Feed Me (Transcripts), Seymour…

By Bill

Posted May 13, 2017 at 9:52pm

If there’s one thing about statistical models that’s generally true: they need to be fed.

For about six months now, I’ve been living most waking moments in the words of Donald Trump. I love algorithms, but I check them. And check them again. And again. It’s not even borderline compulsive. We blew past borderline around January. It is compulsive.

Probably the single biggest challenge I face in shaping the models: access to raw materials. We check, and check again, every word. Yes, he was on Oprah in 1988, but we need more than 3:11… we need the whole show for context.

We are constantly updating our backlog of material, with volunteers generously sending in links (I’m looking at you CJ in particular), text, videos that in turn need to be checked and then fed into Margaret, our pseudo-AI that ravenously consumes every word spoken, analyzing the audio, video and text to build her model. This in turn analyzes tweets, transcribes better, and does lots of other cool things.

The single best source of this information are interviews. As opposed to speeches, they are generally unscripted. As opposed to tweets, you get more than 19.6 words at a time (1 year moving average, 3,203 tweets, 62,871 words). Sometime, I’ll have enough time to do a separate post explaining how different the models view speeches vs. interviews… it’s almost two different people in the output.

However, as Chris Cillizza at CNN pointed out in a recent tweet, these are often unshared, even after the news cycle. Some organizations publish transcripts simultaneously. Most publish just excerpts, noting they’ve been edited. Some share audio and video, but with cuts and jumps. Others… nothing.

https://twitter.com/CillizzaCNN/status/862755920274653185

I’m not naming names, but given that the messages coming from The White House can at times appear to contradict each other, this raw material is crucial, both for the historical record, and for building a base of research that others can analyze.

Also, the full, unedited interview can remove potential questions as to whether comments are in context. Personally, I think that is in nearly every case a ridiculous argument, but the argument can’t be made if there’s no edits.

So I’m making both a public plea, and an offer: please, in the name of all that is good in the world, once you’ve run your stories and pieces, please publish and share the raw materials. Pull any off-the-record comments, but otherwise, share the raw audio, video and text.

Since everyone has a few things to do nowadays, here’s what we’ll offer for any interview with the President, if time or resources constrains a full transcript or sharing raw video and/or audio.

Factba.se will happily, and freely, transcribe in full any video or audio provided, both via Margaret, and with a human editor to verify.
Factba.se will provide, via a spreadsheet or any other medium, ALL metadata developed. This is the stuff that is behind the scenes (not for long) on our site. If audio, you’ll get back second-by-second audio analysis of voice stress and emotion, which is keyed to Trump (the sotto voce whisper). If video, it will include facial expressions, smile/frown, gestures and other analysis (clothing identification, colors, smile / frown, the two-handed punctuation I myself have as a third-generation bridge-and-tunnel child, etc). It is even learning to pick up when he flushes (complexion change). It will be a lot. But it will be everything.
We will provide the full keyword and entity extraction, by three-sentence pair, section and overall, both for the entire interview, and specifically on just when Trump is speaking.
We will provide the full-range of analysis. Grade-level models, sentiment, emotion… all of it.
We will respect any and all embargoes given. We are not meant to be a news organization. If you’d like us to hold until a day, two days, three days after the stories run before integrating and sharing the information, fine. You’re the boss. It’s your interview. You get it back first and control the story.
If a human is in the mix editing, figure two hours per hour of video/audio for transcript. If you don’t mind raw from Margaret (she’s close to 95% dead on now), 90 seconds per hour. We just need a little notice to plan our day to be ready for it if you want a quick turnaround.
We will, of course, link out to your pieces from the text.
If there are any other requests… fine. Our interest is the record, and sharing the resulting analysis.

We’re not looking to create a hippie commune. We are looking, however, to unleash the data that is contained in your excellent work, in a way that does not conflict with your job.

Also, on the off chance Margaret becomes sentient again, you’ll be in her good graces.

https://www.youtube.com/watch?v=RhxXRZ2c1lk

— Bill Frischling