AI Snake Oil (Part 2): Training Data

First in this series, I want to address the simplest and most important question to ask about a machine learning start-up or application:

Question: Is there existing training data? If not, how do they plan on getting it?

To sufficiently understand the answers to this question, you have to understand what training data is and, from there, what tasks or ideas would be extremely difficult to capture within training data. I’ll be addressing those in this post.

Most useful AI applications require training data: examples of the phenomenon they’re trying to replicate with the computer. If some start-up or group proposes a solution to a problem and they don’t have training data, you should be much more skeptical of their proposed solution; it’s now meandering into magic and/or expensive.

I like to think of training data as artificial intelligence’s dirty secret. It never gets mentioned in the press, but it is the topic of Day 1 of any Machine Learning class and forms the theoretical basis for what you learn the rest of the semester. Techniques like these that use training data are called often statistical methods, since they gather statistics about the data they’re provided to make predictions; this is in contrast to the rule-drive methods that were used prior to this.

Continue reading “AI Snake Oil (Part 2): Training Data”

AI Snake Oil (Part 1): Golden Lunar Toilet

A lot of over-hyped AI claims are being thrown around right now. In a lot of cases, leveraging this hype, some individuals make promises they can’t keep, no matter how dedicated or incredibly talented they are as developers. Steve Jobs may have had a so called “reality distortion field,” but that didn’t ever spawn a conscious AI, and neither will these people.

What I do want to describe is how to tell if someone is trying to sell you AI snake oil—bullshit claims on what they can actually achieve in a realistic time and budget. Sure, with infinite resources, I could build you a gold toilet on the moon, but no one has that kind of cash lying around. Shit needs to get done, and the time and material for doing so is finite.

Anything is possible. The only limit is yourself.
Anything is possible. I will make this happen for $412 billion dollars. Please provide in gold bullion so I can melt it down into the toilet of my own secret Swiss bank account.

If you’re approached by someone trying to sell you artificial intelligence-related software, or you read a piece in the popular press about what profession AI will uncannily crush in the next year, these are the questions you should ask. Depending on the answers, you can determine whether they’re bluffing or that they’ve done their homework and are worth taking seriously.

I was originally going to make this one post, but it’s grown too large to fit into one. In this series, each post is centered around a question you should ask when someone wants to do something in the real world with natural language processing, machine learning, or other AI components. These questions are:

Each post will detail what you should expect for an answer. As I write, I might add to or revise some of these questions, so don’t consider this list definitive quite yet.

All said and done, there are some really great things happening in AI right now; it’s part of why I chose to invest 6 years of my life getting involved in computational linguistics as a field. However, on any big wave of technology, there’s also a big wave of exploitation.  When people exploit the gap in knowledge between researchers and the public with hyperbole, it comes back to hurt those of us who work so hard to actually make shit that works. I hope that these posts can help non-researchers think more critically about AI and provide researchers a way to inform the public without dragging them through the equivalent graduate level coursework.

It’s Good* That Word Embeddings Are Sexist

A lot of news has been fluttering around about word embeddings being racist and sexist, e.g:

This is a good thing, but not in the sense that sexism and racism are good. It’s good because people who work on quantitative problems don’t believe things are real without quantitative evidence. This is quantitative evidence of that sexism and racism are real things.

Per my initial reaction, I was surprised how much alarm there is about this. When you live in a world that is glaringly *ist, take data from that world, and learn in an unsupervised manner, you’re going to acquire *ist knowledge. But then again, I’ve done my graduate education in a linguistics department with a strong group of sociolinguists. I was exposed to these ideas years ago and have been taught to have an awareness and sensitivity to these issues and to be critically aware of how language can construct and reinforce racist and sexist norms, especially though prescriptivism.

I suspect a lot of the shock is coming from the stronger CS end of things–a side of the university that is more strictly quantitative. My undergrad was in physics, which I suspect has a similar distribution of social science coursework–namely, just what the university requires. A student might have to take sociology or anthropology, and that’s only if the university requires it. My undergrad did not; I took macroeconomics en lieu of either of those.

When you’re in a quantitative program, there exists a lot of hidden assumptions. One is that quantitative analysis is the only way to do anything–any other way of approaching any problem of any kind is bullshit. This is because any other approach can involve biases that a researcher is unaware of. Abstraction and measurement help to remove the preferences of the researcher from the process, mitigating the effect of their biases. The procedure and the numbers are what count.

Hard-core context control.
Hard-core context control.

This works great for particles in a vacuum, for problems where the context can be completely controlled, but the assumption that these standards can be universally maintained bleeds into other problems for which doing so realistically is impossible. However, the air of non-bias around quantitative methods remains despite that the conditions that purged that bias in the first place are lost.

This assumption of non-bias holds into AI research–that a machine built on quantitative principles will be capable of arriving, logically and deductively, at perfect, non-biased truth–the objective truth that’s obscured by those pesky, confounding social factors.

If only Tay had taken advice from Dr. Dre: "I don't smoke weed or sess / Cause it's known to give a brother brain damage / And brain damage on the mic don't manage nothing / But making a sucker and you equal..."
“I don’t smoke weed or sess / Cause it’s known to give a brother brain damage / And brain damage on the mic don’t manage…”

This hope is at ends with AI’s Dark Secret–the one that never seems to make it into the press with its claims about AI’s up-and-coming “singularity”–solutions to the most interesting problems in AI rely entirely on training data. Some of this is supervised, some it is unsupervised, but it all still relies on the data it’s fed. With that, it comes to replicate whatever it’s been provided: garbage in, garbage out.

And so, this is where the shock comes from. For the first time, white, male quantitative researchers are smacked upside the head with the reality that the world exhibits sexist and racist tendencies. The data they’ve provided is digested and learned into biases. It turns out, building that perfectly logically deductive system, free from bias–a consciousness liberated from the social confines of human existence–isn’t just hard, but possibly impossible.

This isn’t a bad thing–perhaps disappointing to a slowly dying vision of AI. The upside though is that the majority of the evidence up to the present for *ist tendencies in society has been qualitative. You have to trust individuals synopses of their aggregate subjective experiences that privilege and bias exist. Right here, we’re seeing quantitative evidence that supports their testimonies.

There’s a two fold effect there: hopefully, it opens up quantitative researchers to acknowledging better the validity of qualitative research. Simultaneously, it confirms the discoveries of a lot of that qualitative research through discovering the same things from a totally different angle. That sort of independent confirmation is ideal in scientific work, and this convergence is just exactly that. We’re seeing decades of social science research supported by evidence from entirely different methods. In a discipline filled with men, this is unequivocal evidence that there are issues that need to be addressed, derived from the methods within that discipline. Sexism and racism suck, but with AI finally bumping into them and providing firm support for them as real issues, perhaps we can have better luck garnering public support in the larger social sphere.

“Bing bing, bong bong bong, bing bing.”

In the class I’ve been teaching this summer, for the last few days, we’ve been using a parsed version of the Donald Trump speech corpus that Ryan McDermott posted to Github a few days ago. One of my students mentioned that Donald Trump had made a speech where he said, quote, “Bing bing, bong bong bong, bing bing.”

I was wondering if this particular speech were actually in the corpus. As a teaching activity, we started searching for instances of /[Bb][io]ng/. I also wanted to see what the parser would do with a string like “bing bong bing bing bong”. There’s a possibility that the parser would assume this is a normal sentence and produce something like:

[NP bing bong] [VP bing [NP bing bong]]

Another student asked why were we doing this–searching for such an obscure, non-sense lexical item, when we could be searching for something that is actually meaningful?

The answer I had, in part, was that it’s not that obscure. As it turns out, these items are quite characteristic of Trump’s speech. In this corpus alone—which lacks the famous original “bing bing, bong bong” speech cited above—it appears 24 times (16 if you remove duplicates), often in clusters of three:

“And that’s what we ended up getting–the king of teleprompters.  But, so when I look at these things here I say you know what, it’s so much easier, it would be so nice, just bah, pa, bah, pa, bah, bing, bing, bing.  No problems, get off stage, everybody falls asleep and that’s the end of that.  But we have to do something about these teleprompters.”

“I hear where they don’t want me to use the hairspray. They want me to use the pump because the other one, which I really like better than going bing, bing, bing, and then it comes out in big globs, right? And then you’re stuck in your hair and you say, ‘Oh my God, I have to take a shower again. My hair’s all screwed up.’ ”

“You know, in the old days everything was better right? The car seats. You’d sit in your car and you want to move forward and back, you press a button. Bing, bing. Now, you have to open up things, press a computer, takes you 15 minutes.”

“You know, when you have so many people running – we had 17 and then they started to drop. Ding. Bing. I love it. I love it.”

“On the budget – I’m really good at these things – economy, budgets. I sort of expected this. On the budget, Trump – this is with 15 people remaining – Trump 51%. Everyone else bing.”

“In Paris, I call him the guy with the dirty filthy hat. Okay? Not a smart guy. A dummy. Puts people in there – mastermind – bing, bing, bing, it’s like shooting everybody. You’ve got to be a mastermind.”

“I was like the establishment. They’d all come to me, and I’d give them all money I write checks sometimes to Senators whatever the max – bing, bing, bing.”

The communicative goals of these tokens could constitute an entire discourse paper, but let’s just stick with the basics now. He seems to use it to indicate some kind of quick, repetitive action. It doesn’t seem to have a particular sentiment associated with it: bribing senators, competitors dropping out of the race, committing mass murder, moving the chair conveniently in a car, being annoyed with pump style hair gels, politicians reading off teleprompters.

It’s undoubtedly characteristic of his speech, though. To say that it’s a mere aberration–something to ignore–is prescriptive.  If we look at counts of lemmas throughout the corpus (using SpaCy—a little easier to break out than digging through CoreNLP’s XML), the lemma “bing” appears 11 times, the other 13 times being lemmatized as “be.” In those cases, the lemmatizer assumed “bing” was a VBG, essentially a misspelling of “being.”

Of the whole corpus, compared with all 24 counts of “bing,” Trump said “bing” more often than he said:

  • situation: 23
  • donor: 21
  • dangerous: 21
  • migration: 20
  • weak: 20
  • economic: 19
  • freedom: 18
  • mexican: 18
  • illegally: 14
  • muslim: 13
  • god: 11
  • kasich: 11
  • bernardino: 10
  • criminal: 9
  • hispanic: 9
  • chinese: 8

 

Among many, many other word types. You can get the full list of lemma counts here (when I get around to posting it), though note that “bing” appears at 11 in that list because a lot of the results were merged with “be” erroneously.

To go back to the critical student’s original question, though, it’s a difference in expectations, I suspect. While NLP tools are helpful, they don’t totally address the problem of meaning in text. Meaning is still in large part up to the programmer using the tool, not the tool itself. There’s still a lot of work to be done in that regard, in any application. Sometimes “bing bing bong bong” is really the best we can do.

 

Actors and Actions

This summer–out of town, meeting many new people–I encountered far more often the unenviable dilemma of explaining my dissertation topic. Unintentionally, though, I turned it into an experiment.

Linguistics: where talking about an experiment becomes another experiment.

Typically, when introducing the topic, I presented a set of verbs, “arrest, search, apprehend, try, convict” and asked what nouns came to mind. Most folks drew a blank. At first I thought it was a fluke, but after a sustained near-0% success rate, and failing so frequently to explain to so many people what I was doing, I got my head out of my ass and admitted that I was explaining wrong.

So instead of giving them verbs and asking what nouns came to mind, I gave them “police and suspect” and asked them what words come to mind. “arrest, search…” It worked like a charm.

It’s easy to think of the actors and the actions associated with them as interchangeable, and then to emphasize the extracted product of the process (Chambers and Jurafsky 2009). After all, that list of verbs is a project result. However, coreference chains–strings of co-referring nouns–are employed at the first step, so it’s more sensible to convey the process nouns-first. Then, in a way, the listener becomes the project, and that’s way more interesting for them and you.

Furthermore, this may signal a need to alter the schema construction process. Verbs are compared to one another, and though their similarity depends on their co-referrent arguments, the choice of comparison depends on grammatical/referent collocations of verbs, not the juxtaposition of two actors. In this direction, the pair of actors I prompted listeners with is similar to those in Balasubramanian et al. 2013, retaining a pairwise relationship between role fillers through the extraction process.

In the end, it’s the nouns I’m interested in. On my 2nd Qualifying Paper, I looked at narratives related to police. Fundamentally, I was interested in what the system told me about police and how they interacted with other argument types: suspects, bystanders, etc. A noun-centric generation process may provide results more suited to this sort of analysis.

A noun-centric process may also improve performance in more challenging domains. I noticed analyzing movie reviews that, while the means of describing films and reviewer sentiment about them varied, particular roles remained constant throughout the domain: the reviewer, the director, characters in a plot synopsis, the film itself. Since that’s where I’m headed, that seems to be the way to think about things.

Synchronous Narratives, Small Data, and Measure Veracity

I’m, at the moment, looking for a particular problem to work on for my dissertation. It feels a bit backwards the way I’m going about it–I know what kind of solution I want to deploy, but I’m looking for a problem to solve with it. It’s a bit like running around the house with a hammer, looking for nails to hit, or running around with a new saw, cutting up wood into bits for the hell of it. The danger is that I could end up cutting all my wood up into tiny shavings, having had a blast with the saw but finding myself homeless at the end of the day.

My tool in this case isn’t a saw, but the abstraction of narrative schemata. The idea is, using dependency parses and coreference chains, you can extract which verbs are likely to co-occur with a shared referent. For example, arrest, search, and detain often share role fillers of some kind–policesuspect, or something referring to something that is one of those two.

A corpus of news contains all kinds of relationships like those, buried inside the language data itself. Ideally, these represent some sort of shared world knowledge that can be applied to other tasks. To demonstrate that this isn’t mere idealism is what I’m looking to do my dissertation on at the moment.

Back in the spring, I took my first attempt at this, and it went ok. My hypothesis–one of convenience, mostly–didn’t pan out, but there were interesting trends in the data. That resulted in a problem, though; I had two things to sort out: was my hypothesis wrong? Was the measure I used to determine that fact suitable for doing so? There was some minor evidence that the measure was suitable, but nothing conclusive.

Instead, I started sniffing around for other hypotheses–things someone else had already thought of, and that may be demonstrable with narrative schemata as an overlying application. Per my typical procrastination, I stumbled upon a recent article on Salon that critiques national press coverage of Rick Perry, claiming that narratives presented in the national press diverge wildly from those presented in Texas papers.

With an author having shown this qualitatively, it’s ripe for quantitative replication. It would make a great experiment for showing the veracity of whatever measure I end up devising.

The difficulty comes in with corpus building. There isn’t a corpus of these texts lying around. I’d have to dig them up myself, from numerous scattered sources. Additionally, the number of sources is likely to be limited. I may be able to obtain a few hundred articles if I’m relentless. Prior work on schemata began with millions of articles. The robustness of the approach may be questionable, in this case.

Of course, the difference in size may be the source of an interesting result in and of itself, but it’s not what I’d set out to demonstrate when searching for a problem that demonstrates the veracity of my measure.

Broke the Turing Test

There have been reports that some AI “passed” the Turing Test. Let’s delve into this.

First, let’s start with what the Turing Test is, or even who Turing was. Alan Turing established many of the theoretical foundations of modern computing–in the 1940’s. He was largely responsible for hacking German secret codes. He was way ahead of his time–60 years or so.

The Turing Test works like this–if you have some artificial intelligence inside a computer chatting with you and you have some person chatting to you through a computer, can you tell the difference? If you can, the AI has failed the Turing Test. If you can’t, the AI has passed the Turing Test.

So what about this AI?  “…the Eugene Goostman program managed to persuade 33 percent of people that it was a 13-year-old boy from Odessa, Ukraine.” That’s the trick here. First of all, that’s not the highest bar in the world, 33%. If three people examined the system, one of them got duped. Still, though, it’s larger than nothing. I’m waiting on the research paper to see how significant the bar is. Sometimes caveats like this are required in AI research.

However, the real gimmick is the “a 13-year-old boy from Odessa, Ukraine” part. If you can’t make your AI fluent, make your AI simulate someone who isn’t. I don’t think that’s really what Turing intended, but I’d like to congratulate Veselov et al. on finding a loophole in the test. It took 65 years.