AI Snake Oil (Part 2): Training Data

First in this series, I want to address the simplest and most important question to ask about a machine learning start-up or application:

Question: Is there existing training data? If not, how do they plan on getting it?

To sufficiently understand the answers to this question, you have to understand what training data is and, from there, what tasks or ideas would be extremely difficult to capture within training data. I’ll be addressing those in this post.

Most useful AI applications require training data: examples of the phenomenon they’re trying to replicate with the computer. If some start-up or group proposes a solution to a problem and they don’t have training data, you should be much more skeptical of their proposed solution; it’s now meandering into magic and/or expensive.

I like to think of training data as artificial intelligence’s dirty secret. It never gets mentioned in the press, but it is the topic of Day 1 of any Machine Learning class and forms the theoretical basis for what you learn the rest of the semester. Techniques like these that use training data are called often statistical methods, since they gather statistics about the data they’re provided to make predictions; this is in contrast to the rule-drive methods that were used prior to this.

Continue reading “AI Snake Oil (Part 2): Training Data”