Back when writing exams was one of the things that I regularly did, I put the goals of the exam into three broad buckets. For maximum clickability, I will call them Describe, Distinguish and Diagnose. As a student, you were probably most familiar with the first two, because those are the purpose of grades… to Describe the level to which you know the material and to Distinguish who knows the material better/worse. If you have ever taught (or even graded), then you probably used the diagnostic abilities of tests, to figure out what exactly students are understanding, and how exactly they are misunderstanding.
Either way, you are probably thinking “what does this have to do with data?” The answer is that, if we cast ourselves as meritocratic economists, the courses you took in schools are a classifier, and the final grade is just predict_proba()
on the classification of whether or not you know Calc II.
This thought should cause conflicting fealings. From your perspective (the perspective of the student) a course is a lot more than a means to see how you rank amongst your piers. You worked hard and made tough decisions, and in the end you learned a lot. Since you were young, you grew, made friends, formed and reformed opinions. Maybe you didn’t get the best grade, but you took the class because it was outside your comfort zone and you wanted to improve as a person….
But I digress.
This is why I want to focus on tests, quizzes, homework and other examinations… final grades tend to be a synthesis of test results and much vaguer notions. In the ML metaphor, exams are more accurately described by classifiers, and final grades, ideally, are decisions informed by a series of these classifiers. This is where the Diagnoses come in.
The difference between described and distinguished
Depending on your background, you are probably either appalled to discover that part of your grades was to rank-order the students, or you are confused that there were other purposes. This is because distinguishing observations, if done well, is a necessary but insufficient to actually describe observations. Sometimes you can make a decision based on distinction alone, but sometimes you need more. In a statistical or machine learning setting, the description is distinction plus distribution (More Ds!)

For data scientists this might be a bit strange and maybe even uncomfortable. Our favorite metric for prediction, the Area Under the Receiver Operartor Characteristic Curve (Aurocc), doesn’t care about the distribution, and just cares about the ranking. (This is oversimplifying, since confidence in correct predictions is also valued. Maybe I’ll do a wonk-ish explanation of this later.)
This makes sense in that, from a decision making perspective, especially if you are forced to make a “yes” or “no” prediction, and the fraction is not important. And, often, the decision you have to make, can be made without the distribution … This tends to be the case in modeling competitions (like Google’s Kaggle). If I give you a ranking of a baseball players’ batting average, you probably would still be able to make decisions for your fantasy baseball team. But you wouldn’t be able to tell me what proportion of times they hit the ball (that is what batting average is).
Alternatively, imagine a 10-day weather forecast which simply consisted of a ranking by likelyhood of ran. I would say “Rain on Monday is more likely than Tuesday, but less likely than Wednesday.” Would you know when to pack an umbrella?
There are many ways to go from an ordering to a probabilistic description. Given the rank of likelyhoods of rain over each of the next ten days, you could make a pretty good forecast for each day, given historical data and current conditions. But it is important to keep in mind which you need when you making decisions. Otherwise, you might get wet.