3 Ds of testing

There are many ways to go from an ordering to a probabilistic description. Given the rank of likelyhood of rain over each of the next ten days, you could make a pretty good forecast for each day, given historical data and current conditions. But it is important to keep in mind which you need when you making decisions, so you don’t get wet.

Back when writing exams was one of the things that I regularly did, I put the goals of the exam into three broad buckets. For maximum clickability, I will call them Describe, Distinguish and Diagnose. As a student, you were probably most familiar with the first two, because those are the purpose of grades… to Describe the level to which you know the material and to Distinguish who knows the material better/worse. If you have ever taught (or even graded), then you probably used the diagnostic abilities of tests, to figure out what exactly students are understanding, and how exactly they are misunderstanding.

Either way, you are probably thinking “what does this have to do with data?” The answer is that, if we cast ourselves as meritocratic economists, the courses you took in schools are a classifier, and the final grade is just predict_proba() on the classification of whether or not you know Calc II.

This thought should cause conflicting fealings. From your perspective (the perspective of the student) a course is a lot more than a means to see how you rank amongst your piers. You worked hard and made tough decisions, and in the end you learned a lot. Since you were young, you grew, made friends, formed and reformed opinions. Maybe you didn’t get the best grade, but you took the class because it was outside your comfort zone and you wanted to improve as a person….

But I digress.

This is why I want to focus on tests, quizzes, homework and other examinations… final grades tend to be a synthesis of test results and much vaguer notions. In the ML metaphor, exams are more accurately described by classifiers, and final grades, ideally, are decisions informed by a series of these classifiers. This is where the Diagnoses come in.

The difference between described and distinguished

Depending on your background, you are probably either appalled to discover that part of your grades was to rank-order the students, or you are confused that there were other purposes. This is because distinguishing observations, if done well, is a necessary but insufficient to actually describe observations. Sometimes you can make a decision based on distinction alone, but sometimes you need more. In a statistical or machine learning setting, the description is distinction plus distribution (More Ds!)

The Aurocc Aurochs is here!

For data scientists this might be a bit strange and maybe even uncomfortable. Our favorite metric for prediction, the Area Under the Receiver Operartor Characteristic Curve (Aurocc), doesn’t care about the distribution, and just cares about the ranking. (This is oversimplifying, since confidence in correct predictions is also valued. Maybe I’ll do a wonk-ish explanation of this later.)

This makes sense in that, from a decision making perspective, especially if you are forced to make a “yes” or “no” prediction, and the fraction is not important. And, often, the decision you have to make, can be made without the distribution … This tends to be the case in modeling competitions (like Google’s Kaggle). If I give you a ranking of a baseball players’ batting average, you probably would still be able to make decisions for your fantasy baseball team. But you wouldn’t be able to tell me what proportion of times they hit the ball (that is what batting average is).

Alternatively, imagine a 10-day weather forecast which simply consisted of a ranking by likelyhood of ran. I would say “Rain on Monday is more likely than Tuesday, but less likely than Wednesday.” Would you know when to pack an umbrella?

There are many ways to go from an ordering to a probabilistic description. Given the rank of likelyhoods of rain over each of the next ten days, you could make a pretty good forecast for each day, given historical data and current conditions. But it is important to keep in mind which you need when you making decisions. Otherwise, you might get wet.

3 Data specific interview questions you should ask and why

This is the worst! At the end of every job interview, you’ve spent way too much time on the take home assignment, you’ve aced all the questions! You didn’t even make the one joke about that bad thing the company did that one time! Then *Bam*, the one question you didn’t prepare for “Do you have any question for us?”

I’ve always found this question particularly difficult because I will obsessively research a company/department before, so I will usually have a good sense of the practical/quantifiable questions. I have come to see it as a good way to get a good sense of what it is like to work somewhere and on some team. From the other side, a good way to show that you are a good “fit” is by asking questions that get at what it means to fit into a team.

So here are a few of the questions which I have asked for Data Science jobs, and why I like them. I have tried to order these as to when in the interview process it would make sense to ask these questions.

A warning:

None of these questions work in all stages of an interview, which is why it is important to know who you are talking to, so pay attention to the underlying concept that you are trying to get at.

Q1: What does the lifespan of a typical project look like in your group?

You’ve probably gotten a sense of what kind of methods and languages the team use from the job posting, but that is a surprisingly small part of the workflow, certainly in the long term. There are a lot of sub questions which can make the question more of a conversation…. Do you need to collect and clean the data? How do you design models? What does putting a model into production look like? What responsibilities do you have for your projects long term?

Plus, it is nice to ask about someones life, and usually easy to get someone to talk about it.

Q2: Is your team more centralized or more embedded?

This is a super important question for you to know going into a job. A lot of people envision a “centralized” tank of data analysis, which is connected to the rest of the operation by some product minded people. But this is not how all data teams are structured. Sometimes you are “embedded” in a product team, or a marketing team, etc. etc.

A couple of warnings: First you might have to explain the terminology you are using… most people will understand that some teams are grouped together (centralized), and some teams are spread out over different departments (embedded).

Second, it is not obvious which one a person might prefer. People tend to assume there is a “right” answer, but there isn’t, it’s a matter of preference and core competencies. It is even possible some people will see one or the other as an insult. And if it’s your first data science job, it may not be obvious to even you which you will like better.

Q3: How do you see the data team (your group) changing over the next few years?

There are a few phrasings here for various settings, so don’t just read this verbatim. That makes this one better for later in the interview process, when it is a bit more appropriate for you to be asking about priorities of a company and team, and when you might be talking to more senior management.

Remember with this question, you are trying to start a conversation. So ask follow up questions, particularly to pursue things that interest you about the position. A good follow up is some version of “How does (insert position title here) fit into this plan?” — If they are hiring the position specifically to move toward some long term goal, that is good to know at this stage.

Remember

One of the main points of an interview is to see if you ‘fit’ into this position and on this team. For a lot of people this means, would they want to work with you. This is a bit different than for college/grad school, where schools want to make sure they are willing to be associated with your name. This might be more official, but is certainly less personal.

So, be nice, and try to have fun.

Data science, but why?

There are a slew of interview questions which are designed to test whether or not you have thought about the line of work you are entering into. There is debate over the value of such questions, but a good part of preparation for these is to read what other people think, and *most importantly* ask yourself if you agree.

Perhaps the most fundamental is why do you want to be a data scientist? This questions was really hard for me, not because I didn’t want to become a data scientist, but because it’s hard to articulate. Coming from a mathematical background, I tend to think of the difference between, say, logistic regression and a Recurrent Neural Net is a question of scale, not a question of type. (This was super true when sigmoidal activation was most popular.) So my previous view was something like, Machine Learning is Simple Statistics + simple programming + Simple linear algebra.

Now that is dismissive of hard work involved in solving the real problems that ML experts deal with, and in the context of a job interview is… let’s say not politique. (That is not to say giving off a “this is easy” vibe can’t be a winning strategy, but I think that might be a different post)

And, what convinced me personally, is that this is the very thing that makes being a data scientist exciting… you are at the interface of math, statistics, computer science and business (and various sciences depending on your field). There is a cutting edge exchange coming from each of those fields, making each project feel like it’s own research endeavor, but at 100x the pace.

What do you think, what is your answer? If you interview people, what would you look for in the answer?

Error is Random

I’m temped to call this the Tolstoy effect after the first line of Anna Karenina

Happy families are all alike; every unhappy family is unhappy in its own way.

Leo Tolstoy

But it is a pretty common concept in Signal Processing, Machine Learning, really in Bayesian Reasoning in general: The Truth is Deterministic; Error is Random. There is the intended signal, and the unintended and random noise.

The important corollary is if you take repeated measurements, then you can take an average, and the error should vanish, but the truth should remain.

This is applicable even in situations where “average” is not to be taken literally. Imagine if you are typing a text message, but, like me, you make a lot of errors on your phones keyboard. Then you could send the same message twice, and even in the situation of a lot of errors, the receiver could probably piece together the meaning.

There are a few important assumptions here. Beyond the fundamental one of signal/noise dichotomy, taking the average above requires that the error is either symmetric, or at least predictable. Symmetric implies the average of the error is 0, predictable means that we have a good guess at what the error is.

A case of predictability would come in the form of something like a house effect — For Polling firms this means that there is a consistent skew in one direction or another. If this can be calculated, it will improve your forecasting ability, even if you can’t “fix” the polls… predictably wrong is often good enough.

However, this is mathematical benefit of diverse strategies and voices. There is a principle in Machine Learning which says that if you average two similar models, you don’t see much improvement in prediction, but if you average two fundamentally different models, then you will get a prediction better than either of the originals. This is because of the symmetry problem. And you aren’t going to know that Random Forrest Algorithms are going to be bullish or bearish on a particular stock before going into production. So it is important to consult a diverse set of voices before you make a decision.

Stages of Understanding A Problem

1. This problem is trivial! It’s exactly like this problem that I solved last Tuesday, which was also trivial. You just need to do X and that leads to Y and that leads to … oh wait, never mind. You have to do X then Z and then Y, that will make it so… um… Maybe it more like this problem solved by Dunning, Kruger et al. … but wait…

2. Are you sure that this is the problem? It seems wrong! What would a solution to this problem even look like? And why would anyone ever care! I see why someone might want to, I guess. Let’s [web-search-verb] it….

3. The problem is impossible! Surely if it was possible someone on (specialized) StackExchange would have asked about it before. No one sees a problem for the first time! You want to know why it is impossible? Well, if you try to do X and Y together, W happens, but you can only deal with W if you….

4. Oh, that’s how you solve it. I guess the problem was trivial all along.

Check your denominators

It’s a joke: “You’re one in a million! There are 8 people just like you in New York City, but no one like you in Wyoming.” It’s one of the classic probability fallacies, if you are faced with a seemingly “rare event,” then checking your denominators means asking yourself: How many opportunities were there for this event to arise? Are these opportunities independent?

If so the law of rare events says that you expect the number of occurrences to be (on average) # of opportunities times probability of the event (hence there are 8,000,000*1/1,000,000 = 8 people just like you in NYC!)

This is obvious right? Is that price of peppers per pound or per pepper? It’s something you learned in school. A child can do it! … Well okay, hot shot, tell me which toilet paper is the cheapest.

But there are is another case. A couple of times recently, I rushed through some data-cleaning process and looked at the size of my output data to realise, that “Hey, I have no idea the size this data is supposed to be should be!”

This is a very important thing to keep in mind when processing data. Usually it is relatively easy to check how large you output dataframe is going to be… at least approximately…. e.g. the output of a left join should have the same number of rows as the left dataset. Inner joins are trickier, but in Pandas and R you can check the intersection of sets much faster than you can perform a join.

Often times, when I write a script, it will output the number or rows/columns at each stage of transformation. This is pretty easy to do, and is very useful when doing debugging. Remember to also print a note about where in the process it is! It doesn’t seem important for small scripts, but when your process gets a little longer, it can be hard to keep track of what is what.

Kirchhoff’s laws for data

What can the laws of flow tech us about data?

I was at a meetup recently, and the woman there talked about systems diagrams. These are diagrams which show how a resource flows through the network systems. These terms are pretty broad, because they are pretty broadly applied.

Whenever I think of flows on a network, I think of Kirchhoff’s Laws. High level description: Kirchhoff says flow in has to equal flow out. You probably know this intuitively… it is implied by the word “flow.” But that is what makes it such a beautiful concept, it is something you kind of knew already, but grounding yourself in it can lead to deeper understanding.

Kirchhoff is famous for electrodynamics, but it is used for other fluids, trading algorithms, and search engines. More low level: it is a large part of what gives Markov chains their power.

This got me thinking, to what extent can we think of data as a resource? The obvious reason why we would not is that data itself doesn’t necessarily follow Kirchhoff’s laws… The flow out can, theoretically exceed the flow in. Unlike a lot of resources, data does not deplete with use, but instead, loses potency with time.

That said, there is a sense in which data flows through a team. But usually each member/group transforms the data in some way. So perhaps the way to view this flow is the flow of information, or knowledge, or insight.

So while the data that flows into the data engineering team is not the data that comes out, the information that flows in, should be consistent with the information passed to the data scientist, which should be at the heart of the insights passed onto the management. While physical data which flowing in need not match the data which comes out, you should not be able to draw more information out than is theoretically contained in the data. For example, you can’t find someone’s location from the ambient temperature, although you might be able to narrow it down (e.g. Key West, San Diego, Indoors).

This allows you understand the services that a team or individuals provide. Where are they getting their information from? Where is it going too? How is it transformed? Is there anything lost? Are we drawing too strong of conclusions from it?

You can also understand the interaction between teams. Are their loops? How the data team uses their own insights should be carefully considered, and a loop might mean that we are building models based on previous models. Are we pseudo-labeling responsibly? Or is there a feedback loop leading to the data version of confirmation bias, where one uses their own insights to confirm what they already believed.

Is everyone getting the data they need/use? This complicates things a little, because the information value of data isn’t necessarily additive. For example, if we are trying to locate someone, their longitude by itself is not particularly useful. Nor is latitude taken by itself. But together, it is exactly the information we need.

I am not necessarily treading new ground here, but information is one of the biggest resources many modern companies hold, and it is interesting to understand data in terms of what it means and who can use it.

We Didn’t Start the Fire and Natural Language Processing

Twenty years ago this month, Billy Joel released We Didn’t Start the Fire. Which (and I’m paraphrasing the wikipedia article) he wrote to convice a 21 year old that a lot of stuff had happened over the course of Joel’s life.

But can he convince a computer? Natural language processing is the art of teaching computers to read.

For those of you who have never listened to a classic rock station, We Didn’t Start the Fire, poses an interesting challenge from the perspective of Natural Language Processing, because it is mostly a list of people, places, events and things which effected popular culture, global politics, and domestic (to the US) politics over the course of the 40 years prior to its release.

Usually the first thing people learn in NLP is part of speech tagging, i.e. is it a noun? Is it the subject or object? What verb is it? But with We didn’t start the fire most of the words are nouns, and proper ones for the most part. This means that we would want to use Named Entity Recognition, this is the art of teaching a computer to recognize “named entities,” which are specific things which get referred too, and to classify these things into specific categories.

Here are the lyrics for the song, which I have tagged with their corresponding “Entity” using the python package Spacy. To see how I did this, take a look at me tutorial on Named Entity Recognition in Spacy.

This is a little sloppy, I haven’t fixed all the tags to include all of the words in each tag, but this will work for some simple analysis. You can probably guess most tags, Orgs, Persons, Works of art. GPE stands for geopolitical entity, e.g. city/state/country. FAC is for facility, and NORP is for Nationality, religious or political groups. (no idea where the O comes from).

There is a lot of debating as to what should be tagged as what. As per my earlier discussion, I did not include any events, since it could be argued (and kind of is by the wikipedia page) that most if not all of the entities are short hand for events. I know, it might be weird to classify an H-bomb as a product, but I’m not sure where else it would fit in the scheme. I also toyed with the idea of making Disneyland a geo-political entity.

The question which I have struggled with when thinking about how to tag this is the extent to which the entities that Joel cites are stand ins for events. In one sense, this gives the song it’s ripped from the evening news feel — in particular using locations as short hand for specific events is something that I strongly associate with television news. This is, if anything, where the emotional intelligence of the song comes from, from the fact that Joel was able to find the short words and phrases which will send people, from a certain culture and generation, into daydreams of a moments or whole eras of their lives.

And this is the problem we face. Single words can mean many different things, does Nixon refer to the person or does it refer to the various political campaigns? Does it refer to his Scandals and Retirement? And these are all things that we make decisions about every time we turn on the TV or read the newspaper (okay, fine… or look at Facebook/Twitter). Some of these meanings can be gleaned from texts, but others are meta-textual. For example, the Wikipedia article asserts that Joel’s reference to Nixon refers to his first presidential campaign because it comes along with other things from the 50s and 60s. Meaning that if Nixon was referenced in the second stanza, it may have been construed as a reference to his resignation!