1/32 Right and wrong decisions
2/32 Right and wrong decisions

Some decisions about data are a matter of weighing

3/32 Right and wrong decisions

cost vs. benefit, like short-term performance goals against long-term maintainability.

Others, though, are about

4/32 Right and wrong decisions

right and wrong.

5/32 Right and wrong decisions

For example, let’s say that you’ve trained a model to predict “probability a patient has cancer” from medical records and

6/32 Right and wrong decisions

that you’ve selected patient age, gender, prior medical conditions, hospital name, vital signs, and test results as features.

Your model had excellent performance on held-out test data but performed terribly on new patients.

7/32 Right and wrong decisions

Any guesses as to why?

8/32 Right and wrong decisions

It turns out the model was trained using a feature

9/32 Right and wrong decisions

that wasn’t legitimately available at decision time, and so, when the model was deployed into production, the distribution of this feature changed and it was no longer a reliable predictor.

10/32 Right and wrong decisions

In this case, that feature was ‘hospital name’.

You might think, ‘hospital name’…

How could that be predictive?

Well, remember that there are some hospitals that focus on diseases like cancer.

So, the model learned that ‘hospital name’ was very important.

11/32 Right and wrong decisions

However, at decision time, this feature wasn’t available to the model, because patients hadn’t yet been assigned to a hospital, but rather than throwing an error, the model simply interpreted the hospital name as an empty string, which it was still capable of handling thanks to out-of-vocabulary buckets in its representations of words.

12/32 Right and wrong decisions

We refer to this idea where the label is somehow leaking into the training data as data leakage.

13/32 Right and wrong decisions

Data leakage is related to a broader class of problems.

14/32 Right and wrong decisions

where we talked about models learning unacceptable strategies.

15/32 Right and wrong decisions

Previously, we learned that when there’s class imbalances, a model might learn to predict the majority class.

16/32 Right and wrong decisions

In this case, the model has learned to use a feature that wouldn’t actually be known and which cannot be plausibly causally related to the label.

17/32 Right and wrong decisions

Here’s a similar case.

A professor of 18th century literature believed that there was a relationship between

18/32 Right and wrong decisions

how an author thought about the mind

19/32 Right and wrong decisions

and their political affiliation.

20/32 Right and wrong decisions

So, for example, perhaps authors who used language like

21/32 Right and wrong decisions

“the mind is a garden”

22/32 Right and wrong decisions

had one political affiliation

23/32 Right and wrong decisions

and authors who used language like

24/32 Right and wrong decisions

“the mind is a steel trap”

25/32 Right and wrong decisions

another. Here’s what they did.

26/32 Right and wrong decisions

What if we were to naively test this hypothesis with machine learning?

Some people tried that and they got some unexpected results.

27/32 Right and wrong decisions

They took all of the sentences in all of the works by a number of 18th century authors, extracted just the mind metaphors and set those as their features

28/32 Right and wrong decisions

and set those as their features and then used the political affiliations of the authors who wrote them as labels.

29/32 Right and wrong decisions

Then, they randomly assigned sentences to each of the training, validation, and test sets.

And because they divided the data in this way, some sentences from each author were distributed to each of those three sets.

And the resulting model was amazing! … But suspiciously amazing.

30/32 Right and wrong decisions

What might have gone wrong?

31/32 Right and wrong decisions

One way to think about it is that political affiliation is linked to that person.

And if we wouldn’t include ‘person name’ in the feature set, we should not include it implicitly either

32/32 Right and wrong decisions

When the researchers changed the way they partition the data and instead partitioned it by author instead of by sentence, the model’s accuracy dropped to something more reasonable.