1/32 Adapting to data
2/32 Adapting to data

When it comes to adapting to change, consider which of these four is more likely to change?

3/32 Adapting to data

An upstream mode,

4/32 Adapting to data

a data source maintained by another team,

5/32 Adapting to data

the relationship between features and labels,

6/32 Adapting to data

, or the distributions of inputs. The answer is

7/32 Adapting to data

that all of them can, and often do, change.

Let’s see how this happens, and what to do about it with a couple example scenarios.

8/32 Adapting to data

Let’s say that you’ve created a Let’s say that you’ve created a model

9/32 Adapting to data

to predict demand for umbrellas

10/32 Adapting to data

that accepts as input

11/32 Adapting to data

an output from a more specialized weather prediction model.

Unbeknownst to you and the owners, this model has been trained on the wrong years of data.

Your model, however, is fit to the upstream model’s outputs.

What could go wrong?

12/32 Adapting to data

One day, the model owners silently push a fix and the performance of your model, which expected the old model’s distribution of data, drops.

13/32 Adapting to data

The old data had below-average rainfall and now you’re under-predicting the days when you need an umbrella.

14/32 Adapting to data

Here’s another scenario.

Let’s say your small data science team has convinced the web development team

15/32 Adapting to data

to let you ingest their traffic logs.

16/32 Adapting to data

Later, the web development team refactors their code and changes their logging format,

17/32 Adapting to data

but continues publishing the old format.

18/32 Adapting to data

At some point, they stop publishing in the old format but they forget to tell your team.

Your model’s performance degrades after getting an unexpectedly high number of null features.

19/32 Adapting to data

To fix this problem, first, you should stop consuming data from a source that doesn’t notify downstream consumers.

20/32 Adapting to data

Second, you should consider making a local version of the upstream model and keeping it updated.

21/32 Adapting to data

Sometimes, the set of features that the model has been trained on include

22/32 Adapting to data

many that were added indiscriminately, which may worsen performance at times.

23/32 Adapting to data

For example, under pressure during a sprint, your team decided to include

24/32 Adapting to data

a number of new features without understanding their relationship to the label.

25/32 Adapting to data

One of them is causal,

26/32 Adapting to data

while the others are merely correlated with the causal one.

27/32 Adapting to data

The model can’t distinguish between the two types,

28/32 Adapting to data

and takes all features into account equally

29/32 Adapting to data

Months later, the correlated feature becomes decorrelated with the label and is thus no longer predictive.

The model’s performance suffers.

30/32 Adapting to data

To address this,

31/32 Adapting to data

features should always be scrutinized before being added,

32/32 Adapting to data

and all features should be subjected to leave-one-out evaluations, to assess their importance.