Lab: Adapting to data — Eduardo Avelar

When leading a team of engineers,

many decisions are informed by

technical debt and other sorts of

cost-benefit analyses.

The best teams get very high rates of return on their investments.

With that in mind,

let’s consider a few scenarios.

Let’s imagine that you’re the leader of a team of engineers

and you are nearing the end of a code sprint.

One of the team’s goals for the sprint is to increase performance on the model by 5%.

Currently, however, the best performing model is only marginally better than what was around before.

One of the engineers acknowledges this

but still insists that it’s worth spending time doing an extensive ablation analysis

where the value of an individual feature is computed by comparing it

to a model trained without it.

What might this engineer be concerned about?

The engineer might be concerned about legacy and bundled features.

Legacy features are older features that were added, because they were valuable at the time.

But since then, better features have been added, which have made them redundant without our knowledge.

Bundled features on the other hand, are features that were added as part of a bundle, which collectively are valuable but individually may not be.

Both of these features represent additional unnecessary data dependencies.

In another scenario,

another engineer has found a new data source that is very much related to the label.

The problem is that it’s in a unique format and there’s no parser written in Python, which is what the codebase is composed of.

Thankfully, there is a parser on the web but it’s closed source and written in a different language.

The engineer is thinking about the model performance.

Something in the back of your mind seems wrong.

What is it? It’s the smell.

No, really! There’s a concept called

code smell and it applies in ML as well.

In this case, you might be thinking, “I wonder what introducing code that we can’t inspect and are unable to easily modify into our testing in production frameworks will do.”