1/49 Serving design decisions

Just as the use case determines appropriate training architecture, it’s also determines the appropriate serving architecture.

2/49 Serving design decisions

In designing our serving architecture, one of our goals is to minimize average latency.

Just like in operating systems, where we don’t want to be bottlenecked by slow disk I/O, when serving models, we don’t want to be bottlenecked by slow-to-decide models.

Remarkably, the solution for serving models is very similar to what we do to optimize I/O performance: we use a cache.

In this case, rather than faster memory, we’ll use a table.

3/49 Serving design decisions

Static serving then computes the label ahead of time and serves by looking it up in the table.

4/49 Serving design decisions

Dynamic serving, in contrast, computes the label on-demand.

5/49 Serving design decisions

There’s a space-time tradeoff.

Static serving is

6/49 Serving design decisions

space-intensive,

7/49 Serving design decisions

resulting in higher storage costs, because we store pre-computed predictions

8/49 Serving design decisions

with a low, fixed latency

9/49 Serving design decisions

y and lower maintenance costs.

Dynamic serving, however, is

10/49 Serving design decisions

compute-intensive.

11/49 Serving design decisions

It has lower storage costs,

12/49 Serving design decisions

higher maintenance,

13/49 Serving design decisions

and variable latency.

14/49 Serving design decisions

The choice whether to use static or dynamic serving is determined by considering how important latency, storage, and CPU costs are.

15/49 Serving design decisions

Sometimes, it can be hard to express the relative importance of these three areas.

As a result, it might be helpful to consider static and dynamic serving through another lens:

16/49 Serving design decisions

peakedness

17/49 Serving design decisions

cardinality

18/49 Serving design decisions

Peakedness in a data distribution is the degree to which data values are concentrated around the mean,

19/49 Serving design decisions

or in this case, how concentrated the distribution of the prediction workload is.

You can also think of it as inverse entropy.

20/49 Serving design decisions

For example, a model that predicts the next word given the current word, which you might find in your mobile phone keyboard app,

21/49 Serving design decisions

would be highly peaked

22/49 Serving design decisions

because a small number of words account for the majority of words used.

23/49 Serving design decisions

In contrast, a model that predicted quarterly revenue for all sales verticals in order to populate a report

24/49 Serving design decisions

would be run on the same verticals, and with the same frequency for each, and so it would be very flat.

25/49 Serving design decisions

Cardinality refers to the number of values in a set.

In this case, the set is composed of all the possible things we might have to make predictions for.

26/49 Serving design decisions

So, a model predicting sales revenue given organization division number would have fairly low cardinality.

27/49 Serving design decisions

A model predicting lifetime value given a user for an ecommerce platform would be high cardinality because the number of users, and the number of characteristics of each user, are likely to be quite large.

28/49 Serving design decisions

Taken together, peakedness and cardinality create a space.

29/49 Serving design decisions

When the cardinality is sufficiently low, we can store the entire expected prediction workload, for example, the predicted sales revenue for all divisions, in a table and use static serving.

30/49 Serving design decisions

When the cardinality is high, because the size of the input space is large, and the workload is not very peaked, you probably want to use dynamic training.

31/49 Serving design decisions

In practice, though, you often choose a hybrid of static and dynamic, where you statically cache some of the predictions while responding on-demand for the long tail.

This works best when the distribution is sufficiently peaked.

32/49 Serving design decisions

The striped area above the curve and not inside the blue rectangle is suitable for a hybrid solution, with the most frequently requested predictions cached and the tail computed on demand.

33/49 Serving design decisions

Let’s try to estimate training and inference needs for the same use cases that we saw in the previous lesson.

34/49 Serving design decisions

The first use case is predicting whether an email is spam.

What inference style is needed?

Well, first we need to consider how peaked the distribution is.

The answer is not at all; most emails are likely to be different, although they may be very similar if generated programmatically.

Depending on the choice of representation, the cardinality might be enormous.

35/49 Serving design decisions

So, this would be dynamic.

36/49 Serving design decisions
37/49 Serving design decisions

The second use case is Android voice-to-text.

This is again subtle.

Inference is almost certainly online, since there’s such a long tail of possible voice clips.

But maybe with sufficient signal processing, some key phrases like “okay google” may have precomputed answers.

So, this would be dynamic or hybrid.

38/49 Serving design decisions

And the third use case is shopping ad conversion rate.

The set of all ads doesn’t change much from day to day.

Assuming users are comfortable waiting for a short while after uploading their ads, this could be done statically, and then a batch script could be run at regular intervals throughout the day.

39/49 Serving design decisions

This would be static.

40/49 Serving design decisions

In practice, you’ll often use a hybrid approach.

41/49 Serving design decisions

You might not have realized it, but dynamic serving is what we have learned so far.

Think back to the architecture of the systems we’ve used to make predictions: a model that lived in AI Platform was sent one or more instances and returned predictions for each.

42/49 Serving design decisions

If you wanted to build a static serving system, you would need to make three design changes.

43/49 Serving design decisions

First, you would need to change your call to AI Platform from an online prediction job to a batch prediction job.

44/49 Serving design decisions

Second, you’d need to make sure that your model accepted and passed through keys as input.

These keys are what will allow you to join your requests to predictions at serving time.

45/49 Serving design decisions

And third, you would write the predictions to a data warehouse, like BigQuery and create an API to read from it.

46/49 Serving design decisions

Although the details for each of these instructions are beyond the scope of this lesson, we’ve provided links in the course resources on:

48/49 Serving design decisions

Enabling pass-through features in your model:

49/49 Serving design decisions

And loading data into BigQuery: