This paper proposed a new recommendation system model called wide and deep.
It also described the production system deploying the model on Google Play.
The paper claims using wide and deep could combine the advantage of both the wide component (generalized linear model) and the deep component (feed-forward neural network), and mitigate each other’s disadvantage.
The wide component is a generalized linear model. It’s basic linear model plus non-linear feature transformation.
It is good at memorizing observed examples by only taking effects when the new example has been seen before. Sometimes this can be a disadvantage as it can’t generalize to unseen examples.
The most important feature transformation is cross-product transformation. One example is the cooccurence of categorical feature, e.g. user language and user gender. A concrete example of the wide model features can be found in the tensorflow model tutorial.
wide_columns = [
gender, native_country, education, occupation, workclass, marital_status, relationship, age_buckets,
tf.contrib.layers.crossed_column([education, occupation], hash_bucket_size=int(1e4)),
tf.contrib.layers.crossed_column([native_country, occupation], hash_bucket_size=int(1e4)),
tf.contrib.layers.crossed_column([age_buckets, race, occupation], hash_bucket_size=int(1e6))]
The deep component is a typical feed-forward neural network.
It has good generalization ability, as observed examples are projected into lower dimension spaces, so that examples with similar hidden semantics can be clustered there. However, sometime it could over-generalize, especially when the user-item matrix is very sparse. In this case, the item similarity pattern may not be easily learnable, but the model will try to learn something anyway, which could end up not very desirable.
Two components are jointly trained against the app aquisition label. Below is the full model structure.
The most noval results comes from live experiment on Google Play store. The wide and deep model is able to increase the app online acquisition gain by 3.9% relative value, which is significant for a large scale real world recommendation system.
The experiment was run for 3 weeks with 1% traffic on each arm. The baseline model is a wide only model, while a deep only model is also used as comparison. Using the deep only model could give a 2.9% relative increase. Using it as a baseline, wide and deep could give about 1% relative increase.
The paper also reports an offline AUC result for each model, where the increase of Wide and Deep model is less promising and for Deep only model the number actually decreases. The paper didn’t dig deeper into this discrepancy between offline and online metrics. Their hypothesis is online experiment could better measure the impact of exploration in app recommendation, which I feel intuitive as an app store user.
As new user interaction data comes in, the model could quickly went “outdated”. In this paper, when a new training is started, parameters from initialized using ones from the previous version. The paper didn’t mention whether the retraining is then performed on the whole data set of just the delta, and also didn’t mention the frequency of the retraining.
In live experiment, each data point is an app impression and the label is whether the impressed app is install. Then it’s natural to assume the number of positive labels are much less than the number of negative labels. It would be interesting to learn about how this is handled during training. However, the paper didn’t mention it.