This model is an extension to the “Wide and Deep” model described in the last post. It changed the wide part from a logistic regression model to factorization machine model, which could capture order-two feature interactions effectively.
The new model also followed the “Wide and Deep” architecture. The “deep part” is almost not changed. The key difference is the “wide part”.
|Wide part in “Wide and Deep||Wide part in “DeepFM”|
In the Wide and Deep paper, the input of logistic regression model is the raw features and their cross product transformations, which is order-n feature interactions, where n is usually small (e.g. 2, 3) and sometimes requires human insight on designing which features to interact.
In the DeepFM paper, the linear part is replaced by a factorization machine, which could automatically model order-2 feature interactions without any human input to select what features to interact, using the latent feature vectors. Worth noting the latent feature vectors are also shared with the deep part.
Thus it’s easy to understand the improvement claimed in the paper:
- No need for feature engineering (FM handles it automatically)
- Better prediction performance (using a model with more representative power)
The DeepFm model improved AUC by 0.01 to 0.02 absolute value on two datasets. They also compared the effectiveness of sharing feature embeddings between wide and deep part, and saw performance improvement on both datasets.
Efficiency comparison also showed increasing the model complexity doesn’t increase the training time much.
It’s unclear why using a “wide and deep” acchitecture could perform better than using a single DNN larger than the deep part. One way to understand it is this is adding regularization to the model structure, to emphasize the lower order features.
Could models learn this themselves automatically? Similar to how feature engineering could be replaced by FM, could model architecture be replaced by a better modeling strategy?