2025-11-03TRACE6 min read#Recommendations #XDeepFM #PyTorch #ML

Squeezing 22% More Accuracy from E-commerce Recommendations with XDeepFM

The story behind my IJTE paper: why explicit feature interactions beat deeper networks for recommendation

In my final year I built a recommendation engine for e-commerce using the Extreme Deep Factorization Machine (XDeepFM), and the work ended up published in IJTE in August 2024 as “Enhanced Contextual Recommendation in E-commerce with XDeepFM.” The headline number was a 22% accuracy gain over our baseline - but the interesting part is why this architecture wins.

The problem: interactions, not features

Recommendation data is categorical and sparse: user ID, item category, time of day, device. No single feature predicts a purchase. The signal lives in combinations - this user, in this category, on mobile, in the evening. Classic factorization machines capture pairwise combinations; plain deep networks learn interactions implicitly, but you can never say which ones, or trust that they found the ones that matter.

What makes XDeepFM different

XDeepFM's contribution is the Compressed Interaction Network (CIN). Where a DNN mixes features at the bit level and hopes, CIN constructs feature interactions explicitly at the vector level, one order per layer - second-order, third-order, and up. The full model runs three components side by side and lets each do what it's good at:

A linear part for raw, memorized signal - the “people who buy X buy Y” of the model.
The CIN for explicit, bounded-degree feature interactions you can reason about.
A plain DNN for whatever implicit patterns remain.

What moved the needle in practice

Feature engineering still mattered more than architecture. Sessionizing user behavior and bucketing timestamps into behavioral windows beat adding CIN layers.
Embedding dimensions were the most sensitive hyperparameter. Too small and interactions blur; too large and sparse categories overfit.
Two or three CIN layers were enough. Interaction orders beyond that added cost, not accuracy - most real-world signal is low-order.
The 22% gain came from the combination: explicit interactions caught patterns the DNN alone missed, and the DNN caught what factorization couldn't express.

What publishing taught me

Writing the paper forced a discipline that the code never did: every claim needed an experiment, every experiment needed a baseline, and “it seems better” had to become a number. That habit - benchmark first, then believe - followed me straight into backend engineering, where it turns out the same rule applies to query latency as to recommendation accuracy.