How a Custom Multimodal Transformer Beat a Fine-Tuned LLM for Attribute

1 / 4

How a Custom Multimodal Transformer Beat a Fine-Tuned LLM for Attribute

DEV Community·gentic news·about 1 month ago

#1H0zzO8z

#key #technical #fine #visual #embeddings #attribute

Reading 0:00

15s threshold

LeBonCoin's ML team built a custom late-fusion transformer that uses pre-computed visual embeddings and character n-gram text vectors to predict ad attributes. It outperformed a fine-tuned VLM while running on CPU with sub-200ms latency, offering calibrated probabilities and 15-minute retraining cycles. Key Takeaways LeBonCoin's ML team built a custom late-fusion transformer that uses pre-computed visual embeddings and character n-gram text vectors to predict ad attributes. It outperformed a fine-tuned VLM while running on CPU with sub-200ms latency, offering calibrated probabilities and 15-minute retraining cycles. What Happened Louis-Victor Pasquier, Senior ML Engineer at LeBonCoin (the French classifieds giant), published a detailed technical post describing how his team's custom multimodal transformer outperformed a fine-tuned Vision-Language Model (VLM) for attribute prediction — while being dramatically more efficient.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

How a Custom Multimodal Transformer Beat a Fine-Tuned LLM for Attribute