Scalable Object Detection in the Car Interior With Vision Foundation Models
🚘 IEEE Intelligent Vehicles Symposium (IV) 2026
Overview
Can your car’s personal assistant identify what you left on the back seat? Interior object detection is critical for next-gen vehicle intelligence, but on-board hardware is too constrained to run modern foundation models.
We propose ODAL (Object Detection and Localization), a framework that leverages vision foundation models through a distributed on-board/cloud architecture — bringing the power of large-scale models to the car interior without exceeding hardware limits.
Key Results
🏆 Fine-tuned ODAL-LLaVA achieves an ODAL score of 89%
📈 71% improvement over baseline LLaVA 1.5 7B performance
💪 Outperforms GPT-4o by nearly 20%
🔇 3× higher signal-to-noise ratio than GPT-4o — significantly fewer hallucinations
Key Contributions
🏗️ Distributed architecture — splits computation between on-board and cloud to overcome resource constraints
📏 ODALbench — a new comprehensive metric for evaluating detection and localization quality
🔬 Foundation model comparison — systematic evaluation of GPT-4o vs. lightweight LLaVA models
✨ Fine-tuning wins — demonstrates that a small, fine-tuned model can decisively beat a much larger general-purpose one
Why It Matters
This work shows that you don’t need the biggest model to get the best results. With smart fine-tuning and a distributed architecture, lightweight models can outperform GPT-4o for targeted automotive applications.
AI tasks in the car interior like identifying and localizing externally introduced objects is crucial for response quality of personal assistants. However, computational resources of on-board systems remain highly constrained, restricting the deployment of such solutions directly within the vehicle. To address this limitation, we propose the novel Object Detection and Localization (ODAL) framework for interior scene understanding. Our approach leverages vision foundation models through a distributed architecture, splitting computational tasks between on-board and cloud. This design overcomes the resource constraints of running foundation models directly in the car. To benchmark model performance, we introduce ODALbench, a new metric for comprehensive assessment of detection and localization.Our analysis demonstrates the framework’s potential to establish new standards in this domain. We compare the state-of-the-art GPT-4o vision foundation model with the lightweight LLaVA 1.5 7B model and explore how fine-tuning enhances the lightweight models performance. Remarkably, our fine-tuned ODAL-LLaVA model achieves an ODAL_{score} of 89%, representing a 71% improvement over its baseline performance and outperforming GPT-4o by nearly 20%. Furthermore, the fine-tuned model maintains high detection accuracy while significantly reducing hallucinations, achieving an ODAL_{SNR} three times higher than GPT-4o.
@inproceedings{Mszros2025,title={Scalable Object Detection in the Car Interior With Vision Foundation Models},author={Schmidt, Sebastian and M\'{e}sz\'{a}ros, B\'{a}lint and Firintepe, Ahmet and G\"{u}nnemann, Stephan},year={2026},booktitle={Proceedings of the IEEE Intelligent Vehicles Symposium (IV)},url={TBD},}