RSE4018 Capstone — Generic C++ ONNX Inference Adapter
My capstone project tackles a real industrial challenge: deploying modern deep learning models on the OMRON FH-L551 Vision Controller — an embedded system running an Intel Atom E3827 with only 2 cores, 1 MB L2 cache, and no AVX/AVX2 support.
I designed and built a Generic C++ Inference Adapter that wraps ONNX Runtime and OpenCV into a single portable executable capable of running any ONNX model. Through systematic OS-level and memory-architecture optimizations, I achieved up to 50% latency reduction without requiring any hardware upgrades or specialized instruction sets.
The Problem
Most inference acceleration frameworks (OpenVINO, TensorRT) require AVX2 or CUDA — instruction sets the FH-L551’s Atom E3827 physically cannot execute. The baseline adapter took 15.7 seconds for a single YOLOX-S inference and 18.9 seconds for a PP-OCR pipeline — far exceeding the 10-second takt time required by the production line.
Results
| Model | V0 Baseline | V2 Final | Improvement |
|---|---|---|---|
| YOLOX-S (Object Detection) | 15,742 ms | ~7,771 ms | ▼ 50.6% |
| PP-OCR (Text Recognition) | 18,943 ms | ~10,920 ms | ▼ 42.3% |
Core Optimizations
1. Zero-Copy IoBinding
Binds caller’s raw float* pointers directly to ONNX Runtime via
Ort::IoBinding, eliminating all per-frame memcpy operations.
Critical on hardware with only 1 MB L2 cache where a 640×640×3 image
(4.9 MB as float) exceeds the entire cache.
2. Fused Preprocessing
Replaced 3 separate OpenCV operations (colour conversion, normalization, layout transpose)
with a single fused fillBlobCHW() loop — one pass through the image
instead of three, eliminating 2 temporary buffer allocations per frame.
3. Hoisted Buffer Allocation
Moved all std::vector allocations outside the inference loop with
reserve() before and clear() inside — reducing per-frame
heap allocations from ~14 to 0. Prevents heap fragmentation on 24/7 industrial systems.
Build History (10 Iterations)
Build 1: V0 Baseline
First compilation. 15,742 ms YOLOX on FH-L551.
Build 2: OpenVINO EP
SIGILL crash — AVX2 required, permanent incompatibility.
Build 3: INT8 Quantization
2.9% slower without AVX2 SIMD instructions.
Build 4: Thread Pinning
7.8× worse jitter on 2-core Atom.
Build 5: V1 Optimized
Zero-copy IoBinding + fused preprocessing + hoisted allocation.
Build 6: YOLOX Shape Mismatch
Hardcoded output shapes crash on model switch.
Build 7: V1 Fixed
Dynamic shape inference makes adapter truly generic.
Build 8: YOLOX Decode + NMS
Full anchor-free decode with greedy NMS.
Build 9: Experiment Suite
5-configuration benchmark identifies optimal ORT settings.
Build 10: V2 Final
Production build with all proven optimizations.
Skills & Technologies
Languages: C++17, Python | Libraries: ONNX Runtime, OpenCV, MSVC | Models: YOLOX-S, PaddleOCR | Hardware: OMRON FH-L551, Intel Atom E3827