RSE4018 Capstone — Generic C++ ONNX Inference Adapter

My capstone project tackles a real industrial challenge: deploying modern deep learning models on the OMRON FH-L551 Vision Controller — an embedded system running an Intel Atom E3827 with only 2 cores, 1 MB L2 cache, and no AVX/AVX2 support.

I designed and built a Generic C++ Inference Adapter that wraps ONNX Runtime and OpenCV into a single portable executable capable of running any ONNX model. Through systematic OS-level and memory-architecture optimizations, I achieved up to 50% latency reduction without requiring any hardware upgrades or specialized instruction sets.

50.6%
YOLOX-S Speedup
42.3%
PP-OCR Speedup
10
Build Iterations
0
Malloc/Frame (V2)
ONNX Inference Adapter Architecture

The Problem

Most inference acceleration frameworks (OpenVINO, TensorRT) require AVX2 or CUDA — instruction sets the FH-L551’s Atom E3827 physically cannot execute. The baseline adapter took 15.7 seconds for a single YOLOX-S inference and 18.9 seconds for a PP-OCR pipeline — far exceeding the 10-second takt time required by the production line.

Results

ModelV0 BaselineV2 FinalImprovement
YOLOX-S (Object Detection) 15,742 ms ~7,771 ms ▼ 50.6%
PP-OCR (Text Recognition) 18,943 ms ~10,920 ms ▼ 42.3%

Core Optimizations

1. Zero-Copy IoBinding

Binds caller’s raw float* pointers directly to ONNX Runtime via Ort::IoBinding, eliminating all per-frame memcpy operations. Critical on hardware with only 1 MB L2 cache where a 640×640×3 image (4.9 MB as float) exceeds the entire cache.

2. Fused Preprocessing

Replaced 3 separate OpenCV operations (colour conversion, normalization, layout transpose) with a single fused fillBlobCHW() loop — one pass through the image instead of three, eliminating 2 temporary buffer allocations per frame.

3. Hoisted Buffer Allocation

Moved all std::vector allocations outside the inference loop with reserve() before and clear() inside — reducing per-frame heap allocations from ~14 to 0. Prevents heap fragmentation on 24/7 industrial systems.

Build History (10 Iterations)

Build 1: V0 Baseline

First compilation. 15,742 ms YOLOX on FH-L551.

Build 2: OpenVINO EP

SIGILL crash — AVX2 required, permanent incompatibility.

Build 3: INT8 Quantization

2.9% slower without AVX2 SIMD instructions.

Build 4: Thread Pinning

7.8× worse jitter on 2-core Atom.

Build 5: V1 Optimized

Zero-copy IoBinding + fused preprocessing + hoisted allocation.

Build 6: YOLOX Shape Mismatch

Hardcoded output shapes crash on model switch.

Build 7: V1 Fixed

Dynamic shape inference makes adapter truly generic.

Build 8: YOLOX Decode + NMS

Full anchor-free decode with greedy NMS.

Build 9: Experiment Suite

5-configuration benchmark identifies optimal ORT settings.

Build 10: V2 Final

Production build with all proven optimizations.

Skills & Technologies

Languages: C++17, Python  |  Libraries: ONNX Runtime, OpenCV, MSVC  |  Models: YOLOX-S, PaddleOCR  |  Hardware: OMRON FH-L551, Intel Atom E3827