Technical Documentation — Generic C++ ONNX Inference Adapter
← Back to Project Overview | GitHub Repository
1. Target Hardware — OMRON FH-L551
| Specification | Value | Implication |
|---|---|---|
| CPU | Intel Atom E3827 (2C/2T, 1.74 GHz) | No hyperthreading; limited parallel compute |
| L2 Cache | 1 MB (shared) | 640×640×3 FP32 image = 4.9 MB — exceeds entire cache |
| Max SIMD | SSE4.2 | No AVX/AVX2 — eliminates OpenVINO, INT8 VNNI, most accelerators |
| RAM | 2 GB DDR3L | Must avoid heap fragmentation on 24/7 systems |
| OS | Windows Embedded | Thread affinity competes with OS kernel threads |
The check_avx.cpp utility uses CPUID to confirm: SSE4.2: YES | AVX: NO | AVX2: NO. This single diagnostic permanently ruled out three optimization paths.
2. Adapter Architecture
inspectIO(), and dispatches to the appropriate decoder (PP-OCR, YOLOX, or generic fallback). Unknown models produce diagnostic output instead of crashing.
File Structure
| File | Version | Role |
|---|---|---|
ONNXInference.h/.cpp | V0 | Baseline adapter. Copy-based infer(), OpenVINO EP support (conditional #ifdef). |
ONNXInference_opt.h/.cpp | V1 | Added inferFixed() zero-copy path using Ort::IoBinding. Removed OpenVINO code. |
ONNXInference_opt_yoloxopt.h/.cpp | V2 | Added binding cache (binding_cache_valid_) to skip re-binding when pointers are unchanged. |
main.cpp | V0 | Baseline harness with makeBlobCHW() (3-pass preprocessing), per-frame allocation. |
main_opt.cpp | V1 | Fused fillBlobCHW(), hoisted allocation, watchdog thread, M4 EWMA supervisor. |
main_opt_yoloxopt.cpp | V2 | Production harness with all optimizations, full YOLOX decode + NMS, generic fallback. |
Inference Pipeline
Image → Letterbox Resize → fillBlobCHW (fused BGR→RGB + normalize + HWC→CHW)
→ inferFixed() (zero-copy IoBinding)
→ Decode (PP-OCR DBNet / YOLOX anchor-free / Generic fallback)
→ [Optional: REC pipeline for text recognition]
→ CSV benchmark log + M4 EWMA compliance check
3. Build History & Compilation
All builds use MSVC cl.exe from a Visual Studio Developer Command Prompt. Dependencies are installed via setup_deps.ps1 (ONNX Runtime 1.17.1 + OpenCV 4.9.0).
Build Commands
V0 Baseline:
cl /O2 /EHsc /Fe:generic_infer.exe ^ main.cpp ONNXInference.cpp ^ /I "deps\onnxruntime\include" /I "deps\opencv\build\include" ^ /link /LIBPATH:"deps\onnxruntime\lib" /LIBPATH:"deps\opencv\build\x64\vc16\lib" ^ onnxruntime.lib opencv_world490.lib
V1 Optimized (build_opt.bat):
cl /O2 /EHsc /Fe:optimized_generic_infer.exe ^ main_opt.cpp ONNXInference_opt.cpp ^ /I "deps\onnxruntime\include" /I "deps\opencv\build\include" ^ /link /LIBPATH:"deps\onnxruntime\lib" /LIBPATH:"deps\opencv\build\x64\vc16\lib" ^ onnxruntime.lib opencv_world490.lib
V2 Final (build_final_generic_infer.bat):
cl /O2 /EHsc /Fe:final_generic_infer.exe ^ main_opt_yoloxopt.cpp ONNXInference_opt_yoloxopt.cpp ^ /I "deps\onnxruntime\include" /I "deps\opencv\build\include" ^ /link /LIBPATH:"deps\onnxruntime\lib" /LIBPATH:"deps\opencv\build\x64\vc16\lib" ^ onnxruntime.lib opencv_world490.lib
10 Build Iterations
| # | Build | Result | Key Takeaway |
|---|---|---|---|
| 1 | V0 Baseline | ✓ | First compilation. Deployed to FH-L551: YOLOX 15,742 ms, PP-OCR 18,943 ms. |
| 2 | V0 + OpenVINO EP | ✗ | OpenVINO requires AVX2. SIGILL crash on Atom E3827 (SSE4.2-only). |
| 3 | V0 + INT8 Quantization | ✗ | 2.9% slower — scalar INT8 fallback + de-quantization overhead without AVX2 VNNI. |
| 4 | V0 + Thread Pinning | ✗ | 7.8× worse jitter (260 ms vs 33 ms). Cache thrashing with OS threads on 2-core system. |
| 5 | V1 Optimized | ✓ | Three core optimizations: zero-copy IoBinding, fused preprocessing, hoisted allocation. |
| 6 | V1 + YOLOX Shape Mismatch | ✗ | Hardcoded 4D output shape crashes when loading 3D YOLOX output. Fixed by dynamic inspectIO(). |
| 7 | V1 Fixed | ✓ | Dynamic output shape inference from model metadata. Generic fallback decoder for unknown models. |
| 8 | V1 + YOLOX Decode/NMS | ✓ | Full Megvii-style anchor-free decode across strides {8,16,32} with greedy per-class NMS. |
| 9 | V1 Experiments | ✓ | 5-config benchmark (5,000 COCO images each). ORT_ENABLE_ALL = −9.6% latency. |
| 10 | V2 Final | ✓ | Production build. Binding cache, watchdog, M4 EWMA supervisor, CSV logger. |
4. Optimization Details
4.1 Zero-Copy IoBinding (V1)
Before (V0): infer() copies all input data into ONNX Runtime’s internal buffers, then copies output data back out.
After (V1): inferFixed() binds raw float* pointers via Ort::IoBinding. ONNX Runtime reads/writes directly to caller’s memory.
// V2 binding cache: skip re-bind when pointers unchanged
const bool binding_unchanged =
binding_cache_valid_ &&
last_input_ptrs_ == inputPtrs &&
last_output_ptrs_ == outputPtrs;
if (!binding_unchanged) {
binding_->ClearBoundInputs();
binding_->ClearBoundOutputs();
// ... rebind inputs and outputs ...
binding_cache_valid_ = true;
}
session_->Run(Ort::RunOptions{nullptr}, *binding_);
4.2 Fused Preprocessing (V1)
Before (V0) — makeBlobCHW(): 3 separate passes through the image:
cv::cvtColor(BGR→RGB) — full image scan #1convertTo(normalize) — full image scan #2cv::split+memcpy(HWC→CHW) — full image scan #3
After (V1) — fillBlobCHW(): Single fused loop. For each pixel: read BGR, swap to RGB, normalize, write directly to CHW plane pointers.
// Single pass: read BGR, write normalized RGB in CHW layout
const uint8_t* ptr = bgr.ptr<uint8_t>(0);
for (int i = 0; i < N; ++i) {
float b = ptr[3*i + 0] * scale;
float g = ptr[3*i + 1] * scale;
float r = ptr[3*i + 2] * scale;
if (do_norm) { r = (r - m0)/s0; g = (g - m1)/s1; b = (b - m2)/s2; }
pR[i] = r; pG[i] = g; pB[i] = b;
}
4.3 Hoisted Buffer Allocation (V1)
Before: std::vector<float> detBlob; declared inside loop → heap alloc/free every frame.
After: Declared before loop with reserve(). Inside loop: clear() resets size without freeing memory.
5. OS-Level Tuning
5.1 ONNX Runtime Session Configuration
| Configuration | Mean Latency | vs Baseline | Jitter | Verdict |
|---|---|---|---|---|
| No Graph Opt (Level 0) | 637.4 ms | +3.9% | 48.3 ms | Rejected |
| Extended (Level 2) | 613.7 ms | Baseline | 33.4 ms | Default |
| All Fused (Level 99) | 554.9 ms | −9.6% | 62.4 ms | Best Speed |
| All + Pin Thread | 644.1 ms | +4.9% | 260.1 ms | Rejected |
| No Arena Allocator | 627.2 ms | +2.2% | 44.2 ms | Rejected |
5.2 Memory Architecture
- Arena Allocator: ONNX Runtime pre-reserves a memory pool at startup. Disabling it caused 2.2% latency increase from repeated OS-level
malloc/freecalls. - Hoisted Allocation: Reduced per-frame heap allocations from ~14 to 0.
- Zero-Copy IoBinding: Eliminates internal
memcpy— critical when working set exceeds L2 cache.
5.3 OS Scheduler Interaction
SetThreadAffinityMask(GetCurrentThread(), 1) was tested and rejected. On a 2-core system, forcing inference to core 0 competes with Windows kernel threads, causing 7.8× worse jitter (260 ms vs 33 ms).
6. Benchmark Results
All experiments benchmarked across COCO val2017 (5,000 images) with the built-in CSV logger.
| Model | V0 Baseline | V2 Final | Improvement |
|---|---|---|---|
| YOLOX-S (640×640) | 15,742 ms | ~7,771 ms | −50.6% |
| PP-OCR (det + rec) | 18,943 ms | ~10,920 ms | −42.3% |
The benchmark harness records: det_pre_ms, det_run_ms, det_decode_ms, rec_total_ms, total_ms, CER, WER per frame.
7. API Reference
// Construction
ONNXInference engine("model.onnx"); // Default options
ONNXInference engine("model.onnx", customOpts); // Custom config
// Legacy inference (copy-based, V0)
std::vector<Tensor> outputs = engine.infer({inputBlob});
// Zero-copy inference (production, V1+)
engine.prepareBinding();
engine.inferFixed(inputPtrs, inputShapes, outputPtrs, outputShapes);
// Shape management
engine.fixDynamicHW(640, 640);
engine.fixExactInputShape(0, {1, 3, 48, 320});
// Runtime tuning (triggers session rebuild)
engine.setIntraOpThreads(2);
engine.setGraphOptimization(ORT_ENABLE_ALL);
engine.enableArena(true);
engine.rebuild();
8. Deployment
# Install dependencies .\setup_deps.ps1 # Build production binary .\build_final_generic_infer.bat # Package for FH-L551 .\package_deployment.ps1 # Creates OMRON_FHL551_Deployment.zip # Run inference final_generic_infer.exe det.onnx rec.onnx --images list.txt --runs 50 --csv results.csv final_generic_infer.exe yolox_s.onnx --images coco.txt --opt-level 99 --watchdog-ms 5000 --m4-log