Technical Documentation — Generic C++ ONNX Inference Adapter

← Back to Project Overview  |  GitHub Repository

1. Target Hardware — OMRON FH-L551

SpecificationValueImplication
CPUIntel Atom E3827 (2C/2T, 1.74 GHz)No hyperthreading; limited parallel compute
L2 Cache1 MB (shared)640×640×3 FP32 image = 4.9 MB — exceeds entire cache
Max SIMDSSE4.2No AVX/AVX2 — eliminates OpenVINO, INT8 VNNI, most accelerators
RAM2 GB DDR3LMust avoid heap fragmentation on 24/7 systems
OSWindows EmbeddedThread affinity competes with OS kernel threads

The check_avx.cpp utility uses CPUID to confirm: SSE4.2: YES | AVX: NO | AVX2: NO. This single diagnostic permanently ruled out three optimization paths.

2. Adapter Architecture

Design Principle: The adapter is model-agnostic. It loads any ONNX file, inspects I/O shapes via inspectIO(), and dispatches to the appropriate decoder (PP-OCR, YOLOX, or generic fallback). Unknown models produce diagnostic output instead of crashing.

File Structure

FileVersionRole
ONNXInference.h/.cppV0Baseline adapter. Copy-based infer(), OpenVINO EP support (conditional #ifdef).
ONNXInference_opt.h/.cppV1Added inferFixed() zero-copy path using Ort::IoBinding. Removed OpenVINO code.
ONNXInference_opt_yoloxopt.h/.cppV2Added binding cache (binding_cache_valid_) to skip re-binding when pointers are unchanged.
main.cppV0Baseline harness with makeBlobCHW() (3-pass preprocessing), per-frame allocation.
main_opt.cppV1Fused fillBlobCHW(), hoisted allocation, watchdog thread, M4 EWMA supervisor.
main_opt_yoloxopt.cppV2Production harness with all optimizations, full YOLOX decode + NMS, generic fallback.

Inference Pipeline

Image → Letterbox Resize → fillBlobCHW (fused BGR→RGB + normalize + HWC→CHW)
      → inferFixed() (zero-copy IoBinding)
      → Decode (PP-OCR DBNet / YOLOX anchor-free / Generic fallback)
      → [Optional: REC pipeline for text recognition]
      → CSV benchmark log + M4 EWMA compliance check

3. Build History & Compilation

All builds use MSVC cl.exe from a Visual Studio Developer Command Prompt. Dependencies are installed via setup_deps.ps1 (ONNX Runtime 1.17.1 + OpenCV 4.9.0).

Build Commands

V0 Baseline:

cl /O2 /EHsc /Fe:generic_infer.exe ^
   main.cpp ONNXInference.cpp ^
   /I "deps\onnxruntime\include" /I "deps\opencv\build\include" ^
   /link /LIBPATH:"deps\onnxruntime\lib" /LIBPATH:"deps\opencv\build\x64\vc16\lib" ^
   onnxruntime.lib opencv_world490.lib

V1 Optimized (build_opt.bat):

cl /O2 /EHsc /Fe:optimized_generic_infer.exe ^
   main_opt.cpp ONNXInference_opt.cpp ^
   /I "deps\onnxruntime\include" /I "deps\opencv\build\include" ^
   /link /LIBPATH:"deps\onnxruntime\lib" /LIBPATH:"deps\opencv\build\x64\vc16\lib" ^
   onnxruntime.lib opencv_world490.lib

V2 Final (build_final_generic_infer.bat):

cl /O2 /EHsc /Fe:final_generic_infer.exe ^
   main_opt_yoloxopt.cpp ONNXInference_opt_yoloxopt.cpp ^
   /I "deps\onnxruntime\include" /I "deps\opencv\build\include" ^
   /link /LIBPATH:"deps\onnxruntime\lib" /LIBPATH:"deps\opencv\build\x64\vc16\lib" ^
   onnxruntime.lib opencv_world490.lib

10 Build Iterations

#BuildResultKey Takeaway
1V0 BaselineFirst compilation. Deployed to FH-L551: YOLOX 15,742 ms, PP-OCR 18,943 ms.
2V0 + OpenVINO EPOpenVINO requires AVX2. SIGILL crash on Atom E3827 (SSE4.2-only).
3V0 + INT8 Quantization2.9% slower — scalar INT8 fallback + de-quantization overhead without AVX2 VNNI.
4V0 + Thread Pinning7.8× worse jitter (260 ms vs 33 ms). Cache thrashing with OS threads on 2-core system.
5V1 OptimizedThree core optimizations: zero-copy IoBinding, fused preprocessing, hoisted allocation.
6V1 + YOLOX Shape MismatchHardcoded 4D output shape crashes when loading 3D YOLOX output. Fixed by dynamic inspectIO().
7V1 FixedDynamic output shape inference from model metadata. Generic fallback decoder for unknown models.
8V1 + YOLOX Decode/NMSFull Megvii-style anchor-free decode across strides {8,16,32} with greedy per-class NMS.
9V1 Experiments5-config benchmark (5,000 COCO images each). ORT_ENABLE_ALL = −9.6% latency.
10V2 FinalProduction build. Binding cache, watchdog, M4 EWMA supervisor, CSV logger.

4. Optimization Details

4.1 Zero-Copy IoBinding (V1)

Before (V0): infer() copies all input data into ONNX Runtime’s internal buffers, then copies output data back out.

After (V1): inferFixed() binds raw float* pointers via Ort::IoBinding. ONNX Runtime reads/writes directly to caller’s memory.

// V2 binding cache: skip re-bind when pointers unchanged
const bool binding_unchanged =
    binding_cache_valid_ &&
    last_input_ptrs_ == inputPtrs &&
    last_output_ptrs_ == outputPtrs;

if (!binding_unchanged) {
    binding_->ClearBoundInputs();
    binding_->ClearBoundOutputs();
    // ... rebind inputs and outputs ...
    binding_cache_valid_ = true;
}
session_->Run(Ort::RunOptions{nullptr}, *binding_);

4.2 Fused Preprocessing (V1)

Before (V0) — makeBlobCHW(): 3 separate passes through the image:

  1. cv::cvtColor (BGR→RGB) — full image scan #1
  2. convertTo (normalize) — full image scan #2
  3. cv::split + memcpy (HWC→CHW) — full image scan #3

After (V1) — fillBlobCHW(): Single fused loop. For each pixel: read BGR, swap to RGB, normalize, write directly to CHW plane pointers.

// Single pass: read BGR, write normalized RGB in CHW layout
const uint8_t* ptr = bgr.ptr<uint8_t>(0);
for (int i = 0; i < N; ++i) {
    float b = ptr[3*i + 0] * scale;
    float g = ptr[3*i + 1] * scale;
    float r = ptr[3*i + 2] * scale;
    if (do_norm) { r = (r - m0)/s0; g = (g - m1)/s1; b = (b - m2)/s2; }
    pR[i] = r; pG[i] = g; pB[i] = b;
}

4.3 Hoisted Buffer Allocation (V1)

Before: std::vector<float> detBlob; declared inside loop → heap alloc/free every frame.

After: Declared before loop with reserve(). Inside loop: clear() resets size without freeing memory.

5. OS-Level Tuning

5.1 ONNX Runtime Session Configuration

ConfigurationMean Latencyvs BaselineJitterVerdict
No Graph Opt (Level 0)637.4 ms+3.9%48.3 msRejected
Extended (Level 2)613.7 msBaseline33.4 msDefault
All Fused (Level 99)554.9 ms−9.6%62.4 msBest Speed
All + Pin Thread644.1 ms+4.9%260.1 msRejected
No Arena Allocator627.2 ms+2.2%44.2 msRejected

5.2 Memory Architecture

  • Arena Allocator: ONNX Runtime pre-reserves a memory pool at startup. Disabling it caused 2.2% latency increase from repeated OS-level malloc/free calls.
  • Hoisted Allocation: Reduced per-frame heap allocations from ~14 to 0.
  • Zero-Copy IoBinding: Eliminates internal memcpy — critical when working set exceeds L2 cache.

5.3 OS Scheduler Interaction

SetThreadAffinityMask(GetCurrentThread(), 1) was tested and rejected. On a 2-core system, forcing inference to core 0 competes with Windows kernel threads, causing 7.8× worse jitter (260 ms vs 33 ms).

6. Benchmark Results

All experiments benchmarked across COCO val2017 (5,000 images) with the built-in CSV logger.

ModelV0 BaselineV2 FinalImprovement
YOLOX-S (640×640)15,742 ms~7,771 ms−50.6%
PP-OCR (det + rec)18,943 ms~10,920 ms−42.3%

The benchmark harness records: det_pre_ms, det_run_ms, det_decode_ms, rec_total_ms, total_ms, CER, WER per frame.

7. API Reference

// Construction
ONNXInference engine("model.onnx");                // Default options
ONNXInference engine("model.onnx", customOpts);     // Custom config

// Legacy inference (copy-based, V0)
std::vector<Tensor> outputs = engine.infer({inputBlob});

// Zero-copy inference (production, V1+)
engine.prepareBinding();
engine.inferFixed(inputPtrs, inputShapes, outputPtrs, outputShapes);

// Shape management
engine.fixDynamicHW(640, 640);
engine.fixExactInputShape(0, {1, 3, 48, 320});

// Runtime tuning (triggers session rebuild)
engine.setIntraOpThreads(2);
engine.setGraphOptimization(ORT_ENABLE_ALL);
engine.enableArena(true);
engine.rebuild();

8. Deployment

# Install dependencies
.\setup_deps.ps1

# Build production binary
.\build_final_generic_infer.bat

# Package for FH-L551
.\package_deployment.ps1
# Creates OMRON_FHL551_Deployment.zip

# Run inference
final_generic_infer.exe det.onnx rec.onnx --images list.txt --runs 50 --csv results.csv
final_generic_infer.exe yolox_s.onnx --images coco.txt --opt-level 99 --watchdog-ms 5000 --m4-log