Capstone Technical Documentation

Technical Documentation — Generic C++ ONNX Inference Adapter

← Back to Project Overview | GitHub Repository

Contents

1. Target Hardware
2. Adapter Architecture
3. Build History & Compilation
4. Optimization Details
5. OS-Level Tuning
6. Benchmark Results
7. API Reference
8. Deployment

1. Target Hardware — OMRON FH-L551

Specification	Value	Implication
CPU	Intel Atom E3827 (2C/2T, 1.74 GHz)	No hyperthreading; limited parallel compute
L2 Cache	1 MB (shared)	640×640×3 FP32 image = 4.9 MB — exceeds entire cache
Max SIMD	SSE4.2	No AVX/AVX2 — eliminates OpenVINO, INT8 VNNI, most accelerators
RAM	2 GB DDR3L	Must avoid heap fragmentation on 24/7 systems
OS	Windows Embedded	Thread affinity competes with OS kernel threads

The check_avx.cpp utility uses CPUID to confirm: SSE4.2: YES | AVX: NO | AVX2: NO. This single diagnostic permanently ruled out three optimization paths.

2. Adapter Architecture

Design Principle: The adapter is model-agnostic. It loads any ONNX file, inspects I/O shapes via inspectIO(), and dispatches to the appropriate decoder (PP-OCR, YOLOX, or generic fallback). Unknown models produce diagnostic output instead of crashing.

File Structure

File	Version	Role
`ONNXInference.h/.cpp`	V0	Baseline adapter. Copy-based `infer()`, OpenVINO EP support (conditional `#ifdef`).
`ONNXInference_opt.h/.cpp`	V1	Added `inferFixed()` zero-copy path using `Ort::IoBinding`. Removed OpenVINO code.
`ONNXInference_opt_yoloxopt.h/.cpp`	V2	Added binding cache (`binding_cache_valid_`) to skip re-binding when pointers are unchanged.
`main.cpp`	V0	Baseline harness with `makeBlobCHW()` (3-pass preprocessing), per-frame allocation.
`main_opt.cpp`	V1	Fused `fillBlobCHW()`, hoisted allocation, watchdog thread, M4 EWMA supervisor.
`main_opt_yoloxopt.cpp`	V2	Production harness with all optimizations, full YOLOX decode + NMS, generic fallback.

Inference Pipeline

Image → Letterbox Resize → fillBlobCHW (fused BGR→RGB + normalize + HWC→CHW)
      → inferFixed() (zero-copy IoBinding)
      → Decode (PP-OCR DBNet / YOLOX anchor-free / Generic fallback)
      → [Optional: REC pipeline for text recognition]
      → CSV benchmark log + M4 EWMA compliance check

3. Build History & Compilation

All builds use MSVC cl.exe from a Visual Studio Developer Command Prompt. Dependencies are installed via setup_deps.ps1 (ONNX Runtime 1.17.1 + OpenCV 4.9.0).

Build Commands

V0 Baseline:

cl /O2 /EHsc /Fe:generic_infer.exe ^
   main.cpp ONNXInference.cpp ^
   /I "deps\onnxruntime\include" /I "deps\opencv\build\include" ^
   /link /LIBPATH:"deps\onnxruntime\lib" /LIBPATH:"deps\opencv\build\x64\vc16\lib" ^
   onnxruntime.lib opencv_world490.lib

V1 Optimized (build_opt.bat):

cl /O2 /EHsc /Fe:optimized_generic_infer.exe ^
   main_opt.cpp ONNXInference_opt.cpp ^
   /I "deps\onnxruntime\include" /I "deps\opencv\build\include" ^
   /link /LIBPATH:"deps\onnxruntime\lib" /LIBPATH:"deps\opencv\build\x64\vc16\lib" ^
   onnxruntime.lib opencv_world490.lib

V2 Final (build_final_generic_infer.bat):

cl /O2 /EHsc /Fe:final_generic_infer.exe ^
   main_opt_yoloxopt.cpp ONNXInference_opt_yoloxopt.cpp ^
   /I "deps\onnxruntime\include" /I "deps\opencv\build\include" ^
   /link /LIBPATH:"deps\onnxruntime\lib" /LIBPATH:"deps\opencv\build\x64\vc16\lib" ^
   onnxruntime.lib opencv_world490.lib

10 Build Iterations

#	Build	Result	Key Takeaway
1	V0 Baseline	✓	First compilation. Deployed to FH-L551: YOLOX 15,742 ms, PP-OCR 18,943 ms.
2	V0 + OpenVINO EP	✗	OpenVINO requires AVX2. `SIGILL` crash on Atom E3827 (SSE4.2-only).
3	V0 + INT8 Quantization	✗	2.9% slower — scalar INT8 fallback + de-quantization overhead without AVX2 VNNI.
4	V0 + Thread Pinning	✗	7.8× worse jitter (260 ms vs 33 ms). Cache thrashing with OS threads on 2-core system.
5	V1 Optimized	✓	Three core optimizations: zero-copy IoBinding, fused preprocessing, hoisted allocation.
6	V1 + YOLOX Shape Mismatch	✗	Hardcoded 4D output shape crashes when loading 3D YOLOX output. Fixed by dynamic `inspectIO()`.
7	V1 Fixed	✓	Dynamic output shape inference from model metadata. Generic fallback decoder for unknown models.
8	V1 + YOLOX Decode/NMS	✓	Full Megvii-style anchor-free decode across strides {8,16,32} with greedy per-class NMS.
9	V1 Experiments	✓	5-config benchmark (5,000 COCO images each). ORT_ENABLE_ALL = −9.6% latency.
10	V2 Final	✓	Production build. Binding cache, watchdog, M4 EWMA supervisor, CSV logger.

4. Optimization Details

4.1 Zero-Copy IoBinding (V1)

Before (V0): infer() copies all input data into ONNX Runtime’s internal buffers, then copies output data back out.

After (V1): inferFixed() binds raw float* pointers via Ort::IoBinding. ONNX Runtime reads/writes directly to caller’s memory.

// V2 binding cache: skip re-bind when pointers unchanged
const bool binding_unchanged =
    binding_cache_valid_ &&
    last_input_ptrs_ == inputPtrs &&
    last_output_ptrs_ == outputPtrs;

if (!binding_unchanged) {
    binding_->ClearBoundInputs();
    binding_->ClearBoundOutputs();
    // ... rebind inputs and outputs ...
    binding_cache_valid_ = true;
}
session_->Run(Ort::RunOptions{nullptr}, *binding_);

4.2 Fused Preprocessing (V1)

Before (V0) — makeBlobCHW(): 3 separate passes through the image:

cv::cvtColor (BGR→RGB) — full image scan #1
convertTo (normalize) — full image scan #2
cv::split + memcpy (HWC→CHW) — full image scan #3

After (V1) — fillBlobCHW(): Single fused loop. For each pixel: read BGR, swap to RGB, normalize, write directly to CHW plane pointers.

// Single pass: read BGR, write normalized RGB in CHW layout
const uint8_t* ptr = bgr.ptr<uint8_t>(0);
for (int i = 0; i < N; ++i) {
    float b = ptr[3*i + 0] * scale;
    float g = ptr[3*i + 1] * scale;
    float r = ptr[3*i + 2] * scale;
    if (do_norm) { r = (r - m0)/s0; g = (g - m1)/s1; b = (b - m2)/s2; }
    pR[i] = r; pG[i] = g; pB[i] = b;
}

4.3 Hoisted Buffer Allocation (V1)

Before: std::vector<float> detBlob; declared inside loop → heap alloc/free every frame.

After: Declared before loop with reserve(). Inside loop: clear() resets size without freeing memory.

5. OS-Level Tuning

5.1 ONNX Runtime Session Configuration

Configuration	Mean Latency	vs Baseline	Jitter	Verdict
No Graph Opt (Level 0)	637.4 ms	+3.9%	48.3 ms	Rejected
Extended (Level 2)	613.7 ms	Baseline	33.4 ms	Default
All Fused (Level 99)	554.9 ms	−9.6%	62.4 ms	Best Speed
All + Pin Thread	644.1 ms	+4.9%	260.1 ms	Rejected
No Arena Allocator	627.2 ms	+2.2%	44.2 ms	Rejected

5.2 Memory Architecture

Arena Allocator: ONNX Runtime pre-reserves a memory pool at startup. Disabling it caused 2.2% latency increase from repeated OS-level malloc/free calls.
Hoisted Allocation: Reduced per-frame heap allocations from ~14 to 0.
Zero-Copy IoBinding: Eliminates internal memcpy — critical when working set exceeds L2 cache.

5.3 OS Scheduler Interaction

SetThreadAffinityMask(GetCurrentThread(), 1) was tested and rejected. On a 2-core system, forcing inference to core 0 competes with Windows kernel threads, causing 7.8× worse jitter (260 ms vs 33 ms).

6. Benchmark Results

All experiments benchmarked across COCO val2017 (5,000 images) with the built-in CSV logger.

Model	V0 Baseline	V2 Final	Improvement
YOLOX-S (640×640)	15,742 ms	~7,771 ms	−50.6%
PP-OCR (det + rec)	18,943 ms	~10,920 ms	−42.3%

The benchmark harness records: det_pre_ms, det_run_ms, det_decode_ms, rec_total_ms, total_ms, CER, WER per frame.

7. API Reference

// Construction
ONNXInference engine("model.onnx");                // Default options
ONNXInference engine("model.onnx", customOpts);     // Custom config

// Legacy inference (copy-based, V0)
std::vector<Tensor> outputs = engine.infer({inputBlob});

// Zero-copy inference (production, V1+)
engine.prepareBinding();
engine.inferFixed(inputPtrs, inputShapes, outputPtrs, outputShapes);

// Shape management
engine.fixDynamicHW(640, 640);
engine.fixExactInputShape(0, {1, 3, 48, 320});

// Runtime tuning (triggers session rebuild)
engine.setIntraOpThreads(2);
engine.setGraphOptimization(ORT_ENABLE_ALL);
engine.enableArena(true);
engine.rebuild();

8. Deployment

# Install dependencies
.\setup_deps.ps1

# Build production binary
.\build_final_generic_infer.bat

# Package for FH-L551
.\package_deployment.ps1
# Creates OMRON_FHL551_Deployment.zip

# Run inference
final_generic_infer.exe det.onnx rec.onnx --images list.txt --runs 50 --csv results.csv
final_generic_infer.exe yolox_s.onnx --images coco.txt --opt-level 99 --watchdog-ms 5000 --m4-log