~/blog/running-deeplearning-models-at-the-edge-intel-openvino

Running Deep Learning Models at the Edge with Intel OpenVINO

10 min read

Running a model in a notebook is easy. Running that same model on a smart camera, an industrial gateway, or a low-power box under a desk is a different job entirely. Latency matters, bandwidth becomes expensive, and privacy requirements often remove the option of shipping every frame to a cloud endpoint. That is the problem OpenVINO was built to solve.

I completed Intel's OpenVINO foundation course back in early 2019 and sat on my notes for two years. I'm finally writing this up because the same question keeps coming back in ML engineering circles: "How do I actually get this model running on something that isn't a cloud GPU?" OpenVINO is one of the more practical answers to that question, especially when your deployment target is Intel hardware.

This post is intentionally framed around OpenVINO 2021.x, because that was the tooling available when I originally wrote these notes. The API names in the examples reflect that era: IECore, load_network, and device strings like MYRIAD. Newer OpenVINO releases have cleaned up and renamed parts of the stack, but the deployment model and trade-offs are still worth understanding.

This is not a course recap. It is the mental model, the rough edges, and the parts of the toolkit that actually matter once you leave the cloud and have to make inference work on constrained hardware.


Why Edge Inference Is a Different Problem

When you're training or experimenting, a cloud GPU is the obvious choice. You need raw throughput, you're iterating fast, and cost-per-run is manageable.

Production is different. Latency matters. Network bandwidth to send frames to a remote inference endpoint becomes a real constraint. Privacy requirements often mean you cannot send video data off-device at all. And for anything embedded such as retail analytics, industrial inspection, or smart cameras, you are working with hardware that costs tens or hundreds of pounds, not a monthly cloud GPU bill.

OpenVINO was built for exactly this context. It's Intel's toolkit for running optimised inference on their silicon: CPUs, integrated GPUs, the Movidius Neural Compute Stick (NCS2), and FPGAs. The pitch is one code path, multiple target devices. In practice, that holds up reasonably well.


The Two-Stage Pipeline

The core mental model is straightforward: you don't feed your trained model directly to the inference runtime. You go through two stages first.

Model Optimizer

In OpenVINO 2021.x, the Model Optimizer is the offline conversion step. You point it at a trained model and it emits two files: a .xml for the network topology and a .bin for the weights. Together, these form the Intermediate Representation (IR) that the Inference Engine consumes.

This is not just a format conversion. The Optimizer performs graph transformations that make the model easier to execute efficiently at inference time. Training-only behaviour is stripped out, compatible operations are fused, and the graph is normalised into the form OpenVINO expects. The result is a leaner model with less runtime overhead.

A basic conversion from a frozen TensorFlow model looks like this:

python mo_tf.py \
  --input_model frozen_inference_graph.pb \
  --tensorflow_use_custom_operations_config ssd_v2_support.json \
  --tensorflow_object_detection_api_pipeline_config pipeline.config \
  --reverse_input_channels \
  --output_dir ./ir_output
python mo_tf.py \
  --input_model frozen_inference_graph.pb \
  --tensorflow_use_custom_operations_config ssd_v2_support.json \
  --tensorflow_object_detection_api_pipeline_config pipeline.config \
  --reverse_input_channels \
  --output_dir ./ir_output

The --reverse_input_channels flag handles the BGR/RGB mismatch between OpenCV, which OpenVINO is commonly used with, and TensorFlow-trained models that expect RGB input. You will forget this at least once. The output will look plausible, and your detections will still be wrong in a way that wastes an afternoon.

FP16 compression and INT8 quantisation

This is where the terminology gets messy, so it is worth being precise. Model Optimizer handles conversion to IR and can also compress weights to FP16, which is often the easiest size reduction you will get. That usually cuts the model footprint roughly in half with little or no noticeable accuracy loss for common computer vision models.

INT8 is a separate optimisation step. In the OpenVINO 2021-era toolchain, post-training INT8 quantisation is typically handled after conversion, using a representative dataset to calibrate the model and measure the accuracy drop. That distinction matters because a lot of people, myself included the first time through, assume Model Optimizer is the quantiser. It is not. It prepares the graph; the later optimisation flow is what decides whether INT8 is a safe trade.

In my own SSD MobileNetV2 tests, INT8 on CPU gave a meaningful latency improvement with a negligible accuracy hit. That is a useful pattern, not a universal promise. Detection models with awkward layers or tighter accuracy budgets need profiling, not optimism.

Inference Engine

The Inference Engine is a C++ library — but it ships Python bindings that are genuinely usable. Here's what a basic inference loop looks like:

from openvino.inference_engine import IECore
import cv2
 
# Load network
ie = IECore()
net = ie.read_network(model="model.xml", weights="model.bin")
exec_net = ie.load_network(network=net, device_name="CPU")
 
# Get input/output layer names
input_blob = next(iter(net.input_info))
output_blob = next(iter(net.outputs))
 
# Prepare input
frame = cv2.imread("frame.jpg")
n, c, h, w = net.input_info[input_blob].input_data.shape
resized = cv2.resize(frame, (w, h))
input_image = resized.transpose((2, 0, 1))          # HWC → CHW
input_image = input_image.reshape((n, c, h, w))     # Add batch dim
 
# Run inference
result = exec_net.infer(inputs={input_blob: input_image})
detections = result[output_blob]
 
# Parse detections
for detection in detections[0][0]:
    confidence = float(detection[2])
    if confidence > 0.5:
        xmin = int(detection[3] * frame.shape[1])
        ymin = int(detection[4] * frame.shape[0])
        xmax = int(detection[5] * frame.shape[1])
        ymax = int(detection[6] * frame.shape[0])
        cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), (0, 255, 0), 2)
 
cv2.imwrite("output.jpg", frame)
from openvino.inference_engine import IECore
import cv2
 
# Load network
ie = IECore()
net = ie.read_network(model="model.xml", weights="model.bin")
exec_net = ie.load_network(network=net, device_name="CPU")
 
# Get input/output layer names
input_blob = next(iter(net.input_info))
output_blob = next(iter(net.outputs))
 
# Prepare input
frame = cv2.imread("frame.jpg")
n, c, h, w = net.input_info[input_blob].input_data.shape
resized = cv2.resize(frame, (w, h))
input_image = resized.transpose((2, 0, 1))          # HWC → CHW
input_image = input_image.reshape((n, c, h, w))     # Add batch dim
 
# Run inference
result = exec_net.infer(inputs={input_blob: input_image})
detections = result[output_blob]
 
# Parse detections
for detection in detections[0][0]:
    confidence = float(detection[2])
    if confidence > 0.5:
        xmin = int(detection[3] * frame.shape[1])
        ymin = int(detection[4] * frame.shape[0])
        xmax = int(detection[5] * frame.shape[1])
        ymax = int(detection[6] * frame.shape[0])
        cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), (0, 255, 0), 2)
 
cv2.imwrite("output.jpg", frame)

One important nuance: the detection output format above is common for SSD-style object detection models, but it is not universal. If your model returns [image_id, class_label, confidence, x_min, y_min, x_max, y_max], those coordinates are normalised to the [0, 1] range and need to be scaled back to pixel space using the original frame dimensions. That caught me out the first time I looked at raw outputs and saw values like 0.421.

For production use, you usually want async inference rather than the synchronous infer() call above. The async API uses request IDs to pipeline work across frames, which gives you much better throughput on a live video stream:

# Async inference — pipeline two requests
exec_net.start_async(request_id=0, inputs={input_blob: frame_0})
exec_net.start_async(request_id=1, inputs={input_blob: frame_1})
 
# Process results as they complete
if exec_net.requests[0].wait(-1) == 0:
    result = exec_net.requests[0].outputs[output_blob]
# Async inference — pipeline two requests
exec_net.start_async(request_id=0, inputs={input_blob: frame_0})
exec_net.start_async(request_id=1, inputs={input_blob: frame_1})
 
# Process results as they complete
if exec_net.requests[0].wait(-1) == 0:
    result = exec_net.requests[0].outputs[output_blob]

Hardware Targeting

This is where OpenVINO earns its keep if you are standardising on Intel hardware. Switching target devices is often a one-line change:

exec_net = ie.load_network(network=net, device_name="MYRIAD")  # NCS2
exec_net = ie.load_network(network=net, device_name="GPU")      # Intel integrated GPU
exec_net = ie.load_network(network=net, device_name="CPU")      # Any Intel CPU
exec_net = ie.load_network(network=net, device_name="MYRIAD")  # NCS2
exec_net = ie.load_network(network=net, device_name="GPU")      # Intel integrated GPU
exec_net = ie.load_network(network=net, device_name="CPU")      # Any Intel CPU

The heterogeneous execution mode is more interesting. HETERO:FPGA,CPU will run the model primarily on FPGA and fall back to CPU for operations the FPGA plugin does not support. In theory this is seamless. In practice, moving tensors between devices is not free, so you need to profile before assuming the hybrid path is faster.

A few hardware-specific gotchas worth knowing:

  • NCS2 does not support INT8. FP16 is the right quantisation target if you're deploying to Movidius hardware.
  • Integrated GPU supports FP16 but not every layer mix you will see in the wild. Check the supported-layer documentation for the exact OpenVINO release you are deploying before committing to a model architecture.
  • CPU is the most permissive target and the right place to start. Get your pipeline working there before optimising for specific hardware.

What I'd Use It For (and What I Wouldn't)

OpenVINO makes sense when you're targeting Intel hardware specifically — which describes a large fraction of embedded and edge compute deployments. Industrial systems, smart cameras, retail analytics: a lot of this runs on Atom or Core-class CPUs with an NCS2 on USB for acceleration.

It is a weak fit if you are deploying to ARM-heavy edge hardware such as Raspberry Pi or Jetson-class devices. TF Lite or ONNX Runtime usually match that world better. And if your inference runs primarily in the cloud on non-Intel accelerators, OpenVINO is rarely the optimisation target you should reach for first.

The Open Model Zoo is genuinely useful. Intel maintains a library of pre-trained and pre-optimised models for common tasks such as person detection, vehicle detection, face recognition, and licence plate recognition. For a lot of practical CV applications, you do not need to train a model at all. You need a decent model that already exists, and a deployment path that does not turn your hardware budget into a science project.


Resources

Some links are directly about OpenVINO, while others are there to help if you want to strengthen the surrounding fundamentals in Python, computer vision, and edge deployment.

Intel OpenVINO Toolkit

Python

OpenCV

Machine Learning

Deep Learning

TensorFlow

Computer Vision

Edge Computing

AI at the Edge

Pre-Trained Models

Model Optimizer

Inference Engine

Deploying Edge Applications Using OpenVINO


Final Takeaway

What I like about OpenVINO is that it solves the right problem. It is not trying to be a universal deep learning framework. It is a deployment toolkit for getting models onto Intel hardware with less friction than stitching the pipeline together yourself.

If your workload is computer vision, your deployment target is Intel silicon, and edge constraints are real, OpenVINO is worth the time. The practical win is not just lower latency. It is having a deployment path that acknowledges bandwidth, privacy, and hardware cost as first-class concerns instead of pretending every model lives next to an A100 forever.