WWT Research • Applied Research Report

• May 16, 2025 • 22 minute read

Automating Label Scanning at Scale with AI at the Edge

Automating label scanning at scale in WWT warehouses posed significant challenges. Our AI-powered solution, leveraging edge computing, computer vision and LLMs, achieved real-time, high-quality label detection and extraction during the pilot program. This system promises improved efficiency by reducing manual reconciliation time from hours to minutes. Future enhancements can enable even greater performance gains.

Automating label scanning at scale: An AI-powered solution

WWT warehouses receive hundreds of pallets with thousands of boxes daily. Efficiently processing these shipments is critical, but automating label scanning has long been a challenge. Previous vendor solutions fell short of full automation, primarily due to difficulties in detecting and extracting only the relevant labels from moving boxes. To solve this, we designed an intelligent, real-time label scanning system leveraging edge computing, computer vision and multimodal LLMs.

This report walks through our approach, highlighting the core challenges faced and how we solved them using a modular architecture powered by NVIDIA Jetson AGX Orin, a custom-trained YOLO model, GPT-4o and a Message Queue.

Process flow diagram of our modular architecture powered by NVIDIA Jetson AGX Orin, a custom-trained YOLO model, GPT-4o and a Message Queue.

Challenges in automating label detection

The biggest hurdle was making sure we captured only high-quality images of relevant labels from a moving conveyor belt. Traditional methods captured unnecessary frames, requiring extra processing to remove irrelevant content. We needed an efficient solution that could detect labels in real-time and extract structured information without human intervention.

Our primary challenges were:

Detecting labels of interest: Avoiding the capture of unnecessary frames and producing high-quality label images.
Processing only relevant boxes: Using an IR sensor to trigger image capture only when a box is within proximity.
High image quality: Using Laplacian-based sharpness checks to filter out blurry images.
Providing real-time operator feedback: Using GPIO-controlled LED indicators to signal scan success or failure.
Enabling floor staff to contribute: Developing a mobile app for warehouse workers to manually capture labels when needed.

Check out this 38-minute video to learn why visibility is critical for successful automation.

Edge-based label detection and processing

Given the need for real-time performance, we deployed the solution on an NVIDIA Jetson AGX Orin for high-performance AI compute at 275 TOPS. This edge device runs our computer vision pipeline, leveraging a custom-trained YOLOv8n model for label detection. The model classifies labels into predefined categories with confidence scores. If the score meets our threshold, we validate image quality using OpenCV's Laplacian method.

The Laplacian variance method measures image sharpness by detecting edges and rapid changes in pixel intensity. Higher variance values indicate a sharper image with more defined edges.

The workflow on the edge device includes:

Live video feed & display: Two real-time views of the conveyor feed and the last successful label scan.
IR sensor-triggered processing: The sensor detects incoming boxes and activates the capture process.
Image quality filtering: Only high-resolution, sharp images are stored.
Message dispatch via message queue: If an image meets quality standards, metadata (image path, label details, confidence scores) is sent to the message queue.
Operator feedback via LEDs: A green LED signals a successful scan; a red LED indicates failure.

Text extraction using multimodal LLM

Once the relevant labels are captured, we extract structured text data using GPT-4o, a Large Multimodal Model (LMM) that can process and understand information from multiple sources (e.g., text, images, audio and video). A secondary application, written in Node.js, listens to the message queue for new label messages. The processing steps include:

Downloading the image: Fetching the high-quality label image from storage.
Category-specific prompting: Tailoring prompts based on the label category to improve accuracy.
Structured JSON output: Formatting extracted key-value pairs into a standard schema.
Publishing to a message queue: Making the extracted data available to downstream APIs for reconciliation with packing slips and ERP systems.

With this method, we achieved a remarkable 99 percent accuracy in text extraction when provided with high-quality images and precise prompting. The structured JSON output eliminates inconsistencies for seamless integration with enterprise systems.

Handheld app for floor staff

To further enhance efficiency, we developed a Swift-based mobile app that allows warehouse workers to capture label images manually. The app:

Uses the device camera to capture images.
Uses API calls to GPT-4o or a similar Multimodal to extract structured text.
Publishes extracted data to message queue (RabbitMQ, Kafka, etc.), making it instantly available for processing.

This handheld solution extends the reach of our automated system, enabling multiple staff members to contribute to label processing when automated capture isn't feasible.

Lessons learned: What didn't work

Building an intelligent label-scanning solution came with a fair share of trial and error. Early iterations helped us uncover some valuable lessons:

Classic computer vision fell short

We initially used OpenCV techniques for box detection and tried to infer label regions by detecting barcodes with Pyzbar. While theoretically feasible, the approach failed in real-world conditions. Not all labels had barcodes, and even when present, barcodes were often incomplete or poorly positioned. The system either missed labels entirely or captured irrelevant regions, leading to a flood of noisy data and excessive post-processing.

We also tried using PyTesseract for direct OCR. While it was able to detect text from images, the output lacked structure and formatting. To convert that into a clean, structured JSON format suitable for downstream APIs, we found ourselves writing brittle rule-based scripts. Worse, the text output occasionally included special characters or artifacts, which reduced reliability. Compared to a mature multimodal model like GPT-4o, PyTesseract required significantly more post-processing and still produced inconsistent results.

Offloading all intelligence to the edge wasn't practical

Our first instinct was to run the entire processing pipeline — including the call to the multimodal model — on the Jetson edge device. While technically possible, this introduced latency and made it hard to parallelize workloads. It also risked blocking the pipeline under load.

To address this, we transitioned to a more robust pub-sub-hub architecture using RabbitMQ. The edge device's responsibility was limited to capturing and validating label images. Downstream applications then picked up these events asynchronously and handled text extraction. This separation improved scalability, reduced failure points and made retries trivial if something failed.

These learnings shaped our final architecture: lean, modular and built for resilience in noisy, fast-paced warehouse environments.

Performance gains and future work

By implementing this end-to-end system, we significantly improved warehouse efficiency. Previously, manual label reconciliation took up to two hours per batch; our solution now reduces this to one minute per package. The modular architecture also allows for future enhancements, such as integrating additional sensors, optimizing YOLO models and expanding handheld app capabilities.

A deep dive into the core components

Shifting from architecture to implementation, let's walk through the core components that bring this system to life:

An edge application powered by NVIDIA Jetson AGX Orin handles image acquisition, quality filtering and label detection using YOLO.
A Node.js microservice, responsible for extracting structured text from label crops using a mix of OCR and multimodal LLM-based reasoning.
A Swift-based iOS App, designed for warehouse operators to manually capture label images when needed for fallback coverage.

Each of these components plays a specific role for accuracy, reliability and real-time performance. But the solution's real strength lies in how these components work together: combining edge inference, server-side logic and human-in-the-loop systems to create an end-to-end pipeline optimized for scale.

1. Edge App on NVIDIA Jetson AGX Orin

At the heart of our system is the edge application deployed on NVIDIA Jetson AGX Orin module. This is where real-time intelligence meets physical hardware. The app acts as the brain on the edge — acquiring video streams from multiple Basler IP cameras, filtering for image quality and identifying package labels for downstream processing.

System initialization & configuration

The application is designed to be modular and configuration-driven. All camera parameters, processing thresholds (sharpness, brightness, motion blur), and output settings are defined in an external configuration file. This approach makes the system easy to deploy across staging, testing, and production environments while also allowing quick tuning in response to real-world conditions.

At startup, the app initializes each camera interface based on these config values. This includes resolution settings, exposure levels, frame rate caps and network options so each camera is optimized for the lighting and conveyor speed at its location.

IR sensor-driven capture

Instead of relying on continuous frame capture, the system leverages IR break-beam sensors placed along the conveyor belt. When a package passes through a sensor's beam, it triggers an image capture event within the app.

This tight coupling of sensor events with frame selection serves two purposes:

Reducing system load by limiting capture to meaningful moments.
Providing spatial consistency, capturing packages only when they are centered in the camera's field of view.

Sample script for IR sensor-driven capture.

Real-time image post-processing

Upon trigger, a short burst of frames is captured. The app then analyzes each frame to determine which image best represents the label, using a lightweight image quality assessment pipeline. The checks include:

Sharpness, calculated via the variance of the Laplacian.
Brightness, compared against dynamic thresholds to account for changing lighting.
Contrast, to avoid faded or overly dark images.
Motion blur, estimated by analyzing directional edge patterns.

Only the highest-scoring image that passes all checks moves forward for label detection.

Sample script for real-time image post-processing.

Label detection & event dispatch

Once a good image is selected, a YOLOv8 model (custom-trained on warehouse label data) detects labels within the image. Labels are cropped and saved separately from the full original image.

Instead of storing these images locally, they are pushed to a centralized blob storage bucket. Alongside each upload, the app generates a metadata event — containing the image URL, timestamp, camera ID, sensor ID and detection details — and publishes it to a message queue (RabbitMQ, Kafka, etc.).

This event-driven architecture decouples real-time detection from downstream text extraction and business logic, enabling scale and resilience.

2. Node.js text extraction microservice

At the heart of our backend pipeline is a lightweight yet powerful microservice responsible for transforming raw label images into structured data. This service is built in Node.js and designed with scalability and flexibility in mind. It can keep pace with the high volume of packages moving through the warehouse.

Event-driven & scalable

Every time a label image is processed at the edge, an event is published to the message queue. This message includes a pointer to the image stored in blob storage along with metadata about the label type and capture context. The microservice subscribes to this event stream and routes incoming jobs into a processing queue.

Thanks to its stateless architecture, the service can be scaled horizontally with minimal configuration—multiple instances can run in parallel, each picking up jobs from the queue to ensure throughput remains high even during peak hours.

Tailored prompting with multimodal LLM

Rather than relying on traditional OCR like Tesseract, the service uses a multimodal large language model (LLM) — in our case, GPT-4o —to directly extract key-value pairs from the label images. This model can process both visual and textual data in a single request, significantly simplifying the pipeline.

The key advantage here lies in prompt customization. Because the edge device includes the label class in the metadata, the service can dynamically select a prompt that corresponds to that specific label type. This ensures the output JSON structure matches the expectations of the downstream API consuming the data. In some cases, labels may be poorly printed, obstructed, or otherwise unreadable. To ensure robustness, the prompt explicitly instructs the model to mark such fields as "not-readable" where applicable. This makes it easy to:

Track which fields or entire labels couldn't be processed reliably.
Calculate the percentage of unreadable labels across time windows.
Trigger alerts or fallback workflows for human-in-the-loop review.

Each successful extraction produces a clean, structured JSON payload that is immediately ready for downstream APIs. Alongside this, the system generates an annotated debug image with overlaid fields and bounding boxes, providing visual confirmation for audit and troubleshooting. Robust logging and metrics are also captured to monitor extraction quality and surface trends over time. This service acts as the critical bridge between image capture and business logic — fast, scalable and engineered for resilience.

3. Swift-based handheld app for manual capture

While the edge system handles most of the automated label scanning, certain scenarios demand a human-in-the-loop approach — whether it's for quality control, exception handling or catching packages that bypass the main conveyor. To address this, we built a Swift-based iOS app that empowers operators to manually scan, review and correct label data on the go.

Lightweight & prompt-driven

The app leverages the iOS device camera to capture label images. Once captured, the image is sent to a multimodal LLM (GPT-4o), which extracts the necessary key-value information and returns a structured JSON output — there is no need for local OCR models or heavy processing logic. The prompt from the node app (by label class) can be reused in this app without much modification.

Review and edit workflow

The extracted information is displayed in a tabular UI, where the operator can quickly review all detected fields. Each field is touch-editable, allowing for fast corrections if something looks off. This brings human oversight directly into the flow, intuitively and efficiently.

Once verified (or corrected), the app pushes the final payload into the same event queue used by the backend pipeline, ensuring consistency in how data enters the system.

Floor-level flexibility

By decoupling from the edge device, the handheld app introduces greater operational flexibility. It allows multiple team members to assist during high-volume periods, ensuring smoother throughput and reducing bottlenecks. The app also adds a secondary layer of resilience, continuing operations even if the edge pipeline experiences delays. Additionally, it serves as an audit mechanism, enabling human input to validate and cross-check model performance when needed.

Future enhancements

The application has been architected with scalability and extensibility at its core. As a next step, we plan to integrate a lightweight YOLO model using MLToolKit to perform preliminary label region detection directly within the app. These detections will be published to a message queue, enabling seamless consumption by the existing Node.js pipeline, mirroring the processing workflow currently used by the Edge device.

This enhancement will not only standardize the data flow across platforms but also help reduce prompt complexity and improve label isolation, particularly in challenging scenarios such as cluttered backgrounds or overlapping stickers.

Designing and deploying an AI-powered label scanning system in a fast-paced warehouse environment requires far more than stringing models and devices together — it's about building a robust pipeline that gracefully handles variability, scales with demand and integrates seamlessly into existing operations.

Key lessons learned

This project reinforced several key engineering and system design principles:

Multimodal LLMs are redefining the role of OCR

Traditional OCR engines often struggle with label noise, poor lighting and variable formats. GPT-4o (and similar multimodal models) have proven far more resilient and versatile for structured extraction, especially when paired with thoughtful prompt design and expected schema output.

Edge processing is critical for throughput and latency

By filtering and preprocessing image data at the edge using IR sensors, image quality checks and local logic, we dramatically reduced cloud payloads and ensured we only pushed high-confidence events into the pipeline.

Metadata is critical

Pushing metadata (like label type, location and quality metrics) alongside image URLs to the message queue allows downstream systems to operate with much greater context, tailoring prompts, managing fallbacks and estimating error rates.

Human-in-the-loop is not a fallback — it's a feature. The Swift app added an entire dimension of flexibility to the system, giving operators visibility and control. It complemented the automated pipeline rather than competing with it.

A loosely coupled, event-driven architecture wins

Leveraging blob storage, message queue and scalable microservices allowed the system to operate asynchronously and gracefully recover from partial failures. Components could be independently tested, monitored and scaled.

This CUDA-accelerated computer vision system demonstrates the power of combining modern deep learning techniques with thoughtful software architecture. By leveraging GPU acceleration, parallel processing and quality control mechanisms, we've created a solution that addresses the real-world challenges of automated label processing in logistics environments.

The system's modular design also allows for future enhancements, such as supporting additional label types, implementing more sophisticated OCR processing, or integrating with emerging warehouse automation systems. As logistics operations continue to seek efficiency improvements, solutions like this will play an increasingly important role in modern supply chains.

For organizations looking to implement similar systems, our architecture provides a blueprint that balances performance, reliability and flexibility — the three key ingredients for successful computer vision deployments in industrial settings.

Training YOLO models on custom datasets: A practical guide for real-world applications

This section covers how we trained the custom YOLO model to detect labels with high precision under real-world conditions.

Object detection remains a cornerstone of modern computer vision applications, particularly in logistics, manufacturing and warehouse automation. YOLO (You Only Look Once) models are widely favored for their speed and accuracy in real-time scenarios. The output of any model is only as good as the data it is trained on, so we will start by preparing the dataset.

Annotating your own dataset with Label Studio

Creating a high-quality annotated dataset is foundational to any successful object detection pipeline. For this project, we used Label Studio, an open-source data labeling platform that supports a wide range of annotation tasks. Its clean UI and powerful customization features allowed us to:

Define label classes for different types of warehouse labels.
Annotate hundreds of images with bounding boxes efficiently.
Collaborate seamlessly across team members.
Export annotations in YOLO-compatible format.

Setting up your environment

Before diving into training, you'll need to set up your development environment. The key dependencies include the Ultralytics framework, which provides an easy-to-use implementation of YOLO, along with PyTorch and OpenCV for image processing. Example requirements.txt:

Preparing your custom dataset

The quality of your dataset has a profound impact on model performance. Since our system is designed to capture only relevant frames containing shipping labels on moving boxes, we significantly reduce the number of unnecessary image variations, simplifying training while maintaining real-world relevance.

Data collection

To build a robust dataset, we captured images of warehouse labels across a variety of real-world scenarios. These included different lighting conditions — from bright overhead illumination to low-light corners and naturally lit areas — to simulate the variability often found in warehouses. We also ensured a wide range of angles and perspectives to account for dynamic camera positioning. Additionally, we incorporated diverse backgrounds representative of operational warehouse environments and varied the distance between the camera and boxes to train the model for both close-range and distant detection.

Data annotation with Label Studio

We leveraged Label Studio to draw bounding boxes across the shipping labels and classify the type of label (i.e., shipping-label, box-label, invalid-box-label, invalid-shipping-label). The invalid labels are the use cases where the label was not fully captured in the frame. Annotations were exported in YOLO format and verified for consistency across the dataset. This structured labeling process allowed us to maintain annotation quality as we scaled our dataset size.

Organizing the dataset

After annotation, we used our custom script to split the dataset.

This organized our data into the standard train/validation/test structure that YOLO expects:

Sample script of the standard train/validation/test structure that YOLO expects.

We found that a 70/20/10 split for train/validation/test worked well for our use case, giving enough data for training while reserving sufficient samples for validation and testing.

Image preprocessing

Preprocessing plays a critical role in ensuring the consistency and quality of data fed into the model. YOLOv8 expects input images of a fixed resolution — typically 640×640 pixels — which allows the network to perform optimally during both training and inference. To align with this requirement, we implemented a preprocessing step that resizes all collected images to the expected dimensions.

Resizing helps normalize the dataset, minimizes aspect ratio variations and improves batch processing efficiency during training. It also ensures the model does not become biased toward any particular input resolution, making it more adaptable to real-world deployment scenarios.

Below is an example script that resizes images from a source directory and saves the processed versions into a designated output folder:

Sample script that resizes images from a source directory and saves the processed versions into a designated output folder.

Creating the dataset configuration

Before training can begin, YOLOv8 requires a dataset configuration file in YAML format that describes the structure and content of your dataset. This configuration file serves as a blueprint for the training pipeline, specifying where to locate the training, validation and test images and the list of object classes the model is expected to detect.

In our case, the dataset was organized into a standard directory layout with separate folders for images and corresponding labels, each further split into train, val and test subsets. This clear separation ensures reproducibility and helps in measuring performance consistently across different phases of model development.

Here is a sample dataset.yaml configuration file that reflects this structure:

path: The root directory of your dataset.

train / val / test: Relative paths from the dataset root to the respective image directories.

names: A dictionary mapping class indices to their human-readable labels. These names must match exactly with those used during annotation.

By explicitly defining this structure, we ensure that the YOLO training pipeline correctly associates each image with its annotations and understands which classes it needs to learn. This also simplifies integration with visualization tools, evaluation scripts and deployment pipelines down the line.

Training your YOLO model

With our dataset prepared, training is straightforward. Our training script uses the YOLOv8 nano model as a starting point and fine-tunes it on our custom dataset:

Sample training script uses the YOLOv8 nano model as a starting point and fine-tunes it on our custom dataset.

We found that 50 epochs provided a good balance between training time and model performance. The batch size of 16 worked well on our GPU, but you may need to adjust this based on your hardware.

Evaluating model performance

After training, it's essential to evaluate your model's performance:

Sample script to evaluate model performance.

This script calculates key metrics like precision, recall and mean Average Precision (mAP), giving you insight into how well your model is performing on your validation data.

In our case, we achieved a mAP of 0.92 on our warehouse label dataset, indicating strong performance across all label types.

Exporting your model for deployment

One of the strengths of the YOLO ecosystem is the ability to export models to various formats for deployment, including PyTorch, ONNX, TensorRT, etc. We chose the Pytorch .pt format to keep the architecture simple for the edge app. Here are a few examples of how to export your model into different formats.

ONNX export

YOLO (from Ultralytics) library provides built-in methods to export the model (e.g., ONNX format):

Sample script to export your model into ONNX.

CoreML export for Apple devices

For iOS or macOS applications:

Sample script to export your model for Apple devices.

Running inference

Once your YOLOv8 model has been successfully trained and validated, the next logical step is to test it on real-world inputs to evaluate its behavior and performance outside the training loop. This is where inference comes in.

Running inference on new images or video streams is straightforward with the Ultralytics YOLO interface. For images in your test set prediction, you can use the example code, which loads the trained model weights and performs object detection on your input data.

Sample script to the trained model weights and performs object detection on your input data.

Strategies for enhancing model performance

Through iterative experimentation and analysis, we identified several best practices that had a measurable impact on the accuracy, robustness, and generalization capability of our object detection model:

Data augmentation: Applying randomized image transformations — such as horizontal flips, rotations, and color jitter — helped increase the diversity of our training data. This, in turn, made the model more resilient to real-world variations in lighting, orientation and background clutter.
Transfer learning: Leveraging pre-trained YOLOv8 weights significantly accelerated convergence and improved performance, particularly in the early epochs. This approach enabled us to benefit from prior learning on large, generic datasets while fine-tuning for our specific task.
Hyperparameter optimization: Fine-tuning key hyperparameters such as learning rate, batch size and input image resolution directly impacted model performance. Even minor adjustments yielded noticeable improvements in precision and recall.
Model architecture selection: While we initially adopted the lightweight YOLOv8n model for its inference speed, transitioning to YOLOv8s provided a substantial boost in detection accuracy with only a marginal increase in computational load. This trade-off proved beneficial in our use case, where accuracy was prioritized without compromising on real-time performance.

These strategies collectively contributed to building a production-grade model that is both efficient and accurate across a wide range of deployment conditions.

Conclusion

Training a YOLO model on a custom dataset doesn't have to be complicated. With the right tools and approach, you can create a high-performing object detector tailored to your specific needs.

Our warehouse label detection system now runs successfully in production, helping automate inventory management and reducing errors in our fulfillment process.

Whether you're building a similar system or tackling an entirely different object detection challenge, the workflow outlined in this report should provide a solid foundation for your project.

Follow our AI & Data page to keep learning. Visit the page

WWT Research

Insights powered by the ATC

This report may not be copied, reproduced, distributed, republished, downloaded, displayed, posted or transmitted in any form or by any means, including, but not limited to, electronic, mechanical, photocopying, recording, or otherwise, without the prior express written permission of WWT Research.

This report is compiled from surveys WWT Research conducts with clients and internal experts; conversations and engagements with current and prospective clients, partners and original equipment manufacturers (OEMs); and knowledge acquired through lab work in the Advanced Technology Center and real-world client project experience. WWT provides this report "AS-IS" and disclaims all warranties as to the accuracy, completeness or adequacy of the information.

Contributors

Harry Kabbay

Lead Machine Learning Engineer