|
Low latency software on CUDA for machine vision camera applications
There are a number of image processing applications where we need not only to process camera images, but also to perform fast computations and provide immediate feedback. This is where a low-latency solution comes into play for a particular project. To solve this problem, we've implemented the low-latency FastVCR camera application, which reduces camera latency to very low values.
Low-latency applications
- Remotely controlled robots
- Retinal surgery and video endoscopy
- Drones and connected autonomous vehicles
- Broadcasting and streaming
The main issues that affect the latency of the camera application
- Camera (image sensor) frame rate, bit depth, image resolution. The higher the bit depth and resolution, the lower the frame rate because the camera bandwidth is limited. A lower frame rate means more time between frames and this affects latency.
- Camera output interface. Each machine vision camera interface (MIPI CSI2, GMSL, USB3, CameraLink, 10-GigE, Coax, PCIe) has its own maximum throughput. For better latency, use high speed interfaces that take less time to transfer data.
- Time for image processing. To achieve low latency, we need the fastest image processing software. We offer very fast ISP on NVIDIA GPUs, made possible by the implementation of parallel image processing algorithms and their low-level optimisation. For better latency it also makes sense to implement the simplest pipeline. Typically, ISP performance on NVIDIA Quadro or GeForce is better than we could get with Jetson, so better performance means better latency.
- Batch image processing vs low latency response. Batch image processing could be done faster than processing of separate frames with parallel algorithms, but it's not accaptable for low-latency applications. So we need to capture and to process each frame separately.
- Compression ratio of the output image. In order to reduce the time of data transfer, we need to apply image or video encoding. It could be moderate like 10-15 times for JPEG or strong (100-200 times for H.264/H.265), but in any case we should not transfer uncompressed images if we want to get a better latency. We also have to take into account the processing time for each encoding algorithm and from this point of view JPEG compression could be better compared to H.264/H.265.
- Streaming protocol. While sending the images, we should use UDP, not TCP. We also use custom RTSP. Please bear in mind that standard RTSP is not suitable for low-latency tasks.
- Network bandwidth. This is another external limitation, and it's recommended that you use a network with a bandwidth greater than the camera's output data rate.
- Streaming distance. We need more time to send the same amount of data a greater distance over the same network, so it's also important.
- Monitor frame rate. If a monitor is part of our solution, its own latency can have an impact on the overall latency. For the best results, we recommend the use of high fps monitors with 144 fps or even up to 360 fps.
- PC performance. Please choose fast CPU, memory, GPU, SSD, GigE, etc. The PC should have all the necessary resources to perform the calculations as quickly as possible. Delays from the operating system are also important.
- Programming language of the software should be C++/CUDA/OpenCL to make computations as fast as possible. The software should be optimized and accelerated.
- On GPU it's a good idea to use shared memory and registers for low-level optimization. Pinned memory for I/O is a must as well.
- I/O operations, host to device, and device to host data transfers can destroy the latency. I/O should be done in parallel. For example, both image acquisition from a camera and image output should be separated from image processing pipeline. Async operations are also important.
- It's a good idea to limit the number of threads to avoid thread synchronizations.
- Method of latency measurement/estimation. The widely used G2G (glass-to-glass) approach to latency estimation is not ideal and we should take this into account. It's not an approach with very high precision and reliability, and we consider G2G results to be approximate.
Full image processing pipeline on GPU
- Image acquisition
- Frame unpacking for 10-bit and 12-bit modes
- Image linearization
- Dark frame subtraction (FPN)
- Flat-Field Correction (shading correction)
- Bad pixel removal
- White Balance / AWB
- Adaptive Exposure and Gain control
- High quality demosaicing with MG algorithm
- Color correction with matrix profile or DCP profile
- Highlight recovery
- Exposure correction (brightness control)
- Denoiser: Bilateral, NLM
- Curves and Levels
- Rotation to 90/180/270 degrees and flip/flop
- Crop
- Resize (downscale and upscale)
- Rotation to an arbitrary angle
- Undistortion via LCP or via calibrated maps
- Sharpening (local contrast)
- Gamma transform
- JPEG compression and storage on SSD
- Optional conversion to NV12 and h264/h265/av1 encoding
- Automatic realtime partitioning of AVI/MP4 video files to the specified file size
- Built-in RTSP server for low latency video streaming and broadcasting
- Realtime output to monitor
Fast VCR software outputs
- Video output to monitor via OpenGL in real time
- Camera statistics
- Histogram, Parade, Vectorscope
- Realtime processing and JPEG compression with image storage on SSD
- Video encoding to MJPEG (AVI), H.264/H.265/AV1 (MP4) and storage to video container on SSD
- Low-latency video streaming via RTSP (both player and server are included)
- Glass-to-Glass module for latency evaluation
- RTSP streaming is compatible with VLC and with FastVCR video player
- Interoperability with GPU-based AI libraries and applications at the GPU level
FastVCR latency benchmarks
For latency evaluation, we use a standard glass-to-glass (G2T or GTG) approach. To check system latency, we've implemented a software module to perform G2G tests. The following options are available for G2G testing:
- The camera captures the image with the current time from the high resolution timer on the monitor, then we send the data from the camera to the software, continue the image processing on the GPU and then display the processed image on the same monitor near the window with the timer. When we stop the software, we see two different times and their difference is the system latency.
- We have also implemented a more complicated solution: after image processing on the GPU, we can apply JPEG encoding (MJPEG on the GPU), then send the MJPEG stream to the receiver process, where we do MJPEG parsing and decoding, and then output the frames to the monitor. Both processes (sender and receiver) run on the same PC.
- Same solution as the previous approach, but with H.264/H.265 encoding/decoding (CPU or GPU), both processes are on the same PC.
We can also measure latency when streaming compressed data from one PC to another over a network. Latency depends on camera frame rate, monitor fps, NVIDIA GPU performance, network bandwidth, complexity of the image processing pipeline, etc. We consider G2G latency results to be approximate due to their dependence on camera/monitor frame rate, etc.
For testing we've run XIMEA USB3 3-MPix 8-bit color camera at 120 fps and 144 fps monitor. Estimated G2G latency is around 35-40 ms. It could be better with faster cameras and monitors with higher fps, and it's a good idea to work with PCIe cameras instead of USB3. You can run that software on your PC with your XIMEA camera to evaluate the latency of your system.
Fast VCR CLI application
Often we need to run software without a GUI, and this can happen in a variety of situations. This is the case for remotely controlled robotics applications or for any other task involving remote camera control. This is also the case for any long term unattended video recording and streaming.
To meet these requirements, we've developed a CLI application that has all the above features of the FastVCR software and can work without a GUI. We are still able to have full control over image sensor and image processing parameters in real time. For video preview we provide our own player with RTSP client or you can use VLC instead. The software is Windows/Linux/L4T compatible and all image processing is done on the GPU.
Compatibility
- CUDA-12.6 for Windows/Linux, NVIDIA GeForce, Quadro, Tesla
- CUDA-12.2 for NVIDIA Jetson NX, Xavier, Orin
- XIMEA USB3 and PCIe cameras
Software downloads
|