 |
|
 |
Fast CUDA JPEG encoder for 8-bit images
We have wide experience with high speed cameras and software and it's a usual problem - how to increase duration for high speed video recording. All our cameras do online video streaming to PC RAM via PCI-Express framegrabber and CameraLink interface. In that case total recording time is restricted by the size of free RAM which is usually in the range of 1-12 GBytes. Conventional high speed camera data rate is about 650 MB/s, so one can record up to 16 seconds of high speed video in a raw format.
The basic idea how to improve the situation is quite straightforward. We need to compress incoming to PC stream with lossy algorithm to achieve compression ratio of 10-20 times to be able to write compressed data stream to HDD/SSD/RAID in realtime. Usual CPU is unable to cope with that problem, so we decided to use GPU with NVIDIA CUDA technology instead. We believe that it's a good idea to use video cards to implement the latest NVIDIA findings for parallel computations.
We have implemented JPEG lossy compression algorithm (JPEG baseline for 8-bit images) because we consider it to fulfil our considerations. The main demands are the following:
- algorithm should be able to be adopted to parallel computations
- good image quality for compression ratio in the range from 10 to 20 times
- algorithm should not be too computationally intensive
- task division to maximum number of sub-tasks
- minimum amount of memory for one thread
- popular open standard with multiple encoders and decoders available on different platforms
Lossy JPEG compression algorithm meets all criteria for parallel computations on the GPU. We have developed the software without using standard libraries like CUDA NVIDIA NPP, Cublas, etc. To create high-performance JPEG encoder for high speed video applications we have also done algorithm optimization for GPU calculations.
CUDA JPEG encoder features
- Extremely fast lossy image compression to JPEG with variable compression ratio
- Baseline JPEG for greyscale images (8-bit)
- Data input: 8-bit images from RAM/HDD/RAID/SSD
- Data output: final compressed JPEG image
- Continuous data mode (input one image after another)
- Standard set of computations for parallel implementation: Level shift, DCT, Quantization, Zig-zag, AC/DC, DPCM, RLE, Huffman
- Maximum input image size 12000 x 12000
- Compatibility with the latest NVIDIA GPUs
PC test configurations
- ASUS P6T Deluxe V2 LGA1366, X58, Core i7 920, 2.67 GHz, DDR-III 6 GB
- GPU: NVIDIA GeForce GT 240 (CC=1.2, 96 cores), GeForce 580 GTX (CC=2.0, 512 cores)
- Laptop ASUS N55S, Core i5 2430M, DDR III 6 GB, NVIDIA GeForce 555M (CC = 2.1, 144 cores)
- OS Windows-7, 64-bit, CUDA 4.1, driver 296.10 or later
Data path for GPU JPEG encoder
- Application start, CUDA initialization, device query, device enumeration, device test.
- Memory allocation, creating groups of threads, synchronization with user application on CPU.
- Receiving from framegrabber driver or from user application address or number for the latest frame which is stored in RAM. There is an option to load BMP image from HDD/RAID/SSD.
- Image uploading to GPU memory from RAM.
- Numerical calculations: Level shift, DCT, Quantization, Zig-Zag, AC/DC, DPCM, RLE and Huffman for each block 8x8. We use static Quantization and Huffman tables. All frames are compressed independently, frame correlation is not taken into account.
- Output (compressed) string for each block 8x8, output array of these strings, memcpy to host. Optionally save compressed image to HDD/RAID/SSD.
- Decoding could be done on the GPU in reverse order.
Benchmarks for different stages of GPU JPEG encoder
We got the following benchmarks for JPEG image compression according to Baseline JPEG algorithm, lossy, for 8-bit greyscale image with resolution 7216 x 5408, compression ratio about 13 (compression quality 50%), image is loaded into RAM before compression, GPU GeForce GTX 580:
- HostIO (data loading from HDD to RAM) - we don't have that latency because all data are already in RAM
- Host-to-Device (time to upload all image data to GPU) - 6500 µs
- Level shift, DCT, Quantization, Zig-Zag - 1260 µs
- RLE+DPCM - 1570 µs
- Huffman - 590 µs
- Device-to-Host (memcpy of compressed jpg file from device to host) - 520 µs
- Other - 710 µs
- TOTAL: 11150 microseconds for image compression according to Baseline JPEG for 8-bit grayscale image with resolution 7216 x 5408 and CR = 13 including host-to-device and device-to-host transfers.
Total compression time 11150 µs corresponds to throughput 3.5 GB/s and it's true for situation when all the above stages of JPEG compression algorithm are carried out sequentially (inside every stage we have parallel computations). If we will be able to apply parallel computations not only for separate blocks 8x8 but also for different stages, we would further increase compression throughput. For GPU with Fermi technology, it's possible to do Host-to-Device and Device-to-Host transfers and also calculations at the same time, so one could get maximum throughput even more.
CUDA JPEG encoder benchmarks for 8-bit image on different GPUs
We got the following performance benchmarks for lossy Baseline JPEG compression for greyscale image with 7216 x 5408 resolution and CR = 13
- NVIDIA GeForce GT 240 - 820 MB/s
- NVIDIA GeForce GT 555M (mobile) - 1420 MB/s
- NVIDIA GeForce GTX 580 - 3500 MB/s
- NVIDIA GeForce GTX 680 - 5200 MB/s (preliminary results for PC with PCI-Express 3.0 and Core i7 3770)
The above results include DeviceIO latency (copy image data from RAM to GPU memory and vice versa). We don't include HostIO latency (image loading to RAM from HDD/SSD/RAID and vice versa). We have also evaluated only calculation part of throughput for JPEG compression algorithm (without host-to device and device-to-host transfers) and we got throughput more than 10 GB/s for GeForce GTX 580.
Results for JPEG decoding on the GPU GeForce GTX 580 give throughput 3.5 GB/s for the same image with CR = 13. For JPEG compression quality below 50% decompression throughput is faster than for compression. For GeForce GTX 680 we got 4.5 GB/s decompression rate.
How to improve current results for CUDA JPEG encoder
- Software optimization for parallel computations
- Correct usage of registers and memory, number of threads per block, data alignment, etc.
- Concurrent copy and execution
- More powerful hardware
- GPUs with PCI-Express 3.0 interface
- Calculations on multiple GPU systems
- GPUDirect to bypass RAM and for P2P DMA transfers directly between PCIE camera and GPU
Applications for high-performance JPEG encoder
High-performance GPU JPEG encoder is suitable for a variety of digital imaging applications:
- High speed cameras and scanners
- Medical imaging systems
- Surveillance systems
- Scientific applications
- Software development including PC games
- Fast JPEG image resize for web applications
- GPU cloud computing services
Comparison with existing JPEG codecs for multicore CPU
Here are the fastest commercial JPEG encoders with multithreading capability that we were able to find:
- JPEG encoder IPP-7.0 from Intel (compression throughput up to 500 MB/s)
- JPEG library VXJPEG (version 1.4) from Vision Experts (compression throughput up to 500 MB/s)
- JPEG encoder from Norpix for high-speed video cameras (compression throughput up to 250 MB/s)
- JPEG encoder PICTools Photo from Accusoft Pegasus (compression throughput up to 250 MB/s)
As one can see, our JPEG encoder is much more fast in comparison with the above CPU solutions.
Comparison with existing FPGA JPEG compression IP cores
The idea about online high speed JPEG compression is not new. There are a lot of different JPEG FPGA implementations for that task. Here are several links for existing IP cores on FPGA:
- Cast Inc. - JPEG-E Baseline JPEG Compression Core with processing rates of up to 750 MSamples/s.
- Alma-Tech (SVE-JPEG-E, SpeedView Enabled JPEG Encoder Megafunction) - IP Core for FPGA Altera/Xilinx with throughput up to 500 MSamples/s.
- Visengi JPEG Encoder - JPEG / MJPEG Hardware Compressor IP Core with throughput up to 405 Msamples/s on Virtex-5 FPGA.
We've got much better results with GPU, though we understand that GPU is not a universal solution. We consider GPU to be an excellent choice for many tasks, particularly for testing purposes and prototyping. It could be also interesting if there are no strict limitations on power consumption and dimensions. One should point out some more important advantages for image compression solutions on GPU:
- CUDA C code is much more easy to understand in comparison with Verilog / VHDL code for FPGA
- CUDA C code is suitable for further modifications
- No need to develop hardware for algorithm implementation
- Excellent scalability: to increase throughput we could utilise more powerful GPUs or install more GPUs on PC
What we could offer for your NVIDIA GPU
- Testing utility to evaluate throughput of JPEG compression algorithm at your PC
- CUDA JPEG encoder and decoder SDK for high speed imaging
- Custom solutions for CUDA JPEG encoding/decoding and related image processing tasks
- Custom image compression software development
Support
- Full technical support up to successful client integration
- SDK, documentation, DLL without source code
- GPU JPEG decoder
For any further information concerning these solutions please contact us via email.
|