CUDA JPEG codec for NVIDIA GPUs

We have created fast JPEG codec based on NVIDIA CUDA technology. CUDA JPEG codec developed by Fastvideo is a blend of strict compliance with standards and shocking encoding and decoding speed comparing with existing commercial solutions. This is full, performance-oriented implementation of Baseline JPEG. We got ultra fast JPEG compression and decompression on the GPU due to full parallel implementation of Baseline JPEG algorithm. Our JPEG codec is much faster in comparison with the best commercial multithreaded JPEG codecs for multicore CPUs.

Fast JPEG image compression features for CUDA JPEG codec

Implementation is 100% compliant with JPEG Baseline Standard
Baseline JPEG compression and decompression for grayscale (8-bit) and color (24-bit) images with arbitrary width and height
Optional 12-bit JPEG compression for grayscale and color
Extremely fast lossy image encoding and decoding with variable compression ratio
Subsampling modes: 4:4:4, 4:2:2, 4:2:0
Minimum input image size 1×1 for grayscale and color images with any subsampling
Maximum input image size is 16,000 × 16,000 or more (optional)
JPEG image quality in the range from 1 to 100
Read/edit/write any EXIF section
Optional parameters: quantization tables
Data input: 8/24-bit or 12/36-bit images from RAM/HDD/RAID/SSD/GPU
Data output: final compressed/uncompressed 8/24-bit or 12/36-bit image in RAM/HDD/RAID/SSD/GPU
Standard input formats: PGM, YUV, PPM, BMP
Continuous data mode (input one image after another)
Standard set of computations for parallel implementation of Baseline JPEG compression and decompression

JPEG Encoding on GPU: Input data parcing, Color Transform, 2D DCT, Quantization, Zig-zag, AC/DC, DPCM, RLE, Huffman coding, Byte stuffing, JFIF formatting
JPEG Decoding on GPU: JFIF parcing, Restart marker search, Inverse Huffman decoding, Inverse RLE, Inverse DPCM, AC/DC, Inverse Zig-zag, Inverse Quantization, Inverse DCT, Inverse Color Transform, Output formatting

Optimized for the latest NVIDIA GPUs
Compatibility with FFmpeg to read/write MJPEG streams (FFmpeg is under LGPLv2.1)
Optional integration with OpenGL
Optional support for input from HD-SDI cards (Bluefish, Deltacast, Imperx)
Compatible with Windows-7/8/10 and Linux Ubuntu/CentOS

We have succeeded to make parallel all stages of JPEG algorithm including entropy encoding and decoding. There was a widespread opinion that Huffman algorithm could be only serial. In our solution Huffman coding is not a bottleneck anymore and it's fully parallel. Now we don't off-load anything from GPU to CPU to make JPEG codec faster. CUDA JPEG codec is extremely fast and is functioning completely on GPU.

Benchmarks for JPG encoding on NVIDIA GeForce GTX 1080 TI (Windows-7 and CUDA-8.0, 64-bit)

Now we need just 0.51 ms for Baseline JPEG encoding of 24-bit color image with 4K resolution 3840 × 2160, JPEG quality 90% and subsampling 4:2:0 (it corresponds to image compression ratio ~10:1). We have chosen the above JPEG encoding parameters because they correspond to so called "visually lossless" compression.

These are the latest performance benchmarks for encoding of 2K and 4K images, 24-bit (JPEG compression on GPU, without DeviceIO latency, single image mode, no batch, no streaming) on NVIDIA GeForce GTX 1080 TI and Quadro P6000:

Full HD (2K, 1920 × 1080) ~ 35 GByte/s (0.17 ms)
4K (3840 × 2160) ~ 46 GByte/s (0.51 ms)

These are JPEG decoding performance benchmarks on NVIDIA GeForce GTX 1080 TI and Quadro P6000 (no DeviceIO latency, single image mode, no batch, no streaming):

Full HD (2K, 1920 × 1080) ~ 5.3 GByte/s (1.2 ms)
4K (3840 × 2160) ~ 11.2 GByte/s (2.12 ms)

The above results are much faster than benchmarks of libjpeg-turbo on CPU. Even if we take into account host to device and device to host transfers, the performance of CUDA JPEG codec will be much higher. More results for performance measurements you can download here.

Options for CUDA JPEG image compressor

We have also included our JPEG compression and decompression software to our main product - SDK for GPU Image & Video Processing. It includes dark frame subtraction, shading correction, white balance, demosaicing, denoising, color correction, tone mapping, image filtering, 1D LUT, gamma, color management, 3D LUT, histogram, resize, crop, rotate, remap, integral image, defringe, undistortion, sharp, OpenGL or GLFW output, integration with FFmpeg, raw bayer compression, J2K encoder, etc. Here you can see some SDK benchmarks for JPEG compression and decompression, debayering, resizing, denoising, J2K encoding on NVIDIA GeForce GTX 1080, Quadro P6000, mobile Tegra X1 and X2.

Licensing for CUDA JPEG Codec

We license CUDA JPEG and other components of GPU Image & Video Processing SDK to software developers, camera manufacturers and resellers, internet providers, software integrators, etc. Our SDK is utilized in wide range of imaging applications. Demo SDK, documentation, licensing info and quotation are available upon request. We are also offering custom software design according to agreed specification. If you need to get significant speed up for your image processing application, don't hesitate to contact us.

Roadmap for further improvements of CUDA JPEG Codec

New 12-bit CUDA JPEG decoder
Further fast JPEG codec optimizations
Minimum memory usage on GPU
JPEG codec integration into Fast CinemaDNG software - done