GPU Acceleration Performance on Jetson Orin Nano 8G Significantly Lower Than Expected
Issue Overview
Users are experiencing unexpectedly low GPU acceleration performance on the Jetson Orin Nano 8G compared to laptop GPUs. In a vector addition benchmark, the GPU performance was only 2.4 times faster than the CPU, while the same code on a laptop (CPU: R7-5800H, GPU: 2060) achieved nearly 8 times speedup. This discrepancy is concerning for users expecting higher GPU performance from the Jetson Orin Nano 8G.
Specific details:
- CPU time: 42ms
- GPU time: 17ms
- Jetpack version: 5.1.1 (inferred from CUDA 11.4)
- CUDA version: 11.4.315
- Test program: Vector addition using CUDA
Possible Causes
-
Power management settings: The Jetson Orin Nano may be operating in a lower power mode, limiting its performance.
-
Dynamic clock frequencies: The GPU clock might not be locked to its maximum frequency, causing inconsistent performance.
-
Workload characteristics: The specific workload may not be optimized for the Jetson Orin Nano’s GPU architecture.
-
Memory bandwidth limitations: The Jetson Orin Nano’s shared memory architecture might be causing bottlenecks.
-
CUDA kernel configuration: The chosen grid and block sizes may not be optimal for the Jetson Orin Nano’s GPU.
-
Comparison discrepancy: Comparing embedded GPU performance to a discrete laptop GPU may not be a fair comparison due to architectural differences.
Troubleshooting Steps, Solutions & Fixes
-
Maximize device performance:
- Set the power mode to maximum:
sudo nvpmodel -m 0
- Lock clocks to maximum frequency:
sudo jetson_clocks
- Set the power mode to maximum:
-
Verify current power mode:
- Check the current power mode:
sudo nvpmodel -q
- Ensure it’s set to the highest available mode (e.g., 10W for Jetson Orin Nano)
- Check the current power mode:
-
Optimize CUDA kernel configuration:
- Experiment with different grid and block sizes in the kernel launch:
// Try different values for grid and block size vector_add_gpu<<<grid_size, block_size>>>(dev_a, dev_b, dev_c, n);
- Use CUDA occupancy calculators to find optimal launch configurations
- Experiment with different grid and block sizes in the kernel launch:
-
Profile the application:
- Use NVIDIA Nsight Systems to profile the application and identify potential bottlenecks
- Look for memory transfer overheads, kernel launch times, and GPU utilization
-
Optimize memory transfers:
- Use pinned memory for host allocations to improve transfer speeds
- Consider using unified memory if appropriate for the workload
-
Benchmark with different data types:
- Test with both single-precision (float) and double-precision (double) to see if there’s a significant difference
-
Compare with other Jetson Orin Nano benchmarks:
- Run standard benchmarks like rodinia or parboil to compare your device’s performance with published results
-
Check for thermal throttling:
- Monitor device temperatures during extended runs to ensure thermal limits are not being reached
-
Update software:
- Ensure you’re running the latest JetPack and CUDA versions available for the Jetson Orin Nano
-
Optimize CPU code:
- Ensure OpenMP is properly configured and utilizing all available CPU cores
- Consider using vectorized instructions (e.g., NEON for ARM) in the CPU implementation
-
Adjust expectations:
- Understand that embedded GPUs like those in Jetson devices may not achieve the same speedups as discrete GPUs in laptops or desktops
- Focus on relative performance improvements within the Jetson ecosystem rather than comparing to non-embedded systems
If these steps do not resolve the issue, consider reaching out to NVIDIA developer forums with detailed benchmark results and system information for further assistance.