Slow Performance of Local Small Models on Jetson Orin Nano
Issue Overview
Users of the Jetson Orin Nano Developer Kit with JetPack 6 are experiencing extremely slow performance when running local small language models, particularly in text generation tasks. The issue manifests as follows:
- The chatbot responds extremely slowly when using models like llama-2-7b-chat.Q4_0.gguf
- Setting n-gpu-layers to 128 as suggested in tutorials causes the Jetson to freeze, requiring a restart
- With n-gpu-layers set to 0, the chatbot works but remains very slow, and the overall system becomes sluggish
- The model appears to be running on the CPU instead of the GPU, given the poor performance
- Terminal output shows token generation speed of only 0.02 tokens/s
This problem significantly impacts the user experience and the practical utility of the Jetson Orin Nano for running local language models.
Possible Causes
-
Insufficient GPU Memory: The Jetson Orin Nano may not have enough GPU memory to handle the specified number of GPU layers for the model.
-
Improper GPU Utilization: The model might not be effectively utilizing the GPU, causing it to fall back to CPU processing.
-
Suboptimal Model Configuration: The chosen model or its configuration may not be optimized for the Jetson Orin Nano’s hardware capabilities.
-
Memory Management Issues: Lack of proper memory management, including insufficient swap space, could be limiting the system’s performance.
-
Software Optimization: The llama.cpp implementation used might not be fully optimized for the Jetson platform.
-
Model Size Mismatch: The selected language model might be too large for the Jetson Orin Nano’s specifications.
Troubleshooting Steps, Solutions & Fixes
-
Optimize GPU Memory Usage:
- Reduce the number of GPU layers to find a balance between performance and stability.
- Experiment with different n-gpu-layers values, starting from a lower number and gradually increasing.
-
Increase Swap Space:
- Follow the steps outlined in the Jetson containers setup guide to mount additional swap space:
sudo systemctl disable nvzramconfig sudo fallocate -l 16G /mnt/16GB.swap sudo mkswap /mnt/16GB.swap sudo swapon /mnt/16GB.swap
- Add the following line to /etc/fstab to make the swap persistent:
/mnt/16GB.swap none swap sw 0 0
- Follow the steps outlined in the Jetson containers setup guide to mount additional swap space:
-
Try Alternative Implementations:
- Use Ollama instead of llama.cpp, as it’s generally easier to use and performs better out-of-the-box on Jetson devices.
- Refer to the Jetson AI Lab page for Ollama setup and usage instructions.
-
Use Smaller Language Models:
- Explore smaller language models that are better suited for the Jetson Orin Nano’s capabilities.
- Refer to the Small LLM (SLM) page on the Jetson AI Lab website for a list of compatible models.
-
Optimize System Resources:
- Close unnecessary applications and processes to free up system resources.
- Monitor system resource usage using tools like
top
orhtop
to identify potential bottlenecks.
-
Update Software and Drivers:
- Ensure that JetPack 6 and all associated drivers are up to date.
- Check for any available updates or patches specific to language model performance on Jetson devices.
-
Verify GPU Utilization:
- Use the
tegrastats
command to monitor GPU usage during model inference:tegrastats
- Confirm that the GPU is being utilized and identify any potential issues.
- Use the
-
Experiment with Different Model Quantizations:
- Try different quantization levels for the model (e.g., Q4_0, Q5_1) to find a balance between performance and accuracy.
-
Consult Jetson Community Resources:
- Check the Jetson Developer Forums for similar issues and potential solutions.
- Reach out to the NVIDIA Jetson community for specific advice on optimizing language model performance on the Orin Nano.
By following these steps and exploring the suggested solutions, users should be able to improve the performance of local small language models on their Jetson Orin Nano Developer Kit. If issues persist, further investigation into hardware-specific optimizations or alternative model architectures may be necessary.