Training and Deploying Custom Models on Jetson Orin Nano and Jetson Nano
Issue Overview
Users are experiencing challenges and confusion when training custom object detection models (specifically SSD-MobileNet for DetectNet) on the Jetson Orin Nano using transfer learning with jetson-inference. The main concerns include:
- Portability of trained models between Jetson Orin Nano and Jetson Nano
- Retraining models with new data
- Selecting the best model checkpoint for deployment
These issues impact the workflow of developing and deploying custom AI models across different Jetson platforms, potentially affecting the efficiency and effectiveness of AI projects using NVIDIA Jetson devices.
Possible Causes
- Hardware architecture differences: The Jetson Orin Nano and Jetson Nano have different architectures, which may affect model compatibility.
- TensorRT optimization: TensorRT engines are optimized for specific hardware, potentially limiting portability.
- Training data management: Uncertainty about how to effectively incorporate new training data with existing datasets.
- Model checkpoint selection: Lack of clarity on how to choose the best model checkpoint for deployment.
Troubleshooting Steps, Solutions & Fixes
-
Model Portability:
- ONNX models can be transferred between platforms, including from Jetson Orin Nano to Jetson Nano.
- TensorRT engines are not portable due to hardware-specific optimizations.
- Solution: Export your trained model to ONNX format before transferring to other Jetson devices.
-
Retraining with New Data:
- Option 1: Resume training from a previous checkpoint:
python train_ssd.py --resume=CHECKPOINT [other arguments]
- Option 2: Use a pretrained SSD model:
python train_ssd.py --pretrained-ssd [other arguments]
- Option 3: Train on multiple datasets simultaneously:
python train_ssd.py --data=dataset1 --data=dataset2 [other arguments]
- Recommendation: Try both incremental training and training from scratch to determine which performs better for your specific use case.
- Option 1: Resume training from a previous checkpoint:
-
Monitoring Training Progress:
- Use the
--validate-mean-ap
flag to compute per-class accuracies after each epoch:python train_ssd.py --validate-mean-ap [other arguments]
- This allows you to closely monitor model performance and compare different training approaches.
- Use the
-
Selecting the Best Model Checkpoint:
- By default, the
onnx_export.py
script automatically selects the model checkpoint with the lowest loss. - To manually specify a checkpoint (e.g., one with the highest mAP):
python onnx_export.py --checkpoint=path/to/best_checkpoint.pth [other arguments]
- For advanced users: Modify the
onnx_export.py
script to automatically select the checkpoint with the highest mAP.
- By default, the
-
Additional Best Practices:
- When retraining, consider whether your training process can effectively utilize previous data or if starting from scratch with all data combined would be more beneficial.
- Experiment with different training approaches (incremental vs. full dataset) to find the optimal strategy for your specific use case.
- Regularly save and evaluate model checkpoints to ensure you can always revert to the best-performing model.
By following these steps and recommendations, users should be able to effectively train custom models on the Jetson Orin Nano, deploy them on other Jetson devices, and continuously improve their models with new data.