Device Stuck After Several Weeks, Watchdog Issues on Nvidia Jetson Orin Nano Dev Board
Issue Overview
The Nvidia Jetson Orin Nano Dev Board is experiencing a recurring issue where the device becomes unresponsive after several weeks of operation. Users report that the system requires a manual power cycle to recover, as it remains stuck until the power is unplugged and reconnected.
Symptoms
- The device fails to respond, requiring a manual restart.
- Log messages indicate various errors related to device support and interrupt handling.
- Users have noted that the issue occurs after running applications, particularly Python scripts, but they believe the applications are not the root cause.
Context
- The issue has been reported on a custom carrier board (Orin Nano 8GB from Seeed).
- The Jetpack version in use is 5.1.1.
- The problem appears to occur inconsistently, with users noting it has happened twice over a year across multiple devices.
Impact
This issue significantly affects user experience as it disrupts operations, especially in environments where manual intervention is not feasible. The inability to automatically recover from a hang state poses a risk for deployments in client settings.
Possible Causes
- Hardware Incompatibilities or Defects: Custom boards may not fully support all features of the Orin Nano, leading to instability.
- Software Bugs or Conflicts: The Jetpack version may contain bugs that lead to system hangs or crashes under specific conditions.
- Configuration Errors: Inadequate configuration of the watchdog settings may fail to reset the system when needed.
- Driver Issues: Outdated or incompatible drivers could lead to conflicts and system hangs.
- Environmental Factors: Power supply issues or overheating could contribute to system instability.
- User Errors or Misconfigurations: Incorrect setup or usage patterns may exacerbate underlying issues.
Troubleshooting Steps, Solutions & Fixes
To address the issue effectively, users can follow these troubleshooting steps:
Step 1: Gather System Information
- Check the current Jetpack version:
dpkg -l | grep nvidia-l4t-core
- Review system logs for any error messages:
dmesg | less journalctl -xe | less
Step 2: Reproduce the Issue
- Conduct a stress test by running CPU-intensive tasks to see if the device hangs under load:
stress --cpu 8 --timeout 300
- Monitor system performance and logs during this test.
Step 3: Check Watchdog Configuration
- Ensure that the watchdog is correctly configured and enabled:
cat /dev/watchdog
- Verify that the watchdog timeout is set appropriately (default is often around 120 seconds).
Step 4: Update Software and Drivers
- Consider updating to a newer version of Jetpack if available:
sdkmanager --install "JetPack"
- Ensure all drivers are up-to-date.
Step 5: Implement Workarounds
- Set up a cron job for daily reboots as a temporary fix:
crontab -e # Add the following line for a daily reboot at 6 AM 0 6 * * * /sbin/shutdown -r now
- Use smart sockets for remote power cycling if manual restarts are impractical.
Step 6: Test with Different Configurations
- If possible, test with different hardware configurations (e.g., different power supplies or carrier boards) to isolate the issue.
Step 7: Document Findings
- Keep detailed logs of any occurrences of the issue along with steps taken for resolution. This documentation can help identify patterns and inform future troubleshooting efforts.
Additional Resources
For further assistance, consider consulting:
- Nvidia’s official documentation on Watchdog Timer
- Community forums for similar issues and solutions shared by other users.
Unresolved Aspects
Some users have noted that despite following these steps, issues persist, particularly regarding hardware-level failures that cannot be resolved through software means alone. Further investigation into hardware compatibility and potential defects may be necessary for long-term solutions.