Nvidia Jetson Orin Nano Dev Board Random Shutdowns

Issue Overview

Users are experiencing random shutdowns with Nvidia Jetson Orin Nano 8GB devices running Jetpack 6.0 DP (Developer Preview) in field deployments. The issue is characterized by:

Devices going offline unexpectedly
Power indicator light being OFF when technicians arrive on-site
Manual power cycling required to restore functionality
Occurrence across multiple devices
No clear pattern or trigger for the shutdowns

The problem is causing significant operational disruptions, as it requires time-consuming and resource-intensive manual interventions to restore device functionality.

Possible Causes

Power Supply Issues: The observation of power indicator lights being OFF suggests potential power-related problems, such as:
- Unstable power source
- Faulty power supply unit
- Inadequate power delivery
Hardware Defects: There could be manufacturing defects or component failures in the Orin Nano boards.
Software Bugs: The use of a Developer Preview version of Jetpack 6.0 increases the likelihood of software-related issues, including:
- Kernel bugs
- Driver incompatibilities
- Memory leaks
Thermal Management Problems: Overheating could trigger automatic shutdowns to protect the hardware.
Firmware Issues: Bugs in the system firmware or BIOS could lead to unexpected behavior.
Environmental Factors: Extreme temperatures, humidity, or electromagnetic interference at deployment sites might contribute to the problem.

Troubleshooting Steps, Solutions & Fixes

Upgrade Jetpack Version:
- As suggested in the forum, upgrade to the GA (General Availability) version of Jetpack 6.0 instead of using the DP version.
- Follow the official Nvidia documentation for the upgrade process.
Power Supply Checks:
- Verify that the power supply meets the Orin Nano’s specifications.
- Test with a known good power supply to rule out PSU issues.
- Monitor voltage levels using system tools or external measurement devices.
Log Analysis:
- Examine syslog and kernel logs for patterns or specific errors.
- Pay attention to entries just before shutdown events.
- Look for recurring errors such as the "refcount_t: addition on 0; use-after-free" warning.
Thermal Monitoring:
- Use the following command to check CPU temperatures:
```
tegrastats
```
- Monitor temperatures over time to identify any correlation with shutdown events.
Memory and Resource Usage:
- Monitor system resource usage using tools like top or htop.
- Look for memory leaks or processes consuming excessive resources.
Kernel Parameter Adjustments:
- Try adding the following kernel parameter to disable the watchdog timer:
```
nmi_watchdog=0
```
- Edit /boot/extlinux/extlinux.conf and add the parameter to the APPEND line.
Hardware Diagnostics:
- Run built-in hardware diagnostics if available.
- Consider replacing the device if issues persist across software versions and configurations.
Environmental Mitigation:
- Ensure proper ventilation and cooling for deployed devices.
- If possible, monitor environmental conditions at deployment sites.
Firmware Update:
- Check for and apply any available firmware or BIOS updates from Nvidia.
Kernel Module Investigation:
- Based on the kernel trace, investigate potential issues with the following modules:
  - nvidia_modeset
  - r8168
  - nvvrs_pseq_rtc
Network Interface Monitoring:
- Monitor the stability of network interfaces, particularly eth0, as logs show link up/down events.

System Stability Test:

Run stress tests to identify potential breaking points:

sudo apt install stress-ng
stress-ng --cpu 8 --io 4 --vm 2 --vm-bytes 128M --timeout 10m

Collect Detailed Logs:

Enable more verbose logging and configure persistent logging to capture shutdown events:

sudo systemctl enable persistent-journal-logging.service
sudo systemctl start persistent-journal-logging.service

Contact Nvidia Support:
- If issues persist after trying these steps, reach out to Nvidia’s technical support with collected logs and diagnostic information.

Remember to document all changes and their effects throughout the troubleshooting process. This will help in identifying patterns and communicating the issue effectively if escalation to Nvidia support becomes necessary.

Issue Overview

Possible Causes

Troubleshooting Steps, Solutions & Fixes

Jetson Orin Nano Dev Board – Model Selection and Memory Issues

Driver Version to Install Tensorflow with CUDA on Jetson Orin Nano

Power Delivery Stress Test on Nvidia Jetson Orin Nano Dev Board

Errors and Troubleshooting for NanoVLM Tutorial on Jetson Orin Nano

Configuring GPIO Pullup Resistor for Jetson Orin Nano Dev Board

Ethernet Chip Identification for Jetson Orin Nano SOM

Leave a Reply Cancel reply

More toubleshooting Docs

Info

Development Resources & Official Guides

Follow us on:

Issue Overview

Possible Causes

Troubleshooting Steps, Solutions & Fixes

Similar Posts

Leave a Reply Cancel reply

More toubleshooting Docs

Info

Development Resources & Official Guides

Follow us on: