What should I monitor/log and how should I monitor/log to determine why my headless NAS is often becoming unavailable?
The problem:
- Another machine that depends on the NAS routinely has its services unavailable because the NFS mounts are no longer mounted.
- When that happens, sometimes a
sudo mount -a
recovers them. - Other times, the NAS is not pingable, so I go to the physical host, plug in monitor/keyboard and find that I can’t log in. The login screen is frozen, requiring hard reboot.
- Often when I leave a monitor attached (VGA), I come back to a screen that says:
critical medium error, dev sda, sector 163776752 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 2
I started a sudo smartctl -t long /dev/sda
a few hours ago, and sometime since then, the server depending upon it no longer had NFS mounted. But a simple sudo mount -a
resolved.
What the server was also doing when it had a network blip:
rclone
was backing up to backblaze b2- Acting as NFS server for Plex/*arr media server
- Acting as NFS storage for Proxmox machine (but no VMs or CTs running)
Pasted some zpool
output below. Details about the machine:
-
Repurposed old hardware, just built this Debian 12 NAS a couple months ago
-
Operates as backup destination for other machines
-
Operates as media location for my Plex machine - other server that mounts the NAS via NFS.
-
P6X58D-E LGA 1366 motherboard, Intel X5670 CPU, 18 GB (3x4GB, 3x2GB triple channel)
-
8 hard drives connected to
LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
-
10GbE to managed TP-Link switch through one port on
Mellanox Connectx-3 MCX312A-XCBT EN
➜ sudo zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT nvr 5.45T 3.35T 2.10T - - 2% 61% 1.00x ONLINE - tank 70.9T 34.4T 36.5T - - 0% 48% 1.00x ONLINE -
➜ sudo zpool status -v pool: nvr state: ONLINE scan: scrub repaired 0B in 08:49:40 with 0 errors on Sun Nov 12 09:13:41 2023 config:
NAME STATE READ WRITE CKSUM nvr ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 6T-75LN0J4 ONLINE 0 0 0 6T-95A2PNV ONLINE 0 0 0
errors: No known data errors
pool: tank
state: ONLINE scan: scrub repaired 1M in 16:44:16 with 0 errors on Sun Nov 12 17:08:27 2023 config:
NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 12T-5PGJ4A0D ONLINE 0 0 0 12T-Z2J26EBT ONLINE 0 0 0 12T-5PGHSZJC ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 14T-9KG38U5L ONLINE 0 0 0 14T-9KG81HRL ONLINE 0 0 0 14T-9RGG5ZDC ONLINE 0 0 0
errors: No known data errors
Random freezes, I’d be checking for RAM and power supply problems. Power means PSU and motherboard in this case.
Start with memtest86+ and go from there.