• 0 Posts
  • 5 Comments
Joined 1 year ago
cake
Cake day: October 26th, 2023

help-circle

  • Here is my over the top method.

    ++++++++++++++++++++++++++++++++++++++++++++++++++++

    My Testing methodology

    This is something I developed to stress both new and used drives so that if there are any issues they will appear.
    Testing can take anywhere from 4-7 days depending on hardware. I have a dedicated testing server setup.

    I use a server with ECC RAM installed, but if your RAM has been tested with MemTest86+ then your are probably fine.

    1. SMART Test, check stats

    smartctl -i /dev/sdxx

    smartctl -A /dev/sdxx

    smartctl -t long /dev/sdxx

    1. BadBlocks -This is a complete write and read test, will destroy all data on the drive

    badblocks -b 4096 -c 65535 -wsv /dev/sdxx > $disk.log

    1. Real world surface testing, Format to ZFS -Yes you want compression on, I have found checksum errors, that having compression off would have missed. (I noticed it completely by accident. I had a drive that would produce checksum errors when it was in a pool. So I pulled and ran my test without compression on. It passed just fine. I would put it back into the pool and errors would appear again. The pool had compression on. So I pulled the drive re ran my test with compression on. And checksum errors. I have asked about. No one knows why this happens but it does. This may have been a bug in early versions of ZOL that is no longer present.)

    zpool create -f -o ashift=12 -O logbias=throughput -O compress=lz4 -O dedup=off -O atime=off -O xattr=sa TESTR001 /dev/sdxx

    zpool export TESTR001

    sudo zpool import -d /dev/disk/by-id TESTR001

    sudo chmod -R ugo+rw /TESTR001

    1. Fill Test using F3 + 5) ZFS Scrub to check any Read, Write, Checksum errors.

    sudo f3write /TESTR001 && f3read /TESTR001 && zpool scrub TESTR001

    If everything passes, drive goes into my good pile, if something fails, I contact the seller, to get a partial refund for the drive or a return label to send it back. I record the wwn numbers and serial of each drive, and a copy of any test notes

    8TB wwn-0x5000cca03bac1768 -Failed, 26 -Read errors, non recoverable, drive is unsafe to use.

    8TB wwn-0x5000cca03bd38ca8 -Failed, CheckSum Errors, possible recoverable, drive use is not recommend.

    ++++++++++++++++++++++++++++++++++++++++++++++++++++


  • I have a few 24x drive RAIDz3 pools, and as long as you can live with the longer scrub and resilver time they make a good archive or backup pool, but I would not really want it as an always on active pool. If you want to know what the estimated failure rate here is a calculator.

    Not sure if its broken or if my mobile firefox browser just doesn’t like it, but I seem to be getting an error of 0% failure rates, there are other calculators if you google them though.

    https://www.servethehome.com/raid-calculator/raid-reliability-calculator-simple-mttdl-model/

    I’ve been told that large (e.g. 20 disk) vdevs are bad because resilvers will take a very long time, which creates higher risk of pool failure. How bad of an idea is this?

    I normally only have to replace 1 drive at a time, but with RAIDz3 you have to lose 4 drives at the same time for data loss to happen. If you are using a mixed batches of drive (not all from the same run) this happening is very low, and usually happening due to some other event (overheating, fire, cow attacking the disk shelf) In the 5 years I have had these pools, the worst was losing 1 drive, and having errors pop up on another drive, which were still corrected because RAIDz3 has 3 drives of protection.