| Just following up: it is done! All the evaluators are running off of ZFS now! Here's a little more backstory on the issue and why we decided to move the machines to ZFS:
Originally, the machines were run off of EXT4. For years, it seems like this has been fine, but recently, we have been running into issues where the machines would complain about a lack of free disk space. When we went to go check, however, it wasn't disk space that was the problem, but available inodes!
EXT4 has a limited amount of inodes (I believe it's somewhere around 4 billion?), and while derivations (e.g. the .drv files themeslves) are small, they each take up an inode (at least). Although the garbage collector does know how to "free X amount of data", it doesn't know how to make sure it frees "X amount of inodes". This lead to the disk having plenty of space, but absolutely no inodes available.
Enter ZFS: ZFS gives you unlimited inodes (in that it doesn't have a fixed number of them available) so long as you have the disk space to support it.
Then, in order to actually get the machines running ZFS, I had to do a few things:
- Use Equinix Metal's new NixOS support to deploy the instances
- Set up
customdata in order to set up the ZFS pool using the disks exposed to the instance
- Deploy the instances one-by-one in order to verify they worked properly
and wouldn't fall over if left alone
For the moment, things seem to be working just fine! Fingers crossed, no more running-out-of-space alerts until we're actually running out of space... That said, I will still keep my eye on it. Once again, if you notice anything out of the ordinary, don't be a stranger: ping me (here or on the related issue / PR)!
|