!siOVEzpzgLbkHTjpmA:numtide.com

NixOS Archivists

56 Members
Taking care of NixOS historical build artifacts and GC. Meeting notes: https://pad.lassul.us/nixos-cache-gc For self-hosting, see #binary-cache-selfhosting:nixos.org 18 Servers

Load older messages


SenderMessageTime
12 Jun 2024
@tomberek:matrix.orgtomberek edef: do you need anything to start "Glacierize"? 21:25:01
@flokli:matrix.orgflokliA big snow cannon 22:39:36
14 Jun 2024
@tpw_rules:matrix.orgtpw_ruleswhat does that mean for users?20:36:05
@tomberek:matrix.orgtomberekProposal: start by copying NARs created prior to 2024. Keep a record of the copy. Meanwhile, we can generate some proposed sets to delete: 1. Unreachable NARs. 2. Infrequently accessed 3. Oldest This would mean that if you would want to build something in one of those sets, you'd have to rebuild. We can try to make this a rarer occurrence, but I will happen. In that case we would need a procedure to either allow you to access old data at your own cost, or to restore it to the mainline cache. 23:29:24
16 Jun 2024
@tpw_rules:matrix.orgtpw_rulesso NARs will not be automatically un-glacierized? by delete here do you mean from the main cache or from glacier? doesn't glacierization increase egress costs?17:08:58
@tomberek:matrix.orgtomberekNo, for simplicity, and at least for now, the idea is to start copying things over. That will be desirable almost no matter what. An automatic retrieval could be nice; perhaps make that "requestor pays". And we can restore some set of objects if we find the popular and useful enough to do so. For deletion, I meant deletion from S3.17:38:06
17 Jun 2024
@tpw_rules:matrix.orgtpw_ruleswhat do you mean by s3? as in "not glacier" or AWS entirely?14:00:57
@tomberek:matrix.orgtomberek(Blame aws for confusing naming). I normally refer to S3 as distinct from Glacier in spite of that sometimes being confusing due to the tiering. I know there are some ideas to move to R2 or Tigris, but I'm not aware of a clear winner for this "long-term low-cost archive" portion of the data.15:05:44
@tpw_rules:matrix.orgtpw_rulesok i see, thanks for the explanation. i just wanted to be certain about any plans to permanently delete anything. it sounds like there are not any at this point15:31:54
24 Jun 2024
@janik0:matrix.org@janik0:matrix.org left the room.08:36:26
29 Jun 2024
@mib:kanp.aimib 🥐 joined the room.22:24:45
4 Jul 2024
@philiptaron:matrix.orgPhilip Taron (UTC-8) left the room.15:46:35
@philiptaron:matrix.orgPhilip Taron (UTC-8) joined the room.15:57:02
10 Jul 2024
@sliedes:hacklab.fiSami Liedes joined the room.12:06:45
@sliedes:hacklab.fiSami LiedesMaybe I should introduce myself. I previously thought I had a question about getting old versions of packages, but I solved it. So here's just an introduction and a brain dump. I got interested in the cache disk usage problem already actually before I installed NixOS about a week ago. I'm still new to Nix, but I do think I know compression and storage. I read on discourse the topics about compression and a January announcement on garbage collection. I see that the space is a big expense, and I also saw that there are nuances in it like doing something with it can also cost a lot due to egress (unless done in AWS?). For playing around, I downloaded something like 40 old versions of firefox-unwrapped. It seems potentially suboptimal to me that all nars are independently compressed and stored, but it's hard to know. Is there a good intuition for the cost of CPU versus storage and network? Somehow using xz is already mildly surprising to me, but it may well make sense for Nix (and of course it also affects users). I recompressed the nars I got with zstd --rsyncable, which tries to synchronize the stream at suitable points (but rarely), adding marginally to file size but allowing more effective binary diffs. I only had time to test an rdiff between two .xzs, and I think it was something around 80% of the size of the original xz, which is not very good (but then I just chose two random hashes, so maybe the case of two close releases is better). One interesting aspect of zstd is that at many compression levels it is so CPU cheap that you can just use it without thinking much (almost like memcpy cheap)—so something worth at least considering is that it might at least in theory be possible to read for each request some more complicated, zstd compressed object store with diffs, decompress it, construct the resulting object (using a cheap algorithm) and recompress it using a carefully but deterministically chosen level of zstd. But I did get that S3 get and put costs are a major factor, so something like this might need a different backend if it gets a ton of blocks. The hash is of the uncompressed nar, I assume?21:23:33
@edef1c:matrix.orgedefwhich hash? FileHash is of the compressed NAR, and is used in the filename21:25:14
@edef1c:matrix.orgedefNarHash is used for signature calculations etc21:25:33
@tomberek:matrix.orgtomberekI'm interested in those sorts of more advanced usages of zstd to do smarter things (eg: https://github.com/facebook/zstd/blob/dev/contrib/seekable_format/zstd_seekable_compression_format.md) , but when it comes to the current S3 issue, I think we need to compartment things. Split the cache into a long-term low-cost part - and a high-cost beginner-optimized high-usage part.21:31:06
@tomberek:matrix.orgtomberek @edef1c:matrix.org: Valentin said he might be able to help manage and coordinate administratively. I saw the notes, but I'm not sure what the next steps are on the technical side. 21:32:41
@edef1c:matrix.orgedefi'm working on a few pieces, one of which is bringing up Tigris21:46:15
@edef1c:matrix.orgedefand another is some data wrangling around the object size distribution21:48:54
11 Jul 2024
@sliedes:hacklab.fiSami Liedes25 kUSD to egress sounds like so much that I wonder if it wouldn't be cheaper to just rebuild the packages outside. 😅07:22:30
@edef1c:matrix.orgedefreproducibility though!08:53:46
@edef1c:matrix.orgedefand no, builds are very expensive08:54:13
@edef1c:matrix.orgedefdata is cheap to store08:54:16
@edef1c:matrix.orgedefjust not with the particular access latency we have right now08:54:53
@tomberek:matrix.orgtomberek edef: do you know why we have .drv's in the cache? 08:55:34
@sliedes:hacklab.fiSami Liedes
In reply to @edef1c:matrix.org
which hash? FileHash is of the compressed NAR, and is used in the filename
Yeah, I admit having been a bit confused about which hash is which at times :)
12:09:56
@sliedes:hacklab.fiSami LiedesSo it's also important that the compression of the NARs is deterministic?12:10:28
@edef1c:matrix.orgedefnot really, we don't care about preserving the exact compression of them16:04:58

Show newer messages


Back to Room ListRoom Version: 10