| 5 Mar 2024 |
edef | In reply to @nh2:matrix.org Yes, it's only really concerning availability. For write-mostly backups, one can use higher EC redundancy, or tar/zip the files, which gets rid of the problem of many small files / seeks. yeah, we have a similar problem with Glacier | 05:04:14 |
edef | where objects are costly but size is cheap | 05:04:21 |
edef | so i intend to do aggregation into larger objects | 05:04:53 |
edef | basically we can handle a lot of the read side of that by accepting that tail latencies suck and we just have a bunch of read amplification reading from larger objects and caching what's actually hot | 05:05:49 |
edef | i'd really like to have build timing data so we can maybe just pass on requests for things that are quick to build | 05:07:05 |
nh2 | In reply to @edef1c:matrix.org and, stupid question, but i assume you're keeping space for resilver capacity in your iops budget? Yes, that should be fine, becaus the expected mean serving req/s are only 20% of the IOPS budget (a bit more for writes) | 05:07:30 |
edef | but i'm not entirely sure how much of that data exists | 05:07:33 |
nh2 | In reply to @edef1c:matrix.org so i intend to do aggregation into larger objects This is how we also solved the many-small-files problem on our app's production Ceph. We zipped put "files that live and die together" -- literally put them into a 0-compression ZIP, and the web server serves them out of the zip. That way we reduced the number of files 100x, making Ceph recoveries approximately that much faster. | 05:09:46 |
edef | yeah | 05:10:01 |
edef | the metadata for all this is peanuts | 05:10:17 |
edef | i've built some models for the live/die together part | 05:10:46 |
edef | but had some data quality/enrichment stuff to resolve first and i haven't redone that analysis yet | 05:11:07 |
nh2 | For nix store paths the annoying thing is that there's no natural "files that live together" that can be automatically deducted. For my app, all files myfile00 through myfile99 go into myfile.zip.
So you'd have to write some index that says in which archive which store path is.
Assuming we never delete anything, the packing can be arbitrary.
| 05:11:43 |
edef |  Download image.png | 05:11:54 |
edef | like this chart is a little iffy bc it shouldn't have this long a left tail, i have the data cleanup now to fix it | 05:12:00 |
edef | but other things on my plate before i can get to that one | 05:12:17 |
edef | basically this is meant to model temporal locality of path references | 05:12:52 |
edef | unfortunately it has a time travel issue that i think should be fixed now | 05:13:22 |
edef | i just need to do the dance again | 05:14:47 |
edef | also, just for the bit: what's your best guess on the distribution of number-of-incoming-reference-edges for paths? | 05:15:47 |
edef | (direct, not transitive) | 05:16:40 |
nh2 | In reply to @edef1c:matrix.org also, just for the bit: what's your best guess on the distribution of number-of-incoming-reference-edges for paths? you mean like "glibc is depended on by 200k packages, libpng by 10k, Ceph by 3"? | 05:17:06 |
edef | directionally correct yes | 05:17:25 |
edef | i have a scatterplot but it's a fun thought exercise i don't want to rob you of by posting the plot first :p | 05:18:05 |
nh2 |  Download image.png | 05:19:24 |
edef | yeah p much, it's power law distributed | 05:19:43 |
edef |  Download image.png | 05:19:54 |
edef | (this is log-log) | 05:19:55 |
nh2 | "the golf putter distribution" | 05:20:50 |
edef | haha | 05:20:58 |