11 Oct 2024 |
hexa | (copied from an infra team discussion) | 14:16:34 |
hexa | our feeling was that the epyc has stronger single core perf than the altra | 14:16:54 |
K900 | (notably, the Amperes are locked at 3GHz, and the 9454P can boost to ~3.8) | 14:17:13 |
K900 | So we're looking at very similar all core throughput | 14:17:31 |
K900 | Also, the EPYC is DDR5 and the Altra is DDR4, which may end up mattering for eval because eval is A LOT of pointer chasing | 14:17:57 |
Tristan Ross | From running Ampere, it does not have as good single core performance as other systems I've seen | 14:18:03 |
K900 | It's not supposed to | 14:18:10 |
Tristan Ross | But it's throughout is pretty good | 14:18:14 |
K900 | It's a many small cores design | 14:18:14 |
K900 | Like, this is going to depend on how well we can utilize SMT | 14:20:06 |
K900 | But I'd expect roughly similar MT perf | 14:20:18 |
K900 | With a pretty strong ST lead for the Epyc | 14:20:23 |
Tristan Ross | Gotcha, and the thermals would probably be similar | 14:20:53 |
K900 | Thermals, frankly, should not be our problem | 14:21:59 |
K900 | If Hetzner can't figure out a way to get us hardware that's not thermal throttling, we'll just have to do the math | 14:22:32 |
Tristan Ross | Yeah | 14:22:57 |
Mic92 | In reply to @hexa:lossy.network
bottlenecks:
- parallel compress slots (currently limited at 30, which seems reasonable in relation to the compute rhea has)
- eval memory, which we compensate with zram at 150%
- eval time, which is single-threaded and probably not fixable through hw upgrades
Eval is parallel in hydra | 14:27:12 |
hexa | it can be, but it is not on h.n.o | 14:27:33 |
Mic92 | Not enabled? | 14:27:52 |
hexa | evaling trunk-combined exceeds the available memory | 14:27:58 |
K900 | Single threaded eval nearly OOMs the box | 14:28:08 |
K900 | And the way parallel eval works just makes it even worse | 14:28:20 |
Mic92 | Ok. Got it | 14:28:22 |
Mic92 | Is there a derivation that depends on all other derivations? | 14:30:54 |
Mic92 | Because this not normal | 14:31:07 |
Mic92 | I am able to eval arbitrary large package sets with nix-eval-jobs | 14:33:33 |
Mic92 | It will reclaim memory | 14:33:45 |
K900 | In reply to@joerg:thalheim.io Is there a derivation that depends on all other derivations? The tested job depends on A LOT of things | 14:34:31 |
K900 | I believe it is the primary bottleneck | 14:34:37 |
K900 | Because one of the things it depends on is like 200 VM tests | 14:34:50 |