NixOS CUDA - Public Room Timeline

	NixOS CUDA	312 Members
	CUDA packages maintenance and support in nixpkgs \| https://github.com/orgs/NixOS/projects/27/ \| https://nixos.org/manual/nixpkgs/unstable/#cuda	61 Servers

You have reached the beginning of time (for this room).

Sender	Message	Time
19 May 2024
Gaétan Lepage	Ok ! Thanks for the details !	13:54:29
Gaétan Lepage	You have other usage for storage than nix builds right ?	13:55:08
connor (he/him)	Ah yeah definitely! I'm really into multi-frame super resolution so I've been trying to start aggregating photography I've done to turn it into a dataset	13:56:32
connor (he/him)	I've also got a Light L16 I want to use to create a dataset, and a Lytro Illum because I thought it could be neat to see what I can do with a plenoptic camera	13:57:07
connor (he/him)	UGH https://github.com/pytorch/vision/blob/v0.18.0/version.txt	13:58:36
Gaétan Lepage	Oh I see ! At first, I looked at old MB/CPU combos on ebay (Epyc) but they are DDR4 not "that" cheap slower than more modern chips Lately I was more looking at the Threadripper 7960x	13:58:38
Gaétan Lepage	But it's quite expensive, and the MB too	13:58:56
connor (he/him)	They left it as `0.18.0a0` in `version.txt`	13:58:57
Gaétan Lepage	* But it's quite expensive, and the MBs too	13:59:01
connor (he/him)	Oof yeah any of the workstation-grade chips are very expensive	13:59:25
connor (he/him)	I didn't realize how dump Nix's remote build protocol is in terms of scheduling (doesn't take advantage of data locality, keep records of build times of pervious versions of packages with that name to decide how to allocate, etc.) so I thought scaling out would be better than scaling up	14:00:14
connor (he/him)	`nixbuild.net` is doing amazing stuff with respect to scaling out though -- they've re-implemented the nix remote build protocol and so while their endpoint presents itself as a single monolithic machine, one the backend they're able to scale up and down instances as needed	14:01:50
connor (he/him)	hexa (UTC+1): sorry for the @ -- any ideas if the above failure (last four messages) is by design? I'm not familiar with `packaging` but I saw you contributed the hook doing the version check. I'd just like to know whether I should tell upstream or patch in-tree.	14:04:53
hexa	the upstream package pins that version	14:05:44
hexa	and we provide something that doesn't match that constraint	14:06:00
Gaétan Lepage	Ok ! So is what tier would you think is the most interesting for a builder: consumer, HEDT or pro ?	14:17:57
connor (he/him)	Ah it's because pre-releases aren't allowed by default right	14:18:05
Gaétan Lepage	7960x would be HEDT I guess	14:18:08
connor (he/him)	Changing `"torchvision>=0.15.0",` to `"torchvision>=0.15.0a0",` in `nix-cuda-test`'s `pyproject.toml` enables pre-releases for that requirement (https://github.com/pypa/packaging/blob/32deafe8668a2130a3366b98154914d188f3718e/src/packaging/specifiers.py#L249-L270). So I guess I should submit a PR to `torchvision` to fix their version (it doesn't match their tag either).	14:23:35
hexa	oohhh, pre-releases	14:24:13
hexa	my bad	14:24:14
hexa	not sure if we should allow pre-releases	14:25:06
hexa	it would probably remove confusion about the error	14:25:14
connor (he/him)	I think it's safe to say it's upstream's fault -- their previous releases didn't have mismatched `version.txt` files. I made a PR: https://github.com/pytorch/vision/pull/8431	14:36:28
connor (he/him)	So really, your hook helped me catch something upstream did <3	14:36:51
connor (he/him)	I don't know how feasible it is to add a warning about pre-releases to the hook, but that would have saved me from reading through `packaging`'s codebase to figure out what was going on haha	14:37:39
connor (he/him)	It's... tricky. Consider running `nixpkgs-review` for a CUDA PR as an example. A number of the packages are super small and can be built in parallel. But some of them are massive beasts that should never be built in parallel (OpenCV + JAX + PyTorch = cry). Nix doesn't provide a way to allocate cores per build based on system load, or anything similar. All we have to control the builder are `max-jobs` and `cores`. It's partly why I thought scaling out was the solution -- have a lot of very fast machines which build one derivation at a time, because there's no way to schedule whether they're going to be told to build some small python wrapper or some massive package.	14:41:15
connor (he/him)	I suppose another way around that is to not mark one of the builders with `big-parallel`, and to set `cores = 0` and `max-jobs = auto` so it can handle as many jobs as it wants in parallel, so long as they're known to be small. Then one of the other builders would have the `big-parallel` system feature and have `cores = 0` and `max-jobs = 1`, so it takes the big builds, and only has to build one at a time.	14:42:55

Show newer messages

Back to Room ListRoom Version: 9