!atvIbxHoEqNcAIxYpN:nixos.org

NixOS AWS

64 Members
14 Servers

Load older messages


SenderMessageTime
24 Nov 2024
@commiterate:matrix.orgcommiterate Fixed, though it means agent.run_as_user in the configuration file is no longer respected (i.e. can't change the user at runtime with a CW config file change) which is fine IMO. 20:47:27
25 Nov 2024
@commiterate:matrix.orgcommiterate

Arian Any concerns with this Fluent Bit module before I try upstreaming it?

https://github.com/commiterate/nix-fluent-bit

Probably going to use it despite the CW Agent work due to the native systemd-journald support and better processing features. That and I'm a bit hesitant now that I've seen the spaghetti under the hood.

06:11:20
@commiterate:matrix.orgcommiterate *

Arian Any concerns with this Fluent Bit module before I try adding it to Nixpkgs?

https://github.com/commiterate/nix-fluent-bit

Probably going to use it despite the CW Agent work due to the native systemd-journald support and better processing features. That and I'm a bit hesitant now that I've seen the spaghetti under the hood.

06:11:31
1 Dec 2024
@sielicki:matrix.org@sielicki:matrix.org

fyi, working on a handful of changes related to AWS and ML:

  1. Adding the efa kernel module: https://github.com/NixOS/nixpkgs/pull/360347

  2. Adding efa-nv-peermem: https://github.com/NixOS/nixpkgs/pull/360375

  3. Adding an updateScript for the out-of-tree ena build and a package bump: https://github.com/NixOS/nixpkgs/pull/360326

I expect a few others before the weekend is over:

  • modifying the libfabric drv to support building with efa and HMEM_CUDA

  • adding and building libnccl-ofi, plus extending nccl so that it uses it

with all of these in place (minus the ENA part which is independent) it should be possible to support multinode ML training on aws with nixos.

02:31:59
@sielicki:matrix.org@sielicki:matrix.org

Arian: any ideas on how to expose this in a module and enable it?

EFA supported instances types are here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types

efa-nv-peermem and the nccl/libfabric stuff is only really needed on p3/p4.*/p5.*/p5e.*/p5en.*

02:35:40
@sielicki:matrix.org@sielicki:matrix.orgthere's a separate discussion worth having about neuron kmods and software support02:36:26
@arianvp:matrix.orgArianGiven its a kernel module do we need an option? Cant we just add it to the image and have udev load it when needed?07:47:06
4 Dec 2024
@arianvp:matrix.orgAriannah this looks pretty good. We could perhaps add more structured module types 10:26:18
@arianvp:matrix.orgArianhttps://github.com/arianvp/nixos-village/blob/main/nix/modules/fluent-bit.nix10:26:43
@arianvp:matrix.orgArian by using freeformType = 10:26:50
@arianvp:matrix.orgArian the restartTrigger on user is superfluous 10:28:03
@arianvp:matrix.orgArianpretty sure anything that changes the unit file is a restart trigger10:28:16
@arianvp:matrix.orgArian also the grace option seems unused 10:28:50
6 Dec 2024
@cees:softwareguild.orgCees de Groot joined the room.16:19:14
@adam:robins.wtf@adam:robins.wtf joined the room.16:20:58
10 Dec 2024
@commiterate:matrix.orgcommiterate grace is used to set the systemd unit's shutdown timeout for graceful shutdown. We could technically exclude it but I'd rather have systemd also have a timeout in case Fluent Bit has some bug. 18:04:05
@commiterate:matrix.orgcommiterateAs for adding more structural typing to the config options, it seems like a maintenance burden since we need to keep up with any config schema changes on the Fluent Bit side.18:05:06
@commiterate:matrix.orgcommiterate *

grace is used to set the systemd unit's shutdown timeout for graceful shutdown. We could technically exclude it but I'd rather have systemd also have a timeout in case Fluent Bit has some bug.

We also need it anyways since systemd has a default of 90s.

https://www.freedesktop.org/software/systemd/man/latest/systemd-system.conf.html#DefaultTimeoutStartSec=

18:08:35
@commiterate:matrix.orgcommiterate *

grace is used to set the systemd unit's shutdown timeout for graceful shutdown.

We need it since systemd has a default of 90s.

https://www.freedesktop.org/software/systemd/man/latest/systemd-system.conf.html#DefaultTimeoutStartSec=

19:16:17
13 Dec 2024
@sielicki:matrix.org@sielicki:matrix.orgyes and no -- similar to ENA it's a question of whether the in-tree module should be preferred to the out of tree one. 04:04:18
15 Dec 2024
@commiterate:matrix.orgcommiterateShould we find a maintainer and try to merge the net-utils Nix package as is (since it seems like it's fine for now) or should we try to submit changes upstream to swap to systemd device units first?03:25:59
16 Dec 2024
@commiterate:matrix.orgcommiteratePR: https://github.com/NixOS/nixpkgs/pull/36549304:43:35
@commiterate:matrix.orgcommiterateAs a related bit, adding more IMDS categories to the built-in AWS filter plugin: https://github.com/fluent/fluent-bit/pull/972704:45:04
@commiterate:matrix.orgcommiterate * Related note: adding more IMDS categories to the built-in AWS filter plugin: https://github.com/fluent/fluent-bit/pull/9727 04:45:17
@arianvp:matrix.orgArianI mean. We can unconditionally include the EFA module in the AWS config I mean12:40:24
@arianvp:matrix.orgArianhttps://github.com/NixOS/nixpkgs/pull/36569021:20:24
20 Dec 2024
@commiterate:matrix.orgcommiterateLooks like the fluent-bit package has effectively no maintainers.03:45:54
@commiterate:matrix.orgcommiterate* Looks like the fluent-bit package effectively has no maintainers.03:46:36
@0xfeebdaed:matrix.org0xfeebdaed joined the room.04:04:16
@arianvp:matrix.orgArianHmm. What do you advise? Should we keep using it? 08:44:09

Show newer messages


Back to Room ListRoom Version: 10