| 3 Aug 2021 |
lunik1 | Any recommendations for data pipeline/toolkits that integrate well with nix? | 11:31:51 |
jbedo | to do what? i use nix with a thin layer on top to manage bioinformatics pipelines | 12:03:59 |
tomberek | jbedo: do you use bionix? or is it another thing? | 13:12:47 |
lunik1 | Currently I just have a bunch of python scripts I execute in a given order, but I'm looking for something that would help me formalise that order, easily extend/swap out parts of those pipelines, and help with deployment | 15:01:59 |
tomberek | lunik1: that's a pattern I used Nix/Hydra for. Basically you have a set of "ingress"/"egress" derivations that may be impure (eg: fetch/store from S3) or pure. Then a chain of nix derivations that depend on each other. I defined a function to apply various transformations and map'd them to my list of ingress derivation. It was super nice for iteration, scaling up workers, cached results, experimenting with alternate pipelines. Way better and more productive than something like Airflow. I started to apply content-addressed derivations to them to do short-circuiting as well, it was still in progress for Hydra compatibility. | 19:36:25 |
lunik1 | Damn that sounds awesome, any of this open source? | 19:37:47 |
tomberek | No. My plan is to capture the idea, organize it a bit better, and have that be open source. I've heard of a few people re-inventing this a few times, so I want extract out the common portions and perhaps provide a "flow-library" or something to make it easier to put together. | 19:39:30 |
tomberek | I'd be happy to collaborate on it. | 19:39:45 |
lunik1 | Was that all batch processing or could you handle streaming data too? | 19:39:48 |