| 26 Sep 2024 |
mattleon | May I ask what hardware you are running inferencing on and how many tokens per second it generates?
It seems the hardware I'm running hass on can do about 3 tokens/s, trying to figure out if that's enough | 14:17:50 |
@hexa:lossy.network | In reply to @mattleon:matrix.org May I ask what hardware you are running inferencing on and how many tokens per second it generates?
It seems the hardware I'm running hass on can do about 3 tokens/s, trying to figure out if that's enough 3060 12GB | 14:47:07 |
@hexa:lossy.network | In reply to @hexa:lossy.network 3060 12GB How do I find out how many tokens? | 14:47:24 |
mattleon | That's a great question, I got my information from this medium post on the OG llama 3 with 8B parameters. The gif of the terminal output of running an inference indicates the number of tokens per second at the end:
https://medium.com/@benoit.clouet/running-llama3-on-the-gpu-of-a-rk1-turing-pi-6dddb9e14521
I do wonder if hass exposes a performance graph 🤔
In either case, a 3060 is quite a bit more powerful than what I'm working with | 14:54:15 |
K900 | The RK3588 has an NPU | 14:54:37 |
K900 | Which is presumably currently not used | 14:54:42 |
K900 | Because nothing supports it | 14:54:45 |
K900 | But work is ongoing on that | 14:54:53 |