- cross-posted to:
- singularity@lemmit.online
- cross-posted to:
- singularity@lemmit.online
Big day for people who use AI locally. According to benchmarks this is a big step forward to free, small LLMs.
Big day for people who use AI locally. According to benchmarks this is a big step forward to free, small LLMs.
A 3090.
But it should be fine on a 3060, with zero offloading.
Dump ollama for long context. Grab a 5-6bpw exl2 quantization and load it with Q4 or Q6 cache depending on how much context you want. I personally use EXUI, but text-gen-webui and tabbyapi (with some other frontend) will also load them.