Are you self-hosting LLMs (AI models) on your headless servers? I’d like to hear about your hardware setup. What server do you have your GPUs in?
When I do a hardware refresh I’d like to ensure my next server can support GPU(s?) for local LLM inferencing. I figured I could put in either a 4090 or x2 3090’s(?) maybe into an R730. But I’ve only barely started to research this. Maybe it isn’t practical.
I don’t know much other hardware lineups besides the Dell R7xx lineup.
I host oobagooba on an R710 as a model server API, and host sillytavern and stable diffusion which use oobagooba as clients. I use an R710 using a CPU, so as you can imagine inferencing is so slow it’s basically unusable. But I wired it up as a proof of concept.
I’m curious what other people who self-host LLMs do. I’m aware of remote options like Mancer or Runpod. I’d like the option for purely local inferencing.
Thanks all
https://www.ebay.com/itm/364128788438?mkcid=16&mkevt=1&mkrid=711-127632-2357-0&ssspo=XkOKzd0RR_6&sssrc=4429486&ssuid=9jfKf00cSoK&var=&widget_ver=artemis&media=COPY
At least according to this fairly detailed eBay listing you might be limited in what GPUs you can run in an R730. It states a 300w max per card and double width, which would eliminate both the power and physical requirements of the 3090/4090. You could run say some Tesla P40s but they would be a bit slower.
Another option would to be just buy a rack mount 4U case and say a X99 motherboard (same era CPU as the R730) which would give you a bit more flexibility in running 3090/4090 cards so long as you had a 1200w or so PSU.
Yeah running a 4U case and assembling it with “plain desktop” hardware but rack mounted and headless is definitely an option too. I might be asking too much of server hardware to take R730s (or any racked datacenter hardware) and fit them to a role they weren’t designed for. These are good thoughts and useful links, thank you.