Local AI for up to 200 Users

Hy there,

I’m a big fan of LTS Community and really enjoy listening to all these intelligent people :slight_smile:

I’m, tinkering with my AMD-Radeon Card + Ollama + OpenWebUI for quite a while now. Because of that, we got a request to create a concept for a local AI server in a company.

They have up to 150 Users right now, but I want to add some headroom and want to build a server ready to use for 200 People.

I have two main ways which I’m thinking about:

  1. a monolithic concept with one big server and 2 NVIDIA H100 or 4× RTX 5090, so it is up to the task.

  2. a cluster-like approach with 3-4 NVIDIA DGX Sparcs, where I “grow as I go”.

I’m also thinking of only presenting one, maximal two models,soe the System doens#t have to load and unload models all the time.

It is planned to build the system and test it first, then we give key-users access to it, which can chat/interact with the OpenWebUI webpage. Later on, we invite more and more users + implementing RAG and CodeAssistance-Support. At the end we want to also add custom integrations to their existing Software-Stack.

Since I don’t have experience in this “high” number of user-logins, I wanted to drop this question inhere and ask, if someone of you have Ideas/Tips to share with me and chekc my Ideas if this is a good approach or not.

We are still in planning-phase and no hardware/software is purchased yet…and of course, the customer is aware (and willing) to put in much money with the raising prices in almost every part of this field.

I hope it isn’t frowned upon to link people to other forums. Though I know Tom has referenced this particular one in the past as Wendell is a good guy.

In any case, you may want to check out the Level1Techs forum. It’s hosted by Wendell from Level1Techs (if that last part wasn’t obvious, lol). He’s big into self-hosting AI and has already quite a few videos on the subject on his YouTube channel. His community reflects that as well.

1 Like

Hi @Moseph_V ,

I already dropped this question there as well :slight_smile:

1 Like

Please update here if you get a good response from them :slight_smile:

Never frowned upon to link to great resources such as Level1Techs. :slight_smile:

Wendell is much more informed on this topic than I am, but to get a better idea of the hardware requirements, you should probably drill down into the ‘day-to-day’ expectations:

  1. Model Specifics: Which models are you planning to lead with? There is a huge difference in VRAM requirements and tokens-per-second between running Llama 3 8B versus something like a 70B model or a specialized Coding model.
  2. Concurrency vs. Total Users: While there are 200 users total, how many do you expect to be hitting the ‘Generate’ button at the exact same millisecond? If 20-30 users are prompting simultaneously you’ll need to look closely at how the backend handles batching.
  3. Performance Expectations: What is the ‘acceptable’ speed for the users? For basic chat, 10-15 tokens/sec is fine, but for Code Assistance, users usually expect a much snappier response to keep their flow.
  4. RAG Scope: Since you mentioned RAG, do you have a sense of the document volume? Indexing and querying a massive local database adds another layer of CPU/RAM overhead alongside the GPU inference.

I have a client I am working with that has is spending over $2 million on a system with far less users to do engineering work. (I am only helping with physical network infrastructure not all the hardware) When you build a local LLM setup you get a much better idea of just how subsidized the cloud companies are.

3 Likes

I think there is really good potential in local LLMs for companies. The prices will go up, who know when the bubble will burst and (maybe) the only way to make sure if the data is properly contained, if it is handled locally.

I am curious how others handle these things. Because of that I started to convert my gaming PC, upgrading it, so I can try out different software, models and have first hands experience. Obviously it is way different, when you have to handle concurrent users.

Hi y’all :slight_smile:

Thanks for your intel so far. I have the same Post running now on Wendells side:

I couldn’t put them online since his team had to approve my account first.

Since we’re still in the concept-stage. There’s nothing really settled yet unless these things:

  • we won’t go with the beefy-system version, since the initial cost is too high, and we can’t scale up that easily, since we don’t have any realistic suggestion, what the demand will be
  • so we use the NVIDIA DGX Sparc Cluster Version with a Mellanox Spectrum SN2100 Switch to connect them all together (if we need more than one Sparc)
  • In term of User implementation, we go with the multiplication-concept: We take 10 Key Users, which are tech savy and train them. Then They use the system in a pilot-phase for a while and give feedback. In the next step, we put the system “online” for all Users. So every Keyusers has 10 People taking care of, so we reduce the load on the 1st Level Support.

We have two ideas of implementing the whole thing:

  1. start low with a ChatGPT-Chatbot version, where the AI has no access to any resources and therefore the security concept - just as everything else - is very minimal. After a while, when we know the average load and adoption-rate of the users, we go further and build it out to connect it to their data-silos with read-only access. This would then be a greater step, because the whole concept about security access, etc. will then be much more complicated
  2. start full on and build a cluster with 2-3 Nodes from the scratch, connect Data-Silos to it and implement Local/MS365 Auth onto it. This would put all planning and testing up front before the implementation - which I don’t like so much.

I rather would go with version one, so start slow and see how the System reacts under real load of 100 -200 users, but you know…management decides :slight_smile:

Concurrency vs. Total Users: While there are 200 users total, how many do you expect to be hitting the ‘Generate’ button at the exact same millisecond? If 20-30 users are prompting simultaneously you’ll need to look closely at how the backend handles batching.

As I mentioned, we don’t know. That’s why I like the grow as you go method more.

The monolithic version would need us to count everything in which will happen in the next 2-3 Years and you know, once the system is in place and everyone likes it, you add more and more functions until the server is terribly slow, because it was never designed for it.

With a cluster you could scale much cheaper and easier.

So now to your suggestions and questions:

Model Specifics: Which models are you planning to lead with? There is a huge difference in VRAM requirements and tokens-per-second between running Llama 3 8B versus something like a 70B model or a specialized Coding model.

I think we start of with one midsize 70B Model available for all and no way of using other ones. Especially for the chat-Version, this would be the best in my opinion…everything is better than letting the users choose freely (wildly) :stuck_out_tongue:

Performance Expectations: What is the ‘acceptable’ speed for the users? For basic chat, 10-15 tokens/sec is fine, but for Code Assistance, users usually expect a much snappier response to keep their flow.

I don’t have feedback for this at hand right now, but I rather expect “snappy” than slow. Since they also want to implement it into their other systems further down the road (e.G. Code-Analysis, Agents, etc.) we need to be as flexible as possible - another point for the grow as you go implementation.

RAG Scope: Since you mentioned RAG, do you have a sense of the document volume? Indexing and querying a massive local database adds another layer of CPU/RAM overhead alongside the GPU inference.

Yeah, that’s a huge unknown factor as for now. The only thing I know is, that they have everything in SharePoint. I haven’t yet start planning this part. The thing I know (quite for sure) is, that a local AI connected to SharePoint Cloud will be a disaster. Once the AI is hitting the Servers with thousands of queries, the connection will throttle and block quickly. So for now I’m thinking about replication their SharePoint in a one-way-sync into a local Silo.

I have a client I am working with that has is spending over $2 million on a system with far less users to do engineering work. (I am only helping with physical network infrastructure not all the hardware) When you build a local LLM setup you get a much better idea of just how subsidized the cloud companies are.

Oh yeah, even looking for a 100Gig Switch with 8-10 Ports is in-between 2500 (Euros) and 17000 Euros.
Another point why a beefy system won’t cut it in the end I think. Just a rough calculation is way over 80.000 Euros. When We reach the price point for the big server in the presentation, it’s probably good to plan a break for at least 15 minutes, so everyone could calm himself down :slight_smile:

I think there is really good potential in local LLMs for companies. The prices will go up, who know when the bubble will burst and (maybe) the only way to make sure if the data is properly contained, if it is handled locally.

I am curious how others handle these things. Because of that I started to convert my gaming PC, upgrading it, so I can try out different software, models and have first hands experience. Obviously it is way different, when you have to handle concurrent users.

Me too, that’s why I’m hoping to get some useful insights from y’all :slight_smile:

We are positioned in Germany and therefore there’s (officially) no way to easily use AI right now. It’s slowly coming, but I guess it’s still a lot of paperwork to use it correctly according to our law.

And It’s probably cheaper to hire a tech guy to build something onprem than hiring a lawyer to figure this thing out :stuck_out_tongue:

1 Like