We follow ethical norms & our process for objectivity.

AIMultiple's customers in llms include Holistic AI.

LLM Compatibility Calculator

Top self-hosting tools analyzed

Open-source large language models

Advantages of self-hosted LLMs

Disadvantages of self-hosted LLMs

Optimizing LLMs for self-hosting

FAQ about self-hosted LLMs

External Links

LLM Compatibility Calculator Top self-hosting tools analyzed Open-source large language models Advantages of self-hosted LLMs Disadvantages of self-hosted LLMs Optimizing LLMs for self-hosting FAQ about self-hosted LLMs External Links

Table of contents

LLMs

Updated on Apr 21, 2025

LLM VRAM Calculator for Self-Hosting in 2025

Cem Dilmegani

See our ethical norms

The use of LLMs has become inevitable, but relying solely on cloud-based APIs can be limiting due to cost, reliance on third parties, and potential privacy concerns. That’s where self-hosting an LLM for inference (also called on-premises LLM hosting or on-prem LLM hosting) comes in.

Tool		Best For
1.	Ollama	Integrations (wide compatibility)
2.	vLLM	Developers (high performance)
3.	AnythingLLM	Local RAG applications
4.	LM Studio	Beginner-friendly experimentation

Ollama

Integrations (wide compatibility)

vLLM

Developers (high performance)

AnythingLLM

Local RAG applications

LM Studio

Beginner-friendly experimentation

LLM Compatibility Calculator

Enter your configuration details below to instantly to estimate the RAM needed based on model parameters, quantization method, and your hardware specs:

The available quantization methods and precision bits for vendors are taken from Hugging Face transformers library documentation.¹

You can read more about the optimization techniques to host LLMs locally.

Top self-hosting tools analyzed

Updated at 04-09-2025

Tool	Ease of Use	Performance	Github Stars
Ollama	Medium	Medium	136k
vLLM	Low	High	44k
AnythingLLM	High	Medium	43k
LM Studio	High	Low	3k

Ollama

Ollama is an open-source tool that simplifies running LLMs locally on macOS, Linux, and Windows. It bundles models and configurations, making setup straightforward for various popular LLMs. Ollama prioritizes ease of use and privacy via offline operation and supports integrations with developer tools like LangChain and user-friendly interfaces like Open WebUI, which provides a chat-based graphical experience for interacting with the locally hosted models.

It allows users and developers to easily run and interact with LLMs on their personal machines, including multimodal models, making it ideal for local development and privacy-conscious usage.

vLLM

vLLM is a high-performance engine designed for fast and memory-efficient large language model serving. It uses techniques like PagedAttention and continuous batching to maximize throughput and reduce memory requirements during inference.

It supports distributed execution and various hardware (NVIDIA, AMD, Intel) and offers an OpenAI-compatible API for integration. vLLM targets developers and researchers focused on optimizing LLM deployment in production environments. It excels at scalable, high-speed model serving.

AnythingLLM

AnythingLLM is an open-source desktop tool for running large language models (LLMs) on macOS, Windows, and Linux. It enables users to apply RAG to process documents like PDFs, CSVs, and codebases, retrieving relevant information for chat-based interactions without coding.

It operates offline by default for privacy and integrates RAG to enhance responses using user-provided data. AnythingLLM suits developers and beginners exploring document-driven LLM use cases, with additional support for AI agents and customization through a community hub.

LM Studio

LM Studio is a beginner-friendly desktop application for discovering, downloading, and experimenting with large language models locally across macOS, Windows, and Linux. It features an intuitive graphical interface for managing models from sources like Hugging Face and interacting via a chat UI or a local server.

LM Studio simplifies experimentation with features like offline RAG and leverages efficient backends like llama.cpp and MLX. It caters primarily to beginners and developers wanting an easy-to-use environment for exploring local LLMs.

Open-source large language models

Updated at 04-08-2025

Provider	Model
Alibaba	Qwen2.5-Omni-7B
Alibaba	Qwen2.5-VL-72B-Instruct
DeepSeek	DeepSeek-R1
DeepSeek	DeepSeek-V3-0324
Google	Gemma-3-27b-it
Google	Gemma-2-27b-it
Meta	Llama-4-Maverick-17B-128E-Original
Meta	Llama-4-Scout-17B-16E-Original
Meta	Llama-3.3-70B-Instruct
Mistral	Mistral-Small-3.1-24B-Instruct-2503

Open-source LLMs are models whose architecture and model files (containing weights, often with billions of more parameters) are publicly available, allowing anyone to download, modify, and use them.

Platforms like Hugging Face serve as central repositories, making it easy to access these models for tasks like building a self-hosted LLM solution. Often packaged within a container image for easier deployment, these models enable users to run model inference directly on their own hardware, offering greater control and flexibility compared to closed-source alternatives.

Advantages of self-hosted LLMs

Full Control & Deeper Customization: Host models on your own hardware to gain complete command over your LLM applications, enabling deeper customization for fine-tuning beyond standard APIs.
Enhanced Data Security: Keep your data, especially sensitive data, secure within your own systems, significantly improving data security as information doesn’t need to leave your control.
Cost-Effectiveness: Achieve potential long-term cost savings by utilizing your own infrastructure and leveraging open models, avoiding recurring cloud subscription fees.
Flexibility with Open Models: Take advantage of a growing ecosystem of open models that can be deployed as self-hosted models without vendor lock-in.
User-Friendly Management: Many self-hosting offers provide tools like a web UI to simplify management and interaction with your deployed models.

Disadvantages of self-hosted LLMs

Significant Hardware Investment: Running self-hosted models effectively demands powerful hardware, especially systems with large amounts of expensive GPU memory, representing a major upfront cost and potential infrastructure challenge.
Complex LLM Deployment: Setting up open-source models requires significant technical expertise in configuration, dependency management, and troubleshooting; it’s rarely a single command process.
Limited Access to Proprietary Models: While you can deploy open-source models, self-hosting typically excludes access to the latest, cutting-edge proprietary models (like GPT-4.5, Grok 3, etc.), which are often only available via managed hosted LLM APIs; these other models remain inaccessible on your own system.
Performance Optimization Burden: Achieving better performance and acceptable response times necessitates continuous monitoring, tuning, and optimization of resource usage, a responsibility that falls entirely on the user, unlike with managed services.

Optimizing LLMs for self-hosting

Running AI models like large language models on your own hardware can be challenging due to their size and resource needs, but several techniques help manage their model weights effectively. Methods such as quantization, multi-GPU support, and off-loading improve efficiency, making it possible to host these models at home or work.

Quantization

Quantization, as illustrated in the figure below, often involves reducing the precision of model weights in machine learning by converting high-precision values (like 0.9877 in the Original Matrix) to lower-precision representations (like 1.0 in the Quantized Matrix). This process decreases model size and can speed up computation, albeit with a potential loss in accuracy.

Multi GPU support

As illustrated in the figure, distributing the large ‘Model Parameters’ across multiple GPUs (GPU 1 and GPU 2) allows users to run larger, more capable models on hardware they manage, overcoming single-GPU memory limitations and making self-hosting feasible. This effectively pools resources, optimizing the use of available hardware to meet the demanding requirements of modern LLMs.

A diagram comparing GPU memory usage in two scenarios. The vertical axis represents GPU Memory. The left scenario shows one block labeled "GPU 1", containing a lower pink area labeled "Model Parameters" and an upper grey area labeled "KV cache". The right scenario shows two blocks, "GPU 1" and "GPU 2". A single wide pink block labeled "Model Parameters" spans the lower portion of both GPUs. Above this shared pink block, each GPU has its own separate grey block labeled "KV cache". — **Figure 2.** Comparison of GPU memory allocation for a large language model. On the left, a single GPU holds both the model parameters and the KV cache. On the right, using two GPUs, the model parameters are distributed across both GPUs, while each GPU maintains its own separate KV cache.

Off-loading

Parameter off-loading optimizes LLMs for self-hosting by addressing the limited memory available on consumer GPUs. This technique involves dynamically moving parts of the large model, such as inactive “expert” parameters in MoE models, between the fast GPU memory and slower system RAM. By doing this, offloading allows users to run large, powerful models on accessible hardware that wouldn’t otherwise have enough dedicated GPU memory, making self-hosting feasible.³

Model sharding

Sharding, as illustrated in the image below, divides the complete “Large Language Model” into several smaller, more manageable “Model pieces.” This technique allows the distribution of these pieces across multiple devices (like GPUs) or even different types of memory within a self-hosted setup. By breaking down the model, sharding overcomes the memory limitations of individual hardware components, making it possible to run large models on personally managed infrastructure.

FAQ about self-hosted LLMs

What is a self-hosted LLM?

A self-hosted LLM is a large language model used for LLM applications that runs entirely on hardware you control (like your personal computer or private server) rather than relying on a third-party cloud service.

What are the techniques for running LLMs locally?

Techniques include using frameworks like llama.cpp, libraries like Hugging Face transformers, user-friendly apps (Ollama, LM Studio), model quantization (e.g., GGUF, GPTQ) to reduce resource needs, model parallelism to distribute large models across multiple devices, and optimized inference engines (like vLLM).

Is it possible to process multiple requests on a self-hosted LLM?

Yes, tools like vLLM, Ollama, and LM Studio can run local servers capable of handling multiple (often concurrent) requests. This is similar to how cloud APIs operate, often using batching for efficiency.

Do I need to request access for self-hosted LLMs?

No, you don’t need external access permission or API keys from a provider for self-hosted llm. Since you host it yourself, you have direct access; you might optionally set up your own authentication for your local server if needed.

External Links

1. https://huggingface.co/docs/transformers/main/quantization/overview
2. Quantization for Large Language Models (LLMs): Reduce AI Model Sizes Efficiently | DataCamp. DataCamp
3. https://arxiv.org/pdf/2312.17238
4. Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ) . Exploring Language Models

Share This Article

Cem Dilmegani

Follow on

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

Follow on

Comments

Your email address will not be published. All fields are required.

0 Comments

Related research

Large Language Model Evaluation in 2025: 10+ Metrics & Methods

Aug 413 min read

LLM Fine-Tuning Guide for Enterprises in 2025

Aug 58 min read

LLM VRAM Calculator for Self-Hosting in 2025

LLM Compatibility Calculator

Top self-hosting tools analyzed

Ollama

vLLM

AnythingLLM

LM Studio

Open-source large language models

Advantages of self-hosted LLMs

Disadvantages of self-hosted LLMs

Optimizing LLMs for self-hosting

Quantization

Multi GPU support

Off-loading

Model sharding

FAQ about self-hosted LLMs

What is a self-hosted LLM?

What are the techniques for running LLMs locally?

Is it possible to process multiple requests on a self-hosted LLM?

Do I need to request access for self-hosted LLMs?

External Links

Next to Read

Compare Top 11 LLM Orchestration Frameworks in 2025

Compare Top 20 LLM Security Tools & Free Frameworks

Top 40+ LLMOps Tools & Compare them to MLOPs in 2025

Comments

Related research

Large Language Model Evaluation in 2025: 10+ Metrics & Methods

LLM Fine-Tuning Guide for Enterprises in 2025