How To Easily Deploy Pixtral Large Using Docker VLLM For Self Hosting With One Liner Command

Recently, Mistral has released a powerful multimodal model with 123B parameters. In this blog post, we will first understand what Pixtral is. Then, we will quickly get into the process of deploying Pixtral Large using VLLM.

Understanding Pixtral

Pixtral is an innovative AI model designed to handle complex tasks with high efficiency and accuracy. It is particularly noted for its ability to process and generate multimedia content, including images and text. The model leverages advanced techniques in machine learning to provide state-of-the-art performance in various domains.

Pixtral Large, specifically, is a variant of the Pixtral model that is optimized for large-scale deployments. It is designed to handle extensive datasets and provide high-quality outputs, making it ideal for enterprise-level applications.

Deploying Pixtral Large Using VLLM

Now that we have a basic understanding of Pixtral, let’s move on to the deployment process using VLLM. VLLM is a framework that simplifies the deployment of large language models by providing a streamlined and efficient runtime environment.

Currently, on the Pixtral release page itself and the HuggingFace, there is no tutorial on VLLM with docker deployment. So here we go.

Prerequisites

Before we begin, ensure you have the following prerequisites:

Docker: Make sure Docker is installed on your system.
NVIDIA GPUs: You need NVIDIA GPUs to leverage the full potential of Pixtral Large. Here I am using 8xH100 GPUs.
Hugging Face Token: You need a valid Hugging Face Hub token to access the model. Go to your Huggingface token page to get it.

Deployment Steps

Pull the VLLM Docker Image:
First, you need to pull the latest VLLM Docker image that supports the Pixtral Large model.

   docker pull vllm/vllm-openai:latest

Run the Docker Container:
Use the following command to run the Docker container with the necessary configurations:

docker run --runtime nvidia --rm -d --gpus '"device=0,1,2,3,4,5,6,7"' --name=vllm-pixtral124 -v /raid/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=<HF_TOKEN>" -p 8888:8000 --ipc=host  vllm/vllm-openai:latest --model mistralai/Pixtral-Large-Instruct-2411 --config-format mistral --load-format mistral --tokenizer_mode mistral --limit_mm_per_prompt 'image=10' --tensor-parallel-size 8

Breaking down the command:

--runtime nvidia: Specifies the use of NVIDIA runtime.
--rm: Automatically removes the container when it exits.
-d: Runs the container in detached mode.
--gpus '"device=0,1,2,3,4,5,6,7"': Specifies the GPUs to be used. I am using whole 8xH100 GPUs here.
--name=vllm-pixtral124: Names the container for easy reference.
-v /raid/huggingface:/root/.cache/huggingface: Mounts the Hugging Face cache directory. If you already have the huggingface cache in your host then I strognly suggest to use this, so that after downloading the model, you can reuse it.
--env "HUGGING_FACE_HUB_TOKEN=<HF_TOKEN>": Sets the Hugging Face Hub token environment variable.
-p 1318:8000: Maps port 1318 on the host to port 8000 on the container.
--ipc=host: Uses the host’s IPC namespace.
vllm/vllm-openai:latest: Specifies the Docker image to use.
--model mistralai/Pixtral-Large-Instruct-2411: Specifies the model to be used. In this case we use Pixtral Large Instruct 2411.
--config-format mistral: Specifies the configuration format to mistral.
--load-format mistral: Specifies the load format, also to mistral.
--tokenizer_mode mistral: Specifies the tokenizer mode, to mistral, just like the guidelines from Mistral team.
--limit_mm_per_prompt 'image=10': Limits the number of images per prompt to 10 to make sure it does not get OOM experience.
--tensor-parallel-size 8: Specifies the tensor parallel size.

Now to monitor the logs and ensure everything is running smoothly, use the following command:

docker logs -f vllm-pixtral124

This command will provide real-time logs from the running container, allowing you to troubleshoot any issues that may arise.

The first start might be very slow because the VLLM is downloading the huge 200GB model from huggingface.

Conclusion

Deploying Pixtral Large using VLLM is a straightforward process. By following the steps outlined above, you can efficiently deploy this awesome multimodal model for your applications.

If you encounter any issues or have further questions, feel free to reach me out. Happy deploying! 🙂