August 31, 2023
Comparing Code Llama Models Locally
Trying out new LLM’s can be cumbersome. Two of the biggest challenges are:
Disk space: there are many different variants of each LLM and downloading all of them to your laptop or desktop can use up 500-1000 GB of disk space easily.
No access to an NVIDIA GPU: most people don’t have an NVIDIA GPU lying around, but modern laptops (like the M1 and M2 MacBooks) have surprisingly good graphics capabilities.
In this post, we’ll showcase how you can stream individual model files on-demand (which helps reduce the burden on your disk space) and how you can use quantized models to run on your local machine’s graphics hardware (which helps with the 2nd challenge).
We wrote this post with owners of Apple Silicon pro computers in mind (e.g. M1 / M2 MacBook Pro or Mac Studio) but you can modify a single instruction (the llama.cpp compilation instruction) to try on other platforms.
Before we dive in, we’re thankful for the work of TheBloke (Tom Jobbins) for quantizing the models themselves, the Llama.cpp community, and Meta for making it possible to even try these models locally with just a few commands.
Llama 2 vs Code Llama
As a follow up to Llama 2, Meta recently released a specialized set of models named Code Llama. These models have been trained on code specific datasets for better performance on coding assistance tasks. According to a slew of benchmark measures, the Code Llama models perform better than just regular Llama 2:
Code Llama also was trained to provide stable generation with up to 100,000 tokens of context. This enables some pretty unique use cases.
For example, you could feed a stack trace along with your entire code base into Code Llama to help you diagnose the error.
The Many Flavors of Code Llama
Code Llama has 3 main flavors of models:
Code Llama (vanilla): fine-tuned from Llama 2 for language-agnostic coding tasks
Code Llama - Python: further fine-tuned on 100B tokens of Python code
Code Llama - Instruct: further fine-tuned to generate helpful (and safe) answers in natural language
For each of these models, different versions have been trained with varying levels of parameter counts to accommodate different computing & latency arrangements:
7 billion (or 7B for short): can be served on a single NVIDIA GPU (without quantization) and has lower latency
13 billion (or 13B for short): more accurate but a heavier GPU is needed
34 billion (or 34B for short): slower, higher performing, but has the highest GPU requirements
For example, the Code Llama - Python variant with 7 billion parameters is referenced as Code-Llama-7b across this post and across the webs. Also, here's Meta’s diagram comparing the model training approaches:
To take advantage of XetHub’s ability to mount the model files to your local machine, they need to be hosted on XetHub. To run the models locally, we’ll be using the XetHub mirror of the CodeLlama models quantized by TheBloke (aka Tom Jobbins) . You'll notice that datasets added to XetHub also get deduplicated to reduce the repo size.
Tom has published models for each combination of model type and parameter count. For example, here’s the HF repo for CodeLlama-7B-GGUF. You’ll notice that each model type has multiple quantization options:
The CodeLlama-7B model alone has 10 different quantization variants. Generally speaking, the higher the bits (8 vs 2) used in the quantization process, the higher the memory needed (either standard RAM or GPU RAM), but the higher the quality.
GGML vs GGUF
The llama.cpp community initially used the .ggml file format to represent quantized model weights but they’ve since moved onto the .gguf file format. There are a number of reasons and benefits of the switch, but 2 of the most important reasons include:
Support for non-llama models in llama.cpp like Falcon
In an earlier post, I cover how to run the Llama 2 models on your MacBook. That post covers the pre-reqs you need to run any ML model hosted on XetHub. Follow steps 0 to 3 and then come back to this post. Also make sure you’ve signed the license agreement from Meta and you aren’t violating their community license.
Once you’re setup with PyXet, XetHub, and you’ve compiled llama.cpp for your laptop, run the following command to mount the XetHub/codellama repo to your local machine:
This should finish in just a few seconds because all of the model files aren’t being downloaded to your machine. As a reminder, the XetHub for these models live at this link.
Running the Smallest Model
Now, you can run any Code Llama model you like by changing which model file you point llama.cpp to. The model file you need will be downloaded and cached behind the scenes.
Here’s a breakdown of the code:
llama.cpp/main -ngl 1 : when compiled appropriately, specifies the number of layers (1) to run on the GPU (increasing performance)
-model codellama/GGUF/7b/codellama-7b.Q2_K.gguf: path to the model we want to use for inference. This is a 8-bit quantized version of the codellama-7b model
-prompt "In Snowflake SQL, how do I count the number of rows in a table?" : the prompt we want the model to respond to
And now we wait a few minutes! Depending on your internet connection, it might take 5-10 minutes for your computer to download the model file behind the scenes the first time. Subsequent predictions with the same model will happen in under a second.
Comparing Instruct with Python
Let’s ask the following question to the codellama-7b-instruct and the codellama-7b-python variants, both quantized to 8 bits: “How do I find the max value of a specific column in a pandas dataframe? Just give me the code snippet”
Here’s the output from codellama-7b-instruct:
Next let’s try codellama-7b-python:
Here’s the output:
For this specific example and run, the codellama-7b-python model variant returns an accurate response while the generic codellama-7b-instruct one seems to give an inaccurate one. Running the same prompt again often yields different responses, so it’s very challenging to reliably return responses with quantized models. They are definitely not deterministic.
Comparing 2 Bit with 8 Bit Models
Let’s now try asking a SQL code generation question to a 2 bit vs an 8 bit quantized model version of codellama-7b-instruct.
Here’s the command to submit the prompt to the 2 bit version:
Here's the output:
From this response, we can actually see some leakage from the underlying dataset (likely StackOverflow). Let's submit the prompt to the 8 bit version now:
Here’s the output:
This response returns a useful answer without leaking any underlying data and overall the 8 bit version seems to provide more helpful responses than the 2 bit version. Sadly, neither answer lives up to the experience that ChatGPT provides but Code Llama is at least open source and can be fine tuned on private data safely.
What else can you use XetHub for?
XetHub is a versioned blob store built for ML teams. You can copy terabyte scale datasets, ML models, and other files from S3, Git LFS, or another data repo and get the same benefits of mounting and streaming those files to your machine. Branches enable you to make changes and compare the same file between different branches. Any changes you make in the repo can be pushed back quickly thanks to block-level deduplication built into xet. Finally, you can launch Streamlit, Gradio, or custom Python apps from the data in your XetHub repos.
This XetHub workflow enables a host of cool use cases:
You can play with stable diffusion using the Stable Diffusion Text-to-Image Generator repo
You can mount and analyze the Common Crawl, StackExchange, and Wikipedia folders from the RedPajama repo
If you have questions or run into issues, join our Slack community to meet us and other XetHub users!