CodeShell is a code large language model (LLM) developed jointly by the Knowledge Computing Lab at Peking University and the AI team of Sichuan Tianfu Bank. CodeShell has 7 billion parameters, was trained on 500 billion tokens, and has a context window length of 8192. On authoritative code evaluation benchmarks (HumanEval and MBPP), CodeShell achieves the best performance for models of its scale. At the same time, we offer deployment solutions and IDE plugins that complement CodeShell. Please refer to the CodeShell repository for details.
The open-source models are as follows:
We selected the two most popular code evaluation datasets (HumanEval and MBPP) to evaluate the model. Compared with the two most advanced 7B code models, CodeLlama and Starcoder, Codeshell achieved optimal results. The specific evaluation results are as follows.
Task | CodeShell-7b | CodeLlama-7b | Starcoder-7b |
---|---|---|---|
humaneval | 34.32 | 29.44 | 27.80 |
mbpp | 38.65 | 37.60 | 34.16 |
multiple-js | 33.17 | 31.30 | 27.02 |
multiple-java | 30.43 | 29.24 | 24.30 |
multiple-cpp | 28.21 | 27.33 | 23.04 |
multiple-swift | 24.30 | 25.32 | 15.70 |
multiple-php | 30.87 | 25.96 | 22.11 |
multiple-d | 8.85 | 11.60 | 8.08 |
multiple-jl | 22.08 | 25.28 | 22.96 |
multiple-lua | 22.39 | 30.50 | 22.92 |
multiple-r | 20.52 | 18.57 | 14.29 |
multiple-rkt | 17.20 | 12.55 | 10.43 |
multiple-rs | 24.55 | 25.90 | 22.82 |
The CodeShell series models have been uploaded to Hugging Face. Developers can quickly call CodeShell and CodeShell-Chat through Transformers.
Before starting, make sure you have set up the environment correctly, installed the necessary packages, and meet the environmental requirements from the previous section. The necessary dependencies can be installed quickly using the following code:
pip install -r requirements.txt
Next, you can use CodeShell through Transformers.
Developers can use CodeShell to quickly generate code, accelerating development efficiency.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("WisdomShell/CodeShell-7B")
model = AutoModelForCausalLM.from_pretrained("WisdomShell/CodeShell-7B", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
inputs = tokenizer('def merge_sort():', return_tensors='pt').cuda()
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
CodeShell supports the Fill-in-the-Middle mode to better assist the software development process.
input_text = "<fim_prefix>def print_hello_world():\n <fim_suffix>\n print('Hello world!')<fim_middle>"
inputs = tokenizer(input_text, return_tensors='pt').cuda()
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
CodeShell has also open-sourced the CodeShell-7B-Chat code assistant model. Developers can interact with the model using the following code.
model = AutoModelForCausalLM.from_pretrained('WisdomShell/CodeShell-7B-Chat', trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
tokenizer = AutoTokenizer.from_pretrained('WisdomShell/CodeShell-7B-Chat')
history = []
query = 'Who are you?'
response = model.chat(query, history, tokenizer)
print(response)
history.append((query, response))
query = 'Write an HTTP server in Python'
response = model.chat(query, history, tokenizer)
print(response)
history.append((query, response))
Developers can also interact with CodeShell-7B-Chat through VS Code and JetBrains plugins. For details, please refer to the VSCode plugin repository and IntelliJ plugin repository.
CodeShell supports 4 bit/8 bit quantization. After 4-bit quantization, the memory footprint is approximately 6GB, allowing users to use CodeShell on GPUs with smaller memory.
model = AutoModelForCausalLM.from_pretrained('WisdomShell/CodeShell-7B-Chat-int4', trust_remote_code=True).to(device)
tokenizer = AutoTokenizer.from_pretrained('WisdomShell/CodeShell-7B-Chat-int4')
As most personal computers lack a GPU, CodeShell offers C/C++ inference support. Developers can compile based on the local environment. See CodeShell C/C++ local version. After compilation, the Web API service can be started with the following command.
We offer demos in four forms: Web-UI, command line, API, and IDE.
Developers can start the Web service using the following command. After the service starts, it can be accessed at https://127.0.0.1:8000.
python demos/web_demo.py
We also offer a command-line interactive demo version. Developers can run it using the following command.
python cli_demo.py
CodeShell also offers a deployment method based on the OpenAI API.
python openai_api.py
Then you can interact with CodeShell via HTTP requests:
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "CodeShell-7B-Chat",
"messages": [
{
"role": "user",
"content": "你好"
}
]
}'
Finally, CodeShell offers an online IDE. Developers can use the IDE for code completion, code Q&A, and other operations. IDE plugins are also released, and developers can install and use them locally. For plugin-related issues, please discuss in the VS Code plugin repository.
Code Shell uses GPT-2 as its basic architecture and employs technologies like Grouped-Query Attention and RoPE relative position encoding.
Hyper-parameter | Value |
---|---|
n_layer | 42 |
n_embd | 4096 |
n_inner | 16384 |
n_head | 32 |
num_query_groups | 8 |
seq-length | 8192 |
vocab_size | 70144 |
CodeShell was trained based on its own scraped Github data, the open-source Stack and StarCoder datasets from Big Code, as well as a small amount of high-quality Chinese and English data. On top of the original dataset, CodeShell used Minihash for data deduplication, KenLM, and a high-quality data selection model for data filtering and selection, resulting in a high-quality pre-training dataset.
CodeShell optimized the Starcoder vocabulary by removing infrequently used words and adding some Chinese vocabulary, significantly improving the Chinese compression rate, laying the groundwork for the training of the Chat version.
Tokenizer | Size | Chinese | English | Code | Total |
---|---|---|---|---|---|
Starcoder | 49152 | 1.22 | 3.47 | 3.30 | 2.66 |
CodeShell | 70144 | 1.50 | 3.47 | 3.30 | 2.95 |
The community's use of the CodeShell model must adhere to the "CodeShell Model License Agreement" and the Apache 2.0 license. CodeShell is permitted for commercial use. However, if you plan to use the CodeShell model or its derivative products for commercial purposes, you must confirm that the entity meets the following conditions:
Under the aforementioned conditions, you need to submit the application materials required by the "CodeShell Model License Agreement" by sending an email to codeshell.opensource@gmail.com. After approval, you will be granted a global, non-exclusive, non-transferable, non-sublicensable commercial copyright license.
🤗 Hugging Face • 🤖 ModelScope • ⭕️ WiseModel • 🌐 PKU-KCL
Author: WisdomShell
Source Code: https://github.com/WisdomShell/codeshell