Using Local LLM with .NET
06/25/2024
5 minutes
In this post I explore how one can run a LLM locally to create a chat assistant with .NET 8. I use two libraries:
There is an excellent blog post on Demystifying Retrieval Augmented Generation with .NET by Stephen Toub describing how to use the SemanticKernel library to create a chat agent.
LLamaSharp is a library that can run local LLaMA/GPT model easily and fast in C#. It uses llama.cpp under the hood. I use version 10.0.0 along with semantic kernel version 1.3.0.
Get a Model
When running an LLM locally, we need to find a good model. Preferably something that is small enough to be run on a local machine. I am using models on a 2-in-1 machine with 16GB RAM and a 12th Gen Intel Core i7 CPU and I find the experience acceptable.
There are many language models available, I choose Phi-2, because it is small tends to perform well, and has an MIT license. Although other models (such as the llama) can work just well. A gguf version of the Phi-2 model can be found on TheBloke's repo.
Load the Model
To load the model, pass the file path of the gguf file to ModelParams
. Then load the weights and pass the weights object to an embedder or to a StatelessExecutor
:
var parametersPhi = new ModelParams(phiFilePath) { ContextSize = 2048, Seed = 1337, GpuLayerCount = 5, EmbeddingMode = true, }; // ... using var modelEmbed = LLamaWeights.LoadFromFile(parametersPhi); var embedding = new LLamaEmbedder(modelEmbed, parametersPhi, NullLogger.Instance); using var modelChat = LLamaWeights.LoadFromFile(parametersPhiChat); var ex = new StatelessExecutor(modelChat, parametersPhiChat, NullLogger.Instance);
Integrate with Semantic Kernel
The embedder can be integrated with SemanticKernel using the LLamaSharpEmbeddingGeneration
type and the WithTextEmbeddingGeneration
extension method on the MemoryBuilder
:
ISemanticTextMemory memory = new MemoryBuilder() .WithMemoryStore(...) .WithTextEmbeddingGeneration(new LLamaSharpEmbeddingGeneration(embedding))
The chat client can integrate by registering the LLamaSharpChatCompletion
type as the chat completion implementation in the service collection of the Kernel
:
var kernelBuilder = Kernel.CreateBuilder(); kernelBuilder.Services.AddKeyedSingleton<IChatCompletionService>("llama", new LLamaSharpChatCompletion(ex, new ChatRequestSettings() { MaxTokens = 256, Temperature = 0.4 })); var kernel = kernelBuilder.Build();
MaxTokens sets the number of maximum tokens generated as the response. Temperate configures how much the model can 'wonder around'.
Configuration
While writing the code for the chat service is quite straightforward (for a reference implementation follow Demystifying Retrieval Augmented Generation with .NET), it does not guarantee to get reasonable responses from the model. The chat assistant will 'imagine' too much or with large embeddings to guide, will significantly increase the response time. For that matter, the most development time is required to adjust the configuration values, so that the model provides helpful answers.
Here are some of the configuration values that result significantly different behavior:
Provide a very good system message:
Microsoft.SemanticKernel.ChatCompletion.ChatHistory chat = new("You are an AI assistant who helps to summarize the Facts for the given question.");
Many times, rephrasing the message yields significantly better responses.Set the Temperature of the model allows how much the model may wonder around. The higher the value the more stuff it will make up. I never found the answers reasonable above the 0.7 value using the Phi-2 model. A too low value on the other will just return the facts as is without additional value.
Generating Embeddings: embeddings help to include closed source literature of the domain in question. Generating embeddings takes a significant amount of time.
List<string> paragraphs = TextChunker.SplitPlainTextParagraphs( TextChunker.SplitPlainTextLines(WebUtility.HtmlDecode(s), 128), 256); for (int i = 0; i < paragraphs.Count; i++) await memory.SaveInformationAsync(collectionName, paragraphs[i], $"paragraph-{source}-{i}");
Hence, it is best to try out a few samples before fixing the parameters and processing all documents. Both SplitPlainTextParagraphs
and SplitPlainTextLines
come with a token size parameter. This determines the size of the individual text segments that are added to the embedding. Each of such text segment is returned later by SearchAsync
to be used as a fact in the prompt. Using too small text segments will exclude the overall larger context of the paragraph/chapter/etc. Using too large segments might include too much irrelevant information, hence the search will exclude a hint due to a low relevance score, or it will make the model wonder around too much about the corresponding irrelevant information.
Search and include embeddings in the prompt:
await foreach (var result in memory.SearchAsync(collectionName, question, limit: 2, minRelevanceScore: 0.5))
. The number of included embeddings and minimum relevance score will impact the response. The more facts added to the prompt, the slower it responds, but can be more accurate. The relevance factor set too high may exclude potential facts that can enrich the prompt. For example, if a user used the wrong terminology in the question, it might miss potentially relevant facts.
Conclusion
Creating a local AI chatbot is surprisingly easy, someone can get started with less than 160 lines of code. However, to make a chatbot reasonable for a domain, a lot of work will be required referred as prompt engineering. To embed the right number of relevant facts to be used by the bot requires tuning several parameters.