Win / KotakuInAction2
KotakuInAction2
Sign In
DEFAULT COMMUNITIES All General AskWin Funny Technology Animals Sports Gaming DIY Health Positive Privacy
Reason: None provided.

Local LLMs are Language models that run locally on your graphics card or cpu.

There are three parts to a running an LLM Model, Backend and Front End.

Model determines behavior. This is the part that could be censored or not, or will steer themselves away from certain topics. This is also where you have size and quantization. Size is the number of nodes, and Quantization determines the size of the node.

13B and 33B are the sweet spot for size and 4bit is the sweet spot for quantization right now. Smaller is faster dumber and easier to run. There is also context size but that's changing rapidly right now. Context size determines how long a passage of text it can process, and consequently how long it's memory is in conversation mode.

Backend determines how the model is run. The model either needs to be loaded onto your GPUs VRAM to run on your graphics card, or into your ram .

Oobabooga is a popular backend for running on GPU, and Llama.cpp and Kobold.cpp are backends that run on cpu.

Of course GPU runs much faster but it's far easier to expand your system Ram to the 64 or 128 GB you'll need for the big models than it is to get that much VRAM.

Koboldcpp and Llamacpp both have GPU offload modes that use both, which will speed you up a bit over running on cpu.

For front ends, both kobold and ooba have built in front ends, but there are dedicated front ends like silly tavern that make interacting with it more like a chat app.

So that was the short version. The technology all exists and is pretty good, but it's a pain to set up. I must emphasize that this is all private, uncensored and you are completely in control of it's behavior. Many people create models by taking existing models and fine tuning them by adding specific training data. This can be done very cheaply and quickly, unlike the original models which cost thousands of dollars to create.

This is why so much of the ecosystem is based around Llama, Facebook's model. It was the first model to be leaked in a format that allowed for open source models to be built off it.

In a few years someone will streamline this enough that it will be accessable to run for people who aren't enthusiasts. Right now the market is basically super normie censored offering like chatgpt, and super enthusiast models that require you to know a bit about Linux, tensorflow, and CUDA to get a grip on.

1 year ago
3 score
Reason: Original

Local LLMs are Language models that run locally on your graphics card or cpu.

There are three parts to a running an LLM Model, Backend and Front End.

Model determines behavior. This is the part that could be censored or not, or will steer themselves away from certain topics. This is also where you have size and quantization. Size is the number of nodes, and Quantization determines the size of the node.

13B and 33B are the sweet spot for size and 4bit is the sweet spot for quantization right now. Smaller is faster dumber and easier to run. There is also context size but that's changing rapidly right now. Context size determines how long a passage of text it can process, and consequently how long it's memory is in conversation mode.

Backend determines how the model is run. The model either needs to be loaded onto your house VRAM to run on your graphics card, or into your ram .

Oobabooga is a popular backend for running on GPU, and Llama.cpp and Kobold.cpp are backends that run on cpu.

Of course GPU runs much faster but it's far easier to expand your system Ram to the 64 or 128 GB you'll need for the big models than it is to get that much VRAM.

Koboldcpp and Llamacpp both have GPU offload modes that use both, which will speed you up a bit over running on cpu.

For front ends, both kobold and ooba have built in front ends, but there are dedicated front ends like silly tavern that make interacting with it more like a chat app.

So that was the short version. The technology all exists and is pretty good, but it's a pain to set up. I must emphasize that this is all private, uncensored and you are completely in control of it's behavior. Many people create models by taking existing models and fine tuning them by adding specific training data. This can be done very cheaply and quickly, unlike the original models which cost thousands of dollars to create.

This is why so much of the ecosystem is based around Llama, Facebook's model. It was the first model to be leaked in a format that allowed for open source models to be built off it.

In a few years someone will streamline this enough that it will be accessable to run for people who aren't enthusiasts. Right now the market is basically super normie censored offering like chatgpt, and super enthusiast models that require you to know a bit about Linux, tensorflow, and CUDA to get a grip on.

1 year ago
1 score