Once upon a time, gamers were free
To play what they wanted, without decree
They enjoyed their hobby, with no one to blame
And the world of gaming was truly a game
But then came GamerGate, and all hell broke loose
As censorship reared its ugly head, in full force
People wanted games changed, to fit their views
To make them more politically correct, for all to choose
But gamers stood strong, they would not yield
For gaming was their passion, they were fielded
They fought against those who sought to destroy
The freedom of choice that gaming brings with joy
So censorship is wrong, it's plain to see
We must protect our games and maintain our liberty
To play what we want, without fear or doubt
And keep the world of gaming alive, with no way out.
70b? What kind of graphics card do you have? I have a pretty good one and I'm limited to 13b from my calculations.
GeForce GTX 1080.
I also have 64 GB RAM and 128 GB swap.
You are an absolute psycho. I'm saluting you.
how many tokens per second were you getting?
And are you ready for the llama3 700b model?
Not sure what you mean by "tokens per second"
how many words per second does it generate?
Cool! I haven't played around with running any local models. I have a 4080 but only 16gb of system ram. How limiting is the system ram?
Any part of a model loading into system RAM is going to cut performance down to under 5 tokens per second, bigger models run slower.
You can rent stuff with openrouter and use it with sillytavern too. It's nice since you can pay with crypto. Of course don't expect the stuff you generate from any paid service is fully anonymous though.
You can run on regular ram if you don't mind it being painfully slow. Depending on what you're trying to generate it might not matter.
70B models are leaps and bounds beyond 13B and 34B, it's definitely worth it to run at 2 tokens per second if you absolutely have to get something closer to factually correct but not remotely worth it for ordinary chatting.
llama2 based models can be quantized in ways that allow you to use much less memory for comparable results. And as OP mentions, you can take a performance hit and use RAM/Swap to stretch what you can run.
Also, with GGUF formats, you can utilize standard memory and CPU for most of the work, while having the GPU speed up the process.