The thread suggests it doesn't even quantize the model (running it in FP16, so tons of ram usage), and that its slower than the llama.cpp Metal backend anyway?
And MLC-LLM was faster than llama.cpp, last I checked. Its hard to keep up with developments.
I think llama.cpp is the sweet spot right now, due to its grammar capability and many other features (e.g., multimodal). MLC-LLM is nice but they don't offer uncensored models.
- A: You can convert models to MLC yourself, just like GGUF models, with relative ease.
- B: Yeah, llama.cpp has a killer feature set. And killer integration with other frameworks. MLC is way behind, but is getting more fleshed out every time I take a peek at it.
- C: This is a pet peeve of mine, but I've never run into a local model that was really uncensored. For some, if you give them a GPT4 prompt... Of course you get a GPT4 response. But you can just give them a unspeakable system prompt or completion, and they will go right ahead and complete it. I don't really get why people fixate on the "default personality" of models trained on GPT4 data.
Llama.cpp is great but I have moved to mostly using Ollama because it is both good on the command line and ‘ollama server’ runs a very convenient to use REST server.
In any case, I had fun with MLX today, and I hope it implements 4 bit quantization soon.
The thread suggests it doesn't even quantize the model (running it in FP16, so tons of ram usage), and that its slower than the llama.cpp Metal backend anyway?
And MLC-LLM was faster than llama.cpp, last I checked. Its hard to keep up with developments.