Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Been super impressed with local models on mac. Love that the gemma models have 128k token context input size. However, outputs are usually pretty short

Any tips on generating long output? Like multiple pages of a document, a story, a play or even a book?



The tool you are using may set a default max output size without you realizing. Ollama has a num_ctx that defaults to 2048 for example: https://github.com/ollama/ollama/blob/main/docs/faq.md#how-c...


Been playing with that, but doesn’t seem to have much effect. It works very well to limit output to smaller bits, like setting it to 100-200. But above 2-4k the output seems to never get longer than about 1 page

Might try using the models with mlx instead of ollama to see if that makes a difference

Any tips on prompting to get longer outputs?

Also, does the model context size determine max output size? Are the two related or are they independent characteristics of the model?


Interestingly the Gemma 3 docs say: https://ai.google.dev/gemma/docs/core/model_card_3#:~:text=T...

> Total output context up to 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size per request, subtracting the request input tokens

I don't know how to get it to output anything that length though.


Thank you for the insights and useful links

Will keep experimenting, will also try mistral3.1

edit: just tried mistral3.1 and the quality of the output is very good, at least compared to the other models I tried (llama2:7b-chat, llama2:latest, gemma3:12b, qwq and deepseek-r1:14b)

Doing some research, because of their training sets, it seems like most models are not trained on producing long outputs so even if they technically could, they won’t. Might require developing my own training dataset and then doing some fine tuning. Apparently the models and ollama have some safeguards against rambling and repetition


You can probably find some long-form tuned models on HF. I've had decent results with QwQ-32B (which I can run on my desktop) and Mistral Large (which I have to run on my server). Generating and refining an outline before writing the whole piece can help, and you can also split the piece up into multiple outputs (working a paragraph or two at a time, for instance). So far I've found it to be a tough process, with mixed results.


Thank you, will try out your suggestions

Have you used something like a director model to supervise the output? If so, could you comment on the effectiveness of it and potentially any tips?


Nope, sounds neat though. There's so much to keep up with in this space.


I'm using 12b and getting seriously verbose answers. It's squeezed into 8GB and takes its sweet time but answers are really solid.


This is basically the opposite of what I've experienced - at least compared to another recent entry like IBM's Granite 3.3.

By comparison, Gemma3's output (both 12b and 27b) seems to typically be more long/verbose, but not problematically so.


I agree with you. The outputs are usually good, it’s just that for the use case I have now (writing several pages of long dialogs), the output is not as long as I’d want it, and definitely not as long as it’s supposedly capable of doing




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: