Currently, what I'm seeing with RWKV is that attention fades of quickly. The mod...

pico_creator · on May 23, 2023

Rearrange the query.

Ask the question / explain the task first. Then give it the data you want to extract from.

Also you may want to give a one shot example for best result (the instruct training of raven model is very limited)

anon291 · on May 23, 2023

Yeah... So I did that which is how I got it to begin correctly. This is what I mean though.

I'll say "get a list of Blah from the following document in Json format like this:

Example"

Then I feed the document and add a spot for the answer.

The model begins correctly. But usually in the middle of the Json list generation, it will veer off, and start hallucinating as if it forgot the document and the task. I'm happy to share specifics and datasets but this is a cross cutting problem.

Rwkv is able to answer my questions when I ask simple yes/no or classification. It's the listing that throws it for a loop. Transformers do not have the same problem. Both llama and gpt are able to maintain focus.

Also, do you know where I'd find information on how the current weights were trained?

pico_creator · on May 23, 2023

Hmm we might need to look into the instruct training data. Which is mostly based on gpt4all filtered and mixed with others

(You are using raven right? That’s the instruct trained varient)

Btw ping the discord if ur looking into finetuning for your usecase

anon291 · on May 24, 2023

Yeah I'm using raven. Raven does work better. And I'm on the discord.

Unfortunately I really would like machine readable responses and raven is a bit too verbose.

Looking at fine-tuning right now.

zaptrem · on May 23, 2023

Why would asking the question first improve quality? Is it because the model will be better aware of what info it Can and can’t throw away at each step? This seems like the opposite of transformers.

LoganDark · on May 23, 2023

RWKV does not work like transformers. The "transformer" part here is the training step. RWKV is an RNN with fixed-size state, so old information slightly decays each time it reads a new token. Hence the freshest memory is of the most recent tokens.

Tostino · on May 23, 2023

Are there any ways to train it to maintain attention on the original prompt no matter the distance from it, and selectively pay attention to its own output where relevant?

pico_creator · on May 23, 2023

Instruction training. This is a WIP