Currently, what I'm seeing with RWKV is that attention fades of quickly. The model will start to produce output, but very quickly (a few dozen tokens), its own output tokens are suddenly taking 'precedence' over the input question and it starts to simply repeat itself.
For example, I'm currently attempting to use RWKV for named entity extraction. I ask it to analyze a piece of text and provide output in JSON format. It starts off great. However, eventually, it seems like the beginning of the JSON list 'overtakes' the question I asked, and it starts to just produce random data that would seem plausible based on the set of things in the list. I realize this is due perhaps to the precision losses of the RNN as weights decay.
However, I feel there ought to be some way we can prevent that. Any thoughts?
Yeah... So I did that which is how I got it to begin correctly. This is what I mean though.
I'll say "get a list of Blah from the following document in Json format like this:
Example"
Then I feed the document and add a spot for the answer.
The model begins correctly. But usually in the middle of the Json list generation, it will veer off, and start hallucinating as if it forgot the document and the task. I'm happy to share specifics and datasets but this is a cross cutting problem.
Rwkv is able to answer my questions when I ask simple yes/no or classification. It's the listing that throws it for a loop. Transformers do not have the same problem. Both llama and gpt are able to maintain focus.
Also, do you know where I'd find information on how the current weights were trained?
Why would asking the question first improve quality? Is it because the model will be better aware of what info it Can and can’t throw away at each step? This seems like the opposite of transformers.
RWKV does not work like transformers. The "transformer" part here is the training step. RWKV is an RNN with fixed-size state, so old information slightly decays each time it reads a new token. Hence the freshest memory is of the most recent tokens.
Are there any ways to train it to maintain attention on the original prompt no matter the distance from it, and selectively pay attention to its own output where relevant?
For example, I'm currently attempting to use RWKV for named entity extraction. I ask it to analyze a piece of text and provide output in JSON format. It starts off great. However, eventually, it seems like the beginning of the JSON list 'overtakes' the question I asked, and it starts to just produce random data that would seem plausible based on the set of things in the list. I realize this is due perhaps to the precision losses of the RNN as weights decay.
However, I feel there ought to be some way we can prevent that. Any thoughts?