https://arxiv.org/abs/2304.11062
Achieved 2M tokens of memory by training on token groups rather than individual tokens. Model performance is sacrificed a bit, about a 5-10% on different tasks.
https://arxiv.org/abs/2304.11062
Achieved 2M tokens of memory by training on token groups rather than individual tokens. Model performance is sacrificed a bit, about a 5-10% on different tasks.