https://arxiv.org/abs/2304.11062

Achieved 2M tokens of memory by training on token groups rather than individual tokens. Model performance is sacrificed a bit, about a 5-10% on different tasks.