Scaling transformers to 1M-plus tokens

https://arxiv.org/abs/2304.11062

Achieved 2M tokens of memory by training on token groups rather than individual tokens. Model performance is sacrificed a bit, about a 5-10% on different tasks.

🦨 Alpha's Tech Garden

Explorer

Scaling transformers to 1M-plus tokens

Graph View

Backlinks