Skip to main content

·45 words·1 min·
Author
Mark Ogata
AI and Robotics Undergraduate Researcher

Efficient Streaming Language Models with Attention Sinks

takeaway: train LLMs with an attention sink token as the first token in the context to support streaming applications. (otherwise the first few tokens are treated as attention sinks and evicting them from KV cache causes performance degredation)