Efficient Streaming Language Models with Attention Sinks
takeaway: train LLMs with an attention sink token as the first token in the context to support streaming applications. (otherwise the first few tokens are treated as attention sinks and evicting them from KV cache causes performance degredation)