From Absolute Positional Encoding to RoPE: Why Position Can Be a Rotation

sky_io@outlook.com (K4i) — Thu, 28 May 2026 21:53:12 +0800

Introduction

Self-attention has a surprising weakness: by itself, it does not know word order.

Consider these two sentences:

I like you
You like me

They contain almost the same tokens, but their meanings are different. RNNs read inputs step by step, and CNNs preserve local neighborhoods through convolution windows. Standard self-attention, however, mostly compares all token vectors with all other token vectors. Its core formula is:

$$\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V$$

If we permute the input tokens and permute the output in the same way, the attention computation does not naturally resist that permutation. Attention is good at modeling content relationships, but order must be injected separately.

Rope on k4i's blog

From Absolute Positional Encoding to RoPE: Why Position Can Be a Rotation

Introduction