Terminology nitpick / question here.
First a quick bit of history, as I understand it: back in 2015, the original Resnet paper (He et al., 2015) added a skip connection to the CNN architecture (colored annotations are mine):

The idea was that you would be modeling the output of a layer as the sum of the input
Nowadays, many transformer-based papers—especially ones in the mechanistic interpretability domain—use the terminology residual stream to denote the information that flows layer-wise along the transformer outside of the attention and MLP / FFW components. Here, for example, is an image from the Mathematical Framework for Transformer Circuits paper by Elhage et al. (the wonderful Anthropic mech int team):

As indicated in that figure, the residual stream encompasses both the inter-layer piece and the ‘skip connection’ part where you bypass the attention and MLP / FFW components within an individual layer. In other words, the residual stream does not include the part that I would describe as the residual!
I’m curious whether anyone is aware of a reason for this seemingly inconsistent terminology, and where the more recent transformer-based term originated (was it Olah’s team at Anthropic?). Reach out if you have any insight.