📖 Transformers: The Architecture That Changed NLP¶

🧠 Why Transformers?
🏗️ Core Building Blocks
🔬 The Transformer Block
🔢 Attention Mechanism in Depth
🧰 Training a Transformer Model
📚 Transformers Beyond Text
🧭 From Transformers to LLMs
🔚 Closing Notes

🧠 Why Transformers?¶

🔄 Limits of RNNs and CNNs for Sequential Data¶

🔗 Need for Long-Range Dependencies¶

⏱️ Parallelism and Efficiency¶

Back to the top

🏗️ Core Building Blocks¶

📦 Embeddings¶

🎯 Positional Encoding¶

🧮 Self-Attention Mechanism¶

🧠 Multi-head Attention¶

🧱 Feedforward Layers¶

🔁 Layer Norm, Skip Connections¶

Back to the top

🔬 The Transformer Block¶

🔁 Encoder Block (Structure + Flow)¶

🔁 Decoder Block (Structure + Flow)¶

🔄 Masking in Attention¶

📶 Stack of N Layers¶

Back to the top

🔢 Attention Mechanism in Depth¶

🧠 Attention as Weighted Lookup¶

📏 Query, Key, Value Vectors¶

📊 Dot-Product Attention Calculation¶

⚙️ Softmax + Scaling¶

Back to the top

🧰 Training a Transformer Model¶

📊 Example: Sequence Classification or Translation¶

💡 Tokenization (WordPiece/BPE) Basics¶

🧮 Input-Output Pipeline¶

Back to the top

📚 Transformers Beyond Text¶

🧠 Use in Vision (ViT)¶

🧪 Time Series and Tabular Data¶

🧬 Multimodal Transformers¶

Back to the top

🧭 From Transformers to LLMs¶

🌍 Evolution: Transformer → GPT/BERT → LLMs¶

📈 Scaling Laws (Depth, Width, Data)¶

🔍 Pretraining Objectives: Causal vs. Masked¶

Back to the top

🔚 Closing Notes¶

⚠️ Conceptual Pitfalls¶

🔍 Visual Explainers and Demos to Explore¶

🚀 Next Up: Large Language Models (04)¶

Back to the top