Foundations
Transformers Architecture
- Self-attention mechanism
- Multi-head attention
- Positional encoding
- Layer normalization
- Residual connections
- Feed-forward networks
- Encoder vs Decoder structure
Tokenization & Input Embeddings
- Byte Pair Encoding (BPE) or WordPiece
- Token types, segment embeddings (esp. in BERT)
- Positional embeddings
BERT (Bidirectional Encoder Representations from Transformers)
Architecture Overview
- Stack of transformer encoders only
- Input format: [CLS] sentence A [SEP] sentence B [PAD]
- Embedding types: token, segment, positional
Training Objective
- Masked Language Modeling (MLM)
- Next Sentence Prediction (NSP)
Pre-training vs Fine-tuning
- Transfer learning in BERT
- Fine-tuning for specific tasks (e.g., QA, NER, sentiment)
Variants
- RoBERTa (removes NSP, more training)
- DistilBERT (smaller, faster)
- ALBERT (parameter sharing)
GPT (Generative Pre-trained Transformer)
Architecture Overview
- Stack of transformer decoders only
- Autoregressive model (predict next token)
Training Objective
- Causal Language Modeling (CLM)
- Next-token prediction with unidirectional context
Fine-tuning and In-context Learning
- GPT-2/3/4: zero-shot, few-shot, chain-of-thought prompting
- Reinforcement Learning with Human Feedback (RLHF) — especially in GPT-4
Attention Visualization & Interpretation
- Attention heads and what they learn
- Visualization tools (e.g., BertViz)
Memory and Computational Cost
- Model size (parameters), GPU/TPU requirements
- Memory-efficient training (e.g., FlashAttention)
Limitations and Biases
- Overfitting, hallucination, adversarial examples
- Mitigating bias in language models
Implementing Transformer Blocks from Scratch
- Building attention, positional encoding, etc. in PyTorch or TensorFlow
Using Pre-trained Models
- HuggingFace Transformers: loading and fine-tuning BERT/GPT models
Benchmarking and Evaluation
- Metrics for classification (BERT), generation (GPT)
- Datasets (GLUE, SQuAD, LAMBADA, etc.)