Inner Working of BERT / GPT

Foundations

Transformers Architecture

Self-attention mechanism
Multi-head attention
Positional encoding
Layer normalization
Residual connections
Feed-forward networks
Encoder vs Decoder structure

Tokenization & Input Embeddings

Byte Pair Encoding (BPE) or WordPiece
Token types, segment embeddings (esp. in BERT)
Positional embeddings

BERT (Bidirectional Encoder Representations from Transformers)

Architecture Overview

Stack of transformer encoders only
Input format: [CLS] sentence A [SEP] sentence B [PAD]
Embedding types: token, segment, positional

Training Objective

Masked Language Modeling (MLM)
Next Sentence Prediction (NSP)

Pre-training vs Fine-tuning

Transfer learning in BERT
Fine-tuning for specific tasks (e.g., QA, NER, sentiment)

Variants

RoBERTa (removes NSP, more training)
DistilBERT (smaller, faster)
ALBERT (parameter sharing)

GPT (Generative Pre-trained Transformer)

Architecture Overview

Stack of transformer decoders only
Autoregressive model (predict next token)

Training Objective

Causal Language Modeling (CLM)
Next-token prediction with unidirectional context

Fine-tuning and In-context Learning

GPT-2/3/4: zero-shot, few-shot, chain-of-thought prompting
Reinforcement Learning with Human Feedback (RLHF) — especially in GPT-4

Attention Visualization & Interpretation

Attention heads and what they learn
Visualization tools (e.g., BertViz)

Memory and Computational Cost

Model size (parameters), GPU/TPU requirements
Memory-efficient training (e.g., FlashAttention)

Limitations and Biases

Overfitting, hallucination, adversarial examples
Mitigating bias in language models

Implementing Transformer Blocks from Scratch

Building attention, positional encoding, etc. in PyTorch or TensorFlow

Using Pre-trained Models

HuggingFace Transformers: loading and fine-tuning BERT/GPT models

Benchmarking and Evaluation

Metrics for classification (BERT), generation (GPT)
Datasets (GLUE, SQuAD, LAMBADA, etc.)