Transformers / LLM

Fundamentals

What is a Transformer?
- Origins (Attention is All You Need paper)
- Encoder vs Decoder vs Encoder-Decoder
Attention Mechanism
- Self-attention
- Scaled Dot-Product Attention
- Multi-head attention
Positional Encoding
- Why it’s needed
- Sinusoidal vs Learned embeddings
Architecture of Transformers
- Layer normalization, residuals, FFN
- Stacking layers and blocks
Pretraining vs Fine-tuning
- Pretraining on large corpora
- Fine-tuning on specific tasks (e.g., QA, summarization)

Intermediate Topics

Transformers in NLP
- BERT (bi-directional encoder)
- GPT (auto-regressive decoder)
- RoBERTa, DistilBERT, T5, etc.
Tokenization
- Byte-Pair Encoding (BPE)
- WordPiece
- SentencePiece
Transfer Learning with LLMs
- Zero-shot, one-shot, and few-shot learning
- Prompt engineering basics
LLM Applications
- Chatbots
- Text generation
- Summarization, translation
- Code generation

Advanced Topics

Fine-tuning vs Prompt-tuning vs LoRA
- Parameter-efficient tuning
- Adapters, prefix-tuning
Retrieval-Augmented Generation (RAG)
- Combining LLMs with external knowledge
In-Context Learning
- How LLMs learn from the prompt
- Chain-of-thought prompting
Training LLMs
- Datasets (Common Crawl, C4)
- Training infrastructure and scaling laws
LLM Internals
- Layer attention patterns
- Memorization and hallucination
- Token probabilities
Ethics and Safety in LLMs
- Bias, fairness, misinformation
- Alignment and RLHF (Reinforcement Learning from Human Feedback)
Evaluation Metrics
- Perplexity
- BLEU, ROUGE, F1 for generation
- Human evals

Bonus/Applied Topics

Using Hugging Face Transformers
- Pipelines, tokenizers, models
- Fine-tuning with Trainer API
Deploying LLMs
- Quantization, pruning
- On-device and cloud deployment
Open-source LLMs
- LLaMA, Mistral, Falcon, etc.
Future Directions
- Multimodal Transformers (e.g., CLIP, Flamingo, Gemini)
- Agents (AutoGPT, BabyAGI)
- LLMs with memory or tools