LLM Skills Library

50 AI engineering techniques — real prompts, real code, real tradeoffs.

Prompting

Core techniques for getting better outputs from any LLM.

Chain-of-Thought

intermediate

Chain-of-thought prompting guides LLMs to reason step-by-step before answering. It dramatically improves accuracy on math, logic, and multi-step reasoning tasks. Learn zero-shot CoT, few-shot CoT, and when to use each.

Context Stuffing

intermediate

Context stuffing fills the LLM context window with relevant documents, code, or data so the model can reason over it directly — without retrieval. Learn when to stuff context vs. build RAG, and how to structure large contexts for maximum accuracy.

Few-Shot Prompting

beginner

Few-shot prompting dramatically improves LLM consistency by showing 2–8 examples of the desired input-output pattern before the actual query. Learn example selection, ordering, and formatting strategies.

Meta-Prompting

advanced

Meta-prompting uses an LLM to generate, critique, and refine prompts for a target LLM. It automates prompt engineering by having the model act as a prompt optimizer — dramatically reducing manual iteration time.

Prompt Chaining

intermediate

Prompt chaining breaks complex tasks into sequential LLM calls where each output feeds the next. Learn when to chain, how to design handoffs, and how to handle errors mid-chain.

Prompt Compression

advanced

Prompt compression reduces input token count by 50–90% through selective information removal, LLMLingua-style token pruning, and semantic summarization. Learn when compression helps, when it hurts, and how to measure the tradeoff.

ReAct Pattern

advanced

ReAct (Reasoning + Acting) interleaves LLM reasoning traces with tool actions, enabling agents to decompose tasks, call external APIs, and update their plan based on observations. It's the foundation of most production LLM agents.

Role Prompting

beginner

Role prompting assigns a persona or expert identity to an LLM to improve output quality and domain alignment. Learn which roles work, why they help, and the limits of persona assignment.

Self-Consistency

advanced

Self-consistency runs the same chain-of-thought prompt multiple times with temperature > 0 and takes a majority vote on the final answers. It reliably improves accuracy on reasoning and math tasks at the cost of multiple inference calls.

Structured Output

intermediate

Getting reliable JSON, CSV, or schema-compliant output from LLMs. Learn constrained decoding, schema prompting, validation loops, and which APIs guarantee valid JSON.

System Prompt Design

intermediate

System prompts are the foundation of every production LLM application. Learn how to write system prompts that consistently control persona, format, safety constraints, and output quality.

Temperature & Sampling

intermediate

Temperature, top-p, top-k, and frequency penalties control how an LLM samples its output. Learn exactly what each parameter does, when to turn temperature to zero, and how to tune sampling for creative vs. deterministic tasks.

Tree of Thought

advanced

Tree of Thought (ToT) enables LLMs to explore multiple reasoning branches in parallel, evaluate intermediate steps, and backtrack — mimicking deliberate human problem-solving for hard tasks.

XML Tags for Claude

intermediate

Claude is trained to follow XML-tagged prompt structure exceptionally well. Learn how to use XML tags to separate instructions from content, pass multi-part inputs, and improve Claude's output consistency and accuracy.

Zero-Shot Prompting

beginner

Zero-shot prompting lets you use LLMs without providing any examples. Learn when it works, when it fails, and how to write zero-shot prompts that get reliable results.

RAG & Retrieval

Build accurate retrieval systems that don't hallucinate.

Chunking Strategies

intermediate

Chunking splits documents into pieces for embedding and retrieval. The right chunking strategy — fixed-size, semantic, hierarchical, or late chunking — directly determines RAG accuracy. Learn the tradeoffs for each approach.

Contextual Retrieval

advanced

Contextual retrieval (Anthropic, 2024) prepends a short context summary to each chunk before embedding, giving the embedding model information about where the chunk sits in the document. It reduces retrieval failures by 49% on Anthropic's benchmarks.

Embedding Selection

intermediate

Choosing the right embedding model is the single biggest lever for RAG retrieval quality. This guide covers the major embedding models in 2026, benchmarks, dimension/cost tradeoffs, and how to evaluate embeddings on your specific domain.

GraphRAG

advanced

GraphRAG builds a knowledge graph from your corpus and uses it to answer complex, multi-hop questions that naive vector RAG fails on. Microsoft's GraphRAG system (2024) showed 2-5x better performance on global/analytical queries.

Hybrid Search

advanced

Hybrid search combines dense vector search (semantic similarity) with sparse keyword search (BM25) to retrieve documents. It consistently outperforms either approach alone, especially for queries with specific terms, product names, or technical jargon.

Late Chunking

advanced

Late chunking (Jina AI, 2024) embeds the full document first to capture global context, then pools token embeddings into chunk representations. This preserves cross-sentence context in each chunk's embedding, improving retrieval for context-dependent text.

Metadata Filtering

intermediate

Metadata filtering narrows the vector search space by pre-filtering documents on structured attributes — date, category, author, language — before semantic search. It dramatically improves precision and enables multi-tenant RAG.

Query Expansion

advanced

Query expansion uses an LLM to rewrite, decompose, or augment user queries before retrieval, improving recall by generating hypothetical documents, sub-queries, or synonym variations. It solves the vocabulary mismatch problem in RAG.

RAG Evaluation

intermediate

RAG evaluation measures both retrieval quality (did we fetch the right chunks?) and generation quality (did the LLM produce an accurate, grounded answer?). Learn the RAGAS framework, key metrics, and how to build a continuous eval pipeline.

Reranking

intermediate

Reranking is a second-stage retrieval step that scores each retrieved chunk for relevance to the query using a cross-encoder model. It consistently improves RAG answer quality by 15–30% over pure vector search with minimal added latency.

Agents & Tools

Tool use, memory, planning, and multi-agent coordination.

Agent Evaluation

advanced

Evaluating LLM agents is harder than evaluating single-turn LLMs because agents take sequences of actions, have long-horizon goals, and can fail in many ways. Learn task completion metrics, trajectory evaluation, and how to build regression tests for agents.

Agent Memory

intermediate

LLM agents need memory to maintain context across conversations and sessions. Learn the four memory types (in-context, external, procedural, episodic), when to use each, and how to build persistent memory systems that don't hallucinate past events.

Agent Planning

advanced

Agent planning is how LLM agents decompose complex tasks into executable steps, manage dependencies between steps, and adapt the plan when execution diverges from expectations. Good planning architecture is the difference between agents that complete 10-step tasks and ones that fail after 3.

Error Recovery

intermediate

Production LLM agents fail in predictable ways: tool errors, invalid JSON, hallucinated arguments, and infinite loops. Learn defensive error handling patterns that let agents recover gracefully rather than crashing or producing wrong outputs.

Human-in-the-Loop

intermediate

Human-in-the-loop (HITL) patterns define when LLM agents pause for human confirmation, verification, or input. Proper HITL design prevents costly agent mistakes while avoiding excessive interruptions that destroy productivity.

Multi-Agent Coordination

advanced

Multi-agent systems use an orchestrator LLM to decompose tasks and delegate to specialized subagents. This enables parallelism, specialization, and fault isolation that single-agent architectures can't achieve. Learn the orchestrator/subagent pattern, handoff protocols, and when to use agents vs. tools.

Parallel Tool Calls

advanced

Parallel tool calling lets LLMs request multiple tool executions simultaneously in a single response, rather than sequentially. This reduces multi-step agent latency by 50–80% when tasks can run concurrently.

Prompt Caching

intermediate

Prompt caching (Anthropic, OpenAI) stores computed key-value pairs for long prompt prefixes and reuses them across requests. It reduces input token costs by 90% and latency by 85% on cache hits — essential for any agent with a large system prompt or repeated context.

Streaming

intermediate

Streaming sends LLM tokens to the client as they're generated instead of waiting for the complete response. It reduces perceived time-to-first-token from 3–10s to under 500ms, dramatically improving user experience for long-form outputs.

Tool Use

intermediate

Tool use (function calling) lets LLMs call external APIs, run code, and query databases by describing available functions and receiving structured JSON calls. It's the foundation of all modern LLM agents.

Evaluation

Measure, test, and prevent regressions in LLM applications.

A/B Model Testing

advanced

A/B model testing runs two LLM configurations in parallel on real production traffic to measure which produces better outcomes. Unlike offline evals, A/B tests measure actual user behavior and business metrics — the ultimate signal for LLM quality.

Benchmark Selection

intermediate

Choosing the right benchmarks determines whether your model evaluation is predictive of real-world performance. Learn which benchmarks matter in 2026, how to avoid benchmark gaming, and when to build domain-specific benchmarks instead.

Evals Framework

intermediate

An evals framework systematically measures LLM application quality across multiple dimensions, catches regressions, and provides actionable feedback. Learn how to structure eval pipelines, write eval functions, and integrate evals into CI/CD.

Golden Dataset

intermediate

A golden dataset is a curated set of input/expected output pairs used as ground truth for evaluation. It's the foundation of every reliable LLM eval pipeline. Learn how to build, maintain, and expand golden datasets efficiently.

LLM-as-Judge

intermediate

LLM-as-judge uses a language model to score other LLM outputs on quality dimensions like correctness, faithfulness, helpfulness, and safety. It scales evaluation to millions of examples where human labeling is impractical.

Regression Testing

intermediate

Regression testing for LLMs catches quality degradations when you change prompts, upgrade models, or modify retrieval systems. Learn how to structure regression tests, set meaningful pass/fail thresholds, and integrate them into your CI/CD pipeline.

Cost & Latency

Cut costs 50-90% and reduce latency without sacrificing quality.

Safety & Security

Prevent prompt injection, validate outputs, handle PII.