Artificial Intelligence & Cognitive Systems
Welcome to the AI section of my digital brain. This space is dedicated to tracking the rapidly evolving landscape of Large Language Models (LLMs), agentic workflows, integration protocols, and testing frameworks.
1. Modern LLM Models
The current state of AI is driven by a mix of proprietary frontier models and highly capable open-weights models:
- Proprietary Frontier Models:
- Gemini 1.5 Pro / Flash: Known for its native multimodal capabilities and a massive 2-million token context window, which changes how we handle long-document research and codebase analysis.
- Claude 3.5 Sonnet: A leader in software development, coding assistance, and reasoning tasks, presenting high spatial reasoning and logic.
- GPT-4o / o1: OpenAI's multimodal and reasoning models designed for complex multi-step planning and chain-of-thought execution.
- Open-Weights Ecosystem:
- Llama 3 / 3.1 / 3.2: Meta's flagship open models, supporting up to 405B parameters and a 128k context window, allowing local deployment of frontier-level capabilities.
- Gemma 2: Google's lightweight, high-performance open models (9B and 27B) optimized for local development and efficiency.
2. Agentic Workflows & Architectures
We are moving from simple "system-prompt-and-respond" chats to autonomous agents that can plan, reflect, and interact with tools.
Core Components of an Agent:
- Planning:
- Task Decomposition: Breaking down complex requests into smaller, manageable sub-goals (e.g., Chain-of-Thought, Tree-of-Thoughts).
- Self-Reflection: Analyzing tool outputs and correcting course if the plan is failing.
- Memory:
- Short-term memory: In-context conversation history.
- Long-term memory: External databases (vector stores) containing past knowledge and interactions accessed via semantic search (RAG).
- Tool Use:
- API calling, executing code in secure sandboxes, searching the web, and reading database schemas.
3. Model Context Protocol (MCP)
The Model Context Protocol (MCP) is an open standard proposed by Anthropic that enables developers to build secure, standardized connections between AI models and their data sources or development tools.
- Why MCP? Prior to MCP, every developer had to write custom API integrations for each tool (GitHub, Postgres, Slack, etc.) and each model provider.
- How it works: Much like the Language Server Protocol (LSP) standardized IDE integrations for programming languages, MCP acts as a universal bridge.
- MCP Hosts: Applications like IDEs (Cursor, VS Code) or chat interfaces where the LLM is running.
- MCP Clients: Intermediary clients that coordinate access.
- MCP Servers: Standardized servers exposing specific tools, prompts, or resources (e.g., a filesystem server, database viewer, or web search tool).
4. Evaluation & Testing Harnesses
As AI applications grow more complex, qualitative "vibe-checks" are no longer sufficient. We need structured evaluation harnesses to prevent regression and benchmark performance:
- Standard Benchmarks:
- MMLU (Massive Multitask Language Understanding): Evaluates knowledge across 57 subjects (humanities, sciences, etc.).
- GSM8K & MATH: Benchmarks for grade-school math and advanced competition mathematics.
- HumanEval: Standard coding evaluation containing Python coding problems with unit tests.
- Harness Implementations:
- EleutherAI LM-Eval-Harness: An open framework to evaluate LLMs on hundreds of academic benchmarks with zero-shot/few-shot configurations.
- Custom Agent Harnesses: Mocking environments (using sandboxed tools and pre-recorded outputs) to automatically run agents through coding or web-navigating tasks and grade the outcomes based on final system states.