2025-07-29
Let’s try an exercise.
In what ways is GenAI transforming this field? What should we do in this class in response?
How I use it:
I do not use it to mark your assignments, tempting as that is.
See the course README file and the University position statement.
(from https://substack.com/home/post/p-160131730)
Emphasize craftsmanship. AI rewards having a structured, careful development process:
We will do what Andrej Karpathy calls “vibe coding”: a zen-like use of autocomplete to try and get the AI to do something useful.
Kent Beck has this nice model of how this works. You add features, hurting modularity, then get the modularity - the options - back.
gemini.vs “Augmented Coding” (Kent Beck)
| Vibes, man | Augmented |
|---|---|
| ‘continue monkey’ | Testing |
| ‘Don’t look at code’ | Security |
| ‘Let it rip’ | Maintainability |
| ‘Keep Going!’ | Reliability |
| ‘YOLO’ | Correctness |
| Performance and scaling |
via Gergely
context venn diagram via https://www.philschmid.de/context-engineering
As Simon Willison writes,
The entire game when it comes to prompting LLMs is to carefully control their context—the inputs (and subsequent outputs) that make it into the current conversation with the model.
So how can we do that?
An LLM is trained on vast amounts of general data, like all of Wikipedia and all of GitHub.
But our problem is a particular one, and we don’t know if the LLM’s distribution matches ours.
So we need to steer it to the space of solutions that apply to us.
Another term for this is context engineering1
+1 for “context engineering” over “prompt engineering”.
People associate prompts with short task descriptions you’d give an LLM in your day-to-day use. When in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step. Science because doing this right involves task descriptions and explanations, few shot examples, RAG, related (possibly multimodal) data, tools, state and history, compacting […] Doing this well is highly non-trivial. And art because of the guiding intuition around LLM psychology of people spirits. […]
(derived from the article here)
Context is not free… every token influences the model’s behavior.
A lot of the challenge with software engineering is understanding what the current program is supposed to be doing—what Peter Naur called the “theory” of the program.
The LLM is no different. We need to tell it what to do.
RAG: try to figure out the most relevant documents for the user’s question and stuff as many of them as possible into the prompt. – Simon Willison
Combine well understood information retrieval approaches with transformers to condition the result on the document set.
Given input sequence \(x\), and text documents \(z\), we want to generate output sequence \(y\).
Why does this work?
An eval is a test that the AI’s output is doing what you had hoped. It isn’t a unit test exactly, since unit tests are repeatable and deterministic (ideally). But an eval should give you a sense that the output is what you expect.
Q: What is an eval for a data analysis project?
This is an emerging space. One approach is described here.
It uses the PromptFoo tool to manage the workflow.
The idea is that you
Key to eval is the assertion for what the model should look for. In PromptFoo there are several types of assertions 1
The model-graded assertions take a prompt to the LLM that will judge the output. E.g., select-best might say “choose the most concise and accurate response”. Now, whether the other LLM will do that is subject to that LLM’s performance (you can see the recursion here).
A .yaml file holds all the evals for a given problem. This Google sheet gives some examples for an application that tries to parse government websites for food assistance.
**capability**: It gives accurate advice about asset limits based on your state
**question**: I am trying to figure out if I can get food stamps. I lost my job 2 months ago, so have not had any income. But I do have $10,000 in my bank account. I live in Texas. Can I be eligible for food stamps? Answer with only one of: YES, NO, REFUSE.
**__expected**: contains:NO
What do you need to develop these evals?
root_agent = Agent(
name="weather_time_agent",
model="gemini-2.0-flash",
description=(
"Agent to answer questions about the time and weather in a city."
),
instruction=(
"You are a helpful agent who can answer user questions about the time and weather in a city."
),
tools=[get_weather, get_current_time],
)
Problem: I don’t want the AI to speculate on how to access my API - I have a precise set of calls it can use.
I don’t want it to reinvent regular expressions – just use sed, grep, awk etc.
How to tell the AI what is available? Need to connect it to the API so it can discover what is possible.
NEVER try to edit a file by running terminal commands unless the user specifically asks for it. (Copilot instructions) 1
MCP solves this problem by providing a standardized way for AI models to discover what tools are available, understand how to use them correctly, and maintain conversation context while switching between different tools. It brings determinism and structure to agent-tool interactions, enabling reliable integration without custom code for each new tool” 1

Neil Ernst ©️ 2024-5