2026-05-30
How did “AI” help in Software Engineering tasks in the past?
Natural language processing (NLP) is a branch of AI concerned with representing, reasoning, manipulating natural language and text documents (as opposed to images, videos, moving robots around, reasoning, etc).
Language models are often input to the higher-order tasks above. A language model captures the relationships between words in a language so that it can predict which tokens go where.
As we will see, one way to think about a language model is that it can solve a task like: [^bench]
System.out.println("neil is an awesome instructor") do?Source code is a language!!
Complete:
System.out.println( ...
Complete:
My cat has three ...
{).var theVar; ... ; theVar = 5;). Need something to represent this span of relationship.Convert text into numeric representations of the text. Remember that in training we will be using these tokens billions of times.
Try out https://simonwillison.net/2023/Jun/8/gpt-tokenizers/
Create a language model by effectively counting how often a particular set of n tokens occurs. For example (from the Hindle paper):
\(p(a_4|a_1a_2a_3) = \frac{count(a_1a_2a_3a_4)}{count(a_1a_2a_3*)}\)
Measure success by log perplexity, or cross-entropy:
\(H_{\mathcal{M}}(s) = - \frac{1}{n} log~p_\mathcal{M}(a_1 ... a_n)\)
After preprocessing, we need to encode the tokens - e.g. like “f|unc|tion” - into a numeric representation.
function = [0.322, 0.113, 0.567,..].Words that “mean” similar things (for our domain of interest) should be closer together on some distance metric.
This works OK … but language is pretty complicated.
The word embedding model is restricted to looking a few tokens ahead or behind (the window/context).
That means learning more complex relationships (e.g., this phrase modifies a previous noun) are hard to do.
Can we improve on these approaches?
Another approach is to use deep learning via attention mechanisms in transformer models. This is how BERT, GPT4, Gemini, etc work.
Which of these is most effective, and which is most costly?
A transformer is a ML architecture that encodes an input and decodes output:


Earlier we discussed the problem of sliding windows. Attention is a way to make the model ‘remember’ what it saw before.
attention example from jalammar
Score word pairs based on relevance using something like our embedding example (kings/queens/man/woman).
Then combine this score and multiply that for many words and do it many times.
Sutton’s bitter lesson:
general methods that leverage computation are ultimately the most effective, and by a large margin
https://jalammar.github.io/illustrated-transformer/

Pretraining: ingest vast amounts of textual representations.
Posttraining: condition the model to go beyond spitting out simplistic completions.1
RLHF: Get low-paid gig workers to solve the knowledge acquisition bottleneck.
the policy is a language model that takes in a prompt and returns a sequence of text (or just probability distributions over text). The action space of this policy is all the tokens corresponding to the vocabulary of the language model and the observation space is the distribution of possible input token sequences, which is also quite large given previous uses of RL. The reward function is a combination of the preference model and a constraint on policy shift.1

source: https://huggingface.co/blog/rlhf
What constitutes a good test for LLM coding?
Test set pollution: we have to assume the training data includes problems from online (e.g. LeetCode)

← Course Home©️ Neil Ernst