large language models in SE

Neil Ernst

nernst@uvic.ca

University of Victoria

2026-05-30

Large Language Models for Software Engineering

What Did We Do Before

How did “AI” help in Software Engineering tasks in the past?

What are we training on?

Textual data
Sources?
Other approaches? Abstract Syntax Trees (ASTs)?

NLP

Natural language processing (NLP) is a branch of AI concerned with representing, reasoning, manipulating natural language and text documents (as opposed to images, videos, moving robots around, reasoning, etc).

Language models are often input to the higher-order tasks above. A language model captures the relationships between words in a language so that it can predict which tokens go where.

Tasks for LLMs

As we will see, one way to think about a language model is that it can solve a task like: [^bench]

Prediction/Masking: what replaces 🔡 in “Jim drove a 🔡 to the Drive-Thru line” (masking). Commonly used for self-supervision in training.
Analogy: “Red is to rose as 🔡 is to iris”
Question Answering: “who was president of the USA in 2002?”

Tasks for LLMs

Winograd Schemas: The city council refused the demonstrators a permit because they [feared/advocated] violence.
Translation: “j’aime beaucoup la cuisine japonais” in English is … (taking into account colloquialisms, idioms, slang etc.) “I like very much the cooking Japanese”
Summarization: what does System.out.println("neil is an awesome instructor") do?

Naturalness

Source code is a language!!

Complete:

System.out.println( ...

Complete:

My cat has three ...

Useful terms

N-gram - N tokens in a group. Bigram: “hot lunch, big fish” Tri-gram “Manchester United won”
Token - an atomic “chunk” in the corpus. Could be a word; could be a source code token ({).
Stop words - words we don’t think will be important. Typically, “the, an, and, or” (but which ones?)

Useful Terms (2)

Stemming/lemmatization - shorten words to their essence. “Skillfully” -> skillful- / skill-
Language Model - for a set T of tokens in a vocabulary, map a probability distribution \(p(.)\) to sets \(S\) of tokens in \(T^*\)
AST - Abstract Syntax Tree. A way of representing (encoding) programs using symbols and relationships.

Useful Terms (3)

Distance: historically ML has had a hard time understanding concepts separated in text. But this happens a lot in source code (var theVar; ... ; theVar = 5;). Need something to represent this span of relationship.
OOV: out of vocabulary, a word that is not in the vocabulary (dictionary) learned in the model’s training. E.g., a word that does not appear in Wikipedia, for the GloVe embedding. Such a word will have no representation (think about a self-driving car seeing a totally novel situation like a helicopter landing on the freeway).

Tokenizing

Convert text into numeric representations of the text. Remember that in training we will be using these tokens billions of times.

Try out https://simonwillison.net/2023/Jun/8/gpt-tokenizers/

Simple model: N-grams

Create a language model by effectively counting how often a particular set of n tokens occurs. For example (from the Hindle paper):

\(p(a_4|a_1a_2a_3) = \frac{count(a_1a_2a_3a_4)}{count(a_1a_2a_3*)}\)

Measure success by log perplexity, or cross-entropy:

\(H_{\mathcal{M}}(s) = - \frac{1}{n} log~p_\mathcal{M}(a_1 ... a_n)\)

Encoding

After preprocessing, we need to encode the tokens - e.g. like “f|unc|tion” - into a numeric representation.

Count embedding: each word is represented by its frequency
TF-IDF - represent using a ratio of how common a token is to its inverse document frequency

Encoding

Word embedding: Each word is encoded into a vector (rank 1 tensor) of real numbers representing different dimensions (often 100, 200). E.g., function = [0.322, 0.113, 0.567,..].

Words that “mean” similar things (for our domain of interest) should be closer together on some distance metric.

Reuse

Most of the time we can re-use embeddings
Remember how embeddings were trained.
Careful that the source data is close to our data (e.g., the train control system might not be represented by Wikipedia).

This works OK … but language is pretty complicated.

The word embedding model is restricted to looking a few tokens ahead or behind (the window/context).

That means learning more complex relationships (e.g., this phrase modifies a previous noun) are hard to do.

Break

Can we improve on these approaches?

Attention and Transformers

Another approach is to use deep learning via attention mechanisms in transformer models. This is how BERT, GPT4, Gemini, etc work.

Digression: supervision.

Fully supervised = humans provide a complete, labeled dataset.
Unsupervised - there is no label, the machine just tries to clump similar things together.
Weakly supervised: the human annotates a few important instances to bootstrap the machine.
Self-supervised: the machine manipulates the data to “hide” various pieces in order to train.
Masking is a self-supervised 🔡 approach.

Labeling/Supervising

Which of these is most effective, and which is most costly?

Solving PRs with SWE-bench

how do they filter? What is not in the dataset?
what is “resolved” - what do these terms mean?
some issues cannot be fixed
memorization

Transformers

A transformer is a ML architecture that encodes an input and decodes output:

Drilling in:

Attention

Earlier we discussed the problem of sliding windows. Attention is a way to make the model ‘remember’ what it saw before.

attention example from jalammar

Attention

Score word pairs based on relevance using something like our embedding example (kings/queens/man/woman).

Then combine this score and multiply that for many words and do it many times.

Sutton’s bitter lesson:

general methods that leverage computation are ultimately the most effective, and by a large margin

Attention example

https://jalammar.github.io/illustrated-transformer/

Modern Pipelines

Pretraining: ingest vast amounts of textual representations.

Posttraining: condition the model to go beyond spitting out simplistic completions.¹

RLHF

RLHF: Get low-paid gig workers to solve the knowledge acquisition bottleneck.

the policy is a language model that takes in a prompt and returns a sequence of text (or just probability distributions over text). The action space of this policy is all the tokens corresponding to the vocabulary of the language model and the observation space is the distribution of possible input token sequences, which is also quite large given previous uses of RL. The reward function is a combination of the preference model and a constraint on policy shift.¹

RLHF Process

the arch of RLHF: prompts are fed to a LLM, and the outputs are scored by humans to create a preference model

source: https://huggingface.co/blog/rlhf

Human Future: Labeling Data for AI?

example job ad

Validation approaches

What constitutes a good test for LLM coding?

BLEU score: how many n-grams match between the gold set and the predicted set, penalized for lack of brevity?
- example
Programming problems: Codex used programs from LeetCode.
- see HumanEval
- P@K “a benchmark problem is considered solved if any one of k code samples passes every test case.”

Challenges

Test set pollution: we have to assume the training data includes problems from online (e.g. LeetCode)

Summary

LLMs are the evolution of decades of work in representing natural language.
There is sophisticated engineering behind tokenizing, word embedding, and the attention mechanisms.
Even more important has been the scale of training and inference, using vast datasets.
Making an LLM truly useful in SE requires work on training, including pre- and post-training activities.