Introduction to Data Science for/in/about SE

Neil Ernst

2026-04-12

Data Science

Analytical techniques can include:

Descriptive stats like mean/median, variance, histograms, scatterplots
Inferential stats like Bayesian inference, maximum likelihood, hypothesis testing
Unsupervised clustering of data
Predicting future values of data
Finding a function that successfully captures the generative model e.g. with a neural network

Software Data

Like many areas, software development produces tons of data:

Productivity stats like lines of code per hour;
Communication patterns like developer code review histories;
Natural language text like issues and bug report dicussions
Code and code changes;
Tool logs like build logs;
Application specific metrics, such as uptime or fault reports.
many others

Data Types

Quantitative
- Nominal
- Ordinal
- Interval/Ratio
Symbolic
Qualitative

(Source: Menzies)

Nine step AI pipeline

Six steps of statistical modeling

specification (create a model)
identification (check if, given a new parameterization, your model’s predictions change)
estimation (use the model to produce estimates)
evaluation (check the model; does the estimate match reality)
respecification (redo the model or try other models)
interpretation

Model comparison and exploratory data analysis

When presented with data or a theory about how data is created, what should we do?

Explore the data with few preconceptions
- look for the patterns
Problem: this might bias us if the patterns are just noise

Explore vs. confirm
- Confirm: verify data support/reject hypothesis
Hard to draw a line (Hullman and Gelman, 2021)
Better intuition: explore means comparing data (typically visually) to a pseudo-statistical model (our prior).
Only then do we create a more rigorous statistical model and compare alternatives.

Types of tools

Data miners, that tell us what is in the data and build a model: nearest neighbors, decision trees, deep learners
Optimizers, that tell us what to do, specifically, how to do something simple that has the biggest positive impact: genetic algorithms, heuristic search, etc.

Ethics Implications

how did we get the data? was there consent?
what is missing?
what assumptions are being made in the model?
whose views are not included?
(more to come later in term)

Cross Tool Logs example

Paper here

How to read the papers
Methods used
Types of Constructs
Belief in Results - Practical Significance

The reading this week looks at work by Google on understanding developer productivity (one dimension of it anyway).

The first thing to think about is a meta-analysis of the reading, and how to read academic research papers. Note how the paper is structured. This paper was published in a magazine, a more approachable format than others (journals or conferences for example).

The paper has a nice explanation of the motivation, and then gives a few short details on how it was done (why so scanty?).

The second aspect to think about is the methods used to produce insights. Were they all based on data mining? show the facts on whiteboard.

The third aspect is the type of constructs the paper uses. One key construct here is the notion of readability. Do you agree with their definition?

The fourth aspect is to reflect on the results obtained. Do you believe the results? What might you do as a manager with these insights? What might be missing from the results?