Introduction to Data Science for/in/about SE

Neil Ernst

2025-07-29

Intro

nernst@uvic.ca

Assoc Professor in Computer Science

Acting BSEng Director

ECS 560

Syllabus

Updates? New topics?

Project in 6 weeks. Plan accordingly!

AI - General Philosophy

What’s the essential skill that students need to be learning, such that when they get out of school they are more capable humans than when they went in?

Answer: resilience

the answer … isn’t simply a body of knowledge to be memorized or a set of skills to be mastered. It’s a stance. A stance from which to address the world and all its challenges. A stance built on self-confidence and resilience: the conviction that one has a fighting chance to overcome or circumvent whatever obstacles the world throws in one’s path. The way you acquire it is by trying, and sometimes failing, to do difficult things. It can be discouraging, but if you have good mentors, and if you’re collaborating with friends who are in the same boat, you can find ways to succeed, and develop a knack for it. That’s true self-reliance.

– Neal Stephenson [via newsletter](https://substack.com/app-link/post?publication_id=2694786&post_id=166816228]

Exercise: Icebreaker

Overview

Short Break

Data Science

Analytical techniques can include:

Descriptive stats like mean/median, variance, histograms, scatterplots
Inferential stats like Bayesian inference, maximum likelihood, hypothesis testing
Unsupervised clustering of data
Predicting future values of data
Finding a function that successfully captures the generative model e.g. with a neural network

Software Data

Like many areas, software development produces tons of data:

Productivity stats like lines of code per hour;
Communication patterns like developer code review histories;
Natural language text like issues and bug report dicussions
Code and code changes;
Tool logs like build logs;
Application specific metrics, such as uptime or fault reports.
many others

Data Types

Quantitative
- Nominal
- Ordinal
- Interval/Ratio
Symbolic
Qualitative

(Source: Menzies)

Nine step AI pipeline

six steps of statistical modeling

specification (create a model)
identification (check if, given a new parameterization, your model’s predictions change)
estimation (use the model to produce estimates)
evaluation (check the model; does the estimate match reality)
respecification (redo the model or try other models)
interpretation

Model comparison and exploratory data analysis

When presented with data or a theory about how data is created, what should we do?

Explore the data with few preconceptions
- look for the patterns
Problem: this might bias us if the patterns are just noise

Explore vs. confirm
- Confirm: verify data support/reject hypothesis
Hard to draw a line (Hullman and Gelman, 2021)
Better intuition: explore means comparing data (typically visually) to a pseudo-statistical model (our prior).
Only then do we create a more rigorous statistical model and compare alternatives.

Types of tools

Data miners, that tell us what is in the data and build a model: nearest neighbors, decision trees, deep learners
Optimizers, that tell us what to do, specifically, how to do something simple that has the biggest positive impact: genetic algorithms, heuristic search, etc.

Ethics Implications

how did we get the data? was there consent?
what is missing?
what assumptions are being made in the model?
whose views are not included?
(more to come later in term)