Introduction to Data Science for/in/about SE

Neil Ernst

2025-07-29

Intro

nernst@uvic.ca

Assoc Professor in Computer Science

Acting BSEng Director

ECS 560

Syllabus

Updates? New topics?

Project in 6 weeks. Plan accordingly!

AI - General Philosophy

What’s the essential skill that students need to be learning, such that when they get out of school they are more capable humans than when they went in?

Answer: resilience

the answer … isn’t simply a body of knowledge to be memorized or a set of skills to be mastered. It’s a stance. A stance from which to address the world and all its challenges. A stance built on self-confidence and resilience: the conviction that one has a fighting chance to overcome or circumvent whatever obstacles the world throws in one’s path. The way you acquire it is by trying, and sometimes failing, to do difficult things. It can be discouraging, but if you have good mentors, and if you’re collaborating with friends who are in the same boat, you can find ways to succeed, and develop a knack for it. That’s true self-reliance.

– Neal Stephenson [via newsletter](https://substack.com/app-link/post?publication_id=2694786&post_id=166816228]

Exercise: Icebreaker

Overview

Short Break

Data Science

Analytical techniques can include:

  • Descriptive stats like mean/median, variance, histograms, scatterplots
  • Inferential stats like Bayesian inference, maximum likelihood, hypothesis testing
  • Unsupervised clustering of data
  • Predicting future values of data
  • Finding a function that successfully captures the generative model e.g. with a neural network

Software Data

Like many areas, software development produces tons of data:

  • Productivity stats like lines of code per hour;
  • Communication patterns like developer code review histories;
  • Natural language text like issues and bug report dicussions
  • Code and code changes;
  • Tool logs like build logs;
  • Application specific metrics, such as uptime or fault reports.
  • many others

Data Types

  • Quantitative
    • Nominal
    • Ordinal
    • Interval/Ratio
  • Symbolic
  • Qualitative

(Source: Menzies)

Nine step AI pipeline

six steps of statistical modeling

  1. specification (create a model)
  2. identification (check if, given a new parameterization, your model’s predictions change)
  3. estimation (use the model to produce estimates)
  4. evaluation (check the model; does the estimate match reality)
  5. respecification (redo the model or try other models)
  6. interpretation

Model comparison and exploratory data analysis

When presented with data or a theory about how data is created, what should we do?

  • Explore the data with few preconceptions
    • look for the patterns
  • Problem: this might bias us if the patterns are just noise

  • Explore vs. confirm
    • Confirm: verify data support/reject hypothesis
  • Hard to draw a line (Hullman and Gelman, 2021)
  • Better intuition: explore means comparing data (typically visually) to a pseudo-statistical model (our prior).
  • Only then do we create a more rigorous statistical model and compare alternatives.

Types of tools

  • Data miners, that tell us what is in the data and build a model: nearest neighbors, decision trees, deep learners
  • Optimizers, that tell us what to do, specifically, how to do something simple that has the biggest positive impact: genetic algorithms, heuristic search, etc.

Ethics Implications

  • how did we get the data? was there consent?
  • what is missing?
  • what assumptions are being made in the model?
  • whose views are not included?
  • (more to come later in term)

Cross Tool Logs example

  1. How to read the papers
  2. Methods used
  3. Types of Constructs
  4. Belief in Results - Practical Significance

Longish Break

Activity: Setting Up R, RStudio, AI tooling

We will now spend some time getting the basic tooling for this class installed.