Introduction to Data Science for/in/about SE

Neil Ernst

2026-04-12

Data Science

Analytical techniques can include:

  • Descriptive stats like mean/median, variance, histograms, scatterplots
  • Inferential stats like Bayesian inference, maximum likelihood, hypothesis testing
  • Unsupervised clustering of data
  • Predicting future values of data
  • Finding a function that successfully captures the generative model e.g. with a neural network

Software Data

Like many areas, software development produces tons of data:

  • Productivity stats like lines of code per hour;
  • Communication patterns like developer code review histories;
  • Natural language text like issues and bug report dicussions
  • Code and code changes;
  • Tool logs like build logs;
  • Application specific metrics, such as uptime or fault reports.
  • many others

Data Types

  • Quantitative
    • Nominal
    • Ordinal
    • Interval/Ratio
  • Symbolic
  • Qualitative

(Source: Menzies)

Nine step AI pipeline

Six steps of statistical modeling

  1. specification (create a model)
  2. identification (check if, given a new parameterization, your model’s predictions change)
  3. estimation (use the model to produce estimates)
  4. evaluation (check the model; does the estimate match reality)
  5. respecification (redo the model or try other models)
  6. interpretation

Model comparison and exploratory data analysis

When presented with data or a theory about how data is created, what should we do?

  • Explore the data with few preconceptions
    • look for the patterns
  • Problem: this might bias us if the patterns are just noise

  • Explore vs. confirm
    • Confirm: verify data support/reject hypothesis
  • Hard to draw a line (Hullman and Gelman, 2021)
  • Better intuition: explore means comparing data (typically visually) to a pseudo-statistical model (our prior).
  • Only then do we create a more rigorous statistical model and compare alternatives.

Types of tools

  • Data miners, that tell us what is in the data and build a model: nearest neighbors, decision trees, deep learners
  • Optimizers, that tell us what to do, specifically, how to do something simple that has the biggest positive impact: genetic algorithms, heuristic search, etc.

Ethics Implications

  • how did we get the data? was there consent?
  • what is missing?
  • what assumptions are being made in the model?
  • whose views are not included?
  • (more to come later in term)

Cross Tool Logs example

Paper here

  1. How to read the papers
  2. Methods used
  3. Types of Constructs
  4. Belief in Results - Practical Significance