2025-07-29
What are the data sources?
One of the biggest problems - and best places to spend money - is getting good data.
The most obvious example of this is building bigger telescopes or high energy physics instruments.
What are we trying to accomplish? We would like some confidence that our experiment will reveal some true effect in the world.
The traditional frequentist (STAT260) approach is null-hypothesis significance testing (NHST) with p-values. 1
We run a test calculating the probability of the observed difference or a more extreme difference, under the assumption that there is no real difference between the groups.
We want a test that can find such a difference if it exists (else FN, type 2 error), and not find the difference if it does not exist (else FP, type 1 error).
NB: A non-significant p-value does not mean that the null hypothesis is true.
Let’s get people to solve bugs with AI and without AI. Which one will be quicker to fix bugs? Our null hypothesis might be that it won’t matter. Our alternative might be that it does matter.
We can do some statistical testing to see which hypothesis might best explain the data. In the conventional framework, we would assume the null is true, and see if it continues to explain the numbers we see from doing the experiment (in this case, how fast bugs are fixed).
Load the two sample files into R, and run a t-test to evaluate the hypothesis that AI makes developers faster. Make sure to print out the descriptive stats first.
If the data seem really unlikely under the null model, i.e., the AI users are nearly always slower, then we can reject the null.
We will need to define expected effect size, power of detecting that effect, and the threshold \(\alpha\) at which we reject the null.
Note: not the same as practical significance: A study might have statistical relevance but not practical. Can you think of examples?
from Section 2.3 of my paper:
“effect size ignores the context of decision making. A raw number reflecting (for example) the standardized difference of means is hard for practitioners to interpret and must be contextualized. Contextual, subjective judgment of observed effect sizes must be made and a ritualized interpretation avoided
Visit here:
What is the “secret”?
The histories of even simple bugs are strongly dependent on social, organizational, and technical knowledge that cannot be solely extracted through automation of electronic repositories
number of people involved
1
Design a sampling strategy for the following question:
and
1

Data Science for SE • Neil Ernst (c) 2025