## Significant results are just the beginning.

CCongratulations, your experiment produced significant results. You can be sure (95% confidence) that the independent variable influenced your dependent variable. I think all you have to do is write your discussion and submit your findings to a scientific journal. On the right…………?

Achieving significant results is a huge achievement in itself, but it doesn't tell the whole story behind your results. I want to take this moment to address statistical significance, sample size, power, and effect size, all of which have tremendous impact on how we interpret our results.

First, let's discuss statistical significance, as it is the foundation of inferential statistics. We will discuss the meaning in the context of real experiments, as it is the most relevant and easy to understand. An actual experiment is used to test a specific hypothesis we have about the causal relationship between one or more variables. Specifically, we hypothesize that one or more variables (i.e., independent variables) produce a change in another variable (i.e., dependent variables). Change is our derived causality.*If you want to learn more about the different types of research projects, visit my article (**SHORTCUT**).*

For example, let's test the hypothesis that an authoritative teaching style leads to better student test scores. To accurately test this hypothesis, we randomly select 2 groups of students who are randomly assigned to one of two classrooms. One classroom is taught by an authoritative teacher and another by an authoritative teacher. Throughout the semester, we collect all test scores from all classrooms. At the end of the year, we average all results to get an overall average for each classroom. Let's say the average test score for the licensed classroom was 80% and the average test score for the licensed classroom was 88%. It looks like your hypothesis was correct, students taught by the authoritative teacher did, on average, 8% better on their tests than students taught by the authoritative teacher. But what if we did this experiment 100 times, each time with different groups of students? Do you think we would have similar results? What is the probability that this effect of teaching style on student grades is due to chance or some other latent (ie, unmeasured) variable? Finally, is 8% considered "high enough" to differ from 80%?

**null hypothesis:**The hypothesis is accepted, stating that there are no significant differences between groups. In our teaching style example, the null hypothesis would not predict differences between students' test scores based on teaching style.

**Alternative or research hypothesis**: Our original hypothesis, which predicts the predominant teaching style, will produce higher average student test scores.

## Now that we have the premises set, let's define what a p-value is and what it means that your results are significant.

*The p-value (also known as the alpha) is the probability that our null hypothesis is true.*Getting a significant result simply means that the p-value you got from your test statistic was equal to or less than your alpha, which is 0.05 in most cases.

A p-value of 0.05 is a common standard used in many areas of research.

A significant p-value (that is, less than 0.05) would indicate that the probability that your null hypothesis is correct is less than 5%. If that's the case, we reject the null hypothesis, accept our alternative hypothesis, and find that the students' test scores differ significantly from each other. Note that we did not say that different teaching styles caused significant differences in students' test scores. The p-value only tells us whether the groups are different from each other or not, we have to make the inferential leap and assume that the teaching styles influenced the groups differently.

Another way to look at a meaningful p-value is to consider the probability that if we run this experiment 100 times, we can expect the students' test scores to be very similar at least 5 times.

If we set our alpha to 0.01, our resulting p-value would have to be equal to or less than 0.01 (ie 1%) for our results to be considered significant. Of course, this would impose a stricter criterion, and if it is significant, we would conclude that the probability that the null hypothesis is correct is less than 1%.

The sample size or number of participants in your study has a big impact on whether or not your results are significant. The greater the actual difference between the groups (ie, students' test scores), the smaller the sample size we need to find a significant difference (ie, p ≤ 0.05). In theory, a significant difference can be found in most experiments with a sufficiently large sample size. However, extremely large sample sizes require expensive studies and are extremely difficult to obtain.

**Type I error (a)**or false positives, the probability of completing the groups is significantly different, when in fact it is not. We are willing to assume a 5% chance of incorrectly rejecting the null hypothesis.

**type II error (B)**or false negative, is the probability of concluding that the groups are not significantly different when in fact they are. We can reduce the likelihood of making a Type II error by making sure our test statistic has the power to do so.

**Performance**is defined as 1: probability of a type II error*(*HH). In other words, it is the probability of detecting a difference between groups when the difference actually exists (i.e., the probability of correctly rejecting the null hypothesis). Therefore, when we increase the power of a statistical test, we increase its ability to detect a significant difference (ie, p ≤ 0.05) between groups.

It is generally accepted that we should aim for a power of 0.8 or greater.

Therefore, we have an 80% chance of finding a statistically significant difference. However, we still have a 20% chance of not seeing a really significant difference between the groups.

If you recall our teaching style example, we found significant differences between the two groups of teachers. The average test score in the official classroom was 80% and in the official classroom it was 88%. Effect size attempts to answer the question, "Are these differences large enough to be significant, even if they are statistically significant?"

Effect size appeals to the concept of "least important difference", which states that at some point a significant difference (ie, p ≤ 0.05) is so small that it would not be useful in the real world. Thus, effect size seeks to determine whether the 8% increase in student grades between authoritative and authoritative teachers is large enough to be considered significant.*Remember that by small we don't mean a small p-value.*

Another way to look at effect size is as a quantitative measure of how much IV influences DV. A high effect size would indicate a very important result, as manipulation in the IV produced a large effect in the DV.

The effect size is usually expressed as Cohen's d. Cohen described a small effect size = 0.2, a medium effect size = 0.5 and a large effect size = 0.8.

Smaller p-values (0.05 and below) do not indicate large or important effects, nor do larger p-values (0.05+) imply negligible importance and/or small effects. With a large enough sample size, even very small effect sizes can produce significant p-values (0.05 and below). In other words, statistical significance looks at the likelihood that our results are due to chance, and effect size explains the significance of our results.

We can calculate the minimum sample size needed for our experiment to achieve a specific statistical power and effect size for our analysis. This analysis must be done before actually running the experiment.

Power analysis is a critical process that you must perform during the design phase of your study. This gives a good idea of the number of participants needed for each experimental group (including the control group) to find one or more significant differences.*if one is found*.

G*Power is an excellent open source program that allows you to quickly calculate the required sample size based on power and effect size parameters.

- To choose
**"test family"**suitable for your analysis

- We choose t tests

2. Select the**"statistical test"**used for analysis

- We will use means: difference between two independent means (two groups)

3. Select the**"Power analysis type"**

- We'll select "a priori" to determine the sample size needed for the power and size of effect you want to achieve.

4. Select the number of**cruz**

- Use a queue when you want to detect a significant difference between groups in only one direction. We usually choose a two-sided test.
- We choose a two-tailed test

5. Select the desired effect size or "**effect size d"**

- We'll go through a series of effect sizes.

6. Choose**"error probability α"**or alpha, or the probability of not rejecting the null hypothesis when there is a real difference between groups.

- We use 0.05

7. Select the**Performance**you want to achieve

- We choose 0.8 or 80% power and 0.9 or 90%

To choose**“N2/N1 Allocation Ratio”**

- If you expect the same number of participants in each group (treatment and control), choose 1. If you have twice as many in one group as the other, choose 2.

In general, large effect sizes require smaller sample sizes because they are "obvious" to see/find for analysis. As we decreased the effect size, we needed larger sample sizes, as smaller effect sizes are harder to find. This benefits us because the larger the effect size, the more important our results will be and the fewer participants we will have to recruit for our study.

Finally, these are the sample sizes needed for each group of participants. For example, for an experiment with an IV with 4 groups/levels and a DV, where you want to find a large effect size (0.8+) with 80% power, you would need a sample size of 52 participants per group or 208 total