The results are in!
First, as always, feel free to sign up for our leaderboard or more if you’d like to help us out even more. If you’re just joining us, we’ve got the whole summer’s work archived for you to look through and get up to speed, or jump in now anyway.
This week, we’ll be doing some statistical analysis of the data to know which effects are real (or “statistically significant”) and which we need more evidence on.
So, let’s get started!
When researchers conduct statistical analyses, they are trying to draw objective conclusions based solely on the data. Last week, we explored the data visually and looked for any patterns we might see. Humans naturally look for patterns in everything, though. That’s why people see clouds that look like everyday objects or find faces in burnt pieces of toast. By conducting statistical analysis, we can decide which of the patterns we saw last week have enough evidence for us to argue that they actually exist!
When a scientist calculates the stats for their data, they are choosing a set of “tests” to run, each of which is designed to look at a specific kind of difference. The exact type of analysis you use depends on the type of data you’re looking at. Whatever the type, statistical tests give at least two values: the test statistic and the p-value. Because the test statistic is specific to the test and hard to understand alone, the second value, the p-value, gives the probability that the difference or effect is not present in ‘true’ values. In other words, the p-value tells us whether we should or should not find a similar result if we repeated the experiment.
If the p-value is less than 0.05, we say that the test is statistically significant and the evidence supports the effect or difference being ‘real’ and likely to be found again if we repeated the experiment.
We’ve got a lot of good info for you this week, because we want to look at a lot of potential effects and tell you how we reached our conclusions! So, we’re going to divide things up a bit and let you jump around to different parts of the post as you see fit. Each section will start with an explanation on the type of test we used on the data and then give our results for the thing we were looking at, so feel free to only read the bits you want to. Don’t miss the final results for the guesses you made in Week 7, too! Those are in with their relevant categories.
Test of Difference of Means: T-Test
T-tests are for comparing the mean of two groups. A key idea behind t-tests is that the mean value of a group in the data, say the average number of words remembered by men in our experiment (12.81), is probably not the exact value we would get if we tested the entire group, i.e. all men. However, this ‘true’ value we would get if we tested all men is likely close to 12.81. To counter this, the t-test calculates a range that should contain the ‘true’ value that is centered around the observed value (12.81). The t-test compares the amount of overlap between the ranges of possible ‘true’ values for the different groups of participants, and based on the amount of overlap, the t-test calculates a t-statistic and the related p-value. If the p-value is less than 0.05, there is a less than 5% chance that there is no difference between the two groups. In other words, there is a greater than 95% chance that there is a difference between the ‘true’ values for the two groups.
Statistics are reported in papers in different formats depending on the field. Our lab and field use the American Psychological Association (APA) format. In this format, statistics are presented in line in the following format: (t-value, p-value).
First, let’s see if there is a difference in performance between our two conditions.
We used a t-test to examine if there was a significant difference in the mean number of words remembered by participants told to repeat the words out loud and by participants told to repeat the words in their head (our main research question!). The test was not significant (t=0.72, p=0.48), indicating that there was not a significant difference in the ability of the two groups. Here’s what you all predicted in Week 7. Looks like half of you were right!
So, interestingly (and maybe unfortunately), we didn’t see a clear difference in performance between the two conditions. However, keep reading to see how that isn’t the full story!
Next, let’s see if there is a difference in performance between men and women.
We used a t-test again to examine if there was a significant difference in the mean number of words remembered by participants who identify as men and participants who identify as women. The test was not significant (t=0.71, p=0.48), indicating that there was not a significant difference in ability based on gender. It should be noted that one participant reported “Other” as their gender and one participant that preferred not to state their gender. While both of these participants scored above average, multiple participants are necessary to draw group conclusions.. Again, here’s what you thought was going to happen. You predicted a slight lean towards women performing better in the study, but were still pretty close overall!
These results indicate that there was no clear difference in performance based on gender.
Test of Relationship: Correlation/Linear Regression
Instead of looking at differences between groups, some tests examine if there is a relation between two variables. For example, the analysis below looks at the relationship between the number of remembered words and the participant’s age. A correlation examines how one value increases or decreases as the second value increases or decreases. Maybe you’ve heard of ‘causation vs. correlation’ before? That’s what we’re talking about here; a correlation is just an observed relationship between two sets of values, not necessarily a statement on how one causes the other to happen! A correlation between the number of words remembered and age is asking “as someone’s age increases, does the number of words they remember increase?” A correlation produces an r-value (similar to a t-test producing a t-value) which gives the strength of the relation, where a higher value indicates a stronger relation. The range of r-values runs from -1 to +1, where -1 indicates that as one variable increases, the other decreases and +1 indicates that as one variable increases, the other also increases. For example, as a tree ages, it grows taller. A correlation examining the relation between a tree’s age and its height would have a high positive r-value because as one variable increases (age of the tree), the second variable (height of the tree) also almost always increases.
To determine a p-value for a correlation, we can use a technique known as Linear Regression. This technique tries to create a straight line that comes as close to the actual data as possible. To help understand what that means, check out the plot below to see where that line falls compared to the other data. To determine the p-value, we can examine the difference between the predicted line and the actual data.
So, how much would a participant’s age relate to their performance in the study?
A correlation comparing a participant’s age and the mean number of words they remembered was statistically significant (r=0.40, p=0.02), suggesting that there was a moderately strong relation between age and performance! Linear regression suggested that the number of words a participant could remember increased by 0.18 per each year older. Here are the average predictions you made about age for this study. The lower the score on the graph, the higher the rating you gave them predicting they would do better in the study. Turns out you were wrong on this one! You gave the youngest participations the lowest score (putting them towards number 1 most often in your ranking), but older people actually did better!
As you can see in the plot below, while there is not a clear pattern where older participants almost always score well and younger participants almost always score poorly, there is a clear trend where higher scores tend to fall in the higher age range.
More Complex Tests: Analysis of Variance (ANOVA)
The tests we discussed before, correlation and t-test, are like the hammer and screwdriver of the scientist’s toolbox. It’s hard to complete a project without using at least one of them. If a t-test is the screwdriver of the toolbox, the next test we’ll discuss (ANalaysis Of VAriance: ANOVA) is the electric drill. While a t-test is limited to only two groups, ANOVAs allow for comparison between many different groups and different types of groups. The logic is the same as the t-test though. Take the mean of the group and based on the difference from one participant to the next, create a range of possible values that contains the value we might get if we tested every single person possible. Then, compare the ranges for each group to decide if the groups are actually different from one another.
We have two ANOVAs to look at, which can also help explain how ANOVAs are used. First, we want to compare the mean number of words remembered for each of our list categories (animals, objects, and fruits/vegetables) to see if people did better on some lists compared to others. We have three groups though, so we can’t test all of them at once using a t-test. However, ANOVA can give us a test statistic (F-value) and p-value based on whether there is a difference between any of the three groups.
Our second ANOVA adds a second layer to the question and shows the real strength of ANOVA. We want to compare if there is a difference in the mean number of words remembered based on the list category AND whether a participant was told to repeat the words in their head or out loud. We’re comparing the three groups we looked at in the first ANOVA, which compared participants within-subjects. It compared a participant against themselves, i.e. how many words they remembered for each category. In our second ANOVA, though, we are also comparing between-subjects, by splitting participants based on whether they were in the “in your head” or “out loud” conditions. The ANOVA produces a F-value and p-value telling us whether there was a difference in the difference between groups.
That might sound a bit confusing, so let’s actually use ANOVA for these questions and see how it works in action.
First, was there a difference based on list category?
An ANOVA examining the effect of list category was statistically significant (F=17.15, p=0.0002), indicating that participants remembered more words for some lists compared to others. We can use t-tests to examine which specific groups were different. There was not a statistically significant difference between the number of animal words remembered and number of fruit/vegetable words remembered (t=0.03, p=0.97). There was a significant difference between the number of objects remembered and the number of fruits/vegetables remembered (t=2.80, p=0.007) and between the number of objects remembered and the number of animals remembered (t=2.68, p=0.009). The results indicate that household objects were significantly harder to remember than animals or fruits/vegetables.
Second, did it matter whether participants repeated the words out loud or in their heads?
An ANOVA examining the interaction of list category and task condition (out loud/in your head) was not significant (F=2.16, p=0.15). While the test approached significance (relatively low p-value), there was not enough evidence to conclude that there was a difference.
|In Your Head||11.41||15.88||15.29|
Last one! Fisher’s Exact Test for Count Data
We have one more analysis to look at, but it’s pretty straightforward. We want to know if whether the participants were told to repeat the words in their heads or out loud impacted whether or not they used a strategy. For this we can use a Fisher’s Exact Test, which will examine the ratio of Yes responses to No responses to the question “Did you use a strategy to help you remember?” for the two conditions. The test was not significant (p=0.16), but there seemed to be a clear trend in the data. We might need to run a follow-up study to explore this relationship further.
|Choice||In Your Head||Out Loud|
So, what does it all mean?
Now that we’ve done the analyses and have some objective measures of what happened, it’s time to draw some final conclusions. Were the results what you expected? We had two results with p-values around .15, which isn’t low enough for us to conclude they’re real but is low enough that it might be worth exploring more.
In the end, it looks like age was a statistically significant factor, as was the category of objects that participants had to remember. But, neither gender nor task condition were significant (though the latter interaction approached significance). Strategy choice had a clear pattern but was not statistically significant, and this should be investigated further in a following study to see what we can make of it!
If you’ve read through all of this, well done! If this is your first time looking at data and stats, it may have been a little overwhelming. Don’t worry though! Science is a skill that takes time and practice, and even just learning a bit about how it all works is a big accomplishment.
Remember that, even though using statistical tests like these gives us a far more objective look at our data, this is only the beginning of an even larger process. We were only testing for our pretty specific research question this time around, but by trying to answer that question we’ve run into other cool things we might want to learn more about too! What kinds of studies might be good ways to continue the work we’ve done here so far?
In the comments section below, tell us about what you think the big takeaway message is from our results! What did we learn about internal language in our study?
Next week, we’ll be making some final conclusions about the study and looking back on everything we did this summer!
**Although we moderate every comment before it gets posted, please remember to be kind to others and mindful of your personal information before you post here!**