With all of our data entered into the google drive sheet we began the task of readying it for analysis. Louie began class with an introduction to R – a programming language tailored for statistical analysis, data manipulation, and graphical production. Everyone had R and RStudio (a streamlined interface for using R) loaded on their computers, ready for the day’s instruction.
Firstly, the basics were covered: background to R, RStudio’s interface, basic syntax, operators, common functions, image production, and script-writing, among others. Notably, Louie emphasized the importance of using R Markdown to write our reports.
With few technical issues, everyone was able to get RStudio running and playing around with commands and script-writing.
At this point, Marshall took over to have the class refine our dataset. Marshall generously wrote a script to standardize much of the data, preparing for analysis. However, there were many decisions the class had to make in order to ready the data.
Notably, one point of contention was how to define our orders as predaceous or not. These decisions rendered dermaptera as predaceous and ants as non-predaceous. Unfortunately, one category was hard to resolve: ‘Hymenoptera’. Mostly, hymenoptera were defined to lower rankings (ant, bee, and wasp), but there were many entries that lacked these specifications. Some collectors intended wasps when they wrote hymenoptera, while others intended bees or ants. People were able to go back and edit the data they entered with more specific descriptions. However, for some unknown entries we had to base our classifications on other specifications. For example, we agreed that it was safe to confidently say any unknown hymenoptera smaller than 6mm found on sticky traps was a wasp, while anything larger was a bee.
Initially, Marshall’s script found a significant number of missing samples. This was quickly resolved by noticing errors in the recently entered google drive data (such as some rows not having the pitfall-sticky column entered). However, after all resolutions (like recovering unknown treatments or dates), it was clear that there were still some errors. As we tried to figure out what went wrong, it became gravely clear that some of this missing data was not recoverable. Possible explanations may have been data entry errors (such as incorrectly dragging dates, replicates, or block), lack of data collection (if the samples never existed), and unentered data. There were also several occurrences of double counting / double data collection. Some of these errors may cause over-representation of certain samples or incorrect association of treatments.
At this point in time, there are three missing sticky samples and eighteen missing pitfall samples. Hopefully these can be resolved, but we might have to do without them.
Moving forward we should begin writing scripts for data analysis. Scripts are beautiful in that they are reproducible: we can work with the data we have now, and if we can resolve any issues, then all we have to do is change the input file for our scripts. On Thursday we will continue practicing R and digging deep into our data.
This is an exciting moment for the project! We are on the verge of peering into what was actually going on out in those fields. We are on the brink of understanding! Here’s to math!
1:40 – 4:30 – Further introduction to script-writing and analysis; discussion of potential data breakdowns