Data Analysis Walkthrough

Overview

Data analysis is about extracting meaningful information from your dataset using an appropriate tool. Displaying the data in graphical form can be a simple way to extract something meaningful. However, there is often a great deal of stochasticity (randomness, or noise) in ecological systems, so ecologists often employ tools whose specialty is separating meaning from noise - statistics!


The Ecoplexity website contains various tools for exploring, analyzing, and graphing your datasets. An approach that often works for people is to start by generating summary statistics and basic plots for datasets. This way one can identify potential trends and anomalies in the data, and determine whether parametric or non-parametric statistics will be most appropriate. After that, analyses are performed.


Tips

  • The best time to determine what type of data analysis to use is when you are planning your study. This way you can be sure to gather enough appropriate data, and avoid gathering unnecessary data.

  • One thing to identify is the type of data each of your variables contain. Common types are:

     


    1) Count data (e.g., number of individuals)


    2) Categorical data (e.g., forest vs. edge vs. meadow habitat) and


    3) Continuous data (e.g., inches of rain).


     


    Note that you may want to treat data as a different type than they would be considered in another circumstance. For example, dates might be categorical for the purposes of comparing the mean temperature of a stream during different times of the year but continuous for determining if there is a correlation between day of the year and mean stream temperature.


     


  • Once you collect data it can be helpful to get to know the structure of the dataset before doing formal statistical tests.

     

    For continuous data, the most common approaches are to:

     


    1) Examine summary statistics and


    2) Generate graphs of the distribution of each variable.


    Summary statistics (like the mean and range) are especially useful for getting an initial sense of how samples from two groups (levels of a categorical variable) compare. This can also be helpful for identifying erroneous values in your dataset. Because many statistical tests require data to be "normally distributed" (follow the bell-curve), graphs like histograms, box-plots, and Q-Q (quantile-quantile) plots are very useful for determining what test to perform.

Dataset Preparation

Regardless of how you arranged your data when you collected them, you must make sure they follow the appropriate format for use with the Ecoplexity data analysis tools. Species composition datasets are slightly different from standard datasets.


 


Standard Dataset Structure

  • All of the data you would like graphed or analyzed together should be in a single column.

  • Each column in the dataset should contain a separate variable (e.g., humidity, site code, soil temperature).

  • The first row should contain names for the variables.

  • The column headings should not have spaces between words.

  • All additional rows should contain data entries or NA (blank cells will be substituted with NA).

  • There should not be any empty rows or columns in the dataset.

  • For variables that are numbers, there are should be no extraneous characters (e.g., <1).

  • If there are groups within a variable, a separate factor variable should be used to designate groups (see example below).

  • Group can be designated with numbers or words.

  • Text used in variable names and grouping variables should not have spaces between words (use Soil.temp.celc).

Example Standard Dataset








To the right is a dataset that could be created in the Excel template. It describes soil temperatures at six sites, measured during each of three months.


To prepare the dataset for a test combining the three groups (e.g., a comparison of means with date as the factor) or to make a single graph containing all of the data (e.g., a bar graph), place all of the temperature data in a single column, and place values to designate sites in a separate column (as shown at right).


In this case, the column containing data for the main variable would be 1 and the column containing grouping data would be 2. The numbers 1 and 2 would be placed in the corresponding fields in the tool.


 

Above: Stacked data, with a factor variable (Visit) to designate groups.



Left: Unstacked data, with data from each group next to each other.

Species Composition Dataset Structure

Datasets containing counts of species (or another level of taxa) at one or more sites differ slightly in their structure from standard datasets. The differences are as follows:



  • The top-left cell should be left blank (rather than saying "Species" or something similar).

  • The first row should contain the names of sites, starting in the second column.

  • The first column should contain names of Species.

  • All cells within the matrix should be filled in with zero or greater (missing values are not allowed).

Example Species Composition Dataset

The following is a dataset that could be created in Excel. These specific data describe arthropod species collected in pitfall traps at several sites.



 

 

 

Data Exploration

There are three tools for exploring data:



  • Community Analysis helps you determine how diverse and similar the species composition is at/among several sites.

  • Summary Statistics are useful for assessing the distribution and composition of your dataset.

  • The Graphing Tool allows you to plot continuous variables by group or against other continuous variables and thus present or better interpret your data.

Data Analyses

By now you may have a good idea about what patterns may be present in your dataset, but is that pattern different from what a random selection of values could give you?



  • The statistical tool for Comparing Means will answer this question if you are interested in differences in the central tendency of two or more groups.

  • Correlation analysis will answer this question if you are interested in how strongly pairs of variables change together.