Pretty plots with predictions
I get a lot of questions about the plots in my conference talks, and I’ve been promissing a post about them, so here’s a first shot. I love plotting, and have recently gotten especially into
ggplot2 and some of it’s many options and add-ons. I’ll also include some stats here, to show how to plot the results of of linear mixed model.
A hypothetical hypothesis
Let’s say that I want to know: How does DJL (day journey length, see the intro of this post if you want more info about what this means) change over the course of a female orangutan’s development?
Instead of using age to look at changes over time, I’ll use “stages”: 1) dependence (when she’s still with her mom), 2) independence (when she’s on her own), 3) adolescence (when she’s sexually active), 4) pregnancy (orangutans have an 8-month long gestation period), and 5) adult (when she has her first offspring). And, let’s say that I already know that FAI (fruit availability index) influences DJL: when FAI is higher, orangutans travel farther, so I want to be sure to control for FAI in my statistical analysis. Also, we’ll pretend that these data come from 4 different individuals (focals), and that we’re not interested specifically in differences between individuals, but we do want to control for that.
Click here to download some fake data that I’ll use in this blogpost. (Note: This is not real data. I made it all up. Any patterns/correlations in these data are not necessarily representative of what we actually observe among wild orangutans.)
Set up workspace…
library(tidyverse) #for plotting, etc (includes ggplot2) library(cowplot) #for plotting theme - note that just loading the cowplot library #will hijack the default ggplot theme of your workspace to make it cowplot library(ggbeeswarm) #for different plotting point styles library(nlme) #for linear mixed modelling library(viridis) # for nice colours load("the-right-directory-on-your-computer/DJL data.Rdata")
As always, we should begin by exploring these data and getting to know them a little better. Remember, our main question is: How does DJL change over the course of a female orangutan’s development (i.e. over the stages)? and we suspect that we’ll probably need to control for FAI.
#some trusty base R functions for checking out our data... #structure str(df) #> 'data.frame': 688 obs. of 5 variables: #> $ FollowNumber: int 1 4 5 10 16 17 20 21 23 24 ... #> $ Focal : Factor w/ 4 levels "Luaqlas","Ivoni",..: 1 1 1 1 1 1 1 1 1 1 ... #> $ DJL : num 1033 857 923 1046 974 ... #> $ FAI : num 2.11 1.91 2.31 1.21 4.3 ... #> $ Stage : Factor w/ 5 levels "DEP","INDEP",..: 2 2 2 2 2 2 2 2 2 2 ... #summary summary(df) #> FollowNumber Focal DJL FAI #> Min. : 1.0 Luaqlas:183 Min. : 490.5 Min. : 0.169 #> 1st Qu.: 492.5 Ivoni :226 1st Qu.: 807.0 1st Qu.: 2.844 #> Median : 993.5 Bella : 51 Median : 903.4 Median : 4.236 #> Mean : 990.1 Pi :228 Mean : 907.9 Mean : 4.742 #> 3rd Qu.:1491.5 3rd Qu.:1003.4 3rd Qu.: 6.399 #> Max. :2000.0 Max. :1356.2 Max. :15.186 #> Stage #> DEP :103 #> INDEP:182 #> ADOL :159 #> PREG : 75 #> ADULT:169 #> #some tidyverse functions that can help... #put the stages in the right order (using a forcats package function) df$Stage <- fct_relevel(df$Stage, "DEP", "INDEP", "ADOL", "PREG", "ADULT") #get DJL means per Focal df %>% group_by(Focal) %>% summarize(meanDJL = mean(DJL)) #> # A tibble: 4 x 2 #> Focal meanDJL #> <fct> <dbl> #> 1 Luaqlas 889. #> 2 Ivoni 933. #> 3 Bella 844. #> 4 Pi 913. #get DJL means per Stage df %>% group_by(Stage) %>% summarize(meanDJL = mean(DJL)) #> # A tibble: 5 x 2 #> Stage meanDJL #> <fct> <dbl> #> 1 DEP 924. #> 2 INDEP 868. #> 3 ADOL 985. #> 4 PREG 871. #> 5 ADULT 886. #get DJL means per Focal per Stage df %>% group_by(Focal, Stage) %>% summarize(meanDJL = mean(DJL)) %>% spread(Stage, meanDJL) #spread it into wide format so it's easier to read #> # A tibble: 4 x 6 #> # Groups: Focal  #> Focal DEP INDEP ADOL PREG ADULT #> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 Luaqlas NA 887. 1009. 903. 849. #> 2 Ivoni 912. 858. 986. 832. 980. #> 3 Bella 892. 822. NA NA NA #> 4 Pi 931. 905. 964. 851. 867. #make some super basic plots... (remember that since we have loaded the cowplot library, #all plots will, by default, have the cowplot theme) #start by setting the cowplot theme font size for this workspace theme_set(theme_cowplot(font_size=10)) ggplot(df, aes(DJL)) + geom_histogram(binwidth = 100)
ggplot(df, aes(x = Stage, y = DJL)) + geom_boxplot()
ggplot(df, aes(x = Stage, y = DJL, color = Focal)) + geom_boxplot()
ggplot(df, aes(x = FAI, y = DJL, color = Focal)) + geom_point()
An LMM for these DJLs
Because my outcome variable (DJL) is almost normally distributed, and I want to include a random effect (Focal), I will use a linear mixed model. My fixed effects will be
FAI. I will include
Focal as a random effect because it is a level of replication in the data: the data within each level of
Focal are less independent than between the levels of
Focal. Personally, for LMMs, I like to use the
lme() function in the
nlme package, mostly just because it’s what I’m the most familiar with.
Stage is a temporally ordered factor, and I’m interested in changes over time, I’m going to use polynomial contrasts for the phase variable. This way, instead of comparing the first stage (
DEP) to each other stage, my model will look for a linear, quadratic, cubic, and quartic relationship along the stages, basically treating the stages like a numeric variable from 1 to 5…
#set the contrasts contrasts(df$Stage) <- contr.poly(5) #the number is the total number of factor levels #run the model model <- lme(DJL ~ Stage + FAI, random = ~1|Focal, data = df, method = "ML") #take a look at the model summary summary(model) #> Linear mixed-effects model fit by maximum likelihood #> Data: df #> AIC BIC logLik #> 8638.118 8674.388 -4311.059 #> #> Random effects: #> Formula: ~1 | Focal #> (Intercept) Residual #> StdDev: 15.28533 126.9349 #> #> Fixed effects: DJL ~ Stage + FAI #> Value Std.Error DF t-value p-value #> (Intercept) 830.2283 13.675354 679 60.70982 0.0000 #> Stage.L -9.2698 12.881832 679 -0.71960 0.4720 #> Stage.Q -8.2029 11.604820 679 -0.70685 0.4799 #> Stage.C 12.7765 13.023497 679 0.98104 0.3269 #> Stage^4 84.7546 11.377245 679 7.44948 0.0000 #> FAI 14.6878 1.986168 679 7.39503 0.0000 #> Correlation: #> (Intr) Stag.L Stag.Q Stag.C Stag^4 #> Stage.L -0.092 #> Stage.Q -0.100 -0.226 #> Stage.C -0.338 0.008 0.068 #> Stage^4 -0.021 -0.169 0.088 0.248 #> FAI -0.715 0.153 0.138 0.289 -0.028 #> #> Standardized Within-Group Residuals: #> Min Q1 Med Q3 Max #> -3.1128369 -0.7067194 -0.0227755 0.6815645 2.9390223 #> #> Number of Observations: 688 #> Number of Groups: 4 #take a look at the residuals plot(model)
So, our alpha value is 0.05, and therefore our model summary tells us that there is a significant QUARTIC relationship between DJL and Stage (if more than one polynomial contrast is significant, you always choose the HIGHEST ORDER contrast). It also tells us that there is significant relationship between DJL and FAI. The model plot (standardized residuals against fitted values) shows a pretty nice cloud, with points randomly scattered around the plotting area - LOVELY.
Pretty Plot Plotting
Now, I’ll start by plotting these data - just the raw data. I’ll do a transparent boxplot, overtop of the actual data points, coloured by individual. I’ll use
ggplot2 and two other ggplot add-on packages,
Note that just loading the library
cowplot will hijack the theme of all ggplots made in your R workspace. You can easily override this by adding any
theme_ ggplot argument to a plot.
bxplot_main <- # set up the basic ggplot parameters: data and aesthetic ggplot(df, aes(x = Stage, y = DJL)) + #set the theme's font size (this will override the general one I set earlier) #note that if I wanted to use another theme instead of cowplot, I could specify that #theme here instead (with theme_-----()) theme_cowplot(font_size = 8) + #use a geom from the ggbeeswarm add-in to plot the points, set its own aesthetic #the quasirandom geom with its own aesthetic mapping (same as the overall ggplot, #i.e. the boxplot, except adding color = Focal) geom_quasirandom(aes(Stage, DJL, color = Focal), #set how widely the points should be scattered dodge.width=0.6, #set the point transparency alpha=0.8, #setting stroke to 0 removes the outline around the points stroke = 0, #set point size size=1.3, #add these colours to the legend show.legend=TRUE) + #override ggplots default colours and use plasma colours from the viridis #package instead, need four colours, one for each individual scale_color_manual(values = plasma(5)[1:4], #override the default use of the factor levels (i.e. write names like this...) labels = c("Luaqlas", "Ivoni", "Bella", "Pi"), #override the default use of the column name (i.e. use this instead of "Focal") name = "Individual") + #add the boxplot boxes, on top of the points - this will use the main ggplot #aesthetic mapping #set the transparency geom_boxplot(alpha = 0.4, #set the outline colour of the boxes color = "grey30", #do not plot outliers (since we're plotting all of the points anyways) outlier.shape = NA) + #manually lable the plot and the axes labs(title = "Females, DJL by Stage", y = "DJL (meters)", x = "Stage") + #for the color legend, set the title manually (otherwise would use #column name) guides(color = guide_legend(title = "Individual", #override the alpha value of the points, in the legend override.aes = list(alpha = 1))) + #manually set the x axis lables scale_x_discrete(labels = c("Dep", "Indep", "Adol", "Preg", "Adult")) + #manually set the y axis breaks and limits scale_y_continuous(breaks = seq(0, 1500, by = 200), limits = c(0, 1500)) bxplot_main
So, now, what I am most interested in here is, controlling for FAI and individual differences (accounted for by the random factor), the differences in DJL over the stages. So… it’s time to calculate model predictions, and plot those…
#make a dataframe with every possible unique combination of each factor effect, #and the mean value of any continuous effect interpol <- expand.grid(Focal = factor(unique(model$data$Focal)), Stage = factor(unique(model$data$Stage)), FAI = mean(model$data$FAI)) #now predict the DJLs for each row of the new dataframe interpol$DJL<- predict(model, interpol, type= "response", allow.new.levels= T) #and now average those DJLs per Stage DJLmeansPerStage <- as.vector(tapply(interpol$DJL, interpol$Stage, mean, na.rm = TRUE)) #add these mean model predicitons to the boxplot #(there is probably a more efficient way to do this, #but I haven't figured it out for ggplot yet) bxplot_wPredictions <- bxplot_main + #for each phase, add a line segment that is a bit wider than the box, #start by mapping the start and end x and y coords; might need to #play around with the x coords to get the line segment as wide as you want it geom_segment(aes(x = 0.55, xend = 1.45, #start and end x coords #start and end y-coords are each element in the DJLmeansPerStage vector y = DJLmeansPerStage, yend = DJLmeansPerStage), #some general graphical parameters... color = "black", alpha = 0.8, size = 1.4) + geom_segment(aes(x = 1.55, xend = 2.45, y = DJLmeansPerStage, yend = DJLmeansPerStage), color = "black", alpha = 0.8, size = 1.4) + geom_segment(aes(x = 2.55, xend = 3.45, y = DJLmeansPerStage, yend = DJLmeansPerStage), color = "black", alpha = 0.8, size = 1.4) + geom_segment(aes(x = 3.55, xend = 4.45, y = DJLmeansPerStage, yend = DJLmeansPerStage), color = "black", alpha = 0.8, size = 1.4) + geom_segment(aes(x = 4.55, xend = 5.45, y = DJLmeansPerStage, yend = DJLmeansPerStage), color = "black", alpha = 0.8, size = 1.4) bxplot_wPredictions
And, if you want, you can calculate the parameters of, and plot, the a line to to show the quartic relationship. Dependending on the data and the relationship, this doesn’t always work, but when it does, it’s nice for talks and presentations, when you want to show exactly how a factor is significant.
#make a numeric vector to take the place of "Stage" factor x <- 1:5 #make a y variable out of the predicted values per phase y <- DJLmeansPerStage #make a quartic model for y ~ x pred.model <- lm(y ~ poly(x, 4)) #for plotting, make a dense vector of x-values from 1 to 5 x_vector <- seq(1, 5, by = 0.5) #use the quadratic model to general y values for all of the x-values y_vector <- predict(pred.model, list(x = x_vector)) curve_data <- data.frame(cbind(xvals = x_vector, yvals = y_vector)) bxplot_wPredictions_wCurve <- bxplot_wPredictions + #add curve line, using the newly created curve df stat_smooth(data = curve_data, aes(xvals, yvals), #specify the type of geom (in this case, line) geom = "line", #some general graphical parameters... color = "grey70", alpha = 0.8, size = 3) + #now add some text to explain it, first specify the #type of geom (in this case, text) annotate(geom = "text", #specify the x and y position x = 4.5, y = 500, #specify what the text should say label = "LMM, p < 0.001", #some general graphical parameters... color = "grey70", size = 3, fontface = "bold") bxplot_wPredictions_wCurve #> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Et voilà. For talks, I usually use at least 2 of these plots, if not all 3, building them up step by step. To ensure that the plots all come out exactly the same size, I use the
#will save to current working directory, with the filename: ggsave(filename = "DJL by Stage, full plot with predictions and curve.png", #specify which plot to save plot = bxplot_wPredictions_wCurve, #specify the file format device = "png", #specify the resolution dpi = 300, #specify the size height = 6, width = 6, units = "in")
A few things on my mind as I wrap this up:
- For talks, it’s always good to build up your plots - start with just the axes (put a white box over the data), explain them, then have the basic plot come up, then model predictions, or something like that, whatever you like… I just really don’t recommend throwing the final plot with model predictions and polynomial line at your audience all at once.
- This is NOT real orangutan DJL data. I made it up, just for the purposes of being able to write this post.
- Loading the
cowplotlibrary will hijack the default
ggplottheme in your workspace. I personally like the
cowplotdefault much better than
ggplot’s, but if you want to go back to the
ggplotdefault, just add
+ theme_grey()to any ggplot code (or pick a different
ggplottheme from here.