Comparing two groups with the linear model - r

I want to do a regression when parendiv is my Dependent variable and routine1997 is my Independent variable, and compare males to females. The code is like this:
structure(list(gender = structure(c(2L, 1L, 2L, 1L, 2L, 1L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L), .Label = c("male",
"female"), class = "factor"), parent = structure(c(2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = c("intact", "parentaldivorce"), class = "factor"),
routine = structure(c(1L, 1L, 1L, 1L, NA, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 1L, 2L, 3L, 2L, 1L, 3L, 3L), .Label = c("Med",
"High", "Low"), class = "factor")), row.names = c(3L, 5L,
6L, 7L, 8L, 9L, 10L, 11L, 16L, 18L, 19L, 21L, 22L, 23L, 24L,
25L, 28L, 29L, 30L, 34L), class = "data.frame")
This is the code and I want to specifically compare coefficient among men and women.
lm(parent~routine, data=nlsy97, subset=gender)

There are two ways to compare the coefficients.
The easiest way would be to code gender as dummy (0/1) and include an interaction term in the model. Then, you get the difference the gender makes for the coefficient, complete with a p-value:
out = lm(parent ~ routine + gender + routine*gender, data=nlsy97)
The other way would be to use a multigroup regression and comparing the pooled regression model (all genders included) with the unpooled models (seperate slopes or intercepts or both for genders). The model with the smallest AIC fits the data best. If your random slope model yields the lowest AIC, you have gender differences in your effect. If the random intercept is best, you just have level differences between the genders but may assume equal effects.
library(lme4)
pooled = lm(parent ~ routine, data=nlsy97)
r.inter = lmer(parent ~ routine + (1|gender), data=nlsy97)
r.slope = lmer(parent ~ routine + (routine|gender), data=nlsy97)
r.unpooled = lmer(parent ~ routine + (1+routine|gender), data=nlsy97)
AIC(pooled)
AIC(r.inter)
AIC(r.slope)
AIC(r.unpooled)
Using the method coefficients() on the model with the lowest AIC provides you with the exact coefficients for the individual groups.
EDIT: I just noticed that you just have 20 cases in total. If this is your whole dataset you should probably not do any statistical analyses at all.

Related

ICC per random factor for multilevel negative binomial regression

I have a multi-level negative binomial model fit with brms (library(brms))
fit1 <- brm(TOTAL_VIOLATIONS ~ LN_POP + Source_binary + Source_purchased + (1|TYPE_consolidated) + (1|COUNTY), data = Data, family = negbinomial())
This is what the data looks like:
> dput(droplevels(Data[1:20, c(3, 9, 20, 21, 22, 23)]))
structure(list(COUNTY = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L), .Label = c("ALAMEDA",
"ALPINE", "AMADOR"), class = "factor"), TYPE_consolidated = structure(c(9L,
6L, 3L, 2L, 5L, 7L, 1L, 1L, 4L, 12L, 1L, 1L, 1L, 1L, 8L, 10L,
6L, 5L, 11L, 2L), .Label = c("City", "County Water District",
"CSA", "CSD", "IOU", "MHP", "MUD", "MWC", "PA", "Private", "PUD",
"Special Act District"), class = "factor"), TOTAL_VIOLATIONS = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 0L, 0L,
8L, 0L, 0L), LN_POP = c(3.91202300542815, 6.29710931993394, 6.21260609575152,
12.7367008965923, 10.9852927228879, 14.1374128813017, 11.9290007521904,
11.1991321074213, 11.2374881189349, 12.332000202128, 10.2255710517052,
6.10924758276437, 6.62007320653036, 6.21460809842219, 3.91202300542815,
3.2188758248682, 4.24849524204936, 7.88231491898027, 8.96839619119826,
4.91265488573605), Source_binary = structure(c(1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L
), .Label = c("GW", "SW"), class = "factor"), Source_purchased = structure(c(1L,
1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 2L), .Label = c("No", "Purchased"), class = "factor")), row.names = c(NA,
20L), class = "data.frame")
I want to explore the value of including first random effect (TYPE_consolidated) versus the same model with the second random intercept only (COUNTY) but I can't for the life of me figure out how to get icc() to report this information by group using by_group. The output is the exact same with or without this argument (see below). I have a feeling this is because it is a brms object as according to the help the output is different for these models but I have yet to figure out another way to get this. Does anyone know a way to get the variance ratio for the individual random effects (or alternatively can see what I'm doing wrong with by_group)? If not, is there a standard way to compare ICC between nested models? If so perhaps I could just calculate it for two different versions of my model and do that instead?
> performance::icc(fit1, by_group = TRUE)
# Random Effect Variances and ICC
Conditioned on: all random effects
## Variance Ratio (comparable to ICC)
Ratio: 0.94 CI 95%: [0.80 0.99]
## Variances of Posterior Predicted Distribution
Conditioned on fixed effects: 7.39 CI 95%: [ 2.50 20.03]
Conditioned on rand. effects: 117.57 CI 95%: [59.15 331.86]
## Difference in Variances
Difference: 109.10 CI 95%: [50.23 320.86]
> performance::icc(fit1)
# Random Effect Variances and ICC
Conditioned on: all random effects
## Variance Ratio (comparable to ICC)
Ratio: 0.94 CI 95%: [0.79 0.99]
## Variances of Posterior Predicted Distribution
Conditioned on fixed effects: 7.42 CI 95%: [ 2.48 20.19]
Conditioned on rand. effects: 117.90 CI 95%: [59.53 349.90]
## Difference in Variances
Difference: 109.71 CI 95%: [51.20 340.30]```

show 3 factors ggplot geom

I am trying to make a PCA plot using ggplot and geom_point.
I would like to illustrate 3 factors (Diet, Time, Antibiotics).
I thought I could outline the points in black for one factor).
However this isn't showing the third factor (Time) for the Fill color.
Here is a subset of my data:
> dput(dat.pcx.annot.test)
structure(list(PC1 = c(25.296379160162, 1.4703101394886, 11.4138097811008,
1.41798772574591, 23.7253675969881, 15.5683516005535, -34.6012195481675,
-25.7129281491955, -2.97230018393742, 4.83421092719293, -0.0274189140249825,
23.227939504077, 15.2002258785889, -35.2243685702227, -34.2537374460037,
-7.6380794043063), PC2 = c(27.2678813936857, -9.88577494210313,
-6.19394322321806, -8.88953660465497, 33.6791127012231, -13.2912233546802,
7.77877968081575, 2.7371646557436, -8.41929538502921, -11.5151849519265,
-9.40733576034963, 32.3549860618533, -11.2170071727855, 10.0455709347794,
3.05679707335492, -6.66218028060621), Diet = structure(c(1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L), .Label = c("RC",
"WD"), class = "factor"), Time = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("ZT14",
"ZT2"), class = "factor"), Antibiotics = structure(c(2L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L), .Label = c("Antibiotics ",
"None"), class = "factor")), row.names = c(1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L, 11L, 18L, 19L, 20L, 21L, 22L), class = "data.frame")
Here is the plotting command :
ggplot(dat.pcx.annot.test,aes(x=PC1,y=PC2,color=Diet,shape=Antibiotics,Fill=Time))+
geom_point(size=3,alpha=0.5)+
scale_color_manual(values = c("black","white") )
And the plot it produces:
I thought if I had both color and fill specified then they would both show.
I would like black outlines for Antibiotics, and Fill color for Time.
Right now Time is not represented.
Any help on how to simultaneously view the 3 factors.
Thanks
Yes I had a fill typo. And I finally figured out how to get the legends to correspond. Here is my final answer.
ggplot(dat.pcx.annot,aes(x=PC1,y=PC2,color=Diet,shape=Antibiotics,fill=Time))+
geom_point(size=3)+
scale_shape_manual(values = c(21, 22) )+
scale_color_manual(values = c("black","white") )+
scale_fill_manual(values=c("#EC9DAE","#AEDE94"))+
xlab(PC1var)+
ylab(PC2var)+
guides(fill=guide_legend(override.aes=list(shape=21)))+
guides(color=guide_legend(override.aes=list(shape=21)))
guides(fill=guide_legend(override.aes=list(shape=21,fill=c("#EC9DAE","#AEDE94"),color=c("black","white"))))
ggsave("cohort2_pca.pdf")

Tukey test after LMM keeping contrasts

I want to test a 2x3 factorial design and contrasted the variables like this
library(lme4)
library(emmeans)
my.helmert = matrix(c(2, -1, -1, 0, -1, 1), ncol = 2)
contrasts(Target3$mask) = my.helmert
contrasts(Target3$length)
So for mask I want to compare the first group with the average of the two other groups and in a second step the second with the third group.
This works fine in my LMM
Target3.2_TT.lmer = lmer(logTotalTime ~ mask*length+ (1+length|Subject) +(1|Trialnum), data = Target3)
There is a significant interaction between mask and length, that´s why I want to take a look at this effect and calculate a post hoc test (Turkey) like this:
emmeans(Target3.2_TT.lmer, pairwise ~ mask : length)
This also works pretty fine with one problem: now my contrasts are gone. The text calculates the differences for all masks and not just 1 vs. 2 and 3 and 2 vs. 3. Is there a possibility to keep my contrasts in the Post hoc test?
This is what the data looks like:
> dput(Target3)
structure(list(mask = structure(c(2L, 1L, 2L, 3L, 1L, 2L, 3L,
2L, 1L, 3L, 3L, 3L, 3L, 2L, 2L, 3L, 3L, 1L, 3L, 1L, 1L, 2L, 3L,
3L, 2L, 1L, 3L, 2L, 3L, 2L), contrasts = structure(c(2, -1, -1,
0, -1, 1), .Dim = c(3L, 2L), .Dimnames = list(c("keine Maske",
"syntaktisch\n korrekt", "syntaktisch \n inkorrekt"), NULL)), .Label = c("keine Maske",
"syntaktisch\n korrekt", "syntaktisch \n inkorrekt"), class = "factor"),
length = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L,
2L, 2L, 1L, 1L, 2L, 1L), .Label = c("kurzes \n N+1", "langes\n N+1"
), class = "factor"), logTotalTime = c(4.969813299576, 5.37989735354046,
5.14166355650266, 5.40717177146012, 5.27299955856375, 5.72358510195238,
5.4249500174814, 6.18001665365257, 5.67675380226828, 5.44241771052179,
5.66988092298052, 5.04985600724954, 5.78996017089725, 5.03043792139244,
5.92958914338989, 5.15329159449778, 6.11146733950268, 5.26269018890489,
5.17614973257383, 6.18001665365257, 6.03068526026126, 5.68697535633982,
5.17614973257383, 5.19849703126583, 5.29330482472449, 5.89989735358249,
5.73979291217923, 5.65599181081985, 5.94017125272043, 5.72031177660741
)), .Names = c("mask", "length", "logTotalTime"), row.names = c(2L,
4L, 6L, 8L, 9L, 11L, 13L, 15L, 16L, 18L, 20L, 22L, 27L, 29L,
31L, 33L, 35L, 37L, 39L, 41L, 42L, 44L, 47L, 49L, 51L, 54L, 55L,
57L, 59L, 61L), class = "data.frame")
Well, if you ask for pairwise comparisons, that’s what you get, and Helmert contrasts are not the same as pairwise comparisons. Further, the Tukey (not Turkey) method applies only to pairwise comparisons, not to other types of contrasts.
Here’s something to try that may give you what you want.
emm = emmeans(Target3.2_TT.lmer,
~ mask | length)
contrast(emm, list(
c1 = c(2, -1, -1)/2,
c2 = c(0, 1, -1)),
adjust = “mvt”)
This will work independently of whatever parameterization (i.e., contrasts settings) were used when fitting the model. The model parametrrization affects how the model matrix is set up and the interpretation of the coefficients, but does not affect the results from emmeans or its relatives.

`ddply` fails to apply logistic regression (GLM) by group to my dataset

I'm working out the LD50 (lethal dosage) for multiple populations from different experiments using the MASS package. It's simple enough when I subset the data and do one at a time, but I'm getting an error when I use ddply. Essentially I need an LD50 for each population at each temperature.
My data looks somewhat like this:
# dput(d)
d <- structure(list(Pop = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L), .Label = c("a", "b", "c"), class = "factor"), Temp = structure(c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("high", "low"), class = "factor"),
Dose = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), Dead = c(0L,
11L, 12L, 14L, 2L, 16L, 17L, 7L, 5L, 3L, 17L, 15L, 9L, 20L,
8L, 19L, 7L, 2L, 20L, 14L, 9L, 15L, 1L, 15L), Alive = c(20L,
9L, 8L, 6L, 18L, 4L, 3L, 13L, 15L, 17L, 3L, 5L, 11L, 0L,
12L, 1L, 13L, 18L, 0L, 6L, 11L, 5L, 19L, 5L)), .Names = c("Pop",
"Temp", "Dose", "Dead", "Alive"), class = "data.frame", row.names = c(NA,
-24L))
The following works fine:
d$Mortality <- cbind(d$Alive, d$Dead)
a <- d[d$Pop=="a" & d$Temp=="high",]
library(MASS)
dose.p(glm(Mortality ~ Dose, family="binomial", data=a), p=0.5)[1]
But when I put this into ddply I get the following error:
library(plyr)
d$index <- paste(d$Pop, d$Temp, sep="_")
ddply(d, 'index', function(x) dose.p(glm(Mortality~Dose, family="binomial", data=x), p=0.5)[1])
Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1
I can get the right LD50 when I use a proportion but can't figure out where I've gone wrong with my approach (and had already written this question).
Perhaps this will amaze you. But if you choose to use formula
cbind(Alive, Dead) ~ Dose
instead of
Mortality ~ Dose
the problem will be gone.
library(MASS)
library(plyr)
## `d` is as your `dput` result
## a function to apply
f <- function(x) {
fit <- glm(cbind(Alive, Dead) ~ Dose, family = "binomial", data = x)
dose.p(fit, p=0.5)[[1]]
}
## call `ddply`
ddply(d, .(Pop, Temp), f)
# Pop Temp V1
#1 a high 2.6946257
#2 a low 2.1834099
#3 b high 2.5000000
#4 b low 0.4830998
#5 c high 2.2899553
#6 c low 2.5000000
So what happened with Mortality ~ Dose? Let's set .inform = TRUE when calling ddply:
## `d` is as your `dput` result
d$Mortality <- cbind(d$Alive, d$Dead)
## a function to apply
g <- function(x) {
fit <- glm(Mortality ~ Dose, family = "binomial", data = x)
dose.p(fit, p=0.5)[[1]]
}
## call `ddply`
ddply(d, .(Pop, Temp), g, .inform = TRUE)
#Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1
#Error: with piece 1:
# Pop Temp Dose Dead Alive Mortality
#1 a high 1 0 20 20
#2 a high 2 11 9 9
#3 a high 3 12 8 8
#4 a high 4 14 6 6
Now we we see that variable Mortality has lost dimension, and only the first column (Alive) is retained. For a glm with binomial response, if the response is a single vector, glm expects 0-1 binary or a factor of two levels. Now, we have integers 20, 9, 8, 6, ..., hence glm will complain
Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1
There is really no way to fix this issue. I have tried using a protector:
d$Mortality <- I(cbind(d$Alive, d$Dead))
but it still ends up with the same failure.

ggplot2 loop graph with conditional subsets

Data description:
I have a data set that is in long format with multiple different grouping variables (in data example: StandID and simID)
What I am trying to do:
I need to create simple scatter plots (x=predicted, y=observed) from this dataset for multiple columns based on a unique grouping variable.
An example of what I am trying to do using just standard plot is
obs=subset(example,simID=="OBS_OBS_OBS")
csfnw=example[example$simID== "CS_F_NW",]
plot(obs$X1HR,csfnw$X1HR)
I would need to do this for all simID and columns 9-14. (12 graphs total from data example)
What I have tried:
The problem I am running into is the y axis needs to remain the same, while cycling through the different subsets for the x axis.
I will admit up front, I have no idea what would be the best approach for this... I thought this would be easy for a split second because the data is already in long format and I would just be pointing to a subset of the data.
1) My original approach was to try and just splice up the data so that each simID had its own data frame, and compare it against the observation dataframe but I don't know how I would then pass it to ggplot.
2) My second idea was to make some kind of makeGraph function containing all the aesthetics I wanted essentially and use some kind of apply on it to pass everything through the function, but I could get neither to work.
makePlot=function(dat,x,y) {
ggplot(data=dat,aes(x=x,y=y))+geom_point(shape=Treat)+theme_bw()
}
What I could get to work was just breaking down the dataframe into the vectors of the variables I would then pass to some kind of loop/apply
sims=levels(example$simID)
sims2=sims[sims != "OBS_OBS_OBS"]
fuel_classes=colnames(example)[9:14]
Thank you
Data example:
example=structure(list(Year = structure(c(7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 7L, 7L), .Label = c("2001", "2002", "2003", "2004", "2005",
"2013", "2014", "2015"), class = "factor"), StandID = structure(c(10L,
2L, 6L, 22L, 14L, 18L, 34L, 26L, 30L, 10L, 2L, 6L, 22L, 14L,
18L, 34L, 26L, 30L, 10L, 2L, 6L, 22L, 14L, 18L, 34L, 26L, 30L
), .Label = c("1NB", "1NC", "1NT", "1NTB", "1RB", "1RC", "1RT",
"1RTB", "1SB", "1SC", "1ST", "1STB", "2NB", "2NC", "2NT", "2NTB",
"2RB", "2RC", "2RT", "2RTB", "2SB", "2SC", "2ST", "2STB", "3NB",
"3NC", "3NT", "3NTB", "3RB", "3RC", "3RT", "3RTB", "3SB", "3SC",
"3ST", "3STB"), class = "factor"), Block = structure(c(1L, 1L,
1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L,
1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("1", "2", "3"
), class = "factor"), Aspect = structure(c(3L, 1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L), .Label = c("N", "R", "S"), class = "factor"),
Treat = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), .Label = c("B", "C", "T", "TB"), class = "factor"),
Variant = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("CS", "OBS", "SN"), class = "factor"),
Fuels = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), .Label = c("F", "NF", "OBS"), class = "factor"),
Weather = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("NW", "OBS", "W"), class = "factor"),
X1HR = c(0.321666667, 0.177777778, 0.216111111, 0.280555556,
0.255555556, 0.251666667, 0.296666667, 0.231111111, 0.22,
0.27556628, 0.298042506, 0.440185249, 0.36150676, 0.398630172,
0.367523015, 0.345717251, 0.349305987, 0.412227929, 0.242860824,
0.258737177, 0.394024998, 0.287317872, 0.321927488, 0.281322986,
0.313588411, 0.303123146, 0.383658946), X10HR = c(0.440555556,
0.32, 0.266666667, 0.292222222, 0.496666667, 0.334444444,
0.564444444, 0.424444444, 0.432777778, 0.775042951, 0.832148314,
1.08174026, 1.023838878, 0.976997674, 0.844206274, 0.929837704,
1.0527215, 1.089246511, 0.88642776, 0.920596302, 1.209707737,
1.083737493, 1.077612877, 0.92481339, 1.041637182, 1.149550319,
1.229776621), X100HR = c(0.953888889, 1.379444444, 0.881666667,
1.640555556, 2.321666667, 1.122222222, 1.907777778, 1.633888889,
1.208333333, 1.832724094, 2.149356842, 2.364475727, 2.493232965,
2.262988567, 1.903909683, 2.135747433, 2.256677628, 2.288722038,
1.997704744, 2.087135553, 2.524872541, 2.34671092, 2.338253498,
2.06796217, 2.176314831, 2.580271006, 2.857197046), X1000HR = c(4.766666667,
8.342222222, 3.803333333, 8.057777778, 10.11444444, 6.931111111,
6.980555556, 13.20611111, 1.853333333, 3.389177084, 4.915714741,
2.795267582, 2.48227787, 2.218413353, 1.64684248, 2.716156483,
2.913746119, 2.238629341, 3.449863434, 3.432626724, 3.617531776,
3.641639471, 3.453454971, 3.176793337, 3.459602833, 3.871166945,
2.683447838), LITTER = c(2.4, 2.219444444, 2.772222222, 2.596666667,
2.693888889, 2.226111111, 2.552222222, 3.109444444, 2.963333333,
2.882233381, 3.025934696, 3.174396992, 3.291081667, 2.897673607,
2.737119675, 2.987895727, 3.679605484, 2.769756079, 2.882241249,
3.02594161, 3.174404144, 3.291091681, 2.897681713, 2.737129688,
2.987901449, 3.679611444, 2.769766569), DUFF = c(1.483333333,
1.723888889, 0.901666667, 1.520555556, 1.49, 1.366111111,
0.551666667, 1.056111111, 0.786111111, 2.034614563, 2.349547148,
1.685223818, 2.301301956, 2.609308243, 2.21895647, 2.043699026,
2.142618418, 0.953421116, 4.968493462, 4.990526676, 5.012362003,
5.023665905, 4.974074364, 4.947199821, 4.976779461, 5.082509995,
3.55211544), simID = structure(c(5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("CS_F_NW", "CS_F_W",
"CS_NF_NW", "CS_NF_W", "OBS_OBS_OBS", "SN_F_NW", "SN_F_W",
"SN_NF_NW", "SN_NF_W"), class = "factor")), .Names = c("Year",
"StandID", "Block", "Aspect", "Treat", "Variant", "Fuels", "Weather",
"X1HR", "X10HR", "X100HR", "X1000HR", "LITTER", "DUFF", "simID"
), row.names = c(37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L,
82L, 83L, 84L, 85L, 86L, 87L, 88L, 89L, 90L, 127L, 128L, 129L,
130L, 131L, 132L, 133L, 134L, 135L), class = "data.frame")
You were actually on the right track. If all plots are the same, just make one function and then use loops to loop over the subsets. For your example this can be done like this:
library(ggplot2)
# the plot function
plotFun = function(dat, title) {
ggplot(data=dat) +
geom_point(aes(x = x, y = y), shape=18) +
ggtitle(title) +
theme_bw()
}
# columns of interest
colIdx = 9:14
# split on all values of simID
dfList = split(example, example$simID)
# simID has never appearing factors. These are removed
dfList = dfList[lapply(dfList, nrow) != 0]
# make empty array for saving plots
plotList = array(list(), dim = c(length(dfList), length(dfList), length(colIdx)),
dimnames = list(names(dfList), names(dfList), names(example)[colIdx]))
# the first two loops loop over all unique combinations of dfList
for (i in 2:length(dfList)) {
for (j in 1:(i-1)) {
# loop over target variables
for (k in seq_along(colIdx)) {
# store variables to plot in a temporary dataframe
tempDf = data.frame(x = dfList[[i]][, colIdx[k]],
y = dfList[[j]][, colIdx[k]])
# add a title so we can see in the plot what is plotted vs what
title = paste0(names(dfList)[i], ":", names(dfList[[i]])[colIdx[k]], " VS ",
names(dfList)[j], ":", names(dfList[[j]])[colIdx[k]])
# make and save plot
plotList[[i, j, k]] = plotFun(tempDf, title)
}
}
}
# call the plots like this
plotList[[2, 1, 4]]
# Note that we only filled the lower triangle of combinations
# therefore indexing with [[1, 1, 1]] just returns NULL
plotList[, , 1]
This process can probably be more optimized, but when creating graphs I would go for clarity above speed since speed usually isn't an issue.

Resources