nls regression and storing output coefficients and plots - r

I tried to make reproducible data as suggested by How to make a great R reproducible example?:
structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("ANK.26.1",
"ANK.35.10"), class = "factor"), DAY = c(2L, 3L,
4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L,
18L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L,
15L, 16L, 17L, 18L), carbon = c(1684.351094778, 3514.451339358,
6635.877888654, 10301.700591252, 11361.360992769, 11891.934331254,
12772.885869486, 13545.127224369, 14022.00520767, 14255.045990397,
14479.813468278, 14611.749542181, 14746.382638335, 14942.733363567,
14961.338739162, 15049.433738817, 15047.197961499, 1705.361701104,
3293.593040601, 4788.872254899, 6025.622715999, 6670.80499518,
7150.526272512, 7268.955557607, 7513.61998338, 7896.202773246,
8017.953574608, 8146.09464786, 8286.148260324, 8251.229520243,
8384.244997158, 8413.034235219, 8461.066691601, 8269.360979031
), g.rate.perc = c(NA, 1.08653133557123, 0.888168948119852,0.55242467750436,
0.102862667394628, 0.0466998046116733, 0.0740797513417739, 0.060459426536321,
0.0352066079115925, 0.0166196474238596, 0.0157675729725753, 0.00911172469120847,
0.00921402983026387, 0.0133151790542558, 0.00124511193115184,
0.00588817626489591, -0.000148562222127446, NA, 0.931316411333049,
0.45399634862756, 0.258255053647507, 0.107073129133681, 0.0719135513148148,
0.0165623173150578, 0.0336588143694119, 0.0509185706373581,0.0154189051191185,
0.0159817679236518, 0.0171927308137518, -0.00421410998016991,
0.0161206856006937, 0.00343373053515927, 0.00570929049366353,
-0.0226573929218994), max.carb = c(15049.433738817, 15049.433738817,
15049.433738817, 15049.433738817, 15049.433738817, 15049.433738817,
15049.433738817, 15049.433738817, 15049.433738817, 15049.433738817,
15049.433738817, 15049.433738817, 15049.433738817, 15049.433738817,
15049.433738817, 15049.433738817, 15049.433738817, 8461.066691601,
8461.066691601, 8461.066691601, 8461.066691601, 8461.066691601,
8461.066691601, 8461.066691601, 8461.066691601, 8461.066691601,
8461.066691601, 8461.066691601, 8461.066691601, 8461.066691601,
8461.066691601, 8461.066691601, 8461.066691601, 8461.066691601
)), .Names = c("ID", "DAY", "carbon", "g.rate.perc", "max.carb"
), row.names = c(NA, 34L), class = "data.frame")
'data.frame': 34 obs. of 5 variables:
$ ID : Factor w/ 150 levels "ANK.26.1","ANK.35.10",..: 1 1 1 1 1 1 1 1 1 1 ...
$ DAY : int 2 3 4 5 6 7 8 9 10 11 ...
$ carbon : num 1684 3514 6636 10302 11361 ...
$ g.rate.perc: num NA 1.087 0.888 0.552 0.103 ...
$ max.carb : num 15049 15049 15049 15049 15049 ...
In the sample data ID only has two levels not the indicated 150.
My nls looks like that:
res.sample <- ddply (
d, .(ID),
function(x){
mod <- nls(carbon~phi1/(1+exp(-(phi2 + phi3 * DAY))),
start=list(
phi1 = x$max.carb,
phi2 = int[1],
phi3 = mean(x$g.rate.perc)),
data=x,trace=TRUE)
return(coef(mod))
}
)
phi2 is actually the result for the intercept from
int <- coef(lm(DAY~carbon,data=sample))
Unfortunately it doesnt work anymore, since i tried to wrap it into a ddply surrounding, but i cant go through all original 150 levels of ID manually.
On top of that, i would like to store all three output values for phi1 - phi3 in a data frame/ list with the respective ID. I meant to do that by
return(coef(mod))
The cherry on top would be the plots of the actual data and the fitted curve on top. Manually with subsetting i can do this as well but it is just too time consuming.
My reduced ggplot code for that is
ggplot(data=n, aes(x = DAY, y = carbon))+
geom_point(stat="identity", size=2) +
geom_line( aes(DAY,predict(logMod) ))+
ggtitle("ID")
if somehow the ID which contains a triplet of information, is less useful, here is how you can return it into another version
sep_sample <- sample %>% separate(ID, c("algae", "id", "nutrient"))
I feel like this is so much to ask but i have really tried and i can only spend so many days on this.
Here is a summary:
I need to run the model on every level of ID / every combination of algae & nutrient if you split it.
The output phi's should be stored in some kind of frame/list/table with their respective identification of where they belong to.
Ideally there is a way to include ggplots in all of this, which are automatically generated as well and stored.
as i said, the model in itself already worked, but when i put in in the ddply structure i get the following error message:
Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the model
I hope this is something you can work with somehow and this appears as a reasonable question to ask. If there are pages that already provide a similar solution that I havent found I will gladly take a look.
Thanks a lot!

soo i came up with this solution, which is not all i wanted, but i think a step closer, since it is running
coef_list <- list()
curve_list <- list()
for(i in levels(d$ALGAE)) {
for(j in levels(d$NUTRIENT)) {
dat = d[d$ALGAE == i & d$NUTRIENT == j,]
#int <- coef(lm(DAY~carbon,data=dat))
mod <- nls(carbonlog~phi1/(1+exp(-(phi2+phi3*DAY))),
start=list(
phi1=9.364,
phi2=0,
phi3= 0.135113),
data=dat,trace=TRUE)
coef_list[[paste(i, j, sep = "_")]] = coef(mod)
plt <- ggplot(data = dat, aes(x = DAY, y = carbonlog)) + geom_point()+
geom_line( aes(DAY,predict(mod) ))+
ggtitle(paste(i,"RATIO",j,sep=" ")) +
theme.plot
curve_list[[paste(i, j, sep = "_")]] = plt
}
}
unfortunately the paramters are static and not dependend on the respective factor combination. i reckon that the letter would be more helpful finding a fit.
if i apply
curve_list[["ANK_1"]]
i get an error message though:
Error: Aesthetics must be either length 1 or the same as the data (17): x, y
i only get the message when i use the log transformed carbon values. when i use carbon in its original format it plots everything

Related

Calculate AUC and variables importance in mcgv::gam in R

Hello my dataset looks like this:
structure(list(pa = structure(c(2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L,
1L, 2L, 1L, 1L, 2L, 1L, 1L), .Label = c("0", "1"), class = "factor"),
lon = c(26.953632, 26.914444, 26.854655, 26.377477, 26.653273,
26.739085, 26.732233, 26.67895, 26.6691, 26.925116, 26.771316,
26.952233, 26.934466, 26.9493, 26.948333), lat = c(37.65571,
37.658056, 37.548262, 37.714353, 37.670897, 37.652183, 37.664717,
37.672083, 37.6934, 37.63755, 37.41155, 37.65095, 37.661533,
37.65825, 37.652166), distance = c(2664.205501, 2188.408657,
1309.509802, 2931.223857, 443.7116677, 83.4248179, 1162.349952,
1025.302461, 1447.284772, 156.3081952, 1718.49796, 2120.230705,
2940.015299, 2859.658249, 2179.706853), N = c(2L, 3L, 3L,
4L, 1L, 3L, 3L, 4L, 8L, 7L, 2L, 0L, 10L, 0L, 0L), nh4 = c(0.0911071189102672,
0.0912837530530634, 0.0887604283967188, 0.0809833919295647,
0.0806452852518153, 0.0873989977309376, 0.0854938036251452,
0.0837840217003991, 0.113291559368372, 0.139553981108798,
0.136305334431029, 0.149872598116116, 0.14975582563108, 0.149872598116116,
0.149872598116116), ppn = c(3.13649814951996, 3.38222779366539,
2.5790228332411, 1.68392748415672, 2.80087243875361, 3.2346900728285,
3.17393288172866, 2.63412894585215, 3.14572940860351, 4.80038520203728,
5.83457531216185, 5.10820325640801, 5.14342739916075, 5.10820325640801,
5.10820325640801)), row.names = c(1L, 2L, 3L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 13L, 16L, 17L, 18L, 19L), class = "data.frame")
I'm trying to fit a model with this kind of formula:
mod <- mgcv::gam(data=db, family=binomial(link="logit"), method="REML",
cbind(pa, N) ~ s(lon) + s(lat) + ti(lon, lat, distance, bs = "re") +
s(nh4) + s(ppn, k = 10) )
Where pa is a binomial variable (presence/absence) and N is the number of individuals collected (when when presence has value 1). The problem is when I run the following code to calculate the AUC, R returns errors:
library(mgcv) # library for GAM
library(ggplot2) # for beautiful plots
library(cdata) # data wrangling
library(sigr) # AUC calculation
data <- dplyr::select(db, pa, lon, lat, distance, nh4, ppn, N, season)
randn=runif(nrow(data))
train_idx=randn<=0.8
train=data[train_idx,]
test=data[!train_idx,]
performance=function(y,pred){
confmat_test=table(truth=y,predict=pred>0.5)
acc=sum(diag(confmat_test))/sum(confmat_test)
precision=confmat_test[2,2]/sum(confmat_test[,2])
recall=confmat_test[2,2]/sum(confmat_test[2,])
auc=calcAUC(pred,y)
c(acc,precision,recall,auc)
}
# Posterior probability
train$pred=predict(gam_model,newdata = train,type = "response")
test$pred=predict(gam_model,newdata=test,type="response")
# model performance evaluated using training data
perf_train=performance(train$pa_dd,train$pred)
perf_test=performance(test$pa_dd,test$pred)
perf_mat=rbind(perf_train,perf_test)
colnames(perf_mat)=c("accuracy","precision","recall","AUC")
round(perf_mat,4)
Questions are:
Is this formula correct?
How can I compute AUC?
How can I compute each variable's importance?
Thank you in advance.

Plotting multiple effect plots from logistic regression

I have a number of logistic regression models with different response variables but the same predictor variables. I want to use grid.arrange (or anything else) to make a single figure with all these effect plots that were made with the effects package. I followed the advice here to make such a graph: grid.arrange with John Fox's effects plots
library(effects)
library(gridExtra)
data <- structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L,1L, 1L, 2L, 2L, 2L), .Label = c("group1", "group2"), class = "factor"),obs = c(1L, 1L, 4L, 4L, 6L, 12L, 26L, 1L, 10L, 6L),responseA = c(1L, 1L, 2L, 0L, 1L, 10L, 20L, 0L, 3L, 2L), responseB = c(0L, 0L, 2L, 4L, 6L, 4L, 8L, 1L, 8L, 5L)), .Names = c("group", "obs", "responseA","responseB"), row.names = c(53L, 54L, 55L, 56L, 57L, 58L,59L, 115L, 116L, 117L), class = "data.frame")
model1<-glm(cbind(responseA,(obs-responseA))~group,family=binomial, data=data)
model2<-glm(cbind(responseA,(obs-responseA))~group,family=binomial, data=data)
ef1 <-allEffects(model1)[[1]]
ef2 <- allEffects(model2)[[1]]
elist <- list( ef1,ef2)
class(elist) <- "efflist"
plot(elist, col=2)
The problem is that, in the models I am using the response variable in the model in the form cbind(response A,no response A), but for the figure I would like to change it to something more clean (like Response A). I tried changing the y labels by putting a list, but got a warning, and it turned both labels into "Response A".
plot(elist, ylab=c("response A","response B"),col=2)
Then tried the second method suggestion to change the class to trellis, got an error, so grid.arrange didn’t work either.
p1<-plot(allEffects(model1),ylab="Response A")
p2<-plot(allEffects(model2),ylab="Response B")
class(p1) <- class(p2) <- "trellis"
grid.arrange(p1, p2, ncol=2)
Can anyone provide a method to change each y-axis label separately?
With the ef1 and ef2 variables you created, you can try the following
plot1 <- plot(ef1, ylab = "Response A")
plot2 <- plot(ef2, ylab = "Response B")
grid.arrange(plot1, plot2, ncol=2)

Drawing SE in xyplot with errorbars

I am trying to construct a simple XY-Graph with the milk production (called FCM) of two different groups of cows (from the output I got from the mixed model, using the lsmeans and SE).
I was able to construct the plot displaying the lsmeans using the xyplot function in lattice:
library(lattice)
xyplot(lsmean~Time, type="b", group=Group, data=lsmeans2[order(lsmeans2$Time),],
pch=16, ylim=c(10,35), col=c("darkorange","darkgreen"),
ylab="FCM (kg/day)", xlab="Week", lwd=2,
key=list(space="top",
lines=list(col=c("darkorange","darkgreen"),lty=c(1,1),lwd=2),
text=list(c("Confinement Group","Pasture Group"), cex=0.8)))
I now want to add the error bars. I tried some things with the panel.arrow function, just copying and pasting from other examples but didn´t get any further.
I would really appreciate some help!
My lsmeans2 dataset:
Group Time lsmean SE df lower.CL upper.CL
Stall wk1 26.23299 0.6460481 59 24.19243 28.27356
Weide wk1 25.12652 0.6701080 58 23.00834 27.24471
Stall wk10 21.89950 0.6460589 59 19.85890 23.94010
Weide wk10 18.45845 0.6679617 58 16.34705 20.56986
Stall wk2 25.38004 0.6460168 59 23.33957 27.42050
Weide wk2 22.90409 0.6679617 58 20.79269 25.01549
Stall wk3 25.02474 0.6459262 59 22.98455 27.06492
Weide wk3 24.05886 0.6679436 58 21.94751 26.17020
Stall wk4 23.91630 0.6456643 59 21.87694 25.95565
Weide wk4 22.23608 0.6678912 58 20.12490 24.34726
Stall wk5 23.97382 0.6493483 59 21.92283 26.02481
Weide wk5 18.14550 0.6677398 58 16.03480 20.25620
Stall wk6 24.48899 0.6456643 59 22.44963 26.52834
Weide wk6 19.40022 0.6697394 58 17.28319 21.51724
Stall wk7 24.98107 0.6459262 59 22.94089 27.02126
Weide wk7 19.71200 0.6677398 58 17.60129 21.82270
Stall wk8 22.65167 0.6460168 59 20.61120 24.69214
Weide wk8 19.35759 0.6678912 58 17.24641 21.46877
Stall wk9 22.64381 0.6460481 59 20.60324 24.68438
Weide wk9 19.26869 0.6679436 58 17.15735 21.38004
For completeness, here is a solution using xyplot:
# Reproducible data
lsmeans2 = structure(list(Group = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("Stall",
"Weide"), class = "factor"), Time = structure(c(1L, 1L, 2L, 2L,
3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L,
10L), .Label = c("wk1", "wk10", "wk2", "wk3", "wk4", "wk5", "wk6",
"wk7", "wk8", "wk9"), class = "factor"), lsmean = c(26.23299,
25.12652, 21.8995, 18.45845, 25.38004, 22.90409, 25.02474, 24.05886,
23.9163, 22.23608, 23.97382, 18.1455, 24.48899, 19.40022, 24.98107,
19.712, 22.65167, 19.35759, 22.64381, 19.26869), SE = c(0.6460481,
0.670108, 0.6460589, 0.6679617, 0.6460168, 0.6679617, 0.6459262,
0.6679436, 0.6456643, 0.6678912, 0.6493483, 0.6677398, 0.6456643,
0.6697394, 0.6459262, 0.6677398, 0.6460168, 0.6678912, 0.6460481,
0.6679436), df = c(59L, 58L, 59L, 58L, 59L, 58L, 59L, 58L, 59L,
58L, 59L, 58L, 59L, 58L, 59L, 58L, 59L, 58L, 59L, 58L), lower.CL = c(24.19243,
23.00834, 19.8589, 16.34705, 23.33957, 20.79269, 22.98455, 21.94751,
21.87694, 20.1249, 21.92283, 16.0348, 22.44963, 17.28319, 22.94089,
17.60129, 20.6112, 17.24641, 20.60324, 17.15735), upper.CL = c(28.27356,
27.24471, 23.9401, 20.56986, 27.4205, 25.01549, 27.06492, 26.1702,
25.95565, 24.34726, 26.02481, 20.2562, 26.52834, 21.51724, 27.02126,
21.8227, 24.69214, 21.46877, 24.68438, 21.38004)), .Names = c("Group",
"Time", "lsmean", "SE", "df", "lower.CL", "upper.CL"), class = "data.frame", row.names = c(NA,
-20L))
xyplot(lsmean~Time, type="b", group=Group, data=lsmeans2[order(lsmeans2$Time),],
panel = function(x, y, ...){
panel.arrows(x, y, x, lsmeans2$upper.CL, length = 0.15,
angle = 90, col=c("darkorange","darkgreen"))
panel.arrows(x, y, x, lsmeans2$lower.CL, length = 0.15,
angle = 90, col=c("darkorange","darkgreen"))
panel.xyplot(x,y, ...)
},
pch=16, ylim=c(10,35), col=c("darkorange","darkgreen"),
ylab="FCM (kg/day)", xlab="Week", lwd=2,
key=list(space="top",
lines=list(col=c("darkorange","darkgreen"),lty=c(1,1),lwd=2),
text=list(c("Confinement Group","Pasture Group"), cex=0.8)))
The length argument in panel.arrows changes the width of the error heads. You can fiddle around with this parameter to get a width you like.
Notice that even though you had lsmeans2[order(lsmeans2$Time),] when specifying the data =, the ordering of Time is still wrong. This is because Time is a factor, and R doesn't know you want it to order by the numerical suffix of wk. This means, that it will sort wk10 before wk2, because 1 is smaller than 2. You can use this little trick below to order it correctly:
# Order first by the character lenght, then by Time
Timelevels = levels(lsmeans2$Time)
Timelevels = Timelevels[order(nchar(Timelevels), Timelevels)]
# Reorder the levels
lsmeans2$Time = factor(lsmeans2$Time, levels = Timelevels)
# Create Subset
lsmeansSub = lsmeans2[order(lsmeans2$Time),]
xyplot(lsmean~Time, type="b", group=Group, data=lsmeansSub,
panel = function(x, y, yu, yl, ...){
panel.arrows(x, y, x, lsmeansSub$upper.CL, length = 0.15,
angle = 90, col=c("darkorange","darkgreen"))
panel.arrows(x, y, x, lsmeansSub$lower.CL, length = 0.15,
angle = 90, col=c("darkorange","darkgreen"))
panel.xyplot(x, y, ...)
},
pch=16, ylim=c(10,35), col=c("darkorange","darkgreen"),
ylab="FCM (kg/day)", xlab="Week", lwd=2,
key=list(space="top",
lines=list(col=c("darkorange","darkgreen"),lty=c(1,1),lwd=2),
text=list(c("Confinement Group","Pasture Group"), cex=0.8)))
Note that even after reordering the the levels of "Time", I still need to use the sorted data for the data = argument. This is because xyplot plots the points in the order that appears in the dataset, not the order of the factor levels.
Is there a particular reason you want to use xplot? ggplot2 is much easier to work with and prettier. Here's an example of what I think you want.
#load ggplot2
library(ggplot2)
#load data
d = structure(list(Group = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("Stall",
"Weide"), class = "factor"), Time = structure(c(1L, 1L, 2L, 2L,
3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L,
10L), .Label = c("wk1", "wk10", "wk2", "wk3", "wk4", "wk5", "wk6",
"wk7", "wk8", "wk9"), class = "factor"), lsmean = c(26.23299,
25.12652, 21.8995, 18.45845, 25.38004, 22.90409, 25.02474, 24.05886,
23.9163, 22.23608, 23.97382, 18.1455, 24.48899, 19.40022, 24.98107,
19.712, 22.65167, 19.35759, 22.64381, 19.26869), SE = c(0.6460481,
0.670108, 0.6460589, 0.6679617, 0.6460168, 0.6679617, 0.6459262,
0.6679436, 0.6456643, 0.6678912, 0.6493483, 0.6677398, 0.6456643,
0.6697394, 0.6459262, 0.6677398, 0.6460168, 0.6678912, 0.6460481,
0.6679436), df = c(59L, 58L, 59L, 58L, 59L, 58L, 59L, 58L, 59L,
58L, 59L, 58L, 59L, 58L, 59L, 58L, 59L, 58L, 59L, 58L), lower.CL = c(24.19243,
23.00834, 19.8589, 16.34705, 23.33957, 20.79269, 22.98455, 21.94751,
21.87694, 20.1249, 21.92283, 16.0348, 22.44963, 17.28319, 22.94089,
17.60129, 20.6112, 17.24641, 20.60324, 17.15735), upper.CL = c(28.27356,
27.24471, 23.9401, 20.56986, 27.4205, 25.01549, 27.06492, 26.1702,
25.95565, 24.34726, 26.02481, 20.2562, 26.52834, 21.51724, 27.02126,
21.8227, 24.69214, 21.46877, 24.68438, 21.38004)), .Names = c("Group",
"Time", "lsmean", "SE", "df", "lower.CL", "upper.CL"), class = "data.frame", row.names = c(NA,
-20L))
#fix week
library(stringr)
library(magrittr)
d$Time %<>% as.character() %>% str_replace(pattern = "wk", replacement = "") %>% as.numeric()
#plot
ggplot(d, aes(Time, lsmean, color = Group, group = Group)) +
geom_point() +
geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = .2) +
geom_line() +
ylim(10, 35) +
scale_x_continuous(name = "Week", breaks = 1:10) +
ylab("FCM (kg/day)") +
scale_color_discrete(label = c("Confinement Group","Pasture Group"))

Add symbol on top of ggplot2 boxplots to indicate value of variable

Working with the following subset of a much larger dataset,
ex <- structure(list(transect_id = c(1L, 1L, 1L, 1L, 1L, 15L, 15L,
15L, 15L, 15L, 15L), number_f = c(2L, 2L, 2L, 2L, 2L, 0L, 0L,
0L, 0L, 0L, 0L), years_f = c(1L, 1L, 1L, 1L, 1L, 6L, 6L, 6L,
6L, 6L, 6L), b = c(5.036625862, 6.468666553, 8.028989792, 4.168409348,
5.790089607, 10.67796993, 9.371051788, 10.54364777, 6.904324532,
7.203606129, 9.1611166)), .Names = c("transect_id", "number_f",
"years_f", "b"), class = "data.frame", row.names = c(1L, 2L,
3L, 4L, 5L, 2045L, 2046L, 2047L, 2048L, 2049L, 2050L))
I've plotted the distributions of "b" for each of the groups indicated by "transect_id" and have colored them by "number_f", which I do here:
ggplot(aes(x=reorder(transect_id, b, FUN=median), y=b), data=ex) + geom_boxplot(aes(fill=as.factor(number_f))) + xlab('Transect ID')
What I need to do for each of the "transect_id" groups is stack symbols - asterisks or some other symbol - on top of each boxplot to provide an indication of the value of "years_f" that corresponds to each "transect_id". In the data subset below, "years_f" amounts to 1 and 6 for transect_ids 1 and 15, respectively. I'd like to see something like this, which I manually mocked up.
Also keep in mind that the dataset I'm working with is very large so I'll need to use some loop or some other way of doing this automatically. Please note that I absolutely welcome other ideas for better ways of indicating the value of "years_f" that might not overburden the figure as much as having all of these stacked symbols that will particularly be an issue for larger values of "years_f".
Try adding
annotate('text', x = c(1, 2), y = 3, label = paste0('Year_F =', unique(ex$years_f)))
to the end of your plot like so:
ggplot(aes(x=reorder(transect_id, b, FUN=median), y=b), data=ex) +
geom_boxplot(aes(fill=as.factor(number_f))) + xlab('Transect ID')+
annotate('text', x = c(1, 2), y = 3, label = paste0('Year_F =', unique(ex$years_f)))
To use it on a bigger dataset you would have to edit the x and y argument, but this might be a decent alternative. A possibility for the y coordinate could be something like 0.9 * min(ex$b).
edit In response to your comment:
You could first count how many levels there are of transect_id to specify x
len.levels <- length(levels(as.factor(ex$transect_id)))
then, you could create a summary table of the uniqe years_f variable by transect_id:
sum.table <- aggregate(years_f~reorder(ex$transect_id, ex$b, median),
data = ex, FUN = unique)
reorder(ex$transect_id, ex$b, median) years_f
1 1 1
2 15 6
and then plot as follows:
ggplot(aes(x=reorder(transect_id, b, FUN=median), y=b), data=ex) +
geom_boxplot(aes(fill=as.factor(number_f))) + xlab('Transect ID')+
annotate('text', x = 1:len.levels, y = .9 * min(ex$b),
label = paste0('Year_F =', sum.table[,2]))

Converting this ugly for-loop to something more R-friendly

Been using SO as a resource constantly for my work. Thanks for holding together such a great community.
I'm trying to do something kinda complex, and the only way I can think to do it right now is with a pair of nested for-loops (I know that's frowned upon in R)... I have records of three million-odd course enrollments: student UserID's paired with CourseID's. In each row, there's a bunch of data including start/end dates and scores and so forth. What I need to do is, for each enrollment, calculate the average score for that user across the courses she's taken before the course in the enrollment.
The code I'm using for the for-loop follows:
data$Mean.Prior.Score <- 0
for (i in as.numeric(rownames(data)) {
sum <- 0
count <- 0
for (j in as.numeric(rownames(data[data$UserID == data$UserID[i],]))) {
if (data$Course.End.Date[j] < data$Course.Start.Date[i]) {
sum <- sum + data$Score[j]
count <- count + 1
}
}
if (count != 0)
data$Mean.Prior.Score[i] <- sum / count
}
I'm pretty sure this would work, but it runs incredibly slowly... my data frame has over three million rows, but after a good 10 minutes of chugging, the outer loop has only run through 850 of the records. That seems way slower than the time complexity would suggest, especially given that each user has only 5 or 6 courses to her name on average.
Oh, and I should mention that I converted the date strings with as.POSIXct() before running the loop, so the date comparison step shouldn't be too terribly slow...
There's got to be a better way to do this... any suggestions?
Edit: As per mnel's request... finally got dput to play nicely. Had to add control = NULL. Here 'tis:
structure(list(Username = structure(1:20, .Label = c("100225",
"100226", "100228", "1013170", "102876", "105796", "106753",
"106755", "108568", "109038", "110150", "110200", "110350", "111873",
"111935", "113579", "113670", "117562", "117869", "118329"), class = "factor"),
User.ID = c(2313737L, 2314278L, 2314920L, 9708829L, 2325896L,
2315617L, 2314644L, 2314977L, 2330148L, 2315081L, 2314145L,
2316213L, 2317734L, 2314363L, 2361187L, 2315374L, 2314250L,
2361507L, 2325592L, 2360182L), Course.ID = c(2106468L, 2106578L,
2106493L, 5426406L, 2115455L, 2107320L, 2110286L, 2110101L,
2118574L, 2106876L, 2110108L, 2110058L, 2109958L, 2108222L,
2127976L, 2106638L, 2107020L, 2127451L, 2117022L, 2126506L
), Course = structure(c(1L, 7L, 10L, 15L, 11L, 19L, 4L, 6L,
3L, 12L, 2L, 9L, 17L, 8L, 20L, 18L, 13L, 16L, 5L, 14L), .Label = c("ACCT212_A",
"BIOS200_N", "BIS220_T", "BUSN115_A", "BUSN115_T", "CARD205_A",
"CIS211_A", "CIS275_X", "CIS438_S", "ENGL112_A", "ENGL112_B",
"ENGL227_K", "GM400_A", "GM410_A", "HUMN232_M", "HUMN432_W",
"HUMN445_A", "MATH100_X", "MM575_A", "PSYC110_Y"), class = "factor"),
Course.Start.Date = structure(c(1098662400, 1098662400, 1098662400,
1309737600, 1099267200, 1098662400, 1099267200, 1099267200,
1098662400, 1098662400, 1099267200, 1099267200, 1099267200,
1098662400, 1104105600, 1098662400, 1098662400, 1104105600,
1098662400, 1104105600), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
Term.ID = c(12056L, 12056L, 12056L, 66282L, 12057L, 12056L,
12057L, 12057L, 12056L, 12056L, 12057L, 12057L, 12057L, 12056L,
13469L, 12056L, 12056L, 13469L, 12056L, 13469L), Term.Name = structure(c(2L,
2L, 2L, 4L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 3L, 2L,
2L, 3L, 2L, 3L), .Label = c("Fall 2004", "Fall 2004 Session A",
"Fall 2004 Session B", "Summer Session A 2011"), class = "factor"),
Term.Start.Date = structure(c(1L, 1L, 1L, 4L, 2L, 1L, 2L,
2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 3L, 1L, 3L), .Label = c("2004-10-21",
"2004-10-28", "2004-12-27", "2011-06-26"), class = "factor"),
Score = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125,
0, 0, 0, 0, 0), First.Course.Date = structure(c(1L, 1L, 1L,
4L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 3L,
1L, 3L), .Label = c("2004-10-25", "2004-11-01", "2004-12-27",
"2011-07-04"), class = "factor"), First.Term.Date = structure(c(1L,
1L, 1L, 4L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L,
1L, 3L, 1L, 3L), .Label = c("2004-10-21", "2004-10-28", "2004-12-27",
"2011-06-26"), class = "factor"), First.Timer = c(TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE), Course.Code = structure(c(1L,
6L, 9L, 13L, 9L, 17L, 4L, 5L, 3L, 10L, 2L, 8L, 15L, 7L, 18L,
16L, 11L, 14L, 4L, 12L), .Label = c("ACCT212", "BIOS200",
"BIS220", "BUSN115", "CARD205", "CIS211", "CIS275", "CIS438",
"ENGL112", "ENGL227", "GM400", "GM410", "HUMN232", "HUMN432",
"HUMN445", "MATH100", "MM575", "PSYC110"), class = "factor"),
Course.End.Date = structure(c(1L, 1L, 1L, 4L, 2L, 1L, 2L,
2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 3L, 1L, 3L), .Label = c("2004-12-19",
"2005-02-27", "2005-03-26", "2011-08-28"), class = "factor")), .Names = c("Username",
"User.ID", "Course.ID", "Course", "Course.Start.Date", "Term.ID",
"Term.Name", "Term.Start.Date", "Score", "First.Course.Date",
"First.Term.Date", "First.Timer", "Course.Code", "Course.End.Date"
), row.names = c(NA, 20L), class = "data.frame")
I found that data.table worked well.
# Create some data.
library(data.table)
set.seed(1)
n=3e6
numCourses=5 # Average courses per student
data=data.table(UserID=as.character(round(runif(n,1,round(n/numCourses)))),course=1:n,Score=runif(n),CourseStartDate=as.Date('2000-01-01')+round(runif(n,1,365)))
data$CourseEndDate=data$CourseStartDate+round(runif(n,1,100))
setkey(data,UserID)
# test=function(CourseEndDate,Score,CourseStartDate) sapply(CourseStartDate, function(y) mean(Score[y>CourseEndDate]))
# I vastly reduced the number of comparisons with a better "test" function.
test2=function(CourseEndDate,Score,CourseStartDate) {
o.end = order(CourseEndDate)
run.avg = cumsum(Score[o.end])/seq_along(CourseEndDate)
idx=findInterval(CourseStartDate,CourseEndDate[o.end])
idx=ifelse(idx==0,NA,idx)
run.avg[idx]
}
system.time(data$MeanPriorScore<-data[,test2(CourseEndDate,Score,CourseStartDate),by=UserID]$V1)
# For three million courses, at an average of 5 courses per student:
# user system elapsed
# 122.06 0.22 122.45
Running a test to see if it looks the same as your code:
set.seed(1)
n=1e2
data=data.table(UserID=as.character(round(runif(n,1,1000))),course=1:n,Score=runif(n),CourseStartDate=as.Date('2000-01-01')+round(runif(n,1,365)))
data$CourseEndDate=data$CourseStartDate+round(runif(n,1,100))
setkey(data,UserID)
data$MeanPriorScore<-data[,test2(CourseEndDate,Score,CourseStartDate),by=UserID]$V1
data["246"]
# UserID course Score CourseStartDate CourseEndDate MeanPriorScore
#1: 246 54 0.4531314 2000-08-09 2000-09-20 0.9437248
#2: 246 89 0.9437248 2000-02-19 2000-03-02 NA
# A comparison with your for loop (slightly modified)
data$MeanPriorScore.old<-NA # Set to NaN instead of zero for easy comparison.
# I think you forgot a bracket here. Also, There is no need to work with the rownames.
for (i in seq(nrow(data))) {
sum <- 0
count <- 0
# I reduced the complexity of figuring out the vector to loop through.
# It will result in the exact same thing if there are no rownames.
for (j in which(data$UserID == data$UserID[i])) {
if (data$CourseEndDate[j] <= data$CourseStartDate[i]) {
sum <- sum + data$Score[j]
count <- count + 1
}
}
# I had to add "[i]" here. I think that is what you meant.
if (count != 0) data$MeanPriorScore.old[i] <- sum / count
}
identical(data$MeanPriorScore,data$MeanPriorScore.old)
# [1] TRUE
This seems to be what you want
library(data.table)
# create a data.table object
DT <- data.table(data)
# key by userID
setkeyv(DT, 'userID')
# for each userID, where the Course.End.Date < Course.Start.Date
# return the mean score
# This is too simplistic
# DT[Course.End.Date < Course.Start.Date,
# list(Mean.Prior.Score = mean(Score)) ,
# by = list(userID)]
As per #jorans comment, this will be more complex than the code above.
This is only an outline of what I think a solution might entail. I'm going to use plyr just to illustrate the steps needed, for simplicity.
Let's just restrict ourselves to the case of one student. If we can calculate this for one student, extending it with some sort of split-apply will be trivial.
So let's suppose we have scores for a particular student, sorted by course end date:
d <- sample(seq(as.Date("2011-01-01"),as.Date("2011-01-31"),by = 1),100,replace = TRUE)
dat <- data.frame(date = sort(d),val = rnorm(100))
First, I think you'd need to summarise this by date and then calculate the cumulative running mean:
dat_sum <- ddply(dat,.(date),summarise,valsum = sum(val),n = length(val))
dat_sum$mn <- with(dat_sum,cumsum(valsum) / cumsum(n))
Finally, you'd merge these values back into the original data with the duplicate dates:
dat_merge <- merge(dat,dat_sum[,c("date","mn")])
I could probably write something that does this in data.table using an anonymous function to do all those steps, but I suspect others may be better able to do something that will be concise and fast. (In particular, I don't recommend actually tackling this with plyr, as I suspect it will still be extremely slow.)
I think something like this should work though it'd be helpful to have test data with multiple courses per user. Also might need +1 on the start dates in findInterval to make condition be End.Date < Start.Date instead of <=.
# in the test data, one is POSIXct and the other a factor
data$Course.Start.Date = as.Date(data$Course.Start.Date)
data$Course.End.Date = as.Date(data$Course.End.Date)
data = data[order(data$Course.End.Date), ]
data$Mean.Prior.Score = ave(seq_along(data$User.ID), data$User.ID, FUN=function(i)
c(NA, cumsum(data$Score[i]) / seq_along(i))[1L + findInterval(data$Course.Start.Date[i], data$Course.End.Date[i])])
With three million rows, maybe a database is helpful. Here an sqlite example which I believe creates something similar to your for loop:
# data.frame for testing
user <- sample.int(10000, 100)
course <- sample.int(10000, 100)
c_start <- sample(
seq(as.Date("2004-01-01"), by="3 months", length.ou=12),
100, replace=TRUE
)
c_end <- c_start + as.difftime(11, units="weeks")
c_idx <- sample.int(100, 1000, replace=TRUE)
enroll <- data.frame(
user=sample(user, 1000, replace=TRUE),
course=course[c_idx],
c_start=as.character(c_start[c_idx]),
c_end=as.character(c_end[c_idx]),
score=runif(1000),
stringsAsFactors=FALSE
)
#variant 1: for-loop
system.time({
enroll$avg.p.score <- NA
for (i in 1:nrow(enroll)) {
sum <- 0
count <- 0
for (j in which(enroll$user==enroll$user[[i]]))
if (enroll$c_end[[j]] < enroll$c_start[[i]]) {
sum <- sum + enroll$score[[j]]
count <- count + 1
}
if(count !=0) enroll$avg.p.score[[i]] <- sum / count
}
})
#variant 2: sqlite
system.time({
library(RSQLite)
con <- dbConnect("SQLite", ":memory:")
dbWriteTable(con, "enroll", enroll, overwrite=TRUE)
sql <- paste("Select e.user, e.course, Avg(p.score)",
"from enroll as e",
"cross join enroll as p",
"where e.user=p.user and p.c_end < e.c_start",
"group by e.user, e.course;")
res <- dbSendQuery(con, sql)
dat <- fetch(res, n=-1)
})
On my machine, sqlite is ten times faster. If that is not enough, it would be possible to index the database.
I can't really test this, as your data doesn't appear to satisfy the inequality in any combination, but I'd try something like this:
library(plyr)
res <- ddply(data, .(User.ID), function(d) {
with(subset(merge(d, d, by=NULL, suffixes=c(".i", ".j")),
Course.End.Date.j < Course.Start.Date.i),
c(Mean.Prior.Score = mean(Score.j)))
})
res$Mean.Prior.Score[is.nan(res$Mean.Prior.Score)] = 0
Here is how it works:
ddply: Group data by User.ID and execute function for each subset d of rows for one User.ID
merge: Create two copies of the data for one user, one with columns suffixed .i the other .j
subset: From this outer join, only select those matching the given inequality
mean: Compute the mean for the matched rows
c(…): Give a name to the resulting column
res: Will be a data.frame with columns User.ID and Mean.Prior.Score
is.nan: mean will return NaN for zero-length vectors, change these to zeros
I guess this might be reasonably fast if there are not too many rows for each User.ID. If this isn't fast enough, the data.tables mentioned in other answers might help.
Your code is a bit fuzzy on the desired output: you treat data$Mean.Prior.Score like a length-one variable, but assign to it in every iteration of the loop. I assume that this assignment is meant only for one row. Do you need means for every row of the data frame, or only one mean per user?

Resources