Boxplot in for-loop over multiple columns in r - r

I have a dataset with 7 columns and want to plot the 6 latter each with the first one. If done for one, everything works, but apparently I miss something when looping.
Here is my dataframe:
colnames(df) <- c("real", "est1", "est2", "est3", "est4", "est5", "est6")
head(df)
real est1 est2 est3 est4 est5 est6
1 6 1.040217e-05 7.693853e-05 0.0006782929 0.002676282 0.033385059 0.9631730251
2 6 1.065455e-05 7.880501e-05 0.0006947352 0.002740934 0.034161665 0.9623132055
3 5 1.037427e-03 7.607541e-03 0.0624143732 0.185340034 0.536009785 0.2075908392
4 1 2.345527e-01 4.855757e-01 0.2374464964 0.032691816 0.008846185 0.0008870667
5 5 3.506084e-04 2.585847e-03 0.0222474072 0.079120851 0.458854341 0.4368409455
6 3 1.710639e-03 1.247417e-02 0.0978889632 0.250555703 0.500355545 0.1370149767
and the code
boxplot( est1 ~ real, data=df, main="Estimated Probability for Category 1 Given the Real Categories")
works fine, but if I do the exactly same as a loop, it doesn't:
looper <- c("est1", "est2", "est3", "est4", "est5", "est6") #to get the column names to loop over
counter <- 0 # for the boxplot's title
for (i in all_of(looper)){
counter <- counter +1
boxplot( i ~ real, data=df,
main = paste("Estimated Probability for Category",counter,"Given the Real Categories")
)
}
I suppose it has to do with the way i is used, I tried "i" and also with one ` and I would always get one of the following errors:
For i: Fehler in stats::model.frame.default(formula = i ~ real, data = eval.m1_sim) :
Variablenlängen sind unterschiedlich (gefunden für 'real')
For "i" or ` : Fehler in terms.formula(formula, data = data) :
ungültiger Term in Modellformel
What am I missing?

You could go via column numbers:
# random example data as no reproducible example was given
df <- data.frame(
real = sample(1:4, 20, TRUE),
one = runif(20),
two = runif(20),
three = runif(20))
)
# graphics paramaters so we see all at once
par(mfrow = c(3,1), mar = c(2, 2, 1, 1))
# the easiest way is through column numbers
for(column in 2:4)
boxplot(df[[column]] ~ df$real)

Another option:
library(tidyverse)
df %>%
pivot_longer(-real) %>%
mutate(real = factor(real)) %>%
ggplot(aes(real, value)) +
geom_boxplot() +
facet_wrap(~name)

Related

glmmTMB with Ornstein-Uhlenbeck autocorrelation. Different autocorrelations for different phases of the data

I promise this is going to be simple and easy to read. Or else please shoot me dead in the comments section.
I've got a dependent variable which has temporal autocorrelation, which I would like to analyse with glmmtmb. Here's the tricky part, there are times when the dependent variable is MORE autocorrelated, and times when it is LESS autocorrelated.
Lets make up some data.
library(ggplot2)
# function to get time series of autocorrelated data. high autocorrelation = low value for diV parameter
yfunct = function(diV){
y <- c(rnorm(1,0,sd=1)) # Mean=0, standard deviation=1
for(i in 1:(100-1)) { # Number of the observations is 1000
y[i+1]<-( y[i] + rnorm(1,0,sd=1))/diV #
}
y
}
# Create three groups with 6 phases. 3 high autocorrelation and 3 low autocorrelation
set.seed(11)
y = replicate (3, as.vector(replicate ( 3, c(yfunct(1), yfunct( 1.4)) ))) # high autocorrelation (divide by 1) followed by low autocorrelation (divide by 1.4)
# combine into dataframe.
df = data.frame ( y = as.vector ( y) ,
t = rep ( 1:600,3),
group = as.factor ( rep(1:3,each = 600)),
phase = rep(c("lowAC","highAC"),each=100))
# plot
ggplot ( df, aes ( x=t , y= y, color= group))+
geom_line()+
geom_vline( xintercept = seq( 0,600,100))
OK so now I'll prepare for and perform the statistical analysis. First adding a dummy independent variable, and also strip out some of the rows. The dataset I'm (really) working with has lots of missing rows. This is why I'm using Ornstein-Uhlenbeck autocorrelation. https://cran.r-project.org/web/packages/glmmTMB/vignettes/covstruct.html
My method for dealing with the autocorrelation differences between the phases is to combine group number with phase type ("high" or "low" autocorrelation), and then put this combined variable in the ou() autocorrelation function.
Is this correct? Is there something I haven't considered? Your wisdom will be hugely appreciated.
Dan
library(glmmTMB)
# set the seed
set.seed(11)
# add dummy independent variable
df$x = runif ( nrow( df))
# subsample the dataset
df = df [ sample ( 1:nrow(df), 300),]
# combine group and phase
df$comb = apply ( cbind ( df$group , df$phase) ,1, function(x)paste(x, collapse = "-"))
# numFactor the times
df$t = numFactor(df$t)
# y exists between 0 and 1 for glmmTMB
df$y = df$y + abs( min(df$y))+0.001
min(df$y)
df$y = df$y / (max(df$y)+0.001)
max(df$y)
# Model
mod1 <- glmmTMB(y~x
+ (1|group)
+ ou(t + 0 |comb)
, data=df
, beta_family(link="logit"))
# Summary
summary(mod1)

MarkovChain package error in R when creating attribution model : Error! Rows of transition matrix do not some one

I am following this tutorial on creating a Markov chain for attribution modelling in R.
However when I run it I get a compilation error, a solution was proposed on this page however it is still giving the same error, I have also tried two different workarounds to get past the compilation error, but both result in runtime errors where the output looks nothing like the one in the tutorial. The workarounds are documented at the bottom.
Any help with this would be highly appreciated.
library(dplyr)
library(reshape2)
library(ggplot2)
library(ggthemes)
library(ggrepel)
library(RColorBrewer)
library(ChannelAttribution)
library(markovchain)
##### simple example #####
# creating a data sample
df1 <- data.frame(path = c('c1 > c2 > c3', 'c1', 'c2 > c3'), conv = c(1, 0, 0), conv_null = c(0, 1, 1))
# calculating the model
mod1 <- markov_model(df1,
var_path = 'path',
var_conv = 'conv',
var_null = 'conv_null',
out_more = TRUE)
# extracting the results of attribution
df_res1 <- mod1$result
# extracting a transition matrix
df_trans1 <- mod1$transition_matrix
df_trans1 <- dcast(df_trans1, channel_from ~ channel_to, value.var = 'transition_probability')
### plotting the Markov graph ###
df_trans <- mod1$transition_matrix
# adding dummies in order to plot the graph
df_dummy <- data.frame(channel_from = c('(start)', '(conversion)', '(null)'),
channel_to = c('(start)', '(conversion)', '(null)'),
transition_probability = c(0, 1, 1))
df_trans <- rbind(df_trans, df_dummy)
# ordering channels
df_trans$channel_from <- factor(df_trans$channel_from,
levels = c('(start)', '(conversion)', '(null)', 'c1', 'c2', 'c3'))
df_trans$channel_to <- factor(df_trans$channel_to,
levels = c('(start)', '(conversion)', '(null)', 'c1', 'c2', 'c3'))
df_trans <- dcast(df_trans, channel_from ~ channel_to, value.var = 'transition_probability')
# creating the markovchain object
trans_matrix <- matrix(data = as.matrix(df_trans[, -1]),
nrow = nrow(df_trans[, -1]), ncol = ncol(df_trans[, -1]),
dimnames = list(c(as.character(df_trans[, 1])),
c(colnames(df_trans[, -1]))))
trans_matrix[is.na(trans_matrix)] <- 0
trans_matrix1 <- new("markovchain", transitionMatrix = trans_matrix)
# plotting the graph
plot(trans_matrix1, edge.arrow.size = 0.35)
The error occurs on this line
trans_matrix1 <- new("markovchain", transitionMatrix = trans_matrix)
Error:
Aggregation function missing: defaulting to length
Error in validObject(.Object) :
invalid class “markovchain” object: Error! Rows of transition matrix do not some one
EDIT:
I have tried the following, both give runtime errors:
user2554330s solution to divide each row by using
trans_matrix <- trans_matrix/rowSums(trans_matrix)
Manually deleting the last row and column which are both NA, so that all columns equal 1
The message contains an English error: instead of "do not some one" it should say "do not sum to one". The matrix you are using looks like this:
> trans_matrix
(start) (conversion) (null) NA
(start) 1 0 0 2
(conversion) 0 1 0 0
(null) 0 0 1 0
<NA> 0 1 2 2
To fix this, you could divide each row by its sum using
trans_matrix <- trans_matrix/rowSums(trans_matrix)
After that there's no error. Whether that's the right matrix, I don't know.

Rolling regression and prediction with lm() and predict()

I need to apply lm() to an enlarging subset of my dataframe dat, while making prediction for the next observation. For example, I am doing:
fit model predict
---------- -------
dat[1:3, ] dat[4, ]
dat[1:4, ] dat[5, ]
. .
. .
dat[-1, ] dat[nrow(dat), ]
I know what I should do for a particular subset (related to this question: predict() and newdata - How does this work?). For example to predict the last row, I do
dat1 = dat[1:(nrow(dat)-1), ]
dat2 = dat[nrow(dat), ]
fit = lm(log(clicks) ~ log(v1) + log(v12), data=dat1)
predict.fit = predict(fit, newdata=dat2, se.fit=TRUE)
How can I do this automatically for all subsets, and potentially extract what I want into a table?
From fit, I'd need the summary(fit)$adj.r.squared;
From predict.fit I'd need predict.fit$fit value.
Thanks.
(Efficient) solution
This is what you can do:
p <- 3 ## number of parameters in lm()
n <- nrow(dat) - 1
## a function to return what you desire for subset dat[1:x, ]
bundle <- function(x) {
fit <- lm(log(clicks) ~ log(v1) + log(v12), data = dat, subset = 1:x, model = FALSE)
pred <- predict(fit, newdata = dat[x+1, ], se.fit = TRUE)
c(summary(fit)$adj.r.squared, pred$fit, pred$se.fit)
}
## rolling regression / prediction
result <- t(sapply(p:n, bundle))
colnames(result) <- c("adj.r2", "prediction", "se")
Note I have done several things inside the bundle function:
I have used subset argument for selecting a subset to fit
I have used model = FALSE to not save model frame hence we save workspace
Overall, there is no obvious loop, but sapply is used.
Fitting starts from p, the minimum number of data required to fit a model with p coefficients;
Fitting terminates at nrow(dat) - 1, as we at least need the final column for prediction.
Test
Example data (with 30 "observations")
dat <- data.frame(clicks = runif(30, 1, 100), v1 = runif(30, 1, 100),
v12 = runif(30, 1, 100))
Applying code above gives results (27 rows in total, truncated output for 5 rows)
adj.r2 prediction se
[1,] NaN 3.881068 NaN
[2,] 0.106592619 3.676821 0.7517040
[3,] 0.545993989 3.892931 0.2758347
[4,] 0.622612495 3.766101 0.1508270
[5,] 0.180462206 3.996344 0.2059014
The first column is the adjusted-R.squared value for fitted model, while the second column is the prediction. The first value for adj.r2 is NaN, because the first model we fit has 3 coefficients for 3 data points, hence no sensible statistics is available. The same happens to se as well, as the fitted line has no 0 residuals, so prediction is done without uncertainty.
I just made up some random data to use for this example. I'm calling the object data because that was what it was called in the question at the time that I wrote this solution (call it anything you like).
(Efficient) Solution
data <- data.frame(v1=rnorm(100),v2=rnorm(100),clicks=rnorm(100))
data1 = data[1:(nrow(data)-1), ]
data2 = data[nrow(data), ]
for(i in 3:nrow(data)){
nam <- paste("predict", i, sep = "")
nam1 <- paste("fit", i, sep = "")
nam2 <- paste("summary_fit", i, sep = "")
fit = lm(clicks ~ v1 + v2, data=data[1:i,])
tmp <- predict(fit, newdata=data2, se.fit=TRUE)
tmp1 <- fit
tmp2 <- summary(fit)
assign(nam, tmp)
assign(nam1, tmp1)
assign(nam2, tmp2)
}
All of the results you want will be stored in the data objects this creates.
For example:
> summary_fit10$r.squared
[1] 0.3087432
You mentioned in the comments that you'd like a table of results. You can programmatically create tables of results from the 3 types of output files like this:
rm(data,data1,data2,i,nam,nam1,nam2,fit,tmp,tmp1,tmp2)
frames <- ls()
frames.fit <- frames[1:98] #change index or use pattern matching as needed
frames.predict <- frames[99:196]
frames.sum <- frames[197:294]
fit.table <- data.frame(intercept=NA,v1=NA,v2=NA,sourcedf=NA)
for(i in 1:length(frames.fit)){
tmp <- get(frames.fit[i])
fit.table <- rbind(fit.table,c(tmp$coefficients[[1]],tmp$coefficients[[2]],tmp$coefficients[[3]],frames.fit[i]))
}
fit.table
> fit.table
intercept v1 v2 sourcedf
2 -0.0647017971121678 1.34929652763687 -0.300502017324518 fit10
3 -0.0401617893034109 -0.034750571912636 -0.0843076273486442 fit100
4 0.0132968863522573 1.31283604433593 -0.388846211083564 fit11
5 0.0315113918953643 1.31099122173898 -0.371130010135382 fit12
6 0.149582794027583 0.958692838785998 -0.299479715938493 fit13
7 0.00759688947362175 0.703525856001948 -0.297223988673322 fit14
8 0.219756240025917 0.631961979610744 -0.347851129205841 fit15
9 0.13389223748979 0.560583832333355 -0.276076134872669 fit16
10 0.147258022154645 0.581865844000838 -0.278212722024832 fit17
11 0.0592160359650468 0.469842498721747 -0.163187274356457 fit18
12 0.120640756525163 0.430051839741539 -0.201725012088506 fit19
13 0.101443924785995 0.34966728554219 -0.231560038360121 fit20
14 0.0416637001406594 0.472156988919337 -0.247684504074867 fit21
15 -0.0158319749710781 0.451944113682333 -0.171367482879835 fit22
16 -0.0337969739950376 0.423851304105399 -0.157905431162024 fit23
17 -0.109460218252207 0.32206642419212 -0.055331391802687 fit24
18 -0.100560410735971 0.335862465403716 -0.0609509815266072 fit25
19 -0.138175283219818 0.390418411384468 -0.0873106257144312 fit26
20 -0.106984355317733 0.391270279253722 -0.0560299858019556 fit27
21 -0.0740684978271464 0.385267011513678 -0.0548056844433894 fit28

How to add a column of fitted values to a data frame by group?

Say I have a data frame like this:
X <- data_frame(
x = rep(seq(from = 1, to = 10, by = 1), 3),
y = 2*x + rnorm(length(x), sd = 0.5),
g = rep(LETTERS[1:3], each = length(x)/3))
How can I fit a regression y~x grouped by variable g and add the values from the fitted and resid generic methods to the data frame?
I know I can do:
A <- X[X$g == "A",]
mA <- with(A, lm(y ~ x))
A$fit <- fitted(mA)
A$res <- resid(mA)
B <- X[X$g == "B",]
mB <- with(B, lm(y ~ x))
B$fit <- fitted(mB)
B$res <- resid(mB)
C <- X[X$g == "C",]
mC <- with(B, lm(y ~ x))
C$fit <- fitted(mC)
C$res <- resid(mC)
And then rbind(A, B, C). However, in real life I am not using lm (I'm using rqss in the quantreg package). The method occasionally fails, so I need error handling, where I'd like to place NA all the rows that failed. Also, there are way more than 3 groups, so I don't want to just keep copying and pasting code for each group.
I tried using dplyr with do but didn't make any progress. I was thinking it might be something like:
make_qfits <- function(data) {
data %>%
group_by(g) %>%
do(failwith(NULL, rqss), formula = y ~ qss(x, lambda = 3))
}
Would this be easy to do by that approach? Is there another way in base R?
You can use do on grouped data for this task, fitting the model in each group in do and putting the model residuals and fitted values into a data.frame. To add these to the original data, just include the . that represents the data going into do in the output data.frame.
In your simple case, this would look like this:
X %>%
group_by(g) %>%
do({model = rqss(y ~ qss(x, lambda = 3), data = .)
data.frame(., residuals = resid.rqss(model), fitted = fitted(model))
})
Source: local data frame [30 x 5]
Groups: g
x y g residuals fitted
1 1 1.509760 A -1.368963e-08 1.509760
2 2 3.576973 A -8.915993e-02 3.666133
3 3 6.239950 A 4.174453e-01 5.822505
4 4 7.978878 A 4.130033e-09 7.978878
5 5 10.588367 A 4.833475e-01 10.105020
6 6 11.786445 A -3.807876e-01 12.167232
7 7 14.646221 A 4.167763e-01 14.229445
8 8 15.938253 A -3.534045e-01 16.291658
9 9 19.114927 A 7.610560e-01 18.353871
10 10 19.574449 A -8.416343e-01 20.416083
.. .. ... . ... ...
Things will look more complicated if you need to catch errors. Here is what it would look like using try and filling the residuals and fitted columns with NA if fit attempt for the group results in an error.
X[9:30,] %>%
group_by(g) %>%
do({catch = try(rqss(y ~ qss(x, lambda = 3), data = .))
if(class(catch) == "try-error"){
data.frame(., residuals = NA, fitted = NA)
}
else{
model = rqss(y ~ qss(x, lambda = 3), data = .)
data.frame(., residuals = resid.rqss(model), fitted = fitted(model))
}
})
Source: local data frame [22 x 5]
Groups: g
x y g residuals fitted
1 9 19.114927 A NA NA
2 10 19.574449 A NA NA
3 1 2.026199 B -4.618675e-01 2.488066
4 2 4.399768 B 1.520739e-11 4.399768
5 3 6.167690 B -1.437800e-01 6.311470
6 4 8.642481 B 4.193089e-01 8.223172
7 5 10.255790 B 1.209160e-01 10.134874
8 6 12.875674 B 8.290981e-01 12.046576
9 7 13.958278 B -4.803891e-10 13.958278
10 8 15.691032 B -1.789479e-01 15.869980
.. .. ... . ... ...
For the lm models you could try
library(nlme) # lmList to do lm by group
library(ggplot2) # fortify to get out the fitted/resid data
do.call(rbind, lapply(lmList(y ~ x | g, data=X), fortify))
This gives you the residual and fitted data in ".resid" and ".fitted" columns as well as a bunch of other fit data. By default the rownames will be prefixed with the letters from g.
With the rqss models that might fail
do.call(rbind, lapply(split(X, X$g), function(z) {
fit <- tryCatch({
rqss(y ~ x, data=z)
}, error=function(e) NULL)
if (is.null(fit)) data.frame(resid=numeric(0), fitted=numeric(0))
else data.frame(resid=fit$resid, fitted=fitted(fit))
}))
Here's a version that works with base R:
modelit <- function(df) {
mB <- with(df, lm(y ~ x, na.action = na.exclude))
df$fit <- fitted(mB)
df$res <- resid(mB)
return(df)
}
dfs.with.preds <- lapply(split(X, as.factor(X$g)), modelit)
output <- Reduce(function(x, y) { rbind(x, y) }, dfs.with.preds)

Apply grouped model group-wise

My question is very similar to this one, but the problem I am facing has a twist that those answers do not address. Specifically, I am estimating a spatial model, y=rho * lw * y + X *beta. Because the observations are related by the matrix lw, I must apply the model to the entire X matrix simultaneously. Because those answers operate row-wise, they do not apply.
Here is MWE data, consisting of twenty points across three groups and a spatial weights matrix:
library(spdep)
#Coordinates
pointcoords <- data.frame(x = runif(n=20, min =10, max = 100), y = runif(n=20, min = 10, max = 100), ID = as.character(1:20))
pointsSP <- SpatialPoints(pointcoords[,1:2])
# Weights matrix
lw <- nb2listw(knn2nb(knearneigh(pointsSP, k = 4, RANN = FALSE),
row.names = pointcoords$ID))
# Data
MyData <- data.frame(ID = rep(1:20, each = 3),
Group = rep(1:3, times = 20),
DV = rnorm(60),IV = rnorm(60))
I can estimate the models by Group with dplyr
library(dplyr)
models <- MyData %>% group_by(Group) %>%
do(lm = lm(DV ~ IV, data = .),
sar = lagsarlm(DV ~ IV, data = ., listw = lw))
Predicting to new data with this answer operates on a row-wise basis, working fine for the lm objects,
MyData2 <- data.frame(ID = rep(1:20, each = 3),
Group = rep(1:3, times = 20),
IV = rnorm(60))
MyData2 %>% left_join(models) %>% rowwise %>%
mutate(lmPred = predict(lm, newdata = list("IV" = IV))) %>% head()
#Joining by: "Group"
#Source: local data frame [6 x 6]
#Groups:
# ID Group IV lm sar lmPred
#1 1 1 -0.8930794 <S3:lm> <S3:sarlm> -0.21378814
#2 1 2 -1.6637963 <S3:lm> <S3:sarlm> 0.42547796
#3 1 3 0.5243841 <S3:lm> <S3:sarlm> -0.23372996
#4 2 1 -0.1956969 <S3:lm> <S3:sarlm> -0.20860280
#5 2 2 0.8149920 <S3:lm> <S3:sarlm> 0.14771431
#6 2 3 -0.3000439 <S3:lm> <S3:sarlm> 0.05082524
But not for the sar models:
MyData2 %>% left_join(models) %>% rowwise %>%
mutate(sarPred = predict(sar, newdata = list("IV" = IV), listw=lw)) %>% head()
#Joining by: "Group"
#Error in if (nrow(newdata) != length(listw$neighbours)) stop("mismatch between newdata and spatial weights") :
argument is of length zero
I think there should be a better way of doing this, without joining the model to every row. Also, creating a list object for newdata won't work if you have several or changing predictor variables. It seems that the dplyr way should be something like this:
MyData2 %>% group_by(Group) %>%
mutate(sarPred = predict(models$sar[[Group]], newdata = ., listw=lw))
But the [[Group]] index isn't quite right.
I ended up doing this with do in dplyr, going through the models data.frame rowwise. I believe it does what you want, although the output doesn't contain the new data used for predictions. I did add in Group to the output, though, as it seemed necessary to keep groups separated.
models %>%
do(data.frame(Group = .$Group,
predlm = predict(.$lm, newdata = filter(MyData2, Group == .$Group)),
predsar = predict(.$sar, newdata = filter(MyData2, Group == .$Group) , listw = lw)))
EDIT
Playing around with adding the explanatory variable into the output data.frame. The following works, although there is likely a better way to do this.
models %>%
do(data.frame(Group = .$Group, IV = select(filter(MyData2, Group == .$Group), IV),
predlm = predict(.$lm, newdata = filter(MyData2, Group == .$Group)),
predsar = predict(.$sar, newdata = filter(MyData2, Group == .$Group) , listw = lw)))
I'm putting this out there because it does do what I want it to, even if it needs to use a for loop (gasp)
predictobj <- list()
for(i in models$Group){
predictobj[[i]] <- predict.sarlm(models$sar[[i]],
newdata = filter(MyData2, Group == i),
listw = lw)
}
Anybody have a dplyr solution?

Resources