Calculating residuals from cross validation - r

I am trying to calculate the residuals from a random forest cross validation. I am working with the response variable "Sales" in this data set. I want to put the residuals into a support vector machine. I am using the Carseats data set in R. Here is my code so far:
set.seed (1)
library(ISLR)
data(Carseats)
head(Carseats)
Sales CompPrice Income Advertising Population Price ShelveLoc
1 9.50 138 73 11 276 120 Bad
2 11.22 111 48 16 260 83 Good
3 10.06 113 35 10 269 80 Medium
4 7.40 117 100 4 466 97 Medium
5 4.15 141 64 3 340 128 Bad
6 10.81 124 113 13 501 72 Bad
Age Education Urban US sales
1 42 17 Yes Yes Yes
2 65 10 Yes Yes Yes
3 59 12 Yes Yes Yes
4 55 14 Yes Yes Yes
5 38 13 Yes No Yes
6 78 16 No Yes Yes
##Random forest
#cross validation to pick best mtry from 3,5,10
library(randomForest)
cv.carseats = rfcv(trainx=Carseats[,-1],trainy=Carseats[,1],cv.fold=5,step=0.9)
cv.carseats
with(cv.carseats,plot(n.var,error.cv,type="o"))
#from the graph it would appear mtry=5 produces the lowest error
##SVM
library(e1071)
#cross validation to pick best gamma
tune.out=tune(svm,Sales~.,data=Carseats,gamma=c(0.01,0.1,1,10),
tunecontrol = tune.control(cross=5))
I will replace "Sales" in the SVM with the residuals from the random forest cross validation. I am having a difficult time calculating the residuals from the random forest cross validation. Any help is greatly appreciated! Thank you!

Related

Implementation of the Difference in Difference (DID) model in R with panel data

I am trying to implement the diff in diff model in R in order to analyze the effect of a regulation on households.
I have panel data, meaning that I have observations for different at different periods.
Lets say (for example) that I have below data:
Name Europe? 2000 2001 2002 2003 2004
A YES 56 84 95 32 15
B NO 63 45 9 25 14
C NO 47 72 123 54 95
D YES 28 64 874 14 358
E YES 45 68 48 32 674
If the regulation came into force in 2003 only in Europe, how can I implement this using R please?
I know that I have to create 1 dummy variables for the group control (european) and another one for the year when the regulation came into force but how does it works exactly?

What type of modelling can be used when there are more than 100 levels in one categorical variable?

First few observations of dataframe. All are categorical with some having levels more than 100.
ac2.surcat ac2.typeonenum ac2.countrynum ac2.sumnewnum
1 Average survival rate 248 556 16
2 Poor survival rate 82 375 12
3 Poor survival rate 73 104 16
4 Below average survival rate 252 <NA> 6
5 Poor survival rate 252 200 11
6 Below average survival rate 252 83 19
7 Poor survival rate 252 200 12
8 Poor survival rate 210 111 5
9 Poor survival rate 252 178 19
10 Poor survival rate 252 178 18
11 Poor survival rate 230 200 5
I know that random forests limits only up to 52 levels. This is an already simplified data. Levels have been reduced from 4000s to 100s. Cannot simplify this further
Dependent variable is ac2$surcat (first one)
This is an air crash data. Last 3 columns are 'type of aircraft', 'country' and 'type of crash' respectively.(independent variables)

meta regression and bubble plot with metafor package in R

I am working on a meta-regression on the association of year and medication prevalence with 'metafor' package.
The model I used is 'rma.glmm' for mixed-effect model with logit transformed from 'metafor' package.
My R script is below:
dat<-escalc(xi=A, ni=Sample, measure="PLO")
print(dat)
model_A<-rma.glmm(xi=A, ni=Sample, measure="PLO", mods=~year)
print(model_A)
I did get significant results so I performed a bubble plot with this model. But I found there is no way to perform bubble plot straight away from 'ram.glmm' formula. I did something alternatively:
wi<-1/dat$vi
plot(year, transf.ilogit(dat$yi), cex=wi)
Apparently I got some 'crazy' results, my questions are:
1> How could I weight points in bubble plot by study sample size? the points in bubble plot should be proportionate to study weight. Here, I used 'wi<-dat$vi'. vi stands for sampling variance, which I got from 'escalc()'. But it doesn't seem right.
2> Is my model correct to investigate the association between year and medication prevalence? I tried 'rma' model I got totally different results.
3> Is there any alternative way to perform bubble plot? I also tried:
percentage<-A/Sample
plot(year, percentage)
The database is below:
study year Sample A
study 1 2007 414 364
study 2 2010 142 99
study 3 1999 15 0
study 4 2000 17 0
study 5 2001 20 0
study 6 2002 22 5
study 7 2003 21 6
study 8 2004 24 7
study 9 1999 203 82
study 10 2009 647 436
study 11 2009 200 169
study 12 2010 156 128
study 13 2009 10753 6374
study 14 2007 143 109
study 15 2001 247 36
study 16 2004 318 184
study 17 2012 611 565
study 18 2013 180 167
study 19 2006 344 337
study 20 2007 209 103
study 21 2013 470 354
study 22 2010 180 146
study 23 2005 522 302
study 24 2000 62 30
study 25 2001 79 39
study 26 2002 85 43
study 27 2011 548 307
study 28 2009 218 216
study 29 2006 2901 2332
study 30 2008 464 259
study 31 2010 650 393
study 32 2008 2514 704

Error when Fitting a glmer with poisson error structure

I hope somebody can help me. I'm trying to conduct an analysis which examines the number of samples of Hymenoptera caught over an elevational gradient. I want to examine the possibility of a uni-modal distribution in relation to elevation, as well as a linear distribution. Hence I am including I(Altitude^2) as an explanatory variable in the analysis.
I am trying to run the following model which includes a Poisson error structure (as we are dealing with count data) and date and Trap Type (Trap) as random effects.
model7 <- glmer(No.Specimens~Altitude+I(Altitude^2)+(1|Date)+(1|Trap),
family="poisson",data=Santa.Lucia,na.action=na.omit)
However I keep receiving the following error message:
Error: (maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate
In addition: Warning messages:
1: Some predictor variables are on very different scales: consider rescaling
2: In pwrssUpdate(pp, resp, tolPwrss, GQmat, compDev, fac, verbose) :
Cholmod warning 'not positive definite' at file:../Cholesky/t_cholmod_rowfac.c, line 431
3: In pwrssUpdate(pp, resp, tolPwrss, GQmat, compDev, fac, verbose) :
Cholmod warning 'not positive definite' at file:../Cholesky/t_cholmod_rowfac.c, line 431
Clearly I am making some big mistakes. Can anybody help me figure out where I am going wrong?
Here is the structure of the dataframe:
str(Santa.Lucia)
'data.frame': 97 obs. of 6 variables:
$ Date : Factor w/ 8 levels "01-Sep-2014",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Trap.No : Factor w/ 85 levels "N1","N10","N11",..: 23 48 51 14 17 20 24 27 30 33 ...
$ Altitude : int 1558 1635 1703 1771 1840 1929 1990 2047 2112 2193 ...
$ Trail : Factor w/ 3 levels "Cascadas","Limones",..: 1 1 1 1 1 3 3 3 3 3 ...
$ No.Specimens: int 1 0 2 2 3 4 5 0 1 1 ...
$ Trap : Factor w/ 2 levels "Net","Pan": 2 2 2 2 2 2 2 2 2 2 ...
And here is the complete data.set (these are only my preliminary analyses)
Date Trap.No Altitude Trail No.Specimens Trap
1 28-Aug-2014 W2 1558 Cascadas 1 Pan
2 28-Aug-2014 W5 1635 Cascadas 0 Pan
3 28-Aug-2014 W8 1703 Cascadas 2 Pan
4 28-Aug-2014 W11 1771 Cascadas 2 Pan
5 28-Aug-2014 W14 1840 Cascadas 3 Pan
6 28-Aug-2014 W17 1929 Tower 4 Pan
7 28-Aug-2014 W20 1990 Tower 5 Pan
8 28-Aug-2014 W23 2047 Tower 0 Pan
9 28-Aug-2014 W26 2112 Tower 1 Pan
10 28-Aug-2014 W29 2193 Tower 1 Pan
11 28-Aug-2014 W32 2255 Tower 0 Pan
12 30-Aug-2014 N1 1562 Cascadas 5 Net
13 30-Aug-2014 N2 1635 Cascadas 0 Net
14 30-Aug-2014 N3 1723 Cascadas 2 Net
15 30-Aug-2014 N4 1779 Cascadas 0 Net
16 30-Aug-2014 N5 1842 Cascadas 3 Net
17 30-Aug-2014 N6 1924 Tower 2 Net
18 30-Aug-2014 N7 1979 Tower 2 Net
19 30-Aug-2014 N8 2046 Tower 0 Net
20 30-Aug-2014 N9 2110 Tower 0 Net
21 30-Aug-2014 N10 2185 Tower 0 Net
22 30-Aug-2014 N11 2241 Tower 0 Net
23 31-Aug-2014 N1 1562 Cascadas 1 Net
24 31-Aug-2014 N2 1635 Cascadas 1 Net
25 31-Aug-2014 N3 1723 Cascadas 0 Net
26 31-Aug-2014 N4 1779 Cascadas 0 Net
27 31-Aug-2014 N5 1842 Cascadas 0 Net
28 31-Aug-2014 N6 1924 Tower 0 Net
29 31-Aug-2014 N7 1979 Tower 7 Net
30 31-Aug-2014 N8 2046 Tower 4 Net
31 31-Aug-2014 N9 2110 Tower 6 Net
32 31-Aug-2014 N10 2185 Tower 1 Net
33 31-Aug-2014 N11 2241 Tower 1 Net
34 01-Sep-2014 W1 1539 Cascadas 0 Pan
35 01-Sep-2014 W2 1558 Cascadas 0 Pan
36 01-Sep-2014 W3 1585 Cascadas 2 Pan
37 01-Sep-2014 W4 1604 Cascadas 0 Pan
38 01-Sep-2014 W5 1623 Cascadas 1 Pan
39 01-Sep-2014 W6 1666 Cascadas 4 Pan
40 01-Sep-2014 W7 1699 Cascadas 0 Pan
41 01-Sep-2014 W8 1703 Cascadas 0 Pan
42 01-Sep-2014 W9 1746 Cascadas 1 Pan
43 01-Sep-2014 W10 1762 Cascadas 0 Pan
44 01-Sep-2014 W11 1771 Cascadas 0 Pan
45 01-Sep-2014 W12 1796 Cascadas 1 Pan
46 01-Sep-2014 W13 1825 Cascadas 0 Pan
47 01-Sep-2014 W14 1840 Tower 4 Pan
48 01-Sep-2014 W15 1859 Tower 2 Pan
49 01-Sep-2014 W16 1889 Tower 2 Pan
50 01-Sep-2014 W17 1929 Tower 0 Pan
51 01-Sep-2014 W18 1956 Tower 0 Pan
52 01-Sep-2014 W19 1990 Tower 1 Pan
53 01-Sep-2014 W20 2002 Tower 3 Pan
54 01-Sep-2014 W21 2023 Tower 2 Pan
55 01-Sep-2014 W22 2047 Tower 0 Pan
56 01-Sep-2014 W23 2068 Tower 1 Pan
57 01-Sep-2014 W24 2084 Tower 0 Pan
58 01-Sep-2014 W25 2112 Tower 1 Pan
59 01-Sep-2014 W26 2136 Tower 0 Pan
60 01-Sep-2014 W27 2150 Tower 1 Pan
61 01-Sep-2014 W28 2193 Tower 1 Pan
62 01-Sep-2014 W29 2219 Tower 0 Pan
63 01-Sep-2014 W30 2227 Tower 1 Pan
64 01-Sep-2014 W31 2255 Tower 0 Pan
85 03/06/2015 WT47 1901 Tower 2 Pan
86 03/06/2015 WT48 1938 Tower 2 Pan
87 03/06/2015 WT49 1963 Tower 2 Pan
88 03/06/2015 WT50 1986 Tower 0 Pan
89 03/06/2015 WT51 2012 Tower 9 Pan
90 03/06/2015 WT52 2033 Tower 0 Pan
91 03/06/2015 WT53 2050 Tower 4 Pan
92 03/06/2015 WT54 2081 Tower 2 Pan
93 03/06/2015 WT55 2107 Tower 1 Pan
94 03/06/2015 WT56 2128 Tower 4 Pan
95 03/06/2015 WT57 2155 Tower 0 Pan
96 03/06/2015 WT58 2179 Tower 2 Pan
97 03/06/2015 WT59 2214 Tower 0 Pan
98 03/06/2015 WT60 2233 Tower 0 Pan
99 03/06/2015 WT61 2261 Tower 0 Pan
100 03/06/2015 WT62 2278 Tower 0 Pan
101 03/06/2015 WT63 2300 Tower 0 Pan
102 04/06/2015 WT31 1497 Cascadas 0 Pan
103 04/06/2015 WT32 1544 Cascadas 1 Pan
104 04/06/2015 WT33 1568 Cascadas 1 Pan
105 04/06/2015 WT34 1574 Cascadas 0 Pan
106 04/06/2015 WT35 1608 Cascadas 5 Pan
107 04/06/2015 WT36 1630 Cascadas 3 Pan
108 04/06/2015 WT37 1642 Cascadas 0 Pan
109 04/06/2015 WT38 1672 Cascadas 5 Pan
110 04/06/2015 WT39 1685 Cascadas 6 Pan
111 04/06/2015 WT40 1723 Cascadas 3 Pan
112 04/06/2015 WT41 1744 Cascadas 2 Pan
113 04/06/2015 WT42 1781 Cascadas 1 Pan
114 04/06/2015 WT43 1794 Cascadas 2 Pan
115 04/06/2015 WT44 1833 Cascadas 0 Pan
116 04/06/2015 WT45 1855 Cascadas 4 Pan
117 04/06/2015 WT46 1876 Cascadas 2 Pan
You're almost there. As #BondedDust suggests, it's not practical
to use a two-level factor (Trap) as a random effect; in fact,
it doesn't seem right in principle either (the levels of Trap are
not arbitrary/randomly chosen/exchangeable). When I tried a model
with quadratic altitude, fixed effect of trap, and random effect
of Date, I was warned that I might want to rescale a parameter:
Some predictor variables are on very different scales: consider rescaling
(you saw this warning mixed in with your error messages). The only continuous (and hence worth rescaling) predictor is Altitude, so I centered and scaled it with scale() (the only disadvantage is that this changes the quantitative interpretation of the coefficients, but the model itself is practically identical). I also added an observation-level random effect to allow for overdispersion.
The results seem OK, and agree with the picture.
library(lme4)
Santa.Lucia <- transform(Santa.Lucia,
scAlt=scale(Altitude),
obs=factor(seq(nrow(Santa.Lucia))))
model7 <- glmer(No.Specimens~scAlt+I(scAlt^2)+Trap+(1|Date)+(1|obs),
family="poisson",data=Santa.Lucia,na.action=na.omit)
summary(model7)
## Random effects:
## Groups Name Variance Std.Dev.
## obs (Intercept) 0.64712 0.8044
## Date (Intercept) 0.02029 0.1425
## Number of obs: 97, groups: obs, 97; Date, 6
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.53166 0.31556 1.685 0.09202 .
## scAlt -0.22867 0.14898 -1.535 0.12480
## I(scAlt^2) -0.52840 0.16355 -3.231 0.00123 **
## TrapPan -0.01853 0.32487 -0.057 0.95451
Test the quadratic term by comparing with a model that lacks it ...
model7R <- update(model7, . ~ . - I(scAlt^2))
## convergence warning, but probably OK ...
anova(model7,model7R)
On principle it might be worth looking at the interaction between the quadratic altitude model and Trap (allowing for different altitude trends by trap type), but the picture suggests it won't do much ...
library(ggplot2); theme_set(theme_bw())
ggplot(Santa.Lucia,aes(Altitude,No.Specimens,colour=Trap))+
stat_sum(aes(size=factor(..n..)))+
scale_size_discrete(range=c(2,4))+
geom_line(aes(group=Date),colour="gray",alpha=0.3)+
geom_smooth(method="gam",family="quasipoisson",
formula=y~poly(x,2))+
geom_smooth(method="gam",family="quasipoisson",
formula=y~poly(x,2),se=FALSE,
aes(group=1),colour="black")
The problem is almost surely due to you passing a character vector to the data argument:
..., data="Santa.Lucia, ..."
?glmer says the data argument should be:
data: an optional data frame containing the variables named in
‘formula’. By default the variables are taken from the
environment from which ‘lmer’ is called. While ‘data’ is
optional, the package authors _strongly_ recommend its use,
especially when later applying methods such as ‘update’ and
‘drop1’ to the fitted model (_such methods are not guaranteed
to work properly if ‘data’ is omitted_). If ‘data’ is
omitted, variables will be taken from the environment of
‘formula’ (if specified as a formula) or from the parent
frame (if specified as a character vector).
The last part in parentheses, "if specified as a character vector" relates to what happens if the specification of formula is as a character vector, not to specifying data as a character.
Correct your call to include data = Santa.Lucia and you should be good to go.
You've managed to use two different formats for Date. Here's a fix:
Santa.Lucia$Date2 <- ifelse(nchar(as.character(Santa.Lucia$Date)) > 10,
as.Date(Santa.Lucia$Date, format="%d-%b-%Y"),
as.Date(Santa.Lucia$Date, format="%d/%m/%Y") )
I tried a simpler model:
( model6 <-glmer(No.Specimens~Altitude+(1|Date2)+(1|Trap),family="poisson",data=Santa.Lucia,na.action=na.omit) )
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) [
glmerMod]
Family: poisson ( log )
Formula: No.Specimens ~ Altitude + (1 | Date2) + (1 | Trap)
Data: Santa.Lucia
AIC BIC logLik deviance df.resid
368.6522 378.9510 -180.3261 360.6522 93
Random effects:
Groups Name Std.Dev.
Date2 (Intercept) 0.2248
Trap (Intercept) 0.0000
Number of obs: 97, groups: Date2, 6; Trap, 2
Fixed Effects:
(Intercept) Altitude
1.3696125 -0.0004992
Warning messages:
1: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.0516296 (tol = 0.001, component 3)
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model is nearly unidentifiable: very large eigenvalue
- Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
- Rescale variables?
I'm actually able to get my suggested modification to run without error or warning but I think that using those two groupings is not right because one predicts the other:
> table(Santa.Lucia$Date2, Santa.Lucia$Trap)
Net Pan
16310 0 11
16312 11 0
16313 11 0
16314 0 31
16589 0 17
16590 0 16
That's why you are getting non-convergence. It's not the error model that is at fault, but the pathology in your design and data collection. I question whether you really have sufficient data to support a mixed model:
( model5 <-glm(No.Specimens~Altitude,family="poisson",data=Santa.Lucia,na.action=na.omit) )
Call: glm(formula = No.Specimens ~ Altitude, family = "poisson", data = Santa.Lucia,
na.action = na.omit)
Coefficients:
(Intercept) Altitude
1.4218234 -0.0005391
Degrees of Freedom: 96 Total (i.e. Null); 95 Residual
Null Deviance: 215.3
Residual Deviance: 213.2 AIC: 368.6
To compare with a quadratic altitude model:
( model5.2 <-glm(No.Specimens~poly(Altitude,2),family="poisson",data=Santa.Lucia,na.action=na.omit) )
Call: glm(formula = No.Specimens ~ poly(Altitude, 2), family = "poisson",
data = Santa.Lucia, na.action = na.omit)
Coefficients:
(Intercept) poly(Altitude, 2)1 poly(Altitude, 2)2
0.3188 -1.7116 -3.9539
Degrees of Freedom: 96 Total (i.e. Null); 94 Residual
Null Deviance: 215.3
Residual Deviance: 194.6 AIC: 352
> anova(model5.2)
Analysis of Deviance Table
Model: poisson, link: log
Response: No.Specimens
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev
NULL 96 215.31
poly(Altitude, 2) 2 20.698 94 194.61
> anova(model5.2, model5)
Analysis of Deviance Table
Model 1: No.Specimens ~ poly(Altitude, 2)
Model 2: No.Specimens ~ Altitude
Resid. Df Resid. Dev Df Deviance
1 94 194.61
2 95 213.20 -1 -18.59

How to input a 3-way table?

I have the data in the table form (not even a R table) and I want to transform (or input) it into R to perform analysis.
The table is a 3-way contingency table which looks like this:
Is there a way to easily input this into R? (It can by any format as long as I can perform some regression analysis)
Or I need to manually input it?
In R, this is an ftable.
Inputting an ftable manually is not too difficult if you know how the function works. The data need to be in a format like this:
breathless yes no
coughed yes no
age
20-24 9 7 95 1841
25-29 23 9 108 1654
30-34 54 19 177 1863
If the data are in this format, you can use read.ftable. For example:
temp <- read.ftable(textConnection("breathless yes no
coughed yes no
age
20-24 9 7 95 1841
25-29 23 9 108 1654
30-34 54 19 177 1863"))
temp
# breathless yes no
# coughed yes no yes no
# age
# 20-24 9 7 95 1841
# 25-29 23 9 108 1654
# 30-34 54 19 177 1863
From there, if you want a "long" data.frame, with which analysis and reshaping to different formats is much easier, just wrap it in data.frame().
data.frame(temp)
# age breathless coughed Freq
# 1 20-24 yes yes 9
# 2 25-29 yes yes 23
# 3 30-34 yes yes 54
# 4 20-24 no yes 95
# 5 25-29 no yes 108
# 6 30-34 no yes 177
# 7 20-24 yes no 7
# 8 25-29 yes no 9
# 9 30-34 yes no 19
# 10 20-24 no no 1841
# 11 25-29 no no 1654
# 12 30-34 no no 1863

Resources