in R Plot importance variables of Random Forest model - r

What am I doing wrong here? What does "subscript out of bound" mean?
I got the below code (first block) excerpt form a Revolution R online seminar regarding datamining in R. I'm trying to incorporate this in a RF model I ran but can't get pass what I think is the ordering of variables. I just want to plot the importance of the variables.
I included a little more then needed below to give context. But really what I am erroring out is the third line of code. The second code block are the errors I am getting as applied to the data I am working with. Can anyone help me figure this out?
-------------------------------------------------------------------------
# List the importance of the variables.
rn <- round(importance(model.rf), 2)
rn[order(rn[,3], decreasing=TRUE),]
#### of
# Plot variable importance
varImpPlot(model.rf, main="",col="dark blue")
title(main="Variable Importance Random Forest weather.csv",
sub=paste(format(Sys.time(), "%Y-%b-%d %H:%M:%S"), Sys.info()["user"]))
#--------------------------------------------------------------------------
My errors:
> rn[order(rn[,2], decreasing=TRUE),]
Error in order(rn[, 2], decreasing = TRUE) : subscript out of bounds

Think I understand the confusion. I bet you a 4-finger Kit Kat that if you type in ncol(rn) you'll see that rn has 2 columns, not 3 as you might expect. The first "column" you're seeing on the screen isn't really a column - it's just the row names for the object rn. Type rownames(rn) to confirm this. The final column of rn that you want to order by is therefore rn[,2] rather than rn[,3]. The "subscript out of bounds" message comes up because you've asked R to order by column 3, but rn doesn't have a column 3.
Here's my brief detective trail for anyone interested in what the "importance" object actually is... I installed library(randomForest) and then ran an example from the documentation online:
set.seed(4543)
data(mtcars)
mtcars.rf <- randomForest(mpg ~ ., data=mtcars, ntree=1000,
keep.forest=FALSE, importance=TRUE)
importance(mtcars.rf)
Turns out the "importance" object in this case looks like this (first few rows only to save space):
%IncMSE IncNodePurity
cyl 17.058932 181.70840
disp 19.203139 242.86776
hp 17.708221 191.15919
...
Obviously ncol(importance(mtcars.rf)) is 2, and the row names are likely to be the thing leading to confusion :)

Related

Allowing for aliased coefficients when running `grangertest()` in R

I'm currently trying to run a granger causality analysis in R/R Studio. I am receiving errors about aliased coefficients when using the function grangertest(). From my understanding, this occurs because there is perfect multicolinearity between the variables.
Due to having a very large number of pairwise comparisons (e.g. 200+), I would like to simply run the granger with the aliased coefficients as per normal rather than returning an error. According to one answer here, the solution is or was to add set singular.ok=TRUE, but either I am doing it incorrectly the answer s out of date. I've tried checking the documentation, but have come up empty. Any help would be appreciated.
library(lmtest)
x <- c(0,1,2,3)
y <- c(0,3,6,9)
grangertest(x,y,1) # I want this to run successfully even if there are aliased coefficients.
grangertest(x,y,1, singular.ok=TRUE) # this also doesn't work
"Error in waldtest.lm(fm, 2, ...) :
there are aliased coefficients in the model"
Additionally is there a way to flag x and y are actually aliased variables? There seem to be some answers like here but I'm having issues getting it working properly.
alias((x~ y))
Thanks in advance.
After some investigation and emailing the creator of the grangertest package, they sent me this solution. The solution should run on aliased variables when granger test does not. When the variables are not aliased, the solution should give the same values as the normal granger test.
library(lmtest)
library(dynlm)
# Some data that is multicolinear
x <- c(0,1,2,3,4)
y <- c(0,3,6,9,12)
# Some data that is not multicolinear
# x <- c(0,125,200,230,777)
# y <- c(0,3,6,9,200)
# Convert to time series (this is an important step)
x=ts(x)
y=ts(y)
# This will run even when the data is multicolinear (but also when it is not)
# and is functionally the same as running the granger test (which by default uses the waldtest
m1 = dynlm(x ~ L(x, 1:1) + L(y, 1:1))
m2 = dynlm(x ~ L(x, 1:1))
result <-anova(m1, m2, test="F")
# This will fail if the data is multicolinear or aliased but should give the same results as the anova otherwise (F value and P value etc)
#grangertest(y,x,1)

Fama MacBeth regression pmg function error in R

I've been trying to run a Fama Macbeth regression using the pmg function for my data "Dev_Panel" but I keep getting this error message:
Fehler in pmg(BooktoMarket ~ Returns + Profitability + BEtoMEpersistence, :
Insufficient number of time periods
I've read in other posts on here that this could be due to NAs in the data. But I've already removed these from the panel.
Additionally, I've used the pmg function on the data frame "Em_Panel" for which I have undertaken the exact same data cleaning measures as for the "Dev_Panel". The regression for this panel worked, but it only produces a coefficient for the intercept. The other coefficients are NA.
Here's the code I used for the Em_Panel:
require(foreign)
require(plm)
require(lmtest)
Em_Panel <- read.csv2("Em_Panel.csv", na="NA")
FMR_Em <- pmg(BooktoMarket~Returns+Profitability+BEtoMEpersistence, Em_Panel, index = c("companyID", "years"))
And here's the code for the Dev_Panel:
Em_Panel <- read.csv2("Dev_Panel.csv", na="NA")
FMR_Dev <- pmg(BooktoMarket~Returns+Profitability+BEtoMEpersistence, Dev_Panel, index = c("companyID", "years"))
Since this seemingly is a problem concerning my data I will gladly provide it:
http://www.filedropper.com/empanel
http://www.filedropper.com/devpanel
Thank you so much for any help!!!
Edit
After switching the arguments as suggested the error is now produced by the Dev_Panel and not the Em_Panel.
Also the regression for the Em_Panel now only provides a coefficient for the intercept. The other coefficients are NA.

How to correctly take out zero observations in panel data in R

I'm running into some problems while running plm regressions in my panel database. Basically, I have to take out a year from my base and also all observations from some variable that are zero. I tried to make a reproducible example using a dataset from AER package.
require (AER)
library (AER)
require(plm)
library("plm")
data("Grunfeld", package = "AER")
View(Grunfeld)
#Here I randomize some observations of the third variable (capital) as zero, to reproduce my dataset
for (i in 1:220) {
x <- rnorm(10,0,1)
if (mean(x) >=0) {
Grunfeld[i,3] <- 0
}
}
View(Grunfeld)
panel <- Grunfeld
#First Method
#This is how I was originally manipulating my data and running my regression
panel <- Grunfeld
dd <-pdata.frame(panel, index = c('firm', 'year'))
dd <- dd[dd$year!=1935, ]
dd <- dd[dd$capital !=0, ]
ols_model_2 <- plm(log(value) ~ (capital), data=dd)
summary(ols_model_2)
#However, I couuldn't plot the variables of the datasets in graphs, because they weren't vectors. So I tried another way:
#Second Method
panel <- panel[panel$year!= 1935, ]
panel <- panel[panel$capital != 0,]
ols_model <- plm(log(value) ~ log(capital), data=panel, index = c('firm','year'))
summary(ols_model)
#But this gave extremely different results for the ols regression!
In my understanding, both approaches sould have yielded the same outputs in the OLS regression. Now I'm afraid my entire analysis is wrong, because I was doing it like the first way. Could anyone explain me what is happening?
Thanks in advance!
You are a running two different models. I am not sure why you would expect results to be the same.
Your first model is:
ols_model_2 <- plm(log(value) ~ (capital), data=dd)
While the second is:
ols_model <- plm(log(value) ~ log(capital), data=panel, index = c('firm','year'))
As you see from the summary of the models, both are "Oneway (individual) effect Within Model". In the first one you dont specify the index, since dd is a pdata.frame object. In the second you do specify the index, because panel is a simple data.frame. However this makes no difference at all.
The difference is using the log of capital or capital without log.
As a side note, leaving out 0 observations is often very problematic. If you do that, make sure you also try alternative ways of dealing with zero, and see how much your results change. You can get started here https://stats.stackexchange.com/questions/1444/how-should-i-transform-non-negative-data-including-zeros

Removing character level outlier in R

I have a linear model1<-lm(divorce_rate~marriage_rate+median_age+population) for which the leverage plot shows an outlier at 28 (State variable id for "Nevada"). I'd like to specify a model without Nevada in the dataset. I tried the following but got stuck.
data<-read.dta("census.dta")
attach(data)
data1<-data.frame(pop,divorce,marriage,popurban,medage,divrate,marrate)
attach(data1)
model1<-lm(divrate~marrate+medage+pop,data=data1)
summary(model1)
layout(matrix(1:4,2,2))
plot(model1)
dfbetaPlots(lm(divrate~marrate+medage+pop),id.n=50)
vif(model1)
dataNV<-data[!data$state == "Nevada",]
attach(dataNV)
model3<-lm(divrate~marrate+medage+pop,data=dataNV)
The last line of the above code gives me
Error in model.frame.default(formula = divrate ~ marrate + medage + pop, :
variable lengths differ (found for 'medage')
I suspect that you have some glitch in your code such that you have attach()ed copies that are still lying around in your environment -- that's why it's really best practice not to use attach(). The following code works for me:
library(foreign)
## best not to call data 'data'
mydata <- read.dta("http://www.stata-press.com/data/r8/census.dta")
I didn't find divrate or marrate in the data set: I'm going to speculate that you want the per capita rates:
## best practice to use a new name rather than transforming 'in place'
mydata2 <- transform(mydata,marrate=marriage/pop,divrate=divorce/pop)
model1 <- lm(divrate~marrate+medage+pop,data=mydata2)
library(car)
plot(model1)
dfbetaPlots(model1)
This works fine for me in a clean session:
dataNV <- subset(mydata2,state != "Nevada")
## update() may be nice to avoid repeating details of the
## model specification (not really necessary in this case)
model3 <- update(model1,data=dataNV)
Or you can use the subset argument:
model4 <- update(model1,subset=(state != "Nevada"))

Predicting with lm object in R - black box paradigm

I have a function that returns an lm object. I want to produce predicted values based on some new data. The new data is a data.frame in the exact format as the data passed to the lm function, except that the response has been removed (since we're predicting, not training). I would expect to execute the following, but get an error:
predict( model , newdata )
"Error in eval(expr, envir, enclos) : object 'ModelResponse' not found"
In my case, ModelResponse was the name of the response column in the data I originally trained on. So just for kicks, I tried to insert NA reponse:
newdata$ModelResponse = NA
predict( model , newdata )
Error in terms.default(object, data = data) : no terms component nor attribute
Highly frustrating! R's notion of models/regression doesn't match mine: 1. I train a model with some data and get a model object. 2. I can score new data from any environment/function/frame/etc. so long as I input data into the model object that "looks like" the data I trained on (i.e. same column names). This is a standard black-box paradigm.
So here are my questions:
1. What concept(s) am I missing here?
2. How do I get my scenario to work?
3. How can I get model object to be portable? str(model) shows me that the model object saved the original data it trained on! So the model object is massive. I want my model to be portable to any function/environment/etc. and only contain the data it needs to score.
In the absence of str() on either the model or the data offered to the model, here's my guess regarding this error message:
predict( model , newdata )
"Error in eval(expr, envir, enclos) : object 'ModelResponse' not found"
I guess that you made a model object named "model" and that your outcome variable (the left-hand-side of the formula( in the original call to lm was named "ModelResponse" and that you then named a column in newdata by the same name. But what you should have done was leave out the "ModelResponse" columns (because that is what you are predicting) and put in the "Model_Predictor1", Model_Predictor2", etc. ... i.e. all the names on the right-hand-side of the formula given to lm()
The coef() function will allow you to extract the information needed to make the model portable.
mod.coef <- coef(model)
mod.coef
Since you expressed interest in the rms/Hmisc package combo Function, here it is using the help-example from ols and comparing the output with an extracted function and the rms Predict method. Note the capitals, since these are designed to work with the package equivalents of lm and glm(..., family="binomial") and coxph, which in rms become ols, lrm, and cph.
> set.seed(1)
> x1 <- runif(200)
> x2 <- sample(0:3, 200, TRUE)
> distance <- (x1 + x2/3 + rnorm(200))^2
> d <- datadist(x1,x2)
> options(datadist="d") # No d -> no summary, plot without giving all details
>
>
> f <- ols(sqrt(distance) ~ rcs(x1,4) + scored(x2), x=TRUE)
>
> Function(f)
function(x1 = 0.50549065,x2 = 1) {0.50497361+1.0737604* x1-
0.79398383*pmax(x1-0.083887788,0)^3+ 1.4392827*pmax(x1-0.38792825,0)^3-
0.38627901*pmax(x1-0.65115162,0)^3-0.25901986*pmax(x1-0.92736774,0)^3+
0.06374433*x2+ 0.60885222*(x2==2)+0.38971577*(x2==3) }
<environment: 0x11b4568e8>
> ols.fun <- Function(f)
> pred1 <- Predict(f, x1=1, x2=3)
> pred1
x1 x2 yhat lower upper
1 1 3 1.862754 1.386107 2.339401
Response variable (y): sqrt(distance)
Limits are 0.95 confidence limits
# The "yhat" is the same as one produces with the extracted function
> ols.fun(x1=1, x2=3)
[1] 1.862754
(I have learned through experience that the restricted cubic-spline fit functions coming from rms need to have spaces and carriage returns added to improve readability. )
Thinking long-term, you should probably take a look at the caret package. Many or most modeling functions work with data frames and matrices, others have a preference, and there may be other variations of their expectations. It's important to quickly get your head around each, but if you want a single wrapper that will simplify life for you, making the intricacies into a "black box", then caret is as close as you can get.
As a disclaimer: I do not use caret, as I don't think modeling should be a be a black box. I've had more than a few emails to maintainers of modeling packages resulting from looking into their code and seeing something amiss. Wrapping that in another layer would not serve my interests. So, in the very long-run, avoid caret and develop an enjoyment for dissecting what's going into and out of the different modeling functions. :)

Resources