R: How to update model frame after reducing model formula - r

I am working a phylogenetic multiple regression using the caper package on Windows 7, and am receiving a Model frame / formula mismatch error consistently when ever I try to graph a residual leverage plot after generating a reduced model.
Here is the minimal code needed to reproduce the error:
g <- Response ~ (Name1 + Name2 + Name3 + Name4 + Name5 + Name6 + Name7)^2 + Name1Sqd
+ Name2Sqd + Name3Sqd + Name4Sqd + Name5Sqd + Name6Sqd + Name7Sqd
crunchMod <- crunch(g, data = contrasts)
plot(crunchMod, which=c(5)) ####Works just fine####
varName <- row.names(summary(crunchMod)$coefficients)[1]
#it doesn't matter which predictor I remove.
Reduce(paste, deparse(g))
g <- as.formula(paste(Reduce(paste, deparse(g)), as.name(varName), sep=" - "))
#Edits the model formula to remove varName
crunchMod <- crunch(g, data = contrasts)
plot(crunchMod, which=c(5)) ####Error Happens Here####
When I try to graph a residual leverage plot to look at the effects of model complexity, I get the following error:
Error in model.matrix.default(object, data = list(Response = c(-0.0458443124730482,
: model frame and formula mismatch in model.matrix()
The code that starts this error is: plot(crunchMod, which=c(5)) where crunchMod
holds my regression model via crunchMod <- crunch(g, data = contrasts) from the
caper Package on Windows 7 OS.
How can I update my model frame to be able to examine cook's distance again (either graphically or numerically)?

Within the source code of crunch() there was the implementation:
data <- subset(data, select = all.vars(formula))
which has the side effect of making all interaction effects from a deleted primary effect invalid in the model frame. This becomes more apparent when one realizes that plotting cook's distance vs leverage will work if he/she only deletes interaction effects.
Thus to solve this problem, all interaction effects must be included in the original data frame before calling crunch() to create a linear model. While this makes transforming the data slightly more complicated, it is easy to add these interactions following these two links:
Generating interaction variables in R dataframes (second answer down)
http://www.r-bloggers.com/type-conversion-and-you-or-and-r/

Related

Leads/Lags in linear model with a subsample of the data frame in R

I want to perform in R the next linear model:
\begin{equation}
lPC_t = \beta_0 + \beta_1PIBtvh_{t+1} + \beta_2txDes_t + \beta_3Spread_{t+4} + u_t
\end{equation}
The name of my data frame is Dados_R. I need to impose a restriction in the data once I want to estimate over just the observations between 19 and 45. The problem is that when I create the variables with the lead I cannot change the scope of them, or at least I cannot do it, unless I change the original data frame by myself what is not convenient once I want to perform more models with different leads.
So my question is how can I change the range of the variables that I created (leadPIBtvh0 e leadSpread0), in such a way that allows me to perform the linear model with just the observations between 19 and 45?
The code that I wrote:
attach(Dados_R)
leadPIBtvh0=lag(PIBtvh,1)
leadSpread0=lag(Spread,4)
data=Dados_R[19:45,]
detach(Dados_R)
attach(data)
lPC=log(PC/(1-PC))
lm_lPC=lm(lPC~leadPIBtvh0+txDes+leadSpread0)
This code give me the error (that I understood):
Error in model.frame.default(formula = lPC ~ leadPIBtvh0 + txDes + leadSpread0, : :
variable lengths differ (found for 'leadPIBtvh0')

Using the panel regression on Hedonic data using plm package in R

I am trying to run the panel regression for unbalanced panel in R using the plm package. I am using the 'Hedonic' data to run the same.
I was trying to replicate something similar that is done in the following paper: http://ftp.uni-bayreuth.de/math/statlib/R/CRAN/doc/vignettes/plm/plmEN.pdf (page 14, 3.2.5 Unbalanced Panel).
My code looks something like this:
form = mv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + blacks + lstat
ba = plm(form, data = Hedonic)
However, I am getting the following error on execution:
Error in names(y) <- namesy :
'names' attribute [506] must be the same length as the vector [0]
traceback() yields the following result:
4: pmodel.response.pFormula(formula, data, model = model, effect = effect,
theta = theta)
3: pmodel.response(formula, data, model = model, effect = effect,
theta = theta)
2: plm.fit(formula, data, model, effect, random.method, random.dfcor,
inst.method)
1: plm(form, data = Hedonic)
I am new to panel regression and would be really grateful if someone can help me with this issue.
Thanks.
That paper is ten years old, and I'm not sure plm works like that. The latest docs are here https://cran.r-project.org/web/packages/plm/vignettes/plm.pdf
Your problem arises because, in the docs:
the current version of plm is capable of working with a regular
data.frame without any further transformation, provided that the
individual and time indexes are in the first two columns,
The Hedonic data set does not have individual and time indexes in the first two columns. I'm not sure where the individual and time indexes are in the data, but if I specify townid for the index I at least get something that runs:
> p <- plm(mv~crim,data=Hedonic)
Error in names(y) <- namesy :
'names' attribute [506] must be the same length as the vector [0]
> p <- plm(mv~crim,data=Hedonic, index="townid")
> p
Model Formula: mv ~ crim
Coefficients:
crim
-0.0097455
because when you don't specify id and time indexes it is going to try using the first two columns, and in Hedonic that is giving unique numbers for the id, so the whole model falls apart.
If you look at the examples in help(plm) you might notice that the first two columns in all the data sets define the id and the time.

R random forest - training set using target column for prediction

I am learning how to use various random forest packages and coded up the following from example code:
library(party)
library(randomForest)
set.seed(415)
#I'll try to reproduce this with a public data set; in the mean time here's the existing code
data = read.csv(data_location, sep = ',')
test = data[1:65] #basically data w/o the "answers"
m = sample(1:(nrow(factor)),nrow(factor)/2,replace=FALSE)
o = sample(1:(nrow(data)),nrow(data)/2,replace=FALSE)
train2 = data[m,]
train3 = data[o,]
#random forest implementation
fit.rf <- randomForest(train2[,66] ~., data=train2, importance=TRUE, ntree=10000)
Prediction.rf <- predict(fit.rf, test) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]
#cforest implementation
fit.cf <- cforest(train3[,66]~., data=train3, controls=cforest_unbiased(ntree=10000, mtry=10))
Prediction.cf <- predict(fit.cf, test, OOB=TRUE) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]
Data[,66] is the is the target factor I'm trying to predict, but it seems that by using "~ ." to solve for it is causing the formula to use the factor in the prediction model itself.
How do I solve for the dimension I want on high-ish dimensionality data, without having to spell out exactly which dimensions to use in the formula (so I don't end up with some sort of cforest(data[,66] ~ data[,1] + data[,2] + data[,3}... etc.?
EDIT:
On a high level, I believe one basically
loads full data
breaks it down to several subsets to prevent overfitting
trains via subset data
generates a fitting formula so one can predict values of target (in my case data[,66]) given data[1:65].
so my PROBLEM is now if I give it a new set of test data, let’s say test = data{1:65], it now says “Error in eval(expr, envir, enclos) :” where it is expecting data[,66]. I want to basically predict data[,66] given the rest of the data!
I think that if the response is in train3 then it will be used as a feature.
I believe this is more like what you want:
crtl <- cforest_unbiased(ntree=1000, mtry=3)
mod <- cforest(iris[,5] ~ ., data = iris[,-5], controls=crtl)

I get many predictions after running predict.lm in R for 1 row of new input data

I used ApacheData data with 83784 rows to build a linear regression model:
fit <-lm(tomorrow_apache~ as.factor(state_today)
+as.numeric(daily_creat)
+ as.numeric(last1yr_min_hosp_icu_MDRD)
+as.numeric(bun)
+as.numeric(urin)
+as.numeric(category6)
+as.numeric(category7)
+as.numeric(other_fluid)
+ as.factor(daily)
+ as.factor(age)
+ as.numeric(apache3)
+ as.factor(mv)
+ as.factor(icu_loc)
+ as.factor(liver_tr_before_admit)
+ as.numeric(min_GCS)
+ as.numeric(min_PH)
+ as.numeric(previous_day_creat)
+ as.numeric(previous_day_bun) ,ApacheData)
And I want to use this model to predict a new input so I give each predictor variable a value:
predict(fit, data=data.frame(state_today=1, daily_creat=2.3, last1yr_min_hosp_icu_MDRD=3, bun=10, urin=0.01, category6=10, category7=20, other_fluid=0, daily=2 , age=25, apache3=12, mv=1, icu_loc=1, liver_tr_before_admit=0, min_GCS=20, min_PH=3, previous_day_creat=2.1, previous_day_bun=14))
I expect a single value as a prediction to this new input, but I get many many predictions! I don't know why is this happening. What am I doing wrong?
Thanks a lot for your time!
You may also want to try the excellent effects package in R (?effects). It's very useful for graphing the predicted probabilities from your model by setting the inputs on the right-hand side of the equation to particular values. I can't reproduce the example you've given in your question, but to give you an idea of how to quickly extract predicted probabilities in R and then plot them (since this is vital to understanding what they mean), here's a toy example using the in-built data sets in R:
install.packages("effects") # installs the "effects" package in R
library(effects) # loads the "effects" package
data(Prestige) # loads in-built dataset
m <- lm(prestige ~ income + education + type, data=Prestige)
# this last step creates predicted values of the outcome based on a range of values
# on the "income" variable and holding the other inputs constant at their mean values
eff <- effect("income", m, default.levels=10)
plot(eff) # graphs the predicted probabilities

R: How to make column of predictions for logistic regression model?

So I have a data set called x. The contents are simple enough to just write out so I'll just outline it here:
the dependent variable, Report, in the first column is binary yes/no (0 = no, 1 = yes)
the subsequent 3 columns are all categorical variables (race.f, sex.f, gender.f) that have all been converted to factors, and they're designated by numbers (e.g. 1= white, 2 = black, etc.)
I have run a logistic regression on x as follows:
glm <- glm(Report ~ race.f + sex.f + gender.f, data=x,
family = binomial(link="logit"))
And I can check the fitted probabilities by looking at summary(glm$fitted).
My question: How do I create a fifth column on the right side of this data set x that will include the predictions (i.e. fitted probabilities) for Report? Of course, I could just insert the glm$fitted as a column, but I'd like to try to write a code that predicts it based on whatever is in the race, sex, gender columns for a more generalized use.
Right now I the follow code which I will hope create a predicted column as well as lower and upper bounds for the confidence interval.
xnew <- cbind(xnew, predict(glm5, newdata = xnew, type = "link", se = TRUE))
xnew <- within(xnew, {
PredictedProb <- plogis(fit)
LL <- plogis(fit - (1.96 * se.fit))
UL <- plogis(fit + (1.96 * se.fit))
})
Unfortunately I get the error:
Error in eval(expr, envir, enclos) : object 'race.f' not found
after the cbind code.
Anyone have any idea?
There appears to be a few typo in your codes; First Xnew calls on glm5 but your model as far as I can see is glm (by the way using glm as name of your output is probably not a good idea). Secondly make sure the variable race.f is actually in the dataset you wish to do the prediction from. My guess is R can't find that variable hence the error.

Resources