using fitted() on output from lm with dummy variables - r

reg_ss <- predict(lm(stem_d~stand_id*yr,ss))
fitted.values(reg_ss)
#Error: $ operator is invalid for atomic vectors
I have tried this with fitted() and fitted.values() and receive the same error.
stand_id is a factor with 300+ levels and yr is an integer 1-19, but both are numbers.
I have data on tree stem density collected in stands every 2-3 years for 20 years. I want to run a linear regression and predict stem density for stands in the years between samplings, i.e. use data from year 1 and 3 to predict stem density in year 2.
Any suggestions on how I can get predicted values using fitted() or any other method would be greatly appreciated. I suspect it has something to do with dummy variables assigned to the categories but can't seem to find any information on a solution.
Thanks in advance!

If you want fitted values, you should not be calling predict() first.
reg_ss <- lm(stem_d~stand_id*yr,ss)
predict(reg_ss)
fitted(reg_ss)
When you don't pass new data to predict, it's basically doing the same thing as fitted so you get essentially the same values back. Both fitted and predict will return a simple named vector. You cannot use fitted on a named vector (hence the error message).
If you want to predict unobserved values, you need to pass a newdata= parameter to predict(). You should pass in a data.frame with columns named "stand_id" and "yr" just like ss. Make sure to match up the factor levels as well.

Related

Using predict for linear model with NA values in R

I have a dataset of ~32,000, for which I have created a linear model. ~12,000 observations were deleted due to missingness.
I am trying to use the predict function to backtest the expected value for each of my 32,000 data points, but [as expected], this gives the error 'replacement has 20000 rows, data has 32000'.
Is there any way I can use that model made on the 20,000 rows to predict that of the 32,000? I am happy to have 'zero' for observations that don't have results for every column used in the model.
If not, how can I at least subset the 32,000 dataset correctly such that it only includes the 20,000 whole observations? If my model was lm(a ~ x+y+Z, data=data), for example, how would I filter data to only include observations with full data in x, y and z?
The best thing to do is to use na.action=na.exclude when you fit the model in the first place: from ?na.exclude,
when ‘na.exclude’ is used the residuals and
predictions are padded to the correct length by inserting ‘NA’s
for cases omitted by ‘na.exclude’.
The problem with using a 0 instead of a missing value is that thee linear model will interpret the value as actually having been 0 instead of missing. For instance, if your variable x had a range of 10-100, the model would interpret your imputed 0's as observations lower than the training data's range and give you artificially low predictions. If you want to make a prediction for the rows with missing values, you're going to have to do some value imputation (ie. replace the NAs with the mean, the median or using k-nearest neighbors).
Using
data[complete.cases(data),]
gives you only observations without NAs. Perhaps that's what you are looking for.
Other way is
na.omit(data)
which gives you in addition the indices of the removed observations.

pool.compare generates non-comformable arguments error

Alternate title: Model matrix and set of coefficients show different numbers of variables
I am using the mice package for R to do some analyses. I wanted to compare two models (held in mira objects) using pool.compare(), but I keep getting the following error:
Error in model.matrix(formula, data) %*% coefs : non-conformable arguments
The binary operator %*% indicates matrix multiplication in R.
The expression model.matrix(formula, data) produces "The design matrix for a regression-like model with the specified formula and data" (from the R Documentation for model.matrix {stats}).
In the error message, coefs is drawn from est1$qbar, where est1 is a mipo object, and the qbar element is "The average of complete data estimates. The multiple imputation estimate." (from the documentation for mipo-class {mice}).
In my case
est1$qbar is a numeric vector of length 36
data is a data.frame with 918 observations of 82 variables
formula is class 'formula' containing the formula for my model
model.matrix(formula, data) is a matrix with dimension 918 x 48.
How can I resolve/prevent this error?
As occasionally happens, I found the answer to my own question while writing the question.
The clue I was that the estimates for categorical variables in est1.qbar only exist if that level of that variables was present in the data. Some of my variables are factor variables where not every level is represented. This caused the warning "contrasts dropped from factor variable name due to missing levels", which I foolishly ignored.
On the other hand, looking at dimnames(model.matrix.temp)[[2]] shows that the model matrix has one column for each level of each factor variable, regardless of whether that level of that variable was present in the data. So, although the contrasts for missing factor levels are dropped in terms of estimating the coefficients, those factor levels still appear in the model matrix. This means that the model matrix has more columns than the length of est1.qbar (the vector of estimated coefficients), so matrix multiplication is not going to work.
The answer here is to fix the factor variables so that there are no unused levels. This can be done with the factor() function (as explained here). Unfortunately, this needs to be done on the original dataset, prior to imputation.

Weighted linear regression in R [duplicate]

This question already has an answer here:
R: lm() result differs when using `weights` argument and when using manually reweighted data
(1 answer)
Closed 6 years ago.
I would like to do a linear regression with a weighting factor for an analytical chemistry calibration curve. The x values are concentration and assumed to have no error. The y values are instrument response and the variation is assumed proportional to concentration. So, I would like to use a 1/x weighting factor for the linear regression. The data set is simply ten concentrations with a single measurement for each. Is there an easy way to do this in R? .
The answer can be found on a somewhat older question on Cross Validated. The lm() function (which represents the usual method of applying a linear regression), has an option to specify weights. As shown in the answer on the link, you can use a formula in the weights argument. In your case, the formula will likely take the form of 1/data$concentration.
As suggested by hrbrmstr, I'm adding mpiktas's actual answer from Cross Validated:
I think R help page of lm answers your question pretty well. The only
requirement for weights is that the vector supplied must be the same
length as the data. You can even supply only the name of the variable
in the data set, R will take care of the rest, NA management, etc. You
can also use formulas in the weight argument. Here is the example:
x <-c(rnorm(10),NA) df <-
data.frame(y=1+2*x+rnorm(11)/2,x=x,wght1=1:11)
##Fancy weights as numeric vector
summary(lm(y~x,data=df,weights=(df$wght1)^(3/4)))
#Fancy weights as formula on column of the data set
summary(lm(y~x,data=df,weights=I(wght1^(3/4))))
#Mundane weights as the column of the data set
summary(lm(y~x,data=df,weights=wght1)
Note that weights must be positive, otherwise R will produce an error.

random forest: error in dealing with factor levels in R

I am using rf model in R to predict a binary outcome 0 or 1. I have categorical variables (coded as numbers) in my input data which are coded as factor while training. I use factor() function in R to convert the variable as factor. So for every categorical variablex,my code is like this.
feature_x1=factor(feature_x1) # Convert the variable into factor in training data.
#This variable takes 3 levels 0,1,2
This works perfectly fine while training the model. Let us assume my model object is rf_model. While running the model on new data which is just a vector of numbers. I first convert the number into factors for feature_x1
newdata=data.frame(1,2)
colnames(newdata)=c("feature_x1","feature_x2")
newdata$feature_x1=factor(newdata$feature_x1)
score=pred(rf_model,newdata,type="prob")
I am receiving the following error
Error in predict.randomForest(rf_model, newdata,type = "prob") :
New factor levels not present in the training data
How to deal with this error, because in reality, after training the model we will always have to deal with data for which outcome is unknown which is a just a single record.
Please let me know if more clarity or code is required
Try
newdata$feature_x1 <- factor(newdata$feature_x1, levels=levels(feature_x1))

How to call randomForest predict for use with ROCR?

I am having a hard time understanding how to build a ROC curve and now I came to the conclusion that maybe I don't create the model correctly. I am running a randomforest model in the dataset where the class attribute "y_n" is 0 or 1. I have divided the datasets as bank_training and bank_testing for the prediction purpose.
Here are the steps i do:
bankrf <- randomForest(y_n~., data=bank_training, mtry=4, ntree=2,
keep.forest=TRUE, importance=TRUE)
bankrf.pred <- predict(bankrf, bank_testing, type='response',
predict.all=TRUE, norm.votes=TRUE)
Is it correct what I do till now? The bankrf.pred object that is created is a list object with 2 classes named: aggregate and individuals. I dont understand where did this 2 class names came out? Moreover when I run:
summary(bankrf.pred)
Length Class Mode
aggregate 22606 factor numeric
individual 45212 -none- character
What does this summary mean? The datasets (training & testing) are 22605 and 22606 long each. If someone can explain me what is happening I would be very grateful. I think there is something wrong in all this.
When I try to design the ROC curve with ROCR I use the following code:
library(ROCR)
pred <- prediction(bank_testing$y_n, bankrf.pred$c(0,1))
Error in is.data.frame(labels) : attempt to apply non-function
Is just a mistake in the way I try to create the ROC curve or is it from the beginning with randomForest?
The documentation for the function you are attempting to use includes this description of its two main arguments:
predictions A vector, matrix, list, or data frame containing the
predictions.
labels A vector, matrix, list, or data frame containing the true
class labels. Must have the same dimensions as 'predictions'.
You are currently passing the variable y_n to the predictions argument, and what looks to me like nonsense to the labels argument.
The predictions will be stored in the output of the random forest model. As documented at ?predict.randomForest, it will be a list with two components. aggregate will contain the predicted values for the entire forest, while individual will contain the predicted values for each individual tree.
So you probably want to do something like this:
predictions(bankrf.pred$aggregate, bank_testing$y_n)
See how that works? The predicted values are passed to the predictions argument, while the "labels" or true values, are passed to the labels argument.
You should erase the predict.all=TRUE argument from predict if you simply want to get the predicted classes. By using predict.all=TRUE you are telling the function to keep the predictions of all trees rather than the prediction from the forest.

Resources