I have a dataset of ~32,000, for which I have created a linear model. ~12,000 observations were deleted due to missingness.
I am trying to use the predict function to backtest the expected value for each of my 32,000 data points, but [as expected], this gives the error 'replacement has 20000 rows, data has 32000'.
Is there any way I can use that model made on the 20,000 rows to predict that of the 32,000? I am happy to have 'zero' for observations that don't have results for every column used in the model.
If not, how can I at least subset the 32,000 dataset correctly such that it only includes the 20,000 whole observations? If my model was lm(a ~ x+y+Z, data=data), for example, how would I filter data to only include observations with full data in x, y and z?
The best thing to do is to use na.action=na.exclude when you fit the model in the first place: from ?na.exclude,
when ‘na.exclude’ is used the residuals and
predictions are padded to the correct length by inserting ‘NA’s
for cases omitted by ‘na.exclude’.
The problem with using a 0 instead of a missing value is that thee linear model will interpret the value as actually having been 0 instead of missing. For instance, if your variable x had a range of 10-100, the model would interpret your imputed 0's as observations lower than the training data's range and give you artificially low predictions. If you want to make a prediction for the rows with missing values, you're going to have to do some value imputation (ie. replace the NAs with the mean, the median or using k-nearest neighbors).
Using
data[complete.cases(data),]
gives you only observations without NAs. Perhaps that's what you are looking for.
Other way is
na.omit(data)
which gives you in addition the indices of the removed observations.
Related
I have been trying to do KNN imputation for some missing values in R but it has been inducing negative values in columns where there shouldn't be any negative values at all like age.(Age does have missing values but I don't want it to get imputed by negative values).
Here is my code:
#KNN Imputation:
preProcess_missingdata_model <- preProcess(train, method='knnImpute')
preProcess_missingdata_model
# Use the imputation model to predict the values of missing data points
library(RANN) # required for knnImpute
train <- predict(preProcess_missingdata_model, newdata = train)
What should I do to overcome these negative values being induced?
Any suggestions would be highly appreciated. Thanks.
You are able to expressly tell preProcessing which columns you would like to impute. This can be done as follows:
preProcess_missingdata_model <- preProcess(train[,c('Embarked', 'Sex')], method='knnImpute')
You can even specify the specific rows you would like to impute by including a list before the comma.
KnnImpute of Caret package automatically center and scale your data, so negative values are normal (in fact, if you watch the result of the function you can see it).
I am working on a data set that has 1008 total observations and I remove the missing cases to create the linear model. The training set now has 937 total observations.
The code I used to create my lm is below:
lm.data <- lm(N001S009P088.Average.Horizontal.Wind.Speed.RG.6 ~ C01.Average.Wind.Speed.48.5m+
C02.Average.Wind.Speed.48.5m+ C03.Average.Wind.Speed.40.0m+
C05.Average.Wind.Speed.15.0m+ N002S007P004.Average.Pressure..mb.+
N001S007P006.Average.wind.speed,
data=complete.data)
I am now trying to go back and predict the missing values from the original dataset using the lm from the training set.
all.data$Expected.RG.6.Average.Wind.Speed <- predict(lm.data)
When I run this process I get the following error message:
Error in $<-.data.frame(*tmp*, "Expected.RG.6.Average.Wind.Speed", :
replacement has 947 rows, data has 1008
These predicted values will plotted against the actual values to see how effective th linear model is in its prediction.
Additionally, the entire column is filled with missing data values.
Any insight on this would be greatly appreciated.
have a dataset (found here- https://netfiles.umn.edu/users/nacht001/www/nachtsheim/Kutner/Appendix%20C%20Data%20Sets/APPENC01.txt) and I have done some R coding for linear regression. In the attached dataset the columns are not labeled. I had to label the columns of the dataset and save it as a csv and I apologize I can't get that on here… but the columns I am using are column 3(age) column 4(infection) column 5 (culratio) column 10 (census) and column 12(service), column 9 (region). I named the dataset hospital.
I am supposed to "For each geographic region, regress infection risk (Y) against the predictor variables age, culratio, census, service using a first order regression model. Then I need to find the MSE for each region. This is the code I have.
NE<- subset(hospital, region=="1")
NC<- subset(hospital, region=="2")
S<- subset(hospital, region=="3")
W<- subset(hospital, region=="4")
then to do a first order linear regression model I use the basic code for each
NE.Model<- lm(NE$infection~ NE$age + NE$culratio + NE$census + NE$service)
summary(NE.Model)
and I can get the adjusted R squared value, but how do I find MSE from this output?
Moving my comment to an answer. The "errors" or "residuals" are part of the model object, NE.Model$residuals, so getting the mean square error is as easy as that: mean(NE.Model$residuals^2).
Just as a note, you could do this in fewer steps by fitting a region fixed effect term in your model and then calculating the MSE for each subset of the residuals. Same difference, really.
reg_ss <- predict(lm(stem_d~stand_id*yr,ss))
fitted.values(reg_ss)
#Error: $ operator is invalid for atomic vectors
I have tried this with fitted() and fitted.values() and receive the same error.
stand_id is a factor with 300+ levels and yr is an integer 1-19, but both are numbers.
I have data on tree stem density collected in stands every 2-3 years for 20 years. I want to run a linear regression and predict stem density for stands in the years between samplings, i.e. use data from year 1 and 3 to predict stem density in year 2.
Any suggestions on how I can get predicted values using fitted() or any other method would be greatly appreciated. I suspect it has something to do with dummy variables assigned to the categories but can't seem to find any information on a solution.
Thanks in advance!
If you want fitted values, you should not be calling predict() first.
reg_ss <- lm(stem_d~stand_id*yr,ss)
predict(reg_ss)
fitted(reg_ss)
When you don't pass new data to predict, it's basically doing the same thing as fitted so you get essentially the same values back. Both fitted and predict will return a simple named vector. You cannot use fitted on a named vector (hence the error message).
If you want to predict unobserved values, you need to pass a newdata= parameter to predict(). You should pass in a data.frame with columns named "stand_id" and "yr" just like ss. Make sure to match up the factor levels as well.
Consider fitting a coxph model with, say, 100 data points. Only 95 are included in the analysis, while 5 are excluded due to being NA (i.e. missingness). I extract the residuals on the fitted data so I have a residual vector with 95 observations. I would like to include the residuals back into the original data frame, but I can't do this since the lengths are different.
How do I identify which observations from the original data frame were not included in the model, so I can exclude/delete them to make the two lengths the same?
(The original data is much larger so it's hard to locate where data are missing...)
Re-fit your model, setting the na.action argument to na.exclude. This pads the residuals and fitted values that are part of the fitted object with NAs. If your original model is zn50:
zn50_na <- update(zn50, na.action=na.exclude)
This should give you residuals(zn50_na) and fitted(zn50_na) of the appropriate length. See ?na.omit for more info.