Predict values for missing data in r - r

I am working on a data set that has 1008 total observations and I remove the missing cases to create the linear model. The training set now has 937 total observations.
The code I used to create my lm is below:
lm.data <- lm(N001S009P088.Average.Horizontal.Wind.Speed.RG.6 ~ C01.Average.Wind.Speed.48.5m+
C02.Average.Wind.Speed.48.5m+ C03.Average.Wind.Speed.40.0m+
C05.Average.Wind.Speed.15.0m+ N002S007P004.Average.Pressure..mb.+
N001S007P006.Average.wind.speed,
data=complete.data)
I am now trying to go back and predict the missing values from the original dataset using the lm from the training set.
all.data$Expected.RG.6.Average.Wind.Speed <- predict(lm.data)
When I run this process I get the following error message:
Error in $<-.data.frame(*tmp*, "Expected.RG.6.Average.Wind.Speed", :
replacement has 947 rows, data has 1008
These predicted values will plotted against the actual values to see how effective th linear model is in its prediction.
Additionally, the entire column is filled with missing data values.
Any insight on this would be greatly appreciated.

Related

How can I predict a variable in the test data set, if the number of observations in the data set is smaller then in the training data set? Using R

Hi I am new to R an been stuck on this problem for a while:
I am using Bayes linear regression to make a model that predicts bear weight. The model looks like this:
bears.bayes<- stan_glm(WEIGHT ~ CHEST + NECK + HEADLEN +AGE, data=bears.train, seed=111)
As you can see the model is using a full data set, that has had the missing values omitted. The original data set had missing values only on the predicted (weight) variable. I am now trying to use the model to predict the NA's in the original data set.
What I tried is the following:
bears.test$predicted.weight <- predict(bears.bayes, bears.train)
bears.test$predicted.weight <- predict(bears.bayes, bears.test)
The error I get says that I can't predict as there are 44 observations in the full data set, but only 10 in the test data set.
Error in `$<-.data.frame`(`*tmp*`, predicted.weight, value = c(`52` = -1.77440983341086, :
replacement has 44 rows, data has 10
Can someone explain why that is? and if its possible to predict using unequal data sets?

Is it possible to perform a zero inflated poisson regression model in R with more than 4 variables?

This is my first time posting on here, so I apologize if this isn't the correct format/info. But I'm attempting to run a model in R with the zeroinfl function (pscl package). My data consists of insect count data and 5 different variables, which are 5 different habitat types.
Zero inflation poisson model
summary(m1 <- zeroinfl(count~Hab_1+Hab_2+Hab_3+Hab_4+Hab_5, data = insect_data))
I'm able to run the model when I only use 4 variables in the equation, but when I add the fifth variable it gives me this error code:
Error in optim(fn = loglikfun, gr = gradfun, par = c(start$count, start$zero, :
non-finite value supplied by optim
Is there a way to run a zero inflated model using all 5 of these variables or am I missing something? Any input would be greatly appreciated, thank you!

Using predict for linear model with NA values in R

I have a dataset of ~32,000, for which I have created a linear model. ~12,000 observations were deleted due to missingness.
I am trying to use the predict function to backtest the expected value for each of my 32,000 data points, but [as expected], this gives the error 'replacement has 20000 rows, data has 32000'.
Is there any way I can use that model made on the 20,000 rows to predict that of the 32,000? I am happy to have 'zero' for observations that don't have results for every column used in the model.
If not, how can I at least subset the 32,000 dataset correctly such that it only includes the 20,000 whole observations? If my model was lm(a ~ x+y+Z, data=data), for example, how would I filter data to only include observations with full data in x, y and z?
The best thing to do is to use na.action=na.exclude when you fit the model in the first place: from ?na.exclude,
when ‘na.exclude’ is used the residuals and
predictions are padded to the correct length by inserting ‘NA’s
for cases omitted by ‘na.exclude’.
The problem with using a 0 instead of a missing value is that thee linear model will interpret the value as actually having been 0 instead of missing. For instance, if your variable x had a range of 10-100, the model would interpret your imputed 0's as observations lower than the training data's range and give you artificially low predictions. If you want to make a prediction for the rows with missing values, you're going to have to do some value imputation (ie. replace the NAs with the mean, the median or using k-nearest neighbors).
Using
data[complete.cases(data),]
gives you only observations without NAs. Perhaps that's what you are looking for.
Other way is
na.omit(data)
which gives you in addition the indices of the removed observations.

r - using the output of of Lasso regression to index original dataset

I am trying to use the output of the significant predictor variables that I obtained from running a lasso regression to narrow down the number of variables I place in my logistic regression function, but am having trouble properly indexing the original dataset to create a subset of my data.
I was able to pull out the non-zero variables from my lasso model using:
lasso.coef = colnames(stocks.transform)[which(coef(lasso.mod,s=bestlam)!=0)]
lasso.coef
This yielded a vector of all the non-zero attributes. I want to use this vector to index my original data set stock.transform:
lasso.out = stocks.transform[lasso.coef]
When I run this code I get the following error:
Error in `[.data.frame`(stocks.transform, lasso.coef) :
undefined columns selected
I am unsure how to define the column any further.

locate missingness in fitted model in R

Consider fitting a coxph model with, say, 100 data points. Only 95 are included in the analysis, while 5 are excluded due to being NA (i.e. missingness). I extract the residuals on the fitted data so I have a residual vector with 95 observations. I would like to include the residuals back into the original data frame, but I can't do this since the lengths are different.
How do I identify which observations from the original data frame were not included in the model, so I can exclude/delete them to make the two lengths the same?
(The original data is much larger so it's hard to locate where data are missing...)
Re-fit your model, setting the na.action argument to na.exclude. This pads the residuals and fitted values that are part of the fitted object with NAs. If your original model is zn50:
zn50_na <- update(zn50, na.action=na.exclude)
This should give you residuals(zn50_na) and fitted(zn50_na) of the appropriate length. See ?na.omit for more info.

Resources