Predicting with new data having greater length - r

I would like to make a prediction on a dataset which is longer than the dataframe in which my training set is present.
Df<-data.frame(MW=c(192700,117900,99300,54100,37800,29500,20200,740),
Bands1<-c(0.0427334,0.2393070,0.3206159,0.5732002,0.7228141,0.8164857,0.8462922,0.9273532))
Df.pred<-data.frame(Band2=c(0.4470235,0.4884748,0.5345757,0.5898747,0.6405655,0.6774131,0.7557672,0.7972277,0.8940148,0.9493461,1.0138248,1.0414651))
mod<-lm(log10(Df$MW)~Df$Bands1, data=Df) ## Making the model
Df.pred$PredMW<-predict(lm(log10(Df$MW)~Df$Bands1, data=Df), newdata=Df.pred) ## Asking the model to predict values corresponding to Df.pred based on mod
I seem to get the following output:
Warning message:
'newdata' had 12 rows but variables found have 8 rows
How do I solve this? I have read the ?predict as well as ?predict.lm. I am unable to figure this out.

Change the Df.pred column name to Bands1, the same as in Df:
Df.pred <- data.frame(Bands1 = c(0.4470235, 0.4884748 ,0.5345757 ,0.5898747 ,0.6405655,
0.6774131, 0.7557672, 0.7972277, 0.8940148, 0.9493461,
1.0138248, 1.0414651))
mod <- lm(log10(MW) ~ Bands1, data = Df) ## Making the model
Df.pred$PredMW <- predict(mod, newdata = Df.pred) ## Asking the model to predict values corresponding to Df.pred based on mod

Related

How to fix predict.naive_bayes using no features for prediction in R

I have a data frame with 45045 variables and only 90 observations in R. I did a PCA to reduce the dimension and I'll use 14 principal components. I need do predictions and I wanna try to use the Naive Bayes method. I can't use the predict function with the trasformed data and i'm not understanding the error.
Here is some code:
data.pca <- prcomp(data)
I'll use 14 PCs:
newdata <- as.data.frame(data.pca$x[,1:14]) #dimension: 90x14
Training:
library(naivebayes)
mod.nb <- naive_bayes(label ~ newdata$PC1+...+newdata$PC14, data = NULL)
Tryna predict the 50th observation:
test.pca <- predict(data.pca, newdata = data[50,])
test.pca <- as.data.frame(test.pca)
test.pca <- test.pca[,1:14]
pred <- predict(mod.nb, test.pca)
I'm getting these errors:
predict.naive_bayes(): Only 0 feature(s) out of 14 defined in the naive_bayes object "mod.nb" are used for prediction.
predict.naive_bayes(): No feature in the newdata corresponds to probability tables in the object. Classification is done based on the prior probabilities
The vector of labels is a factor with levels 1 to 6, and for any observation that I try to predict the result is only 1. The 50th observation, for example, has the label 4.
You can try the following code modified from your code only
data.pca <- prcomp(data)
newdata <- as.data.frame(data.pca$x[,1:14])
library(naivebayes)
mod.nb <- naive_bayes(label ~ newdata$PC1+...+newdata$PC14, data = newdata)
test.pca <- predict(mod.nb, newdata = newdata[50,])

Comparing nested models with NAs in R

I am trying to compare nested regression models using the anova() function in R, but am running into problems because the level 1 and level 2 models differ in the number of observations due to missing cases. Here is a simple example:
# Create dataframe with multiple predictors with different number of NAs
dep <- c(45,46,45,48,49)
basevar <- c(10,12,10,16,17)
pred1 <- c(NA,20,NA,19,21)
dat <- data.frame(dep,basevar,pred1)
# Define level 1 of the nested models
basemodel <- lm(dep ~ basevar, data = dat)
# Add level 2
model1 <- lm(dep ~ basevar + pred1, data = dat)
# Compare the models (uh oh!)
anova(basemodel, model1)
I have seen 2 suggestions to similar problems, but both are problematic.
Suggestion 1: Impute the missing data. The problem with this is that the missing cases in my data were removed because they were outliers, and thus are not "missing at random," and imputing may overfit the data.
Suggestion 2: Make a separate data frame containing only the complete cases for the variable with missing cases, and use that for regressions. This is also problematic if you are creating multiple nested models sharing the same level 1 variable, but in which the level 2 variables differ in the number of missing cases. Here is an example of this:
# Create a new predictor variable with a different number of NAs from pred1
pred2 <- c(23,21,NA,10,11)
dat <- cbind(dat,pred2)
# Create dataframe containing only completed cases of pred1
nonadat1 <- subset(dat, subset = !is.na(pred1))
# Do the same for pred2
nonadat2 <- subset(dat, subset = !is.na(pred2))
# Define level 1 of the nested models within dataframe of pred1 complete cases
basemodel1 <- lm(dep ~ basevar, data = nonadat1)
# Check values of the model
summary(basemodel1)
# Add level 2
model1 <- lm(dep ~ basevar + pred1, data = nonadat1)
# Compare the models (yay it runs!)
anova(basemodel1, model1)
# Define level 1 of the nested models within dataframe of pred2 complete cases
basemodel2 <- lm(dep ~ basevar, data = nonadat2)
# Values are different from those in basemodel1
summary(basemodel2)
# Add level 2
model2 <- lm(dep ~ basevar + pred2, data = nonadat2)
# Compare the models
anova(basemodel2, model2)
As you can see, creating individual data frames creates differences at level 1 of the nested models, which makes interpretation problematic.
Does anyone know how I can compare these nested models while circumventing these problems?
Could this work? See here for more information. It doesn't exactly deal with the fact that models are fitted on different datasets, but it does allow for a comparison.
A<-logLik(basemodel)
B<-logLik(model1)
(teststat <- -2 * (as.numeric(A)-as.numeric(B)))
(p.val <- pchisq(teststat, df = 1, lower.tail = FALSE))

Error when replacing new factor levels in test dataset with `NA`

I have split my data set into testing and training data sets. I've tried to fit a regression on the training set, and then use predict on the testing set. When I do this I get an error message that says: "Error in model.frame factor x has New Levels". I know this is because there are levels in my testing data not seen in my training data.
What I want to do is just eliminate or ignore the levels that aren't in both data sets. I've tried to do this, but it isn't setting any levels to NA, and the id object says "integer (empty)":
id <- which(!(test$x %in% levels (train$x))
train$x[id] <- NA
fit <- lm(y ~ x, data=train)
P <- predict(fit,test)
You will get "replacement length differs" error with your code.
id <- which(!(test$x %in% levels (train$x))
tells you what elements in test$x are not in levels(train$x), so you should use id to index test$x, not train$x, when doing replacement.
test$x[id] <- NA
test$x <- droplevels(test$x) ## also don't forget to remove unused factor levels
fit <- lm(y ~ x, data = train)
P <- predict(fit, test)
All data in train will be used to build your linear regression model. Some predictions in P will be NA.
I'm still unable to get the id object to correctly identify which levels are not in both data sets. In the work-space it just shows integer(0).
Then, what is the point of your question??!! All levels in test$x are inside levels(train$x) and there is no new level.

Why doesn't predict like the dimensions of my newdata?

I want to perform a multiple regression in R and make predictions based on the trained model. Below is an example code I am using:
price = c(10,18,18,11,17)
predictors = cbind(c(5,6,3,4,5),c(2,1,8,5,6))
predict(lm(price ~ predictors), data.frame(predictors=matrix(c(3,5),nrow=1)))
So, based on the 2-variate regression model trained by 5 samples, I want to make a prediction for the test data point where the first variate is 3 and second variate is 5. But I get a warning from above code saying that 'newdata' had 1 rows but variable(s) found have 5 rows. How can I correct above code? Below code works fine where I give the variables separately to the model formula. But since I will have hundreds of variates, I have to give them in a matrix since it would be unfeasible to append hundreds of columns using + sign.
price = c(10,18,18,11,17)
predictor1 = c(5,6,3,4,5)
predictor2 = c(2,1,8,5,6)
predict(lm(price ~ predictor1 + predictor2), data.frame(predictor1=3,predictor2=5))
Thanks in advance!
The easiest way to get past the issue of matching up variable names from a matrix of covariates to newdata data.frame column names is to put your input data into a data.frame as well. Try this
price = c(10,18,18,11,17)
predictors = cbind(c(5,6,3,4,5),c(2,1,8,5,6))
indata<-data.frame(price,predictors=predictors)
predict(lm(price ~ ., indata), data.frame(predictors=matrix(c(3,5),nrow=1)))
Here we combine price and predictors into a data.frame such that it will be named the same say as the newdata data.frame. We use the . in the formula to mean "all other columns" so we don't have to specify them explicitly.
Need to build the model first, then predict from it:
mod1 <- lm(price ~ predictor1 + predictor2)
predict( mod1 , data.frame(predictor1=3,predictor2=5))

Multinom with Matrix of Counts as Response

According to the help of multinom, package nnet, "The response should be a factor or a matrix with K columns, which will be interpreted as counts for each of K classes." I tried to use this function in the second case, obtaining an error.
Here is a sample code of what I do:
response <- matrix(round(runif(200,0,1)*100),ncol=20) # 10x20 matrix of counts
predictor <- runif(10,0,1)
fit1 <- multinom(response ~ predictor)
weights1 <- predict(fit1, newdata = 0.5, "probs")
Here what I obtain:
'newdata' had 1 row but variables found have 10 rows
How can I solve this problem?
Bonus question: I also noticed that we can use multinom with a predictor of factors, e.g. predictor <- factor(c(1,2,2,3,1,2,3,3,1,2)). I cannot understand how this is mathematically possible, given that a multinomial linear logit regression should work only with continuous or dichotomous predictors.
The easiest method for obtaining the predictions for a new variable is to define the new data as a data.frame.
Using the sample code
> predict(fit1, newdata = data.frame(predictor = 0.5), type = "probs")
[1] 0.07231972 0.05604055 0.05932186 0.07318140 0.03980245 0.06785690 0.03951593 0.02663618
[9] 0.04490844 0.04683919 0.02298260 0.04801870 0.05559221 0.04209283 0.03799946 0.06406533
[17] 0.04509723 0.02197840 0.06686314 0.06888748

Resources