Getting a warning using predict function in R - r

I have a data set of 400 observations which I divided in 2 separate sets one for training (300 observations) and one for testing (100 observations). I am trying to create a step function regression, the problem is once I try to use the model in order to predict values form the test set I get a warning:
Warning message: 'newdata' had 100 rows but variables found have 300 rows
The variable I am trying to predict is Income and the explanatory variable is called Age.
This is the code:
fit=lm(Incomeāˆ¼cut (training$Age ,4) ,data=training)
predict(fit,test)
Instead of getting 100 predictions based on the test data I get a warning sign and 300 predictions based on the training data.
I read about other people having this question and usually the answer has to do with the name of the variable being different in the data set and in the model, but I don't think this is the problem because while using a regular simple regression I don't get a warning :
lm.fit=lm(Income~Age,data = training)
predict(lm.fit,test)

There are a number of problems here, so it will take several steps to get to a good answer. You did not provide data so I am going to use other data that gets the same kind of error message. The built-in data set iris has 4 continuous variables. I will arbitrarily select two for use here, then apply code just like yours
MyData = iris[,3:4]
set.seed(2017) # for reproducibility
T = sample(150, 100)
training = MyData[ T, ]
test = MyData[-T, ]
fit=lm(Petal.Width ~ cut(training$Petal.Length, 4), data=training)
predict(fit,test)
Warning message:
'newdata' had 50 rows but variables found have 100 rows
So I am getting the same type of error.
cut is changing the continuous variable Petal.Length into a factor with 4 levels. You built your model on the factor, but when you try to predict the new values, you just passed in test, which still has the continuous values (Age in your data; Petal.Length in mine). Trying to evaluate the predict statement, we need to evaluate cut(test$Petal.Length, 4) as part of the process. Look at what that means.
C1 = cut(training$Petal.Length, 4)
C2 = cut(test$Petal.Length, 4)
levels(C1)
[1] "(0.994,2.42]" "(2.42,3.85]" "(3.85,5.28]" "(5.28,6.71]"
levels(C2)
[1] "(1.09,2.55]" "(2.55,4]" "(4,5.45]" "(5.45,6.91]"
The levels are completely different. There is no way that your model can be used on these different levels. You can see the bin boundaries for C1 so it is tempting to just use those boundaries and partition the test data.
levels(C1)
"[0.994,2.42]" "(2.42,3.85]" "(3.85,5.28]" "(5.28,6.71]"
CutPoints = c(0.994, 2.42, 3.85, 5.28, 6.71)
C2 = cut(test$Petal.Length, breaks=CutPoints, include.lowest=TRUE)
But under careful examination, you will see that this did not work. Just printing out a relevant piece of the data
C2[42:46]
[1] (5.28,6.71] (5.28,6.71] <NA> (3.85,5.28] (3.85,5.28]
C2[44] is undefined. Why? One of the values in the test set fell outside the range of values for the training set, so it does not belong in any bin.
test$Petal.Length[44]
[1] 6.9
So what you really need to do is impose no lower limit or upper limit.
## cut the training data to get cut points
C1 = cut(training$Petal.Length, 4)
levels(C1)
"[0.994,2.42]" "(2.42,3.85]" "(3.85,5.28]" "(5.28,6.71]"
CutPoints = c(-Inf, 2.42, 3.85, 5.28, Inf)
It may be easiest to just make new data.frames with the binned data
Binned.training = training
Binned.training$Petal.Length = cut(training$Petal.Length, CutPoints)
Binned.test = test
Binned.test$Petal.Length = cut(test$Petal.Length, CutPoints)
fit=lm(Petal.Width ~ Petal.Length, data=Binned.training)
predict(fit,Binned.test)
## No errors
This will work for your test data and any data that you get in the future.

Related

R Error in mclogit::mblogit() solve.default(X[[i]], ...) : 'a' (4 x 1) must be square

I am trying to use a multinomial logistic regression model to determine how different factors influence the liklihood of several behavioral states among two species of shark.
19 individual animals, comprising two distinct species, were each tracked for ~100 days each and a different behavioral state was identified for each day data were collected.
I would like to code individual shark as a categorical random effect variable (with 19 levels) within species, a categorical fixed effect variable (with 2 levels).
With this general idea, the code that I am currently trying to run is:
mclogit::mblogit(cluster ~ species, random = ~1|individual %in% species, data = df, method = "MQL")
The model appears to run normally but produces the error message:
Error in *tmp*[[k]] : subscript out of bounds
Reversing the order of the random effect interaction term produces a different error message. Now the code reads:
mclogit::mblogit(cluster ~ species, random = ~1|species %in% individual, data = df, method = "MQL")
And produces the error:
Error in solve.default(X[[i]], ...) : 'a' (6 x 1) must be square
Here is a sample of the raw data with which I am trying to fit my model:
df <- data.frame(
Date = c("2015-11-25", "2016-01-24", "2016-02-27", "2016-03-27", "2017-12-02", "2017-12-06", "2015-10-30", "2015-10-31"),
cluster = factor(c(3,3,4,6,3,1,3,2)),
species = factor(c("I.oxyrinchus", "I.oxyrinchus", "I.oxyrinchus", "I.oxyrinchus", "P.glauca", "P.glauca", "P.glauca", "P.glauca")),
individual = factor(c("141257", "141257", "141254", "141254", "141256", "141256", "141255", "141255")))
Attempting to run the code with this reduced dataset produces only the second of the two error messages.
My questions are two fold:
What are the meanings of these two error messages, and how might I address one or both of them?
Why might the order of the terms in the random effect portion of the model formula produce two different results?
Thank you.

SVM Prediction is dropping values

I'm running the SVM model on a dataset, which runs through fine on the train/fitted model. However when I run it for the prediction/test data, it seems to be dropping rows for some reason, when I try to add 'pred_SVM' back into the dataset, the lengths are different.
Below is my code
#SVM MODEL
SVM_swim <- svm(racetime_mins ~ event_date+ event_month +year
+event_id +
gender + place + distance+ New_Condition+
raceNo_Updated +
handicap_mins +points+
Wind_Speed_knots+
Air_Temp_Celsius +Water_Temp_Celsius +Wave_Height_m,
data = SVMTrain, kernel='linear')
summary(SVM_swim)
#Predict Race_Time Using Test Data
pred_SVM <- predict(SVM_swim, SVMTest, type ="response")
View(pred_SVM)
#Add predicted Race_Times back into the test dataset.
SVMTest$Pred_RaceTimes<- pred_SVM
View(SVMTest) #Returns 13214 rows
View(pred_SVM) #Returns 12830
Error in $<-.data.frame(*tmp*, Pred_RaceTime, value = c(2 = 27.1766438249356, :
replacement has 12830 rows, data has 13214
As it is mentioned in the command, you need to get rid of the NA values in your dataset. SVM is handling it for you so that, the pred_SVM output is calculated without the NA values.
To test if there exist NA in your data, just run : sum(is.na(SVMTest))
I am pretty sure that you will see a number greater than zero.
Before starting to build your SVM algorithm, get rid of all NA values by,
dataset <- dataset[complete.cases(dataset), ]
Then after separating your data into Train and Test sets you can run ,
SVM_swim <- svm(.....,data = SVMTrain, kernel='linear')

Plot how the estimated survival depends upon the value of a covariate of interest. Problems with relevel

I want to plot how the estimated survival from a Cox model depends upon the value of a covariate of interest, while the rest of variables are fixed to their average values (if they are continuous variables) or lowest values for dummy. Following this example http://www.sthda.com/english/wiki/cox-proportional-hazards-model , I have construct a new data frame with three rows, one for each value of my variable of interest; and the other covariates are fixed. Among these covariates I have two factor vectors. I created the new dataset and later it is passed to survfit() via the newdata argument.
When I passed the data frame to survfit(), I obtain the following error message error in relevel.default(occupation) : 'relevel' only for factors. Where is the source of problem? If the source of problem is related to the factor vectors, how I can solve it? Below find an example of the code. Unfortunately, I cannot share the data or find a dataset that produces the same error message:
I have transformed the factor variables into integer vectors in the cox model and in the new dataset. it did not work.
I have deleated all the factor variables and it works.
I have tried to implement this strategy, but it did not work: Plotting predicted survival curves for continuous covariates in ggplot
fit <- coxph(Surv(entry, exit, event == 1) ~ status_plot +
exp_national + relevel(occupation, 5) + age + gender + EDUCATION , data = data)
data_rank <- with(data,
data.frame(status_plot = c(1,2,3), # factor vector of interest
exp_national=rep(mean(exp_national, na.rm = TRUE), 3),
occupation = c(5,5,5), # factor with 6 categories, number 5 is the category of reference in the cox model
age=rep(mean(age, na.rm = TRUE), 3),
gender = c(1,1,1),
EDUCATION=rep(mean(EDUCATION, na.rm = TRUE), 3) ))
surv.fin <- survfit(fit, newdata=data_rank) # this produces the error
Looking at the code it appears you probably attempted to take the mean of a factor. So do post at least str(data) as an edit to the body of your question. You should also realize that you can give a single value to a column in a data.frame call and have it recycled to the correct length, you all the meanss could be entered as a single item rather thanrep`-ng.

subscript out of bounds Error, Random Forest Model

I'm trying to use the random forest model to predict Gender based on Height, Weight and Number of siblings. I've gotten the data from a much larger data set that contains dozens of variables, but I've cleaned it into this "clean" data.frame with omitted NA values and only the 4 variables I care about, the last column being Gender.
I've tried fiddling with the code and searching everywhere but I can't find a concrete fix.
Here's the code:
ind <- sample(nrow(clean),0.8*nrow(clean))
train <- clean[ind,]
test <- clean[-ind,]
rf <- randomForest(Gender ~ ., data = train[,1:4], ntree = 20)
pred <- predict(rf, newdata = test[,-c(length(test))])
cm <- table(test$Gender, pred)
cm
and here's the output:
Error in `[.default`(table(observed = y, predicted = out.class), levels(y), : subscript out of bounds
Traceback:
1. randomForest(Gender ~ ., data = train[, 1:4], ntree = 20)
2. randomForest.formula(Gender ~ ., data = train[, 1:4], ntree = 20)
3. randomForest.default(m, y, ...)
4. table(observed = y, predicted = out.class)[levels(y), levels(y)]
5. `[.table`(table(observed = y, predicted = out.class), levels(y),
. levels(y))
6. NextMethod()
The problem is likely that you have some kind of a variable level in your test data that was not reflected in your training data. So when it goes to assign the outcome, it has no basis to do so.
It is impossible to say for sure without sample data, but it is the most likely scenario. Try setting a seed set.seed=3 and then change the seed number set.seed=28 and so on, a few times to see if you end up finding a combination where you do not get the error.
Compare the conflicted data frame with the un-conflicted one to see what is missing.
EDIT:
Also, try running str(train) and str(test) to be sure the fields have remained the same. You can share that if you like by editing your post.
If any of the columns are factors with levels missing (meaning it has 10 levels but only 8 are represented in the train with 9 or 10 in the test) it might be a problem. They should be balanced if you are trying to create a predictor for all possible outcomes.
If nothing else works, you can set a seed and remove predictors one at a time until it runs correctly, then look to see how the train and test sets are different in that removed column.

Cannot run glmer models with na.action=na.fail, necessary for MuMIn dredge function

Mac OS 10.9.5, R 3.2.3, MuMIn_1.15.6, lme4_1.1-10
Reproducible example code, using example data
The MuMIn user guide recommends using na.action=na.fail, otherwise the dredge function will not work, which I have found:
Error in dredge: 'global.model''s 'na.action' argument is not set and options('na.action') is "na.omit".
However, when I try to run a glmer model with na.action=na.fail, I get this:
Error in na.fail.default(list(pr = c(0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, :
missing values in object
Do I have any options other than removing every observation with an NA? My full data set consists of 10,000 observations and has 23 predictor variables which have NAs for different observations. Removing every obs with an NA will waste some data, which I'm looking to avoid.
It is difficult to know what you are asking.
From ?MuMIn::dredge "Use of na.action = "na.omit" (R's default) or "na.exclude" in global.model must be avoided, as it results with sub-models fitted to different data sets, if there are missing values. Error is thrown if it is detected."
In your example, leaving the default options(na.action = na.omit) works fine:
options()$na.action
mod.na.omit <- glmer(formula = pr ~ yr + soil_dist + sla_raw +
yr:soil_dist + yr:sla_raw + (1|plot) + (1|subplot),
data = coldat,
family = binomial)
But, options(na.action = na.fail) causes glmer to fail (as expected from the documentation).
If you look at the length of the data in coldat, complete cases of coldat, mod.na.omit you get the following:
> # number of rows in coldat
> nrow(coldat)
[1] 3171
> # number of complete cases in coldat
> nrow(coldat[complete.cases(coldat), ])
[1] 2551
> # number of rows in data included in glmer model when using 'na.omit'
> length(mod.na.omit#frame$pr)
[1] 2551
From the example data you provided, complete cases of coldat and the rows of coldat included by glmer when using na.omit (mod.na.omit#frame) yields the same number of rows, but it is conceivable that as predictors are added, this may no longer be the case (i.e., number of rows in mod.na.omit#frame > complete cases of coldat). In this scenario (as the documentation states), there is a risk of sub-models being fitted to different data sets as dredge generates the models. So, rather than potentially fitting sub-models, dredge takes a conservative approach to NA, and throws an error.
So, you basically either have to remove the incomplete cases (which you indicated is something you don't want to do) or interpolate the missing values. I typically avoid interpolation if there are large blocks of missing data which make estimating a value fraught, and remove incomplete cases instead.

Resources