I have a data frame that I am using for machine learning using svm with the package e1071 with R. The formula I have is :
fidSVM = formula(y ~ a + b + c + d)
that I plug into :
fitsvm = svm(fidSVM, data = crossdata, method="C-classification", kernel="polynomial", degree=incdegree, cost=0.5, shrinking=TRUE,
scale=TRUE, gamma=1, coef0=0, cross = 10)
Then, I want to predict. To test the predict() function, I simply reuse the initial data frame:
predict(fitsvm, crossdata)
factor(0)
Levels: 0 1
The classification works pretty well (I checked that), but the predict function does not work properly. As mentioned in other posts, I was careful to use a data frame to deal with my data. I have used predict svm in the past and the predict function without problems. Does anyone have an idea on what may cause the problem here?
P.S.: I do have NaN's in my data, only factors and numerical values.
Thank you for your help!
Related
Why is it that Beta Regression that is bound between 0 and 1 is unable to handle lots of independent variables as Regressors? I have around 30 independent variables that I am trying to fit and it shows error like:
Error in optim(par = start, fn = loglikfun, gr = gradfun, method =
method, : non-finite value supplied by optim
Only few variables it is accepting.Now If I combine all these independent variables in X <- (df$x1 + … + df$x30) and make dependent variable in Y <- df$y and then run Beta Regression then it works but I won’t be getting coefficients for individual independent variables which I want.
betareg(Y ~ X, data = df)
So, what’s the solution?
Probably, the model did not converge because of the multicollinearity problem. In most cases, regression models can not be estimated properly when lots of variables are considered. You can overcome this problem with an appropriate variable selection procedure using information criteria.
You can benefit gamlss package in R. Also, stepGAIC() function can help you when considering gamlss(...,family=BE) function during the modeling.
I recently built a random forest model using the ranger package in R. However, I noticed that the predictions stored in the ranger object during training (accessible with model$predictions) do not match the prediction I get if I run the predict command on the same dataset using the model created. The following code reproduces the problem on the mtcars dataset. I created a binary variable just for the sake of converting this to a classification problem though I saw similar results with regression trees as well.
library(datasets)
library(ranger)
mtcars <- mtcars
mtcars$mpg2 <- ifelse(mtcars$mpg > 19.2 , 1, 0)
mtcars <- mtcars[,-1]
mtcars$mpg2 <- as.factor(mtcars$mpg2)
set.seed(123)
mod <- ranger(mpg2 ~ ., mtcars, num.trees = 20, probability = T)
mod$predictions[1,] # Probability of 1 = 0.905
predict(mod, mtcars[1,])$predictions # Probability of 1 = 0.967
This problem also carries on to the randomForest package where I observed a similar problem reproducible with the following code.
library(randomForest)
set.seed(123)
mod <- randomForest(mpg2 ~ ., mtcars, ntree = 20)
mod$votes[1,]
predict(mod, mtcars[1,], type = "prob")
Can someone please tell me why this is happening? I would expect the results to be the same. Am I doing something wrong or is there an error in my understanding of some inherent property of random forest that leads to this scenario?
I think you may want to look a little more deeply into how a random forest works. I really recommend Introduction to Statistical Learning in R (ISLR), which is available for free online here.
That said, I believe the main issue here is that you are treating the mod$votes value and the predict() value as the same, when they are not quite the same thing. If you look at the documentation of the randomForest function, the mod$votes or mod$predicted values are out-of-bag ("OOB") predictions for the input data. This is different from the value that the predict() function produces, which evaluates an observation on the model produced by randomForest(). Typically, you would want to train the model on one set of data, and use the predict() function on the test set.
Finally, you may need to re-run your set.seed() function every time your make the random forest if you want to achieve the same results for the mod object. I think there is a way to set the seed for an entire session, but I am not sure. This looks like a useful post: Fixing set.seed for an entire session
Side note: Here, you are not specifying the number of variables to use for each tree, but the default is good enough in most cases (check the documentation for each of the random forest functions you are using for the default). Maybe you are doing that in your actual code and didn't include it in your example, but I thought it was worth mentioning.
Hope this helps!
Edit:
I tried training the random forest using all of the data except for the first observation (Mazda RX4) and then used the predict function on just that observation, which I think illustrates my point a bit better. Try running something like this:
library(randomForest)
set.seed(123)
mod <- randomForest(mpg2 ~ ., mtcars[-1,], ntree = 200)
predict(mod, mtcars[1,], type = "prob")
As you have converted mpg to mpg2, was expecting that you want to build classification model. But nevertheless mod$predictions gives you probability while your model is trying to learn from your data points and predict(mod,mtcars[,1:10])$predictions option gives probability from trained model. Have run same code with Probability = F, and got below result, you can see prediction from trained model is prefect whereas from mod$predictions option we have 3 miss classifications.
mod <- ranger(mpg2 ~ ., mtcars, num.trees = 20, probability = F)
> table(mtcars$mpg2,predict(mod, mtcars[,1:10])$predictions)
0 1
0 17 0
1 0 15
> table(mtcars$mpg2,mod$predictions)
0 1
0 15 2
1 1 14
I was wondering if it is possible to predict with the plm function from the plm package in R for a new dataset of predicting variables. I have create a model object using:
model <- plm(formula, data, index, model = 'pooling')
Now I'm hoping to predict a dependent variable from a new dataset which has not been used in the estimation of the model. I can do it through using the coefficients from the model object like this:
col_idx <- c(...)
df <- cbind(rep(1, nrow(df)), df[(1:ncol(df))[-col_idx]])
fitted_values <- as.matrix(df) %*% as.matrix(model_object$coefficients)
Such that I first define index columns used in the model and dropped columns due to collinearity in col_idx and subsequently construct a matrix of data which needs to be multiplied by the coefficients from the model. However, I can see errors occuring much easier with the manual dropping of columns.
A function designed to do this would make the code a lot more readable I guess. I have also found the pmodel.response() function but I can only get this to work for the dataset which has been used in predicting the actual model object.
Any help would be appreciated!
I wrote a function (predict.out.plm) to do out of sample predictions after estimating First Differences or Fixed Effects models with plm.
The function is posted here:
https://stackoverflow.com/a/44185441/2409896
I fit a model with biglm and lm, the returned model summary are the same (with the difference of just formatting). However when I use them to predict the same dataset, they produce different results. lm model is correct comparing to if I manually calculate them by hand using model coefficients. But biglm model is incorrect.
Here are the models:
m1 <- biglm(cost ~ d + v + zi, data = tl)
m2 <- lm(cost ~ d + v + zi, data = tl)
Here is a small piece of the model summaries:
m1:
d: coef 473.9196
m2:
d: coef 4.739e+02
the rest of the model coefficients are matching and the same as illustrated above. However, when I use the model to predict, the results are different: m1 != m1
t1$m1 <- predict(m1, t1)
t1$m2 <- predict(m2, t1)
i tried to use predict.biglm() but got an error saying the function doesn't exist.
I also looked at this post (R: lm and biglm producing different answers) and am sure it is not the reason.
The dataset is too big so I don't know how to share it here. And it also might take a while for me to de-code some of the information first.
But here is a small piece comparing of results which shows the predict is quite different.
m1 m2
1798.831, 2365.868
1801.074, 2368.112
1482.508, 2351.042
After a long day, I finally figured out the issue.
I know biglm method requires the training and testing datasets have records for all factor levels. So when I was processing the dataset, I added 1 record of each missing factor level into the dataset (similar to the adding dummies method posted by another thread cited above).
However(!!), I didn't update the factor levels using factor() function. In this case, the biglm model runs fine and syntax is ok. But the model prediction results is not!!
Anyway, after I update the factor levels, it worked just fine.
I fit the following glm model using the survey package:
design <- svydesign(ids=training.data$name, design=design,family=quasibinomial(), data=training.data)
significant.model <- svyglm(Win~x+ y + start+ speed+ vx0 + vy0 + ay + az + length+ rate+ height+ hand+ zone+ count, design=design, family=quasibinomial, data=training.data)
I have a set of test data that I excluded from the model fitting process so that I would be able to see how the model predicts the outcomes for the test data and examine the difference.
Typically, I would use makeFun in the mosaic package, but this does not support objects of type svyglm. Is there another function or method that I can use to create a function for the model?
There are a lot of categorical variables with multiple levels, so writing a user-defined function is not ideal in this situation.
I'm not sure what difficulty you were experiencing since your example is not reproducible. But since an svyglm object is a glm object, makeFun() will create a wrapper around predict() just as it would do for any glm object. This has not been tested extensively, but it seems to work in the following example:
r
example(svyglm)
f <- makeFun(api.reg)
f(enroll = 500)