Consider a data set train:
z a
1 1
0 2
0 1
1 3
0 1
1 2
1 1
0 3
0 1
1 3
with a binary outcome variable z and a categorical predictor a with three levels: 1,2,3.
Now consider a data set test:
z a
1
1
2
1
2
2
1
When I run the following code:
library(randomForest)
set.seed(825)
RFfit1 <- randomForest(z~a, data=train, importance=TRUE, ntree=2000)
RFprediction1 <- predict(RFfit1, test)
I get the following error message:
Error in predict.randomForest(RFfit1, test1) :
Type of predictors in new data do not match that of the training data.
I am assuming this is because the variable a in the test data set does not have three levels. How would I fix this?
You must assign it the same levels as train
test$a <- factor(test$a, levels=levels(train$a))
Related
I want to run a multinomial mixed effects model with the mclogit package of R.
Below can be show the head of my data frame.
> head(mydata)
ID VAR1 X1 Time Y other_X3 other_X4 other_X5 other_X6 other_X7
1 1 1 1 1 10 0 0 0 0 0
2 1 1 1 2 5 1 1 1 0 2
3 2 2 3 1 10 0 0 0 0 0
4 2 2 3 2 7 1 0 0 0 2
5 3 1 3 1 10 0 0 0 0 0
6 3 1 3 2 7 1 0 0 0 2
The Y variable is a categorical variable with 10 levels (1-10, is a score variable).
What I want is a model for y~x1+x2 by adding
random intercept effect (for ID variable) and random slope effect (for Time variable).
I try the following command by I got an error.
> mixed_model <- mclogit( cbind(Y, ID) ~ X1 + Time + X1*Time,
+ random = list(~1|ID, ~Time|ID), data = mydata)
Error in FUN(X[[i]], ...) :
No predictor variable remains in random part of the model.
Please reconsider your model specification.
In addition: Warning messages:
1: In mclogit(cbind(Y, ID) ~ X1 + Time + X1 * Time, random = list(~1 | :
removing X1 from model due to insufficient within-choice set variance
2: In FUN(X[[i]], ...) : removing intercept from random part of the model
because of insufficient within-choice set variance
Any idea about how to correct it ?
Thank you in advance.
I need to create possible combinations of 3 dummy variables into one categorical variable in a logistic regression using R.
I made the combination manually just like the following:
new_variable_code
variable_1
variable_2
variable_3
1
0
0
0
2
0
1
0
3
0
1
1
4
1
0
0
5
1
1
0
6
1
1
1
I excluded the other two options (0 0 1) and (1 0 1) because I do not need them, they are not represented by the data.
I then used new_variable_code as a factor in the logistic regression along with other predictors.
My question is: Is there is any automated way to create the same new_variable_code? or even another econometric technique to encode the 3 dummy variables into 1 categorical variable inside a logistic regression model?
My objective: To understand which variable combination has the highest odds ratio on the outcome variable (along with other predictors explained in the same model).
Thank you
You could use pmap_dbl in the following way to recode your dummy variables to a 1-6 scale:
library(tidyverse)
# Reproducing your data
df1 <- tibble(
variable_1 = c(0,0,0,1,1,1),
variable_2 = c(0,1,1,0,1,1),
variable_3 = c(0,0,1,0,0,1)
)
factorlevels <- c("000","010","011","100","110","111")
df1 <- df1 %>%
mutate(
new_variable_code = pmap_dbl(list(variable_1, variable_2, variable_3),
~ which(paste0(..1, ..2, ..3) == factorlevels))
)
Output:
# A tibble: 6 x 4
variable_1 variable_2 variable_3 new_variable_code
<dbl> <dbl> <dbl> <dbl>
1 0 0 0 1
2 0 1 0 2
3 0 1 1 3
4 1 0 0 4
5 1 1 0 5
6 1 1 1 6
I would just create a variable with paste using sep="." and make it a factor:
newvar <- factor( paste(variable_1, variable_2, variable_3, sep="."))
I don't think it would be a good idea to then make it a sequential vlaue, it's already an integer with levels, since that's how factors get created.
When I include the factor1, factors2, and its interaction, the interaction term has the combination of each's base level as its base level. However, if I include interaction term only (factor1:factor2 instead of factor1*factor2), the combination of last level of both is used as reference (i.e. this row has "NA" for estimate, std error etc). I have checked multiple times that each factor has the right base level configured before building the model. Is there a way to make the combination of each's first level to be the reference? Thanks!
Let's look at what's going on here.
(dd <- expand.grid(f1=letters[1:2],f2=LETTERS[1:2]))
## f1 f2
## 1 a A
## 2 b A
## 3 a B
## 4 b B
Add a response variable:
dd2 <- data.frame(dd,y=c(1,2,3,5))
Use model.matrix() to look at what dummy variables get constructed.
data.frame(dd,model.matrix(~f1*f2,data=dd),check.names=FALSE)
## f1 f2 (Intercept) f1b f2B f1b:f2B
## 1 a A 1 0 0 0
## 2 b A 1 1 0 0
## 3 a B 1 0 1 0
## 4 b B 1 1 1 1
So the baseline (intercept) is the a:A combination; the f1b parameter is the a-b contrast when f2==A; the f2B parameter is the A-B contrast when f1==a; and the interaction is the contrast between bB-aA, given the additive expectation.
If we add the interaction explicitly, R doesn't know to drop the intercept column. In this overparameterized model matrix, there isn't really a "baseline" level, but when there is rank-deficiency, R does drop the last column by default, so you end up effectively getting bB as your baseline (since the bB row of the matrix is [1 0 0 0] if we drop the last column).
data.frame(dd,X3 <- model.matrix(~f1:f2,data=dd),check.names=FALSE)
## f1 f2 (Intercept) f1a:f2A f1b:f2A f1a:f2B f1b:f2B
## 1 a A 1 1 0 0 0
## 2 b A 1 0 1 0 0
## 3 a B 1 0 0 1 0
## 4 b B 1 0 0 0 1
If you want to use a specified model matrix, you can cheat and do this directly. You have to remember that if you don't specify -1 in the formula, R will automatically re-add an intercept column, so here we get rid of the first two columns (y~. says "use all the variables in the data frame, except the response variable, as predictors").
dd3 <- data.frame(y=dd2$y,X3[,-(1:2)])
coef(lm(y~.,data=dd3))
Looking at the model matrix above but leaving out the second column, we interpret this as:
(Intercept) ([1 0 0 0]) is the value of a-A
f1b:f2A ([1 1 0 0]) is the a-b contrast when f2=A
f1a:f2B ([1 0 1 0]) is the A-B contrast when f1=a
the interaction is now the straight contrast between b-B and a-A, but uncorrected for the additive effects. Is that really what you wanted?
I am trying to transform my each prediction into an N Column Vector. i.e
Say My Prediction set is a factor of 3 levels and I would like to write each prediction as vector of 3.
My Current Output is
Id Prediction
1 Prediction 1
2 prediction 2
3 prediction 3
and what I am trying to achieve
Id Prediction1 Prediction2 Predication3
1 0 0 1
2 1 0 0
What is a simpler way of achieving this in R?
It looks like you want to perform so-called "one hot encoding" of your Prediction factor variable by introducing dummy variables. One way to do so is using the caret package.
Suppose you have a data frame like this:
> df <- data.frame(Id = c(1, 2, 3, 4), Prediction = c("Prediction 3", "Prediction 1", "Prediction 2", "Prediction 3"))
> df
Id Prediction
1 1 Prediction 3
2 2 Prediction 1
3 3 Prediction 2
4 4 Prediction 3
First make sure you have the caret package installed and loaded.
> install.packages('caret')
> library(caret)
You can then use caret's dummyVars() function to create dummy variables.
> dummies <- dummyVars( ~ Prediction, data = df, levelsOnly = TRUE)
The first argument to dummyVars(), a formula, tells it to generate dummy variables for the Prediction factor in the date frame df. (levelsOnly = TRUE strips the variable name from the columns names, leaving just the level, which looks nicer in this case.)
The dummy variables can then be passed to the predict() function to generate a matrix with the one hot encoded factors.
> encoded <- predict(dummies, df)
> encoded
Prediction 1 Prediction 2 Prediction 3
1 0 0 1
2 1 0 0
3 0 1 0
4 0 0 1
You can then, for example, create a new data frame with the encoded variables instead of the original factor variable:
> data.frame(Id = df$Id, encoded)
Id Prediction.1 Prediction.2 Prediction.3
1 1 0 0 1
2 2 1 0 0
3 3 0 1 0
4 4 0 0 1
This technique generalises easily to a mixture of numerical and categorical variables. Here's a more general example:
> df <- data.frame(Id = c(1,2,3,4), Var1 = c(3.4, 2.1, 6.0, 4.7), Var2 = c("B", "A", "B", "A"), Var3 = c("Rainy", "Sunny", "Sunny", "Cloudy"))
> dummies <- dummyVars(Id ~ ., data = df)
> encoded <- predict(dummies, df)
> encoded
Var1 Var2.A Var2.B Var3.Cloudy Var3.Rainy Var3.Sunny
1 3.4 0 1 0 1 0
2 2.1 1 0 0 0 1
3 6.0 0 1 0 0 1
4 4.7 1 0 1 0 0
All numerical variables remain unchanged, whereas all categorical variables get encoded. A typical situation where this is useful is to prepare data for a machine learning algorithm that only accepts numerical variables, not categorical variables.
You can use something like:
as.numeric(data[1,][2:4])
Where '1' is the row number that you are converting to a vector.
Taking WhiteViking's start and using table function seems to work.
> df <- data.frame(Id = c(1, 2, 3, 4), Prediction = c("Prediction 3", "Prediction 1", "Prediction 2", "Prediction 3"))
> df
Id Prediction
1 1 Prediction 3
2 2 Prediction 1
3 3 Prediction 2
4 4 Prediction 3
> table(df$Id, df$Prediction)
Prediction 1 Prediction 2 Prediction 3
1 0 0 1
2 1 0 0
3 0 1 0
4 0 0 1
I would use the reshape function
I would like to run random forest on a large data set: 100k * 400. When I use random forest it takes a lot of time. Can I use parRF method from caret package in order to reduce running time?
What is the right syntax for that?
Here is an example dataframe:
dat <- read.table(text = " TargetVar Var1 Var2 Var3
0 0 0 7
0 0 1 1
0 1 0 3
0 1 1 7
1 0 0 5
1 0 1 1
1 1 0 0
1 1 1 6
0 0 0 8
0 0 1 5
1 1 1 4
0 0 1 2
1 0 0 9
1 1 1 2 ", header = TRUE)
I tried:
library('caret')
m<-randomForest(TargetVar ~ Var1 + Var2 + Var3, data = dat, ntree=100, importance=TRUE, method='parRF')
But I don't see too much of a difference. Any Ideas?
The reason that you don't see a difference is that you aren't using the caret package. You do load it into your environment with the library() command, but then you run randomForest() which doesn't use caret.
I'll suggest starting by creating a data frame (or data.table) that contains only your input variables, and a vector containing your outcomes. I'm referring to the recently updated caret docs.
x <- data.frame(dat$Var1, dat$Var2, dat$Var3)
y <- dat$TargetVar
Next, verify that you have the parRF method available. I didn't until I updated my caret package to the most recent version (6.0-29).
library("randomForest")
library("caret")
names(getModelInfo())
You should see parRF in the output. Now you're ready to create your training model.
library(foreach)
rfParam <- expand.grid(ntree=100, importance=TRUE)
m <- train(x, y, method="parRF", tuneGrid=rfParam)