How to use parRF method so random forest will run faster - r

I would like to run random forest on a large data set: 100k * 400. When I use random forest it takes a lot of time. Can I use parRF method from caret package in order to reduce running time?
What is the right syntax for that?
Here is an example dataframe:
dat <- read.table(text = " TargetVar Var1 Var2 Var3
0 0 0 7
0 0 1 1
0 1 0 3
0 1 1 7
1 0 0 5
1 0 1 1
1 1 0 0
1 1 1 6
0 0 0 8
0 0 1 5
1 1 1 4
0 0 1 2
1 0 0 9
1 1 1 2 ", header = TRUE)
I tried:
library('caret')
m<-randomForest(TargetVar ~ Var1 + Var2 + Var3, data = dat, ntree=100, importance=TRUE, method='parRF')
But I don't see too much of a difference. Any Ideas?

The reason that you don't see a difference is that you aren't using the caret package. You do load it into your environment with the library() command, but then you run randomForest() which doesn't use caret.
I'll suggest starting by creating a data frame (or data.table) that contains only your input variables, and a vector containing your outcomes. I'm referring to the recently updated caret docs.
x <- data.frame(dat$Var1, dat$Var2, dat$Var3)
y <- dat$TargetVar
Next, verify that you have the parRF method available. I didn't until I updated my caret package to the most recent version (6.0-29).
library("randomForest")
library("caret")
names(getModelInfo())
You should see parRF in the output. Now you're ready to create your training model.
library(foreach)
rfParam <- expand.grid(ntree=100, importance=TRUE)
m <- train(x, y, method="parRF", tuneGrid=rfParam)

Related

Multinomial mixed effects model R

I want to run a multinomial mixed effects model with the mclogit package of R.
Below can be show the head of my data frame.
> head(mydata)
ID VAR1 X1 Time Y other_X3 other_X4 other_X5 other_X6 other_X7
1 1 1 1 1 10 0 0 0 0 0
2 1 1 1 2 5 1 1 1 0 2
3 2 2 3 1 10 0 0 0 0 0
4 2 2 3 2 7 1 0 0 0 2
5 3 1 3 1 10 0 0 0 0 0
6 3 1 3 2 7 1 0 0 0 2
The Y variable is a categorical variable with 10 levels (1-10, is a score variable).
What I want is a model for y~x1+x2 by adding
random intercept effect (for ID variable) and random slope effect (for Time variable).
I try the following command by I got an error.
> mixed_model <- mclogit( cbind(Y, ID) ~ X1 + Time + X1*Time,
+ random = list(~1|ID, ~Time|ID), data = mydata)
Error in FUN(X[[i]], ...) :
No predictor variable remains in random part of the model.
Please reconsider your model specification.
In addition: Warning messages:
1: In mclogit(cbind(Y, ID) ~ X1 + Time + X1 * Time, random = list(~1 | :
removing X1 from model due to insufficient within-choice set variance
2: In FUN(X[[i]], ...) : removing intercept from random part of the model
because of insufficient within-choice set variance
Any idea about how to correct it ?
Thank you in advance.

Creating a combination of dummy variables into a single variable in a logistic regression model in R

I need to create possible combinations of 3 dummy variables into one categorical variable in a logistic regression using R.
I made the combination manually just like the following:
new_variable_code
variable_1
variable_2
variable_3
1
0
0
0
2
0
1
0
3
0
1
1
4
1
0
0
5
1
1
0
6
1
1
1
I excluded the other two options (0 0 1) and (1 0 1) because I do not need them, they are not represented by the data.
I then used new_variable_code as a factor in the logistic regression along with other predictors.
My question is: Is there is any automated way to create the same new_variable_code? or even another econometric technique to encode the 3 dummy variables into 1 categorical variable inside a logistic regression model?
My objective: To understand which variable combination has the highest odds ratio on the outcome variable (along with other predictors explained in the same model).
Thank you
You could use pmap_dbl in the following way to recode your dummy variables to a 1-6 scale:
library(tidyverse)
# Reproducing your data
df1 <- tibble(
variable_1 = c(0,0,0,1,1,1),
variable_2 = c(0,1,1,0,1,1),
variable_3 = c(0,0,1,0,0,1)
)
factorlevels <- c("000","010","011","100","110","111")
df1 <- df1 %>%
mutate(
new_variable_code = pmap_dbl(list(variable_1, variable_2, variable_3),
~ which(paste0(..1, ..2, ..3) == factorlevels))
)
Output:
# A tibble: 6 x 4
variable_1 variable_2 variable_3 new_variable_code
<dbl> <dbl> <dbl> <dbl>
1 0 0 0 1
2 0 1 0 2
3 0 1 1 3
4 1 0 0 4
5 1 1 0 5
6 1 1 1 6
I would just create a variable with paste using sep="." and make it a factor:
newvar <- factor( paste(variable_1, variable_2, variable_3, sep="."))
I don't think it would be a good idea to then make it a sequential vlaue, it's already an integer with levels, since that's how factors get created.

How to use predict.stepplr() and confusionmatrix() correctly for step.plr method?

#Here is my code:
library(MASS, caret, stepPlr, janitor)
#stepPlr: L2 penalized logistic regression with a stepwise variable selection
#MASS: Support Functions and Datasets for Venables and Ripley's MASS
#caret: Classification and Regression Training
#janitor: Simple Tools for Examining and Cleaning Dirty Data
#Howells is a main dataframe, we will segregate it.
HNORSE <- Howells[which(Pop=='NORSE'),]
#Let's remove NA cols
#We will use janitor package here to remove NA cols
HNORSE <- remove_empty_cols(HNORSE)
#Assigning 0's and 1's to females and males resp.
HNORSE$PopSex[HNORSE$PopSex=="NORSEF"] <- '0'
HNORSE$PopSex[HNORSE$PopSex=="NORSEM"] <- '1'
HNORSE$PopSex <- as.numeric(HNORSE$PopSex)
HNORSE$PopSex
#Resultant column looks like this
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1
[41] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0
[81] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I want to use Stepplr from caret package
a <- step.plr(HNORSE[,c(6:76)], HNORSE$PopSex, lambda = 1e-4, cp="bic", max.terms = 1, trace = TRUE, type = "forward")
#Where HNORSE[,c(6:76)] --> features
#HNORSE$PopSex ---> Binary response
#lambda ----> Default value
#max.terms ---> I tried more than 1 value for max.terms, but then R goes into infinite loop of 'Convergence Error'.
#That's why using max.terms=1
Then I ran summary command on "a"
summary(a)
Call: plr(x = ix0, y = y, weights = weights, offset.subset = offset.subset,
offset.coefficients = offset.coefficients, lambda = lambda,
cp = cp)
Coefficients:Estimate Std.Error z value Pr(>|z|)
Intercept -71.93470 13.3521 -5.388 0
ZYB 0.55594 0.1033 5.382 0
Null deviance: 152.49 on 109 degrees of freedom
Residual deviance: 57.29 on 108 degrees of freedom
Score: deviance + 4.7 * df = 66.69
I used step.plr so, I should then use predict.stepplr right? and not predict.plr?
By this logic I wish to use predict.stepplr. The default function argument example goes like this:
n <- 100
p <- 5
x0 <- matrix(sample(seq(3),n*p,replace=TRUE),nrow=n)
x0 <- cbind(rnorm(n),x0)
y <- sample(c(0,1),n,replace=TRUE)
level <- vector("list",length=6)
for (i in 2:6) level[[i]] <- seq(3)
fit <- step.plr(x0,y,level=level)
x1 <- matrix(sample(seq(3),n*p,replace=TRUE),nrow=n)
x1 <- cbind(rnorm(n),x1)
pred1 <- predict(fit,x0,x1,type="link")
pred2 <- predict(fit,x0,x1,type="response")
pred3 <- predict(fit,x0,x1,type="class")
object: stepplr object
x: matrix of features used for fitting object.
If newx is provided, x must be provided as well.
newx: matrix of features at which the predictions are made.
If newx=NULL, predictions for the training data are returned.
type: If type=link, the linear predictors are returned;
if type=response, the probability estimates are returned; and
if type=class, the class labels are returned. Default is type=link.
...
other options for prediction..
So First of all, I did not do any sampling like shown in here.
I want to predict HNORSE$PopSex which is binary variable.
My dataset which does not include the binary variable column is HNORSE[,c(6:76)].
I want to know what x0 and x1 function arguments should I put in
predict.stepplr()?
If not, HOW do I correctly implement
predict.stepplr?
I want to use overall accuracy to plot(Density(overall_accuracy))

R How to transform Prediction as N Column Vector

I am trying to transform my each prediction into an N Column Vector. i.e
Say My Prediction set is a factor of 3 levels and I would like to write each prediction as vector of 3.
My Current Output is
Id Prediction
1 Prediction 1
2 prediction 2
3 prediction 3
and what I am trying to achieve
Id Prediction1 Prediction2 Predication3
1 0 0 1
2 1 0 0
What is a simpler way of achieving this in R?
It looks like you want to perform so-called "one hot encoding" of your Prediction factor variable by introducing dummy variables. One way to do so is using the caret package.
Suppose you have a data frame like this:
> df <- data.frame(Id = c(1, 2, 3, 4), Prediction = c("Prediction 3", "Prediction 1", "Prediction 2", "Prediction 3"))
> df
Id Prediction
1 1 Prediction 3
2 2 Prediction 1
3 3 Prediction 2
4 4 Prediction 3
First make sure you have the caret package installed and loaded.
> install.packages('caret')
> library(caret)
You can then use caret's dummyVars() function to create dummy variables.
> dummies <- dummyVars( ~ Prediction, data = df, levelsOnly = TRUE)
The first argument to dummyVars(), a formula, tells it to generate dummy variables for the Prediction factor in the date frame df. (levelsOnly = TRUE strips the variable name from the columns names, leaving just the level, which looks nicer in this case.)
The dummy variables can then be passed to the predict() function to generate a matrix with the one hot encoded factors.
> encoded <- predict(dummies, df)
> encoded
Prediction 1 Prediction 2 Prediction 3
1 0 0 1
2 1 0 0
3 0 1 0
4 0 0 1
You can then, for example, create a new data frame with the encoded variables instead of the original factor variable:
> data.frame(Id = df$Id, encoded)
Id Prediction.1 Prediction.2 Prediction.3
1 1 0 0 1
2 2 1 0 0
3 3 0 1 0
4 4 0 0 1
This technique generalises easily to a mixture of numerical and categorical variables. Here's a more general example:
> df <- data.frame(Id = c(1,2,3,4), Var1 = c(3.4, 2.1, 6.0, 4.7), Var2 = c("B", "A", "B", "A"), Var3 = c("Rainy", "Sunny", "Sunny", "Cloudy"))
> dummies <- dummyVars(Id ~ ., data = df)
> encoded <- predict(dummies, df)
> encoded
Var1 Var2.A Var2.B Var3.Cloudy Var3.Rainy Var3.Sunny
1 3.4 0 1 0 1 0
2 2.1 1 0 0 0 1
3 6.0 0 1 0 0 1
4 4.7 1 0 1 0 0
All numerical variables remain unchanged, whereas all categorical variables get encoded. A typical situation where this is useful is to prepare data for a machine learning algorithm that only accepts numerical variables, not categorical variables.
You can use something like:
as.numeric(data[1,][2:4])
Where '1' is the row number that you are converting to a vector.
Taking WhiteViking's start and using table function seems to work.
> df <- data.frame(Id = c(1, 2, 3, 4), Prediction = c("Prediction 3", "Prediction 1", "Prediction 2", "Prediction 3"))
> df
Id Prediction
1 1 Prediction 3
2 2 Prediction 1
3 3 Prediction 2
4 4 Prediction 3
> table(df$Id, df$Prediction)
Prediction 1 Prediction 2 Prediction 3
1 0 0 1
2 1 0 0
3 0 1 0
4 0 0 1
I would use the reshape function

Creating a loop that will run a Logistic regression across all Independent variables

I would like to run the dependent variable of a logistic regression (in my data set it's : dat$admit) with all available variables, each regression with its own Independent variable vs dependent variable.
The outcome that I wanted to get is a list of each regression summary. Using the data set submitted below there should be 3 regressions.
Here is a sample data set (where admit is the logistic regression dependent variable) :
dat <- read.table(text = "
+ female apcalc admit num
+ 0 0 0 7
+ 0 0 1 1
+ 0 1 0 3
+ 0 1 1 7
+ 1 0 0 5
+ 1 0 1 1
+ 1 1 0 0
+ 1 1 1 6",
+ header = TRUE)
I got an example for simple linear regression but When i tried to change the function from lm to glm I got "list()" as a result.
Here is the original code - for the iris dataset where "Sepal.Length" is the dependent variable :
sapply(names(iris)[-1],
function(x) lm.fit(cbind(1, iris[,x]), iris[,"Sepal.Length"])$coef)
How can I create the right function for a logistic regression?
dat <- read.table(text = "
female apcalc admit num
0 0 0 7
0 0 1 1
0 1 0 3
0 1 1 7
1 0 0 5
1 0 1 1
1 1 0 0
1 1 1 6",
header = TRUE)
This is perhaps a little too condensed, but it does the job.
Of course, the sample data set is too small to get any sensible
answers ...
t(sapply(setdiff(names(dat),"admit"),
function(x) coef(glm(reformulate(x,response="admit"),
data=dat,family=binomial))))

Resources