I am trying to transform my each prediction into an N Column Vector. i.e
Say My Prediction set is a factor of 3 levels and I would like to write each prediction as vector of 3.
My Current Output is
Id Prediction
1 Prediction 1
2 prediction 2
3 prediction 3
and what I am trying to achieve
Id Prediction1 Prediction2 Predication3
1 0 0 1
2 1 0 0
What is a simpler way of achieving this in R?
It looks like you want to perform so-called "one hot encoding" of your Prediction factor variable by introducing dummy variables. One way to do so is using the caret package.
Suppose you have a data frame like this:
> df <- data.frame(Id = c(1, 2, 3, 4), Prediction = c("Prediction 3", "Prediction 1", "Prediction 2", "Prediction 3"))
> df
Id Prediction
1 1 Prediction 3
2 2 Prediction 1
3 3 Prediction 2
4 4 Prediction 3
First make sure you have the caret package installed and loaded.
> install.packages('caret')
> library(caret)
You can then use caret's dummyVars() function to create dummy variables.
> dummies <- dummyVars( ~ Prediction, data = df, levelsOnly = TRUE)
The first argument to dummyVars(), a formula, tells it to generate dummy variables for the Prediction factor in the date frame df. (levelsOnly = TRUE strips the variable name from the columns names, leaving just the level, which looks nicer in this case.)
The dummy variables can then be passed to the predict() function to generate a matrix with the one hot encoded factors.
> encoded <- predict(dummies, df)
> encoded
Prediction 1 Prediction 2 Prediction 3
1 0 0 1
2 1 0 0
3 0 1 0
4 0 0 1
You can then, for example, create a new data frame with the encoded variables instead of the original factor variable:
> data.frame(Id = df$Id, encoded)
Id Prediction.1 Prediction.2 Prediction.3
1 1 0 0 1
2 2 1 0 0
3 3 0 1 0
4 4 0 0 1
This technique generalises easily to a mixture of numerical and categorical variables. Here's a more general example:
> df <- data.frame(Id = c(1,2,3,4), Var1 = c(3.4, 2.1, 6.0, 4.7), Var2 = c("B", "A", "B", "A"), Var3 = c("Rainy", "Sunny", "Sunny", "Cloudy"))
> dummies <- dummyVars(Id ~ ., data = df)
> encoded <- predict(dummies, df)
> encoded
Var1 Var2.A Var2.B Var3.Cloudy Var3.Rainy Var3.Sunny
1 3.4 0 1 0 1 0
2 2.1 1 0 0 0 1
3 6.0 0 1 0 0 1
4 4.7 1 0 1 0 0
All numerical variables remain unchanged, whereas all categorical variables get encoded. A typical situation where this is useful is to prepare data for a machine learning algorithm that only accepts numerical variables, not categorical variables.
You can use something like:
as.numeric(data[1,][2:4])
Where '1' is the row number that you are converting to a vector.
Taking WhiteViking's start and using table function seems to work.
> df <- data.frame(Id = c(1, 2, 3, 4), Prediction = c("Prediction 3", "Prediction 1", "Prediction 2", "Prediction 3"))
> df
Id Prediction
1 1 Prediction 3
2 2 Prediction 1
3 3 Prediction 2
4 4 Prediction 3
> table(df$Id, df$Prediction)
Prediction 1 Prediction 2 Prediction 3
1 0 0 1
2 1 0 0
3 0 1 0
4 0 0 1
I would use the reshape function
Related
I need to create possible combinations of 3 dummy variables into one categorical variable in a logistic regression using R.
I made the combination manually just like the following:
new_variable_code
variable_1
variable_2
variable_3
1
0
0
0
2
0
1
0
3
0
1
1
4
1
0
0
5
1
1
0
6
1
1
1
I excluded the other two options (0 0 1) and (1 0 1) because I do not need them, they are not represented by the data.
I then used new_variable_code as a factor in the logistic regression along with other predictors.
My question is: Is there is any automated way to create the same new_variable_code? or even another econometric technique to encode the 3 dummy variables into 1 categorical variable inside a logistic regression model?
My objective: To understand which variable combination has the highest odds ratio on the outcome variable (along with other predictors explained in the same model).
Thank you
You could use pmap_dbl in the following way to recode your dummy variables to a 1-6 scale:
library(tidyverse)
# Reproducing your data
df1 <- tibble(
variable_1 = c(0,0,0,1,1,1),
variable_2 = c(0,1,1,0,1,1),
variable_3 = c(0,0,1,0,0,1)
)
factorlevels <- c("000","010","011","100","110","111")
df1 <- df1 %>%
mutate(
new_variable_code = pmap_dbl(list(variable_1, variable_2, variable_3),
~ which(paste0(..1, ..2, ..3) == factorlevels))
)
Output:
# A tibble: 6 x 4
variable_1 variable_2 variable_3 new_variable_code
<dbl> <dbl> <dbl> <dbl>
1 0 0 0 1
2 0 1 0 2
3 0 1 1 3
4 1 0 0 4
5 1 1 0 5
6 1 1 1 6
I would just create a variable with paste using sep="." and make it a factor:
newvar <- factor( paste(variable_1, variable_2, variable_3, sep="."))
I don't think it would be a good idea to then make it a sequential vlaue, it's already an integer with levels, since that's how factors get created.
Consider a data set train:
z a
1 1
0 2
0 1
1 3
0 1
1 2
1 1
0 3
0 1
1 3
with a binary outcome variable z and a categorical predictor a with three levels: 1,2,3.
Now consider a data set test:
z a
1
1
2
1
2
2
1
When I run the following code:
library(randomForest)
set.seed(825)
RFfit1 <- randomForest(z~a, data=train, importance=TRUE, ntree=2000)
RFprediction1 <- predict(RFfit1, test)
I get the following error message:
Error in predict.randomForest(RFfit1, test1) :
Type of predictors in new data do not match that of the training data.
I am assuming this is because the variable a in the test data set does not have three levels. How would I fix this?
You must assign it the same levels as train
test$a <- factor(test$a, levels=levels(train$a))
I would like to run random forest on a large data set: 100k * 400. When I use random forest it takes a lot of time. Can I use parRF method from caret package in order to reduce running time?
What is the right syntax for that?
Here is an example dataframe:
dat <- read.table(text = " TargetVar Var1 Var2 Var3
0 0 0 7
0 0 1 1
0 1 0 3
0 1 1 7
1 0 0 5
1 0 1 1
1 1 0 0
1 1 1 6
0 0 0 8
0 0 1 5
1 1 1 4
0 0 1 2
1 0 0 9
1 1 1 2 ", header = TRUE)
I tried:
library('caret')
m<-randomForest(TargetVar ~ Var1 + Var2 + Var3, data = dat, ntree=100, importance=TRUE, method='parRF')
But I don't see too much of a difference. Any Ideas?
The reason that you don't see a difference is that you aren't using the caret package. You do load it into your environment with the library() command, but then you run randomForest() which doesn't use caret.
I'll suggest starting by creating a data frame (or data.table) that contains only your input variables, and a vector containing your outcomes. I'm referring to the recently updated caret docs.
x <- data.frame(dat$Var1, dat$Var2, dat$Var3)
y <- dat$TargetVar
Next, verify that you have the parRF method available. I didn't until I updated my caret package to the most recent version (6.0-29).
library("randomForest")
library("caret")
names(getModelInfo())
You should see parRF in the output. Now you're ready to create your training model.
library(foreach)
rfParam <- expand.grid(ntree=100, importance=TRUE)
m <- train(x, y, method="parRF", tuneGrid=rfParam)
I would like to run the dependent variable of a logistic regression (in my data set it's : dat$admit) with all available variables, each regression with its own Independent variable vs dependent variable.
The outcome that I wanted to get is a list of each regression summary. Using the data set submitted below there should be 3 regressions.
Here is a sample data set (where admit is the logistic regression dependent variable) :
dat <- read.table(text = "
+ female apcalc admit num
+ 0 0 0 7
+ 0 0 1 1
+ 0 1 0 3
+ 0 1 1 7
+ 1 0 0 5
+ 1 0 1 1
+ 1 1 0 0
+ 1 1 1 6",
+ header = TRUE)
I got an example for simple linear regression but When i tried to change the function from lm to glm I got "list()" as a result.
Here is the original code - for the iris dataset where "Sepal.Length" is the dependent variable :
sapply(names(iris)[-1],
function(x) lm.fit(cbind(1, iris[,x]), iris[,"Sepal.Length"])$coef)
How can I create the right function for a logistic regression?
dat <- read.table(text = "
female apcalc admit num
0 0 0 7
0 0 1 1
0 1 0 3
0 1 1 7
1 0 0 5
1 0 1 1
1 1 0 0
1 1 1 6",
header = TRUE)
This is perhaps a little too condensed, but it does the job.
Of course, the sample data set is too small to get any sensible
answers ...
t(sapply(setdiff(names(dat),"admit"),
function(x) coef(glm(reformulate(x,response="admit"),
data=dat,family=binomial))))
So, my data set consists of 15 variables, one of them (sex) has only 2 levels. I want to use it as a dummy variable, but the levels are 1 and 2. How do I do this? I want to have levels 0 and 1, but I don't know how to manage this in R!
With most of R's modelling tools with a formula interface you don't need to create dummy variables, the underlying code that handles and interprets the formula will do this for you. If you want a dummy variable for some other reason then there are several options. The easiest (IMHO) is to use model.matrix():
set.seed(1)
dat <- data.frame(sex = sample(c("male","female"), 10, replace = TRUE))
model.matrix( ~ sex - 1, data = dat)
which gives:
> dummy <- model.matrix( ~ sex - 1, data = dat)
> dummy
sexfemale sexmale
1 0 1
2 0 1
3 1 0
4 1 0
5 0 1
6 1 0
7 1 0
8 1 0
9 1 0
10 0 1
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$sex
[1] "contr.treatment"
> dummy[,1]
1 2 3 4 5 6 7 8 9 10
0 0 1 1 0 1 1 1 1 0
You can use either column of dummy as a numeric dummy variable; choose whichever column you want to be the 1-based level. dummy[,1] chooses 1 as representing the female class and dummy[,2] the male class.
Cast this as a factor if you want it to be interpreted as a categorical object:
> factor(dummy[, 1])
1 2 3 4 5 6 7 8 9 10
0 0 1 1 0 1 1 1 1 0
Levels: 0 1
But that is defeating the object of factor; what is 0 again?
Ty this
set.seed(001) # generating some data
sex <- factor(sample(1:2, 10, replace=TRUE)) # this is what you have
[1] 1 1 2 2 1 2 2 2 2 1
Levels: 1 2
sex<-factor(ifelse(as.numeric(sex)==2, 1,0)) # this is what you want
sex
[1] 0 0 1 1 0 1 1 1 1 0
Levels: 0 1
If you want labels to be 0 = Male and 1 = Female, then...
sex<-factor(ifelse(as.numeric(sex)==2, 1,0), labels=c('M', 'F'))
sex # this is what you want
[1] M M F F M F F F F M
Levels: M F
Actually you don't need to create a dummy variable in order to estimate a model using lm, let's see this example:
set.seed(001) # Generating some data
N <- 100
x <- rnorm(N, 50, 20)
y <- 20 + 3.5*x + rnorm(N)
sex <- factor(sample(1:2, N, replace=TRUE))
# Estimating the linear model
lm(y ~ x + sex) # using the first category as the baseline (this means sex==1)
Call:
lm(formula = y ~ x + sex)
Coefficients:
(Intercept) x sex2
19.97815 3.49994 -0.02719
# renaming the categories and labelling them
sex<-factor(ifelse(as.numeric(sex)==2, 1,0), labels=c('M', 'F'))
lm(y ~ x + sex) # the same results, baseline is 'Male'
Call:
lm(formula = y ~ x + sex)
Coefficients:
(Intercept) x sexF
19.97815 3.49994 -0.02719
As you can see R deals with the dummies pretty well, you just pass them into the formula as factor variable and R will do the rest for you.
By the way there's no need to change the categories from c(2,1) into c(0,1), the results will be the same as you can seen in the example above.
As suggested by many above, turn it into factor.
If you really want to dummy code the gender variable, consider this
set.seed(100)
gender = rbinom(100,1,0.5)+1
gender_dummy = gender-1