R - Neuralnet package using dummies and quantitative variables in supervised learning - r

I am using the package neuralnet for R and would like to use supervised learning.
In my setting I have 15 explanatory variables (8 of them are dummy variables, every dummy contains 0 and 1 as values, the other explanatory variables are real numbered).
However, I want to use all explanatory variables to predict a target vector (real numbered).
So, my setting is a regression problem.
If I run my neural net without the dummies, the neuralnet()-function produces results.
However, by incorporating the dummies, I get the error message:
Obviously, the dummies cause the error. Running the function without them works fine.
How can I make the neuralnet considering the dummies properly and create an output?
Please find below a reproduceable example as well as the neural network setting:
#### install packages
# install.packages("devtools")
# require(devtools)
# devtools::install_github("bips-hb/neuralnet") # CRAN version contains bug, use github version
# require(neuralnet)
### create data
set.seed(1)
dt <- matrix(rnorm(200), nrow = 200, ncol = 3)
dummy1 <- as.factor( c(rep(1,100), rep(0,100)) ) # create vector with data (1,0) for first dummy, save as factor
dummy2 <- as.factor( c(rep(0,100), rep(1,100)) ) # create vector with data (0,1) for second dummy, save as factor
dummy_df <- data.frame(dummy1, dummy2) # merging both dummies into dataframe
class(dummy_df[,1]) # factor
class(dummy_df[,2]) # factor
# bringing original data and dummies together
train <- cbind(as.data.frame(dt), dummy_df)
# see colnames
colnames(train)
# start neural net
nnet <- neuralnet(formula = V1 ~ V2 + V3 + dummy1 + dummy2, # use V1 as target for supervised learning
data = train,
hidden = 1, # neurons
threshold = 0.01, # termination derivative is reached
rep = 5, # trainings
startweights = NULL, # starting weights
learningrate.factor = list(minus = 0.5, plus = 1.2), # increasing and decrasing factors
algorithm = "rprop+", # rprop algorithm with weight updating
err.fct = "sse", # use sum squared errors as error function
act.fct = "tanh", # use hyperbolic tangent
linear.output = TRUE, # output function is linear, regression problem
lifesign = "full", # print behavior
stepmax = 200000)
Thank you for your help!

It appears that neuralnet does not take factors as input (predictors). Just use the numeric values of the dummies, then it works.
dummy1 <- as.numeric(dummy1)
dummy2 <- as.numeric(dummy2)
dummy_df <- data.frame(dummy1, dummy2)
This works fine as long as you have factors with only two levels. If you have more than two levels use dummy coding.

Adding to Dominiks answer, of which he seems absolutely correct, incorporating the dummy variables are rather simple, either using a package like (i believe) dummmies, or the base package and model.matrix as shown below
dt <- matrix(rnorm(200), nrow = 200, ncol = 3)
dummy1 <- as.factor( c(rep(1,100), rep(0,100)) ) # create vector with data (1,0) for first dummy, save as factor
dummy2 <- as.factor( c(rep(0,100), rep(1,100)) ) # create vector with data (0,1) for second dummy, save as factor
dummy_df <- data.frame(dummy1, dummy2) # merging both dummies into dataframe
mm <- model.matrix(~ dummy1 + dummy2 - 1, data = dummy_df) #-1 removes intercept.
train <- cbind(dt, mm)
neuralnet::neuralnet(V1 ~ . , data = train) #possibly adding -1 to the formula is sensible as well.

Related

How do I utilize imputed data, with categorical levels, in a prediction in R?

I'll illustrate my problem with the iris data set in R. My objective here is to create 5 imputed data sets, fit a regression to each imputed data set, then pool together the results of these regressions into one final model. This is the preferred order of operations for a proper execution of multiple imputation.
library(mice)
df <- iris
# Inject some missingness into the data:
df$Sepal.Width[c(20,40,70,121)] <- NA
df$Species[c(15,80,99,136)] <- NA
# Perform the standard steps of multiple imputation with MICE:
imputed_data <- mice(df, method = c(rep("pmm", 5)), m = 5, maxit = 5)
model <- with(imputed_data, lm(Sepal.Length ~ Sepal.Width + Species))
pooled_model <- pool(model)
This leaves me with this pooled_model object which I am hoping to use as a fitted model in the predict command. However, that does not work. When I run:
predict(pooled_model, newdata = iris)
I get this error:
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "c('mipo', 'data.frame')"
Disregard the reasoning why I am using the original iris data set in my newly fitted model; I simply want to be able to fit this data, or a subset of it, onto the model I created with my imputation.
I specifically chose a data set with multiple levels of a categorical variable to highlight my problem. I thought about using some matrix multiplication with which I could do this manually, but the presence of a categorical variable makes that tough. In my actual data set, I have over a hundred variables, many of which have multiple categorical levels. I say this because I realize one possible solution would be to re-code my categorical variables into dummy variables, and then I can apply some matrix multiplication to get my answer. But that would be an EXTREME amount of work for me. If there's a way I can somehow get a model object I can use in the predict function, that would make my life 100x easier.
Any suggestions?
You have two issues: 1) how to use stats::predict with pooled data and 2) what to do about your categorical variables.
Your first issue has already been documented on the mice Github page and it seems like there's been a desire to have a predict.mira function for a while. The author of the mice package posted some code on how to simulate a predict.mira-like function. Unfortunately, it only works with lm models, but it seems like that's okay considering your reprex. If you have a Github account, you can comment on that Github issue to demonstrate your interest in the predict.mira function.
Your question also has been posted on StackOverflow before; although the answer was never accepted, the SO user suggested this reading by Miles (2015).
For your second question, have you considered leaving out your current method argument when using mice()? As long as your variables have been classed as factors, then mice will default to the polyreg method for categorical variables and pmm for continuous variables. You can read more about the method argument here.
library(mice)
set.seed(123)
# make missing data
df <- iris
df$Sepal.Width[c(20,40,70,121)] <- NA
df$Species[c(15,80,99,136)] <- NA
# specify method
meth <- mice(df, maxit = 0, printFlag = FALSE)$meth
print(meth)
# this is how you would change your methods, if you wanted
# but pmm and polyreg are defaults
meth["Species"] <- "polr"
meth["Sepal.Width"] <- "midastouch"
print(meth)
# impute
imputed_data <- mice(df,
m = 5,
maxit = 5,
method = meth, # new method
printFlag = FALSE)
# make model
model <- with(imputed_data, lm(Sepal.Length ~ Sepal.Width + Species))
summary(pool(model))
# obtain predictions Q and prediction variance U
predm <- lapply(getfit(model), predict, se.fit = TRUE)
Q <- sapply(predm, `[[`, "fit")
U <- sapply(predm, `[[`, "se.fit")^2
dfcom <- predm[[1]]$df
# pool predictions
pred <- matrix(NA, nrow = nrow(Q), ncol = 3,
dimnames = list(NULL, c("fit", "se.fit", "df")))
for(i in 1:nrow(Q)) {
pi <- pool.scalar(Q[i, ], U[i, ], n = dfcom + 1)
pred[i, 1] <- pi[["qbar"]]
pred[i, 2] <- sqrt(pi[["t"]])
pred[i, 3] <- pi[["df"]]
}
head(pred)

Feature Importance for machine learning models in (Caret)package

I have a question regarding the feature importance function in the Caret package.
I have a dataset which has more numeric and factor features.
I used the command below to get the feature importance of the model. It gives me the importance of each (sub_feature) for the factor variables. However, I just want the importance of the feature itself without go in detail for each factor of the feature.
gbmImp <- caret::varImp(xgb1, scale = TRUE)
I will create some example data as we don't have any from your question:
library(caret)
# example data
df <- data.frame("x" = rnorm(100),
"fac" = as.factor(sample(c(rep("A", 30), rep("B", 35), rep("C", 35)))),
"y" = as.numeric((rpois(100, 4))))
# model
model <- train(y ~ ., method = "glm", data = df)
# feature importance
varImp(model, scale = TRUE)
This returns the feature importance that you do not want in your question:
# glm variable importance
#
# Overall
# facB 100.00
# facC 13.08
# x 0.00
You can convert the factor variables to numeric and do the same thing:
# make the factor variable numeric
trans_df <- transform(df, fac = as.numeric(fac))
# model
trans_model <- train(y ~ ., method = "glm", data = trans_df)
# feature importance
varImp(trans_model, scale = TRUE)
This returns the importance for the 'overall' feature:
# glm variable importance
#
# Overall
# x 100
# fac 0
However, I do not know whether the as.numeric() operation on the factor variable doesn't result in a different feature importance when we run varImp(trans_model, scale = TRUE).
Also, check out this SO thread if you find that your specific factor/character variables are problematic when converting to numeric.

R logistic regression model.matrix

I am new to R and I am trying to understand the solution of a logistic regression. All that is done so far is to remove unused variables, split the data into train and test datasets. I am trying t understand part of it where it talks about model.matrix. I am just getting into R and statistics and I am not sure of what is model.matrix and what is contracts. Here is the code:
## create design matrix; indicators for categorical variables (factors)
Xdel <- model.matrix(delay~.,data=DataFD_new)[,-1]
xtrain <- Xdel[train,]
xnew <- Xdel[-train,]
ytrain <- del$delay[train]
ynew <- del$delay[-train]
m1=glm(delay~.,family=binomial,data=data.frame(delay=ytrain,xtrain))
summary(m1)
Can someone please tell me the usage of model.matrix? Why cant we directly create dummy variables of categorical variables and put them in glm? I am confused. What is the usage of model.matrix?
Marius' comment explains how to do this - the below code just gives an example (which I felt was helpful since the poster was still confused).
# Create example dataset. 'catvar' represents a categorical variable despite being coded with numbers.
X = data.frame("catvar" = sample(c(1, 2, 3), 100, replace = T),
"numvar" = rnorm(100),
"y" = sample(c(0, 1), 100, replace = T))
# Check whether you're categorical variables are coded correctly. (They'll say 'factor' if so)
sapply(X, class) #catvar is coded as 'numeric', which is wrong.
# Tell 'R' that catvar is categorical. If your categorical variables are already classed as factors, you can skip this step
X$catvar = factor(X$catvar)
sapply(X, class) # check all variables are coded correctly
# Fit model to dataframe (i.e. without needing to convert X to a model matrix)
fit = glm(y ~ numvar + catvar, data = X, family = "binomial")

caret dummy-vars exclude target

How can I use dummy vars in caret without destroying my target variable?
set.seed(5)
data <- ISLR::OJ
data<-na.omit(data)
dummies <- dummyVars( Purchase ~ ., data = data)
data2 <- predict(dummies, newdata = data)
split_factor = 0.5
n_samples = nrow(data2)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
train <- data2[train_idx, ]
test <- data2[-train_idx, ]
modelFit<- train(Purchase~ ., method='lda',preProcess=c('scale', 'center'), data=train)
will fail, as the Purchase variable is missing. In case I replace it with data$Purchase <- ifelse(data$Purchase == "CH",1,0) beforehand caret complains that this no longer is a classification but a regression problem
At least the example code seems to have a few issues indicated in the comments below. To answer your questions:
The result of ifelse is an integer vector, not a factor, so the train function defaults to regression
Passing the dummyVars directly to the function is done by using the train(x = , y =, ...) instead of a formula
To avoid these problems, check the class of your objects carefully.
Be aware that option preProcess in train() will apply the preprocessing to all numeric variables, including the dummies. Option 2 below avoid this, be standardizing the data before calling train().
set.seed(5)
data <- ISLR::OJ
data<-na.omit(data)
# Make sure that all variables that should be a factor are defined as such
newFactorIndex <- c("StoreID","SpecialCH","SpecialMM","STORE")
data[, newFactorIndex] <- lapply(data[,newFactorIndex], factor)
library(caret)
# See help for dummyVars. The function does not take a dependent variable and predict will give an error
# I don't include the target variable here, so predicting dummies on new data will drop unknown columns
# including the target variable
dummies <- dummyVars(~., data = data[,-1])
# I don't change the data yet to apply standardization to the numeric variables,
# before turning the categorical variables into dummies
split_factor = 0.5
n_samples = nrow(data)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
# Option 1 (as asked): Specify independent and dependent variables separately
# Note that dummy variables will be standardized by preProcess as per the original code
# Turn the categorical variabels to (unstandardized) dummies
# The output of predict is a matrix, change it to data frame
data2 <- data.frame(predict(dummies, newdata = data))
modelFit<- train(y = data[train_idx, "Purchase"], x = data2[train_idx,], method='lda',preProcess=c('scale', 'center'))
# Option 2: Append dependent variable to the independent variables (needs to be a data frame to allow factor and numeric)
# Note that I also shift the proprocessing away from train() to
# avoid standardizing the dummy variables
train <- data[train_idx, ]
test <- data[-train_idx, ]
preprocessor <- preProcess(train[!sapply(train, is.factor)], method = c('center',"scale"))
train <- predict(preprocessor, train)
test <- predict(preprocessor, test)
# Turn the categorical variabels to (unstandardized) dummies
# The output of predict is a matrix, change it to data frame
train <- data.frame(predict(dummies, newdata = train))
test <- data.frame(predict(dummies, newdata = test))
# Reattach the target variable to the training data that has been
# dropped by predict(dummies,...)
train$Purchase <- data$Purchase[train_idx]
modelFit<- train(Purchase ~., data = train, method='lda')

How to automate storage of regression estimates into separate matrices in R

I am trying to run a multiple imputation sensitivity analysis. I have provided a simulated data below to demonstrate my problem.
I have already run a multiple imputation model by imputing predicted data to missing values. The next step I wanted to do is to automate a sensitivity analysis where I add to the predicted outcome a multiple of the standard error of the model (sigma). I show my code below. I have six (excluding zero) sensitivity factors for my sensitivity imputation, -3, -2, -1, 0, 1, 2, 3. Each factor is multiplied by the standard error of the model residual and added to the predicted values. Here is the code I used to modify predicted values, + (delta) * summary(cc.m)$sigma . You need to comment this out for the loop to work at this time.
My question: How can I automate the process so that I have a separate matrix for each of my sensitivity analyses so I have an efficient code. I am happy to clarify if this is not clear. I know I can do them separately or use an existing package. I don't want to take these steps. Below is my code.
Thanks!
# Generating data
score <- rnorm(20, 0) # my outcome variable
age <- rnorm(20, mean= 7.5) # Age
gender <- rbinom(20, 1, 0.5) # Gender
missing <- rbinom(20, 1, 0.3) # zero is observed
# Simulated data
dat <- as.data.frame(cbind(score, age, gender, missing))
# Generating missing data for score, my outcome
dat$score[data$missing == 1] <- NA
# Generating number of imputations, keeping it small for now
B <- 10
# Generating sensitivity parameter
delta <- seq(-3,3,1)
# Generating empty matrix to store beta estimates
beta.mat <- matrix(NA, nrow = B, ncol = 3)
# Running an imputation loop
for (j in 1:B) {
# complete cases model
cc.m <- lm(score ~ age + gender, data = dat, subset=dat$missing==0)
### Generating predicted values + I add a sensitivity paramter, delta.
dat$score.hat <- predict(cc.m, newdat <- dat) + (delta) * summary(cc.m)$sigma # Here I am modifying the predicted values by a sensitivity parameter.
# Replacing predicted values by actual values for those that are observed
dat$score.hat[dat$missing==0] <- dat$score[dat$missing==0]
# Running the imputed model
imp.m <- lm(score.hat ~ age + gender, data = dat)
# Saving estimates in a matrix, but at this time this only saves values for only one of the sensitivity parameters.
beta.mat[j,] <- c(summary(imp.m)$coef[,1])
}
Here are two options:
Store your output in a list:
beta.mat <- list()
for (j in 1:B) {
...
beta.mat[[j]] <- summary(imp.m)$coef
}
Use assign():
for (j in 1:B) {
...
assign(paste0("beta.mat",j),summary(imp.m)$coef)
}
Figured it out finally. The trick is to save the separate matrices from each sensitivity simulation into an array. So we will have a [100,3,7] array. I believe this is memory efficient. But I am happy to hear what people think. I need to run this 10,000 times on 15 variables on a totally different data set. I show my code below.
# Generating data
score <- rnorm(20, 0) # My outcome variable
age <- rnorm(20, mean= 7.5) # Age
gender <- rbinom(20, 1, 0.5) # Gender
missing <- rbinom(20, 1, 0.3) # Zero is observed
# Simulated data
dat <- as.data.frame(cbind(score, age, gender, missing))
# Generating missing data for score, my outcome
dat$score[data$missing == 1] <- NA
# Generating number of imputations
B <- 100
# Generating sensitivity parameter
delta <- seq(-3,3,1)
# Generating empty matrix to store beta estimates
beta.arr <- array(c(matrix(NA, nrow = B, ncol = 3)), c(B, 3, length(delta)) )
# Running an imputation loop
for (j in 1:B) {
# Running a sensitivity loop
for (i in 1: length(delta)) {
# Complete cases model
cc.m <- lm(score ~ age + gender, data = dat, subset=dat$missing==0)
# Generating predicted values + I add a delta sensitivity paramter
dat$score.hat <- predict(cc.m, newdat <- dat) + (delta[i]) * summary(cc.m)$sigma
# Replacing predicted values by actual observations
dat$score.hat[dat$missing==0] <- dat$score[dat$missing==0]
# Running the imputed model
imp.m <- lm(score.hat ~ age + gender, data = dat)
# Saving estimates in a matrix
beta.arr[j, ,i ] <- c(summary(imp.m)$coef[,1])
}
}
# Then estimation becomes easier
# Here I save the summary of sensitivity estimates in an array.
est.mat <- matrix(NA, nrow = length(delta), ncol=3)
for(i in 1:length(delta)) {
est.mat[i,] <- apply(beta.arr[, , i], 2, mean)
}
# snapshot of summary estimates matrix for each variable, and the intercept
head(est.mat)

Resources