I am trying to upsample an imbalanced dataset in R using the upSample function in Caret. However upon applying the function it completely removes the target variable C_flag from the dataset. Here is my code:
set.seed(100)
'%ni%' <- Negate('%in%')
up_train <- upSample(x = train[, colnames(train) %ni% "C_flag"], #all predictor variables
y = train$C_flag) #target variable
Here is the amount of each category of C_flag in the train set.
0 = 100193, 1=29651.
I test to see if C_flag is there with this result:
print(up_train$C_flag)
NULL
Does anyone know why this function is removing this variable instead of upsampling?
First thing that comes to my mind is if up_train$C_flagis a factor or not. Anyway, I tried this sample dataset:
library(tidyverse)
library(caret)
train <- data.frame(x1 = c(2,3,4,2,3,3,3,8),
x2 = c(1,2,1,2,4,1,1,4),
C_flag = c("A","B","B","A","A","A","A","A"))
train$C_flag <- as.factor(train$C_flag)
'%ni%' <- Negate('%in%')
up_train <- upSample(x = train[,colnames(train) %ni% "C_flag"],
y = train$C_flag)
up_train$C_flag
And it returned me NULL. Why?, because the target column was renamed "Class". So if you want to see the target with the name C_flag add the yname name you want:
up_train <- upSample(x = train[,colnames(train) %ni% "C_flag"],
y = train$C_flag,
yname = "C_flag")
print(up_train$C_flag)
[1] A A A A A A B B B B B B
Levels: A B
Related
I want to be able to input the variable name that I'll be using in the "weights" option in the lmer function. So then I can change the dataset, and cycle through the "weights" and pull the correct variable.
I want to pull the correct column for weights within the for loop.
So for y, the equation would be:
lmer(y~x+(1|study), weights = weight.var)
And y1:
lmer(y1~x+(1|study),weights = weight.var1)
So I named the weighting variables (weight.opt), then want to use them in the formula within the for loop. I can use "as.formula" to get the formula working and connected to the dataset, but I'm not sure how to do something similar with the weights.
x <- rnorm(300,0,1)
y <- x*rnorm(300,2,0.5)
y1 <- x*rnorm(300,0.1,0.1)
study <- rep(c("a","b","c"),each = 100)
weight.var <- rep(c(0.5,2,4),each = 100)
weight.var1 <- rep(c(0.1,.2,.15),each = 100)
library(lme4)
dataset <- data.frame(x,y,y1,study,weight.var,weight.var1)
resp1 <- c("y","y1")
weight.opt <- c("weight.var","weight.var1")
for(i in 1:2){
lmer(as.formula(paste(resp1[i],"~x+(1|study)")),weights = weight.opt[i],data = dataset)
}
This seems to work fine:
res_list <- list()
for(i in 1:2){
res_list[[i]] <- lmer(as.formula(paste(resp1[i],"~x+(1|study)")),
weights = dataset[[weight.opt[i]]],data = dataset)
}
I'm running the classification method Bagging Tree (Bootstrap Aggregation) and compare this misclassification error rate with one from one single tree.
It's strange to me because the function estim.pred returns a matrix of factors that map to "pos" and "neg", but res.boot$t returns a matrix of integers taking on the values of 1 or 2, where as estim.pred is the statistic of res.boot$t.
Could you please explain the reason for this phenomenon?
library(rpart)
library(boot)
library(mlbench)
data(PimaIndiansDiabetes)
n <- 768
ntrain <- 468
ntest <- 300
B <- 100
M <- 100
train.error <- vector(length = M)
test.error <- vector(length = M)
bagging.error <- vector(length = M)
estim.pred <- function(a.sample, vector.of.indices)
{
current.train <- a.sample[vector.of.indices, ]
current.fitted.model <- rpart(diabetes ~ ., data = current.train, method = "class")
predict(current.fitted.model, test.set, type = "class")
}
fitted.tree <- rpart(diabetes ~ ., data = train.set, method = "class")
pred.train <- predict(fitted.tree, train.set, type = "class")
res.boot = boot(train.set, estim.pred, B)
head(pred.train)
head(res.boot$t)
Here is #Roland comment. I post it here to remove my question from unanswered list.
res.boot$t is a matrix. A matrix cannot contain a factor variable. Thus, the matrix contains the underlying integer values. Transpose the matrix, turn it into a data.frame and turn the integers into factor variables with your levels.
for some reason, my model is not running. I created a model matrix to run a simple model with the package neuralnet. I know it might be challenging to debug other people code especially without the data but in case you think you could assist me here is the code:
library(tidyverse)
library(neuralnet)
#Activity 1 Load Data
featchannels <-read.csv("features_channel.csv")
trainTargets <-read.table("traintargets.txt")
#Activity 2 Normalize every column of the features dataset using min-max
normalization to range [0-1].
normalized <- function(x) {
return((x-min(x)) /(max(x) -min(x)))
}
featchannels <- normalized(featchannels)
#Activity 3 Add a target feature named response to the features dataset
with 0-1 values read from trainTargets.txt, with 1 indicating P300
response and 0 otherwise.
colnames(trainTargets)[1] <- "State"
featchannels <- cbind(featchannels, trainTargets)
# Changing rows to P300 and others.
featchannels <- within(featchannels, State <- factor(State, labels =
c("Other", "P300")))
featchannels$State <- as.factor(featchannels$State)
#4. Take the first 3840 rows of the dataset as the training data set, and
the remaining 960 rows as the testing data set.
training <- featchannels[1:3840,]
testing <- featchannels[3841:4800,]
enter code here
#Activitry 6
#Creating model matrix before runing the model
df_comb_training <- training
y <- model.matrix(~ df_comb_training$State + 0, data = df_comb_training[,
c('State'), drop=FALSE])
# fix up names for as.formula
y_feats <- gsub("^[^ ]+\\$", "", colnames(y))
colnames(y) <- y_feats
df_comb_training <- df_comb_training[, !(colnames(df_comb_training) ==
"State")]
feats <- colnames(df_comb_training)
df_comb_training <- cbind(y, df_comb_training)
# Concatenate strings
f <- paste(feats, collapse=' + ')
y_f <- paste(y_feats, collapse=' + ')
f <- paste(y_f, '~', f)
# Convert to formula
f <- as.formula(f)
model_h5 <- neuralnet(f, df_comb_training, stepmax = 1e+08, hidden = 5)
while using Regsubsets from package leaps on data with linear dependencies, I found that results given by coef() and by summary()$which differs. It seems that, when linear dependencies are found, reordering changes position of coefficients and coef() returns wrong values.
I use mtcars just to "simulate" the problem I had with other data. In first example there is no issue of lin. dependencies and best given model by BIC is mpg~wt+cyl and both coef(),summary()$which gives the same result. In second example I add dummy variable so there is possibility of perfect multicollinearity, but variables in this order (dummy in last column) don't cause the problem. In last example after changing order of variables in dataset, the problem finally appears and coef(),summary()$which gives different models. Is there anything incorrect in this approach? Is there any other way to get coefficients from regsubsets?
require("leaps") #install.packages("leaps")
###Example1
dta <- mtcars[,c("mpg","cyl","am","wt","hp") ]
bestSubset.cars <- regsubsets(mpg~., data=dta)
(best.sum <- summary(bestSubset.cars))
#
w <- which.min(best.sum$bic)
best.sum$which[w,]
#
best.sum$outmat
coef(bestSubset.cars, w)
#
###Example2
dta2 <- cbind(dta, manual=as.numeric(!dta$am))
bestSubset.cars2 <- regsubsets(mpg~., data=dta)
(best.sum2 <- summary(bestSubset.cars2))
#
w <- which.min(best.sum2$bic)
best.sum2$which[w,]
#
coef(bestSubset.cars2, w)
#
###Example3
bestSubset.cars3 <- regsubsets(mpg~., data=dta2[,c("mpg","manual","am","cyl","wt","hp")])
(best.sum3 <- summary(bestSubset.cars3))
#
w <- which.min(best.sum3$bic)
best.sum3$which[w,]
#
coef(bestSubset.cars3, w)
#
best.sum2$which
coef(bestSubset.cars2,1:4)
best.sum3$which
coef(bestSubset.cars3,1:4)
The order of vars by summary.regsubsets and regsubsets are different. The generic function coef() of regsubsets calls those two in one function, and the results are in mess if you are trying to force.in or using formula with fixed order. Changing some lines in the coef() function might help. Try codes below, see if it works!
coef.regsubsets <- function (object, id, vcov = FALSE, ...)
{
s <- summary(object)
invars <- s$which[id, , drop = FALSE]
betas <- vector("list", length(id))
for (i in 1:length(id)) {
# added
var.name <- names(which(invars[i, ]))
thismodel <- which(object$xnames %in% var.name)
names(thismodel) <- var.name
# deleted
#thismodel <- which(invars[i, ])
qr <- .Fortran("REORDR", np = as.integer(object$np),
nrbar = as.integer(object$nrbar), vorder = as.integer(object$vorder),
d = as.double(object$d), rbar = as.double(object$rbar),
thetab = as.double(object$thetab), rss = as.double(object$rss),
tol = as.double(object$tol), list = as.integer(thismodel),
n = as.integer(length(thismodel)), pos1 = 1L, ier = integer(1))
beta <- .Fortran("REGCF", np = as.integer(qr$np), nrbar = as.integer(qr$nrbar),
d = as.double(qr$d), rbar = as.double(qr$rbar), thetab = as.double(qr$thetab),
tol = as.double(qr$tol), beta = numeric(length(thismodel)),
nreq = as.integer(length(thismodel)), ier = numeric(1))$beta
names(beta) <- object$xnames[qr$vorder[1:qr$n]]
reorder <- order(qr$vorder[1:qr$n])
beta <- beta[reorder]
if (vcov) {
p <- length(thismodel)
R <- diag(qr$np)
R[row(R) > col(R)] <- qr$rbar
R <- t(R)
R <- sqrt(qr$d) * R
R <- R[1:p, 1:p, drop = FALSE]
R <- chol2inv(R)
dimnames(R) <- list(object$xnames[qr$vorder[1:p]],
object$xnames[qr$vorder[1:p]])
V <- R * s$rss[id[i]]/(object$nn - p)
V <- V[reorder, reorder]
attr(beta, "vcov") <- V
}
betas[[i]] <- beta
}
if (length(id) == 1)
beta
else betas
}
Another solution that works for me is to randomize the order of the column(independent variables) in your dataset before running the regsubsets. The idea is that after reorder hopefully the highly correlated columns will be far apart from each other and will not trigger the reorder behavior in the regsubsets algorithm.
Suppose my data frame DF has two colums $A and $B. $A is always present. $B is sometimes coded NaN when the value is missing. I want to predict $B.predicted, the missing values for $B, and create a new column $B.complete such that $B.complete[i] is $B.predicted if $B[i] is NaN and is $B[i] otherwise.
I use multinom, which requires a factors as the dependent variable, to predict the B's where I have a full observation, using:
DF$B.factor <- factor(DF$B)
model.results <- multinom(formula=B.factor ~ A,
data=DF[!is.na(DF$B),])
B.predicted <- predict(model.result, newdata=DF, type="class")
The variable B.predicted is a factor.
My DF$B column is not a factor.
Mu question is how to I merge DF$B and B.predicted to create B.complete? In particular, since B.predicted is a factor and DF$B is not, does this code pick up the correct values?
B.complete <- ifelse(is.na(DF$B), $B.predicted, DF$B)
Use replace
set.seed(1)
DF <- data.frame(A = factor(sample(letters[1:5],30, TRUE)),
B = sample(c(letters[1:3],NA), 30 , TRUE, prob = rep(c(0.3,0.1),c(3,1))),
stringsAsFactors = F)
DF$B.factor <- factor(DF$B)
# no need to include is.na(DF$B) as multinom will omit anyway
model <- multinom(B.factor ~ A, data = DF)
# use replace to replace the NA values (converting to character when necessary)
DF$B.complete <- replace(DF$B, is.na(DF$B), as.character(predict(model, newdata = DF[is.na(DF$B),])))