I have a dataset of 25 variables and 248 rows.
There are 8-factor variables and the rest are integers and numbers.
I am trying to run XGBoost.
I have done the following code: -
# Partition Data
set.seed(1234)
ind <- sample(2, nrow(mission), replace = T, prob = c(0.7,0.3))
train <- mission[ind == 1,]
test <- mission[ind == 2,]
# Create matrix - One-Hot Encoding for Factor variables
trainm <- sparse.model.matrix(GRL ~ .-1, data = train)
head(trainm)
train_label <- train[,"GRL"]
train_matrix <- xgb.DMatrix(data = as.matrix(trainm), label = train_label)
testm <- sparse.model.matrix(GRL~.-1, data = test)
test_label <- test[,"GRL"]
test_matrix <- xgb.DMatrix(data = as.matrix(testm),label = test_label)
The response variable here is "GRL" and I am running the test_label <- test[,"GRL"]
The above code is getting executed but when I am trying to use it in xgb.DMatrix, I am encountering the following error:
Error in setinfo.xgb.DMatrix(dmat, names(p), p[[1]]) :
The length of labels must equal to the number of rows in the input data
I have partitioned the data into 70:30.
test[,"GRL"] returns a data.frame, and XGBoost needs the label to be a vector.
Just use teste$GRL or test[["GRL"]] instead. You also need to do the same for the training dataset
I have been stuck for hours trying to run XGboost with R. I have a training data and test data containing around 40 columns and the last column is the target column. It is a 0,1 nominal value. I am running this code which I got from https://www.kaggle.com/michaelpawlus/xgboost-example-0-76178/code.
require(xgboost)
library(xgboost)
train <- read.csv(file.choose(),header = T)
test <- read.csv(file.choose(),header = T)
feature.names <- names(train)[2:ncol(train)-1]
clf <- xgboost(data = data.matrix(train[,feature.names]),
label = train$target,
nrounds = 100, # 100 is better than 200
objective = "binary:logistic",
eval_metric = "auc")
cat("making predictions in batches due to 8GB memory limitation\n")
submission <- data.frame(ID=test$ID)
submission$target1 <- NA
for (rows in test) {
submission[rows, "Succeed"] <- predict(clf, data.matrix(test[rows,feature.names]))
}
varimp_clf <- xgb.importance(feature_names=feature.names,model=clf)
xgb.plot.importance(varimp_clf)
This is the errors I am getting
Error in xgb.get.DMatrix(data, label, missing, weight) :
xgboost: need label when data is a matrix
Error in $<-.data.frame(*tmp*, target1, value = NA) :
replacement has 1 row, data has 0
Error in predict(clf, data.matrix(test[rows, feature.names])) :
object 'clf' not found
Check your input data. Is your last column named target? It sounds like it isn't.
I would like to implement the stepwise linear regression in a function. Example of my code is given bellow.
test_function <- function(formula, dataset, k, repeats) {
# k is a number of folds used for cross-validation (CV)
# repeats represents the number of repeats of CV
for (m in 1:repeats){
# randomly shuffle dataset
dataset <- dataset[sample(nrow(dataset)), ]
#Create k equally size folds
folds <- cut(seq(1, nrow(dataset)), breaks = k, labels = FALSE)
for (j in 1:k){
#Segement my data by fold using the which() function
testIndexes <- which(folds == j, arr.ind = TRUE)
test <- dataset[testIndexes, ]
train <- dataset[-testIndexes, ]
MLR <- stats::step(lm(formula, data = train), direction = "both", trace = TRUE)
test_predicted <- predict(MLR, test)
}
}
}
# Example
# install.packages("dendroTools") # to get data used in the example below
library(dendroTools)
data("example_dataset_1") # load the data used in the example bellow
test_function(formula = MVA~., dataset = example_dataset_1, k = 10, repeats = 2)
And it works perfectly. But I would like to implement this function in my R package. And when I run RCMD check for this example. I get the following error:
... 166 lines ...
- T_aug_sep 1 0.2973 8.0723 -92.866
<none> 7.7750 -92.817
- T_APR 1 4.2533 12.0283 -72.127
Step: AIC=-92.87
MVA ~ T_APR
Error in eval(predvars, data, env) :
invalid 'envir' argument of type 'closure'
Calls: test_function ... eval -> eval -> <Anonymous> -> model.frame.default -> eval
Execution halted
It looks like, the problem occurs, when there is a non-significant variable.
I have a training_predictors set with 56 columns, all of which are numeric. training_labels is a factor vector of 0 and 1.
I am using following list as subset sizes to be tested.
subset_sizes <- c(1:5, 10, 15, 20, 25)
Following is the list of modified rfFuncs functions.
rfRFE <- list(summary = defaultSummary,
fit = function(x, y, first, last, ...) {
library(randomForest)
randomForest(x, y, importance = first, ...)
},
pred = function(object, x) predict(object, x),
rank = function(object, x, y) {
vimp <- varImp(object)
vimp <- vimp[order(vimp$Overall, decreasing = TRUE),,drop = FALSE]
vimp$var <- rownames(vimp)
vimp
},
selectSize = pickSizeBest,
selectVar = pickVars)
I have declared the control function as:
rfeCtrl <- rfeControl(functions = rfRFE,
method = "cv",
number = 10,
verbose = TRUE)
But when I run rfe function as shown below,
rfProfile <- rfe(training_predictors,
training_labels,
sizes = subset_sizes,
rfeControl = rfeCtrl)
I am getting an error as :
Error in { : task 1 failed - "argument 1 is not a vector"
I also tried changing the vector subset_sizes, but still no luck. What am I doing wrong?
Update : I tried to run these steps one by one and the problem seems to be with the rank function. But I am still unable to figure out the problem.
Update: I found out the problem. varImp in rank function is not containing $Overall. But it contains columns with names 0 and 1. Why is it so? What does 0 and 1 signify (both column values are exactly same, by the way)? Also, how can I make varImp to return $Overall column? [as a temporary solution, I am creating a new column $Overall and attaching it to vimp in rank function.]
Using 0 and 1 as factor levels is problematic since those are not valid R column names. In your other SO post you probably would have received a message about using these as factor levels for your output.
Try using a factor outcome with some more informative levels that can be translated into valid R column names (for class probabilities).
I have found a solution for this same issue to fit a logistic regression model in rfe using caret. The solution as below:
glmFuncs$rank <-function (object, x, y){
vimp <- varImp(object, scale = FALSE)
loadNamespace("dplyr")
vimp <- vimp$importance %>%
mutate(var=row.names(.)) %>%
arrange(-Overall)
vimp <- vimp[order(vimp$Overall, decreasing = TRUE), ,drop = FALSE]
vimp
}
Often, I want to run a cross validation on a dataset which contains some factor variables and after running for a while, the cross validation routine fails with the error: factor x has new levels Y.
For example, using package boot:
library(boot)
d <- data.frame(x=c('A', 'A', 'B', 'B', 'C', 'C'), y=c(1, 2, 3, 4, 5, 6))
m <- glm(y ~ x, data=d)
m.cv <- cv.glm(d, m, K=2) # Sometimes succeeds
m.cv <- cv.glm(d, m, K=2)
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor x has new levels B
Update: This is a toy example. The same problem occurs with larger datasets as well, where there are several occurrences of level C but none of them is present in the training partition.
The function createDataPartition function from the package caret does stratified sampling for the outcome variables and correctly warns:
Also, for ‘createDataPartition’, very small class sizes (<= 3) the classes may not show up in both the training and test data.
There are two solutions which spring to mind:
First, create a subset of the data by selecting one random sample of each factor level first, starting from the rarest class (by frequency) and then greedily satisfying the next rare class and so on. Then using createDataPartition on the rest of the dataset and merging the results to create a new train dataset which contains all levels.
Using createDataPartitions and and doing rejection sampling.
So far, option 2 has worked for me because of the data sizes, but I cannot help but think that there must be a better solution than a hand rolled out one.
Ideally, I would want a solution which just works for creating partitions and fails early if there is no way to create such partitions.
Is there a fundamental theoretical reason why packages do not offer this? Do they offer it and I just haven't been able to spot them because of a blind spot? Is there a better way of doing this stratified sampling?
Please leave a comment if I should ask this question on stats.stackoverflow.com.
Update:
This is what my hand rolled out solution (2) looks like:
get.cv.idx <- function(train.data, folds, factor.cols = NA) {
if (is.na(factor.cols)) {
all.cols <- colnames(train.data)
factor.cols <- all.cols[laply(llply(train.data[1, ], class), function (x) 'factor' %in% x)]
}
n <- nrow(train.data)
test.n <- floor(1 / folds * n)
cond.met <- FALSE
n.tries <- 0
while (!cond.met) {
n.tries <- n.tries + 1
test.idx <- sample(nrow(train.data), test.n)
train.idx <- setdiff(1:nrow(train.data), test.idx)
cond.met <- TRUE
for(factor.col in factor.cols) {
train.levels <- train.data[ train.idx, factor.col ]
test.levels <- train.data[ test.idx , factor.col ]
if (length(unique(train.levels)) < length(unique(test.levels))) {
cat('Factor level: ', factor.col, ' violated constraint, retrying.\n')
cond.met <- FALSE
}
}
}
cat('Done in ', n.tries, ' trie(s).\n')
list( train.idx = train.idx
, test.idx = test.idx
)
}
Everyone agrees that there sure is an optimal solution. But personally, I would just try the cv.glm call until it works usingwhile.
m.cv<- try(cv.glm(d, m, K=2)) #First try
class(m.cv) #Sometimes error, sometimes list
while ( inherits(m.cv, "try-error") ) {
m.cv<- try(cv.glm(d, m, K=2))
}
class(m.cv) #always list
I've tried it with 100,000 rows in the data.fame and it only takes a few seconds.
library(boot)
n <-100000
d <- data.frame(x=c(rep('A',n), rep('B', n), 'C', 'C'), y=1:(n*2+2))
m <- glm(y ~ x, data=d)
m.cv<- try(cv.glm(d, m, K=2))
class(m.cv) #Sometimes error, sometimes list
while ( inherits(m.cv, "try-error") ) {
m.cv<- try(cv.glm(d, m, K=2))
}
class(m.cv) #always list
When I call traceback I get this:
> traceback()
9: stop(sprintf(ngettext(length(m), "factor %s has new level %s",
"factor %s has new levels %s"), nm, paste(nxl[m], collapse = ", ")),
domain = NA)
8: model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels)
7: model.frame(Terms, newdata, na.action = na.action, xlev = object$xlevels)
6: predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type ==
"link", "response", type), terms = terms, na.action = na.action)
5: predict.glm(d.glm, data[j.out, , drop = FALSE], type = "response")
4: predict(d.glm, data[j.out, , drop = FALSE], type = "response")
3: mean((y - yhat)^2)
2: cost(glm.y[j.out], predict(d.glm, data[j.out, , drop = FALSE],
type = "response"))
1: cv.glm(d, m, K = 2)
And looking at the cv.glm function gives:
> cv.glm
function (data, glmfit, cost = function(y, yhat) mean((y - yhat)^2),
K = n)
{
call <- match.call()
if (!exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE))
runif(1)
seed <- get(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
n <- nrow(data)
out <- NULL
if ((K > n) || (K <= 1))
stop("'K' outside allowable range")
K.o <- K
K <- round(K)
kvals <- unique(round(n/(1L:floor(n/2))))
temp <- abs(kvals - K)
if (!any(temp == 0))
K <- kvals[temp == min(temp)][1L]
if (K != K.o)
warning(gettextf("'K' has been set to %f", K), domain = NA)
f <- ceiling(n/K)
s <- sample0(rep(1L:K, f), n)
n.s <- table(s)
glm.y <- glmfit$y
cost.0 <- cost(glm.y, fitted(glmfit))
ms <- max(s)
CV <- 0
Call <- glmfit$call
for (i in seq_len(ms)) {
j.out <- seq_len(n)[(s == i)]
j.in <- seq_len(n)[(s != i)]
Call$data <- data[j.in, , drop = FALSE]
d.glm <- eval.parent(Call)
p.alpha <- n.s[i]/n
cost.i <- cost(glm.y[j.out], predict(d.glm, data[j.out,
, drop = FALSE], type = "response"))
CV <- CV + p.alpha * cost.i
cost.0 <- cost.0 - p.alpha * cost(glm.y, predict(d.glm,
data, type = "response"))
}
list(call = call, K = K, delta = as.numeric(c(CV, CV + cost.0)),
seed = seed)
}
It seems the problem has to do with your extremely small sample size and categorical effect (with values "A", "B", and "C"). You are fitting a glm with 2 effects: "B:A" and "C:A". In each CV iteration you bootstrap from the sample dataset and fit a new model d.glm. Given the size, the bootstrapped data are guaranteed to come up with 1 or more iteration in which the value "C" is not sampled, hence the error comes from calculating fitted probabilities from the bootstrap model from the training data in which validation data has a "C" level for x not observed in the training data.
Frank Harrell (often on stats.stackexchange.com) wrote in Regression Modelling Strategies that one ought to favor against split sample validation when sample size is small and/or some cell counts are small in categorical data analysis. Singularity (as you are seeing here) is one of many reasons why I think this is true.
Given the small sample size here, you should consider some split sample cross validation alternatives like a permutation test, or a parametric bootstrap. Another important consideration is exactly why you feel model based inference isn't correct. As Tukey said of the bootstrap, he'd like to call it a shotgun. It will blow the head off of any problem, as long as you're willing to reassemble the pieces.
There don't seem to be many simple solutions around the web so here's one I worked out that should be easy to generalize to as many factors as you need. It uses pre-installed packages and Caret but you could get away with just base R if you really wanted.
To use cross-validation when you have multiple factors follow a two-step process. Convert the factors to numerics and then multiply them together. Use this new variable as the target variable in a stratified sampling function. Be sure to remove it or keep it out of your training set after creating your folds.
If y is your DV and x is a factor then:
#Simulated factors (which are conveniently distributed for the example)
dataset <-data.frame(x=as.factor(rep(c(1,10),1000)),y=as.factor(rep(c(1,2,3,4),250)[sample(1000)]))
#Convert the factors to numerics and multiply together in new variable
dataset$cv.variable <-as.numeric(levels(dataset$x))[dataset$x]*as.numeric(levels(dataset$y))[dataset$y]
prop.table(table(dataset$y)) #One way to view distribution of levels
ftable(dataset$x,dataset$y) #A full table of all x and y combinations
folds <- caret::createFolds(dataset$cv.variable,k=10)
testIndexes <- folds[[k]]
testData <- as.data.frame(dataset[testIndexes, ])
trainData <- as.data.frame(dataset[-testIndexes, ])
prop.table(table(testData$y))
ftable(testData$x,testData$y) #evaluate distribution
which should produce a result that is close to balanced.
Note: In real life, if your sample lacks the requisite unique combinations of factors then your problem is harder overcome and might be impossible. You can either drop some levels from consideration before creating folds or employ some kind of over-sampling.