R Catboost to handle categorical variables - r

I have a question about Catboost. Whether do I preprocess the categorical before modeling?
If I have 86 variables including 1 target variable. In these 85 variables, there are 2 numeric variables and 83 categorical variables (Factor type). The target variable is binary factor, 1 or 0.
Column 1, and Column 4 to Column 85 are factors type.
Column 2 and 3 are numeric.
I am a little confused with cat_features in catboost.train(). In the parameters, I can set a vector of categorical features. Also, I can set in the catboost.load_pool.
library(Catboost)
library(dplyr)
X_train <- train %>% select(-Target)
y_train <- (as.numeric(unlist(train[c('Target')])) - 1)
X_valid <- test %>% select(-Target)
y_valid <- (as.numeric(unlist(test[c('Target')])) - 1)
train_pool <- catboost.load_pool(data = X_train, label = y_train, cat_features = c(0,3:84))
test_pool <- catboost.load_pool(data = X_valid, label = y_valid, cat_features = c(0,3:84))
params <- list(iterations=500,
learning_rate=0.01,
depth=10,
loss_function='RMSE',
eval_metric='RMSE',
random_seed = 1,
od_type='Iter',
metric_period = 50,
od_wait=20,
use_best_model=TRUE,
cat_features = c(0,3:84))
catboost.train(train_pool, test_pool, params = params)
However, after I ran the code above, I got an error:
Error in catboost.train(train_pool, test_pool, params = params) :
catboost/libs/options/plain_options_helper.cpp:339: Unknown option {cat_features} with value "[0,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84]"
Any help?

Look at this example cat_features should not go in param <- list() only in catboost.load_pool()
library(catboost)
countries = c('RUS','USA','SUI')
years = c(1900,1896,1896)
phone_codes = c(7,1,41)
domains = c('ru','us','ch')
dataset = data.frame(countries, years, phone_codes, domains, stringsAsFactors = T)
glimpse(dataset)
label_values = c(0,1,1)
fit_params <- list(iterations = 100,
loss_function = 'Logloss',
ignored_features = c(4,9),
border_count = 32,
depth = 5,
learning_rate = 0.03,
l2_leaf_reg = 3.5)
pool = catboost.load_pool(dataset, label = label_values, cat_features = c(0,3))
model <- catboost.train(pool, params = fit_params)
model

I haven't tried CatBoost in R, but see the example on this page:
https://catboost.ai/docs/concepts/r-reference_catboost-train.html
It appears you only pass the categorical variables in the load_pool() call, and NOT in the train() call.
(This works differently from the Python API, where cat_features is passed in the Python fit() call.)
A suggestion: group all the categorical variables in the left most column. That way you have a simpler vector creation. I also have a check in my code to make sure I did it right...

Related

Multiple imputation and mlogit for a multinomial regression

I am trying to run a multinomial regression with imputed data. I can do this with the nnet package, however I want to use mlogit. Using the mlogit package I keep getting the following error "Error in 1:nrow(data) : argument of length 0".
So making the data
library(mlogit)
library(nnet)
library(tidyverse)
library(mice)
df <- data.frame(vax = sample(1:6, 500, replace = T),
age = runif(500, 12, 18),
var1 = sample(1:2, 500, replace = T),
var2 = sample(1:5, 500, replace = T))
# Create missing data using the mice package:
df2 <- ampute(df, prop = 0.15)
df3 <- df2$amp
df3$vax <- as.factor(df3$vax)
df3$var1 <- as.factor(df3$var1)
df3$var2 <- as.factor(df3$var2)
# Inpute missing data:
df4 <- mice(df3, m = 5, print = T, seed = 123)
It works using nnet's multinom:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, data = df, model = T))
summary(pool(multinomtest))
But throws up an error when I try to reshape the data into mlogit format
test <- with(df4, dfidx(data = df4, choice = "vax", shape = "wide"))
Does anyone have any idea how I can get the imputed data into mlogit format, or even whether mlogit has compatibility with mice or any other imputation package?
Answer
You are using with.mids incorrectly, and thus both lines of code are wrong; the multinom line just doesn't give an error. If you want to apply multiple functions to the imputed datasets, you're better off using something like lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2,
data = data.idx,
reflevel = "1",
nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
How with.mids works
When you apply with to a mids object (AKA the output of mice::mice), then you are actually calling with.mids.
If you use getAnywhere(with.mids) (or just type mice:::with.mids), you'll find that it does a couple of things:
It loops over all imputed datasets.
It uses complete to get one dataset.
It runs the expression with the dataset as the environment.
The third step is the problem. For functions that use formulas (like lm, glm and multinom), you can use that formula within a given environment. If the variables are not in the current environment (but rather in e.g. a data frame), you can specify a new environment by setting the data variable.
The problems
This is where both your problems derive from:
In your multinom call, you set the data variable to be df. Hence, you are actually running your multinom on the original df, NOT the imputed dataset!
In your dfidx call, you are again filling in data directly. This is also wrong. However, leaving it empty also gives an error. This is because with.mids doesn't fill in the data argument, but only the environment. That isn't sufficient for you.
Fixing multinom
The solution for your multinom line is simple: just don't specify data:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, model = T))
summary(pool(multinomtest))
As you will see, this will yield very different results! But it is important to realise that this is what you are trying to obtain.
Fixing dfidx (and mlogit)
We cannot do this with with.mids, since it uses the imputed dataset as the environment, but you want to use the modified dataset (after dfidx) as your environment. So, we have to write our own code. You could just do this with any looping function, e.g. lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2, data = data.idx, reflevel = "1", nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
From there, all we have to do is make something that looks like a mira object, so that we can still use pool:
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
Offering this as a way forward to circumvent the error with dfidx():
df5 <- df4$imp %>%
# work with a list, where each top-element is a different imputation run (imp_n)
map(~as.list(.x)) %>%
transpose %>%
# for each run, impute and return the full (imputed) data set
map(function(imp_n.x) {
df_out <- df4$data
df_out$vax[is.na(df_out$vax)] <- imp_n.x$vax
df_out$age[is.na(df_out$age)] <- imp_n.x$age
df_out$var1[is.na(df_out$var1)] <- imp_n.x$var1
df_out$var2[is.na(df_out$var2)] <- imp_n.x$var2
return(df_out)
}) %>%
# No errors with dfidx() now
map(function(imp_n.x) {
dfidx(data = imp_n.x, choice = "vax", shape = "wide")
})
However, I'm not too familiar with mlogit(), so can't help beyond this.
Update 8/2/21
As #slamballais mentioned in their answer, the issue is with dataset you refer to when fitting the model. I assume that mldata (from your code in the comments section) is a data.frame? This is probably why you are seeing the same coefficients - you are not referring to the imputed data sets (which I've identified as imp_n.x in the functions). The function purrr::map() is very similar to lapply(), where you apply a function to elements of a list. So to get the code working properly, you would want to change mldata to imp_n.x:
# To fit mlogit() for each imputed data set
df5 %>%
map(function(imp_n.x) {
# form as specified in the comments
mlogit(vax ~ 1 | age + var1 + var2,
data = imp_n.x,
reflevel = "1",
nests = list(type1 = c('1', '2'),
type2 = c('3','4'),
type3 = c('5','6')))
})

Passing strings into 'contrasts' argument of lme/lmer

I am writing a sub-routine to return output of longitudinal mixed-effects models. I want to be able to pass elements from lists of variables into lme/lmer as the outcome and predictor variables. I would also like to be able to specify contrasts within these mixed-effects models, however I am having trouble with getting the contrasts() argument to recognise the strings as the variable names referred to in the model specification within the same lme/lme4 call.
Here's some toy data,
set.seed(345)
A0 <- rnorm(4,2,.5)
B0 <- rnorm(4,2+3,.5)
A1 <- rnorm(4,6,.5)
B1 <- rnorm(4,6+2,.5)
A2 <- rnorm(4,10,.5)
B2 <- rnorm(4,10+1,.5)
A3 <- rnorm(4,14,.5)
B3 <- rnorm(4,14+0,.5)
score <- c(A0,B0,A1,B1,A2,B2,A3,B3)
id <- rep(1:8,times = 4, length = 32)
time <- factor(rep(0:3, each = 8, length = 32))
group <- factor(rep(c("A","B"), times =2, each = 4, length = 32))
df <- data.frame(id = id, group = group, time = time, score = score)
Now the following call to lme works just fine, with contrasts specified (I know these are the default so this is all purely pedagogical).
mod <- lme(score ~ group*time, random = ~1|id, data = df, contrasts = list(group = contr.treatment(2), time = contr.treatment(4)))
The following also works, passing strings as variable names into lme using the reformulate() function.
t <- "time"
g <- "group"
dv <- "score"
mod1R <- lme(reformulate(paste0(g,"*",t), response = "score"), random = ~1|id, data = df)
But if I want to specify contrasts, like in the first example, it doesn't work
mod2R <- lme(reformulate(paste0(g,"*",t), response = "score"), random = ~1|id, data = df, contrasts = list(g = contr.treatment(2), t = contr.treatment(4)))
# Error in `contrasts<-`(`*tmp*`, value = contrasts[[i]]) : contrasts apply only to factors
How do I get lme to recognise that the strings specified to in the contrasts argument refer to the variables passed into the reformulate() function?
You should be able to use setNames() on the list of contrasts to apply the full names to the list:
# Using a %>% pipe so need to load magrittr
library(magrittr)
mod2R <- lme(reformulate(paste0(g,"*",t), response = "score"),
random = ~1|id,
data = df,
contrasts = list(g = contr.treatment(2), t = contr.treatment(4)) %>%
setNames(c(g, t))
)

Linear Model error or data type error?

The program generates a matrix with location and treatment. I generates some data from a normal distribution. It then tries to fit a linear model predicting yield based on treatment and location. The linear model does not work. Why?
trts = paste(0:6)
locs = paste(1:6)
reps = paste(1:4)
plotsize = 4
DF = expand.grid(locs, reps, trts, plotsize, stringsAsFactors = TRUE)
colnames(DF) = c("Location","Replicate","Treatment")
vector = rnorm(1000000, mean=138.2, sd=54.89)
DF$Treatment = as.numeric(DF$Treatment)
DF$Location = as.numeric(DF$Location)
#This approach takes one set of "plotsize" values from "vector" and adds for 5 for each treatment.
DF$Yield = apply(DF, 1, function(x) (5*DF$Treatment)+mean(sample vector,plotsize)))
DF<-t(DF)
Yield<-DF$Yield
trt=as.factor(DF$Treatment)
loc=as.factor(DF$Location)
summary(fm1 <- aov(Yield ~ loc*trt))
result1<-TukeyHSD(fm1, "trtm", ordered = TRUE)

Correctly formatting data for lstm recurrent neural network in R / mxnet

I want to train an lstm neural net using the mx.lstm function in the R package mxnet. My data comprises n feature vectors, a vector of labelled classes and a time vector, much like this dummy example where X1, X2, X3 are the features:
dat <- data.frame(
X1 = rnorm(100, 1, sd = 1),
X2 = rnorm(100, 2, sd = 1),
X3 = rnorm(100, 3, sd = 1),
class = sample(c(1,0), replace = T, 100),
time = seq(0.01,1,0.01))
Help for mx.lstm states that the train.data argument requires "mx.io.DataIter or list(data=R.array, label=R.array) The Training set".
I have tried this:
library(mxnet)
# Convert dummy data into suitable format
trainDat <- list(data = array(c(dat$X1, dat$X2, dat$X3), dim = c(100,3)),
label = array(dat[,4], dim = c(100,1)))
# Set the basic network parameters for the lstm (arbitrary for this example)
batch.size = 32
seq.len = 32
num.hidden = 16
num.embed = 16
num.lstm.layer = 1
num.round = 1
learning.rate = 0.1
wd = 0.00001
clip_gradient = 1
update.period = 1
# Run the model
model <- mx.lstm(train.data = trainDat,
ctx=mx.cpu(),
num.round=num.round,
update.period=update.period,
num.lstm.layer=num.lstm.layer,
seq.len=seq.len,
num.hidden=num.hidden,
num.embed=num.embed,
num.label=vocab,
batch.size=batch.size,
input.size=vocab,
initializer=mx.init.uniform(0.1),
learning.rate=learning.rate,
wd=wd,
clip_gradient=clip_gradient)
Which returns "Error in mx.io.internal.arrayiter(as.array(data), as.array(label), unif.rnds, :
basic_string::_M_replace_aux"
There is an example lstm on the mxnet website, but the data used are quite different to mine and I can't make sense of it.
http://mxnet.io/tutorials/r/charRnnModel.html
So, my question is how do I transform my data into a suitable format for mx.lstm?
I tried to reproduce your error and got a more detailed message:
Error in mx.io.internal.arrayiter(as.array(data), as.array(label), unif.rnds, :
io.cc:50: Seems X, y was passed in a Row major way, MXNetR adopts a column major convention.
Please pass in transpose of X instead
I fixed the error by passing data and label arrays to aperm().
trainDat <- list(data = aperm(array(c(dat$X1, dat$X2, dat$X3), dim = c(100,3))), label = aperm(array(dat[,4], dim = c(100,1))))

fitting data to bnlearn model in R

I have a bnlearn model in R that is learned using the gs function with 4 categorical variables and 8 numerical variables.
when I try to validate my model with a test set, I get this error when trying to predict some of the nodes:
Error in check.fit.vs.data(fitted = object, data = data, subset = object[[node]]$parents) :
'Keyword' has different number of levels in the node and in the data.
Is it not possible to use both numerical and categorical variables with bnlearn? and if it is possible, what am I doing wrong?
mydata$A <- as.factor(mydata$A)
mydata$B <- as.numeric(mydata$B)
mydata$C <- as.numeric(mydata$C)
mydata$D <- as.numeric(mydata$D)
mydata$E <- as.factor(mydata$E)
mydata$F <- as.numeric(mydata$F)
mydata$G <- as.numeric(mydata$G)
mydata$H <- as.numeric(mydata$H)
mydata$I <- as.numeric(mydata$I)
mydata$J <- as.numeric(mydata$J)
mydata$K <- as.numeric(mydata$K)
mydata$L <- as.numeric(mydata$L)
mydata$M <- as.numeric(mydata$M)
mydata$N <- as.numeric(mydata$N)
mydata$O <- as.numeric(mydata$O)
mydata$P <- as.numeric(mydata$P)
mydata$Q <- as.numeric(mydata$Q)
#create vector of black arcs
temp1=vector(mode = "character", length = 0)
for (i in 1:length(varnames)){
for (j in 1:length(varnames)){
temp1 <- c(temp1,varnames[i])
}
}
temp2=vector(mode = "character", length = 0)
for (i in 1:length(varnames)){
temp2 <- c(temp2,varnames)
}
#creat to arcs of the model
arcdata = read.csv("C:/users/asaf/desktop/in progress/whitearcs.csv", header = T)
wfrom=arcdata[,1]
wto=arcdata[,2]
whitelist = data.frame(from = wfrom,to =wto)
#block unwanted arcs
blacklist = data.frame(from = temp1, to = temp2)
#fit and plot the model
#gaussian method
model = gs(mydata, whitelist = whitelist, blacklist = blacklist)
#inference procedure
learntmodel = bn.fit(model,mydata,method = "mle",debug = F)
graphviz.plot(learntmodel)
myvalidation=read.csv("C:/users/asaf/desktop/in progress/val.csv", header = T)
#predicate A
pred = predict(learntmodel, node="A", myvalidation)
myvalidation$A <- pred
#predicate B
pred = predict(learntmodel, node="B", myvalidation)
myvalidation$B <- pred
at this point it throws the following error :
Error in check.fit.vs.data(fitted = object, data = data, subset = object[[node]]$parents) :
'A' has different number of levels in the node and in the data.
bnlearn can't work with mixed variables (qualitative and quantitative) at same time, I read it is possible in deal package.
Another possibility is to use discretize to transform your continous variables into discrete variables:
dmydata <- discretize(mydata, breaks = 2, method = "interval")
model <- gs(dmydata, whitelist = whitelist, blacklist = blacklist)
... and continue your code.
Actually I had the same problem today, I resolved it by ensuring that the other nodes that are connected to the one in question... i.e. $A, had also the same number of levels.

Resources