How to train model with large of categorical features :: RStudio Crashes - r

I have a dataset with over 800K rows and 66 columns/features. I am training xgboost model with carte with 5k-Fold Cross-Validation. However, due to the following two columns my R session always crashes; even though I used amazon instance with following specs. I am using
Amazon EC2 Instance Types
m5.4xlarge 16 64 EBS-Only Up to 10 3,500
# A tibble: 815,885 x 66
first_tile last_tile
<fct> <fct>
1 Filly Brown Body of Evidence
2 The Dish The Hunger Games
3 Waiting for Guffman Hell's Kitchen N.Y.C.
4 The Age of Innocence The Lake House
5 Malevolence In the Name of the Father
6 Old Partner Desperate Measures
7 Lady Jane The Invasion
8 Mad Dog Time Eye of the Needle
9 Beauty Is Embarrassing Funny Lady
10 The Snowtown Murders Alvin and the Chipmunks
11 Superman II Pina
12 Leap of Faith Capote
13 The Royal Tenenbaums Dead Men Don't Wear Plaid
14 School for Scoundrels Tarzan
15 Rhinestone Cocoon: The Return
16 Burn After Reading Death Defying Acts
17 The Doors Half Baked
18 The Wood Dance of the Dead
19 Jason X Around the World in 80 Days
20 Dragon Wars LOL
## Model Training
libray(caret)
set.seed(42)
split <- 0.8
train_index <- createDataPartition(data_tbl$paid, p = split, list = FALSE)
data_train <- data_tbl[train_index, ]
data_test <- data_tbl[-train_index, ]
## Summarise The Target Variable
table(dat_train$paid) / nrow(data_train)
## Create Train/Test Indexes
## Create train/test indexes
## preserve class indices
set.seed(42)
my_folds <- createFolds(data_train$paid, k = 5)
# Compare class distribution
i <- my_folds$Fold1
table(data_train$paid[i]) / length(i)
## Reusing trainControl
my_control <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = my_folds
)
model_xgb <- train(
paid ~. ,
data = data_train,
metric = "ROC",
method = "xgbTree",
trControl = myControl)
Can you suggest me someway I can get around with this memory problem every time?
Is there a way I can do some sort of one hot coding for these features?
I would appreciate any suggestion or help?
Is there a way I should know how big machine I need?
Thanks in advance

There are different ways to tackle such issues in the world of ML.
Do you really need all the 66 features? Have you performed feature selection techniques? Have you tried getting rid of features which do not contribute to your prediction in any way? Check out some feature selection mechanisms for R here:
https://dataaspirant.com/2018/01/15/feature-selection-techniques-r/
Assuming you need most or all of your features, and now you want to encode these categorical variables, one hot seems a popular choice but there are other encoding techniques out there too. One of my choices would be binary encoding. However, there are other encoding techniques you can explore too: https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159
xgboost also has subsampling mechanism. Did you try training with a sample of your data? Check out the subsampling feature of xgboost here: https://xgboost.readthedocs.io/en/latest/parameter.html

Related

Classic king - man + woman = queen example with pretrained word-embedding and word2vec package in R

I am really desperate, I just cannot reproduce the allegedly classic example of king - man + woman = queen with the word2vec package in R and any (!) pre-trained embedding model (as a bin file).
I would be very grateful if anybody could provide working code to reproduce this example... including a link to the necessary pre-trained model which is also downloadable (many are not!).
Thank you very much!
An overview of using word2vec with R is available at https://www.bnosac.be/index.php/blog/100-word2vec-in-r which even shows an example of king - man + woman = queen.
Just following the instructions there and downloading the first English 300-dim embedding word2vec model from http://vectors.nlpl.eu/repository ran on the British National Corpus which I encountered, downloaded and unzipped the model.bin on my drive and next inspecting the terms in the model (words are there apparently appended with pos tags), getting the word vectors, displaying the vectors, getting the king - man + woman and finding the closest vector to that vector gives ... queen.
> library(word2vec)
> model <- read.word2vec("C:/Users/jwijf/OneDrive/Bureaublad/model.bin", normalize = TRUE)
> head(summary(model, type = "vocabulary"), n = 10)
[1] "vintage-style_ADJ" "Sinopoli_PROPN" "Yarrell_PROPN" "en-1_NUM" "74°–78°F_X"
[6] "bursa_NOUN" "uni-male_ADJ" "37541_NUM" "Menuetto_PROPN" "Saxena_PROPN"
> wv <- predict(model, newdata = c("king_NOUN", "man_NOUN", "woman_NOUN"), type = "embedding")
> head(t(wv), n = 10)
king_NOUN man_NOUN woman_NOUN
[1,] -0.4536242 -0.47802860 -1.03320265
[2,] 0.7096733 1.40374041 -0.91597748
[3,] 1.1509652 2.35536361 1.57869458
[4,] -0.2882653 -0.59587735 -0.59021348
[5,] -0.2110678 -1.05059254 -0.64248675
[6,] 0.1846713 -0.05871651 -1.01818573
[7,] 0.5493720 0.13456300 0.38765019
[8,] -0.9401053 0.56237948 0.02383301
[9,] 0.1140556 -0.38569298 -0.43408644
[10,] 0.3657919 0.92853492 -2.56553030
> wv <- wv["king_NOUN", ] - wv["man_NOUN", ] + wv["woman_NOUN", ]
> predict(model, newdata = wv, type = "nearest", top_n = 4)
term similarity rank
1 king_NOUN 0.9332663 1
2 queen_NOUN 0.7813236 2
3 coronation_NOUN 0.7663506 3
4 kingship_NOUN 0.7626975 4
Do you prefer to build your own model based on your own text or a more larger corpus e.g. the text8 file. Follow the instructions shown at https://www.bnosac.be/index.php/blog/100-word2vec-in-r.
Get a text file and use R package word2vec to build the model, wait untill the model finished training and next interact with it.
download.file("http://mattmahoney.net/dc/text8.zip", "text8.zip")
unzip("text8.zip", files = "text8")
> library(word2vec)
> set.seed(123456789)
> model <- word2vec(x = "text8", type = "cbow", dim = 100, window = 10, lr = 0.05, iter = 5, hs = FALSE, threads = 2)
> wv <- predict(model, newdata = c("king", "man", "woman"), type = "embedding")
> wv <- wv["king", ] - wv["man", ] + wv["woman", ]
> predict(model, newdata = wv, type = "nearest", top_n = 4)
term similarity rank
1 king 0.9743692 1
2 queen 0.8295941 2
You haven't shown what pretrained models you've tried, nor what data you've used in your attempts, nor what training-then-probing code that you used and failed, nor how your attempt failed. So it's hard to help without writing you a whole tutorial... and there are already plenty of word2vec tutorials online.
But note:
word2vec is a data-hungry algorithm, and its useful qualities (including analogy-solving capabilities) really only become reliably demoable when using adequate large training sets
that said, most pretrained models from competent teams should easily show the classic man : king :: woman : queen analogy-solution, when using the same kinds of vector-arithmetic & candidate-answer ranking (eliminating all words in the question) as the original work
if I recall correctly, the merely 100MB of uncompressed-text text8 dataset from http://mattmahoney.net/dc/textdata) will often succeed or come close to succeeding on man : king :: woman : queen, though the related text9 that's 1GB of data tends to do much better. Both, though are a bit small for making strong general word-vectors. For contrast, the GoogleNews vectors Google released circa 2013 at the same time as the original word2vec papers were said to be trained on something like 100GB of news articles.
beware, though: the text8 & text9 datasets, by stripping all punctuation/linebreaks, may need to be chunked to pass to some word2vec implementations that rquire training-texts to fit within certain limits. For example, Python's Gensim expects training texts to be no longer than 10000 tokens each. text8 is 17 million words on one line. If you pass that one line of 17 million tokens to Gensim as one training text, 99.94% of them will be ignored as beyond the 10000-token limit. Your R implementation may have a similar, or even tighter, implementation limit.

Error in generalized linear mixed model cross-validation: The value in 'data[[cat_col]]' must be constant within each ID

I am trying to conduct a 5-fold cross validation on a generalized linear mixed model using the groupdata2 and cvms packages. This is the code I tried to run:
data <- groupdata2::fold(detect, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
cvms::cross_validate(
data,
"outcome ~ sex + year + season + (1 | bird) + (1 | obsname)",
family="binomial",
fold_cols = ".folds",
control = NULL,
REML = FALSE)
This is the error I receive:
Error in groupdata2::fold(detect, k = 4, cat_col = "outcome", id_col = "bird") %>% :
1 assertions failed:
* The value in 'data[[cat_col]]' must be constant within each ID.
In the package vignette, the following explanation is given: "A participant must always have the same diagnosis (‘a’ or ‘b’) throughout the dataset. Otherwise, the participant might be placed in multiple folds." This makes sense in the example. However, my data is based on the outcome of resighting birds, so outcome varies depending on whether the bird was observed on that particular survey. Is there a way around this?
Reproducible example:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
data <- groupdata2::fold(df, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
The full documentation says:
cat_col: Name of categorical variable to balance between folds.
E.g. when predicting a binary variable (a or b), we usually
want both classes represented in every fold.
N.B. If also passing an ‘id_col’, ‘cat_col’ should be
constant within each ID.
So in this case, where outcome varies within individual birds (id_col), you simply can't specify that the folds be balanced within respect to the outcome. (I don't 100% understand this constraint in the software: it seems it should be possible to do at least approximate balancing by selecting groups (birds) with a balanced range of outcomes, but I can see how it could make the balancing procedure a lot harder).
In my opinion, though, the importance of balancing outcomes is somewhat overrated in general. Lack of balance would mean that some of the simpler metrics in ?binomial_metrics (e.g. accuracy, sensitivity, specificity) are not very useful, but others (balanced accuracy, AUC, aic) should be fine.
A potentially greater problem is that you appear to have (potentially) crossed random effects (i.e. (1|bird) + (1|obsname)). I'm guessing obsname is the name of an observer: if some observers detected (or failed to detect) multiple birds and some birds were detected/failed by multiple observers, then there may be no way to define folds that are actually independent, or at least it may be very difficult.
You may be able to utilize the new collapse_groups() function in groupdata2 v2.0.0 instead of fold() for this. It allows you to take existing groups (e.g. bird) and collapse them to fewer groups (e.g. folds) with attempted balancing of multiple categorical columns, numeric columns, and factor columns (number of unique levels - though the same levels might be in multiple groups).
It does not have the constraints that fold() does with regards to changing outcomes, but on the other hand does not come with the same "guarantees" in the "non-changing outcome" context. E.g. it does not guarantee at least one of each outcome levels in all folds.
You need more birds than the number of folds though, so I've added a few to the test data:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,
4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,7)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1,0,1,
0,1,1,0,1,1,0,0,1,1,0,0,1,0,0,1,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
# Combine 'bird' groups to folds
data <- groupdata2::collapse_groups(
data = df,
n = 3,
group_cols="bird",
cat_col="outcome",
col_name = ".folds"
) %>%
arrange(.folds)
# Check the balance of the relevant columns
groupdata2::summarize_balances(
data=data,
group_cols=".folds",
cat_cols="outcome"
)$Groups
> # A tibble: 3 × 6
> .group_col .group `# rows` `# bird` `# outc_0` `# outc_1`
> <fct> <fct> <int> <int> <dbl> <dbl>
> 1 .folds 1 14 3 7 7
> 2 .folds 2 10 2 6 4
> 3 .folds 3 10 2 4 6
summarize_balances() shows us that we created 3 folds with 14 rows in the first fold and 10 in the other folds. There are 3 unique bird levels in the first fold and 2 in the others (normally only unique within the group, but here we know that birds are only in one group, as that is how collapse_groups() works with its group_cols argument).
The outcome variable (here # outc_0 and # outc_1) are somewhat decently balanced.
With larger datasets, you might want to run multiple collapsings and choose the one with the best balance from the summary. That can be done by adding num_new_group_cols = 10 to collapse_groups() (for even better results, enable the auto_tune setting) and then listing all the created group columns when running summarize_balances().
Hope this helps you or others in a similar position. The constraint in fold() is hard to get around with its current internal approach, but collapse_groups hopefully does the trick in those cases.
See more https://rdrr.io/cran/groupdata2/man/collapse_groups.html

Lambda Issue, or cross validation

I am doing double cross validation with LASSO of glmnet package, however when I plot the results I am getting lambda of 0 - 150000 which is unrealistic in my case, not sure what is wrong I am doing, can someone point me in the right direction. Thanks in advance!
calcium = read.csv("calciumgood.csv", header=TRUE)
dim(calcium)
n = dim(calcium)[1]
calcium = na.omit(calcium)
names(calcium)
library(glmnet) # use LASSO model from package glmnet
lambdalist = exp((-1200:1200)/100) # defines models to consider
fulldata.in = calcium
x.in = model.matrix(CAMMOL~. - CAMLEVEL - AGE,data=fulldata.in)
y.in = fulldata.in[,2]
k.in = 10
n.in = dim(fulldata.in)[1]
groups.in = c(rep(1:k.in,floor(n.in/k.in)),1:(n.in%%k.in))
set.seed(8)
cvgroups.in = sample(groups.in,n.in) #orders randomly, with seed (8)
#LASSO cross-validation
cvLASSOglm.in = cv.glmnet(x.in, y.in, lambda=lambdalist, alpha = 1, nfolds=k.in, foldid=cvgroups.in)
plot(cvLASSOglm.in$lambda,cvLASSOglm.in$cvm,type="l",lwd=2,col="red",xlab="lambda",ylab="CV(10)")
whichlowestcvLASSO.in = order(cvLASSOglm.in$cvm)[1]; min(cvLASSOglm.in$cvm)
bestlambdaLASSO = (cvLASSOglm.in$lambda)[whichlowestcvLASSO.in]; bestlambdaLASSO
abline(v=bestlambdaLASSO)
bestlambdaLASSO # this is the lambda for the best LASSO model
LASSOfit.in = glmnet(x.in, y.in, alpha = 1,lambda=lambdalist) # fit the model across possible lambda
LASSObestcoef = coef(LASSOfit.in, s = bestlambdaLASSO); LASSObestcoef # coefficients for the best model fit
I found the dataset you referring at
Calcium, inorganic phosphorus and alkaline phosphatase levels in elderly patients.
Basically the data are "dirty", and it is a possible reason why the algorithm does not converge properly. E.g. there are 771 year old patients, bisides 1 and 2 for male and female, there is 22 for sex encodeing etc.
As for your case you removed only NAs.
You need to check data.frame imported types as well. E.g. instead of factors it could be imported as integers (SEX, Lab and Age group) which will affect the model.
I think you need:
1) cleanse the data;
2) if doesnot work submit *.csv file

Multinomial logit models and nested logit models

I am using the mlogit package in program R. I have converted my data from its original wide format to long format. Here is a sample of the converted data.frame which I refer to as 'long_perp'. All of the independent variables are individual specific. I have 4258 unique observations in the data-set.
date_id act2 grp.bin pdist ship sea avgknots shore day location chid alt
4.dive 40707_004 TRUE 2 2.250 second light 14.06809 2.30805 12 Lower 4 dive
4.fly 40707_004 FALSE 2 2.250 second light 14.06809 2.30805 12 Lower 4 fly
4.none 40707_004 FALSE 2 2.250 second light 14.06809 2.30805 12 Lower 4 none
5.dive 40707_006 FALSE 2 0.000 second light 15.12650 2.53312 12 Lower 5 dive
5.fly 40707_006 TRUE 2 0.000 second light 15.12650 2.53312 12 Lower 5 fly
5.none 40707_006 FALSE 2 0.000 second light 15.12650 2.53312 12 Lower 5 none
6.dive 40707_007 FALSE 1 1.995 second light 14.02101 2.01680 12 Lower 6 dive
6.fly 40707_007 TRUE 1 1.995 second light 14.02101 2.01680 12 Lower 6 fly
6.none 40707_007 FALSE 1 1.995 second light 14.02101 2.01680 12 Lower 6 none
'act2' is the dependent variable and consists of choices a bird floating on the water could make when approached by a ship; fly, dive, or none. I am interested in how these probabilities relate to the remaining independent variables in the data.frame, i.e. perpendicular distance to the ship path (pdist) sea conditions (sea), speed (avgknots), distance to shore (shore) etc. The independent variables are made of dichotomous, factor and continuous variables.
I ran two multinomial logit models, one including all the choice options and another including only a subset. I then compared these models with the hmftest() function to test for the IIA assumption. The results were confusing the say the least. I will include the codes for the two models and the test output (in case I am miss-specifying the models in the code).
# model including all choice options (fly, dive, none)
mod.1 <- mlogit(act2 ~ 1 | pdist + as.factor(grp.bin) +
as.factor(sea) + avgknots + shore + as.factor(location),long_perp ,
reflevel = 'none')
# model including only a subset of choice options (fly, dive)
mod.alt <- mlogit(act2 ~ 1 | pdist + as.factor(grp.bin) +
as.factor(sea) + avgknots + shore + as.factor(location),long_perp ,
reflevel = 'none', alt.subset = c("fly","dive"))
# IIA test
hmftest(mod.1, mod.alt)
# output
Hausman-McFadden test
data: long_perp
chisq = -968.7303, df = 7, p-value = 1
alternative hypothesis: IIA is rejected
As you can see the chisquare statistic is negative! I assume I am either 1. doing something wrong, or 2. IIA is violated. This result holds true for choice subset (fly, dive), but the IIA assumption is upheld with choice subset (none, dive)? This confuses me.
Next I tried to formulate a nested model as a way to relax the IIA assumption. I nested the choices as nest1 = none, nest2 = fly, dive. This makes sense to me as this seems like a logical break, the bird decides to react or not then decides which reaction to make.
I am unclear on how to run the nested logit models (even after reading the two vignettes for mlogit, Croissant vignette and Train vignette).
When I run my analysis following the example in the Croissant vignette I get the following error.
nested.1 <- mlogit(act2 ~ 0 | pdist + as.factor(grp.bin) + as.factor(ship) +
as.factor(sea) + avgknots + shore + as.factor(location),
long_perp , reflevel="none",nests = list(noact = "none",
react = c("dive","fly")), unscaled = TRUE)
# Error in solve.default(crossprod(attr(x, "gradi")[, !fixed])) :
Lapack routine dgesv: system is exactly singular: U[19,19] = 0
I have read a bit about this error message and it may occur because of complete separation. I have looked at some tables of the data and do not believe this is happening as I have 4,000+ observations and only one factor variable with more than 2 levels (it has 3).
Help on these specific problems is greatly appreciated but I am also open to alternate analyses that I can use to answer my question. I am mainly interested in the probability of flying as a function of perpendicular distance to the ships path.
Thanks, Tim
To get a positive chi-sq, change the code as follows:
alt.subset = c("none", "fly")
that is, the ref level will be in the subset too. It may help, though the P-value may not change much.

D-efficient balanced design with R

I was wondering if there is any package for generating a D-efficient balanced design with R
I tried the AlgDesign package, but I did not manage to get a balanced design.
Not sure if this is because of the small full factorial of the attributes and levels I am considering (32) or because balanced designs are out of the scope of AlgDesign.
I have 3 attributes: two attributes have four levels, and 1 has 2 levels
att 1: 4 levels
att 2: 4 levels
att 3: 2 levels
I use the following R code:
library(AlgDesign)
#-----------------------------
# define attributes and levels
#-----------------------------
desVarNames <- c("esource", "certified", "cost")
desLevels <- c(4,2, 4)
n <- 6 #number of choice sets
desOpt <- 4 #num option per choice set
set.seed(123456)
#generate full factorial
dat<-gen.factorial(desLevels,length(desLevels),varNames=desVarNames, center=TRUE)
destT <- optFederov(~., dat, nTrials = (n*(desOpt)), criterion="D")
destT
First, when you say "balanced" design, I suspect you mean that each factor level should appear the same number of times in the final design. (Actually, your variable "dat" above is a balanced design, too, since each possible combination appears only once).
The way optFederov works is by randomly selecting and replacing trials using Federov's exchange algorithm. As such, everytime a trial is exchanged with another candidate trial, an initially balanced design will become unbalanced, since if a trial "balances" a design, replacing it with any other trial will unbalance the design. If the algorithm was forced to keep a balanced design at each step, no replacement would be possible, and it would get stuck.
Not only a balanced design is incompatible with the way Feverov's exchange algorithm works, it is actually not desirable from a D-efficiency point of view.
For instance, if you have 4 factors with 2, 3, 5 and 7 levels respectively, the only way to have a balanced design is to include all 2*3*5*7 = 210 trials, whereas AlgDesign only suggests 19.
data = gen.factorial(c(2,3,5,7), factors = "all")
trials = optFederov(data = data, center = FALSE, criterion = "D")
In short, the notion of a "balanced d-efficient" design is largely antagonistic.

Resources