cforest party unbalanced classes - r

I want to measure the features importance with the cforest function from the party library.
My output variable has something like 2000 samples in class 0 and 100 samples in class 1.
I think a good way to avoid bias due to class unbalance is to train each tree of the forest using a subsample such that the number of elements of class 1 is the same of the number of element in class 0.
Is there anyway to do that? I am thinking to an option like n_samples = c(20, 20)
EDIT:
An example of code
> iris.cf <- cforest(Species ~ ., data = iris,
+ control = cforest_unbiased(mtry = 2)) #<--- Here I would like to train the forest using a balanced subsample of the data
> varimp(object = iris.cf)
Sepal.Length Sepal.Width Petal.Length Petal.Width
0.048981818 0.002254545 0.305818182 0.271163636
>
EDIT:
Maybe my question is not clear enough.
Random forest is a set of decision trees. In general the decision trees are constructed using only a random subsample of the data. I would like that the used subsample has the same numbers of element in the class 1 and in the class 0.
EDIT:
The function that I am looking for is for sure available in the randomForest package
sampsize
Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.
I need the same for the party package. Is there any way to get it?

I will assume you know what you want to accomplish, but don't know enough R to do that.
Not sure if the function provides balancing of data as an argument, but you can do it manually. Below is the code I quickly threw together. More elegant solution might exist.
# just in case
myData <- iris
# replicate everything *10* times. Replicate is just a "loop 10 times".
replicate(10,
{
# split dataset by class and add separate classes to list
splitList <- split(myData, myData$Species)
# sample *20* random rows from each matrix in a list
sampledList <- lapply(splitList, function(dat) { dat[sample(20),] })
# combine sampled rows to a data.frame
sampledData <- do.call(rbind, sampledList)
# your code below
res.cf <- cforest(Species ~ ., data = sampledData,
control = cforest_unbiased(mtry = 2)
)
varimp(object = res.cf)
}
)
Hope you can take it from here.

Related

How to capture the most important variables in Bootstrapped models in R?

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them. The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long). I would like to report the top features then rename these variables with a shorter more concise name.
My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops. However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results. I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.
For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.
library(glmnet)
data("mtcars") # Base R Dataset
df <- mtcars
topvar <- list()
for (i in 1:100) {
# CV and Splitting
ind <- sample(nrow(df), nrow(df), replace = TRUE)
ind <- unique(ind)
train <- df[ind, ]
xtrain <- model.matrix(mpg~., train)[,-1]
ytrain <- df[ind, 1]
test <- df[-ind, ]
xtest <- model.matrix(mpg~., test)[,-1]
ytest <- df[-ind, 1]
# Create Model per Loop
model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2)
# Store Coeffecients per loop
coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
# Store all nonzero Coefficients
topvar[[i]] <- coef_las[which(coef_las != 0)]
}
# Unlist
varimp <- unlist(topvar)
# Count all predictors
novar <- table(names(varimp))
# Find the mean of all variables
meanvar <- tapply(varimp, names(varimp), mean)
# Return top 3 repeated Coefs
repvar <- novar[order(novar, decreasing = TRUE)][1:3]
# Return mean of repeated Coefs
repvar.mean <- meanvar[names(repvar)]
repvar
Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run. Any suggestions on how I could approach this?
You can use function set.seed() to ensure your sample will return the same sample each time. For example
set.seed(123)
When I add this to above code and then run twice, the following is returned both times:
wt carb hp
98 89 86

How to use lapply with get.confusion_matrix() in R?

I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.

poLCA - Latent Class how to do the adjusted Lo-Mendell-Rubin (LMR) test with R

Good afternoon,
I am trying to perform Lo, Mendell and Rubin's (2001) adjusted test (LMR) in order to decide the optimal number of classes in LCA. I performed the command with poLCA, but I didn't find any command to perform it.
Is there someone that can help me?
Thank you very much!
Here is an example of a (ad-hoc adjusted) LMR test comparing a LCA with 3 groups (alternative model) against 2 groups (baseline model).
# load packages/install if needed
library(poLCA)
library(tidyLPA)
data("election")
# Fit LCA with 2 classes (NULL model)
mod_null <- poLCA(formula = cbind(MORALG, CARESG, KNOWG) ~ 1,
data = election, nclass = 2, verbose = F)
# store values baseline model
n <- mod_null$Nobs #number of observations (should be equal in both models)
null_ll <- mod_null$llik #log-likelihood
null_param <- mod_null$npar # number of parameters
null_classes <- length(mod_null$P) # number of classes
# Fit LCA with 3 classes (ALTERNATIVE model)
mod_alt <- poLCA(formula = cbind(MORALG, CARESG, KNOWG) ~ 1,
data = election, nclass = 3, verbose = F)
# Store values alternative model
alt_ll <- mod_alt$llik #log-likelihood
alt_param <- mod_alt$npar # number of parameters
alt_classes <- length(mod_alt$P) # number of classes
# use calc_lrt from tidyLPA package
calc_lrt(n, null_ll, null_param, null_classes, alt_ll, alt_param, alt_classes)
Wow really late to the game but as Im looking at similar things Ill leave for the next person.
The Lo-Mendell-Rubin test involves a transformation of the data and then a chi-sq test to determine if K classes is a better fit than K-1 classes... basically.
However there is reasonable research out there suggesting that a better measure of this is the bootstrap likelihood ratio.
The former is still in common use with MPlus users, the latter is far more common in LCA packages in R, e.g. mclust. Dunno about poLCA though...

How do I produce a set of predictions based on a new set of data using predict in R? [duplicate]

This question already has answers here:
Predict() - Maybe I'm not understanding it
(4 answers)
Closed 6 years ago.
I'm struggling to understand how the predict function works and can be used with different sample data. For instance the following code...
my <- data.frame(x=rnorm(1000))
my$y <- 0.5*my$x+0.5*rnorm(1000)
fit <- lm(my$y ~ my$x)
mySample <- my[sample(nrow(my), 100),]
predict(fit, mySample)
I would understand should return 100 y predictions based on the sample. But it returns 1,000 row with the warning message :
'newdata' had 100 rows but variables found have 1000 rows
How do I produce a set of predictions based on a new set of data using predict? Or am I using the wrong function? I am a noob so apologise in advance if I am asking stupid questions.
It's never a good idea to use the $ symbol when using the formula syntax (and most of the times it's completely unnecessary. This is especially true when you are trying to make predictions because the predict() function works hard to exactly match up column names and data.types. So rather than
fit <- lm(my$y ~ my$x)
use
fit <- lm(y ~ x, my)
So a complete example would be
set.seed(15) # for reproducibility
my <- data.frame(x=rnorm(1000))
my$y <- 0.5*my$x+0.5*rnorm(1000)
fit <- lm(y ~ x, my)
mySample <- my[sample(1:nrow(my), 100),]
head(predict(fit, mySample))
# 694 278 298 825 366 980
# 0.43593108 -0.67936324 -0.42168723 -0.04982095 -0.72499087 0.09627245
couple of things wrong with the code: you are overwriting the sample function with your variable named sample. you want something like mysample<- sample(my\$x,100) ... its nothing to do with predict. From my limited understanding dataframes are 'lists of columns' so sampling my means creating 100 samples of (the 1000 row) column x. by using my\$x you now are referring to the column ( in the dataframe), which is a list of rows.
In other words you are sampling from a list of columns (which only has a single element), but you actually want to sample from a list of the rows in column x
Is this what you want
library(caret)
my <- data.frame(x=rnorm(1000))
my$y <- 0.5*my$x+0.5*rnorm(1000)
## Divide data into train and test set
Index <- createDataPartition(my$y, p = 0.8, list = FALSE, times = 1)
train <- my[Index, ]
test <- my[-Index,]
lmfit<- train(y~x,method="lm",data=train,trControl = trainControl(method = "cv"))
lmpredict<-predict(lmfit,test)
this for an in-sample prediction for pseudo out of sample prediction (forecasting one step ahead) you just need lag the independent variable by 1
Lag(x)

Local prediction modelling approach in R

users
I am trying to develop a local model (PLSR) which is predicting a query sample by a model built on the 10 most similar samples using the code below (not the full model yet, just a part of it). I got stuck when trying to predict the query sample (second to last line). The model is actually predicting something, ("prd") but not the query sample!
Here is my code:
require("pls")
set.seed(10000) # generate some sample data
mat <- replicate(100, rnorm(100))
y <- as.matrix(mat[,1], drop=F)
x <- mat[,2:100]
eD <- dist(x, method="euclidean") # create a distance matrix
eDm <- as.matrix(eD)
Looping over all 100 samples and extracting their 10 most similar samples for subsequent model building and prediction of query sample:
for (i in 1:nrow(eDm)) {
kni <- head(order(eDm[,i]),11)[-1] # add 10 most similar samples to kni
pls1 <- plsr(y[kni,] ~ x[kni,], ncomp=5, validation="CV") # run plsr on sel. samples
prd <- predict(pls1, ncomp=5, newdata=x[[i]]) # predict query sample ==> I suspect there is something wrong with this expression: newdata=x[[i]]
}
I can't figure out how to address the query sample properly - many thanks i.a. for any help!
Best regards,
Chega
You are going to run into all sorts of pain building models with formulae like that. Also the x[[i]] isn't doing what you think it is - you need to supply a data frame usually to these modelling functions. In this case a matrix seems fine too.
I get all your code working OK if I use:
prd <- predict(pls1, ncomp=5, newdata=x[i, ,drop = FALSE])
giving
> predict(pls1, ncomp=5, newdata=x[i,,drop = FALSE])
, , 5 comps
y[kni, ]
[1,] 0.6409897
What you were seeing with your code are the fitted values for the training data.
> fitted(pls1)[, , 5, drop = FALSE]
, , 5 comps
y[kni, ]
1 0.1443274
2 0.2706769
3 1.1407780
4 -0.2345429
5 -1.0468221
6 2.1353091
7 0.8267103
8 3.3242296
9 -0.5016016
10 0.6781804
This is convention in R when you either don't supply newdata or the object you are supplying makes no sense and doesn't contain the covariates required to generate predictions.
I would have fitted the model as follows:
pls1 <- plsr(y ~ x, ncomp=5, validation="CV", subset = kni)
where I use the subset argument for its intended purpose; to select the rows of the input data to fit the model with. You get nicer output from the models; the labels use y instead of y[kni, ] etc, plus this general convention will serve you well in other modelling tools, where R will expect newdata to be a data frame with names exactly the same as those mentioned in the model formula. In your case, with your code, that would mean creating a data frame with names like x[kni, ] which are not easy to do, for good reason!

Resources