Combining binary classification algorithms - r

I have several algorithms which solve a binary classification (with response 0 or 1) problem by assigning to each observation a probability of the target value being equal to 1. All the algorithms try to minimize the log loss function where N is the number of observations, y_i is the actual target value and p_i is the probability of 1 predicted by the algorithm. Here is some R code with sample data:
actual.response = c(1,0,0,0,1)
prediction.df = data.frame(
method1 = c(0.5080349,0.5155535,0.5338271,0.4434838,0.5002529),
method2 = c(0.5229466,0.5298336,0.5360780,0.4217748,0.4998602),
method3 = c(0.5175378,0.5157711,0.5133765,0.4372109,0.5215695),
method4 = c(0.5155535,0.5094510,0.5201827,0.4351625,0.5069823)
)
log.loss = colSums(-1/length(actual.response)*(actual.response*log(prediction.df)+(1-actual.response)*log(1-prediction.df)))
The sample code gives the log loss for each algorithm:
method1 method3 method2 method4
0.6887705 0.6659796 0.6824404 0.6719181
Now I want to combine this algorithms so I can minimize the log loss even further. Is there any R package which can do this for me? I will appreciate references to any algorithms, articles, books or research papers which solve this kind of problem. Note that as a final result I want to have the predicted probabilities of each class and note plain 0,1 responses.

This is called ensemble learning (Wikipedia).
Check out this article: "an intro to ensemble learning in r."
Here is an example I did using the Cornell movie review data which can be downloaded by clicking the link. I used to data set with 1000 positive and 1000 negative reviews. Once you get the data into R:
library(RTextTools)
library(tm)
library(glmnet)
library(ipred)
library(randomForest)
library(data.table)
## create a column of sentiment score. 0 for negative and 1 for
## positive.
text_neg$pos_neg<-rep(0,1000)
text_pos$pos_neg<-rep(1,1000)
## Combine into 1 data.table and rename.
text_all<-rbind(text_neg, text_pos)
##dont forget to shuffle
set.seed(26)
text2<-text_all[sample(nrow(text_all)),]
## turn the data.frame into a document term matrix. This uses the handy
##RTextTools wrappers and functions.
doc_matrix <- create_matrix(text2$V1, language="english",
removeNumbers=TRUE, stemWords=TRUE, removeSparseTerms=.98)
ncol(data.frame(as.matrix(doc_matrix)))
## 2200 variables at .98 sparsity. runs pretty slow...
## create a container with the very nice RTextTools package
container <- create_container(doc_matrix, text2$pos_neg,
trainSize=1:1700, testSize=1701:2000, virgin=FALSE)
## train the data
time_glm<-system.time(GLMNET <- train_model(container,"GLMNET"));
time_glm #1.19
time_slda<-system.time(SLDA <- train_model(container,"SLDA"));
time_slda #45.03
time_bag<-system.time(BAGGING <- train_model(container,"BAGGING"));
time_bag #59.24
time_rf<-system.time(RF <- train_model(container,"RF")); time_rf #69.59
## classify with the models
GLMNET_CLASSIFY <- classify_model(container, GLMNET)
SLDA_CLASSIFY <- classify_model(container, SLDA)
BAGGING_CLASSIFY <- classify_model(container, BAGGING)
RF_CLASSIFY <- classify_model(container, RF)
## summarize results
analytics <- create_analytics(container,cbind( SLDA_CLASSIFY,
BAGGING_CLASSIFY,RF_CLASSIFY, GLMNET_CLASSIFY))
summary(analytics)
This ran an ensemble classifier using the 4 different methods (random forests, GLM, SLD and bagging). The ensemble summary at the end shows
# ENSEMBLE SUMMARY
#
# n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
# n >= 1 1.00 0.86
# n >= 2 1.00 0.86
# n >= 3 0.89 0.89
# n >= 4 0.63 0.96
That if all 4 methods agreed on if the review was positive or negative, then the ensemble had a 96% recall rate. But be careful, because with a binary outcome (2 choices) and 4 different algorithms, there is bound to be a lot of agreement.
See the RTextTools documentation for more explanation. They also do an almost identical example with U.S Congress data that I more or less mimicked in the above example.
Hope this was helpful.

Related

Clustering with Mclust results in an empty cluster

I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.

choose thresholds for 100% sensitivity in glm and lda (wbcd, R)

I'm working on Wisconsin Breast Cancer Dataset, my aim is to build a model which features a good accuracy and 100% sensitivity. I know that in order to achieve this, I've to work with the thresholds. The problem is that I don't understand how does thresholds work and how can I properly choose them.
I'm studying on the famous Intro to SL (with applications in R) book, but I'm not able to find the explanation about choosing the threshold in chapter 4.
Here is the code I've written so far:
df <- subset(df, select = -c(X, id)) # Selecting features
set.seed(4)
# Train and test
nrows <- NROW(df)
index <- sample(1:nrows, 0.7 * nrows)
traindf <- df[index,]
testdf <- df[-index,]
glm.fit=glm(diagnosis~., data=traindf ,family=binomial)
glm.probs=predict(glm.fit,testdf,type="response")
glm.pred=rep("B",dim(tested)[1])
glm.pred[glm.probs >.5]="M"
table(glm.pred, testdf[,1])
Now, this gives me
glm.pred B M
B 108 3
M 4 56
What I want is to put 0 in the top right of the table, but changing the thresholds doesn't work.
How can I fix the problem?
The same is with the lad function (which I avoid to write here).
Thanks

R: adjusting a given time-series but keeping summary statistics equal

Let's say I have a time-series like this
t x
1 100
2 50
3 200
4 210
5 90
6 80
7 300
Is it possible in R to generate a new dataset x1 which has the exact same summary statistics, e.g. mean, variance, kurtosis, skew as x?
The reason for my asking is that I would like to do an experiment where I want to test how subjects react to different graphs of data that contain the same information.
I recently read:
Matejka, Justin, and George Fitzmaurice. "Same stats, different graphs: Generating datasets with varied appearance and identical statistics through simulated annealing." Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 2017.
Generating Data with Identical Statistics but Dissimilar Graphics: A Follow up to the Anscombe Dataset, The American Statistician, 2007,
However, Matejka uses code in Python that is quite scientific and their data is more complex than time-series data, so I was wondering if there was a way to do this more efficiently for a simpler data set?
Best regards
I'm not aware of a package that can give you precisely what you are looking for. One option is using the datasets in the datasauRus package as JasonAizkalns pointed out. However, if you want to create your own dataset, you can try this:
Fit the Johnson distribution from the SuppDists package to get the moments of the dataset and draw new sets from that distribution until the difference is sufficiently small. Below an example with your dataset, although more observations make it easier to replicate the summary statistics:
library(SuppDists)
a <- c(100,50,200,210,90,80,300)
momentsDiffer <- function(x1,x2){
diff <- sum(abs(moments(x1)- moments(x2)))
return(diff)
}
repDataset <- function(x,n){
# fit Johnson distribution
parms<-JohnsonFit(a, moment="quant")
# generate from distribution n times storing if improved
current <- rJohnson(length(a),parms)
momDiff <- momentsDiffer(x,current)
for(i in 1:n){
temp <- rJohnson(length(a),parms)
tempDiff <- momentsDiffer(x,temp)
if(tempDiff < momDiff){
current <- temp
momDiff <- tempDiff
}
}
return(current)
}
# Drawing 1000 times to allow improvement
b <- repDataset(a,1000)
> moments(b)
mean sigma skew kurt
148.14048691 84.24884165 1.04201116 -0.05008629
> moments(a)
mean sigma skew kurt
147.1428571 84.1281821 0.5894543 -1.0198303
EDIT - Added additional method
Following the suggestion of #Jj Blevins, the method below generates a random sequence based upon the original sequence leaving out 4 observations. Those 4 observations are then added through solving a non-linear equation on the difference between the four moments of the original sequence and the new sequence. This still not generate a perfect match, feel free to improve.
library(nleqslv)
library(e1071)
set.seed(1)
a <- c(100,50,200,210,90,80,300)
#a <- floor(runif(1000,0,101))
init <- floor(runif(length(a)-4,min(a),max(a)+1))
moments <- moments(a)
f <- function(x) {
a <- mean(c(init,x))
b <- var(c(init,x))
c <- skewness(c(init,x))
d <- kurtosis(c(init,x))
c(a-moments[1],b-moments[2],c-moments[3],d-moments[4])
}
result <- nleqslv(runif(4,min(a),max(a)+1), f,control=list(ftol=.00000001, allowSingular=TRUE))
> moments(c(init,result$x))
mean sigma skew kurt
49.12747961 29.85435993 0.03327868 -1.25408078
> moments(a)
mean sigma skew kurt
49.96600000 29.10805462 0.03904256 -1.18250616

R topicmodels LDA

I am running LDA on a small corpus of 2 docs (sentences) for testing purposes. Following code returns topic-term and document-topic distributions that are not reasonable at all given the input documents. Running exactly the same returns in Python reasonable results.
Who knows what is wrong here?
library(topicmodels)
library(tm)
d1 <- "bank bank bank"
d2 <- "stock stock stock"
corpus <- Corpus(VectorSource(c(d1,d2)))
##fit lda to data
dtm <- DocumentTermMatrix(corpus)
ldafit <- LDA(dtm, k=2, method="Gibbs")
##get posteriors
topicTerm <- t(posterior(ldafit)$terms)
docTopic <- posterior(ldafit)$topics
topicTerm
docTopic
> topicTerm
1 2
bank 0.3114525 0.6885475
stock 0.6885475 0.3114525
> docTopic
1 2
1 0.4963245 0.5036755
2 0.5036755 0.4963245
The results from Python are as follows:
>>> docTopic
array([[ 0.87100799, 0.12899201],
[ 0.12916713, 0.87083287]])
>>> fit.print_topic(1)
u'0.821*"bank" + 0.179*"stock"'
>>> fit.print_topic(0)
u'0.824*"stock" + 0.176*"bank"'
The author of the R package topicmodels, Bettina GrĂ¼n, pointed out that this is due to the selection of the hyperparameter 'alpha'.
LDA in R selects alpha = 50/k= 25 while LDA in gensim Python selects alpha = 1/k = 0.5. A smaller alpha value favors sparse solutions of document-topic distributions, i.e. documents contain mixture of just a few topics. Hence, decreasing alpha in LDA in R yields very reasonable results:
ldafit <- LDA(dtm, k=2, method="Gibbs", control=list(alpha=0.5))
posterior(ldafit)$topics
# 1 2
# 1 0.125 0.875
# 2 0.875 0.125
posterior(ldafit)$terms
# bank stock
# 1 0.03125 0.96875
# 2 0.96875 0.03125
Try to plot the perplexity over iterations and make sure they converge. Initial status also matters. (The document size and sample size both seem to be small, though.)

poLCA - Latent Class how to do the adjusted Lo-Mendell-Rubin (LMR) test with R

Good afternoon,
I am trying to perform Lo, Mendell and Rubin's (2001) adjusted test (LMR) in order to decide the optimal number of classes in LCA. I performed the command with poLCA, but I didn't find any command to perform it.
Is there someone that can help me?
Thank you very much!
Here is an example of a (ad-hoc adjusted) LMR test comparing a LCA with 3 groups (alternative model) against 2 groups (baseline model).
# load packages/install if needed
library(poLCA)
library(tidyLPA)
data("election")
# Fit LCA with 2 classes (NULL model)
mod_null <- poLCA(formula = cbind(MORALG, CARESG, KNOWG) ~ 1,
data = election, nclass = 2, verbose = F)
# store values baseline model
n <- mod_null$Nobs #number of observations (should be equal in both models)
null_ll <- mod_null$llik #log-likelihood
null_param <- mod_null$npar # number of parameters
null_classes <- length(mod_null$P) # number of classes
# Fit LCA with 3 classes (ALTERNATIVE model)
mod_alt <- poLCA(formula = cbind(MORALG, CARESG, KNOWG) ~ 1,
data = election, nclass = 3, verbose = F)
# Store values alternative model
alt_ll <- mod_alt$llik #log-likelihood
alt_param <- mod_alt$npar # number of parameters
alt_classes <- length(mod_alt$P) # number of classes
# use calc_lrt from tidyLPA package
calc_lrt(n, null_ll, null_param, null_classes, alt_ll, alt_param, alt_classes)
Wow really late to the game but as Im looking at similar things Ill leave for the next person.
The Lo-Mendell-Rubin test involves a transformation of the data and then a chi-sq test to determine if K classes is a better fit than K-1 classes... basically.
However there is reasonable research out there suggesting that a better measure of this is the bootstrap likelihood ratio.
The former is still in common use with MPlus users, the latter is far more common in LCA packages in R, e.g. mclust. Dunno about poLCA though...

Resources