Cluster assignment after estimate in a large dataset (Mclust) - r

I've been doing a clustering analysis with a relative large dataset (~50.000 observations and 16 variables).
library(mclust)
load(file="mdper.f.Rdata")#mdper.f = My stored data
As my computer was unable to do it, I did a few subsets of information (10 x 5.000, 16.000 in the example, but 15min computing) and I was using Mclust to determine the optimal number of groups.
ind<- sample(1:nrow(mdper.f),size=16000)#sampling especial with 16.000, 15min cumputing
nfac <- mdper.f[ind,]#sampling
Fnac <- scale(nfac) #scale data
mod = Mclust(Fnac) #Determining the optimal number of clusters
summary(mod) #Summary
#Results:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust VII (spherical, varying volume) model with 9 components:
log.likelihood n df BIC ICL
128118.2 16000 80 255462 254905.3
Clustering table:
1 2 3 4 5 6 7 8 9
1879 2505 3452 3117 2846 464 822 590 325
Resulting always 9 (10 out 10 of datasets of 5.000), so, I guess it's okay..
Now, I would like to assign to the rest of the data the estimated cluster divisions in order to the multidimensional parts of the cluster.
How can I do it?
I started to play with the Mclust object but I can't see how to handle it and apply to the rest of the data.
The optimal solution would be my original data with an extra column with the cluster number (1 to 9) assigned, for example.

I've got the answer after few minutes working:
First of all, there is a concept mistake, dataset must be scaled before partitioning and then just using predict()
library(mclust)
load(file="mdper.f.Rdata")#mdper.f = My stored data
mdper.f.s <- scale(mdper.f)#Scaling data
ind<- sample(1:nrow(mdper.f.s),size=16000)#sampling with 16.000
nfac <- mdper.f.s[ind,]#sampling
mod16 = Mclust(nfac)#Determining the optimal number of clusters, 15min cumputing with 7 vars
prediction<-predict(mod16 ,mdper.f.s )#Predict with calculated model and scaled data
mdper.f <- cbind(mdper.f,prediction$classification)#Assignment to the original data
colnames(mdper.f.pred)[8]<-"Cluster" #Assing name to the new column

Related

Clustering with Mclust results in an empty cluster

I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.

Is it wrong to apply Poisson Regression to stock volume?

I recently learned about Poisson Regression, and am wanting to apply this new-to-me statistical method to real world problems. So I thought about it for a while, and decided I wanted to try and predict stock volumes of the Fortune 500 based on financial information of a random sampling of companies.
The problem I am encountering, is while the model accounts for a massive amount of variance and contains only significant predictors, when I try to get the Poisson model to make predictions using the predict function, it returns predictions with virtually no variance that are way off the actual value.
The dataset I am playing with is not fully filled in, but I decided to take a peak at results with a small sample size. I did this because I read something online that suggested the needed power for poisson regression was lower for large numbers, and stock market volume includes some massive numbers. The dataset can be accessed here:
https://drive.google.com/file/d/1qvkwWSfUSodfceyNLvPjA4jqnWTDTeSo/view?usp=sharing
The code I used is presented below:
Stock<-read.csv("C:/FilePath/StockPrices.csv")
head(Stock)
summary(StockTest <- step(glm(formula = X2018.Volume ~ X2017.Stock.Price + X2017.Volume+Total.Revenue+Cost.of.Revenue+Research...Development+Selling.General...Administrative+Interest.Expense+Total.Other.Income...Expenses.Net+Income.Before.Tax+Income.Tax.Expense+Income.From.Continuing.Operation+Net.Income+Enviornment+Social+Governance, family = "poisson", data = Stock)))
1-StockTest$deviance/StockTest$null.deviance
predict(StockTest)
The model has a great Pseudo R-squared, but its predicted values are way off the actual values. See for yourself:
predict(StockTest)
15.47486 15.00441 15.00881 14.01175 15.01126 16.24620 15.99307 15.68193 15.67123 14.98932 14.77741 15.43363 12.07001 13.84586 15.83090 14.28052 15.16039 13.83686
Versus
Stock[,"X2018.Volume"]
[1] 5160000 110853500 3310000 3310000 1200000 876000 3310000 11400000 8830000 6380000 6410000
[12] 820000 3500000 2620000 4860000 199000 741000 7680000 1287769 3810000 1460000 2310000
What about this am I doing wrong? Are there special considerations that need to be made when using the predict function on a Poisson function? Is Poisson regression not the appropriate analysis for the data I am playing with?
First you need to read the manual page for predict.glm:
predict(Stock.glm, type="response")
# 1 3 4 5 7 8 9
# 5255867.7 3283450.0 3297945.2 1216812.4 3306021.9 11366695.1 8824739.9
# 10 11 13 14 15 16 17
# 6465084.7 6396289.7 3234293.9 2616649.3 5043601.7 174557.7 1030814.3
# 18 19 20 21
# 7503622.7 1592024.5 3837723.8 1021574.3
Stock.glm$model$X2018.Volume
# [1] 5160000 3310000 3310000 1200000 3310000 11400000 8830000 6380000 6410000
# [10] 3500000 2620000 4860000 199000 741000 7680000 1287769 3810000 1460000
You cannot compare to the original data since there are missing values. As a result 4 rows of the original data are missing from the data used in the analysis.
cor(Stock.glm$model$X2018.Volume, predict(Stock.glm, type="response"))
# [1] 0.9983086

R: adjusting a given time-series but keeping summary statistics equal

Let's say I have a time-series like this
t x
1 100
2 50
3 200
4 210
5 90
6 80
7 300
Is it possible in R to generate a new dataset x1 which has the exact same summary statistics, e.g. mean, variance, kurtosis, skew as x?
The reason for my asking is that I would like to do an experiment where I want to test how subjects react to different graphs of data that contain the same information.
I recently read:
Matejka, Justin, and George Fitzmaurice. "Same stats, different graphs: Generating datasets with varied appearance and identical statistics through simulated annealing." Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 2017.
Generating Data with Identical Statistics but Dissimilar Graphics: A Follow up to the Anscombe Dataset, The American Statistician, 2007,
However, Matejka uses code in Python that is quite scientific and their data is more complex than time-series data, so I was wondering if there was a way to do this more efficiently for a simpler data set?
Best regards
I'm not aware of a package that can give you precisely what you are looking for. One option is using the datasets in the datasauRus package as JasonAizkalns pointed out. However, if you want to create your own dataset, you can try this:
Fit the Johnson distribution from the SuppDists package to get the moments of the dataset and draw new sets from that distribution until the difference is sufficiently small. Below an example with your dataset, although more observations make it easier to replicate the summary statistics:
library(SuppDists)
a <- c(100,50,200,210,90,80,300)
momentsDiffer <- function(x1,x2){
diff <- sum(abs(moments(x1)- moments(x2)))
return(diff)
}
repDataset <- function(x,n){
# fit Johnson distribution
parms<-JohnsonFit(a, moment="quant")
# generate from distribution n times storing if improved
current <- rJohnson(length(a),parms)
momDiff <- momentsDiffer(x,current)
for(i in 1:n){
temp <- rJohnson(length(a),parms)
tempDiff <- momentsDiffer(x,temp)
if(tempDiff < momDiff){
current <- temp
momDiff <- tempDiff
}
}
return(current)
}
# Drawing 1000 times to allow improvement
b <- repDataset(a,1000)
> moments(b)
mean sigma skew kurt
148.14048691 84.24884165 1.04201116 -0.05008629
> moments(a)
mean sigma skew kurt
147.1428571 84.1281821 0.5894543 -1.0198303
EDIT - Added additional method
Following the suggestion of #Jj Blevins, the method below generates a random sequence based upon the original sequence leaving out 4 observations. Those 4 observations are then added through solving a non-linear equation on the difference between the four moments of the original sequence and the new sequence. This still not generate a perfect match, feel free to improve.
library(nleqslv)
library(e1071)
set.seed(1)
a <- c(100,50,200,210,90,80,300)
#a <- floor(runif(1000,0,101))
init <- floor(runif(length(a)-4,min(a),max(a)+1))
moments <- moments(a)
f <- function(x) {
a <- mean(c(init,x))
b <- var(c(init,x))
c <- skewness(c(init,x))
d <- kurtosis(c(init,x))
c(a-moments[1],b-moments[2],c-moments[3],d-moments[4])
}
result <- nleqslv(runif(4,min(a),max(a)+1), f,control=list(ftol=.00000001, allowSingular=TRUE))
> moments(c(init,result$x))
mean sigma skew kurt
49.12747961 29.85435993 0.03327868 -1.25408078
> moments(a)
mean sigma skew kurt
49.96600000 29.10805462 0.03904256 -1.18250616

predict.lm after regression with missing data in Y

I don't understand how to generate predicted values from a linear regression using the predict.lm command when some value of the dependent variable Y are missing, even though no independent X observation is missing. Algebraically, this isn't a problem, but I don't know an efficient method to do it in R. Take for example this fake dataframe and regression model. I attempt to assign predictions in the source dataframe but am unable to do so because of one missing Y value: I get an error.
# Create a fake dataframe
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(100,200,300,400,NA,600,700,800,900,100)
df <- as.data.frame(cbind(x,y))
# Regress X and Y
model<-lm(y~x+1)
summary(model)
# Attempt to generate predictions in source dataframe but am unable to.
df$y_ip<-predict.lm(testy)
Error in `$<-.data.frame`(`*tmp*`, y_ip, value = c(221.............
replacement has 9 rows, data has 10
I got around this problem by generating the predictions using algebra, df$y<-B0+ B1*df$x, or generating the predictions by calling the coefficients of the model df$y<-((summary(model)$coefficients[1, 1]) + (summary(model)$coefficients[2, 1]*(df$x)) ; however, I am now working with a big data model with hundreds of coefficients, and these methods are no longer practical. I'd like to know how to do it using the predict function.
Thank you in advance for your assistance!
There is built-in functionality for this in R (but not necessarily obvious): it's the na.action argument/?na.exclude function. With this option set, predict() (and similar downstream processing functions) will automatically fill in NA values in the relevant spots.
Set up data:
df <- data.frame(x=1:10,y=100*(1:10))
df$y[5] <- NA
Fit model: default na.action is na.omit, which simply removes non-complete cases.
mod1 <- lm(y~x+1,data=df)
predict(mod1)
## 1 2 3 4 6 7 8 9 10
## 100 200 300 400 600 700 800 900 1000
na.exclude removes non-complete cases before fitting, but then restores them (filled with NA) in predicted vectors:
mod2 <- update(mod1,na.action=na.exclude)
predict(mod2)
## 1 2 3 4 5 6 7 8 9 10
## 100 200 300 400 NA 600 700 800 900 1000
Actually, you are not using correctly the predict.lm function.
Either way you have to input the model itself as its first argument, hereby model, with or without the new data. Without the new data, it will only predict on the training data, thus excluding your NA row and you need this workaround to fit the initial data.frame:
df$y_ip[!is.na(df$y)] <- predict.lm(model)
Or explicitly specifying some new data. Since the new x has one more row than the training x it will fill the missing row with a new prediction:
df$y_ip <- predict.lm(model, newdata = df)

BMA Poisson selects only 1 model with all variables at inclusion probability at 100%

I am trying to use Bayesian Model averaging (BMA) for a Poisson model to select relevant variables from a large set of variables. In the selection process I force some variables in the model since they are necessary from an econometric point of view (fixed effects).
The problem is that the BMA seems to go through only 1 possible model, or at least reports only 1, and (therefore?) includes all variables in this model. However, I of course would like the BMA to select only a subset of variables.
My script looks as follows (I skipped about 80 variables to make things easier here).
### Prep
rm(list = ls())
setwd("H:\\DAF_ENV Climate Change\\2015\\CPF bid on clean energy\\Models\\3rd model\\R-scripts and tables")
library(BMA)
## Loading data
D.all <- read.csv("H:/Data/full sets/3rd model/poissonbmadata.csv")
### Poisson family BMA run
fes<-c(2:15,16:20) #fixed effects
indepvar<-c(65,66,67,68,72,74,75,76,77,78,79,80,81,82,83,84:91) #other independent variables
x<- D.all[,c(fes,indepvar)] #all independent vars
y<-D.all[,c(1)] #dependent variable
force<- 1:dim(x)[2] #creates the force indicator variable
force[1:dim(x)[2]] <- 0.5 #sets all variables in the set to "not forced"
force[1:length(fes)] <- 1 #sets the FEs as forced
glm.out.invbma <- bic.glm( x, y, glm.family = poisson(),prior.param = force, factor.type=FALSE)
summary(glm.out.invbma)
The output is:
1 models were selected
Best 1 models (cumulative posterior probability = 1 ):
p!=0 EV SD model 1
Intercept 100 1.020e+01 1.304e-02 1.020e+01
yeardummy2001 100 -1.499e+00 6.912e-03 -1.499e+00
yeardummy2002 100 -1.369e+00 6.187e-03 -1.369e+00
yeardummy2003 100 -1.496e+00 6.575e-03 -1.496e+00
yeardummy2004 100 -1.250e+00 6.032e-03 -1.250e+00
yeardummy2005 100 -4.798e-01 5.219e-03 -4.798e-01
yeardummy2006 100 3.142e-01 3.956e-03 3.142e-01
yeardummy2007 100 5.242e-01 3.723e-03 5.242e-01
yeardummy2008 100 5.977e-01 3.697e-03 5.977e-01
yeardummy2009 100 5.813e-01 3.787e-03 5.813e-01
yeardummy2011 100 8.794e-01 3.603e-03 8.794e-01
yeardummy2012 100 8.238e-01 3.709e-03 8.238e-01
yeardummy2013 100 6.407e-01 3.771e-03 6.407e-01
yeardummy2014 100 -2.690e+00 7.676e-02 -2.690e+00
secdummy5 100 1.830e-01 2.335e-03 1.830e-01
nrecomp 100 -7.020e-06 8.813e-08 -7.020e-06
fit 100 2.453e+00 5.869e-03 2.453e+00
tender 100 1.188e-03 2.087e-06 1.188e-03
pmr 100 -3.908e-02 2.645e-03 -3.908e-02
etspart 100 -5.910e-01 2.354e-03 -5.910e-01
terteduc 100 -1.288e-02 7.119e-05 -1.288e-02
wdi532 100 8.007e-03 5.450e-05 8.007e-03
wdi533 100 -1.473e-02 1.651e-04 -1.473e-02
wdi534 100 -7.677e-02 4.526e-04 -7.677e-02
wdi535 100 -4.996e-03 2.699e-05 -4.996e-03
wdi537 100 -3.188e-03 1.241e-05 -3.188e-03
wdi538 100 -1.778e-03 1.679e-05 -1.778e-03
wdi5310 0 0.000e+00 0.000e+00 .
dobus 100 -2.662e-02 1.208e-04 -2.662e-02
transloss 100 -9.329e-02 3.339e-04 -9.329e-02
nVar 28
BIC 5.970e+06
post prob 1
I already tested removing the FEs, putting more or less variables and forcing different or no variables. In almost all cases all variables are at 100% inclusion probability. In some cases two models are selected, but still all variables are at >50% inclusion probability.
The context of the model is the influence of factors on renewable electricity investment. And, no, I don't think all of these factors have an influence warranting the 100% inclusion probability. :)
Edit: The version above also gave warnings after the bic.glm regarding NAs. Once I removed the NAs via D.all<-D.all.original[complete.cases(D.all.original),], and excluded the year-dummies (one or two of them might have become collinear with the constant due to the NA-removal), the results still looked similar, except for the now missing year-FEs.

Resources