Mclust() - NAs in model selection - r

I recently tried to perform a GMM in R on a multivariate matrix (400 obs of 196 var), which elements belong to known categories. The Mclust() function (from package mclust) gave very poor results (around 30% of individuals were well classified, whereas with k-means the result reaches more than 90%).
Here is my code :
library(mclust)
X <- read.csv("X.csv", sep = ",", h = T)
y <- read.csv("y.csv", sep = ",")
gmm <- Mclust(X, G = 5) #I want 5 clusters
cl_gmm <- gmm$classification
cl_gmm_lab <- cl_gmm
for (k in 1:nclusters){
ii = which(cl_gmm == k) # individuals of group k
counts=table(y[ii]) # number of occurences for each label
imax = which.max(counts) # Majority label
maj_lab = attributes(counts)$dimnames[[1]][imax]
print(paste("Group ",k,", majority label = ",maj_lab))
cl_gmm_lab[ii] = maj_lab
}
conf_mat_gmm <- table(y,cl_gmm_lab) # CONFUSION MATRIX
The problem seems to come from the fact that every other model than "EII" (spherical, equal volume) is "NA" when looking at gmm$BIC.
Until now I did not find any solution to this problem...are you familiar with this issue?
Here is the link for the data: https://drive.google.com/file/d/1j6lpqwQhUyv2qTpm7KbiMRO-0lXC3aKt/view?usp=sharing
Here is the link for the labels: https://docs.google.com/spreadsheets/d/1AVGgjS6h7v6diLFx4CxzxsvsiEm3EHG7/edit?usp=sharing&ouid=103045667565084056710&rtpof=true&sd=true

I finally found the answer. GMMs simply cannot apply every model when two much explenatory variables are involved. The right thing to do is first reduce dimensions and select an optimal number of dimensions that make it possible to properly apply GMMs while preserving as much informations as possible about the data.

Related

How to iterate a given process 1'000 times and average the results

I am here to ask you about R language and how to construct a loop to iterate some functions several times.
Here is my problem: I have a numeric matrix obtained from previous analyses (matrix1) that I want to compare (using the overlap function that results in a single value) with another numeric matrix that I get by extracting values of a given raster with a set of randomly created points, as many as the values in the first numeric matrix.
I want to repeat the random sampling procedure 1'000 times, in order to get 1'000 different sets of random points, then repeat the comparison with matrix1 1'000 times (one for each set of random points), and, in the end, calculate the mean of the results to get a single value.
Hereafter I give you an example of the functions I want to use:
#matrix1 is the first matrix, obtained before starting the potential loop;
#LineVector is a polyline shapefile that has to be used within the loop and downloaded before it;
#Raster is a raster from which I should extract values at points location;
#The loop should start from here:
Random_points <- st_sample(LineVector, size = 2000, exact = TRUE, type = "random")
Random_points <- Random_points[!st_is_empty(Random_points)]
Random_points_vect <- vect(Random_points)
Random_values <- terra::extract(Raster, Random_points_vect, ID = F, raw = T)
Random_values <- na.omit(Random_values[, c("Capriolo")])
Values_list <- list(matrix1, Random_values)
Overlapping_value <- overlap(Values_list, type = "2")
#This value, obtained 1'000 times, has then to be averaged into a single number.
I hope I have posed my question in a clear and understandable manner, and I hope you can help me with this problem.
Thanks to everyone in advance, I wish you a good day!
Easy way i can figure out is to use "replicate":
values <- replicate(1000, {
Random_points <- st_sample(LineVector, size = 2000, exact = TRUE, type = "random")
Random_points <- Random_points[!st_is_empty(Random_points)]
Random_points_vect <- vect(Random_points)
Random_values <- terra::extract(Raster, Random_points_vect, ID = F, raw = T)
Random_values <- na.omit(Random_values[, c("Capriolo")])
Values_list <- list(matrix1, Random_values)
Overlapping_value <- overlap(Values_list, type = "2")
Overlapping_value
})
mean(values)

How to work with binary contraints in linear optimization?

I have two input matrices, dt(10,3) & wt(3,3), that i need to use to find the optimal decision matrix (same dimension), Par(10,3) so as to maximize an objective function. Below R code would give some direction into the problem (used Sample inputs here) -
#Input Matrices
dt <- matrix(runif(300),100,3)
wt <- matrix(c(1,0,0,0,2,0,0,0,1),3,3) #weights
#objective function
Obj <- function(Par) {
P = matrix(Par, nrow = 10, byrow=F) # Reshape
X = t((dt%*%wt)[,1])%*%P[,1]
Y = t((dt%*%wt)[,2])%*%P[,2]
Z = t((dt%*%wt)[,3])%*%P[,3]
as.numeric(X+Y+Z) #maximize
}
Now I am struggling to apply the following constraints to the problem :
1) Matrix, Par can only have binary values (0 or 1)
2) rowSums(Par) = 1 (Basically a row can only have 1 in one of the three columns)
3) colSums(Par[,1]) <= 5, colSums(Par[,2]) <= 6, & colSums(Par[,3]) <= 4
4) X/(X+Y+Z) < 0.35, & Y/(X+Y+Z) < 0.4 (X,Y,Z are defined in the objective function)
I tried coding the constraints in constrOptim, but not sure how to input binary & integer constraints. I am reading up on lpSolve, but not able to figure out. Any help much appreciated. Thanks!
I believe this is indeed a MIP so no issues with convexity. If I am correct the model can look like:
This model can be easily transcribed into R. Note that LP/MIP solvers do not use functions for the objective and constraints (opposed to NLP solvers). In R typically one builds up matrices with the LP coefficients.
Note: I had to make the limits on the column sums much larger (I used 50,60,40).
Based on Erwin's response, I am able to formulate the model using lpSolve in R. However still struggling to add the final constraint to the model (4th constraint in my question above). Here's what I am able to code so far :
#input dimension
r <- 10
c <- 3
#input matrices
dt <- matrix(runif(r*c),r,c)
wt <- matrix(c(1,0,0,0,2,0,0,0,1),3,3) #weights
#column controller
c.limit <- c(60,50,70)
#create structure for lpSolve
ncol <- r*c
lp.create <- make.lp(ncol=ncol)
set.type(lp.create, columns=1:ncol, type = c("binary"))
#create objective values
obj.vals <- as.vector(t(dt%*%wt))
set.objfn(lp.create, obj.vals)
lp.control(lp.create,sense='max')
#Add constraints to ensure sum of parameters for every row (rowSum) <= 1
for (i in 1:r){
add.constraint(lp.create, xt=c(1,1,1),
indices=c(3*i-2,3*i-1,3*i), rhs=1, type="<=")
}
#Add constraints to ensure sum of parameters for every column (colSum) <= column limit (defined above)
for (i in 1:c){
add.constraint(lp.create, xt=rep(1,r),
indices=seq(i,ncol,by=c), rhs=c.limit[i], type="<=")
}
#Add constraints to ensure sum of column objective (t((dt%*%wt)[,i])%*%P[,i) <= limits defined in the problem)
#NOT SURE HOW TO APPLY A CONSTRAINT THAT IS DEPENDENT ON THE OBJECTIVE FUNCTION
solve(lp.create)
get.objective(lp.create) #20
final.par <- matrix(get.variables(lp.create), ncol = c, byrow=T) # Reshape
Any help that can get me to the finish line is much appreciated :)
Thanks

knapsack case r implementation for multiple persons using genetic algorithm

I am trying to implement genetic algorithm in R. I found out that r has 'GA' and 'genalg' packages for genetic algorithm implementation. I encountered the example i the link http://www.r-bloggers.com/genetic-algorithms-a-simple-r-example/. They tried solving the Knapsack problem. The problem can be briefly explained as:
"You are going to spend a month in the wilderness. You’re taking a backpack with you, however, the maximum weight it can carry is 20 kilograms. You have a number of survival items available, each with its own number of 'survival points'. You’re objective is to maximize the number of survival points"
The problem is easily solved using 'genalg' package for a single person and the output is binary string. Now i have a doubt, lets say instead of one person there are 2 or more i.e multiple persons and we need to distribute the survival points. The weight constraints apply for each person. Then how can we solve this problem? Can we use 'genalg' or 'GA' package? If so how can we apply them? Are there any examples on this that are solved in R or other software's?
Thanks
The R package adagio (https://cran.r-project.org/web/packages/adagio/index.html) comes with two functions (knapsack and mknapsack) which solves this type of problem more efficient by dynamic programming.
A simple approach could be to have one chromosome containing all individuals in the group and have the evaluation function split this chromosome in multiple parts, one for each individual and then have these parts evaluated. In the example below (based on the example in the question) I have assumed each individual has the same weight limit and multiple individuals can bring the same item.
library(genalg)
#Set up the problem parameters
#how many people in the group
individual_count <-3
#The weight limit for one individual
weightlimit <- 20
#The items with their survivalpoints
dataset <- data.frame(item = c("pocketknife", "beans", "potatoes", "unions",
"sleeping bag", "rope", "compass"), survivalpoints = c(10, 20, 15, 2, 30,
10, 30), weight = c(1, 5, 10, 1, 7, 5, 1))
#Next, we choose the number of iterations, design and run the model.
iter <- 100
#Our chromosome has to be large enough to contain a bit for all individuals and for all items in the dataset
chromosomesize <- individual_count * nrow(dataset)
#Function definitions
#A function to split vector X in N equal parts
split_vector <- function(x,n) split(x, cut(seq_along(x), n, labels = FALSE))
#EValuate an individual (a part of the chromosome)
evalIndividual <- function(x) {
current_solution_survivalpoints <- x %*% dataset$survivalpoints
current_solution_weight <- x %*% dataset$weight
if (current_solution_weight > weightlimit)
return(0) else return(-current_solution_survivalpoints)
}
#Evaluate a chromosome
evalFunc <- function(x) {
#First split the chromosome in a list of individuals, then we can evaluate all individuals
individuals<-split_vector(x,individual_count)
#now we need to sapply the evalIndividual function to each element of individuals
return(sum(sapply(individuals,evalIndividual)))
}
#Run the Genetic Algorithm
GAmodel <- rbga.bin(size = chromosomesize, popSize = 200, iters = iter, mutationChance = 0.01,
elitism = T, evalFunc = evalFunc)
#First show a summary
summary(GAmodel,echo=TRUE)
#Then extract the best solution from the GAmodel, copy/paste from the source code of the summary function
filter = GAmodel$evaluations == min(GAmodel$evaluations)
bestSolution = GAmodel$population[filter, , drop= FALSE][1,]
#Now split the solution in the individuals.
split_vector(bestSolution,individual_count)

Using a for loop for performing several regressions

I am currently performing a style analysis using the following method: http://www.r-bloggers.com/style-analysis/ . It is a constrained regression of one asset on a number of benchmarks, over a rolling 36 month window.
My problem is that I need to perform this regression for a fairly large number of assets and doing it one by one would take a huge amount of time. To be more precise: Is there a way to tell R to regress columns 1-100 one by one on colums 101-116. Of course this also means printing 100 different plots, one for each asset. I am new to R and have been stuck for several days now.
I hope it doesn't matter that the following excerpt isn't reproducible, since the code works as originally intended.
# Style Regression over Window, constrained
#--------------------------------------------------------------------------
# setup
load.packages('quadprog')
style.weights[] = NA
style.r.squared[] = NA
# Setup constraints
# 0 <= x.i <= 1
constraints = new.constraints(n, lb = 0, ub = 1)
# SUM x.i = 1
constraints = add.constraints(rep(1, n), 1, type = '=', constraints)
# main loop
for( i in window.len:ndates ) {
window.index = (i - window.len + 1) : i
fit = lm.constraint( hist.returns[window.index, -1], hist.returns[window.index, 1], constraints )
style.weights[i,] = fit$coefficients
style.r.squared[i,] = fit$r.squared
}
# plot
aa.style.summary.plot('Style Constrained', style.weights, style.r.squared, window.len)
Thank you very much for any tips!
"Is there a way to tell R to regress columns 1-100 one by one on colums 101-116."
Yes! You can use a for loop, but you there's also a whole family of 'apply' functions which are appropriate. Here's a generalized solution with a random / toy dataset and using lm(), but you can sub in whatever regression function you want
# data frame of 116 cols of 20 rows
set.seed(123)
dat <- as.data.frame(matrix(rnorm(116*20), ncol=116))
# with a for loop
models <- list() # empty list to store models
for (i in 1:100) {
models[[i]] <-
lm(formula=x~., data=data.frame(x=dat[, i], dat[, 101:116]))
}
# with lapply
models2 <-
lapply(1:100,
function(i) lm(formula=x~.,
data=data.frame(x=dat[, i], dat[, 101:116])))
# compare. they give the same results!
all.equal(models, models2)
# to access a single model, use [[#]]
models2[[1]]

How to find significant correlations in a large dataset

I'm using R.
My dataset has about 40 different Variables/Vektors and each has about 80 entries. I'm trying to find significant correlations, that means I want to pick one variable and let R calculate all the correlations of that variable to the other 39 variables.
I tried to do this by using a linear modell with one explaining variable that means: Y=a*X+b.
Then the lm() command gives me an estimator for a and p-value of that estimator for a. I would then go on and use one of the other variables I have for X and try again until I find a p-value thats really small.
I'm sure this is a common problem, is there some sort of package or function that can try all these possibilities (Brute force),show them and then maybe even sorts them by p-value?
You can use the function rcorr from the package Hmisc.
Using the same demo data from Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Then:
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
To access the p-values:
correlations$P
To visualize you can use the package corrgram
library(corrgram)
corrgram(the_data)
Which will produce:
In order to print a list of the significant correlations (p < 0.05), you can use the following.
Using the same demo data from #Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Install Hmisc
install.packages("Hmisc")
Import library and find the correlations (#Carlos)
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
Loop over the values printing the significant correlations
for (i in 1:m){
for (j in 1:m){
if ( !is.na(correlations$P[i,j])){
if ( correlations$P[i,j] < 0.05 ) {
print(paste(rownames(correlations$P)[i], "-" , colnames(correlations$P)[j], ": ", correlations$P[i,j]))
}
}
}
}
Warning
You should not use this for drawing any serious conclusion; only useful for some exploratory analysis and formulate hypothesis. If you run enough tests, you increase the probability of finding some significant p-values by random chance: https://www.xkcd.com/882/. There are statistical methods that are more suitable for this and that do do some adjustments to compensate for running multiple tests, e.g. https://en.wikipedia.org/wiki/Bonferroni_correction.
Here's some sample data for reproducibility.
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
You can calculate the correlation between two columns using cor. This code loops over all columns except the first one (which contains our response), and calculates the correlation between that column and the first column.
correlations <- vapply(
the_data[, -1],
function(x)
{
cor(the_data[, 1], x)
},
numeric(1)
)
You can then find the column with the largest magnitude of correlation with y using:
correlations[which.max(abs(correlations))]
So knowing which variables are correlated which which other variables can be interesting, but please don't draw any big conclusions from this knowledge. You need to have a proper think about what you are trying to understand, and which techniques you need to use. The folks over at Cross Validated can help.
If you are trying to predict y using only one variable than you have to take the one that is mainly correlated with y.
To do this just use the command which.max(abs(cor(x,y))). If you want to use more than one variable in your model then you have to consider something like the lasso estimator
One option is to run a correlation matrix:
cor_result=cor(data)
write.csv(cor_result, file="cor_result.csv")
This correlates all the variables in the file against each other and outputs a matrix.

Resources