To understand my problem, you will need the whole dataset: https://pastebin.com/82paf0G8
Pre-processing: I had a list of orders and 696 unique item numbers, and wanted to cluster them, based on how frequent each pair of items are ordered together. I calculated for each pair of items, number of frequency of occurence within the same order. I.e the highest number of occurrence was 489 between two items. I then "calculated" the similarity/correlation, by: Frequency / "max frequency of all pairs" (489). Now I have the dataset that I have uploaded.
Similarity/correlation: I don't know if my similarity approach is the best in this case. I also tried with something called "Jaccard’s coefficient/index", but get almost same results.
The dataset: The dataset contains material numbers V1 and V2. and N is the correlation between the two material numbers between 0 - 1.
With help from another one, I managed to create a distance matrix and use the PAM clustering.
Why PAM clustering? A data scientist suggest this: You have more than 95% of pairs without information, this makes all these materials are at the same distance and a single cluster very dispersed. This problem can be solved using a PAM algorithm, but still you will have a very concentrated group. Another solution is to increase the weight of the distances other than one.
Problem 1: The matrix is only 567x567. I think for clustering I need the 696x696 full matrix, even though a lot of them are zeros. But i'm not sure.
Problem 2: Clustering does not do very well. I get very concentrated clusters. A lot of items are clustered in the first cluster. Also, according to how you verify PAM clusters, my clustering results are poor. Is it due to the similarity analysis? What else should I use? Is it due to the 95% of data being zeros? Should I change the zeros to something else?
The whole code and results:
#Suppose X is the dataset
df <- data.table(X)
ss <- dcast(rbind(df, df[, .(V1 = V2, V2 = V1, N)]), V1~V2, value.var = "N")[, -1]
ss <- ss/max(ss, na.rm = TRUE)
ss[is.na(ss)] <- 0
diag(ss) <- 1
Now using the PAM clustering
dd2 <- as.dist(1 - sqrt(ss))
pam2 <- pam(dd2, 4)
summary(as.factor(pam2$clustering))
But I get very concentrated clusters, as:
1 2 3 4
382 100 23 62
I'm not sure where you get the 696 number from. After you rbind, you have a dataframe with 567 unique values for V1 and V2, and then you perform the dcast, and end up with a matrix as expected 567 x 567. Clustering wise I see no issue with your clusters.
dim(df) # [1] 7659 3
test <- rbind(df, df[, .(V1 = V2, V2 = V1, N)])
dim(test) # [1] 15318 3
length(unique(test$V1)) # 567
length(unique(test$V2)) # 567
test2 <- dcast(test, V1~V2, value.var = "N")[,-1]
dim(test2) # [1] 567 567
#Mayo, forget what the data scientist said about PAM. Since you've mentioned this work is for a thesis. Then from an academic viewpoint, your current justification to why PAM is required, does not hold any merit. Essentially, you need to either prove or justify why PAM is a necessity for your case study. And given the nature of (continuous) variables in the dataset, V1, V2, N, I do not see the logic on why PAM is applicable here (like I mentioned in the comments, PAM works best for mixed variables).
Continuing further, See this post on correlation detection in R;
# Objective: Detect Highly Correlated variables, visualize them and remove them
data("mtcars")
my_data <- mtcars[, c(1,3,4,5,6,7)]
# print the first 6 rows
head(my_data, 6)
# compute correlation matrix using the cor()
res<- cor(my_data)
round(res, 2) # Unfortunately, the function cor() returns only the correlation coefficients between variables.
# Visualize the correlation
# install.packages("corrplot")
library(corrplot)
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
# Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. In the right side of the correlogram, the legend color shows the correlation coefficients and the corresponding colors.
# tl.col (for text label color) and tl.srt (for text label string rotation) are used to change text colors and rotations.
#Apply correlation filter at 0.80,
#install.packages("caret", dependencies = TRUE)
library(caret)
highlyCor <- colnames(my_data)[findCorrelation(res, cutoff = 0.80, verbose = TRUE)]
# show highly correlated variables
highlyCor
[1] "disp" "mpg"
removeHighCor<- findCorrelation(res, cutoff = 0.80) # returns indices of highly correlated variables
# remove highly correlated variables from the dataset
my_data<- my_data[,-removeHighCor]
[1] 32 4
Hope this helps.
Related
I'm currently working with a large matrix (4 cols and around 8000 rows).
I want to perform a correlation analysis using Pearson's correlation coefficient between the different rows composing this matrix.
I would like to proceed the following way:
Find Pearson's correlation coefficient between row 1 and row 2. Then between rows 1 and 3... and so on with the rest of the rows.
Then find Pearson's correlation coefficient between row 2 and row 3. Then between rows 2 and 4... and so on with the rest of the rows. Note I won't find the coefficient with row 1 again...
For those coefficients being higher or lower than 0.7 or -0.7 respectively, I would like to list on a separate file the row names corresponding to those coefficients, plus the coefficient. E.g.:
row 230 - row 5812 - 0.76
I wrote the following code for this aim. Unfortunately, it takes a too long running time (I estimated almost a week :( ).
for (i in 1:7999) {
print("Analyzing row:")
print(i)
for (j in (i+1):8000) {
value<- cor(alpha1k[i,],alpha1k[j,],use = "everything",method = "pearson")
if(value>0.7 | value<(-0.7)){
aristi <- c(row.names(alpha1k)[i],row.names(alpha1k)[j],value)
arist1p<-rbind(arist1p,aristi)
}
}
Then my question is if there's any way I could do this faster. I read about making these calculations in parallel but I have no clue on how to make this work. I hope I made myself clear enough, thank you on advance!
As Roland pointed out, you can use the matrix version of cor to simplify your task. Just transpose your matrix to get a "row" comparison.
mydf <- data.frame(a = c(1,2,3,1,2,3,1,2,3,4), b = rep(5,2,10), c = c(1:10))
cor_mat <- cor(t(mydf)) # correlation of your transposed matrix
idx <- which((abs(cor_mat) > 0.7), arr.ind = T) # get relevant indexes in a matrix form
cbind(idx, cor_mat[idx]) # combine coordinates and the correlation
Note that parameters use = everything and method = "pearson" are used by default for correlation. There is no need to specify them.
I would like to explore the profile of two modalities of a categorical variable over time with respect to a given set of other categorical variables. I paste a reproducible example of such a dataset below.
set.seed(90114)
V1<-sample(rep(c("a", "A"), 100))
V2<-sample(rep(c("a", "A", "b", "B"), 50))
V3<-sample(rep(c("F", "M", "I"), 67), 200)
V4<-sample(rep(c("C", "R"), 100))
V5<-sample(rep(c(1970, 1980, 1990, 2000, 2010), 40))
data<-data.frame(V1, V2, V3, V4, V5)
To explore the behavior of such modalities, I decided to use Multiple Correspondence Analysis (package FactoMineR). To account for variation over time, one possibility is to split the dataset into 5 subsamples which represent the different levels of V5 and then run MCA on each subset. The rest of the analysis consists in comparing the position of the modalities across the different biplots. However, such practice is not without problems if the original dataset is too small. In such a case, the dimensions could be flipped or worse, the location of the active variables are likely to change from one plot to the other.
To avoid the problem, one solution could be to stabilize the position of the active variables across all the subsets and predict the coordinates of the supplementary variable afterwards, allowing the latter to move over time. I read somewhere that the coordinates of a modality can be obtained by computing the weighted mean of the coordinates of individuals in which this modality is found. So finding the coordinates of a modality for the year 1970 would boil down to computing the weighted mean of the coordinates of the individuals in the 1970 subset for that modality. However, I don't know whether it's common practice and if yes, I just don't know how to implement such calculations. I paste the rest of the code in order for you to visualize the problem.
data.mca<-MCA(data[, -5], quali.sup=1, graph=F)
# Retrieve the coordinates of the first and second dimension
DIM1<-data.mca$ind$coord[, 1]
DIM2<-data.mca$ind$coord[, 2]
# Append the coordinates to the original dataframe
data1<-data.frame(data, DIM1, DIM2)
# Split the data into 5 clusters according to V5 ("year")
data1.split<-split(data1, data1$V5)
data1.split<-lapply(data1.split, function(x) x=x[, -5]) # to remove the fifth column with the years, no longer needed
seventies<-as.data.frame(data1.split[1])
eightties<-as.data.frame(data1.split[2])
# ...
a.1970<-seventies[seventies$X1970.V1=="a",]
A.1970<-seventies[seventies$X1970.V1=="A",]
# The idea, then, is to find the coordinates of the modalities "a" and "A" by computing the weighted mean of their respective indivuduals for each subset. The arithmetic mean would yield
# a.1970.DIM1<-mean(a.1970$X1970.DIM1) # 0.0818
# a.1970.DIM2<-mean(a.1970$X1970.DIM2) # 0.1104
# and so on for the other levels of V5.
I thank you in advance for your help!
I found a solution to my problem. We can simply weight the mean of the coordinates by the value returned by row.w in FactoMineR. To account for the dilatation of the MCA, the values of the resulting coordinates of the barycentres should be divided by the square root of the eigenvalue of the dimension.
DIM1<-data.mca$ind$coord[, 1]
DIM2<-data.mca$ind$coord[, 2]
WEIGHT<-data.mca$call$row.w
data1<-data.frame(data, WEIGHT, DIM1, DIM2)
# Splitting the dataset according to values of V1
v1_a<-data1[data1$V1=="a",]
v1_A<-data1[data1$V1=="A",]
# Computing the weighted average of the coordinates of Dim1 and Dim2 for the first category of V1
V1_a_Dim1<-sum(v1_a$WEIGHT*v1_a$DIM1)/100 # -0.0248
v1_a_Dim2<-sum(v1_a$WEIGHT*v1_a$DIM2)/100 # -0.0382
# Account for the dilatation of the dimensions...
V1_a_Dim1/sqrt(data.mca$eig[1,1])
[1] -0.03923839
v1_a_Dim2/sqrt(data.mca$eig[2,1])
[1] -0.06338353
# ... which is the same as the following:
categories<-data.mca$quali.sup$coord[, 1:2]
categories
# Dim 1 Dim 2
# V1_a -0.03923839 -0.06338353
# V1_A 0.03923839 0.06338353
This can be applied to different partitions of the data according to V5 or any other categorical variable.
Problem
I have a dataframe that composes of > 5 variables at any time and am trying to do a K-Means of it. Because K-Means is greatly affected by outliers, I've been trying to look for a few hours on how to calculate and remove multivariate outliers. Most examples demonstrated are with 2 variables.
Possible Solutions Explored
mvoutlier - Kind user here noted that mvoutlier may be what I need.
Another Outlier Detection Method - Poster here commented with a mix of R functions to generate an ordered list of outliers.
Issues thus Far
Regarding mvoutlier, I was unable to generate a result because it noted my dataset contained negatives and it could not work because of that. I'm not sure how to alter my data to only positive since I need negatives in the set I am working with.
Regarding Another Outlier Detection Method I was able to come up with a list of outliers, but am unsure how to exclude them from the current data set. Also, I do know that these calculations are done after K-Means, and thus I probably will apply the math prior to doing K-Means.
Minimal Verifiable Example
Unfortunately, the dataset I'm using is off-limits to be shown to anyone, so what you'll need is any random data set with more than 3 variables. The code below is code converted from the Another Outlier Detection Method post to work with my data. It should work dynamically if you have a random data set as well. But it should have enough data where cluster center amount should be okay with 5.
clusterAmount <- 5
cluster <- kmeans(dataFrame, centers = clusterAmount, nstart = 20)
centers <- cluster$centers[cluster$cluster, ]
distances <- sqrt(rowSums(clusterDataFrame - centers)^2)
m <- tapply(distances, cluster$cluster, mean)
d <- distances/(m[cluster$cluster])
# 1% outliers
outliers <- d[order(d, decreasing = TRUE)][1:(nrow(clusterDataFrame) * .01)]
Output: A list of outliers ordered by their distance away from the center they reside in I believe. The issue then is getting these results paired up to the respective rows in the data frame and removing them so I can start my K-Means procedure. (Note, while in the example I used K-Means prior to removing outliers, I'll make sure to take the necessary steps and remove outliers before K-Means upon solution).
Question
With Another Outlier Detection Method example in place, how do I pair the results with the information in my current data frame to exclude those rows before doing K-Means?
I don't know if this is exactly helpful but if your data is multivariate normal you may want to try out a Wilks (1963) based method. Wilks showed that the mahalanobis distances of multivariate normal data follow a Beta distribution. We can take advantage of this (iris Sepal data used as an example):
test.dat <- iris[,-c(1,2))]
Wilks.function <- function(dat){
n <- nrow(dat)
p <- ncol(dat)
# beta distribution
u <- n * mahalanobis(dat, center = colMeans(dat), cov = cov(dat))/(n-1)^2
w <- 1 - u
F.stat <- ((n-p-1)/p) * (1/w-1) # computing F statistic
p <- 1 - round( pf(F.stat, p, n-p-1), 3) # p value for each row
cbind(w, F.stat, p)
}
plot(test.dat,
col = "blue",
pch = c(15,16,17)[as.numeric(iris$Species)])
dat.rows <- Wilks.function(test.dat); head(dat.rows)
# w F.stat p
#[1,] 0.9888813 0.8264127 0.440
#[2,] 0.9907488 0.6863139 0.505
#[3,] 0.9869330 0.9731436 0.380
#[4,] 0.9847254 1.1400985 0.323
#[5,] 0.9843166 1.1710961 0.313
#[6,] 0.9740961 1.9545687 0.145
Then we can simply find which rows of our multivariate data are significantly different from the beta distribution.
outliers <- which(dat.rows[,"p"] < 0.05)
points(test.dat[outliers,],
col = "red",
pch = c(15,16,17)[as.numeric(iris$Species[outliers])])
I'm looking for some algorithm such as k-means for grouping points on a map into a fixed number of groups, by distance.
The number of groups has already been decided, but the trick part (at least for me) is to meet the criteria that the sum of MOS of each group should in the certain range, say bigger than 1. Is there any way to make that happen?
ID MOS X Y
1 0.47 39.27846 -76.77101
2 0.43 39.22704 -76.70272
3 1.48 39.24719 -76.68485
4 0.15 39.25172 -76.69729
5 0.09 39.24341 -76.69884
I was intrigued by your question but was unsure how you might introduce some sort of random process into a grouping algorithm. Seems that the kmeans algorithm does indeed give different results if you permutate your dataset (e.g. the order of the rows). I found this bit of info here. The following script demonstrates this with a random set of data. The plot shows the raw data in black and then draws a segment to the center of each cluster by permutation (colors).
Since I'm not sure how your MOS variable is defined, I have added a random variable to the dataframe to illustrate how you might look for clusterings that satisfy a given criteria. The sum of MOS is calculated for each cluster and the result is stored in the MOS.sums object. In order to reproduce a favorable clustering, you can use the random seed value that was used for the permutation, which is stored in the seeds object. You can see that the permutations result is several different clusterings:
set.seed(33)
nsamples=500
nperms=10
nclusters=3
df <- data.frame(x=runif(nsamples), y=runif(nsamples), MOS=runif(nsamples))
MOS.sums <- matrix(NaN, nrow=nperms, ncol=nclusters)
colnames(MOS.sums) <- paste("cluster", 1:nclusters, sep=".")
rownames(MOS.sums) <- paste("perm", 1:nperms, sep=".")
seeds <- round(runif(nperms, min=1, max=10000))
plot(df$x, df$y)
COL <- rainbow(nperms)
for(i in seq(nperms)){
set.seed(seeds[i])
ORD <- sample(nsamples)
K <- kmeans(df[ORD,1:2], centers=nclusters)
MOS.sums[i,] <- tapply(df$MOS[ORD], K$cluster, sum)
segments(df$x[ORD], df$y[ORD], K$centers[K$cluster,1], K$centers[K$cluster,2], col=COL[i])
}
seeds
MOS.sums
In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.
x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y).
Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 :
Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :
total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...
Be aware you're not going to get even close to a correct estimate of your p.value :
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162
How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?
The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows:
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
b_x <- boot(x, sum, R = 10000)
b_y <- boot(y, sum, R = 10000)
z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1]))
pnorm(z)
So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case.
If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.