Clustering with non independent variables and very large data set - r

I have a very large data set ~ 400 000 instances, that looks like data below.
data <- as.data.frame(matrix(0, 10, 5))
samp <- function(){
x <-sample( c(0:9), 5, replace =TRUE, prob = c(0.5, 0.1, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05))
return(x)
}
data <- lapply(split(data, c(1:10)), function(x) samp() )
data <- do.call(rbind.data.frame, data)
colnames(data) <- c("fail","below_a", "aver", "above_a", "exceed")
data$class_size <- apply(data[1:5] , 1, FUN = sum)
class_prof <- sample(letters[1:6], nrow(data), replace = T)
data$class_prof <- class_prof
I am trying to cluster this set, but there are following problems:
class size is the sum of the first five columns - I think it may cause collinearity issue, but it is an important variable.
the first five variables are not independent they are the results of measuring the same quality, everyone in the class must fall in one of the categories.
the set is really big, the only algorithm that did not have convergence issues was kmeans, (without using class profile variable).
I can drop categorical variable as it can be included in the models in the later stage, but I am keen to try some methods that use it as well and compare results.
For the convergence problems , I tried downsampling, but for many methods, I need to downsample to 5000-7000 to avoid the memory issues, which is the less than 2%of original data.
What method could be applied here using r packages?

Try doing a principal components analysis on the data, then kmeans or knn on the number of dimensions you decide you want.
There are couple differnt packages that are fairly straightforward to use of this, you'll have to mean center and scale your data before. You'll also have to conver any factors into numerical using a one hot method (one column for every possible factor of that original factor column).
Look into 'prcomp' or 'princomp'

Related

What is the fastest way to calculate a lot of means and sds in R?

I'm relatively new to R and am working on a project where I need to calculate a LOT of column means and standard deviations. I have a dataset called scores that has over 3 million observations of 172 variables. I need to transform each of these scores by subtracting a mean and dividing a standard deviation. I am able to do what I want with my code below, but it takes up all of the memory in my R session (which is 50GB!). This step (calculating means and sds and transforming values) is the most memory-expensive step in my code and I am wondering if there is anything I can do lessen it. Would a function help? Should I store my data differently? Or does it take the same amount of power to do the math regardless of how you ask?
I am trying to avoid paying for a remote machine with more power if possible.
correct_scores <- TRUE
if (correct_scores){
# pull score data from larger database
scores <- noise[["i_scores"]][["whole_dataset"]][,-c(1:4)]
# calculate means and sds
meanofmeans <- mean(apply(scores, 2, mean))
meanofsds <- mean(apply(scores, 2, sd))
# do the thing
scores <- (scores - meanofmeans) / meanofsds
# put values back into larger database
noise[[ "i_scores-cor" ]][["whole_dataset"]] <- cbind(noise[["i_scores"]][["whole_dataset"]][,c(1:4)],scores)
}
a tiny bit of reproducible code from the scores dataset:
scores <- data.frame(ENCFF802ZBQ = c(34.80, -0.01, 0.248, 0.54),
ENCFF477IRE = c(0.32, 0.24, -0.24, 23.01),
ENCFF127IJN = c(0.23, 0.56, 0.01, 0.01))
Thanks!!
Given your example:
library(data.table)
setDT(scores)[, lapply(.SD, scale)]
setDT(scores) converts scores to a data.table. lapply(.SD, scale) applies the scale(...) function to each column in scores (.SD is a shorthand in data.table for "subset of columns"). In this case the subset is all columns. See ?data.table for more information.
To your question: Should I store my data differently? Yes absolutely. But I'd need to see the structure of noise and perhaps how/why you import it that way to comment further.

Correcting for multiple comparisons in permutation procedure using R and multtest

I have carried out a permutation test comprising a Null-distribution of distances and then 5 observed distances as statistics. Now I would like to correct for multiple comparisons using the Max-T method; using the multtest package, and the ss.maxT, the ss.minT and/or the sd.maxT functions.
But I have problems implementing the functions and making sense of the results; the first function only gives 1s as result, the third only give back the unadjusted p-values and the third throws an error.
Please see example data below:
## Example data
# Observed distances
obs <- matrix(c(0.001, 0.2, 0.50, 0.9, .9999))
null_values <- runif(20)
# Null distribution of distances
null <- matrix(null_values, nrow = length(obs), ncol = length(c(1:20)), byrow=TRUE)
null
# Hypotheses
alternative <- "more"
# The unadjusted raw p-value
praw <- c(0, 0.1, 0.45, 0.85, 1)
# Only getting 1s as results
adjusted_p_values_max <- multtest::ss.maxT(null, obs, alternative, get.cutoff=FALSE,
get.cr = FALSE, get.adjp = TRUE, alpha = 0.05)
adjusted_p_values_max
# Should probably use this one: but getting praw back, which is supposedly correct (but perhaps odd)
# this is because of the null distribution being identical for all 5 variables.
# Hence, should each word be tested against its own unique null distribution?
adjusted_p_values_min <- multtest::ss.minP(null, obs, praw, alternative, get.cutoff=FALSE,
get.cr = FALSE, get.adjp = TRUE, alpha=0.05)
adjusted_p_values_min
# Throwing and error
adjusted_p_values_sdmax <- sd.maxT(null, obs, alternative, get.cutoff=TRUE,
get.cr = TRUE, get.adjp = TRUE, alpha = 0.05)
adjusted_p_values_sdmax
Considering the very different conclusions from the first two methods, I’m wondering if my plan to implement these methods are incorrect in the first place. Basically, I want to examine several hundred distances against a null distribution of several thousands.
obs = The observed distances between different observed points in space to the same “original” point A. (Hence, distances are not independent since they all relate to the same point)
null = The null distribution comprises distances between points that have been randomly selected (replacement = TRUE) from the different observed points and the same original point A.
It seems way too conservative to use ss.maxP for me. Whereas it seems unnecessary to use ss.minP if it “just” returns the raw p-values; or what am I missing?
Can I perhaps solve this situation by constructing individual null distributions for every observed distance?
Thank you in advance!

Remove the outlier in the dataset before running factor analysis in R

The introduction about my dataset: It is the questionnaire data, mentioning the different reasons for students' antisocial behaviours. And I want to run the factor analysis to organize similar reasons to a factor.
For instance, there is one reason that students have antisocial behaviour because of their parents' educating, and another reason is that this happens because of their parents' educational background. There are some similarity between these two reasons, so that I am wondering whether these two reasons could be merged into one factor, so I want to run a factor analysis to see whether I could merge different reasons in one factor.
In order to run the factor analysis, removing the outlier(those which is smaller than mean minus 3 standard deviation, and bigger than mean add 3 standard deviation) is quite important from my understanding. However, I am not sure whether it is necessary for the questionnaire data, and if it is necessary, or at least it is not completely redundant, then with which R code could I reach this aim?
I did some research on Median Absolute Deviation (MAD) method, which could partial out the outliers. And I also wrote the R code as below:
mad.mean.D.O <- as.numeric(D.O.Mean.data$D.O_Mean)
median(mad.mean.D.O)
mad(mad.mean.D.O, center = median(mad.mean.D.O), constant = 1.4826,
na.rm = FALSE, low = FALSE, high = FALSE)
print(Upper.MAD <- (median(mad.mean.D.O)+3*(mad(mad.mean.D.O, center = median(mad.mean.D.O), constant = 1.4826,
na.rm = FALSE, low = FALSE, high = FALSE))))
print(Lower.MAD <- (median(mad.mean.D.O)-3*(mad(mad.mean.D.O, center = median(mad.mean.D.O), constant = 1.4826,
na.rm = FALSE, low = FALSE, high = FALSE))))
D.O.clean.mean.data <- D.O.Mean.data %>%
select(ID_t,
anonymity,
fail_exm,
pregnant,
deg_job,
new_job,
crowded,
stu_req,
int_sub,
no_org,
child,
exm_cont,
lec_sup,
fals_exp,
fin_prob,
int_pro,
family,
illness,
perf_req,
abroad,
relevanc,
quickcash,
deg_per,
lack_opp,
prac_work,
D.O_Mean) %>%
filter(D.O_Mean < 4.197032 & D.O_Mean > 0.282968)
This R code works.
However, I just wonder whether there are also other methods which could reach the same aim, but in a simpler approach.
In addition, my data set looks like this:
All the variables are questionnaire data, being measured by likert scale. And all of those are reasons for antisocial behaviour. For example, the first participants, she/he give 1 to anonymity, that means from not exactly yo exactly, he/ she think anonymity not exactly contribute to his/ her antisocial behaviour.
I would be really thankful for all of your input here.
You can try this function to remove outliers. It'll comb through all columns to identify outliers, so be sure to temporarily remove columns do not need outliers removed and you can cbind() it back later.
#identify outliers
idoutlier<- function(data, cutoff = 3) {
# Calculate the sd
sds <- apply(data, 2, sd, na.rm = TRUE)
# Identify the cells with value greater than cutoff * sd (column wise)
result <- mapply(function(d, s) {
which(d > cutoff * s)
}, data, sds)
result
}
#remove outliers
rmoutlier<- function(data, outliers) {
result <- mapply(function(d, o) {
res <- d
res[o] <- NA
return(res)
}, data, outliers)
return(as.data.frame(result))
}
cbind() if necessary, and then na.omit() to remove your outliers

Clustering using categorical and continuous data together

I am trying to create a unsupervised model with categorical and continuous data together. I think I have worked it out, but is this the correct way to do this?
Load Libraries
library(tidyr)
library(dummies)
library(fastDummies)
library(cluster)
library(dplyr)
create sample data set
set.seed(3)
sampleData <- data.frame(id = 1:50,
gender = sample(c("Male", "Female"), 10, replace =
TRUE),
age_bracket = sample(c("0-10", "11-30","31-60",">60"),
10, replace = TRUE),
income = rnorm(10, 40, 10),
volume = rnorm(50, 40, 100))
Create sparse matrix and scale
sd1 <- sampleData %>%
dummy_cols(select_columns = c("gender","age_bracket"))%>%
mutate(id = factor(id))%>%
select(-c(gender,age_bracket))%>%
mutate_if(is.numeric, scale)
glimpse(sd1)
Generate a k-means model using the pam() function with a k = 3
sd2 <- pam(sd1, k =3)
Extract the vector of cluster assignments from the model
sd3 <- sd2$cluster
Build the segment_customers dataframe
sd4 <- mutate(sd1, cluster = sd3)
Calculate the size of each cluster
count(sd4, cluster)
Dummy coding of variables is fairly standard, but I am not a fan of it. In many cases this IMHO causes large bias, and hinders interpretability.
In your case, you may additionally be applying standardization to them, which makes variable bias even worse.
Your text claims to use k-means, but uses PAM. These are not the same. PAM is IMHO a better choice here, because of interpretability, and the ability to use other metrics such as Manhattan. The resulting cluster "centers" are data points, not means.
I recommend going down to the mathematical level. PAM tries to minimize the sum of distances to the centers. Now put in the distance you use, e.g., Manhattan. Now substitute the standardization and dummy encoding in there, and you get the actual problem your approach tries to solve. Now have a critical look at this (probably quite large) term: is that helpful for your problem, or are you optimizing the wrong function?

using the apcluster package in R, it is possible to "score" unclustered data points

I am new to R and I have a request that I am not sure is possible. We have a number of retail locations that my boss would like to use affinity propagation to group into clusters. We will not be clustering based on geographic location. Once he has found a configuration he likes, he wants to be able to input other locations to determine which of those set clusters they should fall into.
The only solution I have been able to come up with is to use the same options and re-cluster with the original points and the new ones added in, however I believe that this might change the outcome.
Am I understanding this right, or are there other options?
Sorry for the late answer, I just incidentally stumbled over your question.
I agree with Anony-Mousse's answer that clustering is the first step and classification is the second. However, I'm not sure whether this is the best option here. Elena601b is obviously talking about a task with truly spatial data, so my impression is that the best approach is to cluster first and then to "classify" new points/samples/locations by looking for the closest cluster exemplar. Here is some code for synthetic data:
## if not available, run the following first:
## install.packages("apcluster")
library(apcluster)
## create four synthetic 2D clusters
cl1 <- cbind(rnorm(30, 0.3, 0.05), rnorm(30, 0.7, 0.04))
cl2 <- cbind(rnorm(30, 0.7, 0.04), rnorm(30, 0.4, .05))
cl3 <- cbind(rnorm(20, 0.50, 0.03), rnorm(20, 0.72, 0.03))
cl4 <- cbind(rnorm(25, 0.50, 0.03), rnorm(25, 0.42, 0.04))
x <- rbind(cl1, cl2, cl3, cl4)
## run apcluster() (you may replace the Euclidean distance by a different
## distance, e.g. driving distance, driving time)
apres <- apcluster(negDistMat(r=2), x, q=0)
## create new samples
xNew <- cbind(rnorm(10, 0.3, 0.05), rnorm(10, 0.7, 0.04))
## auxiliary predict() function
predict.apcluster <- function(s, exemplars, newdata)
{
simMat <- s(rbind(exemplars, newdata),
sel=(1:nrow(newdata)) + nrow(exemplars))[1:nrow(exemplars), ]
unname(apply(simMat, 2, which.max))
}
## assign new data samples to exemplars
predict.apcluster(negDistMat(r=2), x[apres#exemplars, ], xNew)
## ... the result is a vector of indices to which exemplar/cluster each
## data sample is assigned
I will probably add such a predict() method in a future release of the package (I am the maintainer of the package). I hope that helps.
Clustering is not a drop-in replacement for classification.
Few clustering algorithms can meaningfully integrate new information.
The usual approach for your problem however is simple:
Do clustering.
use the cluster labels as class labels
train a classifier
apply the classifier to the new data

Resources