Cluster a big data set (quantitative/qualitative values)

Cluster a big data set (quantitative/qualitative values) - r

I have a dataset composed of 54 000 rows and a few columns (7). My values are both numeric and alphanumeric (qualitative and quantitative variables). I want to cluster it using function hclust in R.
Let's take an example :
X <- data.frame(rnorm(54000, sd = 0.3),
rnorm(54000, mean = 1, sd = 0.3),
sample( LETTERS[1:24], 54000, replace=TRUE),
sample( letters[1:10], 54000, replace=TRUE),
round(rnorm(54000,mean=25, sd=3)),
round(runif(n = 54000,min = 1000,max = 25000)),
round(runif(54000,0,200000)))
colnames(X) <- c("A","B","C","D","E","F","G")
If I use the hclust function like this :
hclust(dist(X), method = "ward.D")
I get this error message :
Error: cannot allocate vector of size 10.9 Gb
What is the problem ? I'm trying to create a 54k * 54k matrix which is too big to be computed by my PC (4Go of RAM). I've read that since R3.0.0, the software is now in 64 bits (able to work with a 2.916e+09 matrix like in my example) so limitations are from my computer. I've tried it with hclust in stats / fastcluster/ flashClust and get the same problem.
In this packages, hclust are described like that :
hclust(d, method="complete", members=NULL)
flashClust(d, method = "complete", members=NULL)
d a dissimilarity structure as produced by dist.
We always need a dist matrix to make this function work. I've also tried to set higher the limitations of my computer for R session using this :
memory.limit(size = 4014)
memory.size(max = TRUE)
Question :
Is it possible to use a hierarchical clustering (or similar way to cluster data) whithout using this dist() matrix for a quantitative/qualitative dataset with R ?
Edit :
About k-means :
The method of k-means works great for a big dataset composed of numerical values. In my example, I got both numeric and alphanumeric values. I've tried to tranform my qualitative variables into binary numerical variables to do the process of k-means :
First dataframe (example) :
Col1 Col2 Col3
1 12 43.93145 Alpha
2 45 44.76081 Beta
3 48 45.09708 Gamma
4 31 45.42278 Alpha
5 12 46.53709 Delta
6 7 39.07841 Beta
7 78 49.60947 Alpha
If I transform this into binary variables, I get this :
Col1 Col2 Alpha Beta Gamma Delta
1 12 44.29369 1 0 0 0
2 45 43.90610 0 1 0 0
3 48 44.82659 0 0 1 0
4 31 43.09096 1 0 0 0
5 12 42.71190 0 0 0 1
6 7 43.71710 0 1 0 0
7 78 42.24293 1 0 0 0
It's OK if I only got a few modalities but in a real dataset, we could get about 10.000 modalities for a 50k rows base. I don't think k-means is the solution of this type of problem.

From reading your question, it seems there are 2 problems:
1. You have a fairly large amount of observations for clustering
2. The categorical variables have high cardinality
My advice:
1) You can just take a sample and use fastcluster::hclust, or use clara.
Probably after sorting out 2) you can use more observations, in any case it's potentially ok to use a sample. Try to take a stratified sample of the categories.
2) You basically need to represent these categories in a numeric format, without having 10000 columns more. You could use PCA or a Discrete version of it.
A few questions deal with this problem:
q1, q2

Related

Create a new Variable of values of another variable-multilevel regression

I am up to create a multilevel analysis (and I am a total newbie).
In this analysis I want to test if a high value of a predictor( here:senseofhumor) (numeric value - transfered into "high","low","medium") would predict the (numeric)outcome more than the other (numeric)predictors (senseofhomor-seriousness-friednlyness).
I have a dataset with many people and groups and want to compare the outcome between the groups regarding the influence of SenseofhumorHIGH
The code for that might look like this
RandomslopeEC <- lme(criteria(timepoint1) ~ senseofhumor + seriousness + friendlyness , data = DATA, random = ~ **SenseofhumorHIGH**|group)
For that reason I created values "high" "low" "medium" for my numeric predictor via
library(tidyverse)
DATA <- DATA %>%
mutate(predictorNew = case_when(senseofhumor< quantile(senseofhumor, 0.5) ~ 'low',
senseofhumor > quantile(senseofhumor, 0.75)~'high',
TRUE ~ 'med'))
Now they look like this:
Person
Group
senseofhumor
1
56
low
7
1
high
87
7
low
764
45
high
Now I realized i might need to cut this variable values in separate variables if I want to test my idea.
Do any of you know how to generate variables, that they may look like this?
Person
Group
senseofhumorHIGH
senseofhumorMED
senseofhumorLOW
1
56
0
0
1
7
1
1
0
0
87
7
0
0
1
764
45
1
0
0
51
3
1
0
0
362
9
1
0
0
87
27
0
0
1
Does this make any sense to you regarding my approach? Or do you have a better idea?
Thanks a lot in advance

Welcome to learning R. You will want to convert these types of variables to "factors," and R will be able to process them accordingly. To do that, use as.factor(variable) - so for you it may be DATA$senseofhumor <- as.factor(DATA$senseofhumor). If you need to convert multiple columns, you can use:
factor_cols <- c("Var1","Var2","Var3") # list columns you want as factors
DATA[factor_cols] <- lapply(DATA[factor_cols], as.factor)
Since you are new, note that this forum is typically for questions that cant be easily found online. This question is relatively routine and more details can be found with a quick google search. While SO is a great place to learn R, you may be penalized by the SO community in the future for routine questions like this. Just trying to help ensure you keep learning!

Unable to create exactly equal data partitions using createDataPartition in R- getting 1396 and 1398 observations each but need 1397

I am quite familiar with R but never had this requirement where I need to create exactly equal data partition randomly using createDataPartition in R.
index = createDataPartition(final_ts$SAR,p=0.5, list = F)
final_test_data = final_ts[index,]
final_validation_data = final_ts[-index,]
This code creates two datasets with sizes 1396 and 1398 observations respectively.
I am surprised why p=0.5 doesn't do what it is supposed to do. Does it have something to do with resulting dataset not having odd number of observations by default?
Thanks in advance!

It has to do with the number of cases of the response variable (final_ts$SAR in your case).
For example:
y <- rep(c(0,1), 10)
table(y)
y
0 1
10 10
# even number of cases
Now we split:
train <- y[caret::createDataPartition(y, p=0.5,list=F)]
table(train) # we have 10 obs
train
0 1
5 5
test <- y[-caret::createDataPartition(y, p=0.5,list=F)]
table(test) # we have 10 obs.
test
0 1
5 5
If we build and example instead with odd number of cases:
y <- rep(c(0,1), 11)
table(y)
y
0 1
11 11
We have:
train <- y[caret::createDataPartition(y, p=0.5,list=F)]
table(train) # we have 12 obs.
train
0 1
6 6
test <- y[-caret::createDataPartition(y, p=0.5,list=F)]
table(test) # we have 10 obs.
test
0 1
5 5
More info here.

Here is another thread which explains why the number returned from createDataPartition might seem to be "off" to us but not according to what this function is trying to do.
So, it depends on what you have in final_ts$SAR and the spread of the data.
If it is categorical value, ex: T and F, if you have 100 total, 55 are T, 45 are F. When you invoke the way in your code, it will return you 51 because:
55*0.5=27.5, 45*0.5=22.5, round each result up, 28+23=51.
You can refer to below thread which has a great explanation about this when the values you want to split are numbers.
R - caret createDataPartition returns more samples than expected

Running a interaction matrix between many variables

I have a data set with 70 column variables, each is 0-1 dummy variable, and 3500 observations. I am looking to see how often observations with a 'success' in one variable are matched with another variable. In other words it obs 1 has a success dummy in variable one how often does it also have a success in variable 2 and so on for all the variables. I have found how to create a matrix table showing interactions when only two columns are involved however i cant find anything involving many columns. Ideally id like to present this in an interaction matrix with 70 variables across and 70 down. Here is an idea of the data set:
Dat A B C D
XX 1 1 1 1
XY 0 1 0 1
XZ 0 0 1 1
The output im hoping for would be:
Out A B C D
A 0 1 1 1
B 0 1 2
C 0 2
D 0
Showing the number of times that (A,B) is a pairing (B,C) is a pairing and so on.
I have tried using the table() command as well as as.matrix but it seems these require data organized as two columns and cannot understand the data when it refers to many column variables. I am fairly new to R so I apologize if my question isnt clear or is possibly quite simple.
Any help is appreciated. Thanks

Here's how to create a correlation matrix of indefinite size. First create a reproducible example of your dataset...
dat <- matrix(sample(0:1, size = 700, replace = TRUE), ncol = 70)
dat <- data.frame(dat)
Then calculate the correlation...
dat <- cor(dat)
And then plot the correlation visually...
library(corrplot)
corrplot(dat, method = "square")
You can also plot the correlation using numbers instead of colors...
corrplot(dat, method = "number")
Obviously you'll want to finesse these charts before using them in a publication. corrplot offers tons of options for chart appearance.

You can try:
res <- apply(combn(2:ncol(df), 2), 2, function(x, y) sum(rowSums(y[, x]) == 2), df)
m <- diag(x=0, ncol(df)-1)
m[upper.tri(m)] <- res
m[lower.tri(m)] <- NA
dimnames(m) <- list(colnames(df)[-1], colnames(df)[-1])
A B C D
A 0 1 1 1
B NA 0 1 2
C NA NA 0 2
D NA NA NA 0

descriptive statistics in table r for multiple variables

I am totally new with R and I'll appreciate the time anyone bothers to take with helping me with these probably simple tasks. I'm just at a loss with all the resources available and am not sure where to start.
My data looks something like this:
subject sex age nR medL medR meanL meanR pL ageBin
1 0146si 1 67 26 1 1 1.882353 1.5294118 0.5517241 1
2 0162le 1 72 5 2 1 2 1.25 0.6153846 1
3 0323er 1 54 30 2.5 3 2.416667 2.5 0.4915254 0
4 0811ne 0 41 21 2 2 2 1.75 0.5333333 0
5 0825en 1 44 31 2 2 2.588235 1.8235294 0.5866667 0
Though the actual data has many, many more subjects in variables.
This first thing I need to do is compare the 'ageBin' values. 0 = under age 60, 1 = over age 60. I want to compare stats between these two groups. So I guess the first thing I need is the ability to recognize the different ageBin values and make those the two rows.
Then I need to do things like calculate the frequency of the values in the two groups (ie. how many instances of 1 and 0), the mean of the 'age' variable, the median of the age variable, number of males (ie. sex = 1), the mean of meanL, etc. Simple things like that. I just want them to be all in one table.
So an example of a potential table might be
n nMale mAge
ageBin 0 14 x x
ageBin 1 14 x x
I could easily do this stuff in SPSS or even Excel...I just really want to get started with R. So any resource or advice someone could offer to point me in the right direction would be so, so helpful. Sorry if this sounds unclear...I can try to clarify if necessary.
Thanks in advance, anyone.

Use the plyr() package to split up the data structure and then apply a function to combine all the results back together.
install.packages("plyr") # install package from CRAN
library(plyr) # load the package into R
dd <- list(subject=c("0146si", "0162le", "1323er", "0811ne", "0825en"),
sex = c(1,1,1,0,1),
age = c(67,72,54,41,44),
nR = c(26,5,30,21,31),
medL = c(1,2,2.5,2,2),
medR = c(1,1,3,2,2),
meanL = c(1.882352,2,2.416667,2,2.588235),
meanR = c(1.5294118,1.25,2.5,1.75,1.8235294),
pL = c(0.5517241,0.6153846,0.4915254,0.5333333,0.5866667),
ageBin = c(1,1,0,0,0))
dd <- data.frame(dd) # convert to data.frame
Using the ddply function, you can do things like calculate the frequency of the values in the two groups
ddply(dd, .(ageBin), summarise, nMale = sum(sex), mAge = mean(age))
ageBin nMale mAge
0 2 46.33333
1 2 69.50000
The following is a very useful resource by Sean Anderson for getting up to speed with the plyr package.
A more comprehensive extremely resource by Hadley Wickham the package author can be found here

Try the by function:
if your data frame is named df:
by(data=df, INDICES=df$ageBin, FUN=summary)

covariance matrix from a community list with grouping factors

I am still learning to use data.table (from the data.table package) and even after looking for help on the web and the help files, I am still struggling to do what I want.
I have a large data table with over 60 columns (the first three corresponding to factors and the remaining to response variables, in this case different species) and several rows corresponding to the different levels of the treatments and the species abundances. A very small version looks like this:
> TEST<-data.table(Time=c("0","0","0","7","7","7","12"),
Zone=c("1","1","0","1","0","0","1"),
quadrat=c(1,2,3,1,2,3,1),
Sp1=c(0,4,29,9,1,2,10),
Sp2=c(20,17,11,15,32,15,10),
Sp3=c(1,0,1,1,1,1,0))
>setkey(TEST,Time)
> TEST
Time Zone quadrat Sp1 Sp2 Sp3
1: 0 1 1 0 20 1
2: 0 1 2 4 17 0
3: 0 0 3 29 11 1
4: 12 1 1 10 10 0
5: 7 1 1 9 15 1
6: 7 0 2 1 32 1
7: 7 0 3 2 15 1
I need to calculate the sum of the covariances for each Zone x quadrat group. If I only had the species list for a given Zone x quadrat combination, then I could use the cov() function but using cov() in the same way that I would use mean() or sum() in
Abundance = TEST[,lapply(.SD,mean),by="Zone,quadrat"]
does not work as I get the following error message:
Error in cov(value) : supply both 'x' and 'y' or a matrix-like 'x'
I understand why but I cannot figure out how to solve this.
What I exactly want is to be able to get, for each Zone x quadrat combination, the covariance matrix of all the species across all the sampling Time points. From each matrix, I then need to calculate the sum of the covariances of all pairs of species, so that then I can have a sum of covariance for each Zone x quadrat combination.
Any help would be greatly appreciated, Thanks.

From the help provided above by #Frank and some additional searching that I did around the use of the upper.tri function, the following code works:
Cov= TEST[,sum(cov(.SD)[upper.tri(cov(.SD), diag = FALSE)]), by='Zone,quadrat', .SDcols=paste('Sp',1:3,sep='')]
The initial version proposed, where upper.tri() did not appear in [ ] only extracted logical values from the covariance matrix and having diag = FALSE allowed to exclude the diagonal values before summing the upper triangle of the matrix. In my case, I didn't care whether it was the upper or lower triangle but I'm sure that using lower.tri() would work equally well.
I hope this helps other users who might encounter a similar issue.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Cluster a big data set (quantitative/qualitative values) - r

Related

Create a new Variable of values of another variable-multilevel regression

Unable to create exactly equal data partitions using createDataPartition in R- getting 1396 and 1398 observations each but need 1397

Running a interaction matrix between many variables

descriptive statistics in table r for multiple variables

covariance matrix from a community list with grouping factors

Categories

Resources