I am still learning to use data.table (from the data.table package) and even after looking for help on the web and the help files, I am still struggling to do what I want.
I have a large data table with over 60 columns (the first three corresponding to factors and the remaining to response variables, in this case different species) and several rows corresponding to the different levels of the treatments and the species abundances. A very small version looks like this:
> TEST<-data.table(Time=c("0","0","0","7","7","7","12"),
Zone=c("1","1","0","1","0","0","1"),
quadrat=c(1,2,3,1,2,3,1),
Sp1=c(0,4,29,9,1,2,10),
Sp2=c(20,17,11,15,32,15,10),
Sp3=c(1,0,1,1,1,1,0))
>setkey(TEST,Time)
> TEST
Time Zone quadrat Sp1 Sp2 Sp3
1: 0 1 1 0 20 1
2: 0 1 2 4 17 0
3: 0 0 3 29 11 1
4: 12 1 1 10 10 0
5: 7 1 1 9 15 1
6: 7 0 2 1 32 1
7: 7 0 3 2 15 1
I need to calculate the sum of the covariances for each Zone x quadrat group. If I only had the species list for a given Zone x quadrat combination, then I could use the cov() function but using cov() in the same way that I would use mean() or sum() in
Abundance = TEST[,lapply(.SD,mean),by="Zone,quadrat"]
does not work as I get the following error message:
Error in cov(value) : supply both 'x' and 'y' or a matrix-like 'x'
I understand why but I cannot figure out how to solve this.
What I exactly want is to be able to get, for each Zone x quadrat combination, the covariance matrix of all the species across all the sampling Time points. From each matrix, I then need to calculate the sum of the covariances of all pairs of species, so that then I can have a sum of covariance for each Zone x quadrat combination.
Any help would be greatly appreciated, Thanks.
From the help provided above by #Frank and some additional searching that I did around the use of the upper.tri function, the following code works:
Cov= TEST[,sum(cov(.SD)[upper.tri(cov(.SD), diag = FALSE)]), by='Zone,quadrat', .SDcols=paste('Sp',1:3,sep='')]
The initial version proposed, where upper.tri() did not appear in [ ] only extracted logical values from the covariance matrix and having diag = FALSE allowed to exclude the diagonal values before summing the upper triangle of the matrix. In my case, I didn't care whether it was the upper or lower triangle but I'm sure that using lower.tri() would work equally well.
I hope this helps other users who might encounter a similar issue.
Related
I have some datasets for which i want to calculate gamma diversity as the Shannon H index.
Example dataset:
Site SpecA SpecB SpecC
1 4 0 0
2 3 2 4
3 1 1 1
Calculating the alpha diversity is as follows:
vegan::diversity(df, index = "shannon")
However, i want this diversity function to calculate one number for the complete dataset instead of for each row. I can't wrap my head around this. My thought is that i need to write a function to merge all the columns into one and taking the average abundance of each species, thus creating a dataframe with one site contaning all the species information:
site SpecA SpecB SpecC
1 2.6 1 1.6
This seems like a giant workaround for something that could be done with some existing functions, but i don't know how. I hope someone can help in creating this dataframe or using some other method to use the diversity() function over the complete dataframe.
Regards
library(vegan)
data(BCI)
diversity(colSums(BCI)) # vector of sums is all you need
## vegan 2.6-0 in github has 'groups' argument for sampling units
diversity(BCI, groups = rep(1, nrow(BCI))) # one group, same result as above
diversity(BCI, groups = "a") # arg 'groups' recycled: same result as above
I'm working with an expression matrix obtained by single cell RNA sequencing, but I have a question related with the R code one mate has sent me...
sort(unique(1 + slot(as(data_matrix, "dgTMatrix"), "i")))
# there isn't more details in the code...
In theory, this function is to delete non expressed genes (if it's zero in all samples, it think...), but it's impossible for me to understand it, anyone can give me a tip?
Well, I think I have understood this code... let's try to explain it! (please, correct me if I'm wrong).
Our data has a structure of sparse matrix (ie. more handly in regards to memory, link) and with as it's coerced to a specific format for this kind of matrix (Triplet Format for Sparse Matrices, link): three columns with i and j index for these non-zero values.
y <- matrix_counts # sparse matrix
AAACCTGAGAACAACT-1 AAACCTGTCGGAAATA-1 AAACGGGAGAGCTGCA-1
ENSG00000243485 1 . .
ENSG00000237613 . . 2
y2 <- as(y, "dgTMatrix") #triplet format for sparse matrix
i j x
1 9 1 1 #in row(9) and column(1) we have the value 1
2 50 1 2
3 60 1 1
4 62 1 2
5 78 1 1
6 87 1 1
After, it takes only the column "i" (slot(data, "i")), because we only need the row index (to know what rows are different to zero), and delete duplicates (unique) to finally obtain a vector with the row index which will be used to filter the raw data:
y3 <- unique(1 + slot(as(exprs(gbm), "dgTMatrix"), "i"))
[1] 9 50 60 62 78 87
data <- data_raw[y3,]
I am a bit confused with sort and 1+, but I think this is the basic concept. So, to summarize, we take the row index from this non-zero rows (genes) and use it to filter our raw data... another original method for delete non-expressed genes, interesting!
I have a dataset composed of 54 000 rows and a few columns (7). My values are both numeric and alphanumeric (qualitative and quantitative variables). I want to cluster it using function hclust in R.
Let's take an example :
X <- data.frame(rnorm(54000, sd = 0.3),
rnorm(54000, mean = 1, sd = 0.3),
sample( LETTERS[1:24], 54000, replace=TRUE),
sample( letters[1:10], 54000, replace=TRUE),
round(rnorm(54000,mean=25, sd=3)),
round(runif(n = 54000,min = 1000,max = 25000)),
round(runif(54000,0,200000)))
colnames(X) <- c("A","B","C","D","E","F","G")
If I use the hclust function like this :
hclust(dist(X), method = "ward.D")
I get this error message :
Error: cannot allocate vector of size 10.9 Gb
What is the problem ? I'm trying to create a 54k * 54k matrix which is too big to be computed by my PC (4Go of RAM). I've read that since R3.0.0, the software is now in 64 bits (able to work with a 2.916e+09 matrix like in my example) so limitations are from my computer. I've tried it with hclust in stats / fastcluster/ flashClust and get the same problem.
In this packages, hclust are described like that :
hclust(d, method="complete", members=NULL)
flashClust(d, method = "complete", members=NULL)
d a dissimilarity structure as produced by dist.
We always need a dist matrix to make this function work. I've also tried to set higher the limitations of my computer for R session using this :
memory.limit(size = 4014)
memory.size(max = TRUE)
Question :
Is it possible to use a hierarchical clustering (or similar way to cluster data) whithout using this dist() matrix for a quantitative/qualitative dataset with R ?
Edit :
About k-means :
The method of k-means works great for a big dataset composed of numerical values. In my example, I got both numeric and alphanumeric values. I've tried to tranform my qualitative variables into binary numerical variables to do the process of k-means :
First dataframe (example) :
Col1 Col2 Col3
1 12 43.93145 Alpha
2 45 44.76081 Beta
3 48 45.09708 Gamma
4 31 45.42278 Alpha
5 12 46.53709 Delta
6 7 39.07841 Beta
7 78 49.60947 Alpha
If I transform this into binary variables, I get this :
Col1 Col2 Alpha Beta Gamma Delta
1 12 44.29369 1 0 0 0
2 45 43.90610 0 1 0 0
3 48 44.82659 0 0 1 0
4 31 43.09096 1 0 0 0
5 12 42.71190 0 0 0 1
6 7 43.71710 0 1 0 0
7 78 42.24293 1 0 0 0
It's OK if I only got a few modalities but in a real dataset, we could get about 10.000 modalities for a 50k rows base. I don't think k-means is the solution of this type of problem.
From reading your question, it seems there are 2 problems:
1. You have a fairly large amount of observations for clustering
2. The categorical variables have high cardinality
My advice:
1) You can just take a sample and use fastcluster::hclust, or use clara.
Probably after sorting out 2) you can use more observations, in any case it's potentially ok to use a sample. Try to take a stratified sample of the categories.
2) You basically need to represent these categories in a numeric format, without having 10000 columns more. You could use PCA or a Discrete version of it.
A few questions deal with this problem:
q1, q2
I have a dataframe of xyz coordinates of units in 5 different boxes, all 4x4x8 so 128 total possible locations. The units are all different lengths. So even though I know the coordinates of the unit (3 units in, 2 left, and 1 up) I don't know the exact location of the unit in the box (12' in, 14' left, 30' up?). The z dimension corresponds to length and is the dimension I am interested in.
My instinct is to run a for loop summing values, but that is generally not the most efficient in R. The key elements of the for loop would be something along the lines of:
master$unitstartpoint<-if(master$unitz)==1 0
master$unitstartpoint<-if(master$unitz)>1 master$unitstartpoint[i-1] + master$length[i-1]
i.e. the unit start point is 0 if it is the first in the z dimension, otherwise it is the start point of the prior unit + the length of the prior unit. Here's the data:
# generate dataframe
master<-c(rep(1,128),rep(2,128),rep(3,128),rep(4,128),rep(5,128))
master<-as.data.frame(master)
# input basic data--what load number the unit was in, where it was located
# relative other units
master$boxNumber<-master$master
master$unitx<-rep(c(rep(1,32),rep(2,32),rep(3,32),rep(4,32)),5)
master$unity<-c(rep(1,8),rep(2,8),rep(3,8),rep(4,8))
master$unitz<-rep(1:8,80)
# create unique unit ID # based on load number and xyz coords.
transform(master,ID=paste0(boxNumber,unitx,unity,unitz))
# generate how long the unit is. this length will be used to identify unit
# location in the box
master$length<-round(rnorm(640,13,2))
I'm guessing there is a relatively easy way to do this with apply or by but I am unfamiliar with those functions.
Extra info: the unit ID's are unique and the master dataframe is sorted by boxNumber, unitx, unity, and then unitz, respectively.
This is what I am shooting for:
length unitx unity unitz unitstartpoint
15 1 1 1 0
14 1 1 2 15
11 1 1 3 29
13 1 1 4 40
Any guidance would be appreciated. Thanks!
It sounds like you just want a cumulative sum along the z dimesion for each box/x/y combination. I used cumulative sum because otherwise if you reset at 0 when z=1 your definition would be leaving off the length at z=8. We can do this easily with ave
clength <- with(master, ave(length, boxNumber, unitx, unity, FUN=cumsum))
I'm exactly sure which values you want returned, but this column roughly transates to how you were redefining length above. If i combine with the original data and look at the total lenth for the first box for x=1, y=1:4
# head(subset(cbind(master, ml), unitz==8),4)
master boxNumber unitx unity unitz length ID ml
8 1 1 1 1 8 17 1118 111
16 1 1 1 2 8 14 1128 104
24 1 1 1 3 8 10 1138 98
32 1 1 1 4 8 10 1148 99
we see the total lengths for those positions. Since we are using cumsum we are summing that the z are sorted as you have indicated they are. If you just want one total overall length per box/x/y combo, you can replace cumsum with sum.
I'm using the SVD package with R and I'm able to reduce the dimensionality of my matrix by replacing the lowest singular values by 0. But when I recompose my matrix I still have the same number of features, I could not find how to effectively delete the most useless features of the source matrix in order to reduce it's number of columns.
For example what I'm doing for the moment:
This is my source matrix A:
A B C D
1 7 6 1 6
2 4 8 2 4
3 2 3 2 3
4 2 3 1 3
If I do:
s = svd(A)
s$d[3:4] = 0 # Replacement of the 2 smallest singular values by 0
A' = s$u %*% diag(s$d) %*% t(s$v)
I get A' which has the same dimensions (4x4), was reconstruct with only 2 "components" and is an approximation of A (containing a little bit less information, maybe less noise, etc.):
[,1] [,2] [,3] [,4]
1 6.871009 5.887558 1.1791440 6.215131
2 3.799792 7.779251 2.3862880 4.357163
3 2.289294 3.512959 0.9876354 2.386322
4 2.408818 3.181448 0.8417837 2.406172
What I want is a sub matrix with less columns but reproducing the distances between the different rows, something like this (obtained using PCA, let's call it A''):
PC1 PC2
1 -3.588727 1.7125360
2 -2.065012 -2.2465708
3 2.838545 0.1377343 # The similarity between rows 3
4 2.815194 0.3963005 # and 4 in A is conserved in A''
Here is the code to get A'' with PCA:
p = prcomp(A)
A'' = p$x[,1:2]
The final goal is to reduce the number of columns in order to speed up clustering algorithms on huge datasets.
Thank you in advance if someone can guide me :)
I would check out this chapter on dimensionality reduction or this cross-validated question. The idea is that the entire data set can be reconstructed using less information. It's not like PCA in the sense that you might only choose to keep 2 out of 10 principal components.
When you do the kind of trimming you did above, you're really just taking out some of the "noise" of your data. The data still as the same dimension.