Show KMeans cluster results with clusters as columns - r

My data has 40+ variables and I am creating a 3 cluster model on it.
I have built a kmeans model:
teen_clusters <- kmeans(interests_z, 3).
It works fine. It is getting an output that I can read is the issue.
When I screen print the model, it places the variables on the top (40 across) and the clusters as rows (3 deep). Very hard to read.
I want it the other way around. 3 cluster columns and 40 rows.
I have tried the below, but get the same thing. This does way too much screen wrap.
aggregate(interests_z,by=list(teen_clusters$cluster),FUN=mean)

Since we don't have your data lets use mtcars ...
ret <- kmeans(mtcars,3)
ret$centers # the default format
t(ret$centers) # transposed as you want
To see the components of ret use str(ret)

Related

Translating a for-loop to perhaps an apply through a list

I have a r code question that has kept me from completing several tasks for the last year, but I am relatively new to r. I am trying to loop over a list to create two variables with a specified correlation structure. I have been able to "cobble" this together with a "for" loop. To further complicate matters, I need to be able to put the correlation number into a data frame two times.
For my ultimate usage, I am concerned about speed, efficiency, and long-term effectiveness of my code.
library(mvtnorm)
n=100
d = NULL
col = c(0, .3, .5)
for (j in 1:length(col)){
X.corr = matrix(c(1, col[j], col[j], 1), nrow=2, ncol=2)
x=rmvnorm(n, mean=c(0,0), sigma=X.corr)
x1=x[,1]
x2=x[,2]
}
d = rbind(d, c(j))
Let me describe my code, so my logic is clear. This is part of a larger simulation. I am trying to draw 2 correlated variables from the mvtnorm function with 3 different correlation levels per pass using 100 observations [toy data to get the coding correct]. d is a empty data frame. The 3 correlation levels will occur in the following way pass 1 uses correlation 0 then create the variables, and yes other code will occur; pass 2 uses correlation .3 to create 2 new variables, and then other code will occur; pass 3 uses correlation .5 to create 2 new variables, and then other code will occur. Within my larger code, the for-loop gets the job done. The last line puts the number of the correlation into the data frame. I realize as presented here it will only put 1 number into this data frame, but when it is incorporated into my larger code it works as desired by putting 3 different numbers in a single column (1=0, 2=.3, and 3=.5). To reiterate, the for-loop gets the job done, but I believe there is a better way--perhaps something in the apply family. I do not know how to construct this and still access which correlation is being used. Would someone help me develop this little piece of code? Thank you.

Problems with DataFrame to generate a table of 15000 rows x 37 columns in Julia

I'm trying to generate a table with 15000 rows and 16 columns, however Julia loses or omits some variables.
I have tried different ways to run DataFrame but I get the following results:
df = DataFrame(periods=15000, households=5000, giniY=giniY)
15,000 rows × 3 columns
However, when I run with the 16 variables I get the following result
df = DataFrame(periods=15000, households=5000, gamma=gamma, delta=delta,
betta=betta, alfa=alfa, miz=miz, roz=roz,
phi=phi, rok=rok, mie=mie, roe=roe,
roez=roez)
1×13 DataFrame. Omitted printing of 3 columns
Your second df variable has 13 columns (I have aligned the code in my edit 4 variables per line so that it is clearly visible). Julia omits printing all columns if they would not fit the screen (imagine what would happen if you had a data frame with 10 000 columns and always tried to print them all).
In REPL Julia omits printing columns if they do not fit the screen unless you pass allcols=true keyword argument to show or create a custom IOContext that you pass to show that defines a non-standard output width. All this is explained in show documentation for DataFrame.
In Jupyter Notebook a similar thing happens, but by default the width of the output is governed by "COLUMNS" environment variable. The details how you can set it are explained at the beginning of the DataFrames.jl manual here.

Using R to cluster based on euclidean distance and a complete linkage metric, too many vectors?

I am trying to figure out how to read in a counts matrix into R, and then cluster based on euclidean distance and a complete linkage metric. The original matrix has 56,000 rows (genes) and 7 columns (treatments). I want to see if there is a clustering relationship between the treatments. However, every time I try to do this, I first get an error stating, Error: cannot allocate vector of size 544.4 Gb Since I'm trying to reproduce work that has been published by someone else, I am wondering if I am making a mistake with my initial data entry.
Second, if I try such clustering with just 20 genes of the 56,000, I am able to make a clustering dendrogram, but the branches are no experimental samples. The paper I am trying to replicate did such clustering with the resulting dendrogram displaying clustering samples.
Here is the code I am trying to run:
exprs <- as.matrix(read.table("small_RMA_table.txt", header=TRUE, sep = "\t", row.names = 1, as.is=TRUE))
eucl_dist=dist(matrix(exprs),method = 'euclidean')
hie_clust=hclust(eucl_dist,method = 'complete')
plot(hie_clust)
And here is a sample of my data table:
AGS KATOIII MKN45 N87 SNU1 SNU5 SNU16
1_DDR1 11.18467721 11.91358171 11.81568242 11.08565284 8.054326631 12.46899188 10.54972491
2_RFC2 9.19869822 9.609015734 8.925772678 8.3641799 8.550993726 10.32160527 9.421779056
3_HSPA6 6.455324139 6.088320986 7.949175048 6.128573129 6.113793411 6.317460116 7.726657567
4_PAX8 8.511225092 8.719103196 8.706242048 8.705618546 8.696547633 9.292782564 8.710369119
5_GUCA1A 3.773404228 3.797729793 3.574286779 3.848753216 3.684193193 3.66065606 3.88239872
6_UBA7 6.477543321 6.631538303 6.506133756 6.433793116 6.145507918 6.92197071 6.479113995
7_THRA 6.263090367 6.507397854 6.896879084 6.696356125 6.243160864 6.936051147 6.444444498
8_PTPN21 6.88050894 6.342007735 6.55408163 6.099950167 5.836763044 5.904301086 6.097067306
9_CCL5 6.197989448 4.00619542 4.445053893 7.350765625 3.892650264 7.140038596 4.123639647
10_CYP2E1 4.379433632 4.867741561 4.719912827 4.547433566 6.530890968 4.187701905 4.453267508
11_EPHB3 6.655231606 7.984278173 7.025962652 7.111129175 6.246989328 6.169529157 6.546374446
12_ESRRA 8.675023046 9.270153715 8.948209029 9.412638347 9.4470612 9.98312055 9.534236722
13_CYP2A6 6.834018146 7.18386746 6.826740822 7.244411918 6.744588768 6.715122111 7.302922762
14_SCARB1 8.856802264 8.962211232 8.975200168 9.710291176 9.120002571 10.29588004 10.55749325
15_TTLL12 8.659539601 9.93935462 8.309244963 9.21145716 9.792647852 10.46958091 10.51879844
16_LINC00152 5.108632654 4.906321384 4.958158343 5.315532543 5.456138001 5.242577092 5.180295902
17_WFDC2 5.595843025 5.590991341 5.776102664 5.622086284 5.273603946 5.304240608 5.573746302
18_MAPK1 6.970036434 5.739881305 4.927993642 5.807358161 7.368137365 6.17697538 5.985006279
19_MAPK1 8.333269232 8.758733916 7.855324572 9.03596893 7.808283302 7.675434022 7.450262521
20_ADAM32 4.075355477 4.216259982 4.653654879 4.250333684 4.648194266 4.250333684 4.114286071
The rows describe genes (Ex., 1_DDR1, 2_RFC2, etc.) and the columns are experimental samples (Ex. AGS, KATOIII). I wish to see the relatedness of the samples in the cluster.
Here is my sample dendrogram that my code produces. I thought it would only show 7 branches reflecting my 7 samples:
The paper's dendrogram (including these 8 samples and many more as well) is below:
Thanks for any help you can provide!
You're running out of RAM. That's it. You can't allocate a vector that exceeds your memory space. Move to a computer with more memory or maybe, try use bigmemory (I've never tried it).
https://support.bioconductor.org/p/53848/
In case anybody was wondering, the answer to my second question is below. I was calling as.matrix on a matrix, and it was screwing up the data. The following code works now!
exprs <- as.matrix(read.table("small_RMA_table.txt", header=TRUE, sep = "\t", row.names = 1, as.is=TRUE))
eucl_dist=dist(exprs,method = 'euclidean')
hie_clust=hclust(eucl_dist,method = 'complete')
plot(hie_clust)
Do you want to cluster on columns (detect similarities between treatments) or on rows (detect similarities between genes)? It sounds like you want the former, given that you're expecting 7 dendrogram branches for 7 treatments.
If so, then you need to transpose your dataset. dist computes a distance matrix for rows, not columns, which is not what you want.
Once you've done the transpose, your clustering should take no time at all, and minimal memory.

Clustering big data

I have a list like this:
A B score
B C score
A C score
......
where the first two columns contain the variable name and third column contains the score between both. Total number of variables is 250,000 (A,B,C....). And the score is a float [0,1]. The file is approximately 50 GB. And the pairs of A,B where scores are 1, have been removed as more than half the entries were 1.
I wanted to perform hierarchical clustering on the data.
Should I convert the linear form to a matrix with 250,000 rows and 250,000 columns? Or should I partition the data and do the clustering?
I'm clueless with this. Please help!
Thanks.
Your input data already is the matrix.
However hierarchical clustering usually scales O(n^3). That won't work with your data sets size. Plus, they usually need more than one copy of the matrix. You may need 1TB of RAM then... 2*8*250000*250000is a lot.
Some special cases can run in O(n^2): SLINK does. If your data is nicely sorted, it should be possible to run single-link in a single pass over your file. But you will have to implement this yourself. Don't even think of using R or something fancy.

Parallel processing in R

I'm working with a custom random forest function that requires both a starting and ending point in a set of genomic data (about 56k columns).
I'd like to split the column numbers into subgroups and allow each subgroup to be processed individually to speed things up. I tried this (unsuccessfully) with the following code:
library(foreach)
library(doMC)
foreach(startMrk=(markers$start), endMrk=(markers$end)) %dopar%
rfFunction(genoA,genoB,0.8,ntree=100,startMrk=startMrk,endMrk=endMrk)
Where startMrk is an array of numeric variables: 1 4 8 12 16 and endMrk is another array: 3 7 11 15 19
For this example, I'd want one core to run samples 1:3, another to run 4:7, etc. I'm new to the idea of parallel processing in R, so I'm more than willing to study any documentation available. Does anyone have advice on things I'm missing for parallel-wise processing or for the above code?
The basic point here is that you're splitting up your columns into chunks, right. First, it might be better to chunk your dataset appropriately at each iteration and feed the chunks into RF. Also, foreach works just like for in some ways, so the code can be
rfs=vector('list',4)
foreach(i=1:4) %dopar% {
ind <- markers$start[i]:markers$end[i]
rfs[[i]] <- randomForest(genoA[,ind],genoB[,ind], 0.8, ntree=100)
}
I gave this in regular randomForest, but you can wrap this up into your custom code in a straightforward manner.

Resources