How to resolve error in running KModes algorithm in R - r

I'm working on a segmentation problem and I have a dataframe with 49 variables and 500000 observations which could be continous, binary or categorical. I'm reading only those variables which do not have any NA values in them. Also to be on the safe side I was also using na.omit option.
Now as the dataset is too large, I wass trying to incrementally run it and did a sample at 1000, 10000 and 50000 rows. It run successfully on 1000 and 10000 rows with the following code :
t1c <- t1[sample(nrow(t1),50000),-c(5,23,25,26,28,55)]
library(klaR)
segments <- kmodes(na.omit(t1c),4, iter.max = 5)
where t1 is my original dataframe. When I run this with 50000 rows, I got the following error
Error in match.arg(useNA) : 'arg' must be of length 1
Any idea as to what might be the issue here.
P.s. Also I'm trying to run PAM using daisy() as it might be a better fit as I'm researching more on this for the type of data I have but still I was wondering if kmodes was running with 10000 samples, what might be an issue with 50000

Related

Conjoint analysis levels error - incorrect number of subscripts on matrix?

I'm working on a homework case using Conjoint analysis and running into an error that reads:
Error in X[k, j] <- x[i, j] : incorrect number of subscripts on matrix
I have tried looking up the error and cannot find anything that is conjoint analysis related. I have compared my data with other datasets (tea, ice) that are preloaded and don't cause the error, and it appears to be the same format, just different data.
library(conjoint)
**load in data**
caModel(y=conjointrating[1,],x=profiles)
This line of code (caModel) works just fine
caUtilities(y=conjointrating[1,],x=profiles,z=levels)
This is where I run in to the error. It seems to come from adding in z=levels. Other functions from the Conjoint package give the same exact error. The functions that do not require z (levels) as an input do not give an error.
"conjointrating" gives 40 observations of 16 variables, representing responses to 16 product profiles.
"profiles" gives 16 observations of 5 variables, representing 16 different product profiles.
"levels" is 15 observations of 1 variable, a section of which is below:
sm_suite
lg_rm
rm_off
internet
phone
I am receiving the following error:
Error in X[k, j] <- x[i, j] : incorrect number of subscripts on matrix
I expect to get partial utilities in a matrix format with columns representing the different levels.
Had the same issue with my conjoint analysis. Like Phil I noticed the preloaded datasets are all data frames. So I just converted my data files to data.frame and it works now. In your case
profiles <- data.frame(profiles)
levels <- data.frame(levels)
should do it.

Generating 3.000.000 strings of length 11 in R

Apparently if I try this:
# first grab the package
install.packages("stringi")
library(stringi)
# and then try to generate some serious dummy data
my_try <- as.vector(sample(1111111111:99999999999,3000000,replace=T))
R will say NOPE, sorry:
Error: cannot allocate vector of size 736.8 Gb
Should I buy more RAM*?
*this is a joke, but I seriously appreciate any help!
EDIT:
The desired output is a dataframe of 20 variables, and 3x10^6 rows. Some columns/variables should be strings, some integers. All in lengths ranging from 2 to 12.
The error isn't coming from sampling 3 million values, it's from trying to create a population of about 90 billion values 1111111111:99999999999 from which to sample. If you want to sample from that range, sample from the range 1:88888888889 and add 11111111110 using
sample(88888888889, 3000000,replace=TRUE) + 11111111110
There's no need for as.vector at the end, it's already a vector.
P.S. I believe in R-devel the range 1111111111:99999999999 will be stored much more efficiently (basically just the limits), but I don't know if sample() will be modified to work with it that way.

Time Series Clustering in R

I have two time series- a baseline (x) and one with an event (y). I'd like to cluster based on dissimilarity of these two time series. Specifically, I'm hoping to create new features to predict the event. I'm much more familiar with clustering, but fairly new to time series.
I've tried a few different things with a limited understanding...
Simulating data...
x<-rnorm(100000,mean=1,sd=10)
y<-rnorm(100000,mean=1,sd=10)
This package seems awesome but there is limited information available on SO or Google.
library(TSclust)
d<-diss.ACF(x, y)
the value of d is
[,1]
[1,] 0.07173596
I then move on to clustering...
hc <- hclust(d)
but I get the following error:
Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") :
missing value where TRUE/FALSE needed
My assumption is this error is because I only have one value in d.
Alternatively, I've tried the following on a single timeseries (the event).
library(dtw)
distMatrix <- dist(y, method="DTW")
hc <- hclust(y, method="complete")
but it takes FOREVER to run the distance Matrix.
I have a couple of guesses at what is going wrong, but could use some guidance.
My questions...
Do I need a set of baseline and a set of event time series? Or is one pairing ok to start?
My time series are quite large (100000 rows). I'm guessing this is causing the SLOW distMatrix calculation. Thoughts on this?
Any resources on applied clustering on large time series are welcome. I've done a pretty thorough search, but I'm sure there are things I haven't found.
Is this the code you would use to accomplish these goals?
Thanks!

Using R to cluster based on euclidean distance and a complete linkage metric, too many vectors?

I am trying to figure out how to read in a counts matrix into R, and then cluster based on euclidean distance and a complete linkage metric. The original matrix has 56,000 rows (genes) and 7 columns (treatments). I want to see if there is a clustering relationship between the treatments. However, every time I try to do this, I first get an error stating, Error: cannot allocate vector of size 544.4 Gb Since I'm trying to reproduce work that has been published by someone else, I am wondering if I am making a mistake with my initial data entry.
Second, if I try such clustering with just 20 genes of the 56,000, I am able to make a clustering dendrogram, but the branches are no experimental samples. The paper I am trying to replicate did such clustering with the resulting dendrogram displaying clustering samples.
Here is the code I am trying to run:
exprs <- as.matrix(read.table("small_RMA_table.txt", header=TRUE, sep = "\t", row.names = 1, as.is=TRUE))
eucl_dist=dist(matrix(exprs),method = 'euclidean')
hie_clust=hclust(eucl_dist,method = 'complete')
plot(hie_clust)
And here is a sample of my data table:
AGS KATOIII MKN45 N87 SNU1 SNU5 SNU16
1_DDR1 11.18467721 11.91358171 11.81568242 11.08565284 8.054326631 12.46899188 10.54972491
2_RFC2 9.19869822 9.609015734 8.925772678 8.3641799 8.550993726 10.32160527 9.421779056
3_HSPA6 6.455324139 6.088320986 7.949175048 6.128573129 6.113793411 6.317460116 7.726657567
4_PAX8 8.511225092 8.719103196 8.706242048 8.705618546 8.696547633 9.292782564 8.710369119
5_GUCA1A 3.773404228 3.797729793 3.574286779 3.848753216 3.684193193 3.66065606 3.88239872
6_UBA7 6.477543321 6.631538303 6.506133756 6.433793116 6.145507918 6.92197071 6.479113995
7_THRA 6.263090367 6.507397854 6.896879084 6.696356125 6.243160864 6.936051147 6.444444498
8_PTPN21 6.88050894 6.342007735 6.55408163 6.099950167 5.836763044 5.904301086 6.097067306
9_CCL5 6.197989448 4.00619542 4.445053893 7.350765625 3.892650264 7.140038596 4.123639647
10_CYP2E1 4.379433632 4.867741561 4.719912827 4.547433566 6.530890968 4.187701905 4.453267508
11_EPHB3 6.655231606 7.984278173 7.025962652 7.111129175 6.246989328 6.169529157 6.546374446
12_ESRRA 8.675023046 9.270153715 8.948209029 9.412638347 9.4470612 9.98312055 9.534236722
13_CYP2A6 6.834018146 7.18386746 6.826740822 7.244411918 6.744588768 6.715122111 7.302922762
14_SCARB1 8.856802264 8.962211232 8.975200168 9.710291176 9.120002571 10.29588004 10.55749325
15_TTLL12 8.659539601 9.93935462 8.309244963 9.21145716 9.792647852 10.46958091 10.51879844
16_LINC00152 5.108632654 4.906321384 4.958158343 5.315532543 5.456138001 5.242577092 5.180295902
17_WFDC2 5.595843025 5.590991341 5.776102664 5.622086284 5.273603946 5.304240608 5.573746302
18_MAPK1 6.970036434 5.739881305 4.927993642 5.807358161 7.368137365 6.17697538 5.985006279
19_MAPK1 8.333269232 8.758733916 7.855324572 9.03596893 7.808283302 7.675434022 7.450262521
20_ADAM32 4.075355477 4.216259982 4.653654879 4.250333684 4.648194266 4.250333684 4.114286071
The rows describe genes (Ex., 1_DDR1, 2_RFC2, etc.) and the columns are experimental samples (Ex. AGS, KATOIII). I wish to see the relatedness of the samples in the cluster.
Here is my sample dendrogram that my code produces. I thought it would only show 7 branches reflecting my 7 samples:
The paper's dendrogram (including these 8 samples and many more as well) is below:
Thanks for any help you can provide!
You're running out of RAM. That's it. You can't allocate a vector that exceeds your memory space. Move to a computer with more memory or maybe, try use bigmemory (I've never tried it).
https://support.bioconductor.org/p/53848/
In case anybody was wondering, the answer to my second question is below. I was calling as.matrix on a matrix, and it was screwing up the data. The following code works now!
exprs <- as.matrix(read.table("small_RMA_table.txt", header=TRUE, sep = "\t", row.names = 1, as.is=TRUE))
eucl_dist=dist(exprs,method = 'euclidean')
hie_clust=hclust(eucl_dist,method = 'complete')
plot(hie_clust)
Do you want to cluster on columns (detect similarities between treatments) or on rows (detect similarities between genes)? It sounds like you want the former, given that you're expecting 7 dendrogram branches for 7 treatments.
If so, then you need to transpose your dataset. dist computes a distance matrix for rows, not columns, which is not what you want.
Once you've done the transpose, your clustering should take no time at all, and minimal memory.

Running clustering analysis on a cluster causes one of the nodes to crash

I am a user of a Rocks 4.3 cluster with 22 nodes. I am using it to run a clustering function - parPvclust - on a dataset of 2 million rows and 100 columns (it clusters the sample names in the columns). To run parPvclust, I am using a C-shell script in which I've embedded some R code. Using the R code as it is below with a dataset of2 million rows and 100 columns, I always crash one of the nodes.
library("Rmpi")
library("pvclust")
library("snow")
cl <- makeCluster()
load("dataset.RData") # dataset.m: 2 million rows x 100 columns
# subset.m <- dataset.m[1:200000,] # 200 000 rows x 100 columns
output <- parPvclust(cl, dataset.m, method.dist="correlation", method.hclust="ward",nboot=500)
save(output,"clust.RData")
I know that the C-shell script code works, and I know that the R-code actually works with a smaller dataset because if I use a subset of the dataset (commented out above), the code runs fine and I get an output. Likewise, if I use the non-parallelized version (i.e. just pvclust), that also works fine, although running the non-parallelized version defeats the gain in speed of running it in parallel.
The parPvclust function requires the Rmpi and snow R packages (for parallelization) and the pvclust package.
The following can produce a reasonable approximation of the dataset I'm using:
dataset <- matrix(unlist(lapply(rnorm(n=2000,0,1),rep,sample.int(1000,1))),ncol=100,nrow=2000000)
Are there any ideas as to why I always crash a node with the larger dataset and not the smaller one?

Resources