Combining a kinship matrix and pedigree data into a new kinship matrix - out-of-memory

I am running a very large simulation, trying to simulate kinship of a constant size population of 500 along 50,000 iterations. In each iteration one individual dies and one is born.
The function PED of the package KINSHIP2 does the job easily for a medium size of iterations. However, once I increase the number of iterations into tens of thousands, the system throws me out due to a memory problem. This is quite reasonable, as for N simulations the system has to calculate approximately 0.5*(500+N)^2 pairwise relationships.
I realized that if I stop the run every k iterations (on a "middle time point") and calculate the pedigree function at this point, I could probably throw out all of the dead individuals, and keep running my simulation from that point with a much less heavy data load. However, I cannot find any existing function which would take in both the "middle time point" kinship matrix, and the pedigree matrix developed since that Middle time point, combine them and calculate the updated kinship matrix.
Any suggestions?
Thank you so much

Related

How can I achieve hierarchical clustering with p-values for a large dataset?

I am trying to carry out hierarchical cluster analysis (based on Ward's method) on a large dataset (thousands of records and 13 variables) representing multi-species observations of marine predators, to identify possible significant clusters in species composition.
Each record has date, time etc and presence/absence data (0 / 1) for each species.
I attempted hierarchical clustering with the function pvclust. I transposed the data (pvclust works on transposed tables), then I ran pvclust on the data selecting Jacquard distances (“binary” in R) as a distance measure (suitable for species pres/abs data) and Ward’s method (“ward.D2”). I used “parallel = TRUE” to reduce computation time. However, using a default of nboots= 1000, my computer was not able to finish the computation in hours and finally I got ann error, so I tried with lower nboots (100).
I cannot provide my dataset here, and I do not think it makes sense to provide a small test dataset, as one of the main issues here seems to be the size itself of the dataset. However, I am providing the lines of code I used for the transposition, clustering and plotting:
tdata <- t(data)
cluster <- pvclust(tdata, method.hclust="ward.D2", method.dist="binary",
nboot=100, parallel=TRUE)
plot(cluster, labels=FALSE)
This is the dendrogram I obtained (never mind the confusion at the lower levels due to overlap of branches).
As you can see, the p-values for the higher ramifications of the dendrogram all seem to be 0.
Now, I understand that my data may not be perfect, but I still think there is something wrong with the method I am using, as I would not expect all these values to be zero even with very low significance in the clusters.
So my questions would be
is there anything I got wrong in the pvclust function itself?
may my low nboots (due to “weak” computer) be a reason for the non-significance of my results?
are there other functions in R I could try for hierarchical clustering that also deliver p-values?
Thanks in advance!
.............
I have tried to run the same code on a subset of 500 records with nboots = 1000. This worked in a reasonable computation time, but the output is still not very satisfying - see dendrogram2 .dendrogram obtained for a SUBSET of 500 records and nboots=1000

mantel.test (package: cultevo)

I am running into some problems and was wondering if you could help me out.
I have two data.frames, one for species data and another for water quality data. I have converted my water quality dataset to u-scores, as there is "censoring" present. However, due to sampling irregularities, I select/run each water quality variable separately when I actually run the Mantel tests.
I am able to compute my distance matrices for both data.frames (i.e., log Bray-Curtis for species and Euclidean distance for the water quality variable), then perform a Mantel test using:
Distance.ln.bray.curtis <- vegdist(log1p(species_abundance))
Distance.euclidean = dist(explanatory_variable, method = "euclidean")
mantel.rtest(Distance.ln.bray.curtis, Distance.euclidean, nrepet = 9999)
But then I realized.. I have to permute the observations within each site (I have 28 sites) as this is recommended within my statistics book. This is where I am stuck, as there are no resources that I have found that have been helpful to do "within permutations" in R.
I have seen that the "cultevo" package can do perform one or more Mantel permutation tests, which is what I need to do. But I'm not entirely sure how to go about doing this. Would I have to reinsert the site #'s within the distance matrices? If someone could provide examples, it would be greatly appreciated!

Competing risk survival random forest with large data

I have a data set with 500,000 observations with events and a competing risk as well as a time-to-event variable (survival analysis).
I want to run a survival random forest.
The R-package randomForestSRC is great for it, however, it is impossible to use more than 100,000 rows due to memory limitation (100'000 uses 40GB of RAM) even though I limit my number of predictors to 15 to 20.
I have a hard time finding a solution. Does anyone have a recommendation?
I looked at h2o and spark mllib, both of which do not support survival random forests.
Ideally I am looking for a somewhat R-based solution but I am happy to explore anything else if anyone knows a way to use large data + competing risk random forest.
In general, the memory profile for an RF-SRC data set is n x p x 8 on a 64-bit machine. With n=500,000 and p=20, RAM usage is approximately 80MB. This is not large.
You also need to consider the size of the forest, $nativeArray. With the default nodesize = 3, you will have n / 3 = 166,667 terminal nodes. Assuming symmetric trees for convenience, the total number of interanal/external nodes will approximately be 2 * n / 3 = 333,333. With the default ntree = 1000, and assuming no factors, $nativeArray will be of dimensions [2 * n / 3 * ntree] x [5]. A simple example will show you why we have [5] columns in the $nativeArray to tag the split parameter, and split value. Memory usage for the forest will be thus be 2 * n / 3 * ntree * 5 * 8 = 1.67GB.
So now we are getting into some serious memory usage.
Next consider the ensembles. You haven't said how many events you have in your competing risk data set, but let's assume there are two for simplicity.
The big arrays here are the cause-specific hazard function (CSH) and the cause-specific cumulative incidence function (CIF). These are both of dimension [n] x [time.interest] x [2]. In a worst case scenario, if all your times are distinct, and there are no censored events, time.interest = n. So each of these outputs is n * n * 2 * 8 bytes. This will blow up most machines. It's time.interest that is your enemy. In big-n situations, you need to constrain the time.interest vector to a subset of the actual event times. This can be controlled with the parameter ntime.
From the documentation:
ntime: Integer value used for survival families to constrain ensemble calculations to a grid of time values of no more than ntime time points. Alternatively if a vector of values of length greater than one is supplied, it is assumed these are the time points to be used to constrain the calculations (note that the constrained time points used will be the observed event times closest to the user supplied time points). If no value is specified, the default action is to use all observed event times.
My suggestion would be to start with a very small value of ntime, just to test whether the data set can be analyzed in its entirety without issue. Then increase it gradually and observe your RAM usage. Note that if you have missing data, then RAM usage will be much larger. Also note that I did not mention other arrays such as the terminal node statistics that also lead to heavy RAM usage. I only considered the ensembles, but the reality is that each terminal node will contain arrays of dimension [time.interest] x 2 for the node specific estimator of the CSH and CIF that is used in creating the forest ensemble.
In the future, we will be implementing a Big Data option that will suppress ensembles and optimize the memory profile of the package to better accommodate big-n scenarios. In the meantime, you will have to be diligent in using the existing options like ntree, nodesize, and ntime to reduce your RAM usage.

cluster many curves representing gas consumption

I have 700 hourly time series from 2010 to 2014 of gas consumption. One time serie represents the consumption of one companies.
Some have constant consumption, other consume only 4 months of the year and some have a high volatility consumption. As a consequence I would like to cluster them according to the shape of the consuption curve.
I tried the R package "kml", but i do not have good results. I also tried the "kmlShape" package, but it seems that i have too much data, and each time R quit..
I wondered if using Fast fourier transform and then cluster it could be a good idea? My goal is to really distinguish the group that the consumption is constant to those whose consumption is variable.
Then I would like to cluster the variable consummer in function of the peak and when they consumme.
I also tried to calculate the mean et variance of each clients, then cluster it with k-mean but it not very good, i can see 2 cluster, one with 650 clients and on other with 50...
thanks
first exemple`
2nd exemple
Here are three exemple of what I have, I have 700 curves likes that, some are high variables, some pretty constant.
I would like to cluster them according to their shape in order to have one group where the consumption is pretty constant, an other where the consumption is highly variable and try to cluster it according to the time the peak appear

Increasing cluster performance

I'm clustering a data set with categorical data using kmodes and it's starting to take too long. I'm considering two approaches:
1) reduce the number of iterations 2) randomly pick a smaller subset of the data, get the centroids, and then assign the rest of the data a cluster based on the closest centroid.
I'm wondering what the trade off would be between these two approaches, or if there are other approaches that I'm not thinking of.

Resources