I'm trying to run a PCA on a really large dataset (160 000 x 20 000 variables, approx 6.3G in the file but much more when loaded in R) on a cluster. However, it is taking a high amount of time (my job was killed after 90 hours) while it was usually done in a few hours on datasets half the size.
I'm using the most basic R code possible :
data=read.table("dataset.csv", header=T, sep=',',row.names=1, fill=TRUE)
y=PCA(data, ncp=100, graph=FALSE)
Is there something wrong with what I'm doing or should I try a PCA from another package?
Related
I am working with a set of 13 .tif raster files, 116.7 MB each, containing data on mangrove forest distributions in West Africa. Each file holds the distribution for one year (2000-2012). The rasters load into R without any problems and plot relatively easily as well, taking ~20 seconds using base plot() and ~30 seconds using ggplot().
I am running into problems when I try to do any sort of processing or analysis of the rasters. I am trying to do simple raster math, subtracting the 2000 mangrove distribution raster from the 2000 raster to show deforestation hotspots, but as soon as I do, the memory on my computer starts rapidly disappearing.
I have 48GB of drive space free, but when I start running the raster math, I start to lose a GB of storage every few seconds. This continues until my storage is almost empty, I get a notification from my computer that my storage is critically low, and I have to stop R from running. I am running on a MacBook Pro 121GB storage 8GB ram Big Sur 11.0.1. Does anyone know what could be causing this?
Here's my code:
#import cropped rasters
crop2000 <- raster("cropped2000.tif")
crop2001 <- raster("cropped2001.tif")
crop2002 <- raster("cropped2002.tif")
crop2003 <- raster("cropped2003.tif")
crop2004 <- raster("cropped2004.tif")
crop2005 <- raster("cropped2005.tif")
crop2006 <- raster("cropped2006.tif")
crop2007 <- raster("cropped2007.tif")
crop2008 <- raster("cropped2008.tif")
crop2009 <- raster("cropped2009.tif")
crop2010 <- raster("cropped2010.tif")
crop2011 <- raster("cropped2011.tif")
crop2012 <- raster("cropped2012.tif")
#look at 2000 distribution
plot(crop2000)
#look at 2012 distribuion
plot(crop2012)
#subtract 2000 from 2012 to look at change
chg00_12 <- crop2012 - crop2000
If you work with large datasets that cannot be all kept in RAM, raster will save them to temporary files. This can be especially demanding with raster math, as each step will create a new file. e.g with Raster* x
y <- 3 * (x + 2) - 5
would create three temp files. First for (x+2), then for *3, and then for -5. You can avoid that by using functions like calc and overlay
y <- raster::calc(x, function(i) 3 * (i + 2) - 5)
That would only create one temp file. Or none if you provide a filename (which makes it also easier to delete), and perhaps use compression (see ?writeRaster).
Also see ?raster::removeTmpFiles
You can also increase the amount of RAM that raster is allowed to use. See ?raster::rasterOptions.
I have a large set of polygons (about 20k) that I want to sample points from. I use the st_sample function from the sf package in R, but it's pretty slow. It takes about 5 minutes to sample from all polygons, and I need to repeat this task a large number of times (N >= 1000) so it's not practical.
Is there a way to do faster sampling?
I have two time series- a baseline (x) and one with an event (y). I'd like to cluster based on dissimilarity of these two time series. Specifically, I'm hoping to create new features to predict the event. I'm much more familiar with clustering, but fairly new to time series.
I've tried a few different things with a limited understanding...
Simulating data...
x<-rnorm(100000,mean=1,sd=10)
y<-rnorm(100000,mean=1,sd=10)
This package seems awesome but there is limited information available on SO or Google.
library(TSclust)
d<-diss.ACF(x, y)
the value of d is
[,1]
[1,] 0.07173596
I then move on to clustering...
hc <- hclust(d)
but I get the following error:
Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") :
missing value where TRUE/FALSE needed
My assumption is this error is because I only have one value in d.
Alternatively, I've tried the following on a single timeseries (the event).
library(dtw)
distMatrix <- dist(y, method="DTW")
hc <- hclust(y, method="complete")
but it takes FOREVER to run the distance Matrix.
I have a couple of guesses at what is going wrong, but could use some guidance.
My questions...
Do I need a set of baseline and a set of event time series? Or is one pairing ok to start?
My time series are quite large (100000 rows). I'm guessing this is causing the SLOW distMatrix calculation. Thoughts on this?
Any resources on applied clustering on large time series are welcome. I've done a pretty thorough search, but I'm sure there are things I haven't found.
Is this the code you would use to accomplish these goals?
Thanks!
I am a user of a Rocks 4.3 cluster with 22 nodes. I am using it to run a clustering function - parPvclust - on a dataset of 2 million rows and 100 columns (it clusters the sample names in the columns). To run parPvclust, I am using a C-shell script in which I've embedded some R code. Using the R code as it is below with a dataset of2 million rows and 100 columns, I always crash one of the nodes.
library("Rmpi")
library("pvclust")
library("snow")
cl <- makeCluster()
load("dataset.RData") # dataset.m: 2 million rows x 100 columns
# subset.m <- dataset.m[1:200000,] # 200 000 rows x 100 columns
output <- parPvclust(cl, dataset.m, method.dist="correlation", method.hclust="ward",nboot=500)
save(output,"clust.RData")
I know that the C-shell script code works, and I know that the R-code actually works with a smaller dataset because if I use a subset of the dataset (commented out above), the code runs fine and I get an output. Likewise, if I use the non-parallelized version (i.e. just pvclust), that also works fine, although running the non-parallelized version defeats the gain in speed of running it in parallel.
The parPvclust function requires the Rmpi and snow R packages (for parallelization) and the pvclust package.
The following can produce a reasonable approximation of the dataset I'm using:
dataset <- matrix(unlist(lapply(rnorm(n=2000,0,1),rep,sample.int(1000,1))),ncol=100,nrow=2000000)
Are there any ideas as to why I always crash a node with the larger dataset and not the smaller one?
I know that R works most efficiently with vectors and looping should be avoided. I am having a hard time teaching myself to actually write code this way. I would like some ideas on how to 'vectorize' my code. Here's an example of creating 10 years of sample data for 10,000 non unique combinations of state (st), plan1 (p1) and plan2 (p2):
st<-NULL
p1<-NULL
p2<-NULL
year<-NULL
i<-0
starttime <- Sys.time()
while (i<10000) {
for (years in seq(1991,2000)) {
st<-c(st,sample(c(12,17,24),1,prob=c(20,30,50)))
p1<-c(p1,sample(c(12,17,24),1,prob=c(20,30,50)))
p2<-c(p2,sample(c(12,17,24),1,prob=c(20,30,50)))
year <-c(year,years)
}
i<-i+1
}
Sys.time() - starttime
This takes about 8 minutes to run on my laptop. I end up with 4 vectors, each with 100,000 values, as expected. How can I do this faster using vector functions?
As a side note, if I limit the above code to 1000 loops on i it only takes 2 seconds, but 10,000 takes 8 minutes. Any idea why?
Clearly I should have worked on this for another hour before I posted my question. It's so obvious in retrospect. :)
To use R's vector logic I took out the loop and replaced it with this:
st <- sample(c(12,17,24),10000,prob=c(20,30,50),replace=TRUE)
p1 <- sample(c(12,17,24),10000,prob=c(20,30,50),replace=TRUE)
p2 <- sample(c(12,17,24),10000,prob=c(20,30,50),replace=TRUE)
year <- rep(1991:2000,1000)
I can now do 100,000 samples almost instantaneous. I knew that vectors were faster, but dang. I presume 100,000 loops would have taken over an hour using a loop and the vector approach takes <1 second. Just for kicks I made the vectors a million. It took ~2 seconds to complete. Since I must test to failure, I tried 10mm but ran out of memory on my 2GB laptop. I switched over to my Vista 64 desktop with 6GB ram and created vectors of length 10mm in 17 seconds. 100mm made things fall apart as one of the vectors was over 763mb which resulted in an allocation issue with R.
Vectors in R are amazingly fast to me. I guess that's why I am an economist and not a computer scientist.
To answer your question about why the loop of 10000 took much longer than your loop of 1000:
I think the primary suspect is the concatenations that are happening every loop. As the data gets longer R is probably copying every element of the vector into a new vector that is one longer. Copying a small (500 elements on average) data set 1000 times is fast. Copying a larger (5000 elements on average) data set 10000 times is slower.