How to reduce the size of the data in R? - r

I've a CSV file which has 600,000 rows and 1339 columns making 1.6 GB. 1337 columns are binaries taking either 1 or 0 values and other 2 columns are numeric and character variables.
I pulled the data use the package readr with following code
VLU_All_Before_Wide <- read_csv("C:/Users/petas/Desktop/VLU_All_Before_Wide_Sample.csv")
When I checked the object size using following code, it's about 3 gb.
> print(object.size(VLU_All_Before_Wide),units="Gb")
3.2 Gb
In the next step, using the below code, I want to create training and test set for LASSO regression.
set.seed(1234)
train_rows <- sample(1:nrow(VLU_All_Before_Wide), .7*nrow(VLU_All_Before_Wide))
train_set <- VLU_All_Before_Wide[train_rows,]
test_set <- VLU_All_Before_Wide[-train_rows,]
yall_tra <- data.matrix(subset(train_set, select=VLU_Incidence))
xall_tra <- data.matrix(subset(train_set, select=-c(VLU_Incidence,Replicate)))
yall_tes <- data.matrix(subset(test_set, select=VLU_Incidence))
xall_tes <- data.matrix(subset(test_set, select=-c(VLU_Incidence,Replicate)))
When I started my R session the RAM was at ~3 gb and by the time I exicuted all the above code it's now at 14 gb, leaving me an error saying can't allocate vector of size 4 gb. There was no other application running other than 3 chrome windows. I removed the original dataset, training and test dataset but it only reduced .7 to 1 gb RAM.
rm(VLU_All_Before_Wide)
rm(test_set)
rm(train_set)
Appreciate if someone can guide me a way to reduce the size of the data.
Thanks

R struggles when it comes to huge datasets because it tries to load and keep all the data into the RAM. You can use other packages available in R which are made to handle big datasets, like 'bigmemory and ff. Check my answer here which addresses a similar issue.
You can also choose to do some data processing & manipulation outside R and remove unnecessary columns and rows. But still, to handle big datasets, it's better to use the capable packages.

Related

Memory management in R ComplexUpset Package

I'm trying to plot an stacked barplot inside an upset-plot using the ComplexUpset package. The plot I'd like to get looks something like this (where mpaa would be component in my example):
I have a dataframe of size 57244 by 21, where one column is ID and the other is type of recording, and other 19 columns are components from 1 to 19:
ID component1 component2 ... component19 type
1 1 0 1 a
2 0 0 1 b
3 1 1 0 b
Ones and zeros indicate affiliation with a certain component. As shown in the example in the docs, I first convert these ones and zeros to logical, and then try to plot the basic upset plot. Here's the code:
df <- df %>% mutate(across(where(is.numeric), as.logical))
components <- colnames(df)[2:20]
upset(df, components, name='protein', width_ratio = 0.1)
But unfortunately after thinking for a while when processing the last line it spits out an error message like this:
Error: cannot allocate vector of size 176.2 Mb
Though I know I'm using the 32Gb RAM architecture, I'm sure I couldn't have flooded the memory so much that 167 Mb can't be allocated, so my guess is I am managing memory in R somehow wrong. Could you please explein what's faulty in my code, if possible.
I also know that UpsetR package plots the same data, but as far as i know it provides no way for the stacked barplotting.
Somehow, it works if you:
Tweak the min_size parameter so that the plot is not overloaded and makes a better impression
Making the first argument of ComplexUpset a sample with some data also helps, even if your sample is the whole dataset.

What is the best way to manage/store result from either posthoc.krukal.dunn.test() or dunn.test() - where my input data is in dataframe format?

I am a newbie in R programming and seek help in analyzing the Metabolomics data - 118 metabolites with 4 conditions (3 replicates per condition). I would like to know, for each metabolite, which condition(s) is significantly different from which. Here is part of my data
> head(mydata)
Conditions HMDB03331 HMDB00699 HMDB00606 HMDB00707 HMDB00725 HMDB00017 HMDB01173
1 DMSO_BASAL 0.001289121 0.001578235 0.001612297 0.0007772231 3.475837e-06 0.0001221674 0.02691318
2 DMSO_BASAL 0.001158363 0.001413287 0.001541713 0.0007278363 3.345166e-04 0.0001037669 0.03471329
3 DMSO_BASAL 0.001043537 0.002380287 0.001240891 0.0008595932 4.007387e-04 0.0002033625 0.07426482
4 DMSO_G30 0.001195253 0.002338346 0.002133992 0.0007924157 4.189224e-06 0.0002131131 0.05000778
5 DMSO_G30 0.001511538 0.002264779 0.002535853 0.0011580857 3.639661e-06 0.0001700157 0.02657079
6 DMSO_G30 0.001554804 0.001262859 0.002047611 0.0008419137 6.350990e-04 0.0000851638 0.04752020
This is what I have so far.
I learned the first line from this post
kwtest_pvl = apply(mydata[,-1], 2, function(x) kruskal.test(x,as.factor(mydata$Conditions))$p.value)
and this is where I loop through the metabolite that past KW test
tCol = colnames(mydata[,-1])[kwtest_pvl <= 0.05]
for (k in tCol){
output = posthoc.kruskal.dunn.test(mydata[,k],as.factor(mydata$Conditions),p.adjust.method = "BH")
}
I am not sure how to manage my output such that it is easier to manage for all the metabolites that passed KW test. Perhaps saving the output from each iteration appending to excel? I also tried dunn.test package since it has an option of table or list output. However, it still leaves me at the same point. Kinda stuck here.
Moreover, should I also perform some kind of adjusted p-value, i.e FWER, FDR, BH right after KW test - before performing the posthoc test?
Any suggestion(s) would be greatly appreciated.

R - Efficiently create dataframe from large raster excluding NA values

apologies for cross-posting something similar in the GIS stack.
I am looking for a more efficient way to create a frequency table based on a large raster in R.
Currently, I have a few dozen rasters, ~ 150 million cells in each, and I need to create frequencies tables for each. These rasters are derived from masking a base raster with a few hundred small sampling locations*. Therefore the rasters I am creating the tables from contain ~99% NA values.
My current working approach is this:
sampling_site_raster <- raster("FILE")
base_raster <- raster("FILE")
sample_raster <- mask(base_raster, sampling_site_raster)
DF <- as.data.frame(freq(sample_raster, useNA='no', progress='text'))
### run time for the freq() process ###
user system elapsed
162.60 4.85 168.40
this uses the freq() function from the raster package of R. The usaNA=no flag will dump the NA values.
My questions are:
1) is there a more efficient way to create a frequency table from a large raster that is 99% NA values?
or
2) is the a more efficient way to derive the values from the base raster than by using mask()? (using the Mask GP function in ArcGIS is very fast, but still has the NA values and is an extra step
*additional info: The sample areas represented by sampling_site_raster are irregular shapes of various sizes spread randomly across the study area. In the sampling_site_raster the sampling sites are encoded as 1 and non-sampling areas as NA.
Thank you!
If you mask the raster by raster, you will always get another huge raster. I don't think this is a way to make things faster.
What I would do is to try to mask by polygon layer using extract:
res <- extract(raster, polygons)
Then you will have all the cell values for each polygon and can run freq on them.

how to run mclust faster on 50000 records dataset

I am a beginner, I am trying to cluster a data frame (with 50,000 records) that has 2 features (x, y) by using mclust package. However, it feels like forever to run a command (e.g.Mclust(XXX.df) or densityMclust(XXX.df).
Is there any way to execute the command faster? an example code will be helpful.
For your info I'm using 4 core processor with 6GB RAM, it took me about 15 minutes or so to do the same analysis (clustering) with Weka, using R the process is still running above 1.5 hours. I do really want to use R for the analysis.
Dealing with large datasets while using mclust is described in Technical Report, subsection 11.1.
Briefly, functions Mclust and mclustBIC include a provision for using a subsample of the data in the hierarchical clustering phase before applying EM to the full data set, in order to extend the method to larger datasets.
Generic example:
library(mclust)
set.seed(1)
##
## Data generation
##
N <- 5e3
df <- data.frame(x=rnorm(N)+ifelse(runif(N)>0.5,5,0), y=rnorm(N,10,5))
##
## Full set
##
system.time(res <- Mclust(df))
# > user system elapsed
# > 66.432 0.124 67.439
##
## Subset for initial stage
##
M <- 1e3
system.time(res <- Mclust(df, initialization=list(subset=sample(1:nrow(df), size=M))))
# > user system elapsed
# > 19.513 0.020 19.546
"Subsetted" version runs approximately 3.5 times faster on my Dual Core (although Mclust uses only single core).
When N<-5e4 (as in your example) and M<-1e3 it took about 3.5 minutes for version with subset to complete.

Running out of memory with merge

I have a paneldata which looks like:
(Only the substantially cutting for my question)
Persno 122 122 122 333 333 333 333 333 444 444
Income 1500 1500 2000 2000 2100 2500 2500 1500 2000 2200
year 1990 1991 1992 1990 1991 1992 1993 1994 1992 1993
Now I would like to give out for every row (PErsno) the years of workexperience at the begining of the year. I use ddply
hilf3<-ddply(data, .(Persn0), summarize, Bgwork = 1:(max(year) - min(year)))
To produce output looking like this:
Workexperience: 1 2 3 1 2 3 4 5 1 2
Now I want to merge the ddply results to my original panel data:
data<-(merge(data,hilf3,by.x="Persno",by.y= "Persno"))
The panel data set is very large. The code stops because of a memory size error.
Errormessage:
1: In make.unique(as.character(rows)) :
Reached total allocation of 4000Mb: see help(memory.size)
What should I do?
Re-reading your question, I think you don't actually want to use merge here at all. Just sort your original data frame and rbind Bgwork from hilf3. And also, your ddply-call could perhaps result in a 1:0 sequence, which is most likely not what you want. Try
data = data[order(data$Persno, data$year),]
hilf3 = ddply(data, .(Persno), summarize, Bgwork=(year - min(year) + 1))
stopifnot(nrow(data) == nrow(hilf3))
stopifnot(all(data$Persno == hilf3$Persno))
data$Bgwork = hilf3$Bgwork
Well, perhaps the surest way of fixing this is to get more memory. However, this isn't always an option. What you can do is somewhat dependent on your platform. On Windows, check the results of memory.size()and compare this to your available RAM. If memory size is lower than RAM then you can increase it. This is not an option on linux, as by default it will show all of your memory.
Another issue that can complicate matters is whether or not you are running a 32bit or 64bit system, as 32bit windows can only address up to a certain amount of RAM (2-4GB) depending on settings. This is not an issue if you are using 64bit Windows 7, which can address far more memory.
A more practical solution is to eliminate all unnecessary objects from your workspace before performing merge. You should run gc() to see how much memory you have and are using, and also to remove any objects which have no more references. Personally, I would probably run your ddply() from a script, then save the resulting dataframe as a CSV file, close your workspace and reopen it and then perform the merge again.
Finally the worst possible option (but which does require a whole lot less memory) is to create a new dataframe, and use the subsetting commands in R to copy the columns you want over, one by one. I really don't recommend this as it is tiresome and error prone, but I have had to do it once when there was no way to complete my analysis otherwise (i ended up investing in a new computer with more RAM shortly afterwards).
Hope this helps.
If you need to merge large data frames in R, one good option is to do it in pieces of, say 10000 rows. If you're merging data frames x and y, loop over 10000-row pieces of x, merge (or rather use plyr::join) with y and immediately append these results to a sigle csv-file. After all pieces have been merged and written to file, read that csv-file. This is very memory-efficient with proper use of logical index vectors and well placed rm and gc calls. It's not fast though.
Since this question was posted, the data.table package has provided a re-implementation of data frames and a merge function that I have found to be much more memory-efficient than R's default. Converting the default data frames to data tables with as.data.table may avoid memory issues.

Resources