High-scale signal processing in R - r

I have high-dimensional data, for brain signals, that I would like to explore using R.
Since I am a data scientist I really do not work with Matlab, but R and Python. Unfortunately, the team I am working with is using Matlab to record the signals. Therefore, I have several questions for those of you who are interested in data science.
The Matlab files, recorded data, are single objects with the following dimensions:
1000*32*6000
1000: denotes the sampling rate of the signal.
32: denotes the number of channels.
6000: denotes the time in seconds, so that is 1 hour and 40 minutes long.
The questions/challenges I am facing:
I converted the "mat" files I have into CSV files, so I can use them in R.
However, CSV files are 2 dimensional files with the dimensions: 1000*192000.
the CSV files are rather large, about 1.3 gigabytes. Is there a
better way to convert "mat" files into something compatible with R,
and smaller in size? I have tried "R.matlab" with readMat, but it is
not compatible with the 7th version of Matlab; so I tried to save as V6 version, but it says "Error: cannot allocate vector of size 5.7 Gb"
the time it takes to read the CSV file is rather long! It takes
about 9 minutes to load the data. That is using "fread" since the
base R function read.csv takes forever. Is there a better way to
read files faster?
Once I read the data into R, it is 1000*192000; while it is actually
1000*32*6000. Is there a way to have multidimensional object in R,
where accessing signals and time frames at a given time becomes
easier. like dataset[1007,2], which would be the time frame of the
1007 second and channel 2. The reason I want to access it this way
is to compare time frames easily and plot them against each other.
Any answer to any question would be appreciated.

This is a good reference for reading large CSV files: https://rpubs.com/msundar/large_data_analysis A key takeaway is to assign the datatype for each column that you are reading versus having the read function decide based on the content.

Related

Processing very large files in R

I have a dataset that is 188 million rows with 41 columns. It comes as a massive compressed fixed width file and I am currently reading it into R using the vroom package like this:
d <- vroom_fwf('data.dat.gz',
fwf_positions([41 column position],
[41 column names])
vroom does a wonderful job here in the sense that the data are actually read into an R session on a machine with 64Gb of memory. When I run object.size on d it is a whopping 61Gb is size. When I turn around to do anything with this data I can't. All I get back the Error: cannot allocate vector of size {x} Gb because there really isn't any memory left to much of anything with that data. I have tried base R with [, dplyr::filter and trying to convert to a data.table via data.table::setDT each with the same result.
So my question is what are people's strategies for this type of thing? My main goal is to convert the compressed fixed width file to a parquet format but I would like to split it into small more manageable files based on some values in a column in the data then write them to parquet (using arrow::write_parquet)
My ideas at this point are to read a subset of columns keeping the column that I want to subset by, write the parquet files then bind the columns/merge the two back together. This seems like a more error prone solution though so I thought I would turn here and see what is available for further conversions.

Reading very large CSV-files with many column in R

I am dealing with very large csv-files of 1-10 GB. I have figured out that I need to use the ff-package for reading in the data. However, this does not seem to work. I suspect that the problem is that I have approximately 73 000 columns and since ff reads row-wise, the size is to high for R's memory. My computer has 128 GB of memory, so the hardware should not be a limitation.
This there any way of reading the data column-wise instead?
Note: In each file there is 10 rows of text which needs to be removed before the file can be read as a matrix successfully. I have previously dealt with this by using the read.csv(file,skip=10,header=T,fill=T) on smaller files of same type.
Here is a picture of how a smaller version of the data sets looks in excel:

What is the fastest way and fastest format for loading large data sets into R [duplicate]

This question already has answers here:
Quickly reading very large tables as dataframes
(12 answers)
Closed 7 years ago.
I have a large dataset (about 13GB uncompressed) and I need to load it repeatedly. The first load (and save to a different format) can be very slow but every load after this should be as fast as possible. What is the fastest way and fastest format from which to load a data set?
My suspicion is that the optimal choice is something like
saveRDS(obj, file = 'bigdata.Rda', compress = FALSE)
obj <- loadRDS('bigdata.Rda)
But this seems slower than using fread function in the data.table package. This should not be the case because fread converts a file from CSV (although it is admittedly highly optimized).
Some timings for a ~800MB dataset are:
> system.time(tmp <- fread("data.csv"))
Read 6135344 rows and 22 (of 22) columns from 0.795 GB file in 00:00:43
user system elapsed
36.94 0.44 42.71
saveRDS(tmp, file = 'tmp.Rda'))
> system.time(tmp <- readRDS('tmp.Rda'))
user system elapsed
69.96 2.02 84.04
Previous Questions
This question is related but does not reflect the current state of R, for example an answer suggests reading from a binary format will always be faster than a text format. The suggestion to use *SQL is also not helpful in my case as the entire data set is required, not just a subset of it.
There are also related questions about the fastest way of loading data once (eg: 1).
It depends on what you plan on doing with the data. If you want the entire data in memory for some operation then I guess your best bet is fread or readRDS (the file size for a data saved in RDS is much much smaller if that matters to you).
If you will be doing summary operations on the data I have found one time conversion to a database (using sqldf) a much better option, as subsequent operations are much more faster by executing sql queries on the data, but that is also because I don't have enough RAM to load 13 GB files in memory.

correlation matrix using large data sets in R when ff matrix memory allocation is not enough

I have a simple analysis to be done. I just need to calculate the correlation of the columns (or rows ,if transposed). Simple enough? I am unable to get the results for the whole week and I have looked through most of the solutions here.
My laptop has a 4GB RAM. I do have access to a server with 32 nodes. My data cannot be loaded here as it is huge (411k columns and 100 rows). If you need any other information or maybe part of the data I can try to put it up here, but the problem can be easily explained without really having to see the data. I simply need to get a correlation matrix of size 411k X 411k which means I need to compute the correlation among the rows of my data.
Concepts I have tried to code: (all of them in some way give me memory issues or run forever)
The most simple way, one row against all, write the result out using append.T. (Runs forever)
biCorPar.r by bobthecat (https://gist.github.com/bobthecat/5024079), splitting the data into blocks and using ff matrix. (unable to allocate memory to assign the corMAT matrix using ff() in my server)
split the data into sets (every 10000 continuous rows will be a set) and do correlation of each set against the other (same logic as bigcorPar) but I am unable to find a way to store them all together finally to generate the final 411kX411k matrix.
I am attempting this now, bigcorPar.r on 10000 rows against 411k (so 10000 is divided into blocks) and save the results in separate csv files.
I am also attempting to run every 1000 vs 411k in one node in my server and today is my 3rd day and I am still on row 71.
I am not an R pro so I could attempt only this much. Either my codes run forever or I do not have enough memory to store the results. Are there any more efficient ways to tackle this issue?
Thanks for all your comments and help.
I'm familiar with this problem myself in the context of genetic research.
If you are interested only in the significant correlations, you may find my package MatrixEQTL useful (available on CRAN, more info here: http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/ ).
If you want to keep all correlations, I'd like to first warn you that in the binary format (economical compared to text) it would take 411,000 x 411,000 x 8 bytes = 1.3 TB. If this what you want and you are OK with the storage required for that, I can provide my code for such calculations and storage.

Load large datasets into data frame [duplicate]

This question already has answers here:
Quickly reading very large tables as dataframes
(12 answers)
Closed 8 years ago.
I have a dataset stored in text file, it is of 997 columns, 45000 rows. All values are double values except row names and column names. I use R studio with read.table command to read the data file, but it seems taking hours to do it. Then I aborted it.
Even I use Excel to open it, it takes me 2 minutes.
R Studio seems lacking of efficiency in this task, any suggestions given how to make it faster ? I dont want to read the data file all the time ?
I plan to load it once and store it in Rdata object, which can make the loading data faster in the future. But the first load seems not working.
I am not a computer graduate, any kind help will be well appreciated.
I recommend data.table although you will end up with a data table after this. If you choose not to use the data table, you can simply convert back to a normal data frame.
require(data.table)
data=fread('yourpathhere/yourfile')
As documented in the ?read.table help file there are three arguments that can dramatically speed up and/or reduce the memory required to import data. First, by telling read.table what kind of data each column contains you can avoid the overhead associated with read.table trying to guess the type of data in each column. Secondly, by telling read.table how many rows the data file has you can avoid allocating more memory than is actually required. Finally, if the file does not contain comments, you can reduce the resources required to import the data by telling R not to look for comments. Using all of these techniques I was able to read a .csv file with 997 columns and 45000 rows in under two minutes on a laptop with relatively modest hardware:
tmp <- data.frame(matrix(rnorm(997*45000), ncol = 997))
write.csv(tmp, "tmp.csv", row.names = FALSE)
system.time(x <- read.csv("tmp.csv", colClasses="numeric", comment.char = ""))
# user system elapsed
#115.253 2.574 118.471
I tried reading the file using the default read.csv arguments, but gave up after 30 minutes or so.

Resources