I am dealing with very large csv-files of 1-10 GB. I have figured out that I need to use the ff-package for reading in the data. However, this does not seem to work. I suspect that the problem is that I have approximately 73 000 columns and since ff reads row-wise, the size is to high for R's memory. My computer has 128 GB of memory, so the hardware should not be a limitation.
This there any way of reading the data column-wise instead?
Note: In each file there is 10 rows of text which needs to be removed before the file can be read as a matrix successfully. I have previously dealt with this by using the read.csv(file,skip=10,header=T,fill=T) on smaller files of same type.
Here is a picture of how a smaller version of the data sets looks in excel:
Related
I need to read a huge dataset, trim it to a tiny one, and then use in my program. After trimming memory is not released (regardless of usage of gc() and rm()). I am puzzled by this behavior.
I am on Linux, R 4.2.1. I read a huge .Rds file (>10 Gb) (both with the base function and the readr version). Memory usage shows 14.58 Gb. I do operations and decrease its size to 800 rows and 24.7 Mb. But memory usage stays the same within this session regardless of what I do. I tried:
Piping readRDS directly into trimming functions and only storing the trimmed result;
First reading rds into a variable and then replacing it with the trimmed version;
Reading rds into a variable, storing the trimmed data in a new variable, and then removing the big dataset with rm() followed by garbage collection gc().
I understand what the workaround should be: a bash script that first creates a temporary file with this reduced dataset and then runs a separate R session to work with that dataset. But feels like this shouldn't be happening?
I have a very large species dataset from gbif (178GB) zipped, when unzipped its approximately 800gb (TSV) My Mac only has 512gb Memory and 8GB of Ram, however I am not in need of using all of this data.
Are their any approaches that I can take that can unzip the file without eating all of my memory and extracting a portion of the dataset by filtering out rows relative to a column? For example, it has occurrence values going back until 1600, I only need data for the last 2 years which I believe my PC can more than handle. Perhaps there is a library with a function that can filter rows when loading the data?
I am unsure of how to unzip properly, and I have looked to see unzipping libraries and unzip according to this article, truncates data >4GB. My worry is where could I store 800gb of data when unzipped?
Update:
It seems that all the packages I have come across stop at 4GB after decompression. I am wondering if it is possible to create a function that can decompress at the 4GB, mark that point or data that has been retrieved, and begin decompression again from that point, and continue until the whole .zip file has been decompressed. It could store the decompressed files into a folder, that way you can access them with something like list.files(), any ideas if this can be done?
I have a dataset that is 188 million rows with 41 columns. It comes as a massive compressed fixed width file and I am currently reading it into R using the vroom package like this:
d <- vroom_fwf('data.dat.gz',
fwf_positions([41 column position],
[41 column names])
vroom does a wonderful job here in the sense that the data are actually read into an R session on a machine with 64Gb of memory. When I run object.size on d it is a whopping 61Gb is size. When I turn around to do anything with this data I can't. All I get back the Error: cannot allocate vector of size {x} Gb because there really isn't any memory left to much of anything with that data. I have tried base R with [, dplyr::filter and trying to convert to a data.table via data.table::setDT each with the same result.
So my question is what are people's strategies for this type of thing? My main goal is to convert the compressed fixed width file to a parquet format but I would like to split it into small more manageable files based on some values in a column in the data then write them to parquet (using arrow::write_parquet)
My ideas at this point are to read a subset of columns keeping the column that I want to subset by, write the parquet files then bind the columns/merge the two back together. This seems like a more error prone solution though so I thought I would turn here and see what is available for further conversions.
I have high-dimensional data, for brain signals, that I would like to explore using R.
Since I am a data scientist I really do not work with Matlab, but R and Python. Unfortunately, the team I am working with is using Matlab to record the signals. Therefore, I have several questions for those of you who are interested in data science.
The Matlab files, recorded data, are single objects with the following dimensions:
1000*32*6000
1000: denotes the sampling rate of the signal.
32: denotes the number of channels.
6000: denotes the time in seconds, so that is 1 hour and 40 minutes long.
The questions/challenges I am facing:
I converted the "mat" files I have into CSV files, so I can use them in R.
However, CSV files are 2 dimensional files with the dimensions: 1000*192000.
the CSV files are rather large, about 1.3 gigabytes. Is there a
better way to convert "mat" files into something compatible with R,
and smaller in size? I have tried "R.matlab" with readMat, but it is
not compatible with the 7th version of Matlab; so I tried to save as V6 version, but it says "Error: cannot allocate vector of size 5.7 Gb"
the time it takes to read the CSV file is rather long! It takes
about 9 minutes to load the data. That is using "fread" since the
base R function read.csv takes forever. Is there a better way to
read files faster?
Once I read the data into R, it is 1000*192000; while it is actually
1000*32*6000. Is there a way to have multidimensional object in R,
where accessing signals and time frames at a given time becomes
easier. like dataset[1007,2], which would be the time frame of the
1007 second and channel 2. The reason I want to access it this way
is to compare time frames easily and plot them against each other.
Any answer to any question would be appreciated.
This is a good reference for reading large CSV files: https://rpubs.com/msundar/large_data_analysis A key takeaway is to assign the datatype for each column that you are reading versus having the read function decide based on the content.
This question already has answers here:
Quickly reading very large tables as dataframes
(12 answers)
Closed 8 years ago.
I have a dataset stored in text file, it is of 997 columns, 45000 rows. All values are double values except row names and column names. I use R studio with read.table command to read the data file, but it seems taking hours to do it. Then I aborted it.
Even I use Excel to open it, it takes me 2 minutes.
R Studio seems lacking of efficiency in this task, any suggestions given how to make it faster ? I dont want to read the data file all the time ?
I plan to load it once and store it in Rdata object, which can make the loading data faster in the future. But the first load seems not working.
I am not a computer graduate, any kind help will be well appreciated.
I recommend data.table although you will end up with a data table after this. If you choose not to use the data table, you can simply convert back to a normal data frame.
require(data.table)
data=fread('yourpathhere/yourfile')
As documented in the ?read.table help file there are three arguments that can dramatically speed up and/or reduce the memory required to import data. First, by telling read.table what kind of data each column contains you can avoid the overhead associated with read.table trying to guess the type of data in each column. Secondly, by telling read.table how many rows the data file has you can avoid allocating more memory than is actually required. Finally, if the file does not contain comments, you can reduce the resources required to import the data by telling R not to look for comments. Using all of these techniques I was able to read a .csv file with 997 columns and 45000 rows in under two minutes on a laptop with relatively modest hardware:
tmp <- data.frame(matrix(rnorm(997*45000), ncol = 997))
write.csv(tmp, "tmp.csv", row.names = FALSE)
system.time(x <- read.csv("tmp.csv", colClasses="numeric", comment.char = ""))
# user system elapsed
#115.253 2.574 118.471
I tried reading the file using the default read.csv arguments, but gave up after 30 minutes or so.