Load large datasets into data frame [duplicate] - r

This question already has answers here:
Quickly reading very large tables as dataframes
(12 answers)
Closed 8 years ago.
I have a dataset stored in text file, it is of 997 columns, 45000 rows. All values are double values except row names and column names. I use R studio with read.table command to read the data file, but it seems taking hours to do it. Then I aborted it.
Even I use Excel to open it, it takes me 2 minutes.
R Studio seems lacking of efficiency in this task, any suggestions given how to make it faster ? I dont want to read the data file all the time ?
I plan to load it once and store it in Rdata object, which can make the loading data faster in the future. But the first load seems not working.
I am not a computer graduate, any kind help will be well appreciated.

I recommend data.table although you will end up with a data table after this. If you choose not to use the data table, you can simply convert back to a normal data frame.
require(data.table)
data=fread('yourpathhere/yourfile')

As documented in the ?read.table help file there are three arguments that can dramatically speed up and/or reduce the memory required to import data. First, by telling read.table what kind of data each column contains you can avoid the overhead associated with read.table trying to guess the type of data in each column. Secondly, by telling read.table how many rows the data file has you can avoid allocating more memory than is actually required. Finally, if the file does not contain comments, you can reduce the resources required to import the data by telling R not to look for comments. Using all of these techniques I was able to read a .csv file with 997 columns and 45000 rows in under two minutes on a laptop with relatively modest hardware:
tmp <- data.frame(matrix(rnorm(997*45000), ncol = 997))
write.csv(tmp, "tmp.csv", row.names = FALSE)
system.time(x <- read.csv("tmp.csv", colClasses="numeric", comment.char = ""))
# user system elapsed
#115.253 2.574 118.471
I tried reading the file using the default read.csv arguments, but gave up after 30 minutes or so.

Related

Processing very large files in R

I have a dataset that is 188 million rows with 41 columns. It comes as a massive compressed fixed width file and I am currently reading it into R using the vroom package like this:
d <- vroom_fwf('data.dat.gz',
fwf_positions([41 column position],
[41 column names])
vroom does a wonderful job here in the sense that the data are actually read into an R session on a machine with 64Gb of memory. When I run object.size on d it is a whopping 61Gb is size. When I turn around to do anything with this data I can't. All I get back the Error: cannot allocate vector of size {x} Gb because there really isn't any memory left to much of anything with that data. I have tried base R with [, dplyr::filter and trying to convert to a data.table via data.table::setDT each with the same result.
So my question is what are people's strategies for this type of thing? My main goal is to convert the compressed fixed width file to a parquet format but I would like to split it into small more manageable files based on some values in a column in the data then write them to parquet (using arrow::write_parquet)
My ideas at this point are to read a subset of columns keeping the column that I want to subset by, write the parquet files then bind the columns/merge the two back together. This seems like a more error prone solution though so I thought I would turn here and see what is available for further conversions.

Read huge csv file using `read.csv` by divide-and-conquer strategy?

I am supposed to read a big csv file (5.4GB with 7m lines and 205 columns) in R. I have successfully read it by using data.table::fread(). But I want to know is it possible to read it by using the basic read.csv()?
I tried just using brute force but my 16GB RAM cannot hold that. Then I tried to use the 'divide-and-conquer' (chunking) strategy as below, but it still didn't work. How should I do this?
dt1 <- read.csv('./ss13hus.csv', header = FALSE, nrows = 721900, skip =1)
print(paste(1, 'th chunk completed'))
system.time(
for (i in (1:9)){
tmp = read.csv('./ss13hus.csv', header = FALSE, nrows = 721900, skip = i * 721900 + 1)
dt1 <- rbind(dt1, tmp)
print(paste(i + 1, 'th chunk completed'))
}
)
Also I want to know how fread() works that could read all the data at once and very efficiently no matter in terms of memeory or time?
Your issue is not fread(), it's the memory bloat caused from not defining colClasses for all your (205) columns. But be aware that trying to read all 5.4GB into 16GB RAM is really pushing it in the first place, you almost surely won't be able to hold all that dataset in-memory; and even if you could, you'll blow out memory whenever you try to process it. So your approach is not going to fly, you seriously have to decide which subset you can handle - which fields you absolutely need to get started:
Define colClasses for your 205 columns: 'integer' for integer columns, 'numeric' for double columns, 'logical' for boolean columns, 'factor' for factor columns. Otherwise things get stored very inefficiently (e.g. millions of strings are very wasteful), and the result can easily be 5-100x larger than the raw file.
If you can't fit all 7m rows x 205 columns, (which you almost surely can't), then you'll need to aggressively reduce memory by doing some or all of the following:
read in and process chunks (of rows) (use skip, nrows arguments, and search SO for questions on fread in chunks)
filter out all unneeded rows (e.g. you may be able to do some crude processing to form a row-index of the subset rows you care about, and import that much smaller set later)
drop all unneeded columns (use fread select/drop arguments (specify vectors of column names to keep or drop).
Make sure option stringsAsFactors=FALSE, it's a notoriously bad default in R which causes no end of memory grief.
Date/datetime fields are currently read as character (which is bad news for memory usage, millions of unique strings). Either totally drop date columns for beginning, or read the data in chunks and convert them with the fasttime package or standard base functions.
Look at the args for NA treatment. You might want to drop columns with lots of NAs, or messy unprocessed string fields, for now.
Please see ?fread and the data.table doc for syntax for the above. If you encounter a specific error, post a snippet of say 2 lines of data (head(data)), your code and the error.

read.csv to import more than 2105 columns?

Leaving question for archival purposes only.
(Read.csv did read in all columns, I just did not see them in the preview when opening the data.frame)
Related to this question:
Maximum number of columns that can be read using read.csv
I would like to import a csv file into R that contains about 3200 columns (100 rows). I am used to working with data.frames and read.csv, but my usual approach failed because
data <- read.csv("data.csv", header=TRUE)
only imported the first 2105 columns. It did not display an error message.
How can I readin a csv file with more than 2105 columns?
without specifying column classes
into a data frame
the file contains different data types (dates, strings, numbers, ..)
speed is not my biggest concern
I did not manage to apply the solutions in Quickly reading very large tables as dataframes in R to my situation. Tried this, but it does not seem to work without information on column classes:
df <- as.data.frame(scan("data.csv",sep=','))
There are already several questions about reading in large datafiles with millions of rows/columns and how to speed up the process, but my files are much smaller, so I am hoping that there is an easier solution that I overlooked.
Try using data.table.
library(data.table)
data <- fread("data.csv")
(Posted answer on behalf of the OP).
Read.csv did read in all columns, I just did not see them in the preview when opening the data.frame

What is the fastest way and fastest format for loading large data sets into R [duplicate]

This question already has answers here:
Quickly reading very large tables as dataframes
(12 answers)
Closed 7 years ago.
I have a large dataset (about 13GB uncompressed) and I need to load it repeatedly. The first load (and save to a different format) can be very slow but every load after this should be as fast as possible. What is the fastest way and fastest format from which to load a data set?
My suspicion is that the optimal choice is something like
saveRDS(obj, file = 'bigdata.Rda', compress = FALSE)
obj <- loadRDS('bigdata.Rda)
But this seems slower than using fread function in the data.table package. This should not be the case because fread converts a file from CSV (although it is admittedly highly optimized).
Some timings for a ~800MB dataset are:
> system.time(tmp <- fread("data.csv"))
Read 6135344 rows and 22 (of 22) columns from 0.795 GB file in 00:00:43
user system elapsed
36.94 0.44 42.71
saveRDS(tmp, file = 'tmp.Rda'))
> system.time(tmp <- readRDS('tmp.Rda'))
user system elapsed
69.96 2.02 84.04
Previous Questions
This question is related but does not reflect the current state of R, for example an answer suggests reading from a binary format will always be faster than a text format. The suggestion to use *SQL is also not helpful in my case as the entire data set is required, not just a subset of it.
There are also related questions about the fastest way of loading data once (eg: 1).
It depends on what you plan on doing with the data. If you want the entire data in memory for some operation then I guess your best bet is fread or readRDS (the file size for a data saved in RDS is much much smaller if that matters to you).
If you will be doing summary operations on the data I have found one time conversion to a database (using sqldf) a much better option, as subsequent operations are much more faster by executing sql queries on the data, but that is also because I don't have enough RAM to load 13 GB files in memory.

Optimizing File reading in R

My R application reads input data from large txt files. it does not read the entire
file in one shot. Users specify the name of the gene, (3 or 4 at a time) and based on the user-input, app goes to the appropriate row and reads the data.
File format: 32,000 rows (one gene per row, first two columns contain info about
gene name, etc.) 35,000 columns with numerical data (decimal numbers).
I used read.table (filename, skip=10,000 ) etc. to go to the right row, then read
35,000 columns of data. then I do this again for the 2nd gene, 3rd gene (upto 4 genes max)
and then process the numerical results.
The file reading operations take about 1.5 to 2.0 Minutes. I am experimenting with
reading the entire file and then taking the data for the desired genes.
Any way to accelerate this? I can rewrite the gene data in another format (one
time processing) if that will accelerate reading operations in the future.
You can use the colClasses argument to read.table to speed things up, if you know the exact format of your files. For 2 character columns and 34,998 (?) numeric columns, you would use
colClasses = c(rep("character",2), rep("numeric",34998))
This would be more efficient if you used a database interface. There are several available via the RODBC package, but a particularly well-integrated-with-R option would be the sqldf package which by default uses SQLite. You would then be able to use the indexing capacity of the database to do lookup of the correct rows and read all the columns in one operation.

Resources