Read huge csv file using `read.csv` by divide-and-conquer strategy? - r

I am supposed to read a big csv file (5.4GB with 7m lines and 205 columns) in R. I have successfully read it by using data.table::fread(). But I want to know is it possible to read it by using the basic read.csv()?
I tried just using brute force but my 16GB RAM cannot hold that. Then I tried to use the 'divide-and-conquer' (chunking) strategy as below, but it still didn't work. How should I do this?
dt1 <- read.csv('./ss13hus.csv', header = FALSE, nrows = 721900, skip =1)
print(paste(1, 'th chunk completed'))
system.time(
for (i in (1:9)){
tmp = read.csv('./ss13hus.csv', header = FALSE, nrows = 721900, skip = i * 721900 + 1)
dt1 <- rbind(dt1, tmp)
print(paste(i + 1, 'th chunk completed'))
}
)
Also I want to know how fread() works that could read all the data at once and very efficiently no matter in terms of memeory or time?

Your issue is not fread(), it's the memory bloat caused from not defining colClasses for all your (205) columns. But be aware that trying to read all 5.4GB into 16GB RAM is really pushing it in the first place, you almost surely won't be able to hold all that dataset in-memory; and even if you could, you'll blow out memory whenever you try to process it. So your approach is not going to fly, you seriously have to decide which subset you can handle - which fields you absolutely need to get started:
Define colClasses for your 205 columns: 'integer' for integer columns, 'numeric' for double columns, 'logical' for boolean columns, 'factor' for factor columns. Otherwise things get stored very inefficiently (e.g. millions of strings are very wasteful), and the result can easily be 5-100x larger than the raw file.
If you can't fit all 7m rows x 205 columns, (which you almost surely can't), then you'll need to aggressively reduce memory by doing some or all of the following:
read in and process chunks (of rows) (use skip, nrows arguments, and search SO for questions on fread in chunks)
filter out all unneeded rows (e.g. you may be able to do some crude processing to form a row-index of the subset rows you care about, and import that much smaller set later)
drop all unneeded columns (use fread select/drop arguments (specify vectors of column names to keep or drop).
Make sure option stringsAsFactors=FALSE, it's a notoriously bad default in R which causes no end of memory grief.
Date/datetime fields are currently read as character (which is bad news for memory usage, millions of unique strings). Either totally drop date columns for beginning, or read the data in chunks and convert them with the fasttime package or standard base functions.
Look at the args for NA treatment. You might want to drop columns with lots of NAs, or messy unprocessed string fields, for now.
Please see ?fread and the data.table doc for syntax for the above. If you encounter a specific error, post a snippet of say 2 lines of data (head(data)), your code and the error.

Related

R merge large number of data frames

I have the output from a data submission which is in the form of multiple vector list objects in rda files.
Each list object is in a separate rda file and i have nearly 2000 files.
I want to merge all the objects into a single object in a single rda file in the fastest way (partly because i may need to repeat this several times).
All the rda files are fairly small (~10mb though this will be a compressed size), but it all adds up with the number of files.
Memory isn't a huge problem as am running it on a server with >700GB RAM,
My first approach to incrementally load them one by one concatenate with the merged list object and remove the object that was appended went badly due to the time it was going to take (something like 40 days at a best guess).
My revised approach is below, but wondering if there is a quicker way to do this given that i may need to repeat the process:
load("data_1.rda")
load("data_2.rda")
load("data_3.rda") ...
load("data_2000.rda")
my.list <- list()
my.list <- c(my.list, data.1, data.2, data.3, ... , data.2000)
save(my.list, file="my_list.rda")
And just to add to things i'm getting an error when doing this:
Error: attempt to set index 18446744071562067968/2877912830 in SET_STRING_ELT
It's not a very helpful error message
All the rdas load as objects into the environment fine, but when i try and concatenate them that is when I get the error message, and it seems like it is when it gets to a particular point as it doesn't fail immediately. Wasn't sure if it is some sort of limit in the number of concatenations you can do or rogue data, but troubleshooting it it appears to be syntax rather than data related.
Have chunked it up into 5 batches and then doing a final concatenation before saving the rda. Have seen other answers for this sort of thing suggesting using rbind, mget, and do.Call or list function - would using any of these functions make it faster and achieve the same thing?
Something like this:
my.list <- do.call(rbind, mget(ls(pattern="^data_")))
Thanks

Reading multiple csv files faster into data.table R

I have 900000 csv files which i want to combine into one big data.table. For this case I created a for loop which reads every file one by one and adds them to the data.table. The problem is that it is performing to slow and the amount of time used is expanding exponentially. It would be great if someone could help me make the code run faster. Each one of the csv files has 300 rows and 15 columns.
The code I am using so far:
library(data.table)
setwd("~/My/Folder")
WD="~/My/Folder"
data<-data.table(read.csv(text="X,Field1,PostId,ThreadId,UserId,Timestamp,Upvotes,Downvotes,Flagged,Approved,Deleted,Replies,ReplyTo,Content,Sentiment"))
csv.list<- list.files(WD)
k=1
for (i in csv.list){
temp.data<-read.csv(i)
data<-data.table(rbind(data,temp.data))
if (k %% 100 == 0)
print(k/length(csv.list))
k<-k+1
}
Presuming your files are conventional csv, I'd use data.table::fread since it's faster. If you're on a Linux-like OS, I would use the fact it allows shell commands. Presuming your input files are the only csv files in the folder I'd do:
dt <- fread("tail -n-1 -q ~/My/Folder/*.csv")
You'll need to set the column names manually afterwards.
If you wanted to keep things in R, I'd use lapply and rbindlist:
lst <- lapply(csv.list, fread)
dt <- rbindlist(lst)
You could also use plyr::ldply:
dt <- setDT(ldply(csv.list, fread))
This has the advantage that you can use .progress = "text" to get a readout of progress in reading.
All of the above assume that the files all have the same format and have a header row.
Building on Nick Kennedy's answer using plyr::ldply there is roughly a 50% speed increase by enabling the .parallel option while reading 400 csv files roughly 30-40 MB each.
Original answer with progress bar
dt <- setDT(ldply(csv.list, fread, .progress="text")
Enabling .parallel also with a text progress bar
library(plyr)
library(data.table)
library(doSNOW)
cl <- makeCluster(4)
registerDoSNOW(cl)
pb <- txtProgressBar(max=length(csv.list), style=3)
pbu <- function(i) setTxtProgressBar(pb, i)
dt <- setDT(ldply(csv.list, fread, .parallel=TRUE, .paropts=list(.options.snow=list(progress=pbu))))
stopCluster(cl)
As suggested by #Repmat, use rbind.fill. As suggested by #Christian Borck, use fread for faster reads.
require(data.table)
require(plyr)
files <- list.files("dir/name")
df <- rbind.fill(lapply(files, fread, header=TRUE))
Alternatively you could use do.call, but rbind.fill is faster (http://www.r-bloggers.com/the-rbinding-race-for-vs-do-call-vs-rbind-fill/)
df <- do.call(rbind, lapply(files, fread, header=TRUE))
Or you could use the data.table package, see this
You are growing your data table in a for loop - this is why it takes forever. If you want to keep the for loop as is, first create a empty data frame (before the loop), which has the dimensions you need (rows x columns), and place it in the RAM.
Then write to this empty frame in each iteration.
Otherwise use rbind.fill from package plyr - and avoid the loop altogehter.
To use rbind.fill:
require(plyr)
data <- rbind.fill(df1, df2, df3, ... , dfN)
To pass the names of the df's, you could/should use an apply function.
I go with #Repmat as your current solution using rbind() is copying the whole data.table in memory every time it is called (this is why time is growing exponentially). Though another way would be to create an empty csv file with only the headers first and then simply append the data of all your files to this csv-file.
write.table(fread(i), file = "your_final_csv_file", sep = ";",
col.names = FALSE, row.names=FALSE, append=TRUE, quote=FALSE)
This way you don't have to worry about putting the data to the right indexes in your data.table. Also as a hint: fread() is the data.table file reader which is much faster than read.csv.
In generell R wouldn't be my first choice for this data munging tasks.
One suggestion would be to merge them first in groups of 10 or so, and then merge those groups, and so on. That has the advantage that if individual merges fail, you don't lose all the work. The way you are doing it now not only leads to exponentially slowing execution, but exposes you to having to start over from the very beginning every time you fail.
This way will also decrease the average size of the data frames involved in the rbind calls, since the majority of them will be being appended to small data frames, and only a few large ones at the end. This should eliminate the majority of the execution time that is growing exponentially.
I think no matter what you do it is going to be a lot of work.
Some things to consider under the assumption you can trust all the input data and that each record is sure to be unique:
Consider creating the table being imported into without indexes. As indexes get huge the time involved to manage them during imports grows -- so it sounds like this may be happening. If this is your issue it would still take a long time to create indexes later.
Alternately, with the amount of data you are discussing you may want to consider a method of partitioning the data (often done via date ranges). Depending on your database you may then have individually indexed partitions -- easing index efforts.
If your demonstration code doesn't resolve down to a database file import utility then use such a utility.
It may be worth processing files into larger data sets prior to importing them. You could experiment with this by combining 100 files into one larger file before loading, for example, and comparing times.
In the event you can't use partitions (depending on the environment and the experience of the database personnel) you can use a home brewed method of seperating data into various tables. For example data201401 to data201412. However, you'd have to roll your own utilities to query across boundaries.
While decidedly not a better option it is something you could do in a pinch -- and it would allow you to retire/expire aged records easily and without having to adjust the related indexes. it would also let you load pre-processed incoming data by "partition" if desired.

Removing lines in data.table and spiking memory usage

I have a data.table of a decent size 89M rows, 3.7Gb. Keys are in place so everything is set-up properly. However I am experiencing a problem when I remove rows based on a column's value. The memory usage just goes through the roof!
Just for the record I have read the other posts here about this, but they don't really help much. Also, I am using RStudio which I am pretty sure is not ideal but it helps while experimenting, however I notice the same behaviour in the R console. I am using Windows.
Let me post an example (taken from a similar question regarding removal of rows) of creating a very big data.table approx 1e6x100
rm(list=ls(all=TRUE)) #Clean stuff
gc(reset=TRUE) #Call gc (not really helping but whatever..)
dimension=1e6 #let's say a million
DT = data.table(col1 = 1:dimension)
cols = paste0('col', 2:100) #let these be conditions as columns
for (col in cols){ DT[, col := 1:dimension, with = F] }
DT.m<-melt(DT,id=c('col1','col2','col3'))
Ok so now we have a data.table with 97M rows, approx 1.8Gb. This is our starting point.
Let's remove all rows where the value column (after the melt) is e.g. 4
DT.m<-DT.m[value!=4]
The last line takes a huge amount of memory! Prior to executing this line, in my PC the memory usage is approx 4.3Gb, and just after the line is executed, it goes to 6.9Gb!
This is the correct way to remove the lines, right? (just checking). Has anyone come across this behaviour before?
I thought of looping for all parameters and keeping the rows I am interested in, in another data.table but somehow I doubt that this is a proper way of working.
I am looking forward to your help.
Thanks
Nikos
Update: With this commit, the logical vector is replaced by row indices to save memory (Read the post below for more info). Fixed in 1.9.5.
Doing sum(DT.m$value == 4L) gives me 97. That is, you're removing a total of 97 rows from 97 million. This in turn implies that the subset operation would return ~1.8GB data set as well.
Your memory usage was 4.3GB to begin with
The condition you provide value == 4 takes the space of a logical vector of size 97 million =~360MB.
data.table computes a which(that_value) to fetch indices = almost all the rows = another 360MB
The data that's being subset has to be allocated elsewhere first, and that's ~1.8GB.
Total comes to 4.3+1.8+0.72 =~ 6.8GB
And garbage collection hasn't happened yet. If you now do gc(), the memory corresponding to old DT.m should be released.
The only place where I can see we can save space is by replacing the logical vector with the integer vector (rather than storing the integer indices in another vector) to save the extra 360MB of space.
Usually which results in a much smaller (negligible) value - and therefore subset is faster - that being the reason for using which(). But in this case, you remove 97 rows.
But good to know that we can save a bit of memory. Could you please file an issue here?
Removing rows by reference, #635, when implemented, should both be fast and memory efficient.

Load large datasets into data frame [duplicate]

This question already has answers here:
Quickly reading very large tables as dataframes
(12 answers)
Closed 8 years ago.
I have a dataset stored in text file, it is of 997 columns, 45000 rows. All values are double values except row names and column names. I use R studio with read.table command to read the data file, but it seems taking hours to do it. Then I aborted it.
Even I use Excel to open it, it takes me 2 minutes.
R Studio seems lacking of efficiency in this task, any suggestions given how to make it faster ? I dont want to read the data file all the time ?
I plan to load it once and store it in Rdata object, which can make the loading data faster in the future. But the first load seems not working.
I am not a computer graduate, any kind help will be well appreciated.
I recommend data.table although you will end up with a data table after this. If you choose not to use the data table, you can simply convert back to a normal data frame.
require(data.table)
data=fread('yourpathhere/yourfile')
As documented in the ?read.table help file there are three arguments that can dramatically speed up and/or reduce the memory required to import data. First, by telling read.table what kind of data each column contains you can avoid the overhead associated with read.table trying to guess the type of data in each column. Secondly, by telling read.table how many rows the data file has you can avoid allocating more memory than is actually required. Finally, if the file does not contain comments, you can reduce the resources required to import the data by telling R not to look for comments. Using all of these techniques I was able to read a .csv file with 997 columns and 45000 rows in under two minutes on a laptop with relatively modest hardware:
tmp <- data.frame(matrix(rnorm(997*45000), ncol = 997))
write.csv(tmp, "tmp.csv", row.names = FALSE)
system.time(x <- read.csv("tmp.csv", colClasses="numeric", comment.char = ""))
# user system elapsed
#115.253 2.574 118.471
I tried reading the file using the default read.csv arguments, but gave up after 30 minutes or so.

Optimizing File reading in R

My R application reads input data from large txt files. it does not read the entire
file in one shot. Users specify the name of the gene, (3 or 4 at a time) and based on the user-input, app goes to the appropriate row and reads the data.
File format: 32,000 rows (one gene per row, first two columns contain info about
gene name, etc.) 35,000 columns with numerical data (decimal numbers).
I used read.table (filename, skip=10,000 ) etc. to go to the right row, then read
35,000 columns of data. then I do this again for the 2nd gene, 3rd gene (upto 4 genes max)
and then process the numerical results.
The file reading operations take about 1.5 to 2.0 Minutes. I am experimenting with
reading the entire file and then taking the data for the desired genes.
Any way to accelerate this? I can rewrite the gene data in another format (one
time processing) if that will accelerate reading operations in the future.
You can use the colClasses argument to read.table to speed things up, if you know the exact format of your files. For 2 character columns and 34,998 (?) numeric columns, you would use
colClasses = c(rep("character",2), rep("numeric",34998))
This would be more efficient if you used a database interface. There are several available via the RODBC package, but a particularly well-integrated-with-R option would be the sqldf package which by default uses SQLite. You would then be able to use the indexing capacity of the database to do lookup of the correct rows and read all the columns in one operation.

Resources