Speed up data reading in R [duplicate] - r

I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am assuming I know the types of the columns ahead of time, the table does not contain any column headers or row names, and does not have any pathological characters that I have to worry about.
I know that reading in a table as a list using scan() can be quite fast, e.g.:
datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))
But some of my attempts to convert this to a dataframe appear to decrease the performance of the above by a factor of 6:
df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))
Is there a better way of doing this? Or quite possibly completely different approach to the problem?

An update, several years later
This answer is old, and R has moved on. Tweaking read.table to run a bit faster has precious little benefit. Your options are:
Using vroom from the tidyverse package vroom for importing data from csv/tab-delimited files directly into an R tibble. See Hector's answer.
Using fread in data.table for importing data from csv/tab-delimited files directly into R. See mnel's answer.
Using read_table in readr (on CRAN from April 2015). This works much like fread above. The readme in the link explains the difference between the two functions (readr currently claims to be "1.5-2x slower" than data.table::fread).
read.csv.raw from iotools provides a third option for quickly reading CSV files.
Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.) read.csv.sql in the sqldf package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: the RODBC package, and the reverse depends section of the DBI package page. MonetDB.R gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with its monetdb.read.csv function. dplyr allows you to work directly with data stored in several types of database.
Storing data in binary formats can also be useful for improving performance. Use saveRDS/readRDS (see below), the h5 or rhdf5 packages for HDF5 format, or write_fst/read_fst from the fst package.
The original answer
There are a couple of simple things to try, whether you use read.table or scan.
Set nrows=the number of records in your data (nmax in scan).
Make sure that comment.char="" to turn off interpretation of comments.
Explicitly define the classes of each column using colClasses in read.table.
Setting multi.line=FALSE may also improve performance in scan.
If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table based on the results.
The other alternative is filtering your data before you read it into R.
Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save saveRDS, then next time you can retrieve it faster with load readRDS.

Here is an example that utilizes fread from data.table 1.8.7
The examples come from the help page to fread, with the timings on my windows XP Core 2 duo E8400.
library(data.table)
# Demo speedup
n=1e6
DT = data.table( a=sample(1:1000,n,replace=TRUE),
b=sample(1:1000,n,replace=TRUE),
c=rnorm(n),
d=sample(c("foo","bar","baz","qux","quux"),n,replace=TRUE),
e=rnorm(n),
f=sample(1:1000,n,replace=TRUE) )
DT[2,b:=NA_integer_]
DT[4,c:=NA_real_]
DT[3,d:=NA_character_]
DT[5,d:=""]
DT[2,e:=+Inf]
DT[3,e:=-Inf]
standard read.table
write.table(DT,"test.csv",sep=",",row.names=FALSE,quote=FALSE)
cat("File size (MB):",round(file.info("test.csv")$size/1024^2),"\n")
## File size (MB): 51
system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))
## user system elapsed
## 24.71 0.15 25.42
# second run will be faster
system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))
## user system elapsed
## 17.85 0.07 17.98
optimized read.table
system.time(DF2 <- read.table("test.csv",header=TRUE,sep=",",quote="",
stringsAsFactors=FALSE,comment.char="",nrows=n,
colClasses=c("integer","integer","numeric",
"character","numeric","integer")))
## user system elapsed
## 10.20 0.03 10.32
fread
require(data.table)
system.time(DT <- fread("test.csv"))
## user system elapsed
## 3.12 0.01 3.22
sqldf
require(sqldf)
system.time(SQLDF <- read.csv.sql("test.csv",dbname=NULL))
## user system elapsed
## 12.49 0.09 12.69
# sqldf as on SO
f <- file("test.csv")
system.time(SQLf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))
## user system elapsed
## 10.21 0.47 10.73
ff / ffdf
require(ff)
system.time(FFDF <- read.csv.ffdf(file="test.csv",nrows=n))
## user system elapsed
## 10.85 0.10 10.99
In summary:
## user system elapsed Method
## 24.71 0.15 25.42 read.csv (first time)
## 17.85 0.07 17.98 read.csv (second time)
## 10.20 0.03 10.32 Optimized read.table
## 3.12 0.01 3.22 fread
## 12.49 0.09 12.69 sqldf
## 10.21 0.47 10.73 sqldf on SO
## 10.85 0.10 10.99 ffdf

I didn't see this question initially and asked a similar question a few days later. I am going to take my previous question down, but I thought I'd add an answer here to explain how I used sqldf() to do this.
There's been little bit of discussion as to the best way to import 2GB or more of text data into an R data frame. Yesterday I wrote a blog post about using sqldf() to import the data into SQLite as a staging area, and then sucking it from SQLite into R. This works really well for me. I was able to pull in 2GB (3 columns, 40mm rows) of data in < 5 minutes. By contrast, the read.csv command ran all night and never completed.
Here's my test code:
Set up the test data:
bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, 'bigdf.csv', quote = F)
I restarted R before running the following import routine:
library(sqldf)
f <- file("bigdf.csv")
system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))
I let the following line run all night but it never completed:
system.time(big.df <- read.csv('bigdf.csv'))

Strangely, no one answered the bottom part of the question for years even though this is an important one -- data.frames are simply lists with the right attributes, so if you have large data you don't want to use as.data.frame or similar for a list. It's much faster to simply "turn" a list into a data frame in-place:
attr(df, "row.names") <- .set_row_names(length(df[[1]]))
class(df) <- "data.frame"
This makes no copy of the data so it's immediate (unlike all other methods). It assumes that you have already set names() on the list accordingly.
[As for loading large data into R -- personally, I dump them by column into binary files and use readBin() - that is by far the fastest method (other than mmapping) and is only limited by the disk speed. Parsing ASCII files is inherently slow (even in C) compared to binary data.]

This was previously asked on R-Help, so that's worth reviewing.
One suggestion there was to use readChar() and then do string manipulation on the result with strsplit() and substr(). You can see the logic involved in readChar is much less than read.table.
I don't know if memory is an issue here, but you might also want to take a look at the HadoopStreaming package. This uses Hadoop, which is a MapReduce framework designed for dealing with large data sets. For this, you would use the hsTableReader function. This is an example (but it has a learning curve to learn Hadoop):
str <- "key1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey2\t9.9\nkey2\"
cat(str)
cols = list(key='',val=0)
con <- textConnection(str, open = "r")
hsTableReader(con,cols,chunkSize=6,FUN=print,ignoreKey=TRUE)
close(con)
The basic idea here is to break the data import into chunks. You could even go so far as to use one of the parallel frameworks (e.g. snow) and run the data import in parallel by segmenting the file, but most likely for large data sets that won't help since you will run into memory constraints, which is why map-reduce is a better approach.

An alternative is to use the vroom package. Now on CRAN.
vroom doesn't load the entire file, it indexes where each record is located, and is read later when you use it.
Only pay for what you use.
See Introduction to vroom, Get started with vroom and the vroom benchmarks.
The basic overview is that the initial read of a huge file, will be much faster, and subsequent modifications to the data may be slightly slower. So depending on what your use is, it could be the best option.
See a simplified example from vroom benchmarks below, the key parts to see is the super fast read times, but slightly sower operations like aggregate etc..
package read print sample filter aggregate total
read.delim 1m 21.5s 1ms 315ms 764ms 1m 22.6s
readr 33.1s 90ms 2ms 202ms 825ms 34.2s
data.table 15.7s 13ms 1ms 129ms 394ms 16.3s
vroom (altrep) dplyr 1.7s 89ms 1.7s 1.3s 1.9s 6.7s

I am reading data very quickly using the new arrow package. It appears to be in a fairly early stage.
Specifically, I am using the parquet columnar format. This converts back to a data.frame in R, but you can get even deeper speedups if you do not. This format is convenient as it can be used from Python as well.
My main use case for this is on a fairly restrained RShiny server. For these reasons, I prefer to keep data attached to the Apps (i.e., out of SQL), and therefore require small file size as well as speed.
This linked article provides benchmarking and a good overview. I have quoted some interesting points below.
https://ursalabs.org/blog/2019-10-columnar-perf/
File Size
That is, the Parquet file is half as big as even the gzipped CSV. One of the reasons that the Parquet file is so small is because of dictionary-encoding (also called “dictionary compression”). Dictionary compression can yield substantially better compression than using a general purpose bytes compressor like LZ4 or ZSTD (which are used in the FST format). Parquet was designed to produce very small files that are fast to read.
Read Speed
When controlling by output type (e.g. comparing all R data.frame outputs with each other) we see the the performance of Parquet, Feather, and FST falls within a relatively small margin of each other. The same is true of the pandas.DataFrame outputs. data.table::fread is impressively competitive with the 1.5 GB file size but lags the others on the 2.5 GB CSV.
Independent Test
I performed some independent benchmarking on a simulated dataset of 1,000,000 rows. Basically I shuffled a bunch of things around to attempt to challenge the compression. Also I added a short text field of random words and two simulated factors.
Data
library(dplyr)
library(tibble)
library(OpenRepGrid)
n <- 1000000
set.seed(1234)
some_levels1 <- sapply(1:10, function(x) paste(LETTERS[sample(1:26, size = sample(3:8, 1), replace = TRUE)], collapse = ""))
some_levels2 <- sapply(1:65, function(x) paste(LETTERS[sample(1:26, size = sample(5:16, 1), replace = TRUE)], collapse = ""))
test_data <- mtcars %>%
rownames_to_column() %>%
sample_n(n, replace = TRUE) %>%
mutate_all(~ sample(., length(.))) %>%
mutate(factor1 = sample(some_levels1, n, replace = TRUE),
factor2 = sample(some_levels2, n, replace = TRUE),
text = randomSentences(n, sample(3:8, n, replace = TRUE))
)
Read and Write
Writing the data is easy.
library(arrow)
write_parquet(test_data , "test_data.parquet")
# you can also mess with the compression
write_parquet(test_data, "test_data2.parquet", compress = "gzip", compression_level = 9)
Reading the data is also easy.
read_parquet("test_data.parquet")
# this option will result in lightning fast reads, but in a different format.
read_parquet("test_data2.parquet", as_data_frame = FALSE)
I tested reading this data against a few of the competing options, and did get slightly different results than with the article above, which is expected.
This file is nowhere near as large as the benchmark article, so maybe that is the difference.
Tests
rds: test_data.rds (20.3 MB)
parquet2_native: (14.9 MB with higher compression and as_data_frame = FALSE)
parquet2: test_data2.parquet (14.9 MB with higher compression)
parquet: test_data.parquet (40.7 MB)
fst2: test_data2.fst (27.9 MB with higher compression)
fst: test_data.fst (76.8 MB)
fread2: test_data.csv.gz (23.6MB)
fread: test_data.csv (98.7MB)
feather_arrow: test_data.feather (157.2 MB read with arrow)
feather: test_data.feather (157.2 MB read with feather)
Observations
For this particular file, fread is actually very fast. I like the small file size from the highly compressed parquet2 test. I may invest the time to work with the native data format rather than a data.frame if I really need the speed up.
Here fst is also a great choice. I would either use the highly compressed fst format or the highly compressed parquet depending on if I needed the speed or file size trade off.

A minor additional points worth mentioning. If you have a very large file you can on the fly calculate the number of rows (if no header) using (where bedGraph is the name of your file in your working directory):
>numRow=as.integer(system(paste("wc -l", bedGraph, "| sed 's/[^0-9.]*\\([0-9.]*\\).*/\\1/'"), intern=T))
You can then use that either in read.csv , read.table ...
>system.time((BG=read.table(bedGraph, nrows=numRow, col.names=c('chr', 'start', 'end', 'score'),colClasses=c('character', rep('integer',3)))))
user system elapsed
25.877 0.887 26.752
>object.size(BG)
203949432 bytes

Often times I think it is just good practice to keep larger databases inside a database (e.g. Postgres). I don't use anything too much larger than (nrow * ncol) ncell = 10M, which is pretty small; but I often find I want R to create and hold memory intensive graphs only while I query from multiple databases. In the future of 32 GB laptops, some of these types of memory problems will disappear. But the allure of using a database to hold the data and then using R's memory for the resulting query results and graphs still may be useful. Some advantages are:
(1) The data stays loaded in your database. You simply reconnect in pgadmin to the databases you want when you turn your laptop back on.
(2) It is true R can do many more nifty statistical and graphing operations than SQL. But I think SQL is better designed to query large amounts of data than R.
# Looking at Voter/Registrant Age by Decade
library(RPostgreSQL);library(lattice)
con <- dbConnect(PostgreSQL(), user= "postgres", password="password",
port="2345", host="localhost", dbname="WC2014_08_01_2014")
Decade_BD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from Birthdate) from voterdb where extract(DECADE from Birthdate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")
Decade_RD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from RegistrationDate) from voterdb where extract(DECADE from RegistrationDate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")
with(Decade_BD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Birthdays later than 1980 by Precinct",side=1,line=0)
with(Decade_RD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Registration Dates later than 1980 by Precinct",side=1,line=0)

I wanted to contribute Spark-based solution in the simplest form:
# Test Data ---------------------------------------------------------------
set.seed(123)
bigdf <-
data.frame(
dim = sample(letters, replace = T, 4e7),
fact1 = rnorm(4e7),
fact2 = rnorm(4e7, 20, 50)
)
tmp_csv <- fs::file_temp(pattern = "big_df", ext = ".csv")
readr::write_csv(x = bigdf, file = tmp_csv)
# Spark -------------------------------------------------------------------
# Installing if needed
# sparklyr::spark_available_versions()
# sparklyr::spark_install()
library("sparklyr")
sc <- spark_connect(master = "local")
# Uploading CSV
system.time(tbl_big_df <- spark_read_csv(sc = sc, path = tmp_csv))
Spark generated fairly OK results:
>> system.time(tbl_big_df <- spark_read_csv(sc = sc, path = tmp_csv))
user system elapsed
0.278 0.034 11.747
This was tested on MacBook Pro with 32GB ram.
Remarks
Spark, usually shouldn't be able to "win" against packages optimised for speed. Nevertheless, I wanted to contribute an answer using Spark:
For some of the comments and answers where process didn't work using Spark may be a viable alternative
In a long-run, hammering as much data as possible into data.frame may prove problematic later on, when other operations are attempted on that object and hit the performance envelope of architecture
I think that for questions like that, where the task is to handle 1e7 or more rows Spark should be given considerations. Even if it may be possible to "hammer in" that data into a single data.frame it's just doesn't feel right. Likely that object will be difficult to work with and create problems when deploying models, etc.

Instead of the conventional read.table I feel fread is a faster function.
Specifying additional attributes like select only the required columns, specifying colclasses and string as factors will reduce the time take to import the file.
data_frame <- fread("filename.csv",sep=",",header=FALSE,stringsAsFactors=FALSE,select=c(1,4,5,6,7),colClasses=c("as.numeric","as.character","as.numeric","as.Date","as.Factor"))

I've tried all above and [readr][1] made the best job. I have only 8gb RAM
Loop for 20 files, 5gb each, 7 columns:
read_fwf(arquivos[i],col_types = "ccccccc",fwf_cols(cnpj = c(4,17), nome = c(19,168), cpf = c(169,183), fantasia = c(169,223), sit.cadastral = c(224,225), dt.sitcadastral = c(226,233), cnae = c(376,382)))

Related

R ERROR in reading 18 GBs data [duplicate]

I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am assuming I know the types of the columns ahead of time, the table does not contain any column headers or row names, and does not have any pathological characters that I have to worry about.
I know that reading in a table as a list using scan() can be quite fast, e.g.:
datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))
But some of my attempts to convert this to a dataframe appear to decrease the performance of the above by a factor of 6:
df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))
Is there a better way of doing this? Or quite possibly completely different approach to the problem?
An update, several years later
This answer is old, and R has moved on. Tweaking read.table to run a bit faster has precious little benefit. Your options are:
Using vroom from the tidyverse package vroom for importing data from csv/tab-delimited files directly into an R tibble. See Hector's answer.
Using fread in data.table for importing data from csv/tab-delimited files directly into R. See mnel's answer.
Using read_table in readr (on CRAN from April 2015). This works much like fread above. The readme in the link explains the difference between the two functions (readr currently claims to be "1.5-2x slower" than data.table::fread).
read.csv.raw from iotools provides a third option for quickly reading CSV files.
Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.) read.csv.sql in the sqldf package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: the RODBC package, and the reverse depends section of the DBI package page. MonetDB.R gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with its monetdb.read.csv function. dplyr allows you to work directly with data stored in several types of database.
Storing data in binary formats can also be useful for improving performance. Use saveRDS/readRDS (see below), the h5 or rhdf5 packages for HDF5 format, or write_fst/read_fst from the fst package.
The original answer
There are a couple of simple things to try, whether you use read.table or scan.
Set nrows=the number of records in your data (nmax in scan).
Make sure that comment.char="" to turn off interpretation of comments.
Explicitly define the classes of each column using colClasses in read.table.
Setting multi.line=FALSE may also improve performance in scan.
If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table based on the results.
The other alternative is filtering your data before you read it into R.
Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save saveRDS, then next time you can retrieve it faster with load readRDS.
Here is an example that utilizes fread from data.table 1.8.7
The examples come from the help page to fread, with the timings on my windows XP Core 2 duo E8400.
library(data.table)
# Demo speedup
n=1e6
DT = data.table( a=sample(1:1000,n,replace=TRUE),
b=sample(1:1000,n,replace=TRUE),
c=rnorm(n),
d=sample(c("foo","bar","baz","qux","quux"),n,replace=TRUE),
e=rnorm(n),
f=sample(1:1000,n,replace=TRUE) )
DT[2,b:=NA_integer_]
DT[4,c:=NA_real_]
DT[3,d:=NA_character_]
DT[5,d:=""]
DT[2,e:=+Inf]
DT[3,e:=-Inf]
standard read.table
write.table(DT,"test.csv",sep=",",row.names=FALSE,quote=FALSE)
cat("File size (MB):",round(file.info("test.csv")$size/1024^2),"\n")
## File size (MB): 51
system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))
## user system elapsed
## 24.71 0.15 25.42
# second run will be faster
system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))
## user system elapsed
## 17.85 0.07 17.98
optimized read.table
system.time(DF2 <- read.table("test.csv",header=TRUE,sep=",",quote="",
stringsAsFactors=FALSE,comment.char="",nrows=n,
colClasses=c("integer","integer","numeric",
"character","numeric","integer")))
## user system elapsed
## 10.20 0.03 10.32
fread
require(data.table)
system.time(DT <- fread("test.csv"))
## user system elapsed
## 3.12 0.01 3.22
sqldf
require(sqldf)
system.time(SQLDF <- read.csv.sql("test.csv",dbname=NULL))
## user system elapsed
## 12.49 0.09 12.69
# sqldf as on SO
f <- file("test.csv")
system.time(SQLf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))
## user system elapsed
## 10.21 0.47 10.73
ff / ffdf
require(ff)
system.time(FFDF <- read.csv.ffdf(file="test.csv",nrows=n))
## user system elapsed
## 10.85 0.10 10.99
In summary:
## user system elapsed Method
## 24.71 0.15 25.42 read.csv (first time)
## 17.85 0.07 17.98 read.csv (second time)
## 10.20 0.03 10.32 Optimized read.table
## 3.12 0.01 3.22 fread
## 12.49 0.09 12.69 sqldf
## 10.21 0.47 10.73 sqldf on SO
## 10.85 0.10 10.99 ffdf
I didn't see this question initially and asked a similar question a few days later. I am going to take my previous question down, but I thought I'd add an answer here to explain how I used sqldf() to do this.
There's been little bit of discussion as to the best way to import 2GB or more of text data into an R data frame. Yesterday I wrote a blog post about using sqldf() to import the data into SQLite as a staging area, and then sucking it from SQLite into R. This works really well for me. I was able to pull in 2GB (3 columns, 40mm rows) of data in < 5 minutes. By contrast, the read.csv command ran all night and never completed.
Here's my test code:
Set up the test data:
bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, 'bigdf.csv', quote = F)
I restarted R before running the following import routine:
library(sqldf)
f <- file("bigdf.csv")
system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))
I let the following line run all night but it never completed:
system.time(big.df <- read.csv('bigdf.csv'))
Strangely, no one answered the bottom part of the question for years even though this is an important one -- data.frames are simply lists with the right attributes, so if you have large data you don't want to use as.data.frame or similar for a list. It's much faster to simply "turn" a list into a data frame in-place:
attr(df, "row.names") <- .set_row_names(length(df[[1]]))
class(df) <- "data.frame"
This makes no copy of the data so it's immediate (unlike all other methods). It assumes that you have already set names() on the list accordingly.
[As for loading large data into R -- personally, I dump them by column into binary files and use readBin() - that is by far the fastest method (other than mmapping) and is only limited by the disk speed. Parsing ASCII files is inherently slow (even in C) compared to binary data.]
This was previously asked on R-Help, so that's worth reviewing.
One suggestion there was to use readChar() and then do string manipulation on the result with strsplit() and substr(). You can see the logic involved in readChar is much less than read.table.
I don't know if memory is an issue here, but you might also want to take a look at the HadoopStreaming package. This uses Hadoop, which is a MapReduce framework designed for dealing with large data sets. For this, you would use the hsTableReader function. This is an example (but it has a learning curve to learn Hadoop):
str <- "key1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey2\t9.9\nkey2\"
cat(str)
cols = list(key='',val=0)
con <- textConnection(str, open = "r")
hsTableReader(con,cols,chunkSize=6,FUN=print,ignoreKey=TRUE)
close(con)
The basic idea here is to break the data import into chunks. You could even go so far as to use one of the parallel frameworks (e.g. snow) and run the data import in parallel by segmenting the file, but most likely for large data sets that won't help since you will run into memory constraints, which is why map-reduce is a better approach.
An alternative is to use the vroom package. Now on CRAN.
vroom doesn't load the entire file, it indexes where each record is located, and is read later when you use it.
Only pay for what you use.
See Introduction to vroom, Get started with vroom and the vroom benchmarks.
The basic overview is that the initial read of a huge file, will be much faster, and subsequent modifications to the data may be slightly slower. So depending on what your use is, it could be the best option.
See a simplified example from vroom benchmarks below, the key parts to see is the super fast read times, but slightly sower operations like aggregate etc..
package read print sample filter aggregate total
read.delim 1m 21.5s 1ms 315ms 764ms 1m 22.6s
readr 33.1s 90ms 2ms 202ms 825ms 34.2s
data.table 15.7s 13ms 1ms 129ms 394ms 16.3s
vroom (altrep) dplyr 1.7s 89ms 1.7s 1.3s 1.9s 6.7s
I am reading data very quickly using the new arrow package. It appears to be in a fairly early stage.
Specifically, I am using the parquet columnar format. This converts back to a data.frame in R, but you can get even deeper speedups if you do not. This format is convenient as it can be used from Python as well.
My main use case for this is on a fairly restrained RShiny server. For these reasons, I prefer to keep data attached to the Apps (i.e., out of SQL), and therefore require small file size as well as speed.
This linked article provides benchmarking and a good overview. I have quoted some interesting points below.
https://ursalabs.org/blog/2019-10-columnar-perf/
File Size
That is, the Parquet file is half as big as even the gzipped CSV. One of the reasons that the Parquet file is so small is because of dictionary-encoding (also called “dictionary compression”). Dictionary compression can yield substantially better compression than using a general purpose bytes compressor like LZ4 or ZSTD (which are used in the FST format). Parquet was designed to produce very small files that are fast to read.
Read Speed
When controlling by output type (e.g. comparing all R data.frame outputs with each other) we see the the performance of Parquet, Feather, and FST falls within a relatively small margin of each other. The same is true of the pandas.DataFrame outputs. data.table::fread is impressively competitive with the 1.5 GB file size but lags the others on the 2.5 GB CSV.
Independent Test
I performed some independent benchmarking on a simulated dataset of 1,000,000 rows. Basically I shuffled a bunch of things around to attempt to challenge the compression. Also I added a short text field of random words and two simulated factors.
Data
library(dplyr)
library(tibble)
library(OpenRepGrid)
n <- 1000000
set.seed(1234)
some_levels1 <- sapply(1:10, function(x) paste(LETTERS[sample(1:26, size = sample(3:8, 1), replace = TRUE)], collapse = ""))
some_levels2 <- sapply(1:65, function(x) paste(LETTERS[sample(1:26, size = sample(5:16, 1), replace = TRUE)], collapse = ""))
test_data <- mtcars %>%
rownames_to_column() %>%
sample_n(n, replace = TRUE) %>%
mutate_all(~ sample(., length(.))) %>%
mutate(factor1 = sample(some_levels1, n, replace = TRUE),
factor2 = sample(some_levels2, n, replace = TRUE),
text = randomSentences(n, sample(3:8, n, replace = TRUE))
)
Read and Write
Writing the data is easy.
library(arrow)
write_parquet(test_data , "test_data.parquet")
# you can also mess with the compression
write_parquet(test_data, "test_data2.parquet", compress = "gzip", compression_level = 9)
Reading the data is also easy.
read_parquet("test_data.parquet")
# this option will result in lightning fast reads, but in a different format.
read_parquet("test_data2.parquet", as_data_frame = FALSE)
I tested reading this data against a few of the competing options, and did get slightly different results than with the article above, which is expected.
This file is nowhere near as large as the benchmark article, so maybe that is the difference.
Tests
rds: test_data.rds (20.3 MB)
parquet2_native: (14.9 MB with higher compression and as_data_frame = FALSE)
parquet2: test_data2.parquet (14.9 MB with higher compression)
parquet: test_data.parquet (40.7 MB)
fst2: test_data2.fst (27.9 MB with higher compression)
fst: test_data.fst (76.8 MB)
fread2: test_data.csv.gz (23.6MB)
fread: test_data.csv (98.7MB)
feather_arrow: test_data.feather (157.2 MB read with arrow)
feather: test_data.feather (157.2 MB read with feather)
Observations
For this particular file, fread is actually very fast. I like the small file size from the highly compressed parquet2 test. I may invest the time to work with the native data format rather than a data.frame if I really need the speed up.
Here fst is also a great choice. I would either use the highly compressed fst format or the highly compressed parquet depending on if I needed the speed or file size trade off.
A minor additional points worth mentioning. If you have a very large file you can on the fly calculate the number of rows (if no header) using (where bedGraph is the name of your file in your working directory):
>numRow=as.integer(system(paste("wc -l", bedGraph, "| sed 's/[^0-9.]*\\([0-9.]*\\).*/\\1/'"), intern=T))
You can then use that either in read.csv , read.table ...
>system.time((BG=read.table(bedGraph, nrows=numRow, col.names=c('chr', 'start', 'end', 'score'),colClasses=c('character', rep('integer',3)))))
user system elapsed
25.877 0.887 26.752
>object.size(BG)
203949432 bytes
Often times I think it is just good practice to keep larger databases inside a database (e.g. Postgres). I don't use anything too much larger than (nrow * ncol) ncell = 10M, which is pretty small; but I often find I want R to create and hold memory intensive graphs only while I query from multiple databases. In the future of 32 GB laptops, some of these types of memory problems will disappear. But the allure of using a database to hold the data and then using R's memory for the resulting query results and graphs still may be useful. Some advantages are:
(1) The data stays loaded in your database. You simply reconnect in pgadmin to the databases you want when you turn your laptop back on.
(2) It is true R can do many more nifty statistical and graphing operations than SQL. But I think SQL is better designed to query large amounts of data than R.
# Looking at Voter/Registrant Age by Decade
library(RPostgreSQL);library(lattice)
con <- dbConnect(PostgreSQL(), user= "postgres", password="password",
port="2345", host="localhost", dbname="WC2014_08_01_2014")
Decade_BD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from Birthdate) from voterdb where extract(DECADE from Birthdate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")
Decade_RD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from RegistrationDate) from voterdb where extract(DECADE from RegistrationDate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")
with(Decade_BD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Birthdays later than 1980 by Precinct",side=1,line=0)
with(Decade_RD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Registration Dates later than 1980 by Precinct",side=1,line=0)
I wanted to contribute Spark-based solution in the simplest form:
# Test Data ---------------------------------------------------------------
set.seed(123)
bigdf <-
data.frame(
dim = sample(letters, replace = T, 4e7),
fact1 = rnorm(4e7),
fact2 = rnorm(4e7, 20, 50)
)
tmp_csv <- fs::file_temp(pattern = "big_df", ext = ".csv")
readr::write_csv(x = bigdf, file = tmp_csv)
# Spark -------------------------------------------------------------------
# Installing if needed
# sparklyr::spark_available_versions()
# sparklyr::spark_install()
library("sparklyr")
sc <- spark_connect(master = "local")
# Uploading CSV
system.time(tbl_big_df <- spark_read_csv(sc = sc, path = tmp_csv))
Spark generated fairly OK results:
>> system.time(tbl_big_df <- spark_read_csv(sc = sc, path = tmp_csv))
user system elapsed
0.278 0.034 11.747
This was tested on MacBook Pro with 32GB ram.
Remarks
Spark, usually shouldn't be able to "win" against packages optimised for speed. Nevertheless, I wanted to contribute an answer using Spark:
For some of the comments and answers where process didn't work using Spark may be a viable alternative
In a long-run, hammering as much data as possible into data.frame may prove problematic later on, when other operations are attempted on that object and hit the performance envelope of architecture
I think that for questions like that, where the task is to handle 1e7 or more rows Spark should be given considerations. Even if it may be possible to "hammer in" that data into a single data.frame it's just doesn't feel right. Likely that object will be difficult to work with and create problems when deploying models, etc.
Instead of the conventional read.table I feel fread is a faster function.
Specifying additional attributes like select only the required columns, specifying colclasses and string as factors will reduce the time take to import the file.
data_frame <- fread("filename.csv",sep=",",header=FALSE,stringsAsFactors=FALSE,select=c(1,4,5,6,7),colClasses=c("as.numeric","as.character","as.numeric","as.Date","as.Factor"))
I've tried all above and [readr][1] made the best job. I have only 8gb RAM
Loop for 20 files, 5gb each, 7 columns:
read_fwf(arquivos[i],col_types = "ccccccc",fwf_cols(cnpj = c(4,17), nome = c(19,168), cpf = c(169,183), fantasia = c(169,223), sit.cadastral = c(224,225), dt.sitcadastral = c(226,233), cnae = c(376,382)))

How do I read and clean a large (4GB) .csv file in R? [duplicate]

I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am assuming I know the types of the columns ahead of time, the table does not contain any column headers or row names, and does not have any pathological characters that I have to worry about.
I know that reading in a table as a list using scan() can be quite fast, e.g.:
datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))
But some of my attempts to convert this to a dataframe appear to decrease the performance of the above by a factor of 6:
df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))
Is there a better way of doing this? Or quite possibly completely different approach to the problem?
An update, several years later
This answer is old, and R has moved on. Tweaking read.table to run a bit faster has precious little benefit. Your options are:
Using vroom from the tidyverse package vroom for importing data from csv/tab-delimited files directly into an R tibble. See Hector's answer.
Using fread in data.table for importing data from csv/tab-delimited files directly into R. See mnel's answer.
Using read_table in readr (on CRAN from April 2015). This works much like fread above. The readme in the link explains the difference between the two functions (readr currently claims to be "1.5-2x slower" than data.table::fread).
read.csv.raw from iotools provides a third option for quickly reading CSV files.
Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.) read.csv.sql in the sqldf package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: the RODBC package, and the reverse depends section of the DBI package page. MonetDB.R gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with its monetdb.read.csv function. dplyr allows you to work directly with data stored in several types of database.
Storing data in binary formats can also be useful for improving performance. Use saveRDS/readRDS (see below), the h5 or rhdf5 packages for HDF5 format, or write_fst/read_fst from the fst package.
The original answer
There are a couple of simple things to try, whether you use read.table or scan.
Set nrows=the number of records in your data (nmax in scan).
Make sure that comment.char="" to turn off interpretation of comments.
Explicitly define the classes of each column using colClasses in read.table.
Setting multi.line=FALSE may also improve performance in scan.
If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table based on the results.
The other alternative is filtering your data before you read it into R.
Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save saveRDS, then next time you can retrieve it faster with load readRDS.
Here is an example that utilizes fread from data.table 1.8.7
The examples come from the help page to fread, with the timings on my windows XP Core 2 duo E8400.
library(data.table)
# Demo speedup
n=1e6
DT = data.table( a=sample(1:1000,n,replace=TRUE),
b=sample(1:1000,n,replace=TRUE),
c=rnorm(n),
d=sample(c("foo","bar","baz","qux","quux"),n,replace=TRUE),
e=rnorm(n),
f=sample(1:1000,n,replace=TRUE) )
DT[2,b:=NA_integer_]
DT[4,c:=NA_real_]
DT[3,d:=NA_character_]
DT[5,d:=""]
DT[2,e:=+Inf]
DT[3,e:=-Inf]
standard read.table
write.table(DT,"test.csv",sep=",",row.names=FALSE,quote=FALSE)
cat("File size (MB):",round(file.info("test.csv")$size/1024^2),"\n")
## File size (MB): 51
system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))
## user system elapsed
## 24.71 0.15 25.42
# second run will be faster
system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))
## user system elapsed
## 17.85 0.07 17.98
optimized read.table
system.time(DF2 <- read.table("test.csv",header=TRUE,sep=",",quote="",
stringsAsFactors=FALSE,comment.char="",nrows=n,
colClasses=c("integer","integer","numeric",
"character","numeric","integer")))
## user system elapsed
## 10.20 0.03 10.32
fread
require(data.table)
system.time(DT <- fread("test.csv"))
## user system elapsed
## 3.12 0.01 3.22
sqldf
require(sqldf)
system.time(SQLDF <- read.csv.sql("test.csv",dbname=NULL))
## user system elapsed
## 12.49 0.09 12.69
# sqldf as on SO
f <- file("test.csv")
system.time(SQLf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))
## user system elapsed
## 10.21 0.47 10.73
ff / ffdf
require(ff)
system.time(FFDF <- read.csv.ffdf(file="test.csv",nrows=n))
## user system elapsed
## 10.85 0.10 10.99
In summary:
## user system elapsed Method
## 24.71 0.15 25.42 read.csv (first time)
## 17.85 0.07 17.98 read.csv (second time)
## 10.20 0.03 10.32 Optimized read.table
## 3.12 0.01 3.22 fread
## 12.49 0.09 12.69 sqldf
## 10.21 0.47 10.73 sqldf on SO
## 10.85 0.10 10.99 ffdf
I didn't see this question initially and asked a similar question a few days later. I am going to take my previous question down, but I thought I'd add an answer here to explain how I used sqldf() to do this.
There's been little bit of discussion as to the best way to import 2GB or more of text data into an R data frame. Yesterday I wrote a blog post about using sqldf() to import the data into SQLite as a staging area, and then sucking it from SQLite into R. This works really well for me. I was able to pull in 2GB (3 columns, 40mm rows) of data in < 5 minutes. By contrast, the read.csv command ran all night and never completed.
Here's my test code:
Set up the test data:
bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, 'bigdf.csv', quote = F)
I restarted R before running the following import routine:
library(sqldf)
f <- file("bigdf.csv")
system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))
I let the following line run all night but it never completed:
system.time(big.df <- read.csv('bigdf.csv'))
Strangely, no one answered the bottom part of the question for years even though this is an important one -- data.frames are simply lists with the right attributes, so if you have large data you don't want to use as.data.frame or similar for a list. It's much faster to simply "turn" a list into a data frame in-place:
attr(df, "row.names") <- .set_row_names(length(df[[1]]))
class(df) <- "data.frame"
This makes no copy of the data so it's immediate (unlike all other methods). It assumes that you have already set names() on the list accordingly.
[As for loading large data into R -- personally, I dump them by column into binary files and use readBin() - that is by far the fastest method (other than mmapping) and is only limited by the disk speed. Parsing ASCII files is inherently slow (even in C) compared to binary data.]
This was previously asked on R-Help, so that's worth reviewing.
One suggestion there was to use readChar() and then do string manipulation on the result with strsplit() and substr(). You can see the logic involved in readChar is much less than read.table.
I don't know if memory is an issue here, but you might also want to take a look at the HadoopStreaming package. This uses Hadoop, which is a MapReduce framework designed for dealing with large data sets. For this, you would use the hsTableReader function. This is an example (but it has a learning curve to learn Hadoop):
str <- "key1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey2\t9.9\nkey2\"
cat(str)
cols = list(key='',val=0)
con <- textConnection(str, open = "r")
hsTableReader(con,cols,chunkSize=6,FUN=print,ignoreKey=TRUE)
close(con)
The basic idea here is to break the data import into chunks. You could even go so far as to use one of the parallel frameworks (e.g. snow) and run the data import in parallel by segmenting the file, but most likely for large data sets that won't help since you will run into memory constraints, which is why map-reduce is a better approach.
An alternative is to use the vroom package. Now on CRAN.
vroom doesn't load the entire file, it indexes where each record is located, and is read later when you use it.
Only pay for what you use.
See Introduction to vroom, Get started with vroom and the vroom benchmarks.
The basic overview is that the initial read of a huge file, will be much faster, and subsequent modifications to the data may be slightly slower. So depending on what your use is, it could be the best option.
See a simplified example from vroom benchmarks below, the key parts to see is the super fast read times, but slightly sower operations like aggregate etc..
package read print sample filter aggregate total
read.delim 1m 21.5s 1ms 315ms 764ms 1m 22.6s
readr 33.1s 90ms 2ms 202ms 825ms 34.2s
data.table 15.7s 13ms 1ms 129ms 394ms 16.3s
vroom (altrep) dplyr 1.7s 89ms 1.7s 1.3s 1.9s 6.7s
I am reading data very quickly using the new arrow package. It appears to be in a fairly early stage.
Specifically, I am using the parquet columnar format. This converts back to a data.frame in R, but you can get even deeper speedups if you do not. This format is convenient as it can be used from Python as well.
My main use case for this is on a fairly restrained RShiny server. For these reasons, I prefer to keep data attached to the Apps (i.e., out of SQL), and therefore require small file size as well as speed.
This linked article provides benchmarking and a good overview. I have quoted some interesting points below.
https://ursalabs.org/blog/2019-10-columnar-perf/
File Size
That is, the Parquet file is half as big as even the gzipped CSV. One of the reasons that the Parquet file is so small is because of dictionary-encoding (also called “dictionary compression”). Dictionary compression can yield substantially better compression than using a general purpose bytes compressor like LZ4 or ZSTD (which are used in the FST format). Parquet was designed to produce very small files that are fast to read.
Read Speed
When controlling by output type (e.g. comparing all R data.frame outputs with each other) we see the the performance of Parquet, Feather, and FST falls within a relatively small margin of each other. The same is true of the pandas.DataFrame outputs. data.table::fread is impressively competitive with the 1.5 GB file size but lags the others on the 2.5 GB CSV.
Independent Test
I performed some independent benchmarking on a simulated dataset of 1,000,000 rows. Basically I shuffled a bunch of things around to attempt to challenge the compression. Also I added a short text field of random words and two simulated factors.
Data
library(dplyr)
library(tibble)
library(OpenRepGrid)
n <- 1000000
set.seed(1234)
some_levels1 <- sapply(1:10, function(x) paste(LETTERS[sample(1:26, size = sample(3:8, 1), replace = TRUE)], collapse = ""))
some_levels2 <- sapply(1:65, function(x) paste(LETTERS[sample(1:26, size = sample(5:16, 1), replace = TRUE)], collapse = ""))
test_data <- mtcars %>%
rownames_to_column() %>%
sample_n(n, replace = TRUE) %>%
mutate_all(~ sample(., length(.))) %>%
mutate(factor1 = sample(some_levels1, n, replace = TRUE),
factor2 = sample(some_levels2, n, replace = TRUE),
text = randomSentences(n, sample(3:8, n, replace = TRUE))
)
Read and Write
Writing the data is easy.
library(arrow)
write_parquet(test_data , "test_data.parquet")
# you can also mess with the compression
write_parquet(test_data, "test_data2.parquet", compress = "gzip", compression_level = 9)
Reading the data is also easy.
read_parquet("test_data.parquet")
# this option will result in lightning fast reads, but in a different format.
read_parquet("test_data2.parquet", as_data_frame = FALSE)
I tested reading this data against a few of the competing options, and did get slightly different results than with the article above, which is expected.
This file is nowhere near as large as the benchmark article, so maybe that is the difference.
Tests
rds: test_data.rds (20.3 MB)
parquet2_native: (14.9 MB with higher compression and as_data_frame = FALSE)
parquet2: test_data2.parquet (14.9 MB with higher compression)
parquet: test_data.parquet (40.7 MB)
fst2: test_data2.fst (27.9 MB with higher compression)
fst: test_data.fst (76.8 MB)
fread2: test_data.csv.gz (23.6MB)
fread: test_data.csv (98.7MB)
feather_arrow: test_data.feather (157.2 MB read with arrow)
feather: test_data.feather (157.2 MB read with feather)
Observations
For this particular file, fread is actually very fast. I like the small file size from the highly compressed parquet2 test. I may invest the time to work with the native data format rather than a data.frame if I really need the speed up.
Here fst is also a great choice. I would either use the highly compressed fst format or the highly compressed parquet depending on if I needed the speed or file size trade off.
A minor additional points worth mentioning. If you have a very large file you can on the fly calculate the number of rows (if no header) using (where bedGraph is the name of your file in your working directory):
>numRow=as.integer(system(paste("wc -l", bedGraph, "| sed 's/[^0-9.]*\\([0-9.]*\\).*/\\1/'"), intern=T))
You can then use that either in read.csv , read.table ...
>system.time((BG=read.table(bedGraph, nrows=numRow, col.names=c('chr', 'start', 'end', 'score'),colClasses=c('character', rep('integer',3)))))
user system elapsed
25.877 0.887 26.752
>object.size(BG)
203949432 bytes
Often times I think it is just good practice to keep larger databases inside a database (e.g. Postgres). I don't use anything too much larger than (nrow * ncol) ncell = 10M, which is pretty small; but I often find I want R to create and hold memory intensive graphs only while I query from multiple databases. In the future of 32 GB laptops, some of these types of memory problems will disappear. But the allure of using a database to hold the data and then using R's memory for the resulting query results and graphs still may be useful. Some advantages are:
(1) The data stays loaded in your database. You simply reconnect in pgadmin to the databases you want when you turn your laptop back on.
(2) It is true R can do many more nifty statistical and graphing operations than SQL. But I think SQL is better designed to query large amounts of data than R.
# Looking at Voter/Registrant Age by Decade
library(RPostgreSQL);library(lattice)
con <- dbConnect(PostgreSQL(), user= "postgres", password="password",
port="2345", host="localhost", dbname="WC2014_08_01_2014")
Decade_BD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from Birthdate) from voterdb where extract(DECADE from Birthdate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")
Decade_RD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from RegistrationDate) from voterdb where extract(DECADE from RegistrationDate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")
with(Decade_BD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Birthdays later than 1980 by Precinct",side=1,line=0)
with(Decade_RD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Registration Dates later than 1980 by Precinct",side=1,line=0)
I wanted to contribute Spark-based solution in the simplest form:
# Test Data ---------------------------------------------------------------
set.seed(123)
bigdf <-
data.frame(
dim = sample(letters, replace = T, 4e7),
fact1 = rnorm(4e7),
fact2 = rnorm(4e7, 20, 50)
)
tmp_csv <- fs::file_temp(pattern = "big_df", ext = ".csv")
readr::write_csv(x = bigdf, file = tmp_csv)
# Spark -------------------------------------------------------------------
# Installing if needed
# sparklyr::spark_available_versions()
# sparklyr::spark_install()
library("sparklyr")
sc <- spark_connect(master = "local")
# Uploading CSV
system.time(tbl_big_df <- spark_read_csv(sc = sc, path = tmp_csv))
Spark generated fairly OK results:
>> system.time(tbl_big_df <- spark_read_csv(sc = sc, path = tmp_csv))
user system elapsed
0.278 0.034 11.747
This was tested on MacBook Pro with 32GB ram.
Remarks
Spark, usually shouldn't be able to "win" against packages optimised for speed. Nevertheless, I wanted to contribute an answer using Spark:
For some of the comments and answers where process didn't work using Spark may be a viable alternative
In a long-run, hammering as much data as possible into data.frame may prove problematic later on, when other operations are attempted on that object and hit the performance envelope of architecture
I think that for questions like that, where the task is to handle 1e7 or more rows Spark should be given considerations. Even if it may be possible to "hammer in" that data into a single data.frame it's just doesn't feel right. Likely that object will be difficult to work with and create problems when deploying models, etc.
Instead of the conventional read.table I feel fread is a faster function.
Specifying additional attributes like select only the required columns, specifying colclasses and string as factors will reduce the time take to import the file.
data_frame <- fread("filename.csv",sep=",",header=FALSE,stringsAsFactors=FALSE,select=c(1,4,5,6,7),colClasses=c("as.numeric","as.character","as.numeric","as.Date","as.Factor"))
I've tried all above and [readr][1] made the best job. I have only 8gb RAM
Loop for 20 files, 5gb each, 7 columns:
read_fwf(arquivos[i],col_types = "ccccccc",fwf_cols(cnpj = c(4,17), nome = c(19,168), cpf = c(169,183), fantasia = c(169,223), sit.cadastral = c(224,225), dt.sitcadastral = c(226,233), cnae = c(376,382)))

How to read 1.5GB .csv file into R in a more efficient and rapid way? [duplicate]

I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am assuming I know the types of the columns ahead of time, the table does not contain any column headers or row names, and does not have any pathological characters that I have to worry about.
I know that reading in a table as a list using scan() can be quite fast, e.g.:
datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))
But some of my attempts to convert this to a dataframe appear to decrease the performance of the above by a factor of 6:
df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))
Is there a better way of doing this? Or quite possibly completely different approach to the problem?
An update, several years later
This answer is old, and R has moved on. Tweaking read.table to run a bit faster has precious little benefit. Your options are:
Using vroom from the tidyverse package vroom for importing data from csv/tab-delimited files directly into an R tibble. See Hector's answer.
Using fread in data.table for importing data from csv/tab-delimited files directly into R. See mnel's answer.
Using read_table in readr (on CRAN from April 2015). This works much like fread above. The readme in the link explains the difference between the two functions (readr currently claims to be "1.5-2x slower" than data.table::fread).
read.csv.raw from iotools provides a third option for quickly reading CSV files.
Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.) read.csv.sql in the sqldf package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: the RODBC package, and the reverse depends section of the DBI package page. MonetDB.R gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with its monetdb.read.csv function. dplyr allows you to work directly with data stored in several types of database.
Storing data in binary formats can also be useful for improving performance. Use saveRDS/readRDS (see below), the h5 or rhdf5 packages for HDF5 format, or write_fst/read_fst from the fst package.
The original answer
There are a couple of simple things to try, whether you use read.table or scan.
Set nrows=the number of records in your data (nmax in scan).
Make sure that comment.char="" to turn off interpretation of comments.
Explicitly define the classes of each column using colClasses in read.table.
Setting multi.line=FALSE may also improve performance in scan.
If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table based on the results.
The other alternative is filtering your data before you read it into R.
Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save saveRDS, then next time you can retrieve it faster with load readRDS.
Here is an example that utilizes fread from data.table 1.8.7
The examples come from the help page to fread, with the timings on my windows XP Core 2 duo E8400.
library(data.table)
# Demo speedup
n=1e6
DT = data.table( a=sample(1:1000,n,replace=TRUE),
b=sample(1:1000,n,replace=TRUE),
c=rnorm(n),
d=sample(c("foo","bar","baz","qux","quux"),n,replace=TRUE),
e=rnorm(n),
f=sample(1:1000,n,replace=TRUE) )
DT[2,b:=NA_integer_]
DT[4,c:=NA_real_]
DT[3,d:=NA_character_]
DT[5,d:=""]
DT[2,e:=+Inf]
DT[3,e:=-Inf]
standard read.table
write.table(DT,"test.csv",sep=",",row.names=FALSE,quote=FALSE)
cat("File size (MB):",round(file.info("test.csv")$size/1024^2),"\n")
## File size (MB): 51
system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))
## user system elapsed
## 24.71 0.15 25.42
# second run will be faster
system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))
## user system elapsed
## 17.85 0.07 17.98
optimized read.table
system.time(DF2 <- read.table("test.csv",header=TRUE,sep=",",quote="",
stringsAsFactors=FALSE,comment.char="",nrows=n,
colClasses=c("integer","integer","numeric",
"character","numeric","integer")))
## user system elapsed
## 10.20 0.03 10.32
fread
require(data.table)
system.time(DT <- fread("test.csv"))
## user system elapsed
## 3.12 0.01 3.22
sqldf
require(sqldf)
system.time(SQLDF <- read.csv.sql("test.csv",dbname=NULL))
## user system elapsed
## 12.49 0.09 12.69
# sqldf as on SO
f <- file("test.csv")
system.time(SQLf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))
## user system elapsed
## 10.21 0.47 10.73
ff / ffdf
require(ff)
system.time(FFDF <- read.csv.ffdf(file="test.csv",nrows=n))
## user system elapsed
## 10.85 0.10 10.99
In summary:
## user system elapsed Method
## 24.71 0.15 25.42 read.csv (first time)
## 17.85 0.07 17.98 read.csv (second time)
## 10.20 0.03 10.32 Optimized read.table
## 3.12 0.01 3.22 fread
## 12.49 0.09 12.69 sqldf
## 10.21 0.47 10.73 sqldf on SO
## 10.85 0.10 10.99 ffdf
I didn't see this question initially and asked a similar question a few days later. I am going to take my previous question down, but I thought I'd add an answer here to explain how I used sqldf() to do this.
There's been little bit of discussion as to the best way to import 2GB or more of text data into an R data frame. Yesterday I wrote a blog post about using sqldf() to import the data into SQLite as a staging area, and then sucking it from SQLite into R. This works really well for me. I was able to pull in 2GB (3 columns, 40mm rows) of data in < 5 minutes. By contrast, the read.csv command ran all night and never completed.
Here's my test code:
Set up the test data:
bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, 'bigdf.csv', quote = F)
I restarted R before running the following import routine:
library(sqldf)
f <- file("bigdf.csv")
system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))
I let the following line run all night but it never completed:
system.time(big.df <- read.csv('bigdf.csv'))
Strangely, no one answered the bottom part of the question for years even though this is an important one -- data.frames are simply lists with the right attributes, so if you have large data you don't want to use as.data.frame or similar for a list. It's much faster to simply "turn" a list into a data frame in-place:
attr(df, "row.names") <- .set_row_names(length(df[[1]]))
class(df) <- "data.frame"
This makes no copy of the data so it's immediate (unlike all other methods). It assumes that you have already set names() on the list accordingly.
[As for loading large data into R -- personally, I dump them by column into binary files and use readBin() - that is by far the fastest method (other than mmapping) and is only limited by the disk speed. Parsing ASCII files is inherently slow (even in C) compared to binary data.]
This was previously asked on R-Help, so that's worth reviewing.
One suggestion there was to use readChar() and then do string manipulation on the result with strsplit() and substr(). You can see the logic involved in readChar is much less than read.table.
I don't know if memory is an issue here, but you might also want to take a look at the HadoopStreaming package. This uses Hadoop, which is a MapReduce framework designed for dealing with large data sets. For this, you would use the hsTableReader function. This is an example (but it has a learning curve to learn Hadoop):
str <- "key1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey2\t9.9\nkey2\"
cat(str)
cols = list(key='',val=0)
con <- textConnection(str, open = "r")
hsTableReader(con,cols,chunkSize=6,FUN=print,ignoreKey=TRUE)
close(con)
The basic idea here is to break the data import into chunks. You could even go so far as to use one of the parallel frameworks (e.g. snow) and run the data import in parallel by segmenting the file, but most likely for large data sets that won't help since you will run into memory constraints, which is why map-reduce is a better approach.
An alternative is to use the vroom package. Now on CRAN.
vroom doesn't load the entire file, it indexes where each record is located, and is read later when you use it.
Only pay for what you use.
See Introduction to vroom, Get started with vroom and the vroom benchmarks.
The basic overview is that the initial read of a huge file, will be much faster, and subsequent modifications to the data may be slightly slower. So depending on what your use is, it could be the best option.
See a simplified example from vroom benchmarks below, the key parts to see is the super fast read times, but slightly sower operations like aggregate etc..
package read print sample filter aggregate total
read.delim 1m 21.5s 1ms 315ms 764ms 1m 22.6s
readr 33.1s 90ms 2ms 202ms 825ms 34.2s
data.table 15.7s 13ms 1ms 129ms 394ms 16.3s
vroom (altrep) dplyr 1.7s 89ms 1.7s 1.3s 1.9s 6.7s
I am reading data very quickly using the new arrow package. It appears to be in a fairly early stage.
Specifically, I am using the parquet columnar format. This converts back to a data.frame in R, but you can get even deeper speedups if you do not. This format is convenient as it can be used from Python as well.
My main use case for this is on a fairly restrained RShiny server. For these reasons, I prefer to keep data attached to the Apps (i.e., out of SQL), and therefore require small file size as well as speed.
This linked article provides benchmarking and a good overview. I have quoted some interesting points below.
https://ursalabs.org/blog/2019-10-columnar-perf/
File Size
That is, the Parquet file is half as big as even the gzipped CSV. One of the reasons that the Parquet file is so small is because of dictionary-encoding (also called “dictionary compression”). Dictionary compression can yield substantially better compression than using a general purpose bytes compressor like LZ4 or ZSTD (which are used in the FST format). Parquet was designed to produce very small files that are fast to read.
Read Speed
When controlling by output type (e.g. comparing all R data.frame outputs with each other) we see the the performance of Parquet, Feather, and FST falls within a relatively small margin of each other. The same is true of the pandas.DataFrame outputs. data.table::fread is impressively competitive with the 1.5 GB file size but lags the others on the 2.5 GB CSV.
Independent Test
I performed some independent benchmarking on a simulated dataset of 1,000,000 rows. Basically I shuffled a bunch of things around to attempt to challenge the compression. Also I added a short text field of random words and two simulated factors.
Data
library(dplyr)
library(tibble)
library(OpenRepGrid)
n <- 1000000
set.seed(1234)
some_levels1 <- sapply(1:10, function(x) paste(LETTERS[sample(1:26, size = sample(3:8, 1), replace = TRUE)], collapse = ""))
some_levels2 <- sapply(1:65, function(x) paste(LETTERS[sample(1:26, size = sample(5:16, 1), replace = TRUE)], collapse = ""))
test_data <- mtcars %>%
rownames_to_column() %>%
sample_n(n, replace = TRUE) %>%
mutate_all(~ sample(., length(.))) %>%
mutate(factor1 = sample(some_levels1, n, replace = TRUE),
factor2 = sample(some_levels2, n, replace = TRUE),
text = randomSentences(n, sample(3:8, n, replace = TRUE))
)
Read and Write
Writing the data is easy.
library(arrow)
write_parquet(test_data , "test_data.parquet")
# you can also mess with the compression
write_parquet(test_data, "test_data2.parquet", compress = "gzip", compression_level = 9)
Reading the data is also easy.
read_parquet("test_data.parquet")
# this option will result in lightning fast reads, but in a different format.
read_parquet("test_data2.parquet", as_data_frame = FALSE)
I tested reading this data against a few of the competing options, and did get slightly different results than with the article above, which is expected.
This file is nowhere near as large as the benchmark article, so maybe that is the difference.
Tests
rds: test_data.rds (20.3 MB)
parquet2_native: (14.9 MB with higher compression and as_data_frame = FALSE)
parquet2: test_data2.parquet (14.9 MB with higher compression)
parquet: test_data.parquet (40.7 MB)
fst2: test_data2.fst (27.9 MB with higher compression)
fst: test_data.fst (76.8 MB)
fread2: test_data.csv.gz (23.6MB)
fread: test_data.csv (98.7MB)
feather_arrow: test_data.feather (157.2 MB read with arrow)
feather: test_data.feather (157.2 MB read with feather)
Observations
For this particular file, fread is actually very fast. I like the small file size from the highly compressed parquet2 test. I may invest the time to work with the native data format rather than a data.frame if I really need the speed up.
Here fst is also a great choice. I would either use the highly compressed fst format or the highly compressed parquet depending on if I needed the speed or file size trade off.
A minor additional points worth mentioning. If you have a very large file you can on the fly calculate the number of rows (if no header) using (where bedGraph is the name of your file in your working directory):
>numRow=as.integer(system(paste("wc -l", bedGraph, "| sed 's/[^0-9.]*\\([0-9.]*\\).*/\\1/'"), intern=T))
You can then use that either in read.csv , read.table ...
>system.time((BG=read.table(bedGraph, nrows=numRow, col.names=c('chr', 'start', 'end', 'score'),colClasses=c('character', rep('integer',3)))))
user system elapsed
25.877 0.887 26.752
>object.size(BG)
203949432 bytes
Often times I think it is just good practice to keep larger databases inside a database (e.g. Postgres). I don't use anything too much larger than (nrow * ncol) ncell = 10M, which is pretty small; but I often find I want R to create and hold memory intensive graphs only while I query from multiple databases. In the future of 32 GB laptops, some of these types of memory problems will disappear. But the allure of using a database to hold the data and then using R's memory for the resulting query results and graphs still may be useful. Some advantages are:
(1) The data stays loaded in your database. You simply reconnect in pgadmin to the databases you want when you turn your laptop back on.
(2) It is true R can do many more nifty statistical and graphing operations than SQL. But I think SQL is better designed to query large amounts of data than R.
# Looking at Voter/Registrant Age by Decade
library(RPostgreSQL);library(lattice)
con <- dbConnect(PostgreSQL(), user= "postgres", password="password",
port="2345", host="localhost", dbname="WC2014_08_01_2014")
Decade_BD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from Birthdate) from voterdb where extract(DECADE from Birthdate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")
Decade_RD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from RegistrationDate) from voterdb where extract(DECADE from RegistrationDate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")
with(Decade_BD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Birthdays later than 1980 by Precinct",side=1,line=0)
with(Decade_RD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Registration Dates later than 1980 by Precinct",side=1,line=0)
I wanted to contribute Spark-based solution in the simplest form:
# Test Data ---------------------------------------------------------------
set.seed(123)
bigdf <-
data.frame(
dim = sample(letters, replace = T, 4e7),
fact1 = rnorm(4e7),
fact2 = rnorm(4e7, 20, 50)
)
tmp_csv <- fs::file_temp(pattern = "big_df", ext = ".csv")
readr::write_csv(x = bigdf, file = tmp_csv)
# Spark -------------------------------------------------------------------
# Installing if needed
# sparklyr::spark_available_versions()
# sparklyr::spark_install()
library("sparklyr")
sc <- spark_connect(master = "local")
# Uploading CSV
system.time(tbl_big_df <- spark_read_csv(sc = sc, path = tmp_csv))
Spark generated fairly OK results:
>> system.time(tbl_big_df <- spark_read_csv(sc = sc, path = tmp_csv))
user system elapsed
0.278 0.034 11.747
This was tested on MacBook Pro with 32GB ram.
Remarks
Spark, usually shouldn't be able to "win" against packages optimised for speed. Nevertheless, I wanted to contribute an answer using Spark:
For some of the comments and answers where process didn't work using Spark may be a viable alternative
In a long-run, hammering as much data as possible into data.frame may prove problematic later on, when other operations are attempted on that object and hit the performance envelope of architecture
I think that for questions like that, where the task is to handle 1e7 or more rows Spark should be given considerations. Even if it may be possible to "hammer in" that data into a single data.frame it's just doesn't feel right. Likely that object will be difficult to work with and create problems when deploying models, etc.
Instead of the conventional read.table I feel fread is a faster function.
Specifying additional attributes like select only the required columns, specifying colclasses and string as factors will reduce the time take to import the file.
data_frame <- fread("filename.csv",sep=",",header=FALSE,stringsAsFactors=FALSE,select=c(1,4,5,6,7),colClasses=c("as.numeric","as.character","as.numeric","as.Date","as.Factor"))
I've tried all above and [readr][1] made the best job. I have only 8gb RAM
Loop for 20 files, 5gb each, 7 columns:
read_fwf(arquivos[i],col_types = "ccccccc",fwf_cols(cnpj = c(4,17), nome = c(19,168), cpf = c(169,183), fantasia = c(169,223), sit.cadastral = c(224,225), dt.sitcadastral = c(226,233), cnae = c(376,382)))

How to optimize read.csv [duplicate]

I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am assuming I know the types of the columns ahead of time, the table does not contain any column headers or row names, and does not have any pathological characters that I have to worry about.
I know that reading in a table as a list using scan() can be quite fast, e.g.:
datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))
But some of my attempts to convert this to a dataframe appear to decrease the performance of the above by a factor of 6:
df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))
Is there a better way of doing this? Or quite possibly completely different approach to the problem?
An update, several years later
This answer is old, and R has moved on. Tweaking read.table to run a bit faster has precious little benefit. Your options are:
Using vroom from the tidyverse package vroom for importing data from csv/tab-delimited files directly into an R tibble. See Hector's answer.
Using fread in data.table for importing data from csv/tab-delimited files directly into R. See mnel's answer.
Using read_table in readr (on CRAN from April 2015). This works much like fread above. The readme in the link explains the difference between the two functions (readr currently claims to be "1.5-2x slower" than data.table::fread).
read.csv.raw from iotools provides a third option for quickly reading CSV files.
Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.) read.csv.sql in the sqldf package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: the RODBC package, and the reverse depends section of the DBI package page. MonetDB.R gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with its monetdb.read.csv function. dplyr allows you to work directly with data stored in several types of database.
Storing data in binary formats can also be useful for improving performance. Use saveRDS/readRDS (see below), the h5 or rhdf5 packages for HDF5 format, or write_fst/read_fst from the fst package.
The original answer
There are a couple of simple things to try, whether you use read.table or scan.
Set nrows=the number of records in your data (nmax in scan).
Make sure that comment.char="" to turn off interpretation of comments.
Explicitly define the classes of each column using colClasses in read.table.
Setting multi.line=FALSE may also improve performance in scan.
If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table based on the results.
The other alternative is filtering your data before you read it into R.
Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save saveRDS, then next time you can retrieve it faster with load readRDS.
Here is an example that utilizes fread from data.table 1.8.7
The examples come from the help page to fread, with the timings on my windows XP Core 2 duo E8400.
library(data.table)
# Demo speedup
n=1e6
DT = data.table( a=sample(1:1000,n,replace=TRUE),
b=sample(1:1000,n,replace=TRUE),
c=rnorm(n),
d=sample(c("foo","bar","baz","qux","quux"),n,replace=TRUE),
e=rnorm(n),
f=sample(1:1000,n,replace=TRUE) )
DT[2,b:=NA_integer_]
DT[4,c:=NA_real_]
DT[3,d:=NA_character_]
DT[5,d:=""]
DT[2,e:=+Inf]
DT[3,e:=-Inf]
standard read.table
write.table(DT,"test.csv",sep=",",row.names=FALSE,quote=FALSE)
cat("File size (MB):",round(file.info("test.csv")$size/1024^2),"\n")
## File size (MB): 51
system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))
## user system elapsed
## 24.71 0.15 25.42
# second run will be faster
system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))
## user system elapsed
## 17.85 0.07 17.98
optimized read.table
system.time(DF2 <- read.table("test.csv",header=TRUE,sep=",",quote="",
stringsAsFactors=FALSE,comment.char="",nrows=n,
colClasses=c("integer","integer","numeric",
"character","numeric","integer")))
## user system elapsed
## 10.20 0.03 10.32
fread
require(data.table)
system.time(DT <- fread("test.csv"))
## user system elapsed
## 3.12 0.01 3.22
sqldf
require(sqldf)
system.time(SQLDF <- read.csv.sql("test.csv",dbname=NULL))
## user system elapsed
## 12.49 0.09 12.69
# sqldf as on SO
f <- file("test.csv")
system.time(SQLf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))
## user system elapsed
## 10.21 0.47 10.73
ff / ffdf
require(ff)
system.time(FFDF <- read.csv.ffdf(file="test.csv",nrows=n))
## user system elapsed
## 10.85 0.10 10.99
In summary:
## user system elapsed Method
## 24.71 0.15 25.42 read.csv (first time)
## 17.85 0.07 17.98 read.csv (second time)
## 10.20 0.03 10.32 Optimized read.table
## 3.12 0.01 3.22 fread
## 12.49 0.09 12.69 sqldf
## 10.21 0.47 10.73 sqldf on SO
## 10.85 0.10 10.99 ffdf
I didn't see this question initially and asked a similar question a few days later. I am going to take my previous question down, but I thought I'd add an answer here to explain how I used sqldf() to do this.
There's been little bit of discussion as to the best way to import 2GB or more of text data into an R data frame. Yesterday I wrote a blog post about using sqldf() to import the data into SQLite as a staging area, and then sucking it from SQLite into R. This works really well for me. I was able to pull in 2GB (3 columns, 40mm rows) of data in < 5 minutes. By contrast, the read.csv command ran all night and never completed.
Here's my test code:
Set up the test data:
bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, 'bigdf.csv', quote = F)
I restarted R before running the following import routine:
library(sqldf)
f <- file("bigdf.csv")
system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))
I let the following line run all night but it never completed:
system.time(big.df <- read.csv('bigdf.csv'))
Strangely, no one answered the bottom part of the question for years even though this is an important one -- data.frames are simply lists with the right attributes, so if you have large data you don't want to use as.data.frame or similar for a list. It's much faster to simply "turn" a list into a data frame in-place:
attr(df, "row.names") <- .set_row_names(length(df[[1]]))
class(df) <- "data.frame"
This makes no copy of the data so it's immediate (unlike all other methods). It assumes that you have already set names() on the list accordingly.
[As for loading large data into R -- personally, I dump them by column into binary files and use readBin() - that is by far the fastest method (other than mmapping) and is only limited by the disk speed. Parsing ASCII files is inherently slow (even in C) compared to binary data.]
This was previously asked on R-Help, so that's worth reviewing.
One suggestion there was to use readChar() and then do string manipulation on the result with strsplit() and substr(). You can see the logic involved in readChar is much less than read.table.
I don't know if memory is an issue here, but you might also want to take a look at the HadoopStreaming package. This uses Hadoop, which is a MapReduce framework designed for dealing with large data sets. For this, you would use the hsTableReader function. This is an example (but it has a learning curve to learn Hadoop):
str <- "key1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey2\t9.9\nkey2\"
cat(str)
cols = list(key='',val=0)
con <- textConnection(str, open = "r")
hsTableReader(con,cols,chunkSize=6,FUN=print,ignoreKey=TRUE)
close(con)
The basic idea here is to break the data import into chunks. You could even go so far as to use one of the parallel frameworks (e.g. snow) and run the data import in parallel by segmenting the file, but most likely for large data sets that won't help since you will run into memory constraints, which is why map-reduce is a better approach.
An alternative is to use the vroom package. Now on CRAN.
vroom doesn't load the entire file, it indexes where each record is located, and is read later when you use it.
Only pay for what you use.
See Introduction to vroom, Get started with vroom and the vroom benchmarks.
The basic overview is that the initial read of a huge file, will be much faster, and subsequent modifications to the data may be slightly slower. So depending on what your use is, it could be the best option.
See a simplified example from vroom benchmarks below, the key parts to see is the super fast read times, but slightly sower operations like aggregate etc..
package read print sample filter aggregate total
read.delim 1m 21.5s 1ms 315ms 764ms 1m 22.6s
readr 33.1s 90ms 2ms 202ms 825ms 34.2s
data.table 15.7s 13ms 1ms 129ms 394ms 16.3s
vroom (altrep) dplyr 1.7s 89ms 1.7s 1.3s 1.9s 6.7s
I am reading data very quickly using the new arrow package. It appears to be in a fairly early stage.
Specifically, I am using the parquet columnar format. This converts back to a data.frame in R, but you can get even deeper speedups if you do not. This format is convenient as it can be used from Python as well.
My main use case for this is on a fairly restrained RShiny server. For these reasons, I prefer to keep data attached to the Apps (i.e., out of SQL), and therefore require small file size as well as speed.
This linked article provides benchmarking and a good overview. I have quoted some interesting points below.
https://ursalabs.org/blog/2019-10-columnar-perf/
File Size
That is, the Parquet file is half as big as even the gzipped CSV. One of the reasons that the Parquet file is so small is because of dictionary-encoding (also called “dictionary compression”). Dictionary compression can yield substantially better compression than using a general purpose bytes compressor like LZ4 or ZSTD (which are used in the FST format). Parquet was designed to produce very small files that are fast to read.
Read Speed
When controlling by output type (e.g. comparing all R data.frame outputs with each other) we see the the performance of Parquet, Feather, and FST falls within a relatively small margin of each other. The same is true of the pandas.DataFrame outputs. data.table::fread is impressively competitive with the 1.5 GB file size but lags the others on the 2.5 GB CSV.
Independent Test
I performed some independent benchmarking on a simulated dataset of 1,000,000 rows. Basically I shuffled a bunch of things around to attempt to challenge the compression. Also I added a short text field of random words and two simulated factors.
Data
library(dplyr)
library(tibble)
library(OpenRepGrid)
n <- 1000000
set.seed(1234)
some_levels1 <- sapply(1:10, function(x) paste(LETTERS[sample(1:26, size = sample(3:8, 1), replace = TRUE)], collapse = ""))
some_levels2 <- sapply(1:65, function(x) paste(LETTERS[sample(1:26, size = sample(5:16, 1), replace = TRUE)], collapse = ""))
test_data <- mtcars %>%
rownames_to_column() %>%
sample_n(n, replace = TRUE) %>%
mutate_all(~ sample(., length(.))) %>%
mutate(factor1 = sample(some_levels1, n, replace = TRUE),
factor2 = sample(some_levels2, n, replace = TRUE),
text = randomSentences(n, sample(3:8, n, replace = TRUE))
)
Read and Write
Writing the data is easy.
library(arrow)
write_parquet(test_data , "test_data.parquet")
# you can also mess with the compression
write_parquet(test_data, "test_data2.parquet", compress = "gzip", compression_level = 9)
Reading the data is also easy.
read_parquet("test_data.parquet")
# this option will result in lightning fast reads, but in a different format.
read_parquet("test_data2.parquet", as_data_frame = FALSE)
I tested reading this data against a few of the competing options, and did get slightly different results than with the article above, which is expected.
This file is nowhere near as large as the benchmark article, so maybe that is the difference.
Tests
rds: test_data.rds (20.3 MB)
parquet2_native: (14.9 MB with higher compression and as_data_frame = FALSE)
parquet2: test_data2.parquet (14.9 MB with higher compression)
parquet: test_data.parquet (40.7 MB)
fst2: test_data2.fst (27.9 MB with higher compression)
fst: test_data.fst (76.8 MB)
fread2: test_data.csv.gz (23.6MB)
fread: test_data.csv (98.7MB)
feather_arrow: test_data.feather (157.2 MB read with arrow)
feather: test_data.feather (157.2 MB read with feather)
Observations
For this particular file, fread is actually very fast. I like the small file size from the highly compressed parquet2 test. I may invest the time to work with the native data format rather than a data.frame if I really need the speed up.
Here fst is also a great choice. I would either use the highly compressed fst format or the highly compressed parquet depending on if I needed the speed or file size trade off.
A minor additional points worth mentioning. If you have a very large file you can on the fly calculate the number of rows (if no header) using (where bedGraph is the name of your file in your working directory):
>numRow=as.integer(system(paste("wc -l", bedGraph, "| sed 's/[^0-9.]*\\([0-9.]*\\).*/\\1/'"), intern=T))
You can then use that either in read.csv , read.table ...
>system.time((BG=read.table(bedGraph, nrows=numRow, col.names=c('chr', 'start', 'end', 'score'),colClasses=c('character', rep('integer',3)))))
user system elapsed
25.877 0.887 26.752
>object.size(BG)
203949432 bytes
Often times I think it is just good practice to keep larger databases inside a database (e.g. Postgres). I don't use anything too much larger than (nrow * ncol) ncell = 10M, which is pretty small; but I often find I want R to create and hold memory intensive graphs only while I query from multiple databases. In the future of 32 GB laptops, some of these types of memory problems will disappear. But the allure of using a database to hold the data and then using R's memory for the resulting query results and graphs still may be useful. Some advantages are:
(1) The data stays loaded in your database. You simply reconnect in pgadmin to the databases you want when you turn your laptop back on.
(2) It is true R can do many more nifty statistical and graphing operations than SQL. But I think SQL is better designed to query large amounts of data than R.
# Looking at Voter/Registrant Age by Decade
library(RPostgreSQL);library(lattice)
con <- dbConnect(PostgreSQL(), user= "postgres", password="password",
port="2345", host="localhost", dbname="WC2014_08_01_2014")
Decade_BD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from Birthdate) from voterdb where extract(DECADE from Birthdate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")
Decade_RD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from RegistrationDate) from voterdb where extract(DECADE from RegistrationDate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")
with(Decade_BD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Birthdays later than 1980 by Precinct",side=1,line=0)
with(Decade_RD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Registration Dates later than 1980 by Precinct",side=1,line=0)
I wanted to contribute Spark-based solution in the simplest form:
# Test Data ---------------------------------------------------------------
set.seed(123)
bigdf <-
data.frame(
dim = sample(letters, replace = T, 4e7),
fact1 = rnorm(4e7),
fact2 = rnorm(4e7, 20, 50)
)
tmp_csv <- fs::file_temp(pattern = "big_df", ext = ".csv")
readr::write_csv(x = bigdf, file = tmp_csv)
# Spark -------------------------------------------------------------------
# Installing if needed
# sparklyr::spark_available_versions()
# sparklyr::spark_install()
library("sparklyr")
sc <- spark_connect(master = "local")
# Uploading CSV
system.time(tbl_big_df <- spark_read_csv(sc = sc, path = tmp_csv))
Spark generated fairly OK results:
>> system.time(tbl_big_df <- spark_read_csv(sc = sc, path = tmp_csv))
user system elapsed
0.278 0.034 11.747
This was tested on MacBook Pro with 32GB ram.
Remarks
Spark, usually shouldn't be able to "win" against packages optimised for speed. Nevertheless, I wanted to contribute an answer using Spark:
For some of the comments and answers where process didn't work using Spark may be a viable alternative
In a long-run, hammering as much data as possible into data.frame may prove problematic later on, when other operations are attempted on that object and hit the performance envelope of architecture
I think that for questions like that, where the task is to handle 1e7 or more rows Spark should be given considerations. Even if it may be possible to "hammer in" that data into a single data.frame it's just doesn't feel right. Likely that object will be difficult to work with and create problems when deploying models, etc.
Instead of the conventional read.table I feel fread is a faster function.
Specifying additional attributes like select only the required columns, specifying colclasses and string as factors will reduce the time take to import the file.
data_frame <- fread("filename.csv",sep=",",header=FALSE,stringsAsFactors=FALSE,select=c(1,4,5,6,7),colClasses=c("as.numeric","as.character","as.numeric","as.Date","as.Factor"))
I've tried all above and [readr][1] made the best job. I have only 8gb RAM
Loop for 20 files, 5gb each, 7 columns:
read_fwf(arquivos[i],col_types = "ccccccc",fwf_cols(cnpj = c(4,17), nome = c(19,168), cpf = c(169,183), fantasia = c(169,223), sit.cadastral = c(224,225), dt.sitcadastral = c(226,233), cnae = c(376,382)))

Quickly reading very large tables as dataframes

I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am assuming I know the types of the columns ahead of time, the table does not contain any column headers or row names, and does not have any pathological characters that I have to worry about.
I know that reading in a table as a list using scan() can be quite fast, e.g.:
datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))
But some of my attempts to convert this to a dataframe appear to decrease the performance of the above by a factor of 6:
df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))
Is there a better way of doing this? Or quite possibly completely different approach to the problem?
An update, several years later
This answer is old, and R has moved on. Tweaking read.table to run a bit faster has precious little benefit. Your options are:
Using vroom from the tidyverse package vroom for importing data from csv/tab-delimited files directly into an R tibble. See Hector's answer.
Using fread in data.table for importing data from csv/tab-delimited files directly into R. See mnel's answer.
Using read_table in readr (on CRAN from April 2015). This works much like fread above. The readme in the link explains the difference between the two functions (readr currently claims to be "1.5-2x slower" than data.table::fread).
read.csv.raw from iotools provides a third option for quickly reading CSV files.
Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.) read.csv.sql in the sqldf package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: the RODBC package, and the reverse depends section of the DBI package page. MonetDB.R gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with its monetdb.read.csv function. dplyr allows you to work directly with data stored in several types of database.
Storing data in binary formats can also be useful for improving performance. Use saveRDS/readRDS (see below), the h5 or rhdf5 packages for HDF5 format, or write_fst/read_fst from the fst package.
The original answer
There are a couple of simple things to try, whether you use read.table or scan.
Set nrows=the number of records in your data (nmax in scan).
Make sure that comment.char="" to turn off interpretation of comments.
Explicitly define the classes of each column using colClasses in read.table.
Setting multi.line=FALSE may also improve performance in scan.
If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table based on the results.
The other alternative is filtering your data before you read it into R.
Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save saveRDS, then next time you can retrieve it faster with load readRDS.
Here is an example that utilizes fread from data.table 1.8.7
The examples come from the help page to fread, with the timings on my windows XP Core 2 duo E8400.
library(data.table)
# Demo speedup
n=1e6
DT = data.table( a=sample(1:1000,n,replace=TRUE),
b=sample(1:1000,n,replace=TRUE),
c=rnorm(n),
d=sample(c("foo","bar","baz","qux","quux"),n,replace=TRUE),
e=rnorm(n),
f=sample(1:1000,n,replace=TRUE) )
DT[2,b:=NA_integer_]
DT[4,c:=NA_real_]
DT[3,d:=NA_character_]
DT[5,d:=""]
DT[2,e:=+Inf]
DT[3,e:=-Inf]
standard read.table
write.table(DT,"test.csv",sep=",",row.names=FALSE,quote=FALSE)
cat("File size (MB):",round(file.info("test.csv")$size/1024^2),"\n")
## File size (MB): 51
system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))
## user system elapsed
## 24.71 0.15 25.42
# second run will be faster
system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))
## user system elapsed
## 17.85 0.07 17.98
optimized read.table
system.time(DF2 <- read.table("test.csv",header=TRUE,sep=",",quote="",
stringsAsFactors=FALSE,comment.char="",nrows=n,
colClasses=c("integer","integer","numeric",
"character","numeric","integer")))
## user system elapsed
## 10.20 0.03 10.32
fread
require(data.table)
system.time(DT <- fread("test.csv"))
## user system elapsed
## 3.12 0.01 3.22
sqldf
require(sqldf)
system.time(SQLDF <- read.csv.sql("test.csv",dbname=NULL))
## user system elapsed
## 12.49 0.09 12.69
# sqldf as on SO
f <- file("test.csv")
system.time(SQLf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))
## user system elapsed
## 10.21 0.47 10.73
ff / ffdf
require(ff)
system.time(FFDF <- read.csv.ffdf(file="test.csv",nrows=n))
## user system elapsed
## 10.85 0.10 10.99
In summary:
## user system elapsed Method
## 24.71 0.15 25.42 read.csv (first time)
## 17.85 0.07 17.98 read.csv (second time)
## 10.20 0.03 10.32 Optimized read.table
## 3.12 0.01 3.22 fread
## 12.49 0.09 12.69 sqldf
## 10.21 0.47 10.73 sqldf on SO
## 10.85 0.10 10.99 ffdf
I didn't see this question initially and asked a similar question a few days later. I am going to take my previous question down, but I thought I'd add an answer here to explain how I used sqldf() to do this.
There's been little bit of discussion as to the best way to import 2GB or more of text data into an R data frame. Yesterday I wrote a blog post about using sqldf() to import the data into SQLite as a staging area, and then sucking it from SQLite into R. This works really well for me. I was able to pull in 2GB (3 columns, 40mm rows) of data in < 5 minutes. By contrast, the read.csv command ran all night and never completed.
Here's my test code:
Set up the test data:
bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, 'bigdf.csv', quote = F)
I restarted R before running the following import routine:
library(sqldf)
f <- file("bigdf.csv")
system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))
I let the following line run all night but it never completed:
system.time(big.df <- read.csv('bigdf.csv'))
Strangely, no one answered the bottom part of the question for years even though this is an important one -- data.frames are simply lists with the right attributes, so if you have large data you don't want to use as.data.frame or similar for a list. It's much faster to simply "turn" a list into a data frame in-place:
attr(df, "row.names") <- .set_row_names(length(df[[1]]))
class(df) <- "data.frame"
This makes no copy of the data so it's immediate (unlike all other methods). It assumes that you have already set names() on the list accordingly.
[As for loading large data into R -- personally, I dump them by column into binary files and use readBin() - that is by far the fastest method (other than mmapping) and is only limited by the disk speed. Parsing ASCII files is inherently slow (even in C) compared to binary data.]
This was previously asked on R-Help, so that's worth reviewing.
One suggestion there was to use readChar() and then do string manipulation on the result with strsplit() and substr(). You can see the logic involved in readChar is much less than read.table.
I don't know if memory is an issue here, but you might also want to take a look at the HadoopStreaming package. This uses Hadoop, which is a MapReduce framework designed for dealing with large data sets. For this, you would use the hsTableReader function. This is an example (but it has a learning curve to learn Hadoop):
str <- "key1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey2\t9.9\nkey2\"
cat(str)
cols = list(key='',val=0)
con <- textConnection(str, open = "r")
hsTableReader(con,cols,chunkSize=6,FUN=print,ignoreKey=TRUE)
close(con)
The basic idea here is to break the data import into chunks. You could even go so far as to use one of the parallel frameworks (e.g. snow) and run the data import in parallel by segmenting the file, but most likely for large data sets that won't help since you will run into memory constraints, which is why map-reduce is a better approach.
An alternative is to use the vroom package. Now on CRAN.
vroom doesn't load the entire file, it indexes where each record is located, and is read later when you use it.
Only pay for what you use.
See Introduction to vroom, Get started with vroom and the vroom benchmarks.
The basic overview is that the initial read of a huge file, will be much faster, and subsequent modifications to the data may be slightly slower. So depending on what your use is, it could be the best option.
See a simplified example from vroom benchmarks below, the key parts to see is the super fast read times, but slightly sower operations like aggregate etc..
package read print sample filter aggregate total
read.delim 1m 21.5s 1ms 315ms 764ms 1m 22.6s
readr 33.1s 90ms 2ms 202ms 825ms 34.2s
data.table 15.7s 13ms 1ms 129ms 394ms 16.3s
vroom (altrep) dplyr 1.7s 89ms 1.7s 1.3s 1.9s 6.7s
I am reading data very quickly using the new arrow package. It appears to be in a fairly early stage.
Specifically, I am using the parquet columnar format. This converts back to a data.frame in R, but you can get even deeper speedups if you do not. This format is convenient as it can be used from Python as well.
My main use case for this is on a fairly restrained RShiny server. For these reasons, I prefer to keep data attached to the Apps (i.e., out of SQL), and therefore require small file size as well as speed.
This linked article provides benchmarking and a good overview. I have quoted some interesting points below.
https://ursalabs.org/blog/2019-10-columnar-perf/
File Size
That is, the Parquet file is half as big as even the gzipped CSV. One of the reasons that the Parquet file is so small is because of dictionary-encoding (also called “dictionary compression”). Dictionary compression can yield substantially better compression than using a general purpose bytes compressor like LZ4 or ZSTD (which are used in the FST format). Parquet was designed to produce very small files that are fast to read.
Read Speed
When controlling by output type (e.g. comparing all R data.frame outputs with each other) we see the the performance of Parquet, Feather, and FST falls within a relatively small margin of each other. The same is true of the pandas.DataFrame outputs. data.table::fread is impressively competitive with the 1.5 GB file size but lags the others on the 2.5 GB CSV.
Independent Test
I performed some independent benchmarking on a simulated dataset of 1,000,000 rows. Basically I shuffled a bunch of things around to attempt to challenge the compression. Also I added a short text field of random words and two simulated factors.
Data
library(dplyr)
library(tibble)
library(OpenRepGrid)
n <- 1000000
set.seed(1234)
some_levels1 <- sapply(1:10, function(x) paste(LETTERS[sample(1:26, size = sample(3:8, 1), replace = TRUE)], collapse = ""))
some_levels2 <- sapply(1:65, function(x) paste(LETTERS[sample(1:26, size = sample(5:16, 1), replace = TRUE)], collapse = ""))
test_data <- mtcars %>%
rownames_to_column() %>%
sample_n(n, replace = TRUE) %>%
mutate_all(~ sample(., length(.))) %>%
mutate(factor1 = sample(some_levels1, n, replace = TRUE),
factor2 = sample(some_levels2, n, replace = TRUE),
text = randomSentences(n, sample(3:8, n, replace = TRUE))
)
Read and Write
Writing the data is easy.
library(arrow)
write_parquet(test_data , "test_data.parquet")
# you can also mess with the compression
write_parquet(test_data, "test_data2.parquet", compress = "gzip", compression_level = 9)
Reading the data is also easy.
read_parquet("test_data.parquet")
# this option will result in lightning fast reads, but in a different format.
read_parquet("test_data2.parquet", as_data_frame = FALSE)
I tested reading this data against a few of the competing options, and did get slightly different results than with the article above, which is expected.
This file is nowhere near as large as the benchmark article, so maybe that is the difference.
Tests
rds: test_data.rds (20.3 MB)
parquet2_native: (14.9 MB with higher compression and as_data_frame = FALSE)
parquet2: test_data2.parquet (14.9 MB with higher compression)
parquet: test_data.parquet (40.7 MB)
fst2: test_data2.fst (27.9 MB with higher compression)
fst: test_data.fst (76.8 MB)
fread2: test_data.csv.gz (23.6MB)
fread: test_data.csv (98.7MB)
feather_arrow: test_data.feather (157.2 MB read with arrow)
feather: test_data.feather (157.2 MB read with feather)
Observations
For this particular file, fread is actually very fast. I like the small file size from the highly compressed parquet2 test. I may invest the time to work with the native data format rather than a data.frame if I really need the speed up.
Here fst is also a great choice. I would either use the highly compressed fst format or the highly compressed parquet depending on if I needed the speed or file size trade off.
A minor additional points worth mentioning. If you have a very large file you can on the fly calculate the number of rows (if no header) using (where bedGraph is the name of your file in your working directory):
>numRow=as.integer(system(paste("wc -l", bedGraph, "| sed 's/[^0-9.]*\\([0-9.]*\\).*/\\1/'"), intern=T))
You can then use that either in read.csv , read.table ...
>system.time((BG=read.table(bedGraph, nrows=numRow, col.names=c('chr', 'start', 'end', 'score'),colClasses=c('character', rep('integer',3)))))
user system elapsed
25.877 0.887 26.752
>object.size(BG)
203949432 bytes
Often times I think it is just good practice to keep larger databases inside a database (e.g. Postgres). I don't use anything too much larger than (nrow * ncol) ncell = 10M, which is pretty small; but I often find I want R to create and hold memory intensive graphs only while I query from multiple databases. In the future of 32 GB laptops, some of these types of memory problems will disappear. But the allure of using a database to hold the data and then using R's memory for the resulting query results and graphs still may be useful. Some advantages are:
(1) The data stays loaded in your database. You simply reconnect in pgadmin to the databases you want when you turn your laptop back on.
(2) It is true R can do many more nifty statistical and graphing operations than SQL. But I think SQL is better designed to query large amounts of data than R.
# Looking at Voter/Registrant Age by Decade
library(RPostgreSQL);library(lattice)
con <- dbConnect(PostgreSQL(), user= "postgres", password="password",
port="2345", host="localhost", dbname="WC2014_08_01_2014")
Decade_BD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from Birthdate) from voterdb where extract(DECADE from Birthdate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")
Decade_RD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from RegistrationDate) from voterdb where extract(DECADE from RegistrationDate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")
with(Decade_BD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Birthdays later than 1980 by Precinct",side=1,line=0)
with(Decade_RD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Registration Dates later than 1980 by Precinct",side=1,line=0)
I wanted to contribute Spark-based solution in the simplest form:
# Test Data ---------------------------------------------------------------
set.seed(123)
bigdf <-
data.frame(
dim = sample(letters, replace = T, 4e7),
fact1 = rnorm(4e7),
fact2 = rnorm(4e7, 20, 50)
)
tmp_csv <- fs::file_temp(pattern = "big_df", ext = ".csv")
readr::write_csv(x = bigdf, file = tmp_csv)
# Spark -------------------------------------------------------------------
# Installing if needed
# sparklyr::spark_available_versions()
# sparklyr::spark_install()
library("sparklyr")
sc <- spark_connect(master = "local")
# Uploading CSV
system.time(tbl_big_df <- spark_read_csv(sc = sc, path = tmp_csv))
Spark generated fairly OK results:
>> system.time(tbl_big_df <- spark_read_csv(sc = sc, path = tmp_csv))
user system elapsed
0.278 0.034 11.747
This was tested on MacBook Pro with 32GB ram.
Remarks
Spark, usually shouldn't be able to "win" against packages optimised for speed. Nevertheless, I wanted to contribute an answer using Spark:
For some of the comments and answers where process didn't work using Spark may be a viable alternative
In a long-run, hammering as much data as possible into data.frame may prove problematic later on, when other operations are attempted on that object and hit the performance envelope of architecture
I think that for questions like that, where the task is to handle 1e7 or more rows Spark should be given considerations. Even if it may be possible to "hammer in" that data into a single data.frame it's just doesn't feel right. Likely that object will be difficult to work with and create problems when deploying models, etc.
Instead of the conventional read.table I feel fread is a faster function.
Specifying additional attributes like select only the required columns, specifying colclasses and string as factors will reduce the time take to import the file.
data_frame <- fread("filename.csv",sep=",",header=FALSE,stringsAsFactors=FALSE,select=c(1,4,5,6,7),colClasses=c("as.numeric","as.character","as.numeric","as.Date","as.Factor"))
I've tried all above and [readr][1] made the best job. I have only 8gb RAM
Loop for 20 files, 5gb each, 7 columns:
read_fwf(arquivos[i],col_types = "ccccccc",fwf_cols(cnpj = c(4,17), nome = c(19,168), cpf = c(169,183), fantasia = c(169,223), sit.cadastral = c(224,225), dt.sitcadastral = c(226,233), cnae = c(376,382)))

Resources