base R faster than readr for reading multiple CSV files - r

There is a lot of documentation on how to read multiple CSVs and bind them into one data frame. I have 5000+ CSV files I need to read in and bind into one data structure.
In particular I've followed the discussion here: Issue in Loading multiple .csv files into single dataframe in R using rbind
The weird thing is that base R is much faster than any other solution I've tried.
Here's what my CSV looks like:
> head(PT)
Line Timestamp Lane.01 Lane.02 Lane.03 Lane.04 Lane.05 Lane.06 Lane.07 Lane.08
1 PL1 05-Jan-16 07:17:36 NA NA NA NA NA NA NA NA
2 PL1 05-Jan-16 07:22:38 NA NA NA NA NA NA NA NA
3 PL1 05-Jan-16 07:27:41 NA NA NA NA NA NA NA NA
4 PL1 05-Jan-16 07:32:43 9.98 10.36 10.41 10.16 10.10 9.97 10.07 9.59
5 PL1 05-Jan-16 07:37:45 9.65 8.87 9.88 9.86 8.85 8.75 9.19 8.51
6 PL1 05-Jan-16 07:42:47 9.14 8.98 9.29 9.04 9.01 9.06 9.12 9.08
I've created three methods for reading in and binding the data. The files are located in a separate directory which I define as:
dataPath <- "data"
PTfiles <- list.files(path=dataPath, full.names = TRUE)
Method 1: Base R
classes <- c("factor", "character", rep("numeric",8))
# build function to load data
load_data <- function(dataPath, classes) {
tables <- lapply(PTfiles, read.csv, colClasses=classes, na.strings=c("NA", ""))
do.call(rbind, tables)
}
#clock
method1 <- system.time(
PT <- load_data(path, classes)
)
Method 2: read_csv
In this case I created a wrapper function for read_csv to use
#create wrapper function for read_csv
read_csv.wrap <- function(x) { read_csv(x, skip = 1, na=c("NA", ""),
col_names = c("tool", "timestamp", paste("lane", 1:8, sep="")),
col_types =
cols(
tool = col_character(),
timestamp = col_character(),
lane1 = col_double(),
lane2 = col_double(),
lane3 = col_double(),
lane4 = col_double(),
lane5 = col_double(),
lane6 = col_double(),
lane7 = col_double(),
lane8 = col_double()
)
)
}
##
# Same as method 1, just uses read_csv instead of read.csv
load_data2 <- function(dataPath) {
tables <- lapply(PTfiles, read_csv.wrap)
do.call(rbind, tables)
}
#clock
method2 <- system.time(
PT2 <- load_data2(path)
)
Method 3: read_csv + dplyr::bind_rows
load_data3 <- function(dataPath) {
tables <- lapply(PTfiles, read_csv.wrap)
dplyr::bind_rows(tables)
}
#clock
method3 <- system.time(
PT3 <- load_data3(path)
)
What I can't figure out, is why read_csv and dplyr methods are slower for elapsed time when they should be faster. The CPU time is decreased, but why would the elapsed time (file system) increase? What's going on here?
Edit - I added the data.table method as suggested in the comments
Method 4 data.table
library(data.table)
load_data4 <- function(dataPath){
tables <- lapply(PTfiles, fread)
rbindlist(tables)
}
method4 <- system.time(
PT4 <- load_data4(path)
)
The data.table method is the fastest from a CPU standpoint. But the question still stands on what is going on with the read_csv methods that makes them so slow.
> rbind(method1, method2, method3, method4)
user.self sys.self elapsed
method1 0.56 0.39 1.35
method2 0.42 1.98 13.96
method3 0.36 2.25 14.69
method4 0.34 0.67 1.74

I would do that in the terminal(Unix). I would put all files int the same folder and then navigate to that folder (in terminal), the use the following command to create only one CSV file:
cat *.csv > merged_csv_file.csv
One observation regarding this method is that the header of each file will show up in the middle of the observations. To solve this I would suggest you do:
Get just the header from the first file
head -2 file1.csv > merged_csv_file.csv
then skip the first "X" lines from the other files, with the folling command, where "X" is the number of lines to skip.
tail -n +3 -q file*.csv >> merged_csv_file.csv
-n +3 makes tail print lines from 3rd to the end, -q tells it not to print the header with the file name (read man), >> adds to the file, not overwrites it as >.

I might have found a related issue. I am reading in nested CSV data from some simulation output, where multiple columns have CSV formatted data as elements, which I need to unnest and reshape for analysis.
With simulations where I have many runs, this resulted in thousands of elements that needed to be parsed. Using map(.,read_csv) this would take hours to transform. When I rewrote my script to apply read.csv in a lambda function, the operation would complete in seconds.
I'm curious if there is some intermediate system I/O operation or error handling that creates a bottleneck you wouldn't run into with a single input file.

Related

R: read_csv reads numeric entries as logical - parsing col_logical instead of col_double

I am new to R.
I wrote a code for an assignment which reads several csv files and binds it into a data frame and then according to the id, calculates the mean of either nitrate or sulfate.
Data sample:
Date sulfate nitrate ID
<date> <dbl> <dbl> <dbl>
1 2003-10-06 7.21 0.651 1
2 2003-10-12 5.99 0.428 1
3 2003-10-18 4.68 1.04 1
4 2003-10-24 3.47 0.363 1
5 2003-10-30 2.42 0.507 1
6 2003-11-11 1.43 0.474 1
...
To read the files and create a data.frame, I wrote this function:
pollutantmean <- function (pollutant, id = 1:332) {
#creating a data frame from several files
file_m <- list.files(path = "specdata", pattern = "*.csv", full.names = TRUE)
read_file_m <- lapply(file_m, read_csv)
df_1 <- bind_rows(read_file_m)
# delete NAs
df_clean <- df_1[complete.cases(df_1),]
#select rows according to id
df_asid_clean <- filter(df_clean, ID %in% id)
#count the mean of the column
mean_result <- mean(df_asid_clean[, pollutant])
mean_result
However, when the read_csv function is applied, certain entries in nitrate column are read as col_logical, although the whole class of the column remains numeric and the entries are numeric. It seems that the code "expects" to receive logical value, although the real value is not.
Throughout the reading I get this message:
<...>
Parsed with column specification:
cols(
Date = col_date(format = ""),
sulfate = col_double(),
nitrate = col_logical(),
ID = col_double()
)
Warning: 41 parsing failures.
row col expected actual file
2055 nitrate 1/0/T/F/TRUE/FALSE 0.383 'specdata/288.csv'
2067 nitrate 1/0/T/F/TRUE/FALSE 0.355 'specdata/288.csv'
2073 nitrate 1/0/T/F/TRUE/FALSE 0.469 'specdata/288.csv'
2085 nitrate 1/0/T/F/TRUE/FALSE 0.144 'specdata/288.csv'
2091 nitrate 1/0/T/F/TRUE/FALSE 0.0984 'specdata/288.csv'
.... ....... .................. ...... ..................
See problems(...) for more details.
I tried to change the column class by writing
df_1[,nitrate] <- as.numeric(as.character(df_1[, nitrate])
, after binding rows, but it only shows that NAs are again introduced in step which calculates the mean.
What is wrong here, and how could I solve it?
Would appreciate your help!
UPDATE: tried to insert read_csv(col_types = list...), but I get "files" argument is not defined. As I understand, the R reads inside read_csv first, then lapply and because there is not "file" given at the time, it shows error.
The problem with readr::read_csv() failure in parsing the column types can be overcome by passing a col_types= argument in lapply(). We do this as follows:
pollutantmean <- function (directory,pollutant,id=1:332){
require(readr)
require(dplyr)
file_m <- list.files(path = directory, pattern = "*.csv", full.names = TRUE)[id]
read_file_m <- lapply(file_m, read_csv,col_types=list(col_date(),col_double(),
col_double(),col_integer()))
# rest of code goes here. Since I am a Community Mentor in the
# JHU Data Science Specialization, I am not allowed to post
# a complete solution to the programming assignment
}
Note that I use the [ form of the extract operator to subset the list of file names with the id vector that is an argument to the function, which avoids reading a lot of data that isn't necessary. This eliminates the need for the filter() statement in the code posted in the question.
With some additional programming statements to complete the assignment, the code in my answer produces the correct results for the three examples posted with the assignment, as listed below.
> pollutantmean("specdata","sulfate",1:10)
[1] 4.064128
> pollutantmean("specdata", "nitrate", 70:72)
[1] 1.706047
> pollutantmean("specdata", "nitrate", 23)
[1] 1.280833
Alternately we could implement lapply() with an anonymous function that also uses read_csv() as follows:
read_file_m <- lapply(file_m, function(x) {read_csv(x,col_types=list(col_date(),col_double(),
col_double(),col_integer()))})
NOTE: while it is completely understandable that students who have been exposed to the tidyverse would like to use it for the programming assignment, the fact that dplyr isn't introduced until the next course in the sequence (and readr isn't covered at all) makes it much more difficult to use for assignments in R Programming, especially the first assignment, where dplyr non-standard evaluation gives people fits. An example of this situation is yet another Stackoverflow question on pollutantmean().
With the read_csv update you don't need lapply, you can simply pass along the file path directly to read_csv as you already have defined.
Regarding the column types this can then be sen manually in the col_type argument:
col_type=cols(Date-col_date,sulfate=...)

Fastest way to upload data via R to PostgresSQL 12

I am using the following code to connect to a PostgreSQL 12 database:
con <- DBI::dbConnect(odbc::odbc(), driver, server, database, uid, pwd, port)
This connects me to a PostgreSQL 12 database on Google Cloud SQL. The following code is then used to upload data:
DBI::dbCreateTable(con, tablename, df)
DBI::dbAppendTable(con, tablename, df)
where df is a data frame I have created in R. The data frame consists of ~ 550,000 records totaling 713 MB of data.
When uploaded by the above method, it took approximately 9 hours at a rate of 40 write operations/second. Is there a faster way to upload this data into my PostgreSQL database, preferably through R?
I've always found bulk-copy to be the best, external to R. The insert can be significantly faster, and your overhead is (1) writing to file, and (2) the shorter run-time.
Setup for this test:
win10 (2004)
docker
postgres:11 container, using port 35432 on localhost, simple authentication
a psql binary in the host OS (where R is running); should be easy with linux, with windows I grabbed the "zip" (not installer) file from https://www.postgresql.org/download/windows/ and extracted what I needed
I'm using data.table::fwrite to save the file because it's fast; in this case write.table and write.csv are still much faster than using DBI::dbWriteTable, but with your size of data you might prefer something quick
DBI::dbCreateTable(con2, "mt", mtcars)
DBI::dbGetQuery(con2, "select count(*) as n from mt")
# n
# 1 0
z1000 <- data.table::rbindlist(replicate(1000, mtcars, simplify=F))
nrow(z1000)
# [1] 32000
system.time({
DBI::dbWriteTable(con2, "mt", z1000, create = FALSE, append = TRUE)
})
# user system elapsed
# 1.56 1.09 30.90
system.time({
data.table::fwrite(z1000, "mt.csv")
URI <- sprintf("postgresql://%s:%s#%s:%s", "postgres", "mysecretpassword", "127.0.0.1", "35432")
system(
sprintf("psql.exe -U postgres -c \"\\copy %s (%s) from %s (FORMAT CSV, HEADER)\" %s",
"mt", paste(colnames(z1000), collapse = ","),
sQuote("mt.csv"), URI)
)
})
# COPY 32000
# user system elapsed
# 0.05 0.00 0.19
DBI::dbGetQuery(con2, "select count(*) as n from mt")
# n
# 1 64000
While this is a lot smaller than your data (32K rows, 11 columns, 1.3MB of data), a speedup from 30 seconds to less than 1 second cannot be ignored.
Side note: there is also a sizable difference between dbAppendTable (slow) and dbWriteTable. Comparing psql and those two functions:
z100 <- rbindlist(replicate(100, mtcars, simplify=F))
system.time({
data.table::fwrite(z100, "mt.csv")
URI <- sprintf("postgresql://%s:%s#%s:%s", "postgres", "mysecretpassword", "127.0.0.1", "35432")
system(
sprintf("/Users/r2/bin/psql -U postgres -c \"\\copy %s (%s) from %s (FORMAT CSV, HEADER)\" %s",
"mt", paste(colnames(z100), collapse = ","),
sQuote("mt.csv"), URI)
)
})
# COPY 3200
# user system elapsed
# 0.0 0.0 0.1
system.time({
DBI::dbWriteTable(con2, "mt", z100, create = FALSE, append = TRUE)
})
# user system elapsed
# 0.17 0.04 2.95
system.time({
DBI::dbAppendTable(con2, "mt", z100, create = FALSE, append = TRUE)
})
# user system elapsed
# 0.74 0.33 23.59
(I don't want to time dbAppendTable with z1000 above ...)
(For kicks, I ran it with replicate(10000, ...) and ran the psql and dbWriteTable tests again, and they took 2 seconds and 372 seconds, respectively. Your choice :-) ... now I have over 650,000 rows of mtcars ... hrmph ... drop table mt ...
I suspect that dbAppendTable results in an INSERT statement per row, which can take a long time for high numbers of rows.
However, you can generate a single INSERT statement for the entire data frame using the sqlAppendTable function and run it by using dbSendQuery explicitly:
res <- DBI::dbSendQuery(con, DBI::sqlAppendTable(con, tablename, df, row.names=FALSE))
DBI::dbClearResult(res)
For me, this was much faster: a 30 second ingest reduced to a 0.5 second ingest.

R: Read multiple files and label them based on the file name

I have a folder with about 400 files with the same structure. Each of these file contains 4 columns with no header, corresponding to 4 climate variables. I would need to include two new columns in each of these files, based on the name of the file. The structure of the name is MeteoData_PXCY, withX=CODE_PLOT and Y=CODE_COUNTRY. Once I have these two new columns I need to read all the files in one single dataset, and aggregate grouping by CODE_PLOT and CODE_COUNTRY to calculate mean values. Hence, the final output is 400 rows, one row per CODE_PLOT and CODE_COUNTRY.
Example file MeteoData_P1C1.csv
32509 33.91 2.9155 4494.5 13.46
32540 63.03 3.9718 6520.8 25.12
32568 71.68 8.7874 11587 58.67
32599 116.38 7.8683 13286 62.58
32629 31.12 16.097 23555 135.35
32660 56.56 16.481 21886 130.24
32690 68.59 19.737 21677 141.15
32721 55.55 18.755 18830 117.39
32752 59.88 15.598 13579 81.06
32782 43.43 12.361 8622.2 54.57
Example MeteoData_P109C19.csv
32509 18.17 -0.70355 1413.5 9.93
32540 78 -0.43607 3574.6 10.46
32568 74.43 0.38645 7478.5 22.53
32599 73.19 2.5743 12352 42.85
32629 36.75 9.4852 21244 105.57
32660 61.65 13.753 21586 117.3
32690 86.16 15.991 20452 127.89
32721 98.02 12.713 13981 76.73
32752 32.14 9.9547 10850 53.13
32782 53.46 4.4252 5041.7 21.46
In the final output I should have this structure (without “;”):
Date; Precip; Temp; Rad; Pet; CODE_PLOT; CODE_COUNTRY
32540; 63.03; 3.9718; 6520.8; 25.12; 1; 1
32568; 71.68; 8.7874; 11587; 58.67; 9; 19
For the moment, I have:
setwd("MeteoData”) # Folder in which all the files are into
filenames <- list.files(pattern=".csv")
clim <- lapply(filenames, function(x) read.csv(file=x, header=FALSE))
You could put all your files in a new folder/directory, and then create a loop using list.files:
all.dfs <- list()
for(filename in list.files("some_dir")) {
all.dfs[[length(all.dfs) + 1]] <- read.table(filename, ...)
# put in read.table call the appropriate arguments, including column names for the existing data in the files
all.dfs[[length(all.dfs)]]$CODE_PLOT <- sub(".*P(\\d*)C(\\d*)\\.csv", "\\1", filename)
all.dfs[[length(all.dfs)]]$CODE_COUNTRY <- sub(".*P(\\d*)C(\\d*)\\.csv", "\\2", filename)
}
Then merging everything into one dataframe...
big.df <- do.call(rbind, all.dfs)
Haven't tested it but feel free to ask questions in comment.

Download VIX futures prices from CBOE

I am trying to get historical prices for VIX futures by downloading all the CSV files on this page (http://cfe.cboe.com/Products/historicalVIX.aspx). Here is the code I am using to do this:
library(XML)
#Extract all links for url
url <- "http://cfe.cboe.com/Products/historicalVIX.aspx"
doc <- htmlParse(url)
links <- xpathSApply(doc, "//a/#href")
free(doc)
#Filter out URLs ending with csv and complete the link.
links <- links[substr(links, nchar(links) - 2, nchar(links)) == "csv"]
links <- paste("http://cfe.cboe.com", links, sep="")
#Peform read.csv on each url in links, skipping the first two URLs as they are not relevant.
c <- lapply(links[-(1:2)], read.csv, header = TRUE)
I get the error:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
Upon further investigation, I realize this is because some of the CSV files are formatted differently. If I load the URL links[9] manually, I see that the first row has this disclaimer:
CFE data is compiled for the .......use of CFE data is subject to the Terms and Conditions of CBOE's Websites.
Most of the other files (e.g.links[8] and links[10]) are fine so it seems this has been randomly inserted. Is there some R magic that can be done to handle this?
Thank you.
I have a getSymbols.cfe method in my qmao package (for the getSymbols function in quantmod package) that will make this a lot easier.
#install.packages('qmao', repos='http://r-forge.r-project.org')
library(qmao)
This is from the examples section of ?getSymbols.cfe (please read the help page as the function has a few arguments that you may want to be different than the defaults)
getSymbols(c("VX_U11", "VX_V11"),src='cfe')
#all contracts expiring in 2010 and 2011.
getSymbols("VX",Months=1:12,Years=2010:2011,src='cfe')
#getSymbols("VX",Months=1:12,Years=10:11,src='cfe') #same
And it's not just for VIX
getSymbols(c("VM","GV"),src='cfe') #The mini-VIX and Gold vol contracts expiring this month
If you're not familiar with getSymbols, by default it stores the data in your .GlobalEnv and return the name of the object that was saved.
> getSymbols("VX_Z12", src='cfe')
[1] "VX_Z12"
> tail(VX_Z12)
VX_Z12.Open VX_Z12.High VX_Z12.Low VX_Z12.Close VX_Z12.Settle VX_Z12.Change VX_Z12.Volume VX_Z12.EFP VX_Z12.OpInt
2012-10-26 19.20 19.35 18.62 18.87 18.9 0.0 22043 15 71114
2012-10-31 18.55 19.50 18.51 19.46 19.5 0.6 46405 319 89674
2012-11-01 19.35 19.35 17.75 17.87 17.9 -1.6 40609 2046 95720
2012-11-02 17.90 18.65 17.55 18.57 18.6 0.7 42592 1155 100691
2012-11-05 18.60 20.15 18.43 18.86 18.9 0.3 28136 110 102746
2012-11-06 18.70 18.85 17.75 18.06 18.1 -0.8 35599 851 110638
Edit
I see now that I did not answer your question, but rather pointed you to another way to get the same error! A simple way to make your code work, is to make a wrapper for read.csv that uses readLines to see if the first row contains the disclaimer; if it does, skip the the first row, otherwise use read.csv as normal.
myRead.csv <- function(x, ...) {
if (grepl("Terms and Conditions", readLines(x, 1))) { #is the first row the disclaimer?
read.csv(x, skip=1, ...)
} else read.csv(x, ...)
}
L <- lapply(links[-(1:2)], myRead.csv, header = TRUE)
I also applied that patch to getSymbols.cfe. You can get the latest version of qmao (1.3.11) using svn checkout (see this post if you need help with that), or, you can wait until R-Forge builds it for you which usually happens pretty quickly, but could take up to a couple of days.

Extracting outputs from lapply to a dataframe

I have some R code which performs some data extraction operation on all files in the current directory, using the following code:
files <- list.files(".", pattern="*.tts")
results <- lapply(files, data_for_time, "17/06/2006 12:00:00")
The output from lapply is the following (extracted using dput()) - basically a list full of vectors:
list(c("amer", "14.5"), c("appl", "14.2"), c("brec", "13.1"),
c("camb", "13.5"), c("camo", "30.1"), c("cari", "13.8"),
c("chio", "21.1"), c("dung", "9.4"), c("east", "11.8"), c("exmo",
"12.1"), c("farb", "14.7"), c("hard", "15.6"), c("herm",
"24.3"), c("hero", "13.3"), c("hert", "11.8"), c("hung",
"26"), c("lizr", "14"), c("maid", "30.4"), c("mart", "8.8"
), c("newb", "14.7"), c("newl", "14.3"), c("oxfr", "13.9"
), c("padt", "10.3"), c("pbil", "13.6"), c("pmtg", "11.1"
), c("pmth", "11.7"), c("pool", "14.6"), c("prae", "11.9"
), c("ral2", "12.2"), c("sano", "15.3"), c("scil", "36.2"
), c("sham", "12.9"), c("stra", "30.9"), c("stro", "14.7"
), c("taut", "13.7"), c("tedd", "22.3"), c("wari", "12.7"
), c("weiw", "13.6"), c("weyb", "8.4"))
However, I would like to then deal with this output as a dataframe with two columns: one for the alphabetic code ("amer", "appl" etc) and one for the number (14.5, 14.2 etc).
Unfortunately, as.data.frame doesn't seem to work with this input of nested vectors inside a list. How should I go about converting this? Do I need to change the way that my function data_for_time returns its values? At the moment it just returns c(name, value). Or is there a nice way to convert from this sort of output to a dataframe?
Try this if results were your list:
> as.data.frame(do.call(rbind, results))
V1 V2
1 amer 14.5
2 appl 14.2
3 brec 13.1
4 camb 13.5
...
One option might be to use the ldply function from the plyr package, which will stitch things back into a data frame for you.
A trivial example of it's use:
ldply(1:10,.fun = function(x){c(runif(1),"a")})
V1 V2
1 0.406373084755614 a
2 0.456838687881827 a
3 0.681300171650946 a
4 0.294320539338514 a
5 0.811559669673443 a
6 0.340881009353325 a
7 0.134072444401681 a
8 0.00850683846510947 a
9 0.326008745934814 a
10 0.90791508089751 a
But note that if you're mixing variable types with c(), you probably will want to alter your function to return simply data.frame(name= name,value = value) instead of c(name,value). Otherwise everything will be coerced to character (as it is in my example above).
inp <- list(c("amer", "14.5"), c("appl", "14.2"), .... # did not see need to copy all
data.frame( first= sapply( inp, "[", 1),
second =as.numeric( sapply( inp, "[", 2) ) )
first second
1 amer 14.5
2 appl 14.2
3 brec 13.1
4 camb 13.5
5 camo 30.1
6 cari 13.8
snipped output
Because and forNelton took the response I was in the process of giving and Joran took the only other reasonable response I could think of and since I'm supposed to be writing a paper here's a ridiculous answer:
#I named your list LIST
LIST2 <- LIST[[1]]
lapply(2:length(LIST), function(i) {LIST2 <<- rbind(LIST2, LIST[[i]])})
data.frame(LIST2)

Resources