R: DuckDB DBconnect is very slow - Why? - r

I have a *.csv file containing columnar numbers and strings (13GB on disk ) which I imported into a new duckdb (or sqlite) database and saved it so I can access it later in R. But reconnecting duplicates it and is very slow, is this wrong?
From within R, I am doing the following:
library(duckdb)
library(dplyr)
library(DBI)
#Create the DB
con <- dbConnect(duckdb::duckdb(), "FINAL_data.duckdb")
#Read in the csv
duckdb_read_csv(con, "data", "FINAL_data_new.csv")
I then close R and restart it to see if it has worked:
#This is super slow (about 10minutes), because it looks like it's writing the DB again somewhere. But why?
con <- dbConnect(duckdb::duckdb(), "FINAL_data.duckdb")
NB. I've added sqlite as a tag because I don't think this is particular to duckdb

The slowdown you experienced is due to the database checkpointing on startup. This has been fixed on the master branch already.

Related

What is a good trick for loading a db file into R that exhausts memory?

I have a .txt file that is roughly 28 gb and 100 million rows, and originally we were using Python and a mix of Pandas and SQLite3 to load this data into a .db file that we could query on. However, my team is more familiar with R, so we want to load the .db file into R instead. However, that comes with a memory limit. Is there a workaround for this error? Is it possible to partially load some of the data into R?
library(RSQLite)
filename <- "DB NAME.db"
sqlite.driver <- dbDriver("SQLite")
db <- dbConnect(sqlite.driver,
dbname = filename)
## Some operations
dbListTables(db)
mytable <- dbReadTable(db,"TABLE NAME")
Error: vector memory exhausted (limit reached?)
You might want to try the vroom package to load the .txt file you mention, which allows you to select which columns to load, thus conserving memory.
filename <- "DB_NAME.txt"
db <- vroom::vroom(filename, col_select = c(col_1,col_2))
Have a look at the vroom vignette document for more information about how memory is conserved.

Read large csv file from S3 into R

I need to load a 3 GB csv file with about 18 million rows and 7 columns from S3 into R or RStudio respectively. My code for reading data from S3 usually works like this:
library("aws.s3")
obj <-get_object("s3://myBucketName/aFolder/fileName.csv")
csvcharobj <- rawToChar(obj)
con <- textConnection(csvcharobj)
data <- read.csv(file = con)
Now, with the file being much bigger than usual, I receive an error
> csvcharobj <- rawToChar(obj)
Error in rawToChar(obj) : long vectors not supported yet: raw.c:68
Reading this post, I understand that the vector is too long but how would I subset the data in this case? Any other suggestion how to deal with larger files to read from S3?
Originally Building on Hugh's comment in the OP and adding an answer for those wishing to load regular size csv's from s3.
At least as of May 1, 2019, there is an s3read_using() function that allows you to read the object directly out of your bucket.
Thus
data <-
aws.s3::s3read_using(read.csv, object = "s3://your_bucketname/your_object_name.csv.gz")
Will do the trick. However, if you want to make your work run faster and cleaner, I prefer this:
data <-
aws.s3::s3read_using(fread, object = "s3://your_bucketname/your_object_name.csv.gz") %>%
janitor::clean_names()
Previously the more verbose method below was required:
library(aws.s3)
data <-
save_object("s3://myBucketName/directoryName/fileName.csv") %>%
data.table::fread()
It works for files up to at least 305 MB.
A better alternative to filling up your working directory with a copy of every csv you load:
data <-
save_object("s3://myBucketName/directoryName/fileName.csv",
file = tempfile(fileext = ".csv")
) %>%
fread()
If you are curious about where the tempfile is positioned, then Sys.getenv() can give some insight - see TMPDIR TEMP or TMP. More information can be found in the Base R tempfile docs..
You can use AWS Athena and mount your S3 files to athena and query only selective records to R. How to run r with athena is explained in detail below.
https://aws.amazon.com/blogs/big-data/running-r-on-amazon-athena/
Hope it helps.
If you are on Spark or similar, a other workaround would be to
- read/load the csv to DataTable and
- continue processing it with R Server / sparklyr

How to use Rsqlite to save and load data in sqlite

I am learning SQLite and I have a big data frame in csv format and I imported into the SQLite.
db <- dbConnect(SQLite(), dbname="myDB.sqlite")
dbWriteTable(conn = db, name = "myDB", dataframe, overwrite=T,
row.names=FALSE)
after that, I saw there is a myDB.sqlite in my directory but with zero byte. How can I save the dataframe in the sqlite so that I don't need to write table everytime. Thanks.
It should be written in your db. Like i said before, your code works for me it's just that the R-Studio File Viewer doesn't automatically Refresh when some of the files have been written to.
Just to be sure that data was written to your db try running this dbGetQuery(conn=db, "SELECT * FROM myDB). That should return your data frame.

integrating hadoop, revo-scaleR and hive

I have a requirement to fetch data from HIVE tables into a csv file and use it in RevoScaleR.
Currently we pull the data from HIVE and manually put it into a file and use it in unix file system for adhoc analysis, however, the requirement is to re-direct the result directly into hdfs location and use RevoScaleR from there?
How do I do that? or what sort of connection do I need to establish this.
If I understand your question correct, you could use RevoScaleR ODBC connection to import HIVE table and do further analysis from there.
Here is example of using Hortonworks provided ODBC driver:
OdbcConnString <- "DSN=Sample Hortonworks Hive DSN"
odbcDS <- RxOdbcData(sqlQuery = "SELECT * FROM airline",
connectionString=OdbcConnString,
stringsAsFactors=TRUE,
useFastRead = TRUE,
rowsPerRead=150000)
xdfFile <- "airlineHWS.xdf"
if(file.exists(xdfFile)) file.remove(xdfFile)
Flights<-rxImport(odbcDS, outFile=xdfFile,overwrite=TRUE)
rxGetInfo(data="airlineHWS.xdf", getVarInfo=TRUE,numRows = 10)
Chenwei's approach is ok but there is just one problem. The data is temporarily stored in memory as data frame in odbcDS object. If we have huge table in hive, then we are done.
I would suggested to keep everything on disk by using external tables in hive and then using the backend data directly in revolution r.
Something in these lines:
Create external table from the existing hive tables in textfile(csv, tab etc) format.
CREATE EXTERNAL TABLE ext_table
LIKE your_original_table_name
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/your/hdfs/location';
Here we are creating external table which is stored as csv file in hdfs.
Next copy the original table to the external table using insert overwrite command.
insert overwrite table ext_table select * from your_original_table_name
If we want to check the backend data on hdfs type:
hadoop fs -ls /your/hdfs/location/
We can see the part files stored at the location. Go ahead and cat them to be double sure
Now we can use RxTextData function to read the data from above step as
hive_data <- RxTextData(file='/your/hdfs/location/', delimiter = ',')
Now you can create an xdf file using hive_data as inFile argument in RxXdfData to be more efficient for further processing but above all data has never touched memory.

Improving RODBC-Postgres Write Performance

I've recently begun using RODBC to connect to PostgreSQL as I couldn't get RPostgreSQL to compile and run in Windows x64. I've found that read performance is similar between the two packages, but write performance is not. For example, using RODBC (where z is a ~6.1M row dataframe):
library(RODBC)
con <- odbcConnect("PostgreSQL84")
#autoCommit=FALSE seems to speed things up
odbcSetAutoCommit(con, autoCommit = FALSE)
system.time(sqlSave(con, z, "ERASE111", fast = TRUE))
user system elapsed
275.34 369.86 1979.59
odbcEndTran(con, commit = TRUE)
odbcCloseAll()
Whereas for the same ~6.1M row dataframe using RPostgreSQL (under 32-bit):
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname="gisdb", user="postgres", password="...")
system.time(dbWriteTable(con, "ERASE222", z))
user system elapsed
467.57 56.62 668.29
dbDisconnect(con)
So, in this test, RPostgreSQL is about 3X as fast as RODBC in writing tables. This performance ratio seems to stay more-or-less constant regardless of the number of rows in the dataframe (but the number of columns has far less effect). I do notice that RPostgreSQL uses something like COPY <table> FROM STDIN while RODBC issues a bunch of INSERT INTO <table> (columns...) VALUES (...) queries. I also notice that RODBC seems to choose int8 for integers, while RPostgreSQL chooses int4 where appropriate.
I need to do this kind of dataframe copy often, so I would very sincerely appreciate any advice on speeding up RODBC. For example, is this just inherent in ODBC, or am I not calling it properly?
It seems there is no immediate answer to this, so I'll post a kludgy workaround in case it is helpful for anyone.
Sharpie is correct--COPY FROM is by far the fastest way to get data into Postgres. Based on his suggestion, I've hacked together a function that gives a significant performance boost over RODBC::sqlSave(). For example, writing a 1.1 million row (by 24 column) dataframe took 960 seconds (elapsed) via sqlSave vs 69 seconds using the function below. I wouldn't have expected this since the data are written once to disk then again to the db.
library(RODBC)
con <- odbcConnect("PostgreSQL90")
#create the table
createTab <- function(dat, datname) {
#make an empty table, saving the trouble of making it by hand
res <- sqlSave(con, dat[1, ], datname)
res <- sqlQuery(con, paste("TRUNCATE TABLE",datname))
#write the dataframe
outfile = paste(datname, ".csv", sep = "")
write.csv(dat, outfile)
gc() # don't know why, but memory is
# not released after writing large csv?
# now copy the data into the table. If this doesn't work,
# be sure that postgres has read permissions for the path
sqlQuery(con,
paste("COPY ", datname, " FROM '",
getwd(), "/", datname,
".csv' WITH NULL AS 'NA' DELIMITER ',' CSV HEADER;",
sep=""))
unlink(outfile)
}
odbcClose(con)

Resources