RSQLite dbWriteTable not working on large data - r

Here is my code, where I am trying to write a data from R to SQLite database file.
library(DBI)
library(RSQLite)
library(dplyr)
library(data.table)
con <- dbConnect(RSQLite::SQLite(), "data.sqlite")
### Read the file you want to load to the SQLite database
data <- read_rds("data.rds")
dbSafeNames = function(names) {
names = gsub('[^a-z0-9]+','_',tolower(names))
names = make.names(names, unique=TRUE, allow_=TRUE)
names = gsub('.','_',names, fixed=TRUE)
names
}
colnames(data) = dbSafeNames(colnames(data))
### Load the dataset to the SQLite database
dbWriteTable(conn=con, name="data", value= data, row.names=FALSE, header=TRUE)
While writing the 80GB data, I see the size of the data.sqlite increasing upto 45GB and then it stops and throws the following error.
Error in rsqlite_send_query(conn#ptr, statement) : disk I/O error
Error in rsqlite_send_query(conn#ptr, statement) :
no such savepoint: dbWriteTable
What is the fix and what should I do? If it's only with RSQLite, please suggest the most robust database creation method like RMySQL, RPostgreSQL, etc.

Related

dbGetQuery throttles when the limit is removed for a very large table in a hive database

All,
I'm trying to use the packages RJDBC , rJava and DBI in R to extract the data from a big hive table sitting in a mapr hive/hadoop cluster on a remote linux machine.
I don't have any issues in connecting to the hive cluster. The table1 I'm trying to extract data from is of the size 500M ( million) rows x 16 columns.
This is the code:
options(java.parameters = "-Xmx32g" )
library(RJDBC)
library(DBI)
library(rJava)
hive_dir <- "/opt/mapr/hive/hive-0.13/lib"
.jinit()
.jaddClassPath(paste(hive_dir,"hadoop-0.20.2-core.jar", sep="/"))
.jaddClassPath(c(list.files("/opt/mapr/hadoop/hadoop-0.20.2/lib",pattern="jar$",full.names=T),
list.files("/opt/mapr/hive/hive-0.13/lib",pattern="jar$",full.names=T),
list.files("/mapr/hadoop-dir/user/userid/lib",pattern="jar$",full.names=T)))
drv <- JDBC("org.apache.hive.jdbc.HiveDriver","hive-jdbc-0.13.0-mapr-1504.jar",identifier.quote="`")
hive.master <- "xx.xxx.xxx.xxx:10000"
url.dbc <- paste0("jdbc:hive2://", hive.master)
conn = dbConnect(drv, url.dbc, "user1", "xxxxxxxx")
dbSendUpdate(conn, "set hive.resultset.use.unique.column.names=false")
df <- dbGetQuery(conn, "select * from dbname.table1 limit 1000000 ") # basically 1 million rows
The above works perfectly and df data.frame contains exactly what I want. However if I remove the limit from the last code segment, I get an error:
df <- dbGetQuery(conn, "select * from dbname.table1 ")
Error:
Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve JDBC result set for select * from dbname.table1 (Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask)
I googled the last part of the error Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask and tried to put these 2 statements before the dbGetQuery, but couldn't get rid of the error.
dbSendUpdate(conn, "SET hive.auto.convert.join=false")
dbSendUpdate(conn, "SET hive.auto.convert.join.noconditionaltask=false")
Does anybody have any idea why I'm getting the error when I remove the limit from my select statement ? It even works for 120 million rows with the limit, but takes a long time. At this point, the time taken is not a concern for me.

Using R, Error in rsqlite_send_query(conn#ptr, statement) : too many SQL variables

I uses sqldfpackage to make SQLite database, my matrix dimension is 2880x1951. I write the table on the SQLite database, unfortunately it asks Error in rsqlite_send_query(conn#ptr, statement) : too many SQL variables. I read from SQlite website if the limitation of number variables use is limited to 999. Is there a simple way to increase the value of this?
Here is my syntax:
db <- dbConnect(SQLite(), dbname="xxx.sqlite")
bunch_vis <- read.csv("xxx.csv")
dbWriteTable(db, name = "xxx", value = xxx,
row.names = FALSE, header = TRUE)
and the output:
Error in rsqlite_send_query(conn#ptr, statement) : too many SQL variables

Writing an R dataset to Oracle with very large fields (8000 char)

I need to write an R dataset to an Oracle database using R package ROracle version 1.3-1, R version 3.4.1, Oracle OraClient 11g home and am new to R.
The dataset included variables of several different data types and lengths, including several character type up to 8000 characters long.
Using dbWriteTable
dbWriteTable(conn, "OracleTableName", df)
I get this error:
Error in .oci.WriteTable(conn, name, value, row.names = row.names,
overwrite = overwrite, :
Error in .oci.GetQuery(con, stmt, data = value) :
ORA-01461: can bind a LONG value only for insert into a LONG column
or this
Error in .oci.GetQuery(con, stmt) :
ORA-02263: need to specify the datatype for this column
or this
drv <-
dbDriver("Oracle")
conn <-
dbConnect(
drv,
username = "username",
password = "password",
dbname = "dbname")
test.df1 <- subset(
df, select=c(
Var, Var2, Var3,
Var4, Var5, Var6))
dat <- as.character(test.df1)
attr(dat, "ora.type") <- "clob"
dbWriteTable(conn, "test2", dat)
returns this
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘dbWriteTable’ for
signature ‘"OraConnection", "character", "character"’
From researching the error, it appears that the first error is indicating that the larger fields - BLOB fields - are not being recognized as BLOB by Oracle.
Documentation indicates that ROracle version 1.3-1 should be able to handle larger datatypes. It suggests using attribute to map NCHAR, CLOB, BLOB, NCLOB columns correctly in dbWriteTable. I have not been able to follow this example successfully as I keep getting the same error. Perhaps I just need a different example than that provided in the documentation?
Initially, I was using the RODBC package, but found that it's known that RODBC does not handle large datatypes (BLOB).
Any assistance or advice is appreciated.

Make sqlite database from data.frames with dot in name with dplyr

This is a follow-up to a previously asked question: Copy a list of data.frame(s) to sqlite database using dplyr. Now I want to load data.frames to an sqlite database using dplyr but some of the data.frames have dots in the name. For example,
data(iris)
data(cars)
res <- list("ir.is" = iris, "cars" = cars)
my_db <- dplyr::src_sqlite(paste0(tempdir(), "/foobar.sqlite3"),
create = TRUE)
lapply(seq_along(res), function(i, dt = res) dplyr::copy_to(my_db,
dt[[i]], names(dt)[[i]]))
Error in sqliteSendQuery(conn, statement, bind.data) :
error in statement: near "is": syntax error
I think the error is due to lack of quoting in the underlying internal SQL statements.

Reading huge csv files into R with sqldf works but sqlite file takes twice the space it should and needs "vacuuming"

Reading around, I found out that the best way to read a larger-than-memory csv file is to use read.csv.sql from package sqldf. This function will read the data directly into a sqlite database, and consequently execute a sql statement.
I noticed the following: it seems that the data read into sqlite is stored into a temporary table, so that in order to make it accessible for future use, it needs to be asked so in the sql statement.
As a example, the following code reads some sample data into sqlite:
# generate sample data
sample_data <- data.frame(col1 = sample(letters, 100000, TRUE), col2 = rnorm(100000))
# save as csv
write.csv(sample_data, "sample_data.csv", row.names = FALSE)
# create a sample sqlite database
library(sqldf)
sqldf("attach sample_db as new")
# read the csv into the database and create a table with its content
read.csv.sql("sample_data.csv", sql = "create table data as select * from file",
dbname = "sample_db", header = T, row.names = F, sep = ",")
The data can then be accessed with sqldf("select * from data limit 5", dbname = "sample_db").
The problem is the following: the sqlite file takes up twice as much space as it should. My guess is that it contains the data twice: once for the temporary read, and once for the stored table. It is possible to clean up the database with sqldf("vacuum", dbname = "sample_db"). This will reclaim the empty space, but it takes a long time, especially when the file is big.
Is there a better solution to this that doesn't create this data duplication in the first time ?
Solution: using RSQLite without going through sqldf:
library(RSQLite)
con <- dbConnect("SQLite", dbname = "sample_db")
# read csv file into sql database
dbWriteTable(con, name="sample_data", value="sample_data.csv",
row.names=FALSE, header=TRUE, sep = ",")

Resources