How to vacuum an sqlite database within R?

How to vacuum an sqlite database within R? - r

I have created and filled a sqlite database within R using packages DBI and RSQLite. E.g. like that:
con_sqlite <- DBI::dbConnect(RSQLite::SQLite(), "../mydatabase.sqlite")
DBI::dbWriteTable(con_sqlite, "mytable", d_table, overwrite = TRUE)
...
Now the sqlite file got too big and I reduced the tables. However, the size does not decrease and I found out that I have to use command vaccum. Is there a possibility to use this command within R?

I think this should do the trick:
DBI::dbExecute(con, "VACUUM;")

Related

Append two tables inside sqlite database using R

I have two very large csv files that contain the same variables. I want to combine them into one table inside a sqlite database - if possible using R.
I successfully managed to put both csv files in separate tables inside one database using inborutils::csv_to_sqlite that one imports small chunks of data at a time.
Is there a way to create a third tables where both tables are simply appended using R (keeping in mind the limited RAM)? And if not - how else can I perform this task? Maybe via the terminal?

We assume that when the question refers to the "same variables" that it means that the two tables have the same column names. Below we create two such test tables, BOD and BOD2, and then in the create statement we combine them creating table both. This does the combining entirely on the SQLite side. Finally we use look at both.
library(RSQLite)
con <- dbConnect(SQLite()) # modify to refer to existing SQLite database
dbWriteTable(con, "BOD", BOD)
dbWriteTable(con, "BOD2", 10 * BOD)
dbExecute(con, "create table both as select * from BOD union select * from BOD2")
dbReadTable(con, "both")
dbDisconnect(con)

write data to table using RPostgreSQL without altering structure

A thing I ran into using the RPostgreSQL package is that dbWriteTable(… overwrite=TRUE) seems to destroy existing table structure (datatypes and constraints), and dbRemoveTable() is equivalent to DROP table.
So I’ve used:
ltvTable <- "the_table_to_use"
dfLTV <- dataframe(x,y,z)
sql_truncate <- paste("TRUNCATE ", ltvTable) ##unconditional DELETE FROM…
res <- dbSendQuery(conn=con, statement=sql_truncate)
dbWriteTable(conn=con, name=ltvTable, value=dfLTV, row.names=FALSE, append=TRUE)
Is the TRUNCATE step necessary, or is there a dbWriteTable method that overwrites just the content not the structure?
I experience different behaviour from the answer offered by Manura Omal to How we can write data to a postgres DB table using R?, as overwrite=TRUE does not appear to truncate first.
I'm using: RPostgreSQL 0.4-1; PostgreSQL 9.4
best wishes - JS

As far as I know overwrite=T does 3 things:
DROPs the table
CREATES the table
fills table with new data
So if you want to preserve the structure, then you need the Truncate step.
Different behaviour might be caused by existence or non-existence of foreign keys preventing the DROP table step.

Speed up INSERT of 1 million+ rows into Postgres via R using COPY?

I would like to bulk-INSERT/UPSERT a moderately large amount of rows to a postgreSQL database using R. In order to do so I am preparing a multi-row INSERT string using R.
query <- sprintf("BEGIN;
CREATE TEMPORARY TABLE
md_updates(ts_key varchar, meta_data hstore) ON COMMIT DROP;
INSERT INTO md_updates(ts_key, meta_data) VALUES %s;
LOCK TABLE %s.meta_data_unlocalized IN EXCLUSIVE MODE;
UPDATE %s.meta_data_unlocalized
SET meta_data = md_updates.meta_data
FROM md_updates
WHERE md_updates.ts_key = %s.meta_data_unlocalized.ts_key;
COMMIT;", md_values, schema, schema, schema, schema)
DBI::dbGetQuery(con,query)
The entire function can be found here. Surprisingly (at leat to me) I learned that the UPDATE part is not the problem. I left it out and ran a the query again and it wasn't much faster. INSERT a million+ records seems to be the issue here.
I did some research and found quite some information:
bulk inserts
bulk inserts II
what causes large inserts to slow down
answers from #Erwin Brandstetter and #Craig Ringer were particularly helpful. I was able to speed things up quite a bit by dropping indices and following a few other suggestions.
However, I struggled to implement another suggestion which sounded promising: COPY. The problem is I can't get it done from within R.
The following works for me:
sql <- sprintf('CREATE TABLE
md_updates(ts_key varchar, meta_data hstore);
COPY md_updates FROM STDIN;')
dbGetQuery(sandbox,"COPY md_updates FROM 'test.csv' DELIMITER ';' CSV;")
But I can't get it done without reading from a extra .csv file. So my questions are:
Is COPY really a promising approach here (over the multi-row INSERT I got?
Is there a way to use COPY from within R without writing data to a file. Data does fit in memory and since it's already in mem why write to disk?
I am using PostgreSQL 9.5 on OS X and 9.5 on RHEL respectively.

RPostgreSQL has a "CopyInDataframe" function that looks like it should do what you want:
install.packages("RPostgreSQL")
library(RPostgreSQL)
con <- dbConnect(PostgreSQL(), user="...", password="...", dbname="...", host="...")
dbSendQuery(con, "copy foo from stdin")
postgresqlCopyInDataframe(con, df)
Where table foo has the same columns as dataframe df

integrating hadoop, revo-scaleR and hive

I have a requirement to fetch data from HIVE tables into a csv file and use it in RevoScaleR.
Currently we pull the data from HIVE and manually put it into a file and use it in unix file system for adhoc analysis, however, the requirement is to re-direct the result directly into hdfs location and use RevoScaleR from there?
How do I do that? or what sort of connection do I need to establish this.

If I understand your question correct, you could use RevoScaleR ODBC connection to import HIVE table and do further analysis from there.
Here is example of using Hortonworks provided ODBC driver:
OdbcConnString <- "DSN=Sample Hortonworks Hive DSN"
odbcDS <- RxOdbcData(sqlQuery = "SELECT * FROM airline",
connectionString=OdbcConnString,
stringsAsFactors=TRUE,
useFastRead = TRUE,
rowsPerRead=150000)
xdfFile <- "airlineHWS.xdf"
if(file.exists(xdfFile)) file.remove(xdfFile)
Flights<-rxImport(odbcDS, outFile=xdfFile,overwrite=TRUE)
rxGetInfo(data="airlineHWS.xdf", getVarInfo=TRUE,numRows = 10)

Chenwei's approach is ok but there is just one problem. The data is temporarily stored in memory as data frame in odbcDS object. If we have huge table in hive, then we are done.
I would suggested to keep everything on disk by using external tables in hive and then using the backend data directly in revolution r.
Something in these lines:
Create external table from the existing hive tables in textfile(csv, tab etc) format.
CREATE EXTERNAL TABLE ext_table
LIKE your_original_table_name
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/your/hdfs/location';
Here we are creating external table which is stored as csv file in hdfs.
Next copy the original table to the external table using insert overwrite command.
insert overwrite table ext_table select * from your_original_table_name
If we want to check the backend data on hdfs type:
hadoop fs -ls /your/hdfs/location/
We can see the part files stored at the location. Go ahead and cat them to be double sure
Now we can use RxTextData function to read the data from above step as
hive_data <- RxTextData(file='/your/hdfs/location/', delimiter = ',')
Now you can create an xdf file using hive_data as inFile argument in RxXdfData to be more efficient for further processing but above all data has never touched memory.

How to import CSV into sqlite using RSqlite?

As question, I found that I can use .import in sqlite shell, but seems it is not working in R environment, any suggestions?

You can use read.csv.sql in the sqldf package. It is only one line of code to do the read. Assuming you want to create a new database, testingdb, and then read a file into it try this:
# create a test file
write.table(iris, "iris.csv", sep = ",", quote = FALSE, row.names = FALSE)
# create an empty database.
# can skip this step if database already exists.
sqldf("attach testingdb as new")
# or: cat(file = "testingdb")
# read into table called iris in the testingdb sqlite database
library(sqldf)
read.csv.sql("iris.csv", sql = "create table main.iris as select * from file",
dbname = "testingdb")
# look at first three lines
sqldf("select * from main.iris limit 3", dbname = "testingdb")
The above uses sqldf which uses RSQLite. You can also use RSQLite directly. See ?dbWriteTable in RSQLite. Note that there can be problems with line endings if you do it directly with dbWriteTable that sqldf will automatically handle (usually).
If your intention was to read the file into R immediately after reading it into the database and you don't really need the database after that then see:
http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql

I tend to do that with the sqldf package: Quickly reading very large tables as dataframes in R
Keep in mind that in the above example I read the csv into a temp sqlite db. You'll obviously need to change that bit.