As question, I found that I can use .import in sqlite shell, but seems it is not working in R environment, any suggestions?
You can use read.csv.sql in the sqldf package. It is only one line of code to do the read. Assuming you want to create a new database, testingdb, and then read a file into it try this:
# create a test file
write.table(iris, "iris.csv", sep = ",", quote = FALSE, row.names = FALSE)
# create an empty database.
# can skip this step if database already exists.
sqldf("attach testingdb as new")
# or: cat(file = "testingdb")
# read into table called iris in the testingdb sqlite database
library(sqldf)
read.csv.sql("iris.csv", sql = "create table main.iris as select * from file",
dbname = "testingdb")
# look at first three lines
sqldf("select * from main.iris limit 3", dbname = "testingdb")
The above uses sqldf which uses RSQLite. You can also use RSQLite directly. See ?dbWriteTable in RSQLite. Note that there can be problems with line endings if you do it directly with dbWriteTable that sqldf will automatically handle (usually).
If your intention was to read the file into R immediately after reading it into the database and you don't really need the database after that then see:
http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql
I tend to do that with the sqldf package: Quickly reading very large tables as dataframes in R
Keep in mind that in the above example I read the csv into a temp sqlite db. You'll obviously need to change that bit.
Related
I have created and filled a sqlite database within R using packages DBI and RSQLite. E.g. like that:
con_sqlite <- DBI::dbConnect(RSQLite::SQLite(), "../mydatabase.sqlite")
DBI::dbWriteTable(con_sqlite, "mytable", d_table, overwrite = TRUE)
...
Now the sqlite file got too big and I reduced the tables. However, the size does not decrease and I found out that I have to use command vaccum. Is there a possibility to use this command within R?
I think this should do the trick:
DBI::dbExecute(con, "VACUUM;")
I'm using sqldf package to import CSV files into R and then produce statistics based on the data inserted into the created dataframes. We have a shared lab environment with a lot of users which means that we all share the available RAM on the same server. Although there is a high capacity of RAM available, given the number of users who are often simultaneously connected, the administrator of the environment recommends using some database (PostgreSQL, SQlite, etc.) to import our files into it, instead of importing everything in memory.
I was checking the documentation of sqldf package and the read.csv.sql.function had my attention. Here is what we can read in the documentation :
Reads the indicated file into an sql database creating the database if
it does not already exist. Then it applies the sql statement returning
the result as a data frame. If the database did not exist prior to
this statement it is removed.
However, what I don't understand is, whether the returned result as a dataframe, will be in memory (therefore in the RAM of the server) or like the imported CSV file, it will be in the specified database. Because if it is in memory it doesn't meet my requirement which is reducing the use of the available shared RAM as much as possible given the huge size of my CSV files (tens of gigabytes, often more than 100 000 000 lines in each file)
Curious to see how this works, I wrote the following program
df_data <- suppressWarnings(read.csv.sql(
file = "X:/logs/data.csv",
sql = "
select
nullif(timestamp, '') as timestamp_value,
nullif(user_account, '') as user_account,
nullif(country_code, '') as country_code,
nullif(prefix_value, '') as prefix_value,
nullif(user_query, '') as user_query,
nullif(returned_code, '') as returned_code,
nullif(execution_time, '') as execution_time,
nullif(output_format, '') as output_format
from
file
",
header = FALSE,
sep = "|",
eol = "\n",
`field.types` = list(
timestamp_value = c("TEXT"),
user_account = c("TEXT"),
country_code = c("TEXT"),
prefix_value = c("TEXT"),
user_query = c("TEXT"),
returned_code = c("TEXT"),
execution_time = c("REAL"),
output_format = c("TEXT")
),
dbname = "X:/logs/sqlite_tmp.db",
drv = "SQLite"
))
I run the above program to import a big CSV file (almost 150 000 000 rows). It took around 30 minutes. During the execution time, as specified via the dbname parameter in the program source code, I saw that a SQLite database file was created in X:/logs/sqlite_tmp.db. As the rows in the file were being imported, this file became bigger and bigger which indicated that all rows were indeed being inserted into the database file on the disk and not in memory (into the server's RAM). Finally, the database file at the end of the import, had reached 30 GB. As stated in the documentation, at the end of the import process, this database was removed automatically. Yet after removing automatically the created SQLite database, I was able to work with the result dataframe (that is, df_data in the above code).
What I understand is that the returned dataframe was in the RAM of the server otherwise I wouldn't have been able to refer to it after the created SQlite database had been removed. Please correct me if I'm wrong, but if that is the case, I think I misunderstood the purpose of this R package. My aim was to put everything, even the result dataframe in a database, and use the RAM only for calculations. Is there anyway to put everything in the database until the end of the program?
The purpose of sqldf is to process data frames using SQL. If you want to create a database and read a file into it you can use dbWriteTable from RSQLite directly; however, if you want to use sqldf anyways then first create an empty database, mydb, then read the file into it and finally check that the table is there. Ignore the read.csv.sql warning. If you add the verbose=TRUE argument to read.csv.sql it will show the RSQLite statements it using.
Also you may wish to read https://avi.im/blag/2021/fast-sqlite-inserts/ and https://www.pdq.com/blog/improving-bulk-insert-speed-in-sqlite-a-comparison-of-transactions/
library(sqldf)
sqldf("attach 'mydb' as new")
read.csv.sql("myfile.csv", sql =
"create table mytab as select * from file", dbname = "mydb")
## data frame with 0 columns and 0 rows
## Warning message:
## In result_fetch(res#ptr, n = n) :
## SQL statements must be issued with dbExecute() or
## dbSendStatement() instead of dbGetQuery() or dbSendQuery().
sqldf("select * from sqlite_master", dbname = "mydb")
## type name tbl_name rootpage
## .. info on table that was created in mydb ...
sqldf("select count(*) from mytab", dbname = "mydb")
I am trying to improve my analytics script for a database and want to use a direct output from the query in my Access DB instead of outputting to a linked excel file where I refreshed it and then saved to a csv file. I can get the table into R, but I would like it parsed through readr. Is there a way to do this?
I have tried outputting to a csv straight away then using read_csv to reimport but it pulled errors on the parsing.
dbdata <- sqlQuery(db , "SELECT *
FROM qRoutput", stringsAsFactors = FALSE)
Is how I currently import my query.
I would just like to parse that query output through the readr parsing function - everything being taken as characters, doubles and the class coming back as "[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
This should already give the output as data.frame.
May be you could try df = sqlGetResults(db) after your sqlQuery.
Alternatively you can go for dbplyr approach. Below is an example. This approach is valid for RODBC connections too. Intro to dbplyr
library(RSQLite)
# Create a db
db = dbConnect(SQLite(), "./Data/TestDB.db")
# write data to the db
dbWriteTable(db, "testtable", mtcars)
# Check whether the table is present
dbListTables(db)
# use the dplyr syntax to query
tbl(db, "testtable") %>% select(mpg, cyl)
I want to use dbWriteTable() of R's DBI package to write data into a database. Usually, the respective tables are already present so I use the argument append = TRUE. How do I get which rows were added to the table by dbWriteTable()? Most of tables have certain columns with UNIQUE values so a SELECT will work (see below for a simple example). However, this is not true for all of them or only several columns together are UNIQUE making the SELECT more complicated. In addition, I would like to put the writing and querying into a function so I would prefer a consistent approach for all cases.
I mainly need this to get the PRIMARY KEY's added by the database and to allow a user to quickly see what was added. If important, my database is PostgreSQL and I would like to use the odbc package for connection.
I have something like this in mind, however, I am looking for a more general solution:
library(DBI)
con <- dbConnect(odbc::odbc(), dsn = "database")
dbWriteTable(con,
name = "site", value = data.frame(name = c("abcd", "efgh"),
append = TRUE))
dbGetQuery(conn,
paste("SELECT * FROM site WHERE name in ('abcd', 'efgh');"))
I am trying to load a large-ish csv file into a SQL lite database using the RSQLite package (I have also tried the sqldf package). The file contains all UK postcodes and a variety of lookup values for them.
I wanted to avoid loading it into R and just directly load it into the database. Whilst this is not strictly necessary for this task, I want to do so in order to have the technique ready for larger files which won't fit in memory should I have to handle them in the future.
Unfortunately the csv is provided with the values in double quotes and the dbWriteTable function doesn't seem able to strip them or ignore them in any form. Here is the download location of the file: http://ons.maps.arcgis.com/home/item.html?id=3548d835cff740de83b527429fe23ee0
Here is my code:
# Load library
library("RSQLite")
# Create a temporary directory
tmpdir <- tempdir()
# Set the file name
file <- "data\\ONSPD_MAY_2017_UK.zip"
# Unzip the ONS Postcode Data file
unzip(file, exdir = tmpdir )
# Create a path pointing at the unzipped csv file
ONSPD_path <- paste0(tmpdir,"\\ONSPD_MAY_2017_UK.csv")
# Create a SQL Lite database connection
db_connection <- dbConnect(SQLite(), dbname="ons_lkp_db")
# Now load the data into our SQL lite database
dbWriteTable(conn = db_connection,
name = "ONS_PD",
value = ONSPD_path,
row.names = FALSE,
header = TRUE,
overwrite = TRUE
)
# Check the data upload
dbListTables(db_connection)
dbGetQuery(db_connection,"SELECT pcd, pcd2, pcds from ONS_PD LIMIT 20")
Having hit this issue, I found a reference tutorial (https://www.r-bloggers.com/r-and-sqlite-part-1/) which recommended using the sqldf package but unfortunately when I try to use the relevant function in sqldf (read.csv.sql) then I get the same issue with double quotes.
This feels like a fairly common issue when importing csv files into a sql system, most import tools are able to handle double quotes so I'm surprised to be hitting an issue with this (unless I've missed an obvious help file on the issue somewhere along the way).
EDIT 1
Here is some example data from my csv file in the form of a dput output of the SQL table:
structure(list(pcd = c("\"AB1 0AA\"", "\"AB1 0AB\"", "\"AB1 0AD\"",
"\"AB1 0AE\"", "\"AB1 0AF\""), pcd2 = c("\"AB1 0AA\"", "\"AB1 0AB\"",
"\"AB1 0AD\"", "\"AB1 0AE\"", "\"AB1 0AF\""), pcds = c("\"AB1 0AA\"",
"\"AB1 0AB\"", "\"AB1 0AD\"", "\"AB1 0AE\"", "\"AB1 0AF\"")), .Names = c("pcd",
"pcd2", "pcds"), class = "data.frame", row.names = c(NA, -5L))
EDIT 2
Here is my attempt using the filter argument in sqldf's read.csv.sql function (note that Windows users will need rtools installed for this). Unfortunately this still doesn't seem to remove the quotes from my data, although it does mysteriously remove all the spaces.
library("sqldf")
sqldf("attach 'ons_lkp_db' as new")
db_connection <- dbConnect(SQLite(), dbname="ons_lkp_db")
read.csv.sql(ONSPD_path,
sql = "CREATE TABLE ONS_PD AS SELECT * FROM file",
dbname = "ons_lkp_db",
filter = 'tr.exe -d ^"'
)
dbGetQuery(db_connection,"SELECT pcd, pcd2, pcds from ONS_PD LIMIT 5")
Also, thanks for the close vote from whoever felt this wasn't a programming question in the scope of Stack Overflow(?!).
The CSV importer in the RSQLite package is derived from the sqlite3 shell, which itself doesn't seem to offer support for quoted values when importing CSV files (How to import load a .sql or .csv file into SQLite?, doc). You could use readr::read_delim_chunked():
callback <- function(data) {
name <- "ONS_PD"
exists <- dbExistsTable(con, name)
dbWriteTable(con, name, data, append = exists)
}
readr::read_delim_chunked(ONSPD_path, callback, ...)
Substitute ... with any extra arguments you need for your CSV file.
Use read.csv.sql from the sqldf package with the filter argument and provide any utility which strips out double quotes or which translates them to spaces.
The question does not provide a fully reproducible minimal example but I have provided one below. If you are using read.csv.sql in order to pick out a subset of rows or columns then just add the appropriate sql argument to do so.
First set up the test input data and then try any of the one-line solutions shown below. Assuming Windows, ensure that the tr utility (found in R's Rtools distribution) or the third party csvfix utility (found here and for Linux also see this) or the trquote2space.vbs vbscript utility (see Note at end) is on your path:
library(sqldf)
cat('a,b\n"1","2"\n', file = "tmp.csv")
# 1 - corrected from FAQ
read.csv.sql("tmp.csv", filter = "tr.exe -d '^\"'")
# 2 - similar but does not require Windows cmd quoting
read.csv.sql("tmp.csv", filter = "tr -d \\42")
# 3 - using csvfix utility (which must be installed first)
read.csv.sql("tmp.csv", filter = "csvfix echo -smq")
# 4 - using trquote2space.vbs utility as per Note at end
read.csv.sql("tmp.csv", filter = "cscript /nologo trquote2space.vbs")
any of which give:
a b
1 1 2
You could also use any other language or utility that is appropriate. For example, your Powershell suggestion could be used although I suspect that dedicated utilities such as tr and csvfix would run faster.
The first solution above is corrected from the FAQ. (It did work at the time the FAQ was written many years back but testing it now in Windows 10 it seems to require the indicated change or possibly the markdown did not survive intact from the move from Google Code, where it was originally located, to github which uses a slightly different markdown flavor.)
For Linux, tr is available natively although quoting differs from Windows and can even depend on the shell. csvfix is available on Linux too but would have to be installed. The csvfix example shown above would work identically on Windows and Linux. vbscript is obviously specific to Windows.
Note: sqldf comes with a mini-tr utility written in vbscript. If you change the relevant lines to:
Dim sSearch : sSearch = chr(34)
Dim sReplace : sReplace = " "
and change the name to trquote2space.vbs then you will have a Windows specific utility to change double quotes to spaces.
Honestly I could not find anything to solve this problem.
sqldf documentation tells
"so, one limitation with .csv files is that quotes
are not regarded as special within files so a comma within a data field such as
"Smith, James"
would be regarded as a field delimiter and the quotes would be entered as part of the data which
probably is not what is intended"
So, It looks like there is no solution as far as I know.
One possible suboptimal approach (other then obvious find and replace in text editor)
is to use SQL commands like this
dbSendQuery(db_connection,"UPDATE ONS_PD SET pcd = REPLACE(pcd, '\"', '')")