Make sqlite database from data.frames with dot in name with dplyr - r

This is a follow-up to a previously asked question: Copy a list of data.frame(s) to sqlite database using dplyr. Now I want to load data.frames to an sqlite database using dplyr but some of the data.frames have dots in the name. For example,
data(iris)
data(cars)
res <- list("ir.is" = iris, "cars" = cars)
my_db <- dplyr::src_sqlite(paste0(tempdir(), "/foobar.sqlite3"),
create = TRUE)
lapply(seq_along(res), function(i, dt = res) dplyr::copy_to(my_db,
dt[[i]], names(dt)[[i]]))
Error in sqliteSendQuery(conn, statement, bind.data) :
error in statement: near "is": syntax error
I think the error is due to lack of quoting in the underlying internal SQL statements.

Related

R query database tables iteratively without for loop with lambda or vectorized function for Shiny app

I am connecting to a SQL Server database through the ODBC connection in R. I have two potential methods to get data, and am trying to determine which would be more efficient. The data is needed for a Shiny dashboard, so the data needs to be pulled while the app is loading rather than querying on the fly as the user is using the app.
Method 1 is to use over 20 stored procedures to query all of the needed data and store them for use. Method 2 is to query all of the tables individually.
Here is the method I used to query one of the stored procedures:
get_proc_data <- function(proc_name, url, start_date, end_date){
dbGetQuery(con, paste0(
"EXEC dbo.", proc_name, " ",
"#URL = N'", url, "', ",
"#Startdate = '", start_date, "', ",
"#enddate = '", end_date, "' "
))
}
data <- get_proc_data(proc_name, url, today(), today() %m-% years(5))
However, each of the stored procedures has a slightly different setup for the parameters, so I would have to define each of them separately.
I have started to implement Method 2, but have run into issues with iteratively querying each table.
# use dplyr create list of table names
db_tables <- dbGetQuery(con, "SELECT * FROM [database_name].INFORMATION_SCHEMA.TABLES;") %>% select(TABLE_NAME)
# use dplyr pull to create list
table_list <- pull(db_tables , TABLE_NAME)
# get a quick look at the first few rows
tbl(con, "[TableName]") %>% head() %>% glimpse()
# iterate through all table names, get the first five rows, and export to .csv
for (table in table_list){
write.csv(
tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/tables/{table}.csv")
)
}
selected_tables <- db_tables %>% filter(TABLE_NAME == c("TableName1","TableName2"))
Ultimately this method was just to test how long it would take to iterate through the ~60 tables and perform the required function. I have tried putting this into a function instead but have not been able to get it to iterate through while also pulling the name of the table.
Pro/Con for Method 1: The stored procs are currently powering a metrics plug-in that was written in C++ and is displaying metrics on the webpage. This is for internal use to monitor website performance. However, the stored procedures are not all visible to me and the client needs me to extend their current metrics. I also do not have a DBA at my disposal to help with the SQL Server side, and the person who wrote the procs is unavailable. The procs are also using different logic than each other, so joining the results of two different procs gives drastically different values. For example, depending on the proc, each date will list total page views for each day or already be aggregated at the weekly or monthly scale then listed repeatedly. So joining and grouping causes drastic errors in actual page views.
Pro/Con for Method 2: I am familiar with dplyr and would be able to join the tables together to pull the data I need. However, I am not as familiar with SQL and there is no Entity-Relationship Diagram (ERD) of any sort to refer to. Otherwise, I would build each query individually.
Either way, I am trying to come up with a way to proceed with either a named function, lambda function, or vectorized method for iterating. It would be great to name each variable and assign them appropriately so that I can perform the data wrangling with dplyr.
Any help would be appreciated, I am overwhelmed with which direction to go. I researched the equivalent to Python list comprehension in R but have not been able get the function in R to perform similarly.
> db_table_head_to_csv <- function(table) {
+ write.csv(
+ tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/bibliometrics_tables/{table}.csv")
+ )
+ }
>
> bibliometrics_tables %>% db_table_head_to_csv()
Error in UseMethod("as.sql") :
no applicable method for 'as.sql' applied to an object of class "data.frame"
Consider storing all table data in a named list (counterpart to Python dictionary) using lapply (counterpart to Python's list/dict comprehension). And if you use its sibling, sapply, the character vector passed in will return as names of elements:
# RETURN VECTOR OF TABLE NAMES
db_tables <- dbGetQuery(
con, "SELECT [TABLE_NAME] FROM [database_name].INFORMATION_SCHEMA.TABLES"
)$TABLE_NAME
# RETURN NAMED LIST OF DATA FRAMES FOR EACH DB TABLE
df_list <- sapply(db_tables, function(t) dbReadTable(conn, t), simplify = FALSE)
You can extend the lambda function for multiple steps like write.csv or use a defined method. Just be sure to return a data frame as last line. Below uses the new pipe, |> in base R 4.1.0+:
db_table_head_to_csv <- function(table) {
head_df <- dbReadTable(con, table) |> head()
write.csv(
head_df,
file.path(
"00_exports", "bibliometrics_tables", paste0(table, ".csv")
)
)
return(head_df)
}
df_list <- sapply(db_tables, db_table_head_to_csv, simplify = FALSE)
You lose no functionality of data frame object if stored in a list and can extract with $ or [[ by name:
# EXTRACT SPECIFIC ELEMENT
head(df_list$table_1)
tail(df_list[["table_2"]])
summary(df_list$`table_3`)

RSQLite dbWriteTable not working on large data

Here is my code, where I am trying to write a data from R to SQLite database file.
library(DBI)
library(RSQLite)
library(dplyr)
library(data.table)
con <- dbConnect(RSQLite::SQLite(), "data.sqlite")
### Read the file you want to load to the SQLite database
data <- read_rds("data.rds")
dbSafeNames = function(names) {
names = gsub('[^a-z0-9]+','_',tolower(names))
names = make.names(names, unique=TRUE, allow_=TRUE)
names = gsub('.','_',names, fixed=TRUE)
names
}
colnames(data) = dbSafeNames(colnames(data))
### Load the dataset to the SQLite database
dbWriteTable(conn=con, name="data", value= data, row.names=FALSE, header=TRUE)
While writing the 80GB data, I see the size of the data.sqlite increasing upto 45GB and then it stops and throws the following error.
Error in rsqlite_send_query(conn#ptr, statement) : disk I/O error
Error in rsqlite_send_query(conn#ptr, statement) :
no such savepoint: dbWriteTable
What is the fix and what should I do? If it's only with RSQLite, please suggest the most robust database creation method like RMySQL, RPostgreSQL, etc.

run lm command with zeros and N/A in database

I´m currently working with r. My database has a lot of zeros and N/A, so when I run the lm() command with a log-log model it doesn´t works. Any advice?
This is not the best answer but it got me working: I store the results of my database in a new variable, and i prune all missing values out (in my case, cells that literally have nothing in them, or ""), like this:
library(RODBC) # database on a SQL box
library(tidyverse)
myConnection <- odbcConnect("name_of_your_DNS") # I create a DSN first
Data <- sqlFetch(channel = myConnection, sqtable = "name_of_your_table")
empty_as_na <- function(x) {
if("factor" %in% class(x)) x <- as.character(x) ## since ifelse wont work with factors
ifelse(as.character(x)!="", x, NA)}
Data <- Data %>% mutate_all(funs(empty_as_na))
you might consider experimenting with adding this after loading the RODBC library:
options(stringsAsFactors = FALSE)

Can't append data to sqlite3 with dplyr db_write_table()

Looking into adding data to a table with dplyr, I saw https://stackoverflow.com/a/26784801/1653571 but the documentation says db_insert_table() is deprecated.
?db_insert_into()
...
db_create_table() and db_insert_into() have been deprecated in favour of db_write_table().
...
I tried to use the non-deprecated db_write_table() instead, but it fails both with and without the append= option:
require(dplyr)
my_db <- src_sqlite( "my_db.sqlite3", create = TRUE) # create src
copy_to( my_db, iris, "my_table", temporary = FALSE) # create table
newdf = iris # create new data
db_write_table( con = my_db$con, table = "my_table", values = newdf) # insert into
# Error: Table `my_table` exists in database, and both overwrite and append are FALSE
db_write_table( con = my_db$con, table = "my_table", values = newdf,append=True) # insert into
# Error: Table `my_table` exists in database, and both overwrite and append are FALSE
Should one be able to append data with db_write_table()?
See also https://github.com/tidyverse/dplyr/issues/3120
No, you shouldn't use db_write_table() instead of db_insert_table(), since it can't be generalized across backends.
And you shouldn't use the tidyverse versions rather than the relevant DBI::: versions, since the tidyverse helper functions are for internal use, and not designed to be robust enough for use by users. See the discussion at https://github.com/tidyverse/dplyr/issues/3120#issuecomment-339034612 :
Actually, I don't think you should be using these functions at all. Despite that SO post, these are not user facing functions. You should be calling the DBI functions directly.
-- Hadley Wickham, package author.

List tables within a Postgres schema using R

I'm connecting to a PostgreSQL db using R and the RPostgreSQL package. The db has a number of schemas and I would like to know which tables are associated with a specific schema.
So far I have tried:
dbListTables(db, schema="sch2014")
dbGetQuery(db, "dt sch2014.*")
dbGetQuery(db, "\dt sch2014.*")
dbGetQuery(db, "\\dt sch2014.*")
None of which have worked.
This related question also exists: Setting the schema name in postgres using R, which would solve the problem by defining the schema at the connection. However, it's not yet been answered!
Reading this answer https://stackoverflow.com/a/15644435/2773500 helped. I can use the following to get the tables associated with a specific schema:
dbGetQuery(db,
"SELECT table_name FROM information_schema.tables
WHERE table_schema='sch2014'")
The following should work (using DBI_v1.1.1)
DBI::dbListObjects(conn, DBI::Id(schema = 'schema_name'))
While it has all the info you want, it's hard to access and hard to read.
I would recommend something that produces a data frame:
# get a hard to read table given some Postgres connection `conn`
x = DBI::dbListObjects(conn, DBI::Id(schema = 'schema_name'))
# - extract column "table" comprising a list of Formal class 'Id' objects then
# - extract the 'name' slot for each S4 object element
# could also do `lapply(d$table, function(x) x#name)`
v = lapply(x$table, function(x) slot(x, 'name'))
# create a dataframe with header 'schema', 'table'
d = as.data.frame(do.call(rbind, v))
Or in one line:
d = as.data.frame(do.call(rbind, lapply(DBI::dbListObjects(conn, DBI::Id(schema = 'schema_name'))$table, function(x) slot(x, 'name'))))
Or in a more "tidy" way:
conn %>%
DBI::dbListObjects(DBI::Id(schema = 'schema_name')) %>%
dplyr::pull(table) %>%
purrr::map(~slot(.x, 'name')) %>%
dplyr::bind_rows()
OUTPUT is something like
> d
schema table
1 schema_name mtcars
You can use the table_schema option rather than just schema to see a list of tables within the specific schema. So keeping with your example code snippet above, the below line should work:
dbListTables(db, table_schema="sch2014")

Resources