Looking into adding data to a table with dplyr, I saw https://stackoverflow.com/a/26784801/1653571 but the documentation says db_insert_table() is deprecated.
?db_insert_into()
...
db_create_table() and db_insert_into() have been deprecated in favour of db_write_table().
...
I tried to use the non-deprecated db_write_table() instead, but it fails both with and without the append= option:
require(dplyr)
my_db <- src_sqlite( "my_db.sqlite3", create = TRUE) # create src
copy_to( my_db, iris, "my_table", temporary = FALSE) # create table
newdf = iris # create new data
db_write_table( con = my_db$con, table = "my_table", values = newdf) # insert into
# Error: Table `my_table` exists in database, and both overwrite and append are FALSE
db_write_table( con = my_db$con, table = "my_table", values = newdf,append=True) # insert into
# Error: Table `my_table` exists in database, and both overwrite and append are FALSE
Should one be able to append data with db_write_table()?
See also https://github.com/tidyverse/dplyr/issues/3120
No, you shouldn't use db_write_table() instead of db_insert_table(), since it can't be generalized across backends.
And you shouldn't use the tidyverse versions rather than the relevant DBI::: versions, since the tidyverse helper functions are for internal use, and not designed to be robust enough for use by users. See the discussion at https://github.com/tidyverse/dplyr/issues/3120#issuecomment-339034612 :
Actually, I don't think you should be using these functions at all. Despite that SO post, these are not user facing functions. You should be calling the DBI functions directly.
-- Hadley Wickham, package author.
Related
I am connecting to a SQL Server database through the ODBC connection in R. I have two potential methods to get data, and am trying to determine which would be more efficient. The data is needed for a Shiny dashboard, so the data needs to be pulled while the app is loading rather than querying on the fly as the user is using the app.
Method 1 is to use over 20 stored procedures to query all of the needed data and store them for use. Method 2 is to query all of the tables individually.
Here is the method I used to query one of the stored procedures:
get_proc_data <- function(proc_name, url, start_date, end_date){
dbGetQuery(con, paste0(
"EXEC dbo.", proc_name, " ",
"#URL = N'", url, "', ",
"#Startdate = '", start_date, "', ",
"#enddate = '", end_date, "' "
))
}
data <- get_proc_data(proc_name, url, today(), today() %m-% years(5))
However, each of the stored procedures has a slightly different setup for the parameters, so I would have to define each of them separately.
I have started to implement Method 2, but have run into issues with iteratively querying each table.
# use dplyr create list of table names
db_tables <- dbGetQuery(con, "SELECT * FROM [database_name].INFORMATION_SCHEMA.TABLES;") %>% select(TABLE_NAME)
# use dplyr pull to create list
table_list <- pull(db_tables , TABLE_NAME)
# get a quick look at the first few rows
tbl(con, "[TableName]") %>% head() %>% glimpse()
# iterate through all table names, get the first five rows, and export to .csv
for (table in table_list){
write.csv(
tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/tables/{table}.csv")
)
}
selected_tables <- db_tables %>% filter(TABLE_NAME == c("TableName1","TableName2"))
Ultimately this method was just to test how long it would take to iterate through the ~60 tables and perform the required function. I have tried putting this into a function instead but have not been able to get it to iterate through while also pulling the name of the table.
Pro/Con for Method 1: The stored procs are currently powering a metrics plug-in that was written in C++ and is displaying metrics on the webpage. This is for internal use to monitor website performance. However, the stored procedures are not all visible to me and the client needs me to extend their current metrics. I also do not have a DBA at my disposal to help with the SQL Server side, and the person who wrote the procs is unavailable. The procs are also using different logic than each other, so joining the results of two different procs gives drastically different values. For example, depending on the proc, each date will list total page views for each day or already be aggregated at the weekly or monthly scale then listed repeatedly. So joining and grouping causes drastic errors in actual page views.
Pro/Con for Method 2: I am familiar with dplyr and would be able to join the tables together to pull the data I need. However, I am not as familiar with SQL and there is no Entity-Relationship Diagram (ERD) of any sort to refer to. Otherwise, I would build each query individually.
Either way, I am trying to come up with a way to proceed with either a named function, lambda function, or vectorized method for iterating. It would be great to name each variable and assign them appropriately so that I can perform the data wrangling with dplyr.
Any help would be appreciated, I am overwhelmed with which direction to go. I researched the equivalent to Python list comprehension in R but have not been able get the function in R to perform similarly.
> db_table_head_to_csv <- function(table) {
+ write.csv(
+ tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/bibliometrics_tables/{table}.csv")
+ )
+ }
>
> bibliometrics_tables %>% db_table_head_to_csv()
Error in UseMethod("as.sql") :
no applicable method for 'as.sql' applied to an object of class "data.frame"
Consider storing all table data in a named list (counterpart to Python dictionary) using lapply (counterpart to Python's list/dict comprehension). And if you use its sibling, sapply, the character vector passed in will return as names of elements:
# RETURN VECTOR OF TABLE NAMES
db_tables <- dbGetQuery(
con, "SELECT [TABLE_NAME] FROM [database_name].INFORMATION_SCHEMA.TABLES"
)$TABLE_NAME
# RETURN NAMED LIST OF DATA FRAMES FOR EACH DB TABLE
df_list <- sapply(db_tables, function(t) dbReadTable(conn, t), simplify = FALSE)
You can extend the lambda function for multiple steps like write.csv or use a defined method. Just be sure to return a data frame as last line. Below uses the new pipe, |> in base R 4.1.0+:
db_table_head_to_csv <- function(table) {
head_df <- dbReadTable(con, table) |> head()
write.csv(
head_df,
file.path(
"00_exports", "bibliometrics_tables", paste0(table, ".csv")
)
)
return(head_df)
}
df_list <- sapply(db_tables, db_table_head_to_csv, simplify = FALSE)
You lose no functionality of data frame object if stored in a list and can extract with $ or [[ by name:
# EXTRACT SPECIFIC ELEMENT
head(df_list$table_1)
tail(df_list[["table_2"]])
summary(df_list$`table_3`)
I have a set of targets, lets say data_a, data_b, ...
I want to iterate over all datasets and load the data.
This can be achieved using tar_read(data_a) or tar_read(data_a"). As I want to load the targets programmatically, I would like to use something like this in some kind of lapply:
target_name <- "data_a"
data <- tar_read(target_name)
But then I get the error that the target target_name was not found.
I know this is related to NSE with R as tar_read internally calls substitute, but I wasn't able to figure out how to mask the target_name to make tar_read work. I have tried eval(parse()) and the different options presented in Advanced R, as well as rlang (such as !!, {{ and similar) to no avail.
Any idea how to achieve this?
If you look at the code for tar_read, you see that it uses NSE to convert the name parameter into a character string, then calls the function tar_read_raw on the resulting string:
tar_read
#> function (name, branches = NULL, meta = tar_meta(store = store),
# store = targets::tar_config_get("store"))
#> {
#> force(meta)
#> name <- tar_deparse_language(substitute(name))
#> tar_read_raw(name = name, branches = branches, meta = meta,
#> store = store)
#> }
However, you can also use tar_read_raw directly. The manual for tar_read_raw says:
Like tar_read() except name is a character string.
So you should just be able to do:
data <- tar_read_raw(target_name)
I'm connecting to a PostgreSQL db using R and the RPostgreSQL package. The db has a number of schemas and I would like to know which tables are associated with a specific schema.
So far I have tried:
dbListTables(db, schema="sch2014")
dbGetQuery(db, "dt sch2014.*")
dbGetQuery(db, "\dt sch2014.*")
dbGetQuery(db, "\\dt sch2014.*")
None of which have worked.
This related question also exists: Setting the schema name in postgres using R, which would solve the problem by defining the schema at the connection. However, it's not yet been answered!
Reading this answer https://stackoverflow.com/a/15644435/2773500 helped. I can use the following to get the tables associated with a specific schema:
dbGetQuery(db,
"SELECT table_name FROM information_schema.tables
WHERE table_schema='sch2014'")
The following should work (using DBI_v1.1.1)
DBI::dbListObjects(conn, DBI::Id(schema = 'schema_name'))
While it has all the info you want, it's hard to access and hard to read.
I would recommend something that produces a data frame:
# get a hard to read table given some Postgres connection `conn`
x = DBI::dbListObjects(conn, DBI::Id(schema = 'schema_name'))
# - extract column "table" comprising a list of Formal class 'Id' objects then
# - extract the 'name' slot for each S4 object element
# could also do `lapply(d$table, function(x) x#name)`
v = lapply(x$table, function(x) slot(x, 'name'))
# create a dataframe with header 'schema', 'table'
d = as.data.frame(do.call(rbind, v))
Or in one line:
d = as.data.frame(do.call(rbind, lapply(DBI::dbListObjects(conn, DBI::Id(schema = 'schema_name'))$table, function(x) slot(x, 'name'))))
Or in a more "tidy" way:
conn %>%
DBI::dbListObjects(DBI::Id(schema = 'schema_name')) %>%
dplyr::pull(table) %>%
purrr::map(~slot(.x, 'name')) %>%
dplyr::bind_rows()
OUTPUT is something like
> d
schema table
1 schema_name mtcars
You can use the table_schema option rather than just schema to see a list of tables within the specific schema. So keeping with your example code snippet above, the below line should work:
dbListTables(db, table_schema="sch2014")
I have a RODBC connection using odbcConnect to a DB.
I am reading in a table into a variable.
This table has a text column, however when I submit a subset on the text column, I get a factor.
I am looking for a stringsasfactors=FALSE equivalent while reading a table using RODBC.
Any thoughts on how I might be able to achieve this?
Thanks!
Use eg the stringsAsFactors option to sqlQuery. Documentation is eg here.
Modifying the basic example from the help page:
channel <- odbcConnect("test")
sqlSave(channel, USArrests, rownames = "State", verbose = TRUE)
options(dec=".") # optional, if DBMS is not locale-aware or set to ASCII
## note case of State, Murder, Rape are DBMS-dependent,
## and some drivers need column and table names double-quoted.
sqlQuery(channel, paste("select State, Murder from USArrests",
"where Rape > 30 order by Murder"),
stringsAsFactors=FALSE) ## your option here
close(channel)
As sqlFetch() passes arguments through via ..., it will work the same way. Just add stringsAsFactors=FALSE, or even set it globally via options().
I'm (very!) new to R and mysql and I have been struggling and researching for this problem for days. So I would really appreciate ANY help.
I need to complete a mathematical expression from 2 variables in two different tables. Essentially, I'm trying to figure out how old a subject was (DOB is in one table) when they were serviced (date of service is in another table). I have an identifying variable that is the same in both.
I have tried merging these:
age<-merge("tbl1", "tbl2", by=c("patient_id") all= TRUE)
this returns:
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
I have tried sub-setting where I just keep the variables of interest, but is is not working because I believe sub-setting only works for numbers not characters..right?
Again, I would appreciate any help. Thanks in advance
Since you are new to data.base , I think you should use dplyr here. It is an abstraction layer for many data base providers. So you will not have to deal with db problems. Here I show you a simple example of where:
I read the tables from the MYSQL
Merge the table assuming having unique shared variable
The code:
library(dplyr)
library(RMySQL)
## create a connection
SDB <- src_mysql(host = "localhost", user = "foo", dbname = "bar", password = getPassword())
# reading tables
tbl1 <- tbl(SDB, "TABLE1_NAME")
tbl2 <- tbl(SDB, "TABLE2_NAME")
## merge : this step can be done using dplyr also
age <- merge(tbl1, tbl2, all= TRUE)