I have a JDBC connection and would like to query data from one schema and save to another
library(tidyverse)
library(dbplyr)
library(rJava)
library(RJDBC)
# access the temp table in the native schema
tbl(conn, "temp")
temp_ed <- temp %*% mutate(new = 1)
# I would like to save temp_ed to a new schema "schmema_new"
I would like to use something like dbplyr::compute() but define the output schema specifically. It seems dbplyr::copy_to could be used, but would require bringing the data through the local machine.
I want to use something like RJDBC::dbSendUpdate() but which would ideally integrate nicely with the data manipulating pipeline above.
I do this using dbExecute from the DBI package.
The key idea is to extract the query that defines the current remote table and make this a sub-query in a larger SQL query that writes the table. This requires that (1) the schema exists, (2) you have permission to write new tables, and (3) you know the correct SQL syntax.
Doing this directly might look like:
tbl(conn, "temp")
temp_ed <- temp %*% mutate(new = 1)
save_table_query = paste(
"SELECT * INTO my_database.schema_new.my_table FROM (\n",
dbplyr::sql_render(temp_ed),
"\n) AS sub_query"
)
dbExecute(conn, as.character(save_table_query))
INTO is the clause for writing a new table in SQL server (the flavour of SQL I use). You will need to find the equivalent clause for your database.
In practice I use a custom function that looks something like this:
write_to_database <- function(input_tbl, db, schema, tbl_name){
# connection
tbl_connection <- input_tbl$src$con
# SQL query
sql_query <- glue::glue(
"SELECT *\n",
"INTO {db}.{schema}.{tbl_name}\n",
"FROM (\n",
dbplyr::sql_render(input_tbl),
"\n) AS sub_query"
)
result <- dbExecute(tbl_connection, as.character(sql_query))
}
Applying this in your context:
tbl(conn, "temp")
temp_ed <- temp %*% mutate(new = 1)
write_to_database(temp_ed, "my_database", "schema_new", "my_table")
Related
What is the dbplyr verbs combination that is equivalent to DBI::dbSendQuery(con, "DELETE FROM <table> WHERE <condition>").
What I want is not querying data from database, but removing data from and updating a table in database.
I want to do it in a dplyr way, but I am not sure if it is possible. I could not find anything similar in the package reference.
dbplyr translates dplyr commands to query database tables. I am not aware of any inbuilt way to modify existing database tables using pure dbplyr.
This is likely a design choice.
Within R we do not need to distinguish between fetching data from a table (querying) and modifying a table. This is probably because in R we can reload the original data into memory if an error/mistake occurs.
But in databases querying and modifying a table are deliberately different things. When modifying a database, you are modifying the source so additional controls are used (because recovering deleted data is a lot harder).
The DBI package is probably your best choice for modifying the database
This is the approach I use for all my dbplyr work. Often a custom function that takes the query produced by dbplyr translation and inserting it into a DBI call (you can see examples of this in my dbplyr helpers GitHub repo).
Two approaches to consider for this: (1) an anti-join (on all columns) followed by writing a new table, (2) the DELETE FROM syntax.
Mock up of anti-join approach
records_to_remove = remote_table %>%
filter(conditions)
desired_final_table = remote_table %>%
anti_join(records_to_remove, by = colnames(remote_table))
query = paste0("SELECT * INTO output_table FROM (",
sql_render(desired_final_table),
") AS subquery")
DBI::dbExecute(db_con, as.character(query))
Mock up of DELETE FROM syntax
records_to_remove = remote_table %>%
filter(conditions)
query = sql_render(records_to_remove) %>%
as.character() %>%
gsub(search_term = "SELECT *", replacement_term = "DELETE")
DBI::dbExecute(db_con, query)
If you plan to run these queries multiple times, then wrapping them in a function, with checks for validity would be recommended.
For some use cases deleting rows will not be necessary.
You could think of the filter command in R as deleting rows from a table. For example in R we might run:
prepared_table = input_table %>%
filter(colX == 1) %>%
select(colA, colB, colZ)
And think of this as deleting rows where colX == 1 before producing output:
output = prepared_table %>%
group_by(colA) %>%
summarise(sumZ = sum(colZ))
(Or you could use an anti-join above instead of a filter.)
But for this type of deleting, you do not need to edit the source data, as you can just filter out the unwanted rows at runtime every time. Yes it will make your database query larger, but this is normal for working with databases.
So combining the preparation and output in SQL is normal (something like this):
SELECT colA, SUM(colZ) AS sumZ
FROM (
SELECT colA, colB, colZ
FROM input_table
WHERE colX = 1
) AS prepared_table
GROUP BY colA
So unless you need to modify the database, I would recommend filtering instead of deleting.
I am sure this question is very basic, but this is the first time I am using R connected to a server, so a few things still confuse me.
I used ODBC Data Sources on Windows to create a DNS, and used
con <- dbConnect(odbc::odbc(), "TEST_SERVER")
this worked, and now under the connection tab I can see the server, and if I double click I can see the databases and tables that exist in the server. How would I go about reading something inside one of those databases?
For Example, if the database name is db1, and the table name is t1, what is the code needed to read that table into local memory? I would prefer using dbplyr as I am familiar with the syntax. I am just unsure how to refer to a particular database and table after making the connection to the server.
I haven't used dbplyr before, but you can query the database using dbGetQuery.
test <- dbGetQuery(
con,
"SELECT *
FROM db1.t1
"
)
You can also pass the database into the connection string.
con <- dbConnect(
drv = odbc(),
dsn = "TEST_SERVER",
database = "db1"
)
And then your query would just be "SELECT * FROM t1".
EDIT: To query the table using dbplyr:
tbl1 <- tbl(con, "t1")
qry <- tbl1 %>% head() %>% collect()
I like to use RODBC-
con <- RODBC::odbcConnect(dsn = 'your_dsn',
uid = 'userid',
pwd = 'password')
table_output <- RODBC::sqlQuery(con, 'SELECT * FROM Table')
I would like to have a list of the queries sent to the database by my R script, but it is a bit unclear to me how/where are performed some operations involving local dataframes & database tables.
As mentionned in this post, it seems that when performing an operation between a data frame in local env and a table from a DBI connexion (e.g. a left_join(... ,copy=TRUE)), - copy=TRUE needed because data is coming from different datasources - the operations are performed on the database side, working with temporary tables.
I tried to verify this using the show_query() to see exactly what is sent to the database and what is not.
I cannot give a proper reproductible example as it involves a database connexion, but here is the logic :
con <- DBI::dbConnect(odbc::odbc(),
Driver = "SQL Server",
Server = "server",
Database = "database",
UID = "user",
PWD = "pwd",
Port = port)
db_table <- tbl(con, "tbl_A")
local_df <- read.csv("/.../file.csv",stringsAsFactors = FALSE)
q1 <- local_df %>% inner_join(db_table ,by=c('id'='id'),copy=TRUE)
Below are the outputs of the show_query() statements :
> db_table %>% show_query()
<SQL>
SELECT *
FROM "tbl_A"
q1 %>% show_query()
Error in UseMethod("show_query") :
no applicable method for 'show_query' applied to an object of class "data.frame"
This makes me think that in that sequence, the only operation performed on the database side is SELECT * FROM "tbl_A", and that q1 is performed on the local environment using the local_df and a local copy of the database table.
I tried to have a look at the dplyr documentation but there is no information for when data is coming from multiple sources.
I have a connection to our database:
con <- dbConnect(odbc::odbc(), "myHive")
I know this is successful because when I run it, in the top right of RStudio I can see all of our databases and tables.
My question is, how can I select a specific database table combination? The documentation shows a user sleecting a single table, "flights" but I need to do the equivilent of somedatabase.sometable.
Tried:
mytable <- tbl(con, "somedb.sometable")
Error in new_result(connection#ptr, statement) :
nanodbc/nanodbc.cpp:1344: 42S02: [Hortonworks][SQLEngine] (31740) Table or view not found: HIVE..dp_enterprise.uds_order
Then tried:
mytable <- tbl(con, "somedb::sometable")
Error in new_result(connection#ptr, statement) :
nanodbc/nanodbc.cpp:1344: 42S02: [Hortonworks][SQLEngine] (31740) Table or view not found: HIVE..somedb::sometable
I tried removing the quotes "" too.
Within the connections pane of RStudio I can see somedb.sometable. It's there! How can I save it to variable mytable?
You select the database when creating the connection and the table when creating the tbl (with the from argument).
There is no standard interface to dbConnect, so the exact way to pass the database name depends on the DBDriver you use. Indeed DBI::dbConnect is simply a generic dispatching to the driver-specific dbConnect.
In your case, the driver is odbc so you can check out the documentation for odbc::dbConnect and you'll see the relevant argument is database.
This will work:
con <- dbConnect(odbc::odbc(), "myHive", database = "somedb")
df <- tbl(con, from = "sometable")
With most other drivers (e.g. RMariaDB, RMySQL, RPostgres, RSQLite), the argument is called dbname, so you'd do this:
con <- dbConnect(RMariaDB::MariaDB(), dbname = "somedb")
df <- tbl(con, from = "sometable")
I think I found it, use in_schema
mytable <- tbl(con, in_schema("somedb", "sometable"))
This returns a list not a tbl though so I'm not sure.
I want to use dbWriteTable() of R's DBI package to write data into a database. Usually, the respective tables are already present so I use the argument append = TRUE. How do I get which rows were added to the table by dbWriteTable()? Most of tables have certain columns with UNIQUE values so a SELECT will work (see below for a simple example). However, this is not true for all of them or only several columns together are UNIQUE making the SELECT more complicated. In addition, I would like to put the writing and querying into a function so I would prefer a consistent approach for all cases.
I mainly need this to get the PRIMARY KEY's added by the database and to allow a user to quickly see what was added. If important, my database is PostgreSQL and I would like to use the odbc package for connection.
I have something like this in mind, however, I am looking for a more general solution:
library(DBI)
con <- dbConnect(odbc::odbc(), dsn = "database")
dbWriteTable(con,
name = "site", value = data.frame(name = c("abcd", "efgh"),
append = TRUE))
dbGetQuery(conn,
paste("SELECT * FROM site WHERE name in ('abcd', 'efgh');"))