Using RPostgreSQL and INSERT INTO - Do I need to bind my dataframe? - r

I'm writing a query using RPostgreSQL meant to fill a table one time. I really have no intention to do anything else with the data within R. I just need it for is to run the function to fill that table.
library(data.table)
library (RPostgreSQL)
MakeAndGetQuery <- function(id) {
q <- paste0("INSERT INTO table_a SELECT * FROM table_c WHERE client_id = ",
id,
" AND event_date = CURRENT_DATE - 1")
as.data.table(dbGetQuery(conn2, q))
}
all_yer_data <- rbindlist(lapply(generate_id$client_id, MakeAndGetQuery))
setkey(all_yer_data, id, ...)
So my question is, will not doing anything with the dataframe within R impact if it runs successfully? In theory, that SQL statement shouldn't even produce any results within R. It's using INSERT INTO from Redshift, so if I ran that in Redshift, it wouldn't return any results, just a message saying it was successfull and "5 Rows affected"

Related

R query database tables iteratively without for loop with lambda or vectorized function for Shiny app

I am connecting to a SQL Server database through the ODBC connection in R. I have two potential methods to get data, and am trying to determine which would be more efficient. The data is needed for a Shiny dashboard, so the data needs to be pulled while the app is loading rather than querying on the fly as the user is using the app.
Method 1 is to use over 20 stored procedures to query all of the needed data and store them for use. Method 2 is to query all of the tables individually.
Here is the method I used to query one of the stored procedures:
get_proc_data <- function(proc_name, url, start_date, end_date){
dbGetQuery(con, paste0(
"EXEC dbo.", proc_name, " ",
"#URL = N'", url, "', ",
"#Startdate = '", start_date, "', ",
"#enddate = '", end_date, "' "
))
}
data <- get_proc_data(proc_name, url, today(), today() %m-% years(5))
However, each of the stored procedures has a slightly different setup for the parameters, so I would have to define each of them separately.
I have started to implement Method 2, but have run into issues with iteratively querying each table.
# use dplyr create list of table names
db_tables <- dbGetQuery(con, "SELECT * FROM [database_name].INFORMATION_SCHEMA.TABLES;") %>% select(TABLE_NAME)
# use dplyr pull to create list
table_list <- pull(db_tables , TABLE_NAME)
# get a quick look at the first few rows
tbl(con, "[TableName]") %>% head() %>% glimpse()
# iterate through all table names, get the first five rows, and export to .csv
for (table in table_list){
write.csv(
tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/tables/{table}.csv")
)
}
selected_tables <- db_tables %>% filter(TABLE_NAME == c("TableName1","TableName2"))
Ultimately this method was just to test how long it would take to iterate through the ~60 tables and perform the required function. I have tried putting this into a function instead but have not been able to get it to iterate through while also pulling the name of the table.
Pro/Con for Method 1: The stored procs are currently powering a metrics plug-in that was written in C++ and is displaying metrics on the webpage. This is for internal use to monitor website performance. However, the stored procedures are not all visible to me and the client needs me to extend their current metrics. I also do not have a DBA at my disposal to help with the SQL Server side, and the person who wrote the procs is unavailable. The procs are also using different logic than each other, so joining the results of two different procs gives drastically different values. For example, depending on the proc, each date will list total page views for each day or already be aggregated at the weekly or monthly scale then listed repeatedly. So joining and grouping causes drastic errors in actual page views.
Pro/Con for Method 2: I am familiar with dplyr and would be able to join the tables together to pull the data I need. However, I am not as familiar with SQL and there is no Entity-Relationship Diagram (ERD) of any sort to refer to. Otherwise, I would build each query individually.
Either way, I am trying to come up with a way to proceed with either a named function, lambda function, or vectorized method for iterating. It would be great to name each variable and assign them appropriately so that I can perform the data wrangling with dplyr.
Any help would be appreciated, I am overwhelmed with which direction to go. I researched the equivalent to Python list comprehension in R but have not been able get the function in R to perform similarly.
> db_table_head_to_csv <- function(table) {
+ write.csv(
+ tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/bibliometrics_tables/{table}.csv")
+ )
+ }
>
> bibliometrics_tables %>% db_table_head_to_csv()
Error in UseMethod("as.sql") :
no applicable method for 'as.sql' applied to an object of class "data.frame"
Consider storing all table data in a named list (counterpart to Python dictionary) using lapply (counterpart to Python's list/dict comprehension). And if you use its sibling, sapply, the character vector passed in will return as names of elements:
# RETURN VECTOR OF TABLE NAMES
db_tables <- dbGetQuery(
con, "SELECT [TABLE_NAME] FROM [database_name].INFORMATION_SCHEMA.TABLES"
)$TABLE_NAME
# RETURN NAMED LIST OF DATA FRAMES FOR EACH DB TABLE
df_list <- sapply(db_tables, function(t) dbReadTable(conn, t), simplify = FALSE)
You can extend the lambda function for multiple steps like write.csv or use a defined method. Just be sure to return a data frame as last line. Below uses the new pipe, |> in base R 4.1.0+:
db_table_head_to_csv <- function(table) {
head_df <- dbReadTable(con, table) |> head()
write.csv(
head_df,
file.path(
"00_exports", "bibliometrics_tables", paste0(table, ".csv")
)
)
return(head_df)
}
df_list <- sapply(db_tables, db_table_head_to_csv, simplify = FALSE)
You lose no functionality of data frame object if stored in a list and can extract with $ or [[ by name:
# EXTRACT SPECIFIC ELEMENT
head(df_list$table_1)
tail(df_list[["table_2"]])
summary(df_list$`table_3`)

R: how to query from a database multiple times based on different dates

I have an R Markup file in which I establish a database connection, query data, and store the data in an csv file. The query is based on a specific date range. How can I automated make multiple queries, so that one after another e.g. every week is queried from the database? I cannot make a query for e.g. the whole year, but I need to store the data separately for each week. I could make a data frame, in which I have two columns for the start and end date, which I would like to use for the query.
But how can I automatically run the queries multiple times depending on the date data frame?
My code so far:
#load libraries
drv <- PostgreSQL()
db_con <- dbConnect(drv, host=my_host, user=my_user, dbname=my_name, port=my_port, password=my_password)
start = "2015-01-01"
end = "2015-01-02"
result <- dbGetQuery(
db_con,
"SELECT * FROM table WHERE date >= start AND date <= end;")
st_write(result, pathname)
Consider parameterization using DBI::sqlInterpolate with a Map (wrapper to mapply) iteration:
db_con <- dbConnect(
PostgreSQL(), host=my_host, user=my_user, dbname=my_name,
port=my_port, password=my_password)
)
# ALL WEEKLY DATES IN 2015
dates_df <- data.frame(
start=seq.Date(as.Date("2015-01-01"), as.Date("2016-01-01"), by="week"),
end=seq.Date(as.Date("2015-01-08"), as.Date("2016-01-08"), by="week")
)
# USER DEFINED METHOD TO QUERY AND WRITE DATA
query_db <- function(s, e) {
# PREPARED STATEMENT WITH PLACEHOLDERS
sql <- "SELECT * FROM table WHERE date >= ?start AND date <= ?end;"
# BIND PARAMETERS AND QUERY
stmt <- DBI::sqlInterpolate(db_con, sql, start=s, end=e)
result <- dbGetQuery(db_con, stmt)
# WRITE DATA TO DISK
st_write(result, pathname)
# RETURN QUERY RESULTSET
return(result)
}
# WRITE AND STORE DATA IN MEMORY
df_list <- Map(query_db, dates_df$start, dates_df$end)
I suggest DBI::dbBind and a frame (similar to #Parfait's answer).
For demonstration, I have a "sessions" table on my pg instance that has a field ScheduledStart. In this case, it's a TIMESTAMPTZ column, not a Date column, so I need to take one more step in my demo (convert from R's Date to POSIXt classes).
# pg <- DBI::dbConnect(...)
ranges <- data.frame(
start = seq(as.Date("2020-03-01"), length.out = 4, by = "week")
)
ranges$end <- ranges$start + 6
# this line is only necessary because of my local table
ranges[] <- lapply(ranges, as.POSIXct)
Here is the bulk of the "query multiple weeks":
res <- DBI::dbSendQuery(pg, "select count(*) as n from Sessions where ScheduledStart between ? and ?")
DBI::dbBind(res, ranges)
out <- DBI::dbFetch(res)
DBI::dbClearResult(res)
out
# n
# 1 8
# 2 1
# 3 0
# 4 0
While sqlInterpolate is much better than forming your own query strings (e.g., with sprintf or paste), using dbBind allows for internal iteration like above, and allows the DBMS to optimize the query with the binding parameter ? instead of actual data. (Using sqlInterpolate, the DBMS would see four different queries. Using dbBind, it sees one query, optimizes it, and uses it four times.)
That query was really boring (select * ... works, too), but I think it gets the point across. The only downside of this method is that while it makes it really easy to get all of the data, there is nothing here that inherently tells you which of your queries a particular row came from. I suspect that you can determine that from your data, that your main intent on breaking it down by-week is the amount retrieved per-query.
Side note: I often use code like this in functions, where it is feasible that something between dbSendQuery and dbClearResult might interrupt operation. In that case, I tend to reorder my code a little, like this:
somefunc <- function(...) {
# ...
res <- DBI::dbSendQuery(pg, "select count(*) as n from Sessions where ScheduledStart between ? and ?")
on.exit({
suppressWarnings(DBI::dbClearResult(res))
}, add = TRUE)
DBI::dbBind(res, ranges)
DBI::dbClearResult(res)
return(out)
}

RODBC sqlQuery returns different number of rows every time I run it

I have a simple SQL query that should return 74m rows. I know so because I ran the same query in SSMS using count(*). However, when using RODBC sqlQuery in R, it returned less rows and I tried running the same code many times and found that the returned dataframe contains different number of rows every time (ranging from 1.1m to 20m). I cannot replicate the sql query exactly here due to confidentiality but conceptually it is similar to the below.
con <- odbcConnect("db_name", uid="my_user_id", pwd="my_password")
df_2 <- sqlQuery(con, paste("SELECT *
FROM table_name
WHERE CountryName <> 'US' and Month in (1,2,3)"),
stringsAsFactors = FALSE)
Why would an exact same code return different results every time? I tried running this more than 10 times.

R to SQL server. Make a query joining a database table with a R dataframe

I have an ODBC connection to SQL server database. From R, I want to query a table with lots of data, but I want to get only those records that match my dataframe in R by certain columns (INNER JOIN). I do currently linking ODBC tables in MS ACCESS 2003 (linked tables "dbo_name") and then doing relational queries, without downloading the entire table. I need to reproduce this process in R avoiding downloading the entire table (avoid SQLFetch ()).
I have read the information from ODBC, DBI, rsqlserver packages without success. Is there any package or way to fix this?
If you can't write a table to the database, there is another trick you can use. You essentially make a giant WHERE statement. Let's say you want to join table table in the database to your data.frame called a on the column id. You could say:
ids <- paste0(a$id,collapse=',')
# If a$id is a character, you'll have to surround this in quotes:
# ids <- paste0(paste0("'",a$id,"'"),collapse=',')
dbGetQuery(con, paste0('SELECT * FROM table where id in (',paste(ids,collapse=','),')'))
From your comment, it seems that SQL Server has a problem with a query of that size. I suspect that you may have to "chunk" the query into smaller bits, and then join them all together. Here is an example of splitting the ids into 1000 chunks, querying, and then combining.
id.chunks <- split(a$ids,seq(1000))
result.list <- lapply(id.chunks, function(ids)
dbGetQuery(con,
paste0('SELECT * FROM table where id in (',ids,')')))
combined.resuls <- do.call(rbind,result.list)
The problem was solved. Ids vector was divided into groups of 1000, and then querying to the server each. I show the unorthodox code. Thanks nograpes!
# "lani1" is the vector with 395.474 ids
id.chunks<-split(lani1,seq(1000))
for (i in 1:length(id.chunks)){
idsi<-paste0(paste0("'",as.vector(unlist(id.chunks[i])),"'"),collapse=',')
if(i==1){ani<-sqlQuery(riia,paste0('SELECT * FROM T_ANIMALES WHERE an_id IN (',idsi,')'))
}
else {ani1<-sqlQuery(riia,paste0('SELECT * FROM T_ANIMALES WHERE an_id IN (',idsi,')'))
ani<-rbind(ani,ani1)
}
}
I adapted the answer above and the following worked for me without needing SQL syntax. The table I used was from the adventureworks SQL Server database.
lazy_dim_customer <- dplyr::tbl(conn, dbplyr::in_schema("dbo", "DimCustomer"))
# Create data frame of customer ids
adv_customers <- dplyr::tbl(conn, "DimCustomer")
query1 <- adv_customers %>%
filter(CustomerKey < 20000) %>%
select(CustomerKey)
d00df_customer_keys <- query1 %>% dplyr::collect()
# Chunk customer ids, filter, collect and bind
id.chunks <- split(d00df_customer_keys$CustomerKey, seq(10))
result.list <- lapply(id.chunks, function(ids)
lazy_dim_customer %>%
filter(CustomerKey %in% ids) %>%
select(CustomerKey, FirstName, LastName) %>%
collect() )
combined.results <- do.call(rbind, result.list)

Add a dynamic value into RMySQL getQuery [duplicate]

This question already has answers here:
Dynamic "string" in R
(4 answers)
Closed 5 years ago.
Is it possible to pass a value into the query in dbGetQuery from the RMySQL package.
For example, if I have a set of values in a character vector:
df <- c('a','b','c')
And I want to loop through the values to pull out a specific value from a database for each.
library(RMySQL)
res <- dbGetQuery(con, "SELECT max(ID) FROM table WHERE columna='df[2]'")
When I try to add the reference to the value I get an error. Wondering if it is possible to add a value from an R object in the query.
One option is to manipulate the SQL string within the loop. At the moment you have a string literal, the 'df[2]' is not interpreted by R as anything other than characters. There are going to be some ambiguities in my answer, because df in your Q is patently not a data frame (it is a character vector!). Something like this will do what you want.
Store the output in a numeric vector:
require(RMySQL)
df <- c('a','b','c')
out <- numeric(length(df))
names(out) <- df
Now we can loop over the elements of df to execute your query three times. We can set the loop up two ways: i) with i as a number which we use to reference the elements of df and out, or ii) with i as each element of df in turn (i.e. a, then b, ...). I will show both versions below.
## Version i
for(i in seq_along(df)) {
SQL <- paste("SELECT max(ID) FROM table WHERE columna='", df[i], "';", sep = "")
out[i] <- dbGetQuery(con, SQL)
dbDisconnect(con)
}
OR:
## Version ii
for(i in df) {
SQL <- paste("SELECT max(ID) FROM table WHERE columna='", i, "';", sep = "")
out[i] <- dbGetQuery(con, SQL)
dbDisconnect(con)
}
Which you use will depend on personal taste. The second (ii) version requires you to set names on the output vector out that are the same as the data inside out.
Having said all that, assuming your actual SQL Query is similar to the one you post, can't you do this in a single SQL statement, using the GROUP BY clause, to group the data before computing max(ID)? Doing simple things in the data base like this will likely be much quicker. Unfortunately, I don't have a MySQL instance around to play with and my SQL-fu is weak currently, so I can't given an example of this.
You could also use the sprintf command to solve the issue (it's what I use when building Shiny Apps).
df <- c('a','b','c')
res <- dbGetQuery(con, sprintf("SELECT max(ID) FROM table WHERE columna='%s'"),df())
Something along those lines should work.

Resources