I'm connecting to a PostgreSQL db using R and the RPostgreSQL package. The db has a number of schemas and I would like to know which tables are associated with a specific schema.
So far I have tried:
dbListTables(db, schema="sch2014")
dbGetQuery(db, "dt sch2014.*")
dbGetQuery(db, "\dt sch2014.*")
dbGetQuery(db, "\\dt sch2014.*")
None of which have worked.
This related question also exists: Setting the schema name in postgres using R, which would solve the problem by defining the schema at the connection. However, it's not yet been answered!
Reading this answer https://stackoverflow.com/a/15644435/2773500 helped. I can use the following to get the tables associated with a specific schema:
dbGetQuery(db,
"SELECT table_name FROM information_schema.tables
WHERE table_schema='sch2014'")
The following should work (using DBI_v1.1.1)
DBI::dbListObjects(conn, DBI::Id(schema = 'schema_name'))
While it has all the info you want, it's hard to access and hard to read.
I would recommend something that produces a data frame:
# get a hard to read table given some Postgres connection `conn`
x = DBI::dbListObjects(conn, DBI::Id(schema = 'schema_name'))
# - extract column "table" comprising a list of Formal class 'Id' objects then
# - extract the 'name' slot for each S4 object element
# could also do `lapply(d$table, function(x) x#name)`
v = lapply(x$table, function(x) slot(x, 'name'))
# create a dataframe with header 'schema', 'table'
d = as.data.frame(do.call(rbind, v))
Or in one line:
d = as.data.frame(do.call(rbind, lapply(DBI::dbListObjects(conn, DBI::Id(schema = 'schema_name'))$table, function(x) slot(x, 'name'))))
Or in a more "tidy" way:
conn %>%
DBI::dbListObjects(DBI::Id(schema = 'schema_name')) %>%
dplyr::pull(table) %>%
purrr::map(~slot(.x, 'name')) %>%
dplyr::bind_rows()
OUTPUT is something like
> d
schema table
1 schema_name mtcars
You can use the table_schema option rather than just schema to see a list of tables within the specific schema. So keeping with your example code snippet above, the below line should work:
dbListTables(db, table_schema="sch2014")
Related
I am connecting to a SQL Server database through the ODBC connection in R. I have two potential methods to get data, and am trying to determine which would be more efficient. The data is needed for a Shiny dashboard, so the data needs to be pulled while the app is loading rather than querying on the fly as the user is using the app.
Method 1 is to use over 20 stored procedures to query all of the needed data and store them for use. Method 2 is to query all of the tables individually.
Here is the method I used to query one of the stored procedures:
get_proc_data <- function(proc_name, url, start_date, end_date){
dbGetQuery(con, paste0(
"EXEC dbo.", proc_name, " ",
"#URL = N'", url, "', ",
"#Startdate = '", start_date, "', ",
"#enddate = '", end_date, "' "
))
}
data <- get_proc_data(proc_name, url, today(), today() %m-% years(5))
However, each of the stored procedures has a slightly different setup for the parameters, so I would have to define each of them separately.
I have started to implement Method 2, but have run into issues with iteratively querying each table.
# use dplyr create list of table names
db_tables <- dbGetQuery(con, "SELECT * FROM [database_name].INFORMATION_SCHEMA.TABLES;") %>% select(TABLE_NAME)
# use dplyr pull to create list
table_list <- pull(db_tables , TABLE_NAME)
# get a quick look at the first few rows
tbl(con, "[TableName]") %>% head() %>% glimpse()
# iterate through all table names, get the first five rows, and export to .csv
for (table in table_list){
write.csv(
tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/tables/{table}.csv")
)
}
selected_tables <- db_tables %>% filter(TABLE_NAME == c("TableName1","TableName2"))
Ultimately this method was just to test how long it would take to iterate through the ~60 tables and perform the required function. I have tried putting this into a function instead but have not been able to get it to iterate through while also pulling the name of the table.
Pro/Con for Method 1: The stored procs are currently powering a metrics plug-in that was written in C++ and is displaying metrics on the webpage. This is for internal use to monitor website performance. However, the stored procedures are not all visible to me and the client needs me to extend their current metrics. I also do not have a DBA at my disposal to help with the SQL Server side, and the person who wrote the procs is unavailable. The procs are also using different logic than each other, so joining the results of two different procs gives drastically different values. For example, depending on the proc, each date will list total page views for each day or already be aggregated at the weekly or monthly scale then listed repeatedly. So joining and grouping causes drastic errors in actual page views.
Pro/Con for Method 2: I am familiar with dplyr and would be able to join the tables together to pull the data I need. However, I am not as familiar with SQL and there is no Entity-Relationship Diagram (ERD) of any sort to refer to. Otherwise, I would build each query individually.
Either way, I am trying to come up with a way to proceed with either a named function, lambda function, or vectorized method for iterating. It would be great to name each variable and assign them appropriately so that I can perform the data wrangling with dplyr.
Any help would be appreciated, I am overwhelmed with which direction to go. I researched the equivalent to Python list comprehension in R but have not been able get the function in R to perform similarly.
> db_table_head_to_csv <- function(table) {
+ write.csv(
+ tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/bibliometrics_tables/{table}.csv")
+ )
+ }
>
> bibliometrics_tables %>% db_table_head_to_csv()
Error in UseMethod("as.sql") :
no applicable method for 'as.sql' applied to an object of class "data.frame"
Consider storing all table data in a named list (counterpart to Python dictionary) using lapply (counterpart to Python's list/dict comprehension). And if you use its sibling, sapply, the character vector passed in will return as names of elements:
# RETURN VECTOR OF TABLE NAMES
db_tables <- dbGetQuery(
con, "SELECT [TABLE_NAME] FROM [database_name].INFORMATION_SCHEMA.TABLES"
)$TABLE_NAME
# RETURN NAMED LIST OF DATA FRAMES FOR EACH DB TABLE
df_list <- sapply(db_tables, function(t) dbReadTable(conn, t), simplify = FALSE)
You can extend the lambda function for multiple steps like write.csv or use a defined method. Just be sure to return a data frame as last line. Below uses the new pipe, |> in base R 4.1.0+:
db_table_head_to_csv <- function(table) {
head_df <- dbReadTable(con, table) |> head()
write.csv(
head_df,
file.path(
"00_exports", "bibliometrics_tables", paste0(table, ".csv")
)
)
return(head_df)
}
df_list <- sapply(db_tables, db_table_head_to_csv, simplify = FALSE)
You lose no functionality of data frame object if stored in a list and can extract with $ or [[ by name:
# EXTRACT SPECIFIC ELEMENT
head(df_list$table_1)
tail(df_list[["table_2"]])
summary(df_list$`table_3`)
I am trying to use R to access the clinicaltrials.gov AACT database to create a list of facility_investigators for a specific topic.
The following code is an example of how to get a list of all clinical trials on the topic TP53
library(dplyr)
library(RPostgreSQL)
aact = src_postgres(dbname = 'aact',
host = "aact-db.ctti-clinicaltrials.org",
user = 'aact',
password = 'aact')
study_tbl = tbl(src=aact, 'studies')
x = study_tbl %>% filter(official_title %like% '%TP53%') %>% collect()
Similarly, if I want a list of principal investigators,
library(dplyr)
library(RPostgreSQL)
aact = src_postgres(dbname = 'aact',
host = "aact-db.ctti-clinicaltrials.org",
user = 'aact',
password = 'aact')
study_tbl = tbl(src=aact, 'facility_investigators')
I am unable to make a list on only TP53 facility_investigators. Something like TP53 & facility_investigators. Any help would be appreciated
This is a link where some explanation is provided, but my problem is not resolved - http://www.cancerdatasci.org/post/2017/03/approaches-to-accessing-clinicaltrials.gov-data/
Is this what your asking...Your pulling from two different tables in the same database the first one is 'studies' and the second one is 'facilities investigators'. What you need to do is run the head() command for each of the tables (or run glimpse() or run str()) and see if the two tables have a common variable you can merge on after you load them into R. If they do then you would do something like this:
library(dplyr)
merged_table <- inner_join(x, study_table, by = "common column")
If the columns have different names it would like:
library(dplyr)
merged_table <- inner_join(x, study_table, by = c("x_column_name" = "study_table_column_name"))
From there you can limit your dataset to just facility investigators that have the characteristics you want.
If you want to do it in one PostgreSQL query you can do it like so. For more information about this syntax in particular see page 18:
con <- dbConnect() # use the same parameters you use above to connect
query <- dbSendQuery(con,
'select s.*, fi.*
from (select * from studies where official_title like "%TP53%")
as s
inner join facility_investigators as fi
on s."joining column" = fi."joining column"'
)
r_dataset <- fetch(query)
# I would just close the connection in RStudio using the connection tab.
The above query has an inner join in the main query and a subquery in the from statement. The subquery performs the filtering you where trying to do in R. It essentially allows you to select only from the table where the results are already filtered. An inner join combines all the records the two tables have in common and puts them into one table. If you need to join on more than one column add an 'and' between the two statements in the on statement.
Looking into adding data to a table with dplyr, I saw https://stackoverflow.com/a/26784801/1653571 but the documentation says db_insert_table() is deprecated.
?db_insert_into()
...
db_create_table() and db_insert_into() have been deprecated in favour of db_write_table().
...
I tried to use the non-deprecated db_write_table() instead, but it fails both with and without the append= option:
require(dplyr)
my_db <- src_sqlite( "my_db.sqlite3", create = TRUE) # create src
copy_to( my_db, iris, "my_table", temporary = FALSE) # create table
newdf = iris # create new data
db_write_table( con = my_db$con, table = "my_table", values = newdf) # insert into
# Error: Table `my_table` exists in database, and both overwrite and append are FALSE
db_write_table( con = my_db$con, table = "my_table", values = newdf,append=True) # insert into
# Error: Table `my_table` exists in database, and both overwrite and append are FALSE
Should one be able to append data with db_write_table()?
See also https://github.com/tidyverse/dplyr/issues/3120
No, you shouldn't use db_write_table() instead of db_insert_table(), since it can't be generalized across backends.
And you shouldn't use the tidyverse versions rather than the relevant DBI::: versions, since the tidyverse helper functions are for internal use, and not designed to be robust enough for use by users. See the discussion at https://github.com/tidyverse/dplyr/issues/3120#issuecomment-339034612 :
Actually, I don't think you should be using these functions at all. Despite that SO post, these are not user facing functions. You should be calling the DBI functions directly.
-- Hadley Wickham, package author.
I'm (very!) new to R and mysql and I have been struggling and researching for this problem for days. So I would really appreciate ANY help.
I need to complete a mathematical expression from 2 variables in two different tables. Essentially, I'm trying to figure out how old a subject was (DOB is in one table) when they were serviced (date of service is in another table). I have an identifying variable that is the same in both.
I have tried merging these:
age<-merge("tbl1", "tbl2", by=c("patient_id") all= TRUE)
this returns:
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
I have tried sub-setting where I just keep the variables of interest, but is is not working because I believe sub-setting only works for numbers not characters..right?
Again, I would appreciate any help. Thanks in advance
Since you are new to data.base , I think you should use dplyr here. It is an abstraction layer for many data base providers. So you will not have to deal with db problems. Here I show you a simple example of where:
I read the tables from the MYSQL
Merge the table assuming having unique shared variable
The code:
library(dplyr)
library(RMySQL)
## create a connection
SDB <- src_mysql(host = "localhost", user = "foo", dbname = "bar", password = getPassword())
# reading tables
tbl1 <- tbl(SDB, "TABLE1_NAME")
tbl2 <- tbl(SDB, "TABLE2_NAME")
## merge : this step can be done using dplyr also
age <- merge(tbl1, tbl2, all= TRUE)
I have an ODBC connection to SQL server database. From R, I want to query a table with lots of data, but I want to get only those records that match my dataframe in R by certain columns (INNER JOIN). I do currently linking ODBC tables in MS ACCESS 2003 (linked tables "dbo_name") and then doing relational queries, without downloading the entire table. I need to reproduce this process in R avoiding downloading the entire table (avoid SQLFetch ()).
I have read the information from ODBC, DBI, rsqlserver packages without success. Is there any package or way to fix this?
If you can't write a table to the database, there is another trick you can use. You essentially make a giant WHERE statement. Let's say you want to join table table in the database to your data.frame called a on the column id. You could say:
ids <- paste0(a$id,collapse=',')
# If a$id is a character, you'll have to surround this in quotes:
# ids <- paste0(paste0("'",a$id,"'"),collapse=',')
dbGetQuery(con, paste0('SELECT * FROM table where id in (',paste(ids,collapse=','),')'))
From your comment, it seems that SQL Server has a problem with a query of that size. I suspect that you may have to "chunk" the query into smaller bits, and then join them all together. Here is an example of splitting the ids into 1000 chunks, querying, and then combining.
id.chunks <- split(a$ids,seq(1000))
result.list <- lapply(id.chunks, function(ids)
dbGetQuery(con,
paste0('SELECT * FROM table where id in (',ids,')')))
combined.resuls <- do.call(rbind,result.list)
The problem was solved. Ids vector was divided into groups of 1000, and then querying to the server each. I show the unorthodox code. Thanks nograpes!
# "lani1" is the vector with 395.474 ids
id.chunks<-split(lani1,seq(1000))
for (i in 1:length(id.chunks)){
idsi<-paste0(paste0("'",as.vector(unlist(id.chunks[i])),"'"),collapse=',')
if(i==1){ani<-sqlQuery(riia,paste0('SELECT * FROM T_ANIMALES WHERE an_id IN (',idsi,')'))
}
else {ani1<-sqlQuery(riia,paste0('SELECT * FROM T_ANIMALES WHERE an_id IN (',idsi,')'))
ani<-rbind(ani,ani1)
}
}
I adapted the answer above and the following worked for me without needing SQL syntax. The table I used was from the adventureworks SQL Server database.
lazy_dim_customer <- dplyr::tbl(conn, dbplyr::in_schema("dbo", "DimCustomer"))
# Create data frame of customer ids
adv_customers <- dplyr::tbl(conn, "DimCustomer")
query1 <- adv_customers %>%
filter(CustomerKey < 20000) %>%
select(CustomerKey)
d00df_customer_keys <- query1 %>% dplyr::collect()
# Chunk customer ids, filter, collect and bind
id.chunks <- split(d00df_customer_keys$CustomerKey, seq(10))
result.list <- lapply(id.chunks, function(ids)
lazy_dim_customer %>%
filter(CustomerKey %in% ids) %>%
select(CustomerKey, FirstName, LastName) %>%
collect() )
combined.results <- do.call(rbind, result.list)