I'm connected to Hive using dbplyr and odbc.
A table I would like to connect to is called "pros_year_month":
library(odbc)
library(tidyverse)
library(dbplyr)
con <- dbConnect(odbc::odbc(), "HiveProd")
prosym <- tbl(con, in_schema("my_schema_name", "pros_year_month"))
Table pros_year_month has several fields, two of which are "country" and "year_month".
This appears to work without any problem:
pros_nov <- prosym %>% filter(country == "United States") %>% collect()
However this does not:
pros_nov <- prosym %>% filter(year_month = ymd(as.character(paste0(year_month, "01")))) %>% collect()
Error in new_result(connection#ptr, statement) :
nanodbc/nanodbc.cpp:1344: 42000: [Hortonworks][Hardy] (80) Syntax or
semantic analysis error thrown in server while executing query. Error
message from server: Error while compiling statement: FAILED:
SemanticException [Error 10004]: Line 1:7 Invalid table alias or
column reference 'zzz1.year_month': (possible column names are:
year_month, country, ...
It looks like the field name year_month is somehow now zzz1.year_month? Not sure what this is or how to get around it.
How can I apply a filter for country then year_month before calling collect on a dbplyr object?
Related
I'm joining two relatively simple tables using ODBC and dbplyr. However, I'm getting an error on my join key, it's throwing up an ambiguous column name error. This doesn't happen normally with dplyr joins, and I don't know how to use like an a.key = b.key, using dbplyr.
Error: nanodbc/nanodbc.cpp:1655: 42000: [Microsoft][ODBC SQL Server Driver][SQL Server]Ambiguous column name 'Calendar_key'. [Microsoft][ODBC SQL Server Driver][SQL Server]Statement(s) could not be prepared.
<SQL> 'SELECT "Calendar_key", "Organization_key", "Product_Key", "Promotion_Key", "Shift_Key", "ETL_source_system_key", "Pack_Size", "Qty_Sold", "Inv_Unit_Qty", "Extended_Cost", "Extended_Purchase_Rebate", "Extended_Sales_Rebate", "Extended_Sales", "Ent_Source_Hdr_Key", "Ent_Source_Dtl_Key", "Day_Date", "Day_Of_Week_ID", "Day_Of_Week", "Holiday", "Type_Of_Day", "Calendar_Month_No", "Calendar_Month_Name", "Calendar_Qtr_No", "Calendar_Qtr_Desc", "Calendar_Year", "Fiscal_Week", "Fiscal_Period_No", "Fiscal_Period_Desc", "Fiscal_Year"
FROM "Item_Sales_Fact" AS "LHS"
LEFT JOIN "calendar" AS "RHS"
ON ("LHS"."Calendar_key" = "RHS"."calendar_key")
This is the code block below: My connection is called con
con <- dbConnect(odbc(),
Driver = "SQL Server",
Server = "192.168.139.1",
Database = "pdi_warehouse_2304_01",
UID = XXXX,
PWD = XXXX,
Port = 1433)
item.sales <- tbl(con, "Item_Sales_Fact")
calendar <- tbl(con, "calendar")
organization <- tbl(con, "Organization")
test.df <- item.sales %>%
left_join(calendar, by = c("Calendar_key" = "calendar_key")) %>%
collect()
The SQL generated by dbplyr isn't correct as Calendar_key can either come from RHS or LHS because SQL isn't case sensitive and contrary to R doesn't make a distinction between Calendar_key and calendar_key:
SELECT "Calendar_key", ...
The problem seems to come from the fact that although SQL isn't case sensitive, SQL Server handles case sensitive column names.
A workaround is to rename one of the two keys to obtain exactly the same case sensitive names:
item.sales <- tbl(con, "Item_Sales_Fact")
calendar <- tbl(con, "calendar") %>% rename(Calendar_key = calendar_key)
test.df <- item.sales %>%
left_join(calendar, by = c("Calendar_key" = "Calendar_key")) %>%
collect()
In another project working with Amazon Athena I could do this:
con <- DBI::dbConnect(odbc::odbc(), Driver = "path-to-driver",
S3OutputLocation = "location",
AwsRegion = "eu-west-1", AuthenticationType = "IAM Profile",
AWSProfile = "profile", Schema = "prod")
tbl(con,
# Run SQL query
sql('SELECT *
FROM TABLE')) %>%
# Without having collected the data, I could further wrangle the data inside the database
# using dplyr code
select(var1, var2) %>%
mutate(var3 = var1 + var2)
However, now using BigQuery I get the following error
con <- DBI::dbConnect(bigrquery::bigquery(),
project = "project")
tbl(con,
sql(
'SELECT *
FROM TABLE'
))
Error: dataset is not a string (a length one character vector).
Any idea if with BigQuery is not possible to do what I'm trying to do?
Not a BigQuery user, so can't test this, but from looking at this example it appears unrelated to how you are piping queries (%>%). Instead it appears BigQuery does not support receiving a tbl with an sql string as the second argument.
So it is likely to work when the second argument is a string with the name of the table:
tbl(con, "db_name.table_name")
But you should expect it to fail if the second argument is of type sql:
query_string = "SELECT * FROM db_name.table_name"
tbl(con, sql(query_string))
Other things to test:
Using odbc::odbc() to connect to BigQuery instead of bigquery::bigquery(). The problem could be caused by the bigquery package.
The second approach without the conversation to sql: tbl(con, query_string)
Here is my code
library(DBI)
library(dplyr)
con <- dbConnect(odbc::odbc(), some_credentials)
dbListTables(con, table_name = "Table_A")
The above code returns Table_A indicating presence of table. Now I am trying to query Table_A
df <- as.data.frame(tbl(con, "Table_A"))
and get back:
Error: <SQL> 'SELECT *
FROM "Table_A" AS "zzz18"
WHERE (0 = 1)'
nanodbc/nanodbc.cpp:1587: 42S02: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Invalid object name 'Table_A'.
so dplyr does not see it. How can I reconcile. I already double checked spelling.
As mentioned, any object (table, stored procedure, function, etc.) residing in a non-default schema requires explicit reference to the schema. Default schemas include dbo in SQL Server and public in PostgreSQL. Therefore, as docs indicate, use in_schema in dbdplyr and Id or SQL in DBI:
# dbplyr VERSION
df <- tbl(con, in_schema("myschema", "Table_A"))
# DBI VERSION
t <- Id(schema = "myschema", table = "Table_A")
df <- dbReadTable(con, t)
df <- dbReadTable(con, SQL("myschema.Table_A"))
Without a reproducible example it is kinda hard but I will try my best. I think you should add the dbplyr package which is often used for connecting to databases.
library(DBI)
library(dbplyr)
library(tidyverse)
con <- dbConnect(odbc::odbc(), some_credentials)
df <- tbl(con, "Table_A") %>%
collect() #will create a dataframe in R and use dplyr
Here are some additional resources:
https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html
Hope that can help!
I'm trying to connect to Amazon Athena via JDBC and pool:
What has worked so far:
library(RJDBC)
library(DBI)
library(pool)
library(dplyr)
library(dbplyr)
drv <- RJDBC::JDBC('com.amazonaws.athena.jdbc.AthenaDriver', '/opt/jdbc/AthenaJDBC41-1.1.0.jar')
pool_instance <- dbPool(
drv = drv,
url = "jdbc:awsathena://athena.us-west-2.amazonaws.com:443/",
user = "me",
s3_staging_dir = "s3://somedir",
password = "pwd"
)
mydata <- DBI::dbGetQuery(pool_instance, "SELECT *
FROM myDB.myTable
LIMIT 10")
mydata
---> Works fine. Correct data is beeing returned.
That does not work:
pool_instance %>% tbl("myDB.myTable") %>% head(10)
# Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
# Unable to retrieve JDBC result set for SELECT *
# FROM "myDB.myTable" AS "zzz2"
# WHERE (0 = 1) ( Table myDB.myTable not found. Please check your query.)
The problem here is that Athena expects the following syntax as SQL:
Either:
SELECT *
FROM "myDB"."myTable"
Or:
SELECT *
FROM myDB.myTable
So basically, by passing the string "myDB.myTable":
pool_instance %>% tbl("myDB.myTable") %>% head(10)
The following syntax is being used:
SELECT *
FROM "myDB.myTable"
which results in the following error since such table doesn't exist:
# Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
# Unable to retrieve JDBC result set for SELECT *
# FROM "myDB.myTable" AS "zzz6"
# WHERE (0 = 1) ( Table myDB.myTable not found. Please check your query.)
What I have tried:
So therefore I have tried to pass either "myDB"."myTable" or myDB.myTable to tbl() unsuccessfully:
I have tried to use capture.output(cat('\"myDB\".\"myTable\"')):
pool_instance %>% tbl(capture.output(cat('\"myDB\".\"myTable\"'))) %>% head(10)
# Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
# Unable to retrieve JDBC result set for SELECT *
# FROM """myDB"".""myTable""" AS "zzz4"
# WHERE (0 = 1) ( Table ""myDB"".""myTable"" not found. Please check your query.)
pool_instance %>% tbl(noquote("myDB"."myTable") %>% head(10)
# Error in UseMethod("as.sql") :
# no applicable method for 'as.sql' applied to an object of class "noquote"
You can use dbplyr::in_schema:
pool_instance %>% tbl(in_schema("myDB", "myTable")) %>% head(10)
I have a table saved in AWS redshift that has lots of rows and I want to collect only a subset of them using a "user_id" column. I am trying to use R with the dplyr library to accomplish this (see below).
conn_dplyr <- src_postgres('dev',
host = '****',
port = ****,
user = "****",
password = "****")
df <- tbl(conn_dplyr, "redshift_table")
However, when I try to subset over a collection of user ids it fails (see below). Can someone help me understand how I might be able to collect the data table over a collection of user id elements? The individual calls work, but when I combine them both it fails. In this case there are only 2 user ids, but in general it could be hundreds or thousands, so I don't want to do each one individually. Thanks for your help.
df_subset1 <- filter(df, user_id=="2239257806")
df_subset1 <- collect(df_subset1)
df_subset2 <- filter(df, user_id=="22159960")
df_subset2 <- collect(df_subset2)
df_subset_both <- filter(df, user_id==c("2239257806", "22159960"))
df_subset_both <- collect(df_subset_both)
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: operator does not exist: character varying = record
HINT: No operator matches the given name and argument type(s). You may need to add explicit type casts.
)
Try this:
df_subset_both <- filter(df, user_id %in% c("2239257806", "22159960"))
Also you can add condition in the query you uploaded from redshift.
install.packages("RPostgreSQL")
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
conn <-dbConnect(drv,host='host link',port='5439',dbname='dbname',user='xxx',password='yyy')
df_subset_both <- dbSendQuery(conn,"select * from my_table where user_id in (2239257806,22159960)")