I'm joining two relatively simple tables using ODBC and dbplyr. However, I'm getting an error on my join key, it's throwing up an ambiguous column name error. This doesn't happen normally with dplyr joins, and I don't know how to use like an a.key = b.key, using dbplyr.
Error: nanodbc/nanodbc.cpp:1655: 42000: [Microsoft][ODBC SQL Server Driver][SQL Server]Ambiguous column name 'Calendar_key'. [Microsoft][ODBC SQL Server Driver][SQL Server]Statement(s) could not be prepared.
<SQL> 'SELECT "Calendar_key", "Organization_key", "Product_Key", "Promotion_Key", "Shift_Key", "ETL_source_system_key", "Pack_Size", "Qty_Sold", "Inv_Unit_Qty", "Extended_Cost", "Extended_Purchase_Rebate", "Extended_Sales_Rebate", "Extended_Sales", "Ent_Source_Hdr_Key", "Ent_Source_Dtl_Key", "Day_Date", "Day_Of_Week_ID", "Day_Of_Week", "Holiday", "Type_Of_Day", "Calendar_Month_No", "Calendar_Month_Name", "Calendar_Qtr_No", "Calendar_Qtr_Desc", "Calendar_Year", "Fiscal_Week", "Fiscal_Period_No", "Fiscal_Period_Desc", "Fiscal_Year"
FROM "Item_Sales_Fact" AS "LHS"
LEFT JOIN "calendar" AS "RHS"
ON ("LHS"."Calendar_key" = "RHS"."calendar_key")
This is the code block below: My connection is called con
con <- dbConnect(odbc(),
Driver = "SQL Server",
Server = "192.168.139.1",
Database = "pdi_warehouse_2304_01",
UID = XXXX,
PWD = XXXX,
Port = 1433)
item.sales <- tbl(con, "Item_Sales_Fact")
calendar <- tbl(con, "calendar")
organization <- tbl(con, "Organization")
test.df <- item.sales %>%
left_join(calendar, by = c("Calendar_key" = "calendar_key")) %>%
collect()
The SQL generated by dbplyr isn't correct as Calendar_key can either come from RHS or LHS because SQL isn't case sensitive and contrary to R doesn't make a distinction between Calendar_key and calendar_key:
SELECT "Calendar_key", ...
The problem seems to come from the fact that although SQL isn't case sensitive, SQL Server handles case sensitive column names.
A workaround is to rename one of the two keys to obtain exactly the same case sensitive names:
item.sales <- tbl(con, "Item_Sales_Fact")
calendar <- tbl(con, "calendar") %>% rename(Calendar_key = calendar_key)
test.df <- item.sales %>%
left_join(calendar, by = c("Calendar_key" = "Calendar_key")) %>%
collect()
Related
In another project working with Amazon Athena I could do this:
con <- DBI::dbConnect(odbc::odbc(), Driver = "path-to-driver",
S3OutputLocation = "location",
AwsRegion = "eu-west-1", AuthenticationType = "IAM Profile",
AWSProfile = "profile", Schema = "prod")
tbl(con,
# Run SQL query
sql('SELECT *
FROM TABLE')) %>%
# Without having collected the data, I could further wrangle the data inside the database
# using dplyr code
select(var1, var2) %>%
mutate(var3 = var1 + var2)
However, now using BigQuery I get the following error
con <- DBI::dbConnect(bigrquery::bigquery(),
project = "project")
tbl(con,
sql(
'SELECT *
FROM TABLE'
))
Error: dataset is not a string (a length one character vector).
Any idea if with BigQuery is not possible to do what I'm trying to do?
Not a BigQuery user, so can't test this, but from looking at this example it appears unrelated to how you are piping queries (%>%). Instead it appears BigQuery does not support receiving a tbl with an sql string as the second argument.
So it is likely to work when the second argument is a string with the name of the table:
tbl(con, "db_name.table_name")
But you should expect it to fail if the second argument is of type sql:
query_string = "SELECT * FROM db_name.table_name"
tbl(con, sql(query_string))
Other things to test:
Using odbc::odbc() to connect to BigQuery instead of bigquery::bigquery(). The problem could be caused by the bigquery package.
The second approach without the conversation to sql: tbl(con, query_string)
I'm trying to join tables from two different datasets in the same project. How can I do this?
library(tidyverse)
library(bigrquery)
con1 <-
bConnect(
drv = bigrquery::bigquery(),
project = PROJECT,
dataset = "dataset_1"
)
con2 <-
bConnect(
drv = bigrquery::bigquery(),
project = PROJECT,
dataset = "dataset_2"
)
A <- con1 %>% tbl("A")
B <- con2 %>% tbl("B")
inner_join(A, B,
by = "key",
copy = T) %>%
collect()
Then I get the error: Error: BigQuery does not support temporary tables
The problem is most likely that you are using different connections to connect with the two tables. When you attempt this, R tries to copy data from one source into a temporary table on the other source.
See this question and the copy parameter in this documentation (its a different package, but the principle is the same).
The solution is to only use a single connection for all your tables. Something like this:
con <-
bConnect(
drv = bigrquery::bigquery(),
project = PROJECT,
dataset = "dataset_1"
)
A <- con %>% tbl("A")
B <- con %>% tbl("B")
inner_join(A, B,
by = "key") %>%
collect()
You may need to leave the dataset parameter blank in your connection string, or use in_schema to include the dataset name along with the table when you connect to a remote table. It's hard to be sure without knowing more about the structure of your database(s).
Here is my code
library(DBI)
library(dplyr)
con <- dbConnect(odbc::odbc(), some_credentials)
dbListTables(con, table_name = "Table_A")
The above code returns Table_A indicating presence of table. Now I am trying to query Table_A
df <- as.data.frame(tbl(con, "Table_A"))
and get back:
Error: <SQL> 'SELECT *
FROM "Table_A" AS "zzz18"
WHERE (0 = 1)'
nanodbc/nanodbc.cpp:1587: 42S02: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Invalid object name 'Table_A'.
so dplyr does not see it. How can I reconcile. I already double checked spelling.
As mentioned, any object (table, stored procedure, function, etc.) residing in a non-default schema requires explicit reference to the schema. Default schemas include dbo in SQL Server and public in PostgreSQL. Therefore, as docs indicate, use in_schema in dbdplyr and Id or SQL in DBI:
# dbplyr VERSION
df <- tbl(con, in_schema("myschema", "Table_A"))
# DBI VERSION
t <- Id(schema = "myschema", table = "Table_A")
df <- dbReadTable(con, t)
df <- dbReadTable(con, SQL("myschema.Table_A"))
Without a reproducible example it is kinda hard but I will try my best. I think you should add the dbplyr package which is often used for connecting to databases.
library(DBI)
library(dbplyr)
library(tidyverse)
con <- dbConnect(odbc::odbc(), some_credentials)
df <- tbl(con, "Table_A") %>%
collect() #will create a dataframe in R and use dplyr
Here are some additional resources:
https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html
Hope that can help!
I'm connected to Hive using dbplyr and odbc.
A table I would like to connect to is called "pros_year_month":
library(odbc)
library(tidyverse)
library(dbplyr)
con <- dbConnect(odbc::odbc(), "HiveProd")
prosym <- tbl(con, in_schema("my_schema_name", "pros_year_month"))
Table pros_year_month has several fields, two of which are "country" and "year_month".
This appears to work without any problem:
pros_nov <- prosym %>% filter(country == "United States") %>% collect()
However this does not:
pros_nov <- prosym %>% filter(year_month = ymd(as.character(paste0(year_month, "01")))) %>% collect()
Error in new_result(connection#ptr, statement) :
nanodbc/nanodbc.cpp:1344: 42000: [Hortonworks][Hardy] (80) Syntax or
semantic analysis error thrown in server while executing query. Error
message from server: Error while compiling statement: FAILED:
SemanticException [Error 10004]: Line 1:7 Invalid table alias or
column reference 'zzz1.year_month': (possible column names are:
year_month, country, ...
It looks like the field name year_month is somehow now zzz1.year_month? Not sure what this is or how to get around it.
How can I apply a filter for country then year_month before calling collect on a dbplyr object?
I would like to understand the difference between dplyr joins and sql joins.
I have an open connection to an oracle database in R:
con <- dbConnect(odbc::odbc(), …)
The 1st request :
dbGetQuery(con, "select *
from result join test on result.test_1 = test.test_1
join sample on test.sample = sample.id_2") %>%
setNames(make.names(names(.), unique = TRUE) )%>%
as_tibble()
gives a tibble with 9541 rows (what I want !)
The 2nd request :
tbl(con, "result")%>%
inner_join(tbl(con, "sample"), by = c("test_1" = "id_2"))%>%
collect()
gives a tibble with 2688 rows
test_1 and id_2 are character fields with spaces in it and numbers at the end. example: “ 3333” .
Thanks
In SQL I see 3 tables, in R I see 2 tables : result and sample table.
Here is the probable cause of the difference.