Use Sparklyr to join tables from 2 different databases - r

This is my current way after I invoke a Sparklyr session:
dbGetQuery(sparkContext, "USE DB_1")
df_1 <- tbl(sparkContext, "table_1")
dbGetQuery(sparkContext, "USE DB_2")
df_2 <- tbl(sparkContext, "table_2")
df <- df_1 %>% inner_join(df_2, by = c("col_1" = "col_2"))
nrow(df))
Errors that are I met with:
"Error: org.apache.spark.sql.AnalysisException: Table or view not found: table_1"
My take is Sparklyr does not (directly) support joining tables from 2 databases. I am wondering if anyone has an elegant solution to this problem

You can specify the database in the Spark SQL syntax passed to the dbGetQuery function, e.g.:
df_1 <- dbGetQuery(sc, "select * from db_1.table_1")
However, note that dbGetQuery collects the data to the driver as an R dataframe, so you may want to do the join within dbGetQuery, e.g.:
df <- dbGetQuery(sc,"select * from db_1.table_1 A inner join db_2.table_2 B where A.col_1 = B.col_2)
(Or, if your datasets are really large but you want to aggregate via a more R-friendly API instead of Spark SQL, you can use SparkR.)

From the sparklyr book, you can use the dbplyr package to create references to each table:
library(dplyr)
library(dbplyr)
library(sparklyr)
table_1 <- tbl(sc, dbplyr::in_schema("db_1", "table_1"))
table_2 <- tbl(sc, dbplyr::in_schema("db_2", "table_2"))
Then you can do a standard R merge:
df <- merge(table_1, table2, by.x = "col_1", by.x = "col_2")
(I'm doing this right now, but it's taking forever.)

One more approach could be to create two separate Sparklyr dataframes, (one from each database respectively), and then do whatever you want. You may join them as Sparklyr dataframe, or convert back to R dataframe to join, whichever is appropriate as per the size of the data.
sdf_1 <- sparklyr::spark_read_table(sc, "first_table_name",
options=list(dbtable="first_database_name.first_table_name"))
sdf_2 <- sparklyr::spark_read_table(sc, "second_table_name",
options=list(dbtable="second_database_name.second_table_name"))
inner_join(sdf_1, sdf_2, by = c("col_1" = "col_2"))

Related

Unable to perform merge: what is the difference in these dataframes?

I have two dataframes annotatedFile and subOutFile that contain similar data. I am retrieving annotatedFile from an xlsx file using readxl::read_xlsx. subOutFile is retrived using read.delim2 from a tab-separated text file. They contain similar columns but annotatedFile has an extra column - accuracy that I want to merge into the subOutFile dataframe
This is what the data frames look like:
My merge command was:
subOutFile = subOutFile %>% merge(subOutFile, annotatedFile[,c("StimName", "Accuracy")], by = "StimName", all.x = TRUE)
From the images above, you can see that the structure of the two dataframes looks different. One shows the vector-like notification [1:180] and the other does not. Is there something different about these dataframes which is why I am not able to perform the merge? Or is there another reason?
When you write df1 %>% merge(df1, df2), there is one too many df1.
It's either df1 <- merge(df1, df2) or df1 <- df1 %>% merge(df2). For the latter, there is a shortcut, but you will have to load the magrittr package: df1 %<>% merge(df2).

How to write the same code using pipes in R?

I'm very new to R and can't get a hold of using pipes for trivial commands. How to write these correctly working commands using pipes instead? The following two problems are not related.
1) I'm trying to remove duplicates from my dataframe and replace the old dataframe with a new one that has no duplicated values.
2) I'm trying to change factor format to date format.
1) df <- df[!duplicated(df),]
2) df$date_col <- anytime(df$date_col,
useR = getOption("anytimeUseRConversions", FALSE),
oldHeuristic = getOption("anytimeOldHeuristic", FALSE))
Here is one option
library(dplyr)
library(anytime)
df %>%
distinct() %>%
mutate(date_col = anytime(date_col))

Vector addition with vector indexing

This may well have an answer elsewhere but I'm having trouble formulating the words of the question to find what I need.
I have two dataframes, A and B, with A having many more rows than B. I want to look up a value from B based on a column of A, and add it to another column of A. Something like:
A$ColumnToAdd + B[ColumnToMatch == A$ColumnToMatch,]$ColumnToAdd
But I get, with a load of NAs:
Warning in `==.default`: longer object length is not a multiple of shorter object length
I could do it with a messy for-loop but I'm looking for something faster & elegant.
Thanks
If I understood your question correctly, you're looking for a merge or a join, as suggested in the comments.
Here's a simple example for both using dummy data that should fit what you described.
library(tidyverse)
# Some dummy data
ColumnToAdd <- c(1,1,1,1,1,1,1,1)
ColumnToMatch <- c('a','b','b','b','c','a','c','d')
A <- data.frame(ColumnToAdd, ColumnToMatch)
ColumnToAdd <- c(1,2,3,4)
ColumnToMatch <- c('a','b','c','d')
B <- data.frame(ColumnToAdd, ColumnToMatch)
# Example using merge
A %>%
merge(B, by = c("ColumnToMatch")) %>%
mutate(sum = ColumnToAdd.x + ColumnToAdd.y)
# Example using join
A %>%
inner_join(B, by = c("ColumnToMatch")) %>%
mutate(sum = ColumnToAdd.x + ColumnToAdd.y)
The advantages of the dplyr versions over merge are:
rows are kept in existing order
much faster
tells you what keys you're merging by (if you don't supply)
also work with database tables.

How to get distinct values of a column in rquery?

Exploring the rquery package of John Mount, Win-Vector LLC, is there a way that I could get the distinct values of a column from a SQL table using the rquery package functions? (WITHOUT writing the appropriate SQL query but using the rquery functions since I need to use my code in Oracle, MSSQL and Postgres).
So I do not need:
rq_get_query(db, "SELECT DISTINCT (COL1) FROM TABLE1")
but I am looking for something similar to unique of base R.
I would use the sqldf package. It is very accessible, and think you would benefit.
install.packages("sqldf")
library(sqldf)
df = sqldf("SELECT DISTINCT COL1 FROM TABLE1")
View(df)
This returns the distinct values of Col1 and Col2. Can of course be any number of columns.
db_td(connection, "table") %.>%
project(., groupby = c("Col1", "Col2"), one = 0) %.>%
execute(connection, .)
The assignment of 0 to a new column is necessary, is supposed to be fixed in the next update of rquery, so it will work like this:
project(., groupby = c("Col1", "Col2"))

pass grouped dataframe to own function in dplyr

I am trying to transfer from plyr to dplyr. However, I still can't seem to figure out how to call on own functions in a chained dplyr function.
I have a data frame with a factorised ID variable and an order variable. I want to split the frame by the ID, order it by the order variable and add a sequence in a new column.
My plyr functions looks like this:
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- ddply(data, .(ID_variable), f)
In dplyr I though this should look something like this
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- data %>% group_by(ID_variable) %>% f
Can anyone tell me how to modify my dplyr call to successfully pass my own function and get the same functionality my plyr function provides?
EDIT: If I use the dplyr formula as described here, it DOES pass an object to f. However, while plyr seems to pass a number of different tables (split by the ID variable), dplyr does not pass one table per group but the ENTIRE table (as some kind of dplyr object where groups are annotated), thus when I cbind the Experience variable it appends a counter from 0 to the length of the entire table instead of the single groups.
I have found a way to get the same functionality in dplyr using this approach:
data <- data %>%
group_by(ID_variable) %>%
arrange(ID_variable,order_variable) %>%
mutate(Experience = 0:(n()-1))
However, I would still be keen to learn how to pass grouped variables split into different tables to own functions in dplyr.
For those who get here from google. Let's say you wrote your own print function.
printFunction <- function(dat) print(dat)
df <- data.frame(a = 1:6, b = 1:2)
As it was asked here
df %>%
group_by(b) %>%
printFunction(.)
prints entire data. To get dplyr print multiple tables grouped by, you should use do
df %>%
group_by(b) %>%
do(printFunction(.))

Resources