Exploring the rquery package of John Mount, Win-Vector LLC, is there a way that I could get the distinct values of a column from a SQL table using the rquery package functions? (WITHOUT writing the appropriate SQL query but using the rquery functions since I need to use my code in Oracle, MSSQL and Postgres).
So I do not need:
rq_get_query(db, "SELECT DISTINCT (COL1) FROM TABLE1")
but I am looking for something similar to unique of base R.
I would use the sqldf package. It is very accessible, and think you would benefit.
install.packages("sqldf")
library(sqldf)
df = sqldf("SELECT DISTINCT COL1 FROM TABLE1")
View(df)
This returns the distinct values of Col1 and Col2. Can of course be any number of columns.
db_td(connection, "table") %.>%
project(., groupby = c("Col1", "Col2"), one = 0) %.>%
execute(connection, .)
The assignment of 0 to a new column is necessary, is supposed to be fixed in the next update of rquery, so it will work like this:
project(., groupby = c("Col1", "Col2"))
Related
I am using R and the library dplyr.
I want to join a larger database with a smaller database (in terms of rows).
I use left join because I want to have a final database that has the same number of rows as the larger one.
This naturally returns NA values when the smaller database does not have a value corresponding to the joining key.
What I want to achieve is sort of copying the previous values of the smaller database into the rows where NA is returned by the left join.
In other words:
if is.na(columnvalue[j]) == TRUE then
columnvalue[j] = columnvalue[j-1]
where columnvalue is a joined column from the smaller database and j = 1,..., nrow(largerdataset).
A loop with that if statement should work, but it is a bit cumbersome. Is there any other smarter solution?
Thank you.
If you update with some sample data, I could provide full code for this. The general solution is to use fill from tidyr package, possibly with a group_by the key if needed. You would just write it as:
library(tidyverse)
data %>%
# group_by(key) %>%
tidyr::fill(var1, var2, var3, .direction = "up")
Suppose using dbplyr, we have something like
library(dbplyr)
sometable %>%
head()
then we see the first 6 rows.
But if we try this we see an error
sometable %>%
tail()
# Error: tail() is not supported by sql sources
which is expected behaviour of dbplyr:
Because you can’t find the last few rows without executing the whole query, you can’t use tail().
Question: how do we do the tail() equivalent in this situation?
In general, the order of SQL queries should never be assumed, as the DBMS may store it in an order that is ideal for indexing or other reasons, and not based on the order you want. Because of that, a common "best practice" for SQL queries is to either (a) assume the data is unordered (and perhaps that the order may change, though I've not seen this in practice); or (b) force ordering in the query.
From this, consider arranging your data in a descending manner and use head.
For instance, if I have a table MyTable with a numeric field MyNumber, then
library(dplyr)
library(dbplyr)
tb <- tbl(con, "MyTable")
tb %>%
arrange(MyNumber) %>%
tail() %>%
sql_render()
# Error: tail() is not supported by sql sources
tb %>%
arrange(MyNumber) %>%
head() %>%
sql_render()
# <SQL> SELECT TOP(6) *
# FROM "MyTable"
# ORDER BY "MyNumber"
tb %>%
arrange(desc(MyNumber)) %>%
head() %>%
sql_render()
# <SQL> SELECT TOP(6) *
# FROM "MyTable"
# ORDER BY "MyNumber" DESC
(This is (obviously) demonstrated on a SQL Server connection, but the premise should work just as well for other DBMS types, they'll just shift from SELECT TOP(6) ... to SELECT ... LIMIT 6 or similar.)
Comming from SQL i would expect i was able to do something like the following in dplyr, is this possible?
# R
tbl %>% mutate(n = dense_rank(Name, Email))
-- SQL
SELECT Name, Email, DENSE_RANK() OVER (ORDER BY Name, Email) AS n FROM tbl
Also is there an equivilant for PARTITION BY?
I did struggle with this problem and here is my solution:
In case you can't find any function which supports ordering by multiple variables, I suggest that you concatenate them by their priority level from left to right using paste().
Below is the code sample:
tbl %>%
mutate(n = dense_rank(paste(Name, Email))) %>%
arrange(Name, Email) %>%
view()
Moreover, I guess group_by is the equivalent for PARTITION BY in SQL.
The shortfall for this solution is that you can only order by 2 (or more) variables which have the same direction. In the case that you need to order by multiple columns which have different direction, saying that 1 asc and 1 desc, I suggest you to try this:
Calculate rank with ties based on more than one variable
I have a dataframe titled FilteredData with many columns. Specifically, there are two columns I am interested in: Date and Sale number.
I want to group all Sale number entries by dates. Date is a date-type field, and Sale number is a character-type field. If I'm not mistaken, I think these types are the reason why other Q&As on S.O. haven't been much help to me.
How can I do this?
I've tried the following:
aggregate(FilteredData$`Sale number`, by FilteredData$Date, FUN = count)
group_by(FilteredData$`Sale number`, FilteredData$Date)
Neither worked, and neither did the solution found here when I tried it.
I tried the following:
library(sqldf)
Freq = sqldf('SELECT Date, COUNT('Sale Number') FROM FilteredData GROUP BY Date')
and it surprisingly worked. However, is there a way to obtain this result without having to use SQL syntax, i.e. something "purely" in R?
Your question is a little unclear... So you want to group by date and then count the number of non-duplicate entries within a date?
dplyr can do this:
FilteredData %>% # take filtered data
group_by(FundedDate) %>% # group by the date
subset(!duplicated('Sale number')) %>% # remove rows that are duplicated sales numbers
count('Sale number') # count sales numbers
You can use data.table as follows:
library(data.table)
setDT(FilteredData)
FilteredData[ , uniqueN(`Sale number`), by = Date]
I'm not sure if dplyr has a tailored function for this... you may just want length(unique(`Sale number`)) there.
This is my current way after I invoke a Sparklyr session:
dbGetQuery(sparkContext, "USE DB_1")
df_1 <- tbl(sparkContext, "table_1")
dbGetQuery(sparkContext, "USE DB_2")
df_2 <- tbl(sparkContext, "table_2")
df <- df_1 %>% inner_join(df_2, by = c("col_1" = "col_2"))
nrow(df))
Errors that are I met with:
"Error: org.apache.spark.sql.AnalysisException: Table or view not found: table_1"
My take is Sparklyr does not (directly) support joining tables from 2 databases. I am wondering if anyone has an elegant solution to this problem
You can specify the database in the Spark SQL syntax passed to the dbGetQuery function, e.g.:
df_1 <- dbGetQuery(sc, "select * from db_1.table_1")
However, note that dbGetQuery collects the data to the driver as an R dataframe, so you may want to do the join within dbGetQuery, e.g.:
df <- dbGetQuery(sc,"select * from db_1.table_1 A inner join db_2.table_2 B where A.col_1 = B.col_2)
(Or, if your datasets are really large but you want to aggregate via a more R-friendly API instead of Spark SQL, you can use SparkR.)
From the sparklyr book, you can use the dbplyr package to create references to each table:
library(dplyr)
library(dbplyr)
library(sparklyr)
table_1 <- tbl(sc, dbplyr::in_schema("db_1", "table_1"))
table_2 <- tbl(sc, dbplyr::in_schema("db_2", "table_2"))
Then you can do a standard R merge:
df <- merge(table_1, table2, by.x = "col_1", by.x = "col_2")
(I'm doing this right now, but it's taking forever.)
One more approach could be to create two separate Sparklyr dataframes, (one from each database respectively), and then do whatever you want. You may join them as Sparklyr dataframe, or convert back to R dataframe to join, whichever is appropriate as per the size of the data.
sdf_1 <- sparklyr::spark_read_table(sc, "first_table_name",
options=list(dbtable="first_database_name.first_table_name"))
sdf_2 <- sparklyr::spark_read_table(sc, "second_table_name",
options=list(dbtable="second_database_name.second_table_name"))
inner_join(sdf_1, sdf_2, by = c("col_1" = "col_2"))