We have parquet data saved on a server and I am trying to use SparkR sql() function in the following ways
df <- sql("SELECT * FROM parquet.`<path to parquet file`")
head(df)
show(df) # returns "<SQL> SELECT * FROM parquet.`<path to parquet file`"
and
createOrReplaceTempView(df, "table")
df2 <- sql("SELECT * FROM table")
show(df2) # returns "<SQL> SELECT * FROM table"
In both cases what I get is the sql query in a string format instead of the spark dataframe. Does anyone have any idea why this happens and why I don't get the dataframe?
Thank you
Don't use the show statement use showDF()...or, View(head(df2, num=20L))
This was a very silly problem. The solution is to just use the full name of the method
SparkR::sql(...)
instead of the short name. Apparently the function sql is masked.
show method documentation in sparkR states that if eager evaluation is not enabled it returns the class and type information of the Spark object. so you should use showDF instead.
besides that, apparently, the sql method is masked in R and you should call it with explicit package deceleration. like this:
df <- SparkR::sql("select 1")
showDF(df)
Related
Still new to the world of Azure Databricks, the use of SparkR remains very obscure to me, even for very simple tasks...
It took me a very long time to find how to count distinct values, and I'm not sure it's the right way to go :
library(SparkR)
sparkR.session()
DW <- sql("select * from db.mytable")
nb.var <- head(summarize(DW, n_distinct(DW$VAR)))
I thought I found, but nb.per is not an object, but still a dataframe...
class(nb.per)
[1] "data.frame"
I tried :
nb.per <- as.numeric(head(summarize(DW, n_distinct(DW$PERIODE))))
It seems ok, but I'm pretty sure there is a better way to achieve this ?
Thanks !
Since you are anyway using Spark SQL, a very simple approach would be to do like this:
nb.per <- `[[`(SparkR::collect(SparkR::sql("select count(distinct VAR) from db.mytable")), 1).
And using SparkR APIs like:
DW <- SparkR::tableToDF("db.mytable")
nb.per <- `[[`(SparkR::collect(SparkR::agg(DW, SparkR::countDistinct(SparkR::column("VAR")))), 1)
The SparkR::sql function returns a SparkDataFrame.
In order to use this in R as an R data.frame, you can simply coerce it:
as.data.frame(sql("select * from db.mytable"))
I am trying to calculate the difference in months between two dates in R using dbplyr package, I want to send the sql query to calculate it using "timestampdiff" native function in mysql but I'm getting an error:
library(tidyverse)
library(lubridate)
library(dbplyr)
db_df <- tbl(con, "creditos")
db_df %>% mutate(diff_month = timestampdiff(month, column_date_1, column_date_2))
but the parameter month is not been translating correctly because it looks like an object or function in R:
Error in UseMethod("escape") :
no applicable method for 'escape' applied to an object of class "function"
And if written this way:
db_df %>% mutate(diff_month = timestampdiff("month", column_date_1, column_date_2))
I will also get an error:
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near month, column_date_1, column_date_2) AS diff_month
And I believe this is because dbplyr is writing "month" with double quotes into mysql, and it should be without double quotes, something like this:
TIMESTAMPDIFF(month, column_date_1, column_date_2) AS `diff_month`
Or is there a better way to calculate month difference using dbplyr?
month is a function in the lubridate package. It looks like mutate is being passed month as the R function month() instead of as text.
If you are using native SQL to compute the time difference, then you should not need the lubridate package.
Two possible solutions:
Remove library(lubridate) from your pre-amble and refer to lubridate packages using the prefix lubridate::. E.g.: lubridate::ymd_hms
Capitalize the parts of your mutate command that you want run in native SQL. This should help the SQL translation distinguish them from the lower case varients that have other meanings. E.g.: db_df %>% mutate(diff_month = TIMESTAMPDIFF(MONTH, column_date_1, column_date_2))
I would like to calculate the Jaro-Winkler string distance in a database. If I bring the data into R (with collect) I can easily use the stringdist function from the stringdist package.
But my data is very large and I'd like to filter on Jaro-Winkler distances before pulling the data into R.
There is SQL code for Jaro-Winkler (https://androidaddicted.wordpress.com/2010/06/01/jaro-winkler-sql-code/ and a version for T-SQL) but I guess I'm not sure how best to get that SQL code to work with dbplyr. I'm happy to try and map the stringdist function to the Jaro-Winkler sql code but I don't know where to start on that. But even something simpler like executing the SQL code directly from R on the remote data would be great.
I had hoped that SQL translation in the dbplyr documentation might help, but I don't think so.
You can build your own SQL functions in R. They just have to produce a string that is a valid SQL query. I don't know the Jaro-Winkler distance, but I can provide an example for you to build from:
union_all = function(table_a,table_b, list_of_columns){
# extract database connection
connection = table_a$src$con
sql_query = build_sql(con = connection,
sql_render(table_a),
"\nUNION ALL\n",
sql_render(table_b)
)
return(tbl(connection, sql(sql_query)))
}
unioned_table = union_all(table_1, table_2, c("who", "where", "when"))
Two key commands here are:
sql_render, which takes a dbplyr table and returns the SQL code that produces it
build_sql, which assembles a query from strings.
You have choices for your execution command:
tbl(connection, sql(sql_query)) will return the resulting table
dbExecute(db_connection, as.character(sql_query)) will execute a query without returning the result (useful for for dropping tables, creating indexes, etc.)
Alternatively, find a way to define the function in SQL as a user-defined function, you can then simply use the name of that function as if it were an R function (in a dbplyr query). When R can't find the function locally, it simply passes it to the SQL back-end and assumes it will be a function thats available in SQL-land.
This is a great way to decouple the logic. Down side is that the dbplyr expression is now dependant on the db-backend; you can't run the came code on a local data set. One way around that is to create a UDF that mimics an existing R function. The dplyr will use the local R and dbplyr will use the SQL UDF.
You can use sql() which runs whatever raw SQL you provide.
Example
Here the lubridate equivalent doesn't work on a database backend.
So instead I place custom SQL code sql("EXTRACT(WEEK FROM ildate)") inside sql(), like so:
your_dbplyr_object %>%
mutate(week = sql("EXTRACT(WEEK FROM meeting_date)"))
I am trying to fetch data from a local Postgresql instance into R. I need to work with parameterized queries because the queries will later depend on the users input.
res <- postgresqlExecStatement(con, "SELECT * FROM patient_set WHERE
instance_id = $1", c(100))
postgresqlFetch(res,n=-1)
postgresqlCloseResult(res)
dataframe = data.frame(res)
dbDisconnect(con)
Unfortunately this still gives me the following error:
Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class "structure("PostgreSQLResult", package ="RPostgreSQL")" to a data.frame
I also tried switching to dbGetQuery and dbBind but didn't get it running properly. What is the best way to fetch the result of parameterized queries from Postgresql directly into an R dataframe or table?
Does anyone know of a way to download blob data from an Oracle database using RJDBC package?
When I do something like this:
library(RJDBC)
drv <- JDBC(driverClass=..., classPath=...)
conn <- dbConnect(drv, ...)
blobdata <- dbGetQuery(conn, "select blobfield from blobtable where id=1")
I get this message:
Error in .jcall(rp, "I", "fetch", stride) :
java.sql.SQLException: Ongeldig kolomtype.: getString not implemented for class oracle.jdbc.driver.T4CBlobAccessor
Well, the message is clear, but still I hope there is a way to download blobs. I read something about 'getBinary()' as a way of getting blob information. Can I find a solution in that direction?
The problem is that RJDBC tries to convert the SQL data type it reads to either double or String in Java. Typically the trick works because JDBC driver for Oracle has routines to convert different data types to String (accessed by getString() method of java.sql.ResultSet class). For BLOB, though, the getString() method has been discontinued from some moment. RJDBC still tries calling it, which results in an error.
I tried digging into the guts of RJDBC to see if I can get it to call proper function for BLOB columns, and apparently the solution requires modification of fetch S4 method in this package and also the result-grabbing Java class within the package. I'll try to get this patch to package maintainers. Meanwhile, quick and dirty fix using rJava (assuming conn and q as in your example):
s <- .jcall(conn#jc, "Ljava/sql/Statement;", "createStatement")
r <- .jcall(s, "Ljava/sql/ResultSet;", "executeQuery", q, check=FALSE)
listraws <- list()
col_num <- 1L
i <- 1
while(.jcall(r, 'Z', 'next')){
listraws[[i]] <- .jcall(r, '[B', 'getBytes', col_num)
i <- i + 1
}
This retrieves list of raw vectors in R. The next steps depend on the nature of data - in my application these vectors represent PNG images and can be handled pretty much as file connections by png package.
Done using R 3.1.3, RJDBC 0.2-5, Oracle 11-2 and OJDBC driver for JDK >= 1.6