I would like to calculate the Jaro-Winkler string distance in a database. If I bring the data into R (with collect) I can easily use the stringdist function from the stringdist package.
But my data is very large and I'd like to filter on Jaro-Winkler distances before pulling the data into R.
There is SQL code for Jaro-Winkler (https://androidaddicted.wordpress.com/2010/06/01/jaro-winkler-sql-code/ and a version for T-SQL) but I guess I'm not sure how best to get that SQL code to work with dbplyr. I'm happy to try and map the stringdist function to the Jaro-Winkler sql code but I don't know where to start on that. But even something simpler like executing the SQL code directly from R on the remote data would be great.
I had hoped that SQL translation in the dbplyr documentation might help, but I don't think so.
You can build your own SQL functions in R. They just have to produce a string that is a valid SQL query. I don't know the Jaro-Winkler distance, but I can provide an example for you to build from:
union_all = function(table_a,table_b, list_of_columns){
# extract database connection
connection = table_a$src$con
sql_query = build_sql(con = connection,
sql_render(table_a),
"\nUNION ALL\n",
sql_render(table_b)
)
return(tbl(connection, sql(sql_query)))
}
unioned_table = union_all(table_1, table_2, c("who", "where", "when"))
Two key commands here are:
sql_render, which takes a dbplyr table and returns the SQL code that produces it
build_sql, which assembles a query from strings.
You have choices for your execution command:
tbl(connection, sql(sql_query)) will return the resulting table
dbExecute(db_connection, as.character(sql_query)) will execute a query without returning the result (useful for for dropping tables, creating indexes, etc.)
Alternatively, find a way to define the function in SQL as a user-defined function, you can then simply use the name of that function as if it were an R function (in a dbplyr query). When R can't find the function locally, it simply passes it to the SQL back-end and assumes it will be a function thats available in SQL-land.
This is a great way to decouple the logic. Down side is that the dbplyr expression is now dependant on the db-backend; you can't run the came code on a local data set. One way around that is to create a UDF that mimics an existing R function. The dplyr will use the local R and dbplyr will use the SQL UDF.
You can use sql() which runs whatever raw SQL you provide.
Example
Here the lubridate equivalent doesn't work on a database backend.
So instead I place custom SQL code sql("EXTRACT(WEEK FROM ildate)") inside sql(), like so:
your_dbplyr_object %>%
mutate(week = sql("EXTRACT(WEEK FROM meeting_date)"))
Related
I'm trying to use R's "pool" package to execute a set of queries against a set of databases.
I have a list of queries, queryList (I confirmed that each element is a character vector, e.g. "SELECT...FROM...").
library(pool)
library(DBI)
# queryList defined earlier
myPool <- dbPool (...)
Results <- lapply(queryList, pool::dbGetQuery, myPool) # fails here!
The error I get says this: "unable to find an inherited method for function 'dbGetQuery' for signature '"character", "Pool"'.
One SO thread says this is related to S4 incompatibility. pool::dbGetQuery is an S4 method.
Is there a workaround ?
The use of an anonymous function (e.g. function(x)..., as suggested by #neilfws) worked. However, I'm not sure why, since I didn't need to use anonymous functions when I was dealing directly with dbiConnection objects. So this works
lapply(queryList, DBI::dbGetQuery, conn) # conn is dbiConnection
but this doesn't work
lapply(queryList, pool::dbGetQuery, pool) # pool is a pool of dbiConnections
Maybe I'm misreading the official documentation?
I am trying to extract data from an Azure SQL database into an R notebook on Azure Databricks to run an R script on it. Because of the difficulties I have experienced using jdbc and DBI (detailed here: "No suitable driver" error when using sparklyr::spark_read_jdbc to query Azure SQL database from Azure Databricks and here: How to get SQL database connection compatible with DBI::dbGetQuery function when converting between R script and databricks R notebook?) I have decided to use the inbuilt spark connector as such (credentials changed for security reasons):
%scala
//Connect to database:
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
// Aquire a DataFrame collection (val collection)
val dave_connection = Config(Map(
"url" -> "servername.database.windows.net",
"databaseName" -> "databasename",
"dbTable" -> "myschema.mytable",
"user" -> "username",
"password" -> "userpassword"
))
val collection = sqlContext.read.sqlDB(dave_connection)
collection.show()
This works in the sense that it displays the data, but as someone who doesn't know the first thing about scala or spark I now have no idea how to get it into an R or R-compatible dataframe.
I have tried to see what kind of object "collection" is, but:
%scala
getClass(collection)
returns only:
notebook:1: error: too many arguments for method getClass: ()Class[_ <: $iw]
getClass(collection)
And trying to access it using sparklyr implies that it doesn't actually exist, e.g.
library(sparklyr)
sc <- spark_connect(method="databricks")
sdf_schema(collection)
returns:
Error in spark_dataframe(x) : object 'collection' not found
I feel like this is may well be pretty obvious to anyone who understands scala, but I don't (I come from an analyst rather than computer science background), and I just want to get this data into an R dataframe to perform analyses on it. (I know Databricks is all about parallelisation and scaling, but I'm not performing any parallelised functions on this dataset, the only reason I'm using Databricks is because my work PC doesn't have sufficient memory to run my analyses locally!)
So, does anyone have any ideas on how I can convert this spark object "collection" into an R dataframe?
We have a number of MS Access databases on a server which are copies from remote locations which are updated overnight. We collate some of the data from these machines for reporting purposes on a daily basis. Sometimes the overnight update fails, meaning we don’t have access to all of the databases, so I am attempting to write an R script which will test if we can connect (using a list of the database paths), and output an updated version of the list including only those which we can connect to. This will then be used to run a further script which will only update the data related to the available databases.
This is what I have so far (I am new to R but reasonably proficient in SAS and SQL – attempting to use R both as a learning exercise and for potential cost savings);
{
# Create Store data locations listing
A=matrix(c(1000,1,"One","//Server/Comms1/Access.mdb"
,2000,2,"Two","//Server/Comms2/Access.mdb"
,3000,3,"Three","//Server/Comms3/Access.mdb"
)
,nrow=3,ncol=4,byrow=TRUE)
# Add column names
colnames(A)<-c("Ref1","Ref2","Ref3","Location")
#Create summary for testing connections (Ref1 and Location)
B<-A[,c(1,4)]
ConnectionTest<-function(Ref1,Location)
{
out<-tryCatch({ch<-odbcDriverConnect(paste("Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=",Location))
sqlQuery(ch,paste("select ",Ref1," as Ref1,COUNT(variable) as Count from table"))}
,error=matrix(c(Ref1,0),nrow=1,ncol=2,byrow=TRUE)
)
return(out)
}
#Run function, using 'B' to provide arguments
C<-apply(B,1,function(x)do.call(ConnectionTest,as.list(x)))
#Convert to matrix and add column names
D<-matrix(unlist(C),ncol=2,byrow=T)
colnames(D)<-c("Ref1","Count")
}
When I run the script I get the following error message;
Error in value[3L] : attempt to apply non-function
I am guessing this is because I am using TryCatch incorrectly inside the UDF?
Does anyone have any advice on what I am doing incorrectly, or even if this is the best way to do what I am attempting?
Thanks
(apologies if this is formatted incorrectly, having to post on my phone due to Stackoverflow posting being blocked)
Edit - I think I fixed the 'Error in value[3L]' issue by adding function(e) {} around the matrix function in the error part of the tryCatch.
The issue now is that the script just fails if it can't reach one of the databases, rather than doing the matrix function. Do I need to add something else to make it ignore the error?
Edit 2 - it seems tryCatch does now work - it processes the
alternate function upon error but also shows warnings about the error, which makes sense.
As mentioned in the edit above, using 'function(e) {}' to wrap the Matrix function in the error section of the tryCatch fixed the 'Error in value[3L]' issue, so the script now works, but displays error messages if it can't access a particular channel. I am guessing the 'warning' section of the tryCatch can be used to adjust these as necessary.
We have parquet data saved on a server and I am trying to use SparkR sql() function in the following ways
df <- sql("SELECT * FROM parquet.`<path to parquet file`")
head(df)
show(df) # returns "<SQL> SELECT * FROM parquet.`<path to parquet file`"
and
createOrReplaceTempView(df, "table")
df2 <- sql("SELECT * FROM table")
show(df2) # returns "<SQL> SELECT * FROM table"
In both cases what I get is the sql query in a string format instead of the spark dataframe. Does anyone have any idea why this happens and why I don't get the dataframe?
Thank you
Don't use the show statement use showDF()...or, View(head(df2, num=20L))
This was a very silly problem. The solution is to just use the full name of the method
SparkR::sql(...)
instead of the short name. Apparently the function sql is masked.
show method documentation in sparkR states that if eager evaluation is not enabled it returns the class and type information of the Spark object. so you should use showDF instead.
besides that, apparently, the sql method is masked in R and you should call it with explicit package deceleration. like this:
df <- SparkR::sql("select 1")
showDF(df)
(By object-relational mapping, I mean what is described here: Wikipedia: Object-relational mapping.)
Here is how I could imagine this work in R : a kind of "virtual data frame" is linked to a database, and returns the results of SQL queries when accessed. For instance, head(virtual_list) would actually return the results of (select * from mapped_table limit 5) on the mapped database.
I have found this post by John Myles White, but there seems to have been no progress in the last 3 years.
Is there a working package that implements this ?
If not,
Would it be useful ?
What would be the best way to implement it (S4 ?) ?
The very recent package dplyr is implementing this (amongst other amazing features).
Here are illustrations from the examples of function src_mysql():
# Connection basics ---------------------------------------------------------
# To connect to a database first create a src:
my_db <- src_mysql(host = "blah.com", user = "hadley",
password = "pass")
# Then reference a tbl within that src
my_tbl <- tbl(my_db, "my_table")
# Methods -------------------------------------------------------------------
batting <- tbl(lahman_mysql(), "Batting")
dim(batting)
colnames(batting)
head(batting)
There is an old unsupported package, SQLiteDF, that does that. Build it from source and ignore the numerous error messages.
> # from example(sqlite.data.frame)
>
> library(SQLiteDF)
> iris.sdf <- sqlite.data.frame(iris)
> iris.sdf$Petal.Length[1:10] # $ done via SQL
[1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5
Looks like John Myles White he's given up on it.
There is a bit of a workaround explained here.
I don't think it would be useful. R is not a real OOP language. The "central" data structure in R is the data frame. No need for Object-Relational Mapping here.What you want is a mapping between SQL tables and data frames and the RMySQL and RODBC provide just that :
dbGetQuery to return the results of a query in a data frame and dbWriteTable to insert data in a table or do a bulk update ( from a data frame).
As an experienced R user, I would not use this. First off, this 'virtual frame' would be slow to use, since you constantly need to synchronize between R memory and the database. It would also require locking the database table, since otherwise you have unpredictable results due to other edits happening at the same time.
Finally, I do not think R is suited for implementing a different evaluation of promise objects. Doing myFrame$foo[ myFrame$foo > 40 ] will still fetch the full foo column, since you cannot possible implement a full translation scheme from R to SQL.
Therefore, I prefer to load a dataframe() from a query, use it, and write it back to the database if required.
Next to the various driver packages for querying DBs (DBI, RODBC,RJDBC,RMySql,...)
and dplyr, there's also sqldf https://cran.r-project.org/web/packages/sqldf/
This will automatically import dataframes into the db & let you query the data via sql. At the end the db is deleted.
The most similar could be dbplyr.
In R you work with tables, not rows.