Difference in month with TIMESTAMPDIFF mysql function in dbplyr R - r

I am trying to calculate the difference in months between two dates in R using dbplyr package, I want to send the sql query to calculate it using "timestampdiff" native function in mysql but I'm getting an error:
library(tidyverse)
library(lubridate)
library(dbplyr)
db_df <- tbl(con, "creditos")
db_df %>% mutate(diff_month = timestampdiff(month, column_date_1, column_date_2))
but the parameter month is not been translating correctly because it looks like an object or function in R:
Error in UseMethod("escape") :
no applicable method for 'escape' applied to an object of class "function"
And if written this way:
db_df %>% mutate(diff_month = timestampdiff("month", column_date_1, column_date_2))
I will also get an error:
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near month, column_date_1, column_date_2) AS diff_month
And I believe this is because dbplyr is writing "month" with double quotes into mysql, and it should be without double quotes, something like this:
TIMESTAMPDIFF(month, column_date_1, column_date_2) AS `diff_month`
Or is there a better way to calculate month difference using dbplyr?

month is a function in the lubridate package. It looks like mutate is being passed month as the R function month() instead of as text.
If you are using native SQL to compute the time difference, then you should not need the lubridate package.
Two possible solutions:
Remove library(lubridate) from your pre-amble and refer to lubridate packages using the prefix lubridate::. E.g.: lubridate::ymd_hms
Capitalize the parts of your mutate command that you want run in native SQL. This should help the SQL translation distinguish them from the lower case varients that have other meanings. E.g.: db_df %>% mutate(diff_month = TIMESTAMPDIFF(MONTH, column_date_1, column_date_2))

Related

R apply() format_date

I am a noob trying to troubleshooting an R script written by someone else. The script used to work, but now does not. It is related to apply(), which is apply(X, MARGIN, FUN, ...). That says to me that format_date is supposed to be a function. But the person who wrote this script did not define a function called format_date, and I can't find this function in the libraries that are called in the script. Where do I find format_date?
The reason for this line is that the index of this table is date. But we need a date field to export (and not just date as the index), so we are appending it on.
Here is the line throwing the error:
result$date = apply(rownames(result), 1, format_date) # add in date to dataframe
Here is the error message:
Warning: Ignoring unknown parameters: fill
Saving 7 x 7 in image
Error in apply(rownames(result), 1, format_date) :
object 'format_date' not found
You can comment out or delete line result$date = apply(rownames(result), 1, format_date) from your code.
In older version of rtweet package the function format_date converted datetime Twitter API format to standard datatime objects. Current version of rtweet functions return datetime (like POSIX) objects so there is no need for function like format_date.

How to use custom SQL function in dbplyr?

I would like to calculate the Jaro-Winkler string distance in a database. If I bring the data into R (with collect) I can easily use the stringdist function from the stringdist package.
But my data is very large and I'd like to filter on Jaro-Winkler distances before pulling the data into R.
There is SQL code for Jaro-Winkler (https://androidaddicted.wordpress.com/2010/06/01/jaro-winkler-sql-code/ and a version for T-SQL) but I guess I'm not sure how best to get that SQL code to work with dbplyr. I'm happy to try and map the stringdist function to the Jaro-Winkler sql code but I don't know where to start on that. But even something simpler like executing the SQL code directly from R on the remote data would be great.
I had hoped that SQL translation in the dbplyr documentation might help, but I don't think so.
You can build your own SQL functions in R. They just have to produce a string that is a valid SQL query. I don't know the Jaro-Winkler distance, but I can provide an example for you to build from:
union_all = function(table_a,table_b, list_of_columns){
# extract database connection
connection = table_a$src$con
sql_query = build_sql(con = connection,
sql_render(table_a),
"\nUNION ALL\n",
sql_render(table_b)
)
return(tbl(connection, sql(sql_query)))
}
unioned_table = union_all(table_1, table_2, c("who", "where", "when"))
Two key commands here are:
sql_render, which takes a dbplyr table and returns the SQL code that produces it
build_sql, which assembles a query from strings.
You have choices for your execution command:
tbl(connection, sql(sql_query)) will return the resulting table
dbExecute(db_connection, as.character(sql_query)) will execute a query without returning the result (useful for for dropping tables, creating indexes, etc.)
Alternatively, find a way to define the function in SQL as a user-defined function, you can then simply use the name of that function as if it were an R function (in a dbplyr query). When R can't find the function locally, it simply passes it to the SQL back-end and assumes it will be a function thats available in SQL-land.
This is a great way to decouple the logic. Down side is that the dbplyr expression is now dependant on the db-backend; you can't run the came code on a local data set. One way around that is to create a UDF that mimics an existing R function. The dplyr will use the local R and dbplyr will use the SQL UDF.
You can use sql() which runs whatever raw SQL you provide.
Example
Here the lubridate equivalent doesn't work on a database backend.
So instead I place custom SQL code sql("EXTRACT(WEEK FROM ildate)") inside sql(), like so:
your_dbplyr_object %>%
mutate(week = sql("EXTRACT(WEEK FROM meeting_date)"))

select command not working in R even after installing the library dplyr

Error message : could not find function "select"
After installing the package dplyr which contains the select function for R,
this error isn't expected but still i am getting this error.
I want to select a particular column of the dataset but the dollar sign operator is also not working.
I think I've had this problem as well and I'm not sure what causes it. However, I can usually solve the problem by specifying the package before the command as in the code below.
dplyr::select()
Hope this helps.
#THATguy nailed it! That will solve your problems. The cause of this error is often due to multiple libraries with the same function. In this case specifically, the function "select" exists in the package 'dplyr' and 'MASS'. If you type in select in your code it's likely going to pull the MASS library, and if your intention is select only certain columns out of a data frame then, you want to the select from 'dplyr'. For example:
df <- read.csv("df.csv") %>% #bring in the data frame
dplyr::select(-x, -y, -z) # remove the x, y, and z columns from the data frame
Or if you want to keep certain columns then drop the '-' in front of the variable.
There are various ways you can try to solve this problem.
Restart the R session with ctrl + shift + F10
You can use dplyr::select() if that's the select function you want

Why is there no lubridate:::update function?

As said in the title: Why is there no such function? Or in a different way: What is the type of the function? When I type ?update I get something from stats package, but there is a lubridate function as described here on page 7. There also seems to be a lubridate:::update.Date function, but I can't find any explanations for that function.
Backround: I use the function in a package and I only got it to work after I used the Depends: in the decription file. Initially I wanted to use lubridate::update()...
The lubridate package provides the methods lubridate:::update.Date() and lubridate:::update.POSIXt(). Those functions are not exported into the namespace, but I assume that, by means of function overloading, they are invoked when update() is applied to a POSIXor Date object when the lubridate library is loaded.
The help page ?lubridate:::update.POSIXt provides some information concerning the update function within the lubridate package:
Description
update.Date and update.POSIXt return a date with the specified
elements updated. Elements not specified will be left unaltered.
update.Date and update.POSIXt do not add the specified values to the
existing date, they substitute them for the appropriate parts of the
existing date.
Usage
## S3 method for class 'POSIXt'
update(object, ..., simple = FALSE)
The usage section and the examples in the help page indicate that these functions don't need to be addressed individually, as they are called by simply using update() when the lubridate library is loaded.
To inspect these functions one can type, e.g., lubridate:::update.POSIXt in the console (without passing arguments, and without the parentheses).
You need to load the lubridate package:
library(lubridate)
date <- now()
print(date)
new_date <- update(date, year = 2010, month = 1, day = 1)
print(new_date)
Outputs:
"2016-08-04 08:58:08 CEST"
"2010-01-01 08:58:08 CET"

Lubridate Objects Masked After Loading Data.Table

When I load the data.table package after having already loaded the lubridate package, I get the following error message:
Loading required package: data.table
data.table 1.9.4 For help type: ?data.table
*** NB: by=.EACHI is now explicit. See README to restore previous behaviour.
Attaching package: ‘data.table’
The following objects are masked from ‘package:lubridate’:
hour, mday, month, quarter, wday, week, yday, year
Does anyone know a) what's causing this issue and b) how to prevent these objects within lubridate from being masked?
UPDATE:
The issue associated with the above is that I'm using the quarter function from the lubridate package and, after loading the data.table package, I can no longer do so in the same way.
Specifically, when I run quarter(Date, with_year=TRUE) (where Date is a vector of class = Dates), I now get the following error: Error in quarter(Date, with_year = TRUE) : unused argument (with_year = TRUE).
If I simply, quarter(Date), then I can get the desired output without the attached year. For example, if Date is set as simply May 15, 2015 (today), then quarter(Date) will yield 2 (since we're in the 2nd quarter of 2015), but I'd like it to yield 2015.2, hence the importance of the with_year = TRUE option.
Obviously, I can overcome this by using paste to bind together the year and the output of quarter(Date), but I'd prefer to avoid that work-around.
An object name in a package namespace is masked when a new object is defined with the same name. This can be done by the user assigning the name, or by attaching another package that has an object of the same name.
data.table and lubridate have overlapping function names. If you want the lubridate version to be the default, then the easiest solution is to load data.table first, then load lubridate---thus it will be the data.table versions of these functions that is masked by the "newer" lubridate versions.
library(data.table)
library(lubridate)
Otherwise, the solution is to use :: (as in package::function) to fully specify which version of the function you want to use, for example:
lubridate::quarter(Date, with_year = T)
Another option, which involves a little less typing but is perhaps a little less clear as well, would be to alias the lubridate functions you want in the global environment at the start of your script.
quarter = lubridate::quarter
Any use of quarter() later in the script will use the lubridate version of the function.
Yet another option is the conflicted package, which provides a system for preferring a function from one package. It is a bit more intense and intentional, you should definitely read the documentation before using it, but your script might include something like this:
library(conflicted)
conflict_prefer("quarter", "lubridate")
The package conflicted provides various alternatives and is a good practice to use it while loading libraries to be clear on the masking.
https://github.com/r-lib/conflicted

Resources