I am trying to use dbplyr and the trunc function from pl/sql, to mutate the date column to the start of the month.
df %>% mutate(start_month = sql(trunc(date_column, 'month'))
however this throws an error invalid identifier when executing the query. I think it is because when it is parsed to pl/sql as a string the query reads select .... trunc("date_column",'month') as start_month so it doesn't recognise as a column name due to the quotes inside the sql function.... Any ideas on how to do this another way or get around this error would be great.
You can probably achieve this by removing the quote marks from month and possible the sql function from inside your mutate.
dbplyr works by translating dplyr commands into the equivalent sql. Where there is no translation it defaults to leaving the command as is. You can make use of this feature to pass in sql commands.
I recommend trying
df %>% mutate(start_month = TRUNC(date_column, MONTH))
As dbplyr translations are not defined for TRUNC or MONTH these should appear in your plsql query in the same way as they appear in your R code:
SELECT ... TRUNC(date_column, MONTH) AS start_month
I recommend writing the commands you do not want translated in capitals because R is case sensitive but sql is not. So even if month is an R function, MONTH is probably not an R function and hence there is no chance dbplyr will try and translate it.
Related
I'm very surprised if this kind of problems cannot be solved with sparklyr:
iris_tbl <- copy_to(sc, aDataFrame)
# date_vector is a character vector of element
# in this format: YYYY-MM-DD (year, month, day)
for (d in date_vector) {
...
aDataFrame %>% mutate(newValue=gsub("-","",d)))
...
}
I receive this error:
Error: org.apache.spark.sql.AnalysisException: Undefined function: 'GSUB'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 2 pos 86
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.failFunctionLookup(SessionCatalog.scala:787)
at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction0(HiveSessionCatalog.scala:200)
at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction(HiveSessionCatalog.scala:172)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$39.apply(Analyzer.scala:884)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$39.apply(Analyzer.scala:884)
at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun
But with this line:
aDataFrame %>% mutate(newValue=toupper("hello"))
things work. Some help?
It may be worth adding that the available documentation states:
Hive Functions
Many of Hive’s built-in functions (UDF) and built-in aggregate functions (UDAF) can be called inside dplyr’s mutate and summarize. The Languange Reference UDF page provides the list of available functions.
Hive
As stated in the documentation, a viable solution should be achievable with use of regexp_replace:
Returns the string resulting from replacing all substrings in
INITIAL_STRING that match the java regular expression syntax defined
in PATTERN with instances of REPLACEMENT. For example,
regexp_replace("foobar", "oo|ar", "") returns 'fb.' Note that some
care is necessary in using predefined character classes: using '\s' as
the second argument will match the letter s; '\\s' is necessary to
match whitespace, etc.
sparklyr approach
Considering the above it should be possible to combine sparklyr pipeline with
regexp_replace to achieve effect cognate to applying gsub on the desired column. Tested code removing the - character within sparklyr in variable d could be build as follows:
aDataFrame %>%
mutate(clnD = regexp_replace(d, "-", "")) %>%
# ...
where class(aDataFrame ) returns: "tbl_spark" ....
I would strongly recommend you read the sparklyr documentation before proceeding. In particular, you're going to want to read the section on how R is translated to SQL (http://spark.rstudio.com/dplyr.html#sql_translation). In short, a very limited subset of R functions are available for use on sparklyr dataframes, and gsub is not one of those functions (but toupper is). If you really need gsub you're going to have to collect the data in to a local dataframe, then gsub it (you can still use mutate), then copy_to back to spark.
In perl/python DBI APIs have a mechanism to safely interpolate in parameters to an sql query. For example in python I would do:
cursor.execute("SELECT * FROM table WHERE value > ?", (5,))
Where the second parameter to the execute method is a tuple of parameters to add into the sql query
Is there a similar mechanism for R's DBI compliant APIs? The examples I've seen never show parameters passed to the query. If not, what is the safest way to interpolate in parameters to a query? I'm specifically looking at using RPostgresSQL.
Just for completeness, I'll add an answer based on Hadley's comment. The DBI package now has the function sqlInterpolate which can also perform this. It requires a list of function arguments to be named in the sql query that all must start with a ?. Excerpt from the DBI manual below
sql <- "SELECT * FROM X WHERE name = ?name"
sqlInterpolate(ANSI(), sql, name = "Hadley")
# This is safe because the single quote has been double escaped
sqlInterpolate(ANSI(), sql, name = "H'); DROP TABLE--;")
Indeed the use of bind variables is not really well documented. Anyway the ODBC commands in R work differently for different databases. One possibility for postgres would be like this:
res <- postgresqlExecStatement(con, "SELECT * FROM table WHERE value > $1", c(5))
postgresqlFetch(res)
postgresqlCloseResult(res)
Hope it helps.
SELECT EXTRACT (YEAR FROM Starting_Date) as Orderyear,
FROM PGME
WHERE ID =1
I tried to select the year from the Starting_Date which the format is "15/01/1968". But it keep saying the Syntax error. Any recommended? Thank you for advance.
extract is a MySQL function that isn't part of the ANSI SQL standard, and from the comments it seems you're trying to use it with MS-Access. Instead, you could consider using the datepart function which is more-or-less the MS-Access equivalent. Additionally, as lad2025 noted in his comment, you have a redundant comma after Orderyear:
SELECT DATEPART("yyyy", Starting_Date) AS orderyear
FROM pgme
WHERE id = 1
I'm programmatically fetching a bunch of datasets, many of them having silly names that begin with numbers and have special characters like minus signs in them. Because none of the datasets are particularly large, and I wanted the benefit R making its best guess about data types, I'm (ab)using dplyr to dump these tables into SQLite.
I am using square brackets to escape the horrible table names, but this doesn't seem to work. For example:
data(iris)
foo.db <- src_sqlite("foo.sqlite3", create = TRUE)
copy_to(foo.db, df=iris, name="[14m3-n4m3]")
This results in the error message:
Error in sqliteSendQuery(conn, statement, bind.data) : error in statement: no such table: 14m3-n4m3
This works if I choose a sensible name. However, due to a variety of reasons, I'd really like to keep the cumbersome names. I am also able to create such a badly-named table directly from sqlite:
sqlite> create table [14m3-n4m3](foo,bar,baz);
sqlite> .tables
14m3-n4m3
Without cracking into things too deeply, this looks like dplyr is handling the square brackets in some way that I cannot figure out. My suspicion is that this is a bug, but I wanted to check here first to make sure I wasn't missing something.
EDIT: I forgot to mention the case where I just pass the janky name directly to dplyr. This errors out as follows:
library(dplyr)
data(iris)
foo.db <- src_sqlite("foo.sqlite3", create = TRUE)
copy_to(foo.db, df=iris, name="14M3-N4M3")
Error in sqliteSendQuery(conn, statement, bind.data) :
error in statement: unrecognized token: "14M3"
This is a bug in dplyr. It's still there in the current github master. As #hadley indicates, he has tried to escape things like table names in dplyr to prevent this issue. The current problem you're having arises from lack of escaping in two functions. Table creation works fine when providing the table name unescaped (and is done with dplyr::db_create_table). However, the insertion of data to the table is done using DBI::dbWriteTable which doesn't support odd table names. If the table name is provided to this function escaped, it fails to find it in the list of tables (the first error you report). If it is provided escaped, then the SQL to do the insertion is not synatactically valid.
The second issue comes when the table is updated. The code to get the field names, this time actually in dplyr, again fails to escape the table name because it uses paste0 rather than build_sql.
I've fixed both errors at a fork of dplyr. I've also put in a pull request to #hadley and made a note on the issue https://github.com/hadley/dplyr/issues/926. In the meantime, if you wanted to you could use devtools::install_github("NikNakk/dplyr", ref = "sqlite-escape") and then revert to the master version once it's been fixed.
Incidentally, the correct SQL-99 way to escape table names (and other identifiers) in SQL is with double quotes (see SQL standard to escape column names?). MS Access uses square brackets, while MySQL defaults to backticks. dplyr uses double quotes, per the standard.
Finally, the proposal from #RichardScriven wouldn't work universally. For example, select is a perfectly valid name in R, but is not a syntactically valid table name in SQL. The same would be true for other reserved words.
I have a data frame with stock data (date, symbol, high, low, open, close, volume). Using r and mysql and sqldf and rmysql I have a list of unique dates and unique stock symbols.
What I need now is to loop through the data and find the close on two specified dates. For instance:
stkData contains (date, symbol, high, low, open, close, volume)
dates contains unique dates
symbol contains unique symbols
I want to loop through the lists in a sqldf statement as such:
'select stkData$close from stkData where symbol = symbol[k] and date = dates[j]'
k and j would be looped numbers, but my problem is the symbol[k] and dates[j] parts.
sqldf won't read them properly (or I can't code properly). I've tried as.Date, as.character with no luck. I get the following error message:
Error in sqliteExecStatement(con, statement, bind.data) :
RS-DBI driver: (error in statement: near "[4,]": syntax error)
You're pretty far off in terms of syntax for sqldf, unfortunately. You can't use $ or [] notations in sqldf calls because those are both R syntax, not SQL syntax. It's an entirely separate language. What's happening is that sqldf is taking your data frame, importing it into SQLite3, executing the SQL query that you supply against the resulting table, and then importing the result set back into R as a data frame. No R functionality is available within the SQL.
It's not clear to me what you're trying to do, but if you want to run multiple queries in a loop, you probably want to construct the SQL query as a string using the R function paste(), so that when it gets to SQLite3 it'll just be static values where you currently have symbol[k] and dates[j].
So, you'll have something like the following, but wrapped in a loop for j and k:
sqldf(paste('select close from stkData where symbol = ', symbol[k],
' and date = ', dates[j]))
You might need to construct the select statement as a string with paste before it gets passed to your SQL caller. Something like:
combo_kj <- expand.grid(ksym=symbol[1:k], jdates=dates[1:j])
SQLcalls <- paste('select close from stkData where symbol = ',
combo_kj$ksym,
' and date = '
combo_kj$jdates,
sep="")
And then loop over SQLcalls with whatever code you are using.
Preface sqldf with fn$ as shown and then strings within backticks will be replaced by the result of running them in R and strings of the form $variable will be replaced by the contents of that variable (provided the variable name contains only word characters). Note that SQL requires that character constants be put in quotes so be sure to surround the backticks or $variable with quotes:
fn$sqldf("select close from stkData
where symbol = '`symbol[k]`' and
date = '`dates[j]`' ")
To use the $variable syntax try this:
mysymbol <- symbol[k]
mydate <- dates[j]
fn$sqldf("select close from stkData
where symbol = '$mysymbol' and
date = '$mydate' ")
Also see example 5 on the sqldf github page: https://github.com/ggrothendieck/sqldf