How can I read data from delta lib using SparkR? - r

I couldn't find any reference to access data from Delta using SparkR so I tried myself. So, fist I created a Dummy dataset in Python:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",2000),
("Robert","","Williams","42114","M",5000),
("Maria","Anne","Jones","39192","F",5000),
("Jen","Mary","Brown","","F",-1)
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
df.write \
.format("delta")\
.mode("overwrite")\
.option("userMetadata", "first-version") \
.save("/temp/customers")
You can modify this code to change the data and run again to simulate the change over time.
I can query in python using this:
df3 = spark.read \
.format("delta") \
.option("timestampAsOf", "2020-11-30 22:03:00") \
.load("/temp/customers")
df3.show(truncate=False)
But I don't know how to pass the option in Spark R, can you help me?
%r
library(SparkR)
teste_r <- read.df("/temp/customers", source="delta")
head(teste_r)
It works but returns only the current version.

timestampAsOf will work as a parameter in SparkR::read.df.
SparkR::read.df("/temp/customers", source = "delta", timestampAsOf = "2020-11-30 22:03:00")
This can be also done with SparkR::sql.
SparkR::sql('
SELECT *
FROM delta.`/temp/customers`
TIMESTAMP AS OF "2020-11-30 22:03:00"
')
Alternatively, to do it in sparklyr, use the timestamp parameter in spark_read_delta.
library(sparklyr)
sc <- spark_connect(method = "databricks")
spark_read_delta(sc, "/temp/customers", timestamp = "2020-11-30 22:03:00")

Related

Converting cURL command to R GET request

Does anybody know how to convert more complex cURL commands into httr:GET() requests. The issue I am having is that the API only requires a key in the form of a username <YOUR_API_KEY> but does not require any password.
$ curl https://api.goclimate.com/v1/flight_footprint \
-u YOUR_API_KEY: \
-d 'segments[0][origin]=ARN' \
-d 'segments[0][destination]=BCN' \
-d 'segments[1][origin]=BCN' \
-d 'segments[1][destination]=ARN' \
-d 'cabin_class=economy' \
-d 'currencies[]=SEK' \
-d 'currencies[]=USD' \
-G
Perhaps another package like Rcurl might be more appropriate?
Thanks!
Well, when you include the ":" with nothing after it, you are specifying a password as the empty string. So using httr that would be like
GET("https://api.goclimate.com/v1/flight_footprint",
authenticate("YOUR_API_KEY",""),
query=list(
"segments[0][origin]"="ARN",
"segments[0][destination]"="BCN",
"segments[1][origin]"="BCN",
"segments[1][destination]"="ARN",
"cabin_class"="ecomony",
"currencies[0]"="SEK",
"currencies[1]"="USD"))
Expanding the indexes of the paramters is kind of messy, you can write a helper function
query_expand <- function(x) {
expd <- function(name, value) {
do.call("c", unname(Map(function(name, value) {
if(is.list(value) && !is.null(names(value))) {
xx <- expd(paste0("[", names(value), "]"), value)
setNames(xx, paste0(name, names(xx)))
} else if(is.list(value)) {
xx <- expd(paste0("[",seq_along(value)-1,"]"), value)
setNames(xx, paste0(name, names(xx)))
} else if (length(value)>1) {
setNames(as.list(value), paste0(name, "[", seq_along(value)-1,"]"))
} else {
setNames(list(value), name)
}}, name, value)))
}
expd(names(x), x)
}
Then if you have your data nearly in an object
params <- list("segments" = list(
list(origin="ARN", destination="BCN"),
list(origin="BCN", destination=c("ARN"))
),
"cabin_class" = "ecomony",
"currencies" = c("SEK","USD"))
You could just use
GET("https://api.goclimate.com/v1/flight_footprint",
authenticate("YOUR_API_KEY",""),
query = query_expand(params))

Passing Parameters to an R Script via CMD or Batch

Currently I have an R Script that takes 8 parameters that are hard-coded as the first 8 lines of my script.
I've made a Batch file to try and manually change the parameters on the fly, but it doesn't seem to be working the way I want it to.
Batch that currently runs the script (But doesnt actually change the parameters)
echo off
ECHO PRESS ENTER AT ANY INPUT TO ACCEPT the DEFAULT VALUE.
:: Setting of Variables
#Set /P RScript=Set path to R:_
#Set /P RProgram=Set path to RScript:_
#Set /P RStartDir=Set Start Directory:_
#Set /P Begin=Begin on which Loan?:_
#Set /P End=End on which Loan?:_
#Set /P OutputDir=Set Output Directory:_
#Set /P Deal=Set Deal input file (.txt):_
#Set /P OutputFile=Name Deal Output File:_
#Set /P AsOfDate=As of Date?:_
#Set /P ThirtyYrSpread=Thirty Year Mortgage Spread?:_
:: If Blank (enter), set variables/paths to Defaults (Listed Below)
if "%RScript%"=="" Set RScript=c:\program files\r\r-
3.4.3\bin\x64\rscript.exe
if "%RProgram%"=="" Set RProgram=C:\MortgageMatt\Cirt2014-
1\0.Mortgage Model.R
if "%RStartDir%"=="" Set RStartDir=C:\MortgageMatt\Cirt2014-1
if "%Begin%"=="" Set Begin=1
if "%End%"=="" Set End=2
if "%OutputDir%"=="" Set OutputDir=C:\MortgageMatt\Cirt2014-1
if "%Deal%"=="" Set Deal=Cirt 2014-1 Loan Level.txt
if "%OutputFile%"=="" Set OutputFile=Cirt 2014-1d
if "%AsOfDate%"=="" Set AsOfDate=62017
if "%ThirtyYrSpread%" == "" Set ThirtyYrSpread=135
echo "%RScript% %RProgram% %RStartDir% %Begin% %End% %OutputDir% %Deal%
%OutputFile% %AsOfDate% %ThirtyYrSpread%"
ECHO PLEASE CHECK IF THESE VALUES ARE CORRECT
pause
:: Command Prompt, /c Carries out command specified by string and then
terminates
cmd /c ""%RScript%" "%RProgram%" "%RStartDir%" "%Begin%" "%End%"
"%OutputDir%" "%Deal%" "%OutputFile%" "%AsOfDate%" "%ThirtyYrSpread%""
So because the parameters were actually hard coded into the R Script, this is what I've added to try to accommodate. Does this look okay? I think this is where I'm running into errors.
Added to R Script
args <- commandArgs(trailingOnly = TRUE)
if (length(args) == 0) {
if (!exists("dataDir")) { stop("variables dataDir not found") }
# Set dataDir variable when Running inside a R Session
args <- c(getwd(), 1, 2, ".", "Cirt 2014-1 Loan Level.txt", "Cirt 2014-
1", "62017", 175)
}
print(args)
# Input Values
Input.Directory <- paste(args[1]) ## getwd() , "/", "inputs", sep = "")
Begin.Sim <- args[2]
End.Sim <- args[3]
Output.Directory <- paste(args[1],"\\",args[4],sep = "") ##, "/", "outputs",
sep = "")
Pool.ID.File <- args[5] #"Cirt 2014-1 Loan Level.txt"
Pool.ID <- args[6] #"Cirt 2014-1"
asofdate <- args[7] #"62017"
Thirty.Yr.Mort.Spread <- args[8] # 175
When I try to run it in cmd using the .bat.. I get an error that says cannot change working directory. Anyone have any suggestions?
I sort-of understand where the error is but I'm struggling to fix it.
The path to my file with everything in it is
C:\MortgageMatt\Cirt2014-1
Edit:
I've also heard of something called R CMD Batch... should I look into that? I'm finding that it's an older technique.
What my code looked like before the Args/IF
# Input Values
Input.Directory <- "C:/Mortgage/Cirt 2014 - 1"
Output.Directory <- "C:/Mortgage/Cirt 2014 - 1"
Pool.ID.File <- "Cirt 2014-1 Loan Level.txt"
Pool.ID <- "Cirt 2014-1 NEW"
start<- 1
sims <- 2 # Number of Simulations
asofdate <- "62017"
Thirty.Yr.Mort.Spread <- 175
You can do all of this in R using one of these packages to parse command-line options:
docopt (my favourite)
optparse
argparse
getopt
or doing it manually -- not recommended.
You also do not want the older R CMD BATCH -- use Rscript (or littler, but littler does not work on Windows).
Code Example
#!/usr/bin/Rscript
suppressMessages(library(docopt))
doc <- "Usage: foo.R [-h] [-x] [--src REPODIR] [--out OUTDIR] [FILES...]
-s --src REPODIR source root directory [default: ~/git]
-o --out OUTDIR output directory [default: /tmp]
-h --help show this help text"
opt <- docopt(doc) # docopt parsing
print(opt)
Use with -h
You get a nice message, automatically, with not formatting need:
edd#rob:/tmp$ Rscript so50256138.R -h
Usage: foo.R [-h] [-x] [--src REPODIR] [--out OUTDIR] [FILES...]
-s --src REPODIR source root directory [default: ~/git]
-o --out OUTDIR output directory [default: /tmp]
-h --help show this help text
edd#rob:/tmp$
Use with argument
Note how one default argument is used, and the other from the command-line:
edd#rob:/tmp$ Rscript so50256138.R -s A
List of 9
$ --src : chr "A"
$ --out : chr "/tmp"
$ --help: logi FALSE
$ -x : logi FALSE
$ FILES : list()
$ src : chr "A"
$ out : chr "/tmp"
$ help : logi FALSE
$ x : logi FALSE
NULL
You can access them in opt by name or by option flag.
The docopt site has more; this is actually a portable specification and the CRAN package implements it for R.

How to call exe program and input parameters using R?

I want to call .exe program (spi_sl_6.exe) using a command of R (system), however I can't input parameters to the program using "system". The followwing is my command and parameters:system("D:\\working\spi_sl_6.exe")
I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this. Thanks in advance.
This is using the Standardized Precipitation Index software from
http://drought.unl.edu/MonitoringTools/DownloadableSPIProgram.aspx.
This seems to give a working solution using Windows (but not without warnings!)
Fisrt download the software and example files
# Create directory to download software
mydir <- "C:\\Users\\david\\spi"
dir.create(mydir)
url <- "http://drought.unl.edu/archive/Programs/SPI"
download.file(file.path(url, "spi_sl_6.exe"), file.path(mydir, "spi_sl_6.exe"), mode="wb")
# Download example files
download.file(file.path(url, "SPI_samplefiles.zip"), file.path(mydir, "SPI_samplefiles.zip"))
# extract one example file, and write out
temp <- unzip(file.path(mydir, "SPI_samplefiles.zip"), "wymo.cor")
dat <- read.table(temp)
# Use this file as an example input
write.table(dat, file.path(mydir,"wymo.cor"), col.names = FALSE, row.names = FALSE)
From page 3 of the help file basic-spi-program-information.pdf at the above link the command line code should be of the form spi 3 6 12 <infile.dat >outfile.dat, however,
neither of the following worked (just from command line not in R), and various iterations of how to pass parameters.
C:\Users\david\spi\spi_sl_6 3 <C:\Users\david\spi\wymo.cor >C:\Users\david\spi\out.dat
cd C:\Users\david\spi && spi_sl_6 3 <wymo.cor >out.dat
However, using the accepted answer from Running .exe file with multiple parameters in c#
seems to work. That is again from the command line
cd C:\Users\david\spi && (echo 2 && echo 3 && echo 6 && echo wymo.cor && echo out1.dat) | spi_sl_6
So to run this in R you can wrap this in a shell (you will need to change the path to where you have saved the exe)
shell("cd C:\\Users\\david\\spi && (echo 2 && echo 3 && echo 6 && echo wymo.cor && echo out2.dat) | spi_sl_6", intern=TRUE)
out1.dat and out2.dat should be the same.
This throws warning messages, I think from the echo (in R but not from command line) but the output file is produced.
Suppose you can automate all the echo calls sligtly, so all you need to do is input the time parameters.
timez <- c(2, 3, 6)
stime <- paste("echo", timez, collapse =" && ")
infile <- "wymo.cor"
outfile <- "out3.dat"
spiCall <- paste("cd", mydir, "&& (" , stime, "&& echo", infile, "&& echo", outfile, " ) | spi_sl_6")
shell(spiCall)
You can construct the command using sprintf :
cmd_name <- "D:\\working\spi_sl_6.exe"
param1 <- "a"
param2 <- "b"
system2(sprintf("%s %s %s",cmd_name,param1,param2))
Or using system2( I prefer this option):
system2(cmd_name, args = c(param1,param2))

Simple Curl -H in R

I want to do
curl -H "Authorization: Basic YOUR_API_KEY" -d '{"classifier_id":155, "value":"TEST"}' "https://www.machinelearningsite.com/language/classify"
I tried
h = getCurlHandle(header = TRUE, userpwd = YOUR_API_KEY, netrc = TRUE)
out <- getURL("https://www.machinelearningsite.com/language/classify?classifier_id=155&value=TEST", curl=h,ssl.verifypeer=FALSE)
but it says method not allowed
It's much easier to translate curl command-line arguments into httr calls:
library(httr)
result <- GET("https://www.machinelearningsite.com/language/classify",
add_headers(Authorization=sprintf("Basic %s", YOUR_API_KEY),
query=list(classifier_id=155, value="TEST")))
ideally, YOUR_API_KEY would be an environment variable, so you can change that to:
result <- GET("https://www.machinelearningsite.com/language/classify",
add_headers(Authorization=sprintf("Basic %s", Sys.getenv("YOUR_API_KEY")),
query=list(classifier_id=155, value="TEST")))
You can then do:
content(result)
To retrieve the actual data.

Better string interpolation in R

I need to build up long command lines in R and pass them to system(). I find it is very inconvenient to use paste0/paste function, or even sprintf function to build each command line. Is there a simpler way to do like this:
Instead of this hard-to-read-and-too-many-quotes:
cmd <- paste("command", "-a", line$elem1, "-b", line$elem3, "-f", df$Colum5[4])
or:
cmd <- sprintf("command -a %s -b %s -f %s", line$elem1, line$elem3, df$Colum5[4])
Can I have this:
cmd <- buildcommand("command -a %line$elem1 -b %line$elem3 -f %df$Colum5[4]")
For a tidyverse solution see https://github.com/tidyverse/glue. Example
name="Foo Bar"
glue::glue("How do you do, {name}?")
With version 1.1.0 (CRAN release on 2016-08-19), the stringr package has gained a string interpolation function str_interp() which is an alternative to the gsubfn package.
# sample data
line <- list(elem1 = 10, elem3 = 30)
df <- data.frame(Colum5 = 1:4)
# do the string interpolation
stringr::str_interp("command -a ${line$elem1} -b ${line$elem3} -f ${df$Colum5[4]}")
#[1] "command -a 10 -b 30 -f 4"
This comes pretty close to what you are asking for. When any function f is prefaced with fn$, i.e. fn$f, character interpolation will be performed replacing ... with the result of running ... as an R expression.
library(gsubfn)
cmd <- fn$identity("command -a `line$elem1` -b `line$elem3` -f `df$Colum5[4]`")
Here is a self contained reproducible example:
library(gsubfn)
# test inputs
line <- list(elem1 = 10, elem3 = 30)
df <- data.frame(Colum5 = 1:4)
fn$identity("command -a `line$elem1` -b `line$elem3` -f `df$Colum5[4]`")
## [1] "command -a 10 -b 30 -f 4"
system
Since any function can be used we could operate directly on the system call like this. We have used echo here to make it executable but any command could be used.
exitcode <- fn$system("echo -a `line$elem1` -b `line$elem3` -f `df$Colum5[4]`")
## -a 10 -b 30 -f 4
Variation
This variation would also work. fn$f also performs substitution of $whatever with the value of variable whatever. See ?fn for details.
with(line, fn$identity("command -a $elem1 -b $elem3 -f `df$Colum5[4]`"))
## [1] "command -a 10 -b 30 -f 4"
Another option would be to use whisker.render from https://github.com/edwindj/whisker which is a {{Mustache}} implementation in R. Usage example:
require(dplyr); require(whisker)
bedFile="test.bed"
whisker.render("processing {{bedFile}}") %>% print
Not really a string interpolation solution, but still a very good option for the problem is to use the processx package instead of system() and then you don't need to quote anything.
library(GetoptLong)
str = qq("region = (#{region[1]}, #{region[2]}), value = #{value}, name = '#{name}'")
cat(str)
qqcat("region = (#{region[1]}, #{region[2]}), value = #{value}, name = '#{name}'")
https://cran.r-project.org/web/packages/GetoptLong/vignettes/variable_interpolation.html

Resources