How do you use sdf_checkpoint to break spark table lineage in sparklyr? - r

I'm attempting to manipulate a Spark RDD via sparklyr with a dplyr mutate command to construct a large number of variables, and each time this seems to fail with an error message regarding Java memory exceeding 64 bits.
The mutate command is coded correctly and runs properly when executed separately to construct different dataframes for small subsets of the variables, but not when all of the variables are included in the same mutate command (or several mutate statements piped together in the same chain to construct a single dataset).
I suspect that this may be an issue with needing to 'break the computation graph' using sparklyr::sdf_checkpoint(eager = TRUE), but I am unclear precisely how to do this as the documentation for this function is lacking:
When I run this command, is the appropriate sequence of commands df <- df %>% compute('df') %>% sdf_checkpoint(eager = TRUE) ?
When I set the sparklyr::spark_set_checkpoint_dir() to an HDFS location, I can see that 'stuff' appears there after running the command - however when then referencing the df post the check-pointing, the mutate command still breaks with the same error message. When I am referencing the df, am I actually referencing the version that has had the lineage broken, or do I need to do something to 'load' the checkpoint back in? If so, might somebody be able to provide me with the steps to load this dataset back into memory without the lineage?

Related

R package: Refresh/Update internal raw data

I am developing a package in R (3.4.3) that has some internal data and I would like to be able to let users refresh it without having to do too much work. But let me be more specific.
The package contains several functions that rely on multiple parameters but I am really only exporting wrapper functions that do all the work. Because there is a large number of parameters to pass to those wrapper functions, I am putting them in a .csv file and passing them as a list. The data is therefore a simple .csv file called param.csv with the names of the parameters in one column and their value in the next column
parameter_name,parameter_value
ticker, SPX Index
field, PX_LAST
rolling_window,252
upper_threshold,1
lower_threshold,-1
signal_positive,1
signal_neutral,0
signal_negative,-1
I process it running
param.df <- read.csv(file = "param.csv", header = TRUE, sep = ",")
param.list <- split(param.df[, 2], param.df[, 1])
and I put the list inside the package like this
usethis::use_data(param.list, overwrite = TRUE)
Then, when reset the R interactive window and execute
devtools::document()
devtools::load_all()
devtools::install()
require(pkg)
everything works out great, the data is available and can be passed to functions.
First problem: when I change param.csv, save it and repeat the above four lines of code, the internal param.list is not updated. Is there something I am missing here ? Or is it by nature that the developer of the package should run usethis::use_data(param.list, overwrite = TRUE) each time he changes the .csv file where the data comes from?
Because it is intended to be a sort of map, the users will want to tweak parameters for (manual) model calibration. To try and solve this issue I let users provide the functions with their own parameter list (as a .csv file like before). I have a function get_param_list("path/to/file.csv") that does exactly the same processing that is done above and returns the list. It allows the users to pass their own parameter list. De facto, the internal data param.list is considered a default setting of parameters.
Second Problem: I'd like to let users modify this default list of parameters param.list from outside the package in a safe way.
So since the internal object is a list, the users can simply modify the element of their choice in the list from outside the package. But that is pretty dangerous in my mind because users could
- forget parameters, so I have default values inside the functions for that case
- replace a parameter with another type which can provoke an error or worse, side effects.
Is there a way to let users modify the internal list of parameters without opening the package ? For example I had thought about a function that could replace the .csv file by a new one provided by the user but this brings me back to the First Problem, when I restart the R prompt and reinstall the package, nothing happens unless I usethis::use_data(param.list, overwrite = TRUE) from inside the package. Would another type of data be more useful (e.g. make the data internal) ?
This is my first question on this website but I realize it's quite long and might be taken as a coding style question. But if anyone can see an obvious mistake or misunderstanding of package development in R from my side that would really help me.
Cheers

Using dplyr on local vs remote databases

I'm trying to understand how to use dplyr on a remote database vs data stored in R. Namely, I'm unclear on what functions can be used to with mutate(). For example, this works just fine:
diamonds %>%
select(color, cut, price) %>%
mutate(
newcol = paste0(cut, color)
)
However, if I try to use paste() on a remote database (that is too large to store locally) I get an error saying
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: function paste0()
That's one example, but I noticed a similar error when trying to use POSIXct dates and other functions from non-base R.
My question: Am I limited to only using very basic aggregating functions like are mentioned here? If not, how does one implement other functions (custom, additional libraries etc.) through dplyr on remote databases?
yes, dplyr uses the dbplyr package for the SQL translations. In it, we have to manually specify how each R command translates to that specific SQL syntax, so in some cases one function may work for one database and not for others. I just checked the translation for PostgreSQL and it looks like we have a translation for paste() but not paste0(). In the meanwhile, you can also pass SQL commands inside dplyr verbs, for example, mutate(m = strpos(field1, "search")) will run the PostgreSQL strpos that is used to locate a string within a field.

Print result in R console when passing variables to R script from Tableau to R

I am sending passing R scripts to R from Tableau and would like to be able to see the result in the R console. I have got this to work in the past, but not sure how to do it again.
In R I've ran the following lines of code:
Note, [Petal Length] is just a column of number values- it doesn't matter what the numbers are. In this case I just got it from the IRIS dataset (which is pre-packaged in R as you can see if you run data())
install.packages("Rserve")
library(Rserve)
run.Rserve()
In Tableau, the calulated field that contains the R script is:
SCRIPT_INT('print(.arg1)', SUM([Petal length]))
Thanks.
After some searching I finally found the answer to this question. 1) you have to have the function print() in the Tableau calculated field, and 2) you have to use the command: run.Rserve() as opposed to Reserve().

ROracle Connection on Worker Nodes // Automated reporting with R Markdown

I am running into several distinct problems while trying to speed up some automated report generation for a large dataset. I'm utilizing R + markdown -> HTML to generate a report, and loop over ~10K distinct groupings for the report accessing the data from Oracle.
The system is comprised mainly of two parts
a main script
a markdown template file
The main script sets up the computing environment and parallel processing backends:
library(ROracle)
library(doParallel) ..etc
....
cl <- makeCluster(4)
clusterEvalQ(cl, con<-dbConnect(db,un,pw)) ##pseudocode...
Here the first issue appears to arise. R throws an exception stating the connections on the workers are invalid BUT when I monitor live sessions on Oracle they appear to be fine...
Next, the main calls the loop for report generation.
foreach(i=1:nrow(reportgroups), .packages=c('ROracle', 'ggplot2', 'knitr') %dopar% ##...etc
{
rmarkdown::render(inputfile.Rmd, outputfile.html, params=list(groupParam1[i], groupParam2[i], etc)
}
If I run the foreach loop sequentially i.e., %do% instead of %dopar%, everything seems to work fine. No errors, then entire set runs correctly (I have only tested to ~400 groups, will do a full run of all 10k overnight).
However, if I attempt to run the loop in parallel, invariably 'pandoc' throws an error #1 in converting the file. If I run the broken loop multiple times, the 'task' in the loop (or cluster, not sure which task refers to in this context) which causes the error changes.
The template file is pretty basic, it takes in parameters for groups, runs an SQL query on the connection defined for the cluster worker, and utilizes ggplot2 + dplyr to generate results. Since the template seem to operate when not through a cluster, I believe the problem must be something to do with the connection objects in the cluster nodes from ROracle, although I don't know enough about the subject to really pinpoint the problem.
If anyone has had a similar experience, or has a hunch about what is going on, any advice would be appreciated!
Let me know if I can clarify anything...
Thanks
Problem has been solved using a variety of hacks. First, main loop was re-written to pull data into local R session instead of running SQL within the markdown report. This seemed to help cure some of the collisions in connections to the database, but did not fix it entirely. So, I added some tryCatch() and repeat() functionality to the functions which queried data or attempted to connect to the dB. Highly recommend anyone having issues with ROracle on a cluster implements something similar, and spends the time to review the error messages to see what exactly is going on.
Next, in the issues with pandoc conversion. Most of the problems were solved by resolving a join to a depreciated table in the SQL (old table lacked data for some groups, and thus was pulling in no rows and the report would not generate). However, some issues still remained where rmarkdown::render() would run successfully, but the actual output would be empty or broken. No idea what caused this issue, but the way I solved it was to compare filesizes of the reports generated (empty ones were ~ 300kb, completed ones 400 + ) and re-run the report generation after the cluster was shut down on a single machine. This seemed to clean up any of the last issues.
In summary:
Previously: reports generated, but with significant issues and incomplete data within.
Fixes:
For the cluster, ensure multiple attempts to connect to dB in case of an issue during the run. Don't forget to close connection after grabbing data. In ROracle, creating the driver, i.e,
driver <- dbDriver("Oracle")
is the most time consuming part of connecting to a database, but the driver can be reused with multiple connections in a loop e.g.,
##Create driver outside loop, and reuse inside
for(i in 1:n){
con <- dbConnect(driver, 'username', 'password')
data <- dbGetQuery(con, 'Select * from mydata')
dbDisconnect(con)
...##do something to data
}
is much faster than calling
dbConnect(dbDriver("Oracle"), 'username', 'password')
inside the loop
Wrap connection
/ sql attempts inside a function which implements tryCatch and repeat functionality in case of errors
Wrap calls to rmarkdown::render() with tryCatch and repeat functionality and log status, filesize, filename, filelocation, etc.
--Load in logfile created as above from rmarkdown::render calls, find outlier file sizes (in my case, simply Z-scores of filesize, filter those with < -3) to identify reports where issues still exist. Do something to fix them rerun on a single worker
Overall, ~4500 reports generated from a subset of my data, time on 4 core i5 # 2.8 Ghz is around 1 hour. Total number of 'bad' reports generated is around 1% of the total.

Specify my dataset as working dataset

I am a newbie to R.
I can successfully load my dataset into R-Studio, and I can see my dataset in the workspace.
When I run the command summary(mydataset), I get the expected summary of all my variables.
However, when I run
data(mydataset)
I get the following warning message:
In data(mydataset) : data set ‘mydataset’ not found
I need to run the data() command as recommended in the fitLogRegModel() command, which is part of the PredictABEL package.
Does anybody have a hint on how I can specify mydataset as working dataset?
You don't need to use the data command. You can just pass your data to the function
riskmodel <- fitLogRegModel(data=mydataset, cOutcome=2,
cNonGenPreds=3:10, cNonGenPredsCat=6:8,
cGenPreds=c(11, 13:16), cGenPredsCat=0)
The example uses data(ExampleData) so that it can make data that is in the package available to you. Since you already have your data, you don't need to load it.
An alternative, although it has its drawbacks, is to use attach(mydataset). You can then refer to variables without a mydatdataset$ prefix. The main drawback, as far as I know (although I'd welcome the views of more expert R users) is that if you modify a variable after attaching, it is then not part of the dataset. This can cause confusion and lead to "gotchas". In any case, many expert R users counsel against the use of attach.

Resources