How do I refer to a specific DataFrame in a spark pipeline? - r

Suppose I have two spark DataFrames with the same features in Spark and I want to build a pipeline to cross validate them both. How can I refer to each table within a pipeline? I use sparklyr in R to do this, but I guess it should be the same with pyspark.
First I can use the following code to build a linear regression and cross evaluate it using ml_cross_validator()
suppressMessages(library(sparklyr))
suppressMessages(library(tidyverse))
sc <- spark_connect(master = "local")
copy_to(sc, mtcars, "mtcars")
mtcars <- tbl(sc, "mtcars")
pipeline <- ml_pipeline(sc) %>%
ft_r_formula(mpg ~ .) %>%
ml_linear_regression()
grid <- list(linear_regression = list(reg_param = 0))
cv <- ml_cross_validator(
sc,
estimator = pipeline, # use our pipeline to estimate the model
estimator_param_maps = grid, # use the params in grid
evaluator = ml_regression_evaluator(sc, metric_name = "rmse"), # how to evaluate the CV
num_folds = 2, # number of CV folds
seed = 2018
)
cv_model <- ml_fit(cv, mtcars)
cv_model$avg_metrics_df
#> rmse reg_param_1
#> 1 3.997882 0
Created on 2019-09-13 by the reprex package (v0.3.0)
But if I add another table with the same features:
mtcars_sample <- sdf_sample(mtcars, fraction = 0.8) %>%
sdf_register("mtcars_sample")
How can I refer to it within the pipeline?
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 3.6.0 (2019-04-26)
os macOS Mojave 10.14.6
system x86_64, darwin15.6.0
ui RStudio
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Stockholm
date 2019-09-13

Related

Generating a series ID using contents of several vectors of different length

I am trying to generate series IDs and I have been able to do so for the task at hand, but I am trying to make the code more flexible for future use.
Below, you can see how I was generating the ID. They follow a pattern that I am using to build up the ID from its component parts. Currently, this works when I am looking at multiple statistics (data_type_code) for one industry across all states. I am trying to think about how I would go about creating series_id if I wanted to, say, look at multiple industries, or wanted both seasonally adjusted and not seasonally adjusted figures.
`
library(tidycensus)
data(fips_codes)
prefix <- c("SM")
seasonal_adjustment_code <- c("U")
state_code <- unique(fips_codes$state_code)[1:51]
area_code <- c("00000")
supersector_code <- c("20")
industry_code <- c("000000")
data_type_code <- c("01","02","03")
var_name_list <- c('employees_thousands', 'avg_weekly_hours', 'avg_hourly_earn')
series_id <- unlist(lapply(1:length(data_type_code), function(x)
paste0(prefix, seasonal_adjustment_code, state_code,
area_code, supersector_code, industry_code,
data_type_code[x])))
`
I tried testing out what would happen if I added another industry, but I don't get an ID for every possible combination. I was trying to make mapply work but got stumped and now the only idea I have is an atrocious series of nested for loops that loops over the length of each ID component. Would appreciate other ideas that build on this or scrap it entirely!
Thanks!
I think I've been able to figure out what you wanted without the state_code vector. It looks like you're trying to list all possible combinations of the vectors. Paste0 is vectorised, so the following works:
prefix <- c("SM")
data_type_code <- c("01","02","03")
paste0(prefix, data_type_code)
#> [1] "SM01" "SM02" "SM03"
Created on 2022-11-23 with reprex v2.0.2
However, if we have more than one vector with length > 1, then the values get 'recycled' so it doesn't actually loop through every combination:
prefix <- c("SM", "X")
data_type_code <- c("01","02","03")
paste0(prefix, data_type_code)
#> [1] "SM01" "X02" "SM03"
Created on 2022-11-23 with reprex v2.0.2
It matches the 1st value of prefix with 1st value of data_type_code, 2nd with 2nd, and so on, recycling the first vector (looping back to first value).
To get all possible combinations, we could use expand.grid:
prefix <- c("SM", "X")
data_type_code <- c("01","02","03")
x <- expand.grid(prefix, data_type_code)
paste0(x$Var1, x$Var2)
#> [1] "SM01" "X01" "SM02" "X02" "SM03" "X03"
Created on 2022-11-23 with reprex v2.0.2
UPDATE:
We can expand this out to many columns using tidyverse functions as follows:
library(tidycensus)
#> Warning: package 'tidycensus' was built under R version 4.1.3
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.1.3
#> Warning: package 'ggplot2' was built under R version 4.1.3
#> Warning: package 'tibble' was built under R version 4.1.3
#> Warning: package 'tidyr' was built under R version 4.1.3
#> Warning: package 'readr' was built under R version 4.1.3
#> Warning: package 'purrr' was built under R version 4.1.3
#> Warning: package 'dplyr' was built under R version 4.1.3
#> Warning: package 'stringr' was built under R version 4.1.3
#> Warning: package 'forcats' was built under R version 4.1.3
data(fips_codes)
prefix <- c("SM")
seasonal_adjustment_code <- c("U")
state_code <- unique(fips_codes$state_code)[1:51]
area_code <- c("00000")
supersector_code <- c("20")
industry_code <- c("000000")
data_type_code <- c("01","02","03")
var_name_list <- c('employees_thousands', 'avg_weekly_hours', 'avg_hourly_earn')
tidyr::expand_grid(
prefix, seasonal_adjustment_code,
state_code, area_code, supersector_code,
industry_code, data_type_code) %>%
tidyr::unite(
col = "new_code",
everything(),
sep = "",
)
#> # A tibble: 153 x 1
#> new_code
#> <chr>
#> 1 SMU01000002000000001
#> 2 SMU01000002000000002
#> 3 SMU01000002000000003
#> 4 SMU02000002000000001
#> 5 SMU02000002000000002
#> 6 SMU02000002000000003
#> 7 SMU04000002000000001
#> 8 SMU04000002000000002
#> 9 SMU04000002000000003
#> 10 SMU05000002000000001
#> # ... with 143 more rows
Created on 2022-11-29 with reprex v2.0.2

Can't convert a double vector to function using timetk on R version 4.1.0

I am having issue when i use the tk_augment_slidify function from timetk library in my R version 4.1.0 - (reprex below) - however,
the same code works fine when i run it on Rmarkdown. Can someone
please help to solve this issue ? - shows me the following error -
"Error: Can't convert a double vector to function"
Sample data :
purchased_at|revenue
2018-06-03 |32735.89
2018-06-10 |38290.07
2018-06-17 |39973.95
2018-06-24 |35621.93
2018-07-01 |28983.72
standardize and log transform the data and creating Lags
transformed_transdata <-
transdata %>%
mutate(revenue = log(revenue),
revenue = standardize_vec(revenue))
transformed_transdata %>%
bind_rows(
future_frame(.data = .,
.date_var = purchased_at,
.length_out = 8)
) %>%
tk_augment_lags(.data = .,
.value = revenue,
.lags = 8) %>%
tk_augment_slidify(.data = .,
.value = revenue_lag8,
.f = mean,
.period = 8)
sessionInfo(package = c("tidyverse","timetk","dplyr"))
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18363)
#> other attached packages:
#> [1] tidyverse_1.3.1 timetk_2.6.1 dplyr_1.0.7
> knitr_1.33
My best guess is that you have saved some numbers into a variable named mean. For example, in the tk_augment_slidify function mean is probably saved as some numeric values, so tk_augment_slidify function is getting a numeric value instead of base::mean function

What to do about this makePSOCKcluster error?

I'm running a function called mixedpower() that lets you simulate power at different sample sizes. However I get the following error and don't know how to handle it!
Error in makePSOCKcluster(names = spec, ...) :
Cluster setup failed. 11 of 11 workers failed to connect.
Here is a reproducible example. (Just to head off any comments, I know that gear shouldn't be a random variable here, but I'm just using it for the purpose of this example.)
#You can download the mixedpower package like this.
if (!require("devtools")) {
install.packages("devtools", dependencies = TRUE)}
devtools::install_github("DejanDraschkow/mixedpower")
library(mixedpower)
library(lme4)
m <- lmer(mpg ~ cyl + disp + hp + drat + (1|gear), data = mtcars)
mtcars$gear_num <- as.numeric(mtcars$gear)
power <- mixedpower(model = m, data = mtcars, fixed_effects = c("cyl", "disp", "hp", "drat"), simvar = "gear_num", steps = c(3, 4, 5), critical_value = 2)
And if it is helpful I am running RStudio version 1.1.453 on MacOS Mojave version 10.14.6.
I can't test but strongly suspect that your problem is related to a bug in recent versions of R (now patched in more recent versions). This issue describes a workaround of adding the following to your .Rprofile file (it's also possible that updating to the latest patched version of R would do it).
## WORKAROUND: https://github.com/rstudio/rstudio/issues/6692
## Revert to 'sequential' setup of PSOCK cluster in RStudio Console on macOS and R 4.0.0
if (Sys.getenv("RSTUDIO") == "1" && !nzchar(Sys.getenv("RSTUDIO_TERM")) &&
Sys.info()["sysname"] == "Darwin" && getRversion() >= "4.0.0") {
parallel:::setDefaultClusterOptions(setup_strategy = "sequential")
}

How to use custom font family in UpsetR plots?

I am trying to create set plot using the UpSetR package; however, I'd like to control the family of fonts. The ideal approach would be using theme function from ggplot2 but this is not supported at the moment by UpSetR (there's an open issue from 2016 on GitHub here) and results in NULL.
Example to create test plot:
# R version ---------------------------------------------------------------
# platform x86_64-w64-mingw32
# arch x86_64
# os mingw32
# system x86_64, mingw32
# status
# major 3
# minor 5.1
# year 2018
# month 07
# day 02
# svn rev 74947
# language R
# version.string R version 3.5.1 (2018-07-02)
# nickname Feather Spray
# Package versions --------------------------------------------------------
# Assumes packages are already installed
packageVersion(pkg = "extrafont") == "0.17"
packageVersion(pkg = "UpSetR") == "1.3.3"
packageVersion(pkg = "ggplot2") == "3.1.0"
# Load UpSetR -------------------------------------------------------------
library(UpSetR)
library(extrafont)
library(ggplot2)
# Example -----------------------------------------------------------------
movies <- read.csv( system.file("extdata", "movies.csv", package = "UpSetR"), header=T, sep=";" )
upset(data = movies,
order.by = "freq",
keep.order = TRUE,
mainbar.y.label = "Example plot",
point.size = 4,
line.size = 1,
sets.x.label = NULL)
Going forward, the ideal would be where UpSetR supports layers / + theme() function from ggplot2; however, the UpSetR is not able to use "+" "layer name" logic. For example, if + theme(text = element_text(family = "Times New Roman")) were added at the end of the call above, it would return NULL and produce no plot.
Can you please suggest any workaround (or customization of function in package) that would support custom fonts in the example plot above produced by UpSetR? Alternatively, is there a way to force default font family in all plots without specifying any family arguments manually?
One way of achieving this with another UpSet implementation would be:
# install if needed
if(!require(ggplot2movies)) install.packages('ggplot2movies')
if(!require(devtools)) install.packages('devtools')
devtools::install_github('krassowski/complex-upset')
movies = as.data.frame(ggplot2movies::movies)
genres = c('Action', 'Animation', 'Comedy', 'Drama', 'Documentary', 'Romance', 'Short')
library(ComplexUpset)
library(ggplot2)
upset(
movies, genres, min_size=45, width_ratio=0.1,
themes=upset_default_themes(text=element_text(family='Times New Roman'))
)
Disclamer: I am the author of this implementation.

Passing arguments to xlconnect functions with ellipses

I have a bunch of excel files in one folder, and would like to write a single function as follows:
# takes a file path and sheetname for an excel workbook, passes on additional params
getxl_sheet <- function(wb_path, sheetname, ...) {
testbook <- XLConnect::loadWorkbook(wb_path)
XLConnect::readWorksheet(testbook, sheet = sheetname, ...)
}
However, when I run the following,
set.seed(31415)
x <- rnorm(15); y <- rnorm(15)
randvals <- data.frame(x=x, y=y)
XLConnect::writeWorksheetToFile("~/temp_rands.xlsx", randvals, "Sheet1")
my_vals <- getxl_sheet("~/temp_rands.xlsx", "Sheet1", endRow=5)
my_vals returns the entire 15 by 2 dataframe, as opposed to just stopping at the fifth row (likewise if I use 'endCol=1' for example, it gives both columns). On the other hand, passing additional arguments in base R hasn't been a problem:
my_plot <- function(...) {
plot(...)
}
#my_plot(x=x, y=y, pch=16, col="blue")
works as expected. What's the problem with the function defined above to read in xlsx files? Thanks.
devtools::session_info()
Session info---------------------------------------------------------------------
setting value
version R version 3.1.1 (2014-07-10)
system x86_64, darwin13.1.0
ui RStudio (0.98.1062)
language (EN)
collate en_US.UTF-8
tz America/New_York
Packages-------------------------------------------------------------------------
package * version date source
devtools 1.6.0.9000 2014-11-26 Github (hadley/devtools#bd9c252)
rJava 0.9.6 2013-12-24 CRAN (R 3.1.0)
rstudioapi 0.1 2014-03-27 CRAN (R 3.1.0)
XLConnect * 0.2.9 2014-08-14 CRAN (R 3.1.1)
XLConnectJars * 0.2.9 2014-08-14 CRAN (R 3.1.1)
The dots mechanism needs to have a function that expects dots, and unlike plot.default, readWorksheet is not designed to handle an ellipsis: You need to build some decoding into the arguments:
getxl_sheetRCshort <- function(wb_path, sheetname, ...) {
arglist <- list(...)
testbook <- loadWorkbook(wb_path);
readWorksheet(testbook, sheet = sheetname,
endRow=arglist[['endRow']], endCol=arglist[['endCol']])
}
> my_vals <- getxl_sheet("~/temp_rands.xlsx", "Sheet1", endRow=5)
> my_vals
x y
1 1.6470129 -1.27323204
2 -1.1119872 -1.77141948
3 -1.5485456 1.40846809
4 -0.7483785 -0.09450125
You could make this even more general by doing matching on the entire formals() list from the readWorksheet function and there are worked examples in SO that illustrate this. Fortunately the parser is somehow able to ignore the fact that no value is passed to 'endCol'.

Resources