I'm using testthat package for unit testing in R. I have a function CalcByResultSubModel which has one more function CalculateX which is called inside the main function. This is the main function,
CalcByResultSubModel = function(doll_data, fn_master, modelPath) {
# load sub model result
load(modelPath)
# calculation
for(abc in c("ABC", fn_master$fn_a)) {
# columns
col_name = paste0("x", abc)
iterModel = resultSubmodel[[abc]]
# calculate yhat X
doll_data[, col_name] = iterModel %>%
purrr::map(., function(imodel) {
CalculateX(data, imodel)
}) %>%
as.data.frame(.) %>%
apply(., 1, mean)
message(paste(col_name, "calculated"))
}
This is the function CalculateX
CalculateX = function(data, model) {
iterData = data %>%
dplyr::select(model$feature_names) %>%
as.matrix(.)
set.seed(131)
result = predict(model, iterData, missing = NA)
result = matrix(result, 2)[2, ]
return(result)
}
Inorder to perform unit testing we have to mock the function CalculateX. But the complexity here is that, the function is called inside for loop in the main function. I'm quite new to this scenario in my unit testing. Can anyone help me with the mocking of the function in a for loop? This is the code for mocking and I tried this.
local_mock(CalculateX = function(data, model){
for (abc in c("ABC", fn_master$fn_a)
case_when(
abc == "feature1" ~ .ReadCsvWrapper("feature1.csv"),
abc == "feature2" ~ .ReadCsvWrapper( "feature2.csv"),
abc == "feature3" ~ .ReadCsvWrapper("feature3.csv"))
})
But the above approach doesn't seem to work for me. Can anyone help me with this?
There are a couple of problems in your code. First, CalcByResultSubModel calls CalculateX in the loop
for(abc in c("ABC", fn_master$fn_a)) {
# columns
col_name = paste0("x", abc)
iterModel = resultSubmodel[[abc]]
# calculate yhat X
doll_data[, col_name] = iterModel %>%
purrr::map(., function(imodel) {
CalculateX(data, imodel)
}) %>%
as.data.frame(.) %>%
apply(., 1, mean)
message(paste(col_name, "calculated"))
}
so you don't need to put that for(abc in c("ABC", fn_master$fn_a)) into the mocked function. Just set it up to return results similar to what one call to the real function would do.
The second problem is that in the real CalculateX, you have set.seed(131). This is almost certainly a bad idea. It resets the random number generator to a fixed setting every time CalculateX is called, which makes it completely non-random, and also makes calls to random number functions afterwards repeat their outputs.
It's often a good idea to set the seed once at the top of your testing script so that tests are predictable, but resetting it as often as you did is not.
Related
I'm attempting to iterate over a list of column suffixes and, for each, call a function to perform a bunch of summarises. My problem is how to correctly build the variable name and have R interpret it correctly. I've tried various combinations of {{}}, !!, and sym() without success.
I can do this for the LHS in a mutate without problems, but the correct syntax for the RHS seems to elude me.
A stripped down version follows:
fsumm = function(data, lookAhead) {
summarise(data,
mean=mean("prefix{lookAhead}", na.rm=TRUE),
)
}
lookAheadList = c(2, 5, 10, 20)
for (lookAhead in lookAheadList) {
p = df %>%
fsumm( lookAhead )
print(p)
}
UPDATE:
I seem to be making some progress by using:
summarise( mean=mean( !!sym(glue("prefix{lookAhead}")), na.rm=TRUE) )
I thought though that tidyval(?) was meant to allow me to drop the explicit call to glue(). Is there a simpler way of writing this?
The following seems to work. Not sure if it's the very best solution, or not:
fsumm = function(data, lookAhead) {
colnm = sym( glue("maxCloseGainPctNext{lookAhead}") )
data %>%
summarise( ...
"mean({{lookAhead}})" := mean({{colnm}}, na.rm=TRUE),
I'm trying to learn async programming using R so that I can implement an app that requires generation of random numbers with specified seeds (always with specified seeds). I've been using R.utils::withSeed for this, but I know that withr::with_seed also exists so I thought I might check that out.
I know random number generation is tricky, so I've been trying to run simple examples to try to understand how things work. I need:
...to always get the same random numbers when using the same seed
...to get the same random numbers with the same seed regardless of whether I'm using the async framework (so I should be able to run the code outside a promise and get the same answer)
In the code below, I define two functions to generate random numbers, settings the seed with either withr::with_seed or R.utils::withSeed.
These two functions give me the same answer when run outside the promise.
These two functions give different answers when run inside the promise.
The withr::with_seed version gives the same answer inside or outside the promise.
The R.utils::withSeed version gives different answers inside or outside the promise.
The answers seem to be consistent across multiple runs, however.
My question is: why? Is this a bug in R.utils::withSeed, or am I misunderstanding something?
Code
library(future)
library(promises)
plan(multisession)
s0_R = function(seed = 1, n = 1){
R.utils::withSeed(expr = {
rnorm(n)
}, seed = seed)
}
s0_w = function(seed = 1, n = 1){
withr::with_seed(
seed = seed,
code = {
rnorm(n)
})
}
s_R = function(seed = 1, n = 1){
future_promise(
{
Sys.sleep(5)
s0_R(seed, n)
},
seed = TRUE
)
}
s_w = function(seed = 1, n = 1){
future_promise(
{
Sys.sleep(5)
s0_w(seed, n)
},
seed = TRUE
)
}
s0_R(123) %>%
paste(" (R.utils::withSeed)\n") %>%
cat()
# -0.560475646552213 (R.utils::withSeed)
s0_w(123) %>%
paste(" (withr::with_seed)\n") %>%
cat()
# -0.560475646552213 (withr::with_seed)
s_R(123) %...>%
paste(" (async, R.utils::withSeed)\n") %...>%
cat()
s_w(123) %...>%
paste(" (async, withr::with_seed)\n") %...>%
cat()
# Results occur later...
# -0.968592726552943 (async, R.utils::withSeed)
# -0.560475646552213 (async, withr::with_seed)
The future package sets the default RNG kind to L'Ecuyer-CMRG, whereas R's default is Mersenne-Twister. withr::with_seed resets the RNG kind to "default" (i.e. Mersenne-Twister) unless it is explicitly specified in the .rng_kind argument. R.utils::withSeed, on the other hand, does not do anything about RNG kind by default, but the RNG kind can be specified using the ... argument list passed to set.seed. In your example, the s0_R can be modified as follows to get the same results inside and outside the promise.
s0_R = function(seed = 1, n = 1){
R.utils::withSeed(expr = {
rnorm(n)
}, seed = seed, kind = "default")
}
I have a Spark DataFrame with an ID column called "userid" that I am manipulating using sparklyr. Each userid can have anywhere from one row of data up to hundreds of rows of data. I am applying a function to each userid group which condenses the number of rows it contains based on certain event criteria. Something like
sdf %>%
group_by(userid) %>%
... %>% # using dplyr::filter and dplyr::mutate
ungroup()
I would like to wrap this function in an error handler such as purrr::possibly so that computation will not be interrupted if an error occurs in a single group.
So far, I have had the most success using the replyr package. Specifically, replyr::gapply "partitions from by values in grouping column, applies a generic transform to each group and then binds the groups back together." There are two methods for partitioning the data: "group_by" and "extract". The authors only recommend using "extract" in the case that the number of groups is 100 or less, but the "group_by" method does not work as I'd expect:
library(sparklyr)
library(dplyr)
library(replyr) # replyr::gapply
library(purrr) # purrr::possibly
sc <- spark_connect(master = "local")
# Create a test data frame to use gapply on.
test_spark <- tibble(
userid = c(1, 1, 2, 2, 3, 3),
occurred_at = seq(1, 6)
) %>%
sdf_copy_to(sc, ., "test_spark")
# Create a data frame that purrr::possibly should return in case of error.
default_spark <- tibble(userid = -1, max = -1, min = -1) %>%
sdf_copy_to(sc, ., "default_spark")
#####################################################
# Method 1: gapply with partitionMethod = "group_by".
#####################################################
# Create a function which may throw an error. The group column, userid, is not
# included since gapply( , partitionMethod = "group_by") creates it.
# - A print statement is included to show that when gapply uses "group_by", the
# function is only called once.
fun_for_groups <- function(sdf) {
temp <- sample(c(1,2), 1)
print(temp)
if (temp == 2) {
log("a")
} else {
sdf %>%
summarise(max = max(occurred_at),
min = min(occurred_at))
}
}
# Wrap the risk function to try and handle the error gracefully.
safe_for_groups <- purrr::possibly(fun_for_groups, otherwise = default_spark)
# Apply the safe function to each userid using gapply and "group_by".
# - The result is either a) only the default_spark data frame.
# b) the result expected if no error occurs in fun_for_groups.
# I would expect the answer to have a mixture of default_spark rows and correct rows.
replyr::gapply(
test_spark,
gcolumn = "userid",
f = safe_for_groups,
partitionMethod = "group_by"
)
#####################################################
# Method 2: gapply with partitionMethod = "extract".
#####################################################
# Create a function which may throw an error. The group column, userid, is
# included since gapply( , partiionMethod = "extract") doesn't create it.
# - Include a print statement to show that when gapply uses partitionMethod
# "split", the function is called for each userid.
fun_for_extract <- function(df) {
temp <- sample(c(1,2), 1)
print(temp)
if (temp == 2) {
log("a")
} else {
df %>%
summarise(max = max(occurred_at),
min = min(occurred_at),
userid = min(userid))
}
}
safe_for_extract <- purrr::possibly(fun_for_extract, otherwise = default_spark)
# Apply that function to each userid using gapply and "split".
# - The result dataframe has a mixture of "otherwise" rows and correct rows.
replyr::gapply(
test_spark,
gcolumn = "userid",
f = safe_for_extract,
partitionMethod = "extract"
)
How bad of an idea is it to use gapply when the grouping column has millions of values? Is there an alternative to the error handling strategies presented above?
replyr::gapply() is just a thin wrapper on top of dplyr (and in this case sparklyr).
For the grouped mode- the result can only be correct if no group errors out, as the calculation is issued all at once. This is the most efficient mode, but can't really achieve any sort of error handling.
For the extract mode- it might be possible to add error handling, but the current code does not have it.
As the replyr author I would actually suggest looking into sparklyr's spark_apply() method. replyr's gapply was designed when spark_apply() was not available in sparklyr (and also when binding lists of data was also not available in sparklyr).
Also replyr is mostly in "maintenance mode" (patching issues for clients who used it in larger projects), and probably not a good choice for new projects.
I'm interested in why you would set "string" <- function(x). I am interested in some feedback on what this does. I have included some code below. The function is within a larger function. I understand what unique(), which.max(), tabulate(), and match() are doing, but I am confused on "mode" <- function(x)
The code was sent to me from another party, and works fine. I'm just interested in what it does.
metric = function(z, i, rn, minht, above)
{
"mode" <- function(x){
ux <-unique(x)
ux[which.max(tabulate(match(x,ux)))]
}
metrics = list(
)
return(metrics)
}
I'm having an issue in R where I am running a cor.test on a data frame where there are multiple groups.
I am trying to obtain the correlation coefficient for one dependent variable and multiple independent variables contained in a data frame. The data frame has 2 grouping columns for subsetting the data. Here is an example:
DF <- data.frame(group1=rep(1:4,3),group2=rep(1:2,6),x=rnorm(12),v1=rnorm(12),v2=rnorm(12),v3=rnorm(12))
I created the following script that uses plyr to calculate the correlation coefficient for each of the groups and then loop through for each of the variables.
library(plyr)
group_cor <- function(DF,x,y)
{
return(data.frame(cor = cor.test(DF[,x], DF[,y])$estimate))
}
resultDF <- ddply(DF, .(group1,group2), group_cor,3,4)
for(i in 5:6){
resultDF2 <- ddply(DF, .(group1,group2), group_cor,3,i)
resultDF <- merge(resultDF,resultDF2,by=c("group1","group2"))
rm(resultDF2)
}
This works fine. The problem I'm running into is when there aren't enough values in a group to calculate the correlation coefficient. For example: when I change the data frame created above to now include a few key NA values and then try to run the same loop:
DF[c(2,6,10),5]=NA
for(i in 5:6){
resultDF2 <- ddply(DF, .(group1,group2), group_cor,3,i)
resultDF <- merge(resultDF,resultDF2,by=c("group1","group2"))
rm(resultDF2)
}
I get the following error "Error: not enough finite observations"
I understand why I get this error and am not expecting to get a correlation coefficient for these cases. But what I would like to do is to pass out a null value and move on the the next group instead of stopping my code at an error.
I've tried using a wrapper with try() but can't seem to pass that variable into my result data frame.
Any help on how to get around this would be much appreciated.
I invariably forget to use try if I haven't use it in, oh, a day or something. This link helped me remember the basics.
For your function, you could add it in like this:
group_cor = function(DF,x,y) {
check = try(cor.test(DF[,x], DF[,y])$estimate, silent = TRUE)
if(class(check) != "try-error")
return(data.frame(cor = cor.test(DF[,x], DF[,y])$estimate))
}
However, the won't return anything for the group with the error. That's actually OK if you use the all argument when you merge. Here's another way to merge, saving everything into a list with lapply and then merging with Reduce.
allcor = lapply(4:6, function(i) ddply(DF, .(group1,group2), group_cor, 3, i))
Reduce(function(...) merge(..., by = c("group1", "group2"), all = TRUE), allcor)
If you want to fill in with NA inside the function rather than waiting to fill in using merge, you could change your function to:
group_cor2 = function(DF,x,y) {
check = try(cor.test(DF[,x], DF[,y])$estimate, silent = TRUE)
if(class(check) == "try-error")
return(data.frame(cor = NA))
return(data.frame(cor = cor.test(DF[,x], DF[,y])$estimate))
}
Finally (and outside the scope of the question), depending on what you are doing with your output, you might consider naming your columns uniquely based on which columns you are doing the cor.test for so merge doesn't name them all with suffixes. There is likely a better way to do this, maybe with merge and the suffixes argument.
group_cor3 = function(DF,x,y) {
check = try(cor.test(DF[,x], DF[,y])$estimate, silent = TRUE)
if(class(check) != "try-error") {
dat = data.frame(cor = cor.test(DF[,x], DF[,y])$estimate)
names(dat) = paste("cor", x, "vs", y, sep = ".")
dat
}
}