I'm using the targets pipelining system in R and am wondering how to statically branch optimally. I have a set of parameters for which I'd like to compute results for most but not all interactions. Notice how N_source_components and N_target_components aren't used by the agg_neighbourhoods target, but they are used by other targets that I didn't include in this example. With the current setup, agg_neighbourhoods will be run too many times (targets doesn't understand that not all columns in the value argument of tar_map are relevant for all targets, right?). Is there a smarter way?
I already tried nesting another tar_map call within the currently shown one, to which N_source_components and N_target_components get relegated. This fixes the redundant executions of agg_neighbourhoods, but doesn't allow me to filter undesirable combinations like I'm doing now because the value of query isn't known at 'compilation' time.
Many thanks :)
tar_map(
values = tidyr::expand_grid(
query = c('6369', '6489', '6493'),
k = c(10, 30, 50),
d = c(5, 10, 15),
genelist = c(
'informativeV15',
'informativeV15_monotonic',
'informativeV15_monoreporter'
),
N_source_components = 10L,
N_target_components = as.integer(c(3, 5))
) %>%
dplyr::filter(
!(query %in% c('6369') & N_target_components > 3)) %>%
{ . },
tar_target(agg_neighbourhoods, {
f(
so = tar_read(so_target, branch = e2i(query))[[1]],
genelist = genelist,
k = k,
d = d
)
}, iteration = 'list')
)
Hopefully this is helpful to someone: in simpler terms, my problem was that targets were needlessly being run due to my necessity for filtering out some parameter combinations of target instantiations and not all parameters being used by all targets. A more simple and complete example of this scenario would be:
tar_map(
values = tibble(A = 1:2, B = 1:4) %>%
dplyr::filter(!(A == 2 & B > 2)),
tar_target(tarX, A*3),
tar_target(tarY, A*4 + B^2)
)
tarX is being run for each value of B whereas only one evaluation is required. However, since the values of both A and B are informative as to what combinations aren't required, we have to pre-specify the required targets.
Seeing the 'problem' in this much cleaner abstracted representation, a solution becomes obvious more easily: just do two calls to tar_map, each operating on tailor-selected columns of the parameter grid.
param_grid <-
tibble(A = 1:2, B = 1:3) %>%
dplyr::filter(!(A == 2 & B > 2))
list(
tar_map(
values = param_grid %>%
dplyr::select(-B) %>%
dplyr::distinct(),
tar_target(tarX, A*3)
),
tar_map(
values = param_grid,
tar_target(tarY, A*4 + B^2)
)
)
Perhaps there are other solutions as well. I'd be happy to hear them.
Related
Supposed I have an arbitrary function
foo = function(a,b) {a+b}
How can I iterate this function onto itself n times?
foo(foo(foo(foo(x, 1), 2), 3, 4)
I am looking at purrr:compose but it doesn't look hopeful for arbitrary n. purrr:reduce feels like it will come into play also... but I'm stumped.
Here is a pure purrr version, that is really functional, as you said reduce comes in handy here, since compose is just a function and functions are just elements you can reduce functions by composing them. To just fill one argument use partial.
foo_n <- reduce(map(1:n, ~partial(foo, b=.x)), compose)
You can also just append results of each foo(a,b) function into a numeric vector and then pick up the last result.
Let's x = 1 and bs are elements of 1:4:
x = 1
n = 4
out = vector("numeric")
steps = seq(1, 4, by = 1)
for( b in steps){
## initial value
if (length(out) == 0){
out = append(out, values = foo(x, b) )
}else{
out = append(out, values = foo( tail( out, 1), b) )
}
}
tail(out, 1)
Background
I get hourly interval reports on equipment in buildings, a lot of buildings and a lot of equipment. Each parameter on the equipment is called a point and they already have a name, I don't get to choose the name of the point. Each point name is unique. What I'm trying to do is run a standard report on each building. Eventually, I'd like to move this to Shiny and look at my graphs and maybe print a report from there, but... baby steps.
Question
Am I on the right track? Is there a more efficient way of doing this? Am I going to run into problems when I start to write Markdown reports or transfer this over to Shiny?
Sample Code
library(tidyverse)
set.seed(55)
test_func <- function(pointa, pointb, mult) {
out = (pointb - pointa) * mult
return(out)
}
test_fail <- function(pointa, pointb) {
out = ifelse(pointa > (pointb - 9), 1, 0)
return(out)
}
tbl.data <- data.frame(
date = c(rep("2/1/2018", 24),
rep("2/2/2018", 24),
rep("2/3/2018", 24),
rep("2/4/2018", 24),
rep("2/5/2018", 24),
rep("2/6/2018", 24),
rep("2/7/2018", 24)),
hour = rep(0:23, 7),
equipa.vala = runif(168, min = 50, max = 60),
equipb.vala = runif(168, min = 50, max = 60)
) %>%
mutate(
equipa.valb = 10 + equipa.vala * runif(168, min = 0.75, max = 1.25),
equipb.valb = 10 + equipb.vala * runif(168, min = 0.75, max = 1.25)
)
tbl.equip <- data.frame(
equipment.id = c(1,2),
equipment.name = c("equipa", "equipb"),
equipment.mult = c(5, 7)
)
tbl.point <- data.frame(
point = c("equipa.vala", "equipa.valb", "equipb.vala", "equipb.valb"),
equipment = c("equipa", "equipa", "equipb", "equipb"),
category = c("vala", "valb", "vala", "valb")
)
for (eq in tbl.equip[,2]) {
vala <- as.character(
tbl.point$point[tbl.point$equipment == eq &
tbl.point$category == "vala"]
)
valb <- as.character(
tbl.point$point[tbl.point$equipment == eq &
tbl.point$category == "valb"]
)
equip.mult <- as.numeric(
tbl.equip$equipment.mult[tbl.equip$equipment.name == eq]
)
for.data <- tbl.data %>%
select_(cola = vala,
colb = valb) %>%
mutate(
result = test_func(cola, colb, equip.mult),
fault = test_fail(cola, colb)
)
score <- sum(for.data$fault)/length(for.data$fault)
savings <- sum(for.data$result[for.data$result > 0])
p1 <- ggplot(for.data, aes(x = colb, y = cola, color = as.factor(fault))) +
geom_point() +
annotate("text", label = paste("savings is:", savings), x = 50, y = 60) +
annotate("text", label = paste("score is:", score), y = 51, x = 80) +
ggtitle(paste("Equipment:", eq)) +
theme_minimal()
print(p1)
}
Explanation
So in this sample, the tbl.data data frame would be the data I receive from each building. I'd have to manually create the tbl.equipment and tbl.point data frames, which I'd just house in *.csv files on my machine, or database (and be able to add/edit in Shiny). There's no standard for point names and there's not a guarantee that each piece of equipment has each point. Using select() helpers such as contains() or starts_with() is out of the question.
So I just created an Equipment table, which has parameters on the equipment, (in this case a multiple). Also, there's a Point table, which tells which piece of equipment and which category each point belongs to.
For this simple example, there's two sample functions I included. One calculates a value based on the the data, the other tests for a fault. My biggest problem in the past has been when a piece of equipment doesn't have a point, it stops the execution, so I have to manually go in and take it out or something else. I guess the workaround is to use exists() or something similar and test before running that piece of code.
Again, for this simple example, I just printed a plot, but the output could be a Markdown Document (which I think I've done before, but not like this) or Shiny (which I've created some simpler Apps).
Conclusion
The big question is "Is this the "right" way of doing it?" I'm sure this is pretty common and there has to be a really efficient method I'm not using. What's going to set me up for success when I start writing code to print out reports or taking this into a Shiny App?
I am trying to collate results from a simulation study using dplyr and purrr. My results are saved as a list of data frames with the results from several different classification algorithms, and I'm trying to use purrr and dplyr to summarize these results.
I'm trying to calculate
- number of objects assigned to each cluster
- number of objects in the cluster that actually belong to the cluster
- number of true positives, false positives, false negatives, and true negatives using 3 different algorithms (KEEP1 - KEEP3)
- for 2 of the algorithms, I have access to a probability of being in the cluster, so I can compare this to alternate choices of alpha - and so I can calculate true positives etc. using a different choice of alpha.
I found this: https://github.com/tidyverse/dplyr/issues/3101, which I used successfully on a single element of the list to get exactly what I wanted:
f <- function(.x, .y) {
sum(.x & .y)
}
actions <- list(
.vars = lst(
c('correct'),
c('KEEP1', 'KEEP2', 'KEEP3'),
c('pval1', 'pval2')
),
.funs = lst(
funs(Nk = length, N_correct = sum),
funs(
TP1 = f(., .y = correct),
FN1 = f(!(.), .y = correct),
TN1 = f(!(.), .y = !(correct)),
FP1 = f(., .y = !(correct))
),
funs(
TP2 = f((. < alpha0) , .y = correct),
FN2 = f(!(. < alpha0), .y = correct),
TN2 = f(!(. < alpha0), .y = !(correct)),
FP2 = f((. < alpha0), .y = !(correct))
)
)
)
reproducible_data <- replicate(2,
data_frame(
k = factor(rep(1:10, each = 20)), # group/category
correct = sample(x = c(TRUE, FALSE), 10 * 20, replace = TRUE, prob = c(.8, .2)),
pval1 = rbeta(10 * 20, 1, 10),
pval2 = rbeta(10 * 20, 1, 10),
KEEP1 = pval1 < 0.05,
KEEP2 = pval2 < 0.05,
KEEP3 = runif(10 * 20) > .2,
alpha0 = 0.05,
alpha = 0.05 / 20 # divided by no. of objects in each group (k)
),
simplify = FALSE)
# works
df1 <- reproducible_data[[1]]
pmap(actions, ~df1 %>% group_by(k) %>% summarize_at(.x, .y)) %>%
reduce(inner_join,by = 'k')
Now, I want to use map to do this to the entire list. However, I can no longer access the variable "correct" (it hasn't gotten far enough to not see alpha or alpha0, but presumably the same issue will occur). I'm still learning dplyr/purrr, but my experimenting hasn't proved useful.
# does not work
out_summary <- map(
reproducible_data,
pmap(actions, ~ as_tibble(.) %>% group_by("k") %>% summarize_at(.x, .y)) %>%
reduce(inner_join,by = 'k')
)
# this doesn't either
out_summary <- map(
reproducible_data,
pmap(actions, ~ as_tibble(.) %>% group_by("k") %>% summarize_at(.x, .y, alpha = alpha, alpha0 = alpha0, correct = correct)) %>%
reduce(inner_join,by = 'k')
)
Within map, I don't see the variable 'k' in $group_by(k)$ unless it is quoted $group_by('k')$, but I do not need to quote it when I just used pmap. I've tried various ways to pass the correct variables to these functions, but I'm still learning dplyr and purrr, and haven't succeeded yet.
One more note - the actual data is stored as a regular data frame, so I need $as_tibble()$ in the pmap function. I was running into some different errors when I removed it in this example, so I opted to add it back so I would get the same issues. Thanks!
Try this
map(
reproducible_data,
function(df1) {
pmap(actions, ~ df1 %>%
as_tibble() %>%
group_by(k) %>%
summarize_at(.x, .y)) %>%
reduce(inner_join, by = "k")
}
)
I think your arguments might get mixed up when using map and pmap at the same time. I used the function syntax for map to define df1 to try to fix that. The rest of it looks ok (although I switched to pmap_df to return a dataframe (the structure of the list was ugly without it and pmap_df was the easiest way to make it pretty. Lmk if it's not the expected output. 👍
Also the problem with group_by("k") vs. group_by(k)
Also: writing group_by("k") actually creates a variable "k" and fills it with characters "k", then uses that to group. That will get your code to run, but it won't do what you like. Sometimes that kind of problem is really because of an error that occurs a line or two before (or, with dplyr, a pipe or two before). In this case, map wasn't passing df1 where you needed it.
I've been reading through programming with dplyr and trying to apply the ideas it describes in my work. I have something that works, but it's unclear to me whether I've done it in the "right" way. Is there something more elegant or concise I could be doing?
I have a tibble where rows are scenarios and columns relate to tests that were run in that scenario. There are two types of columns, those that store a test statistic that was computed in that scenario and those that store the degrees of freedom of that test.
So, here's a small, toy example of the type of data I have:
library(tidyverse)
set.seed(27599)
my_tbl <- data_frame(test1_stat = rnorm(12), test1_df = rep(x = c(1, 2, 3), times = 4),
test2_stat = rnorm(12), test2_df = rep(x = c(1, 2, 3, 4), times = 3))
I want to compute a summary of each test that will be based on both its stat and its df. My example here is that I want to compute the median stat for each group, where groups are defined by df. The groupings are not guaranteed to be the same across tests, nor are the number of groups even guaranteed to be the same.
So, here's what I've done:
get_test_median = function(df, test_name) {
stat_col_name <- paste0(test_name, '_stat')
df_col_name <- paste0(test_name, '_df')
median_col_name <- paste0(test_name, '_median')
df %>%
dplyr::group_by(rlang::UQ(rlang::sym(df_col_name))) %>%
dplyr::summarise(rlang::UQ(median_col_name) := median(x = rlang::UQ(rlang::sym(stat_col_name)), na.rm = TRUE))
}
my_tbl %>% get_test_median(test_name = 'test1')
my_tbl %>% get_test_median(test_name = 'test2')
This works. But is it how an experienced rlang user would do it? I am new to NSE, and a bit surprised to be using two nested rlang functions repeatedly (UQ(sym(.))).
I am happy using UQ rather than !!, just because I'm more comfortable with traditional function notation.
Based on the comments, I got rid of the namespace::function notation and now my function doesn't look so verbose:
get_test_median = function(df, test_name) {
stat_col_name <- paste0(test_name, '_stat')
df_col_name <- paste0(test_name, '_df')
median_col_name <- paste0(test_name, '_median')
df %>%
dplyr::group_by(UQ(sym(df_col_name))) %>%
dplyr::summarise(UQ(median_col_name) := median(x = UQ(sym(stat_col_name)), na.rm = TRUE))
}
This is a follow up question from R: t-test over all columns
Suppose I have a huge data set, and then I created numerous subsets based on certain conditions. The subsets should have the same number of columns. Then I want to do t-test on two subsets at a time (outer loop) and then for each combination of subsets go through all columns one column at a time (inner loop).
Here is what I have come up with based on previous answer. This one stops with an error.
C <- c("c1","c1","c1","c1","c1",
"c2","c2","c2","c2","c2",
"c3","c3","c3","c3","c3",
"c4","c4","c4","c4","c4",
"c5","c5","c5","c5","c5",
"c6","c6","c6","c6","c6",
"c7","c7","c7","c7","c7",
"c8","c8","c8","c8","c8",
"c9","c9","c9","c9","c9",
"c10","c10","c10","c10","c10")
X <- rnorm(n=50, mean = 10, sd = 5)
Y <- rnorm(n=50, mean = 15, sd = 6)
Z <- rnorm(n=50, mean = 20, sd = 5)
Data <- data.frame(C, X, Y, Z)
Data.c1 = subset(Data, C == "c1",select=X:Z)
Data.c2 = subset(Data, C == "c2",select=X:Z)
Data.c3 = subset(Data, C == "c3",select=X:Z)
Data.c4 = subset(Data, C == "c4",select=X:Z)
Data.c5 = subset(Data, C == "c5",select=X:Z)
Data.Subsets = c("Data.c1",
"Data.c2",
"Data.c3",
"Data.c4",
"Data.c5")
library(plyr)
combo1 <- combn(length(Data.Subsets),1)
adply(combo1, 1, function(x) {
combo2 <- combn(ncol(Data.Subsets[x]),2)
adply(combo2, 2, function(y) {
test <- t.test( Data.Subsets[x][, y[1]], Data.Subsets[x][, y[2]], na.rm=TRUE)
out <- data.frame("Subset" = rownames(Data.Subsets[x]),
, "Row" = colnames(x)[y[1]]
, "Column" = colnames(x[y[2]])
, "t.value" = round(test$statistic,3)
, "df"= test$parameter
, "p.value" = round(test$p.value, 3)
)
return(out)
} )
} )
First of all, you can more easily define you dataset using gl, and by avoiding creating individual variables for the columns.
Data <- data.frame(
C = gl(10, 5, labels = paste("c", 1:10, sep = "")),
X = rnorm(n = 50, mean = 10, sd = 5),
Y = rnorm(n = 50, mean = 15, sd = 6),
Z = rnorm(n = 50, mean = 20, sd = 5)
)
Convert this to "long" format using melt from the reshape package. (You can also use the base reshape function.)
longData <- melt(Data, id.vars = "C")
Now Use pairwise.t.test to compute t tests on all pairs of X/Y/Z for for each level of C.
with(longData, pairwise.t.test(value, interaction(C, variable)))
Note that it is important to use pairwise.t.test rather than just lots of individual calls to t.test because you need to adjust your p values if you run lots of tests. (See, e.g., xkcd for explanation.)
In general, pairwise t tests are inferior to a regression so be careful about their usage.
You can use get(Data.subset[x]) which will pick out the relevant data frame. But I don't think this should be necessary.
Explicitly subsetting that many times shoudn't be necessry either. You could create them using something like
conditions = c("c1", "c2", "c3", "c4", "c5")
dfs <- lapply(conditions, function(x){subset(Data, C==x, select=X:Z)})
That should (didn't test it) return a list of data frames each subseted on the various conditions you passed it.
However it would be a much better idea as #Richie Cotton points out, to reshape your data frame and use pairwise t tests.
I should point out that doing this many t-tests doesn't seem wise. Even after correction for multiple testing, be it FDR, permutation or otherwise. It would be better to try and figure out if you can use an anova of some sort as they are used for almost exactly this purpose.