I would like to summarise pipeable functions (of magrittr/dplyr) into shorter function-"blocks" (to hopefully have more readable code). For example:
library(dplyr)
d <- tbl_df(data.frame(A = rep(LETTERS[2:5], each = 5),
M = rep(letters[1:2], times = 10),
X = round(rnorm(20, 10, 2), 1)))
# I want to replace this
# ----------------------
d %>%
group_by(A) %>%
summarise(X = mean(X)) -> d_test_1
# with this
# ---------
my_mean <- function(d, by_var, x) {
expr <- substitute(by_var) # group variable, seems ok
expr_2 <- substitute(expression(x = mean(x))) # calculate mean
print(deparse(expr_2))
# problem: x = mean(x) is only substituted to x = mean(X) .. 1 capital x, should be 2
expr_3 <- parse(text = paste(deparse(substitute(x)), "=mean(",
deparse(substitute(x)), ")"))
print(deparse(expr_3))
# expr_3 does not work either
d %>%
group_by(eval(expr)) %>%
#summarise(X = mean(X)) -> d # uses right group variable
summarise(eval(expr_3)) -> d # uses wrong group variable <> side-effect of "expr"?
invisible(d)
}
# this is the short version I am after
d %>%
my_mean(A, X) -> d_test_2
d_test_1
d_test_2
Thx & kind regards
Maybe somebody else does not know where to look either:
library(dplyr)
d <- tbl_df(data.frame(A = rep(LETTERS[2:5], each = 5),
M = rep(letters[1:2], times = 10),
X = round(rnorm(20, 10, 2), 1),
stringsAsFactors = F))
my_mean <- function(d, by_var, x) {
d %>%
group_by(!!enquo(by_var)) %>%
summarise(!!quo_name(enquo(x)) := mean(!!enquo(x)))
}
d %>%
my_mean(A, X) -> want
want
Related
I have this data:
df_1 <- data.frame(
x = replicate(
n = 10, expr = runif(n = 1000, min = 20, max = 100)
)
)
My code:
library(dplyr)
df_1 |>
(\(x) cbind(x, r = apply(x[colnames(x = select(x, where(is.numeric) & head(x = everything(x), 2) & starts_with("x.")))], 1, sum, na.rm = T)))()
I tried use [ instead colnames, but doesn't work. I want convert this part (simultaneously, as a dplyr::select structure made above):
[colnames(x = select(x, where(is.numeric) & head(x = everything(x), 2) & starts_with("x.")))]
to base R.
Just do:
transform(df_1, r = x.1 + x.2)
or even:
cbind(df_1, r = rowSums(df_1[1:2]))
or even:
cbind(df_1, r = df_1[1] + df_1[2])
I want to have a matrix including one high (1 sd above average) and low (1 sd below median) expression for each variable out of multiple variables.
In one variant, for each variable I would like to have one high expression, while all other variables are low.
In addition, I would like to have a variant in which all other variables are set to 0 and then there is a high and a low expression for each variable.
I want to use it for model predictions.
For three variables I would already need for variant 1:
pred_da <- data.frame(var1 = c(median(da$var1)+1*sd(da$var1), median(da$var1)-1*sd(da$var1), median(da$var1)-1*sd(da$var1)), var2 = c(median(da$var2)-1*sd(da$var2), median(da$var2)+1*sd(da$var2), median(da$var2)-1*sd(da$var2)), var3 = c(median(da$var3)-1*sd(da$var3), median(da$var3)-1*sd(da$var3), median(da$var3)+1*sd(da$var3)))
For variant 2 it would be even more...
There should be a more efficient way to do it?
I think Adam B.'s solution puts the medians instead of median - sd as results (see code below in reproducible example).
Also, your example code uses median +/- sd, while the text defines "high" as 1 sd above average (not median), so it is not clear which one you want. I went with median in both cases.
You can achieve the same quite easily with base R by filling a matrix with the "low" expression for each column and adding the "high" expression in the diagonal:
# data (common to all versions)
set.seed(1)
da <-
data.frame(
ID = 1:10,
var1 = rnorm(10, 0, 1),
var2 = rpois(10, 2),
var3 = rexp(10, 1),
stringsAsFactors = FALSE
)
varnames <- colnames(da)[-1]
# my version
mat <- data.matrix(da[, -1])
median_da <- apply(mat, 2, median)
sds <- apply(mat, 2, sd)
lower <- median_da - sds
higher <- median_da + sds
res_mat <-
matrix(
rep(lower, each = length(varnames)),
nrow = length(varnames),
dimnames = list(seq_along(varnames), varnames)
)
diag(res_mat) <- higher
data.frame(res_mat)
#> var1 var2 var3
#> 1 1.0371615 -0.4337209 -0.1102957
#> 2 -0.5240104 2.4337209 -0.1102957
#> 3 -0.5240104 -0.4337209 1.3406680
## your version:
pred_da <-
data.frame(
var1 = c(
median(da$var1) + 1 * sd(da$var1),
median(da$var1) - 1 * sd(da$var1),
median(da$var1) - 1 * sd(da$var1)
),
var2 = c(
median(da$var2) - 1 * sd(da$var2),
median(da$var2) + 1 * sd(da$var2),
median(da$var2) - 1 * sd(da$var2)
),
var3 = c(
median(da$var3) - 1 * sd(da$var3),
median(da$var3) - 1 * sd(da$var3),
median(da$var3) + 1 * sd(da$var3)
)
)
# check for equality of results:
all.equal(data.frame(res_mat), pred_da, check.attributes = FALSE)
#> [1] TRUE
# Adam B.'s version:
library(tidyverse)
median_da <- da %>%
select(- ID) %>%
mutate_all(~ median(.x)) %>%
slice(1)
sds <- da %>%
select(- ID) %>%
summarise_all(sd)
add_sd <- function(varname, sd) {
median <- median_da %>%
pluck(varname)
median_da %>%
mutate(!!varname := median + sd)
}
preds_da <- map2(varnames, sds, ~ add_sd(varname = .x, sd = .y)) %>% bind_rows()
preds_da
#> var1 var2 var3
#> 1 1.0371615 1.000000 0.6151862
#> 2 0.2565755 2.433721 0.6151862
#> 3 0.2565755 1.000000 1.3406680
median_da
#> var1 var2 var3
#> 1 0.2565755 1 0.6151862
It's a bit of a mind-squeezer with nonstandard eval, but I managed to get it to work with my example data:
library(tidyverse)
da <- tibble(ID = 1:10, V1 = rnorm(10, 0, 1), V2 = rpois(10, 2), V3 = rexp(10, 1))
varnames <- colnames(da)[-1]
median_da <- da %>%
select(- ID) %>%
mutate_all(~ median(.x)) %>%
slice(1)
sds <- da %>%
select(- ID) %>%
summarise_all(sd)
add_sd <- function(varname, sd) {
median <- median_da %>%
pluck(varname)
median_low <- median_da %>%
mutate(!!varname := median - sd)
median_high <- median_da %>%
mutate(!!varname := median + sd)
median_low %>%
bind_rows(median_high)
}
preds_da <- map2(varnames, sds, ~ add_sd(varname = .x, sd = .y)) %>% bind_rows()
This is a continuation from the previous question:
Apply function over every entry one table to every entry of another
I have the following tables loss.tib and bandstib and function bandedlossfn:
library(tidyverse)
set.seed(1)
n <- 5
loss.tib <- tibble(lossid = seq(n),
loss = rbeta(n, 1, 10) * 100)
bandstib <- tibble(bandid = seq(4),
start = seq(0, 75, by = 25),
end = seq(25, 100, by = 25))
bandedlossfn <- function(loss, start, end) {
pmin(end - start, pmax(0, loss - start))
}
It is possible to apply this function over loss.tib using bandstib as arguments:
loss.tib %>%
mutate(
result = map(
loss, ~ tibble(result = bandedlossfn(.x, bandstib$start,
bandstib$end))
)
) %>% unnest
However, I would like to add an index within map as follows:
loss.tib %>%
mutate(
result = map(
loss, ~ tibble(result = bandedlossfn(.x, bandstib$start,
bandstib$end)) %>%
mutate(bandid2 = row_number())
)
) %>% unnest
But it does not seem to work as intended.
I also want to add filter(!near(result,0)) within the map function too for efficient memory management.
The result I'm expecting is:
lossid loss bandid result
1 21.6691088 1 21.6691088
2 6.9390647 1 6.9390647
3 0.5822383 1 0.5822383
4 5.5671643 1 5.5671643
5 27.8237244 1 25.0000000
5 27.8237244 2 2.8237244
Thank you.
Here is one possibility:
you first nest bandstib and add it to loss.tib. This way the id sticks to your calculations:
bandstib <- tibble(bandid = seq(4),
start = seq(0, 75, by = 25),
end = seq(25, 100, by = 25)) %>%
nest(.key = "data")
set.seed(1)
n <- 5
result <- tibble(loss = rbeta(n, 1, 10) * 100) %>%
bind_cols(., slice(bandstib, rep(1, n))) %>%
mutate(result = map2(loss, data, ~bandedlossfn(.x, .y$start, .y$end))) %>%
unnest()
I would like to do the same what I have done here by mutate function not by ddplyr one. Is it possible to perform not vectorized operation here somehow?
test <- tibble::tibble(
x = c(1,2,3),
y = c(0.5,1,1.5)
)
d <- c(1.23, 0.99, 2.18)
test %>% mutate(., s = (function(x, y) {
dn <- dnorm(x = d, mean = x, sd = y)
s <- sum(dn)
s
})(x,y))
test %>% plyr::ddply(., c("x","y"), .fun = function(row) {
dn <- dnorm(x = d, mean = row$x, sd = row$y)
s <- sum(dn)
s
})
A popular method is using the dplyr function: rowwise().
library(tidyverse)
test <- tibble::tibble(
x = c(1,2,3),
y = c(0.5,1,1.5)
)
d <- c(1.23, 0.99, 2.18)
test %>%
rowwise() %>% # prior to mutate specify calculate rowwise
mutate(., s = (function(x, y) {
dn <- dnorm(x = d, mean = x, sd = y)
s <- sum(dn)
s})(x,y))
This yields the following result:
# A tibble: 3 x 3
x y s
<dbl> <dbl> <dbl>
1 1 0.5 1.56
2 2 1 0.929
3 3 1.5 0.470
Apologies for what might be a very simple question.
I am new to using the purrr package in R and I'm struggling with trying to pass a second parameter to a function.
library(dplyr)
library(purrr)
my_function <- function(x, y = 2) {
z = x + y
return(z)
}
my_df_2 <- my_df %>%
mutate(new_col = map_dbl(.x = old_col, .f = my_function))
This works and most often I don't need to change the value of y, but if I had to pass a different value for y (say y = 3) through the mutate & map combination, what is the syntax for it?
Thank you very much in advance!
The other idea is to use the following syntax.
library(dplyr)
library(purrr)
# The function
my_function <- function(x, y = 2) {
z = x + y
return(z)
}
# Example data frame
my_df <- data_frame(old_col = 1:5)
# Apply the function
my_df_2 <- my_df %>%
mutate(new_col = map_dbl(old_col, ~my_function(.x, y = 3)))
my_df_2
# # A tibble: 5 x 2
# old_col new_col
# <int> <dbl>
# 1 1 4.
# 2 2 5.
# 3 3 6.
# 4 4 7.
# 5 5 8.
I think all you need to do is modify map_dbl like so:
library(dplyr)
library(purrr)
df <- data.frame(a = c(2, 3, 4, 5.5))
my_function <- function(x, y = 2) {
z = x + y
return(z)
}
df %>%
mutate(new_col = map_dbl(.x = a, y = 3, .f = my_function))
a new_col
1 2.0 5.0
2 3.0 6.0
3 4.0 7.0
4 5.5 8.5