Rowwise subtract vectors using purrr - r

I have a numeric dataframe (m rows * n columns)
For each row of this dataframe, I want to treat it as
a numeric vector (1 * n) and subtract from it another
fixed (1 * n) vector. So for each row we return a
(1 * n) vector.
I would like to return a list with this vector subtraction
done for each row of the dataframe. So in this case
a list with m number of 1 * n vectors.
I have manually done this for 2 rows in a simple reprex
below:
library(tidyverse)
#> Registered S3 methods overwritten by 'ggplot2':
# A function that takes a row as a vector
diff_vec <- function(inp_vec, diff_val){
base::return(inp_vec - diff_val)
}
# Create a test (dummy) dataset with 3 rows and 4 columns
test_dat <- mtcars %>% dplyr::slice(c(1, 3, 6)) %>% dplyr::select(1:4)
test_dat
#> mpg cyl disp hp
#> 1 21.0 6 160 110
#> 2 22.8 4 108 93
#> 3 18.1 6 225 105
# This is the vector we want to subtract from each row
diff_v <- c(3.2, 5.4, 7.5, 8.2)
first_row <- test_dat %>% dplyr::slice(1) %>% as.vector()
test_out1 <- diff_vec(inp_vec = first_row, diff_val = diff_v)
first_row
#> mpg cyl disp hp
#> 1 21 6 160 110
test_out1
#> mpg cyl disp hp
#> 1 17.8 0.6 152.5 101.8
second_row <- test_dat %>% dplyr::slice(2) %>% as.vector()
test_out2 = diff_vec(inp_vec = second_row, diff_val = diff_v)
second_row
#> mpg cyl disp hp
#> 1 22.8 4 108 93
test_out2
#> mpg cyl disp hp
#> 1 19.6 -1.4 100.5 84.8
Created on 2019-06-07 by the reprex package (v0.2.1)
Could anyone please show how to do this using
purrr based approach?
Thanks

There is a simple solution exists:
test_dat %>% map2_dfc(diff_v, ~ .x - .y)
Resulting tibble:
mpg cyl disp hp
<dbl> <dbl> <dbl> <dbl>
1 17.8 0.600 152. 102.
2 19.6 -1.4 100. 84.8
3 14.9 0.600 218. 96.8

Related

Using tidyverse's curly-curly syntax to access data frame columns within a function

I am trying to calculate an indicator value per group in a dataframe, where the indicator value per group is the sum of one column divided by the sum of another column within that group. I want to pass the column names as numerator and denominator arguments. I have tried the following code to no avail.
library(tidyverse)
a = c(1,1,1,2,2)
b = 1:5
c = 6:10
d = 9:13
dummy_data = tibble(
a,b,c,d
)
calc_indicator = function(numerator,denominator){
data = dummy_data %>%
group_by(a) %>%
mutate(
indicator_value = sum({{numerator}})/sum({{denominator}})
)
data
}
calc_indicator("b","d")
#> Error in `mutate()`:
#> ! Problem while computing `indicator_value = sum("b")/sum("d")`.
#> ℹ The error occurred in group 1: a = 1.
#> Caused by error in `sum()`:
#> ! invalid 'type' (character) of argument
Created on 2022-10-17 by the reprex package (v2.0.1)
I realize that if I do not use quotations in the arguments submitted to the function (rather than calc_indicator("b","d") I enter calc_indicator(b,d)), this code runs. However, numerators and denominators for different indicators are defined in an excel file, so they arrive in the R environment as strings.
Any suggestions?
As per the Programming with dplyr article/vignette, {{ is used for unquoted column names, but for string/character vector of column names in objects you should use .data[[col]], e.g.,
calc_indicator = function(numerator,denominator){
data = dummy_data %>%
group_by(a) %>%
mutate(
indicator_value = sum(.data[[numerator]])/sum(.data[[denominator]])
)
data
}
calc_indicator("b","d")
I'd also recommend passing the data frame in to the function as an argument too. Functions that rely on having (in this case) a data frame named dummy_data in your global environment are much less flexible.
Right now, your function will only work if you have data frame named dummy_data, and it will only work on a data frame with that name. If you rewrite the function to have a data argument, then you can use it on any data frame:
calc_indicator = function(data, group, numerator, denominator){
data %>%
group_by(.data[[group]]) %>%
mutate(
indicator_value = sum(.data[[numerator]])/sum(.data[[denominator]])
)
}
## you can still use it on your dummy data
calc_indicator(dummy_data, "a", "b", "c")
## you can use it on other data too
calc_indicator(mtcars, "cyl", "hp", "wt")
# # A tibble: 32 × 12
# # Groups: cyl [3]
# mpg cyl disp hp drat wt qsec vs am gear carb indicator_value
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 39.2
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 39.2
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 36.2
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 39.2
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 52.3
# ...

How to loop through my dataframe and extract certain values?

the code below shows me extracting certain values for 1 parameter in my data frame (Calcium), but I want to be able to do this for all of the parameters/rows in my data frame. There are multiple rows for Calcium, which is why I took the median value.
How can I create a loop that does this for the other drug substance parameters?
Cal_limits=ag_limits_5 %>% filter(PARAMETER=="Drug Substance.Calcium")
lcl <- median(Cal_limits$LCL, na.rm = TRUE)
ucl <- median(Cal_limits$UCL, na.rm = TRUE)
lsl <- median(Cal_limits$LSL_1, na.rm = TRUE)
usl <- median(Cal_limits$USL_1, na.rm = TRUE)
cl <- median(Cal_limits$TARGET_MEAN, na.rm = TRUE)
stdev <- median(Cal_limits$TARGET_STDEV, na.rm = TRUE)
sigabove <- ucl + stdev #3.219 #(UCL + sd (3.11+0.107))
sigbelow <- lcl - stdev#2.363 #(LCL - sd (2.47-0.107))
Snapshot showing that there are multiple rows dedicated to one parameter, the columns not pictured have confidential information but include the values I am looking to extract
Edit: I am creating an RShiny app, so I am not sure if I will need to incorporate a reactive function
Using mtcars, you can do
aggregate(. ~ cyl, data = mtcars, FUN = median)
# cyl mpg disp hp drat wt qsec vs am gear carb
# 1 4 26.0 108.0 91.0 4.080 2.200 18.900 1 1 4 2.0
# 2 6 19.7 167.6 110.0 3.900 3.215 18.300 1 0 4 4.0
# 3 8 15.2 350.5 192.5 3.115 3.755 17.175 0 0 3 3.5
which provides the median for each of the variables (. means "all others") for each of the levels of cyl. I'm going to guess that this would apply to your data as
aggregate(. ~ PARAMETER, data = ag_limits_5, FUN = median)
If you have more columns than you want to reduce, then you can specify them manually with
aggregate(LCL + UCL + LSL_1 + USL_1 + TARGET_MEAN + TARGET_STDDEV ~ PARAMETER,
data = ag_limits_5, FUN = median)
and I think you'll get output something like
# PARAMETER LCL UCL LSL_1 USL_1 TARGET_MEAN TARGET_STDDEV
# 1 Drug Substance.Calcium 1.1 1.2 1.3 1.4 ...
# 2 Drug Substance.Copper ...
(with real numbers, I'm just showing structure there).
Since it appears that you're using dplyr, you can do it this way, too:
mtcars %>%
group_by(cyl) %>%
summarize(across(everything(), ~ median(., na.rm = TRUE)))
# # A tibble: 3 x 11
# cyl mpg disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4 26 108 91 4.08 2.2 18.9 1 1 4 2
# 2 6 19.7 168. 110 3.9 3.22 18.3 1 0 4 4
# 3 8 15.2 350. 192. 3.12 3.76 17.2 0 0 3 3.5
which for you might be
ag_limits_5 %>%
group_by(PARAMETER) %>%
summarize(across(everything(), ~ median(., na.rm = TRUE)))

Mimicking a secondary tidy dots argument in an R function

I'm looking to create a function that accepts a list of (data frame) variables as one of its parameters. I've managed to get it working partially, but when I get to the group_by/count, things fall apart. How can I do this??
## Works
f1 <- function(dfr, ..., split = NULL) {
dots <- rlang::enquos(...)
split <- rlang::enquos(split)
dfr %>%
select(!!!dots, !!!split) %>%
gather('type', 'score', -c(!!!split))
}
## does not work
f2 <- function(dfr, ..., split = NULL) {
dots <- rlang::enquos(...)
split <- rlang::enquos(split)
dfr %>%
select(!!!dots, !!!split) %>%
gather('type', 'score', -c(!!!split))
count(!!!split, type, score)
}
I would want to do things like
mtcars %>% f2(drat:qsec)
mtcars %>% f2(drat:qsec, split = gear)
mtcars %>% f2(drat:qsec, split = c(gear, carb)) ## ??
These calls with f1() all work, but for f2 none of the commands work. They all end up with a Error in !split : invalid argument type. That f2(drat:qsec) doesn't (immediately) work without the split argument, I'm not too surprised about, but how to get the second and third comment working?
The issue with the second function (the missing pipe notwithstanding) is that count() (or rather group_by() which is called by count()) doesn't support tidyselect syntax so you can't pass it a list to be spliced like you can with select(), gather() etc. Instead, one option is to use group_by_at() and add_tally(). Here's a slightly modified version of the function:
library(dplyr)
f2 <- function(dfr, ..., split = NULL) {
dfr %>%
select(..., {{split}}) %>%
gather('type', 'score', -{{split}}) %>%
group_by_at(vars({{split}}, type, score)) %>% # could use `group_by_all()`
add_tally()
}
mtcars %>% f2(drat:qsec)
# A tibble: 96 x 3
# Groups: type, score [81]
type score n
<chr> <dbl> <int>
1 drat 3.9 2
2 drat 3.9 2
3 drat 3.85 1
4 drat 3.08 2
5 drat 3.15 2
6 drat 2.76 2
7 drat 3.21 1
8 drat 3.69 1
9 drat 3.92 3
10 drat 3.92 3
# ... with 86 more rows
mtcars %>% f2(drat:qsec, split = c(gear, carb))
# A tibble: 96 x 5
# Groups: gear, carb, type, score [89]
gear carb type score n
<dbl> <dbl> <chr> <dbl> <int>
1 4 4 drat 3.9 2
2 4 4 drat 3.9 2
3 4 1 drat 3.85 1
4 3 1 drat 3.08 1
5 3 2 drat 3.15 2
6 3 1 drat 2.76 1
7 3 4 drat 3.21 1
8 4 2 drat 3.69 1
9 4 2 drat 3.92 1
10 4 4 drat 3.92 2
# ... with 86 more rows

dplyr group by colnames described as vector of strings

I'm trying to group_by multiple columns in my data frame and I can't write out every single column name in the group_by function so I want to call the column names as a vector like so:
cols <- grep("[a-z]{3,}$", colnames(mtcars), value = TRUE)
mtcars %>% filter(disp < 160) %>% group_by(cols) %>% summarise(n = n())
This returns error:
Error in mutate_impl(.data, dots) :
Column `mtcars[colnames(mtcars)[grep("[a-z]{3,}$", colnames(mtcars))]]` must be length 12 (the number of rows) or one, not 7
I definitely want to use a dplyr function to do this, but can't figure this one out.
Update
group_by_at() has been superseded; see https://dplyr.tidyverse.org/reference/group_by_all.html. Refer to Harrison Jones' answer for the current recommended approach.
Retaining the below approach for posterity
You can use group_by_at, where you can pass a character vector of column names as group variables:
mtcars %>%
filter(disp < 160) %>%
group_by_at(cols) %>%
summarise(n = n())
# A tibble: 12 x 8
# Groups: mpg, cyl, disp, drat, qsec, gear [?]
# mpg cyl disp drat qsec gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 19.7 6 145.0 3.62 15.50 5 6 1
# 2 21.4 4 121.0 4.11 18.60 4 2 1
# 3 21.5 4 120.1 3.70 20.01 3 1 1
# 4 22.8 4 108.0 3.85 18.61 4 1 1
# ...
Or you can move the column selection inside group_by_at using vars and column select helper functions:
mtcars %>%
filter(disp < 160) %>%
group_by_at(vars(matches('[a-z]{3,}$'))) %>%
summarise(n = n())
# A tibble: 12 x 8
# Groups: mpg, cyl, disp, drat, qsec, gear [?]
# mpg cyl disp drat qsec gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 19.7 6 145.0 3.62 15.50 5 6 1
# 2 21.4 4 121.0 4.11 18.60 4 2 1
# 3 21.5 4 120.1 3.70 20.01 3 1 1
# 4 22.8 4 108.0 3.85 18.61 4 1 1
# ...
I believe group_by_at has now been superseded by using a combination of group_by and across. And summarise has an experimental .groups argument where you can choose how to handle the grouping after you create a summarised object. Here is an alternative to consider:
cols <- grep("[a-z]{3,}$", colnames(mtcars), value = TRUE)
original <- mtcars %>%
filter(disp < 160) %>%
group_by_at(cols) %>%
summarise(n = n())
superseded <- mtcars %>%
filter(disp < 160) %>%
group_by(across(all_of(cols))) %>%
summarise(n = n(), .groups = 'drop_last')
all.equal(original, superseded)
Here is a blog post that goes into more detail about using the across function:
https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-colwise/

Stepping through a pipeline with intermediate results

Is there a way to output the result of a pipeline at each step without doing it manually? (eg. without selecting and running only the selected chunks)
I often find myself running a pipeline line-by-line to remember what it was doing or when I am developing some analysis.
For example:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
sample_frac(0.1) %>%
summarise(res = mean(mpg))
# Source: local data frame [3 x 2]
#
# cyl res
# 1 4 33.9
# 2 6 18.1
# 3 8 18.7
I'd to select and run:
mtcars %>% group_by(cyl)
and then...
mtcars %>% group_by(cyl) %>% sample_frac(0.1)
and so on...
But selecting and CMD/CTRL+ENTER in RStudio leaves a more efficient method to be desired.
Can this be done in code?
Is there a function which takes a pipeline and runs/digests it line by line showing output at each step in the console and you continue by pressing enter like in demos(...) or examples(...) of package guides
You can select which results to print by using the tee-operator (%T>%) and print(). The tee-operator is used exclusively for side-effects like printing.
# i.e.
mtcars %>%
group_by(cyl) %T>% print() %>%
sample_frac(0.1) %T>% print() %>%
summarise(res = mean(mpg))
It is easy with magrittr function chain. For example define a function my_chain with:
foo <- function(x) x + 1
bar <- function(x) x + 1
baz <- function(x) x + 1
my_chain <- . %>% foo %>% bar %>% baz
and get the final result of a chain as:
> my_chain(0)
[1] 3
You can get a function list with functions(my_chain)
and define a "stepper" function like this:
stepper <- function(fun_chain, x, FUN = print) {
f_list <- functions(fun_chain)
for(i in seq_along(f_list)) {
x <- f_list[[i]](x)
FUN(x)
}
invisible(x)
}
And run the chain with interposed print function:
stepper(my_chain, 0, print)
# [1] 1
# [1] 2
# [1] 3
Or with waiting for user input:
stepper(my_chain, 0, function(x) {print(x); readline()})
Add print:
mtcars %>%
group_by(cyl) %>%
print %>%
sample_frac(0.1) %>%
print %>%
summarise(res = mean(mpg))
IMHO magrittr is mostly useful interactively, that is when I am exploring data or building a new formula/model.
In this cases, storing intermediate results in distinct variables is very time consuming and distracting, while pipes let me focus on data, rather than typing:
x %>% foo
## reason on results and
x %>% foo %>% bar
## reason on results and
x %>% foo %>% bar %>% baz
## etc.
The problem here is that I don't know in advance what the final pipe will be, like in #bergant.
Typing, as in #zx8754,
x %>% print %>% foo %>% print %>% bar %>% print %>% baz
adds to much overhead and, to me, defeats the whole purpose of magrittr.
Essentially magrittr lacks a simple operator that both prints and pipes results.
The good news is that it seems quite easy to craft one:
`%P>%`=function(lhs, rhs){ print(lhs); lhs %>% rhs }
Now you can print an pipe:
1:4 %P>% sqrt %P>% sum
## [1] 1 2 3 4
## [1] 1.000000 1.414214 1.732051 2.000000
## [1] 6.146264
I found that if one defines/uses a key bindings for %P>% and %>%, the prototyping workflow is very streamlined (see Emacs ESS or RStudio).
I wrote the package pipes that can do several things that might help :
use %P>% to print the output.
use %ae>% to use all.equal on input and output.
use %V>% to use View on the output, it will open a viewer for each relevant step.
If you want to see some aggregated info you can try %summary>%, %glimpse>% or %skim>% which will use summary, tibble::glimpse or skimr::skim, or you can define your own pipe to show specific changes, using new_pipe
# devtools::install_github("moodymudskipper/pipes")
library(dplyr)
library(pipes)
res <- mtcars %P>%
group_by(cyl) %P>%
sample_frac(0.1) %P>%
summarise(res = mean(mpg))
#> group_by(., cyl)
#> # A tibble: 32 x 11
#> # Groups: cyl [3]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
#> sample_frac(., 0.1)
#> # A tibble: 3 x 11
#> # Groups: cyl [3]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
#> 2 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
#> 3 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> summarise(., res = mean(mpg))
#> # A tibble: 3 x 2
#> cyl res
#> <dbl> <dbl>
#> 1 4 26
#> 2 6 17.8
#> 3 8 18.7
res <- mtcars %ae>%
group_by(cyl) %ae>%
sample_frac(0.1) %ae>%
summarise(res = mean(mpg))
#> group_by(., cyl)
#> [1] "Attributes: < Names: 1 string mismatch >"
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"
#> [3] "Attributes: < Component \"class\": Lengths (1, 4) differ (string compare on first 1) >"
#> [4] "Attributes: < Component \"class\": 1 string mismatch >"
#> [5] "Attributes: < Component 2: Modes: character, list >"
#> [6] "Attributes: < Component 2: Lengths: 32, 2 >"
#> [7] "Attributes: < Component 2: names for current but not for target >"
#> [8] "Attributes: < Component 2: Attributes: < target is NULL, current is list > >"
#> [9] "Attributes: < Component 2: target is character, current is tbl_df >"
#> sample_frac(., 0.1)
#> [1] "Different number of rows"
#> summarise(., res = mean(mpg))
#> [1] "Cols in y but not x: `res`. "
#> [2] "Cols in x but not y: `qsec`, `wt`, `drat`, `hp`, `disp`, `mpg`, `carb`, `gear`, `am`, `vs`. "
res <- mtcars %V>%
group_by(cyl) %V>%
sample_frac(0.1) %V>%
summarise(res = mean(mpg))
# you'll have to test this one by yourself

Resources