Evaluation order inconsistency with dplyr mutate - r

I have 2 functions that I use inside a mutate call. One produces per row results as expected while the other repeats the same value for all rows:
library(dplyr)
df <- data.frame(X = rpois(5, 10), Y = rpois(5,10))
pv <- function(a, b) {
fisher.test(matrix(c(a, b, 10, 10), 2, 2),
alternative='greater')$p.value
}
div <- function(a, b) a/b
mutate(df, d = div(X,Y), p = pv(X, Y))
which produces something like:
X Y d p
1 9 15 0.6000000 0.4398077
2 8 7 1.1428571 0.4398077
3 9 14 0.6428571 0.4398077
4 11 15 0.7333333 0.4398077
5 11 7 1.5714286 0.4398077
ie the d column varies, but v is constant and its value does not actually correspond to the X and Y values in any of the rows.
I suspect this relates to NSE, but I do not undertand how from what litlle I have been able to find out about it.
What accounts for the different behaviours of div and pv? How do I fix pv?

We need rowwise
df %>%
rowwise() %>%
mutate(d = div(X,Y), p = pv(X,Y))
# X Y d p
# <int> <int> <dbl> <dbl>
#1 10 9 1.111111 0.5619072
#2 12 8 1.500000 0.3755932
#3 9 8 1.125000 0.5601923
#4 11 16 0.687500 0.8232217
#5 16 10 1.600000 0.3145350
In the OP's code, the pv is taking the 'X' and 'Y' columns as input and it gives a single output.
Or as #Frank mentioned, mapply can be used
df %>%
mutate(d = div(X,Y), p = mapply(pv, X, Y))

Related

How to do calculations on a column of a data frame using values contained in another data frame in R?

I have 2 data frames: one with experimental data and one with values of some constants. Experimental data and constants are separated by categories (a and b). I would like to include a new column in the experimental data frame that is the result of the following calculation:
z = k*y
To do this, I'm using the dplyr package and the mutate() function, but I'm not getting the expected result. Does anyone have any tips or suggestions, even if it is necessary to use another package?
library(dplyr)
Category <- c("a", "b")
k <- c(1, 2)
# Data frame with the constants for each category
Constant <- data.frame(Category, k)
x <- seq(0,5,1)
df <- expand.grid(x = x,
Category = Category)
# Data frame with the experimental resultas
df$y <- seq(1,12,1)
# Failed attempt to calculate z separated by categories
df %>%
group_by(Category) %>%
mutate(z = Constant*y)
With dplyr you can do the following:
library(dplyr)
left_join(df, Constant, by = c("Category")) %>%
mutate(z = k * y) %>%
select(-k)
I did this:
a = c()
for(i in unique(df$Category)){
a = c(a,df[df$Category==i,"y"]*Constant[Constant$Category==i,'k'])
}
df$z=a
result:
x Category y z
1 0 a 1 1
2 1 a 2 2
3 2 a 3 3
4 3 a 4 4
5 4 a 5 5
6 5 a 6 6
7 0 b 7 14
8 1 b 8 16
9 2 b 9 18
10 3 b 10 20
11 4 b 11 22
12 5 b 12 24
I don't know if it was what you're looking for. Juste keep in mind that this works if your df is sorted by the category column
if you don't like for loop, here is a lapply version:
df$z =unlist( lapply(unique(df$Category), function(i){return(df[df$Category==i,"y"]*Constant[Constant$Category==i,'k'])}))
if the data isn't sorted by category:
df$z=unlist(lapply(1:nrow(df),function(i){ return(df[i,"y"]*Constant[Constant$Category==df[i,"Category"],'k'])}))

Tidyverse Rowwise sum of columns that may or may not exist

Consider the following tibble:
library(tidyverse)
data <- tibble(x = c(rnorm(5,2,n = 10)*1000,NA,1000),
y = c(rnorm(1,1,n = 10)*1000,NA,NA))
Suppose I want to make a row-wise sum of "x" and "y", creating variable "z", like this:
data %>%
rowwise() %>%
mutate(z = sum(c(x,y), na.rm = T))
This works fine for what I want, but the problem is that my true dataset has many variables and I did not
want to check before what variables I have and what I do not have. So, suppose I may have variables that do not exist among the elements of the sum:
data %>%
rowwise() %>%
mutate(k = sum(c(x,y,w), na.rm = T))
In this case, it will not run, because column "w" does not exist.
How can I make it run anyway, ignoring the non-existence of "w" and summing over "x" and "y"?
PS: I prefer to do it without filtering the dataset before running the sum. I would like to somehow make the sum happen in any case, whether variables exist or not.
if I understood your problem correctly this would be a solution (slight modification of #Duck's comment:
library(tidyverse)
data <- tibble(x = c(rnorm(5,2,n = 10)*1000,NA,1000),
y = c(rnorm(1,1,n = 10)*1000,NA,NA),
a = c(rnorm(1,1,n = 10)*1000,NA,NA))
wishlist <- c("x","y","w")
data %>%
dplyr::rowwise() %>%
dplyr::mutate(Sum=sum(c_across(colnames(data)[colnames(data) %in% wishlist]),na.rm=T))
x y a Sum
<dbl> <dbl> <dbl> <dbl>
1 3496. 439. -47.7 3935.
2 6046. 460. 2419. 6506.
3 6364. 672. 1030. 7036.
4 1068. 1282. 2811. 2350.
5 2455. 990. 689. 3445.
6 6477. -612. -1509. 5865.
7 7623. 1554. 2828. 9177.
8 5120. 482. -765. 5602.
9 1547. 1328. 817. 2875.
10 5602. -1019. 695. 4582.
11 NA NA NA 0
12 1000 NA NA 1000
Try this:
library(tidyverse)
data <- tibble(x = c(rnorm(5,2,n = 10)*1000,NA,1000),
y = c(rnorm(1,1,n = 10)*1000,NA,NA))
data$k <- rowSums(as.data.frame(data[,which(c("x","y","w")%in%names(data))]),na.rm=TRUE)
Output:
# A tibble: 12 x 3
x y k
<dbl> <dbl> <dbl>
1 3121. 934. 4055.
2 6523. 1477. 8000.
3 5538. 863. 6401.
4 3099. 1344. 4443.
5 4241. 284. 4525.
6 3251. -448. 2803.
7 4786. -291. 4495.
8 4378. 910. 5288.
9 5342. 653. 5996.
10 4772. 1818. 6590.
11 NA NA 0
12 1000 NA 1000

Tidyverse Solution for Using Tibble Columns as Input to a Function

I am trying to run a function on all on combinations of two column vectors in a tibble.
library(tidyverse)
combination <- tibble(x = c(1, 2), y = c(3, 4))
sum_square <- function(x, y) {
x^2+y^2
}
I would like to run this function all combinations of column x and column y:
sum_square(1, 3)
sum_square(1, 4)
sum_square(2, 3)
sum_square(2, 4)
Ideally I would like a tidyverse solution.
We can first expand and then apply sum_square on the expanded dataset
library(tidyverse)
expand(combination, x, y) %>%
mutate(new = sum_square(x, y))
# A tibble: 4 x 3
# x y new
# <dbl> <dbl> <dbl>
#1 1 3 10
#2 1 4 17
#3 2 3 13
#4 2 4 20
Another option is outer
combination %>%
reduce(outer, FUN = sum_square) %>%
c %>%
tibble(new = .)

Add multiple output variables using purrr and a predefined function

Take this simple dataset and function (representative of more complex problems):
x <- data.frame(a = 1:3, b = 2:4)
mult <- function(a,b,n) (a + b) * n
Using base R's Map I could do this to add 2 new columns in a vectorised fashion:
ns <- 1:2
x[paste0("new",seq_along(ns))] <- Map(mult, x["a"], x["b"], n=ns)
x
# a b new1 new2
#1 1 2 3 6
#2 2 3 5 10
#3 3 4 7 14
purrr attempt via pmap gets close with a list output:
library(purrr)
library(dplyr)
x %>% select(a,b) %>% pmap(mult, n=1:2)
#[[1]]
#[1] 3 6
#
#[[2]]
#[1] 5 10
#
#[[3]]
#[1] 7 14
My attempts from here with pmap_dfr etc all seem to error out in trying to map this back to new columns.
How do I end up making 2 further variables which match my current "new1"/"new2"? I'm sure there is a simple incantation, but I'm clearly overlooking it or using the wrong *map* function.
There is some useful discussion here - How to use map from purrr with dplyr::mutate to create multiple new columns based on column pairs - but it seems overly hacky and inflexible for what I imagined was a simple problem.
The best approach I've found (which is still not terribly elegant) is to pipe into bind_cols. To get pmap_dfr to work correctly, the function should return a named list (which may or may not be a data frame):
library(tidyverse)
x <- data.frame(a = 1:3, b = 2:4)
mult <- function(a,b,n) as.list(set_names((a + b) * n, paste0('new', n)))
x %>% bind_cols(pmap_dfr(., mult, n = 1:2))
#> a b new1 new2
#> 1 1 2 3 6
#> 2 2 3 5 10
#> 3 3 4 7 14
To avoid changing the definition of mult, you can wrap it in an anonymous function:
mult <- function(a,b,n) (a + b) * n
x %>% bind_cols(pmap_dfr(
.,
~as.list(set_names(
mult(...),
paste0('new', 1:2)
)),
n = 1:2
))
#> a b new1 new2
#> 1 1 2 3 6
#> 2 2 3 5 10
#> 3 3 4 7 14
In this particular case, it's not actually necessary to iterate over rows, though, because you can vectorize the inputs from x and instead iterate over n. The advantage is that usually n > p, so the number of iterations will be [potentially much] lower. To be clear, whether such an approach is possible depends on for which parameters the function can accept vector arguments.
mult still needs to be called on the variables of x. The simplest way to do this is to pass them explicitly:
x %>% bind_cols(map_dfc(1:2, ~mult(x$a, x$b, .x)))
#> a b V1 V2
#> 1 1 2 3 6
#> 2 2 3 5 10
#> 3 3 4 7 14
...but this loses the benefit of pmap that named variables will automatically get passed to the correct parameter. You can get that back by using purrr::lift, which is an adverb that changes the domain of a function so it accepts a list by wrapping it in do.call. The returned function can be called on x and the value of n for that iteration:
x %>% bind_cols(map_dfc(1:2, ~lift(mult)(x, n = .x)))
This is equivalent to
x %>% bind_cols(map_dfc(1:2, ~invoke(mult, x, n = .x)))
but the advantage of the former is that it returns a function which can be partially applied on x so it only has an n parameter left, and thus requires no explicit references to x and so pipes better:
x %>% bind_cols(map_dfc(1:2, partial(lift(mult), .)))
All return the same thing. Names can be fixed after the fact with %>% set_names(~sub('^V(\\d+)$', 'new\\1', .x)), if you like.
Here is one possibility.
library(purrr)
library(dplyr)
n <- 1:2
x %>%
mutate(val = pmap(., mult, n = n)) %>%
unnest() %>%
mutate(var = rep(paste0("new", n), nrow(.) / length(n))) %>%
spread(var, val)
# a b new1 new2
#1 1 2 3 6
#2 2 3 5 10
#3 3 4 7 14
Not pretty, so I'm also curious to see alternatives. A lot of excess comes about from unnesting the list column and spreading into new columns.
Here is another possibility using pmap_dfc plus an ugly as.data.frame(t(...)) call
bind_cols(x, as.data.frame(t(pmap_dfc(x, mult, n = n))))
# a b V1 V2
#1 1 2 3 6
#2 2 3 5 10
#3 3 4 7 14
Sample data
x <- data.frame(a = 1:3, b = 2:4)
mult <- function(a,b,n) (a + b) * n
To mimic the input format for Map, we could call pmap from purrr in this way:
x[paste0("new",seq_along(ns))] <- pmap(list(x['a'], x['b'], ns), mult)
To fit this in a pipe:
x %>%
{list(.['a'], .['b'], ns)} %>%
pmap(mult) %>%
setNames(paste0('new', seq_along(ns))) %>%
cbind(x)
# new1 new2 a b
# 1 3 6 1 2
# 2 5 10 2 3
# 3 7 14 3 4
Apparently, this looks ugly compared to the concise base R code. But I could not think of a better way.

Minimum value matching across values in multiple columns

I would like to return a dataframe with the minimum value of column one based on the values of columns 2-4:
df <- data.frame(one = rnorm(1000),
two = sample(letters, 1000, replace = T),
three = sample(letters, 1000, replace = T),
four = sample(letters, 1000, replace = T))
I can do:
df_group <- df %>%
group_by(two) %>%
filter(one = min(one))
This gets me the lowest value of all the "m's" in column two, but what if column three or four had a lower "m" value in column one?
The output should look like this:
one two
1 -0.311609752 r
2 0.053166742 n
3 1.546485810 a
4 -0.430308725 d
5 -0.145428664 c
6 0.419181639 u
7 0.008881661 i
8 1.223517580 t
9 0.797273157 b
10 0.790565358 v
11 -0.560031797 e
12 -1.546234090 q
13 -1.847945540 l
14 -1.489130228 z
15 -1.203255034 g
16 0.146969892 m
17 -0.552363433 f
18 -0.006234646 w
19 0.982932856 s
20 0.751936728 o
21 0.220751258 h
22 -1.557436228 y
23 -2.034885868 k
24 -0.463354387 j
25 -0.351448850 p
26 1.331365941 x
I don't care which column has the lowest value for a given letter, I just need the lowest value and the letter column.
I'm trying to wrap my head around writing this simplistically. This might be a duplicate, but I didn't know how to word the title and couldn't find any material or previous questions on how to do it.
Another solution based in data.table :
library(data.table)
setDT(df)
melt(df,
measure=grep("one",names(df),invert = TRUE,value=TRUE))[
,min(one),value]
You can do something like this:
library(dplyr); library(tidyr)
df %>% gather(cols, letts, -one) %>% # gather all letters into one column
group_by(letts) %>%
summarise(one = min(one)) # do a group by summary for each letter
# A tibble: 26 × 2
# letts one
# <chr> <dbl>
#1 a -2.092327
#2 b -2.461102
#3 c -3.055858
#4 d -2.092327
#5 e -2.461102
#6 f -2.249439
#7 g -1.941632
#8 h -2.543310
#9 i -3.055858
#10 j -1.896974
# ... with 16 more rows

Resources