Self defined function with dplyr functions won't accept argument values [duplicate]

Self defined function with dplyr functions won't accept argument values [duplicate] - r

This question already has an answer here:
dplyr with name of columns in a function
(1 answer)
Closed 4 years ago.
I'm trying to use dplyr's mutate_at to subtract a numeric column's value (A1) from another corresponding numeric column (A2), I have multiple columns and several data frames I want to do for this for (BCDE..., df1:df99) so I want to write a function.
df1 <- df1 %>% mutate_at(.vars = vars(A1), .funs = funs(remainder = .-A2))
Works fine, however when I try and write a function to perform this:
REMAINDER <- function(df, numer, denom){
df <- df %>% mutate_at(.vars = vars(numer), .funs = funs(remainder = .-denom))
return(df)
}
With arguments df1 <- REMAINDER(df1, A1, A2)
I get the error Error in mutate_impl(.data, dots) :
Evaluation error: non-numeric argument to binary operator.
Which I don't understand as I just manually called the line of code without a function and my columns are numeric.

The vignette Programming with dplyr explains in great detail what to do:
library(dplyr)
REMAINDER <- function(df, numer, denom) {
numer <- enquo(numer)
denom <- enquo(denom)
df %>% mutate_at(.vars = vars(!! numer), .funs = funs(remainder = . - !! denom))
}
df1 <- data_frame(A1 = 11:13, A2 = 3:1, B1 = 21:23, B2 = 8:6)
REMAINDER(df1, A1, A2)
# A tibble: 3 x 5
A1 A2 B1 B2 remainder
<int> <int> <int> <int> <int>
1 11 3 21 8 8
2 12 2 22 7 10
3 13 1 23 6 12
REMAINDER(df1, B1, B2)
# A tibble: 3 x 5
A1 A2 B1 B2 remainder
<int> <int> <int> <int> <int>
1 11 3 21 8 13
2 12 2 22 7 15
3 13 1 23 6 17
Naming the result column
The OP wants to update df1 and he wants to apply this operation to other columns as well.
Unfortunately, the REMAINDER() function as it is currently defined will overwrite the result column:
df1
# A tibble: 3 x 4
A1 A2 B1 B2
<int> <int> <int> <int>
1 11 3 21 8
2 12 2 22 7
3 13 1 23 6
df1 <- REMAINDER(df1, A1, A2)
df1
# A tibble: 3 x 5
A1 A2 B1 B2 remainder
<int> <int> <int> <int> <int>
1 11 3 21 8 8
2 12 2 22 7 10
3 13 1 23 6 12
df1 <- REMAINDER(df1, B1, B2)
df1
# A tibble: 3 x 5
A1 A2 B1 B2 remainder
<int> <int> <int> <int> <int>
1 11 3 21 8 13
2 12 2 22 7 15
3 13 1 23 6 17
The function can be modified so that the result column is individually named:
REMAINDER <- function(df, numer, denom) {
numer <- enquo(numer)
denom <- enquo(denom)
result_name <- paste0("remainder_", quo_name(numer), "_", quo_name(denom))
df %>% mutate_at(.vars = vars(!! numer),
.funs = funs(!! result_name := . - !! denom))
}
Now, calling REMAINDER() twice on different columns and replacing df1 after each call, we get
df1 <- REMAINDER(df1, A1, A2)
df1 <- REMAINDER(df1, B1, B2)
df1
# A tibble: 3 x 6
A1 A2 B1 B2 remainder_A1_A2 remainder_B1_B2
<int> <int> <int> <int> <int> <int>
1 11 3 21 8 8 13
2 12 2 22 7 10 15
3 13 1 23 6 12 17

I have used this suggestion in order to subtract pairs of columns in a list of data frames. My example has only 3 pairs of columns in each of the two data frames and it can work with higher number of columns and data frames.
dt <- data.table(A1 = round(runif(3),1), A2 = round(runif(3),1),
B1 = round(runif(3),1), B2 = round(runif(3),1),
C1 =round(runif(3),1), C2 =round(runif(3),1))
dt = list(dt,dt+dt)
lapply(seq_along(dt), function(z) {
dt[[z]][, lapply(1:(ncol(.SD)/2), function(x) (.SD[[2*x-1]] - .SD[[2*x]]))]
})

Related

Select columns using starts_with()

With select(starts_with("A") I can select all the columns in a dataframe/tibble starting with "A".
But how can I select all the columns in a dataframe/tibble starting with one of the letters in a vector?
Example:
columns_to_select <- c("A", "B", "C")
df %>% select(starts_with(columns_to_select))
I would like to select A1, A2, A3... and B1, B2, B3, ... and C1, C2, Cxy...

This currently seems to be working the way you're describing:
library(tidyverse)
df <- tibble(A1 = 1:10, B1 = 1:10, C3 = 21:30, D2 = 11:20)
columns_to_select <- c("A", "B", "C")
df |>
select(starts_with(columns_to_select))
#> # A tibble: 10 × 3
#> A1 B1 C3
#> <int> <int> <int>
#> 1 1 1 21
#> 2 2 2 22
#> 3 3 3 23
#> 4 4 4 24
#> 5 5 5 25
#> 6 6 6 26
#> 7 7 7 27
#> 8 8 8 28
#> 9 9 9 29
#> 10 10 10 30
Do you mean to select only by one of the letters at a time? (you can use columns_to_select[1] for this) Apologies if I've misunderstood the question - can delete this response if not relevant.

Define groups of columns and sum all i-th columns of each groups with dplyr

I have two groups of columns, each with 36 columns, and I want to sum all i-th column of group 1 with i-th column of group2, getting 36 columns. The number of columns in each group is not fix in my code, although each group has the same number of them.
Exemple. What I have:
teste <- tibble(a1=c(1,2,3),a2=c(7,8,9),b1=c(4,5,6),b2=c(10,20,30))
a1 a2 b1 b2
<dbl> <dbl> <dbl> <dbl>
1 1 7 4 10
2 2 8 5 20
3 3 9 6 30
What I want:
resultado <- teste %>%
summarise(
a_b1 = a1+b1,
a_b2 = a2+b2
)
a_b1 a_b2
<dbl> <dbl>
1 5 17
2 7 28
3 9 39
It would be nice to perform this operation with dplyr.
I would thank any help.

You will struggle to find a dplyr solution as simple and elegant as the base R one:
teste[1:2] + teste[3:4]
#> a1 a2
#> 1 5 17
#> 2 7 28
#> 3 9 39
Though I guess in dplyr you get the same result with:
teste %>% select(starts_with("a")) + teste %>% select(starts_with("b"))

teste %>%
summarise(across(starts_with("a")) + across(starts_with("b")))
# A tibble: 3 x 2
a1 a2
<dbl> <dbl>
1 5 17
2 7 28
3 9 39

This might also help in base R:
as.data.frame(do.call(cbind, lapply(split.default(teste, sub("\\D(\\d+)", "\\1", names(teste))), rowSums, na.rm = TRUE)))
1 2
1 5 17
2 7 28
3 9 39

Another dplyr solution. We can use rowwise and c_across together to sum the values per row. Notice that we can add na.rm = TRUE to the sum function in this case.
library(dplyr)
teste2 <- teste %>%
rowwise() %>%
transmute(a_b1 = sum(c_across(ends_with("1")), na.rm = TRUE),
a_b2 = sum(c_across(ends_with("2")), na.rm = TRUE)) %>%
ungroup()
teste2
# # A tibble: 3 x 2
# a_b1 a_b2
# <dbl> <dbl>
# 1 5 17
# 2 7 28
# 3 9 39

Can I extract the value from several columns by column's name?

library(dplyr)
mydf <- data.frame(a_x = c(1,2,3,4,5),
b_x = c(8,9,10,11,12),
a_y = c("k",'b','a','d','z'),
b_y = c('aa','bb','cc','dd','ee'),
prefix=c("a","b","c","a","a"))
mydf
Assuming that the data I have is mydf, I would like to produce the same result as mydf2.
I made a column with the name of the column containing the value to be extracted.
I want to extract the value through this column.
mydf2 <- data.frame(a_x=c(1,2,3,4,5),
b_x=c(8,9,10,11,12),
prefix=c("a","b","c","a","a"),
desired_x_value = c(1,9,NA,4,5),
desired_y_value = c('k','bb',NA,'d','z'))
mydf2
I've used 'get' and 'paste0' but it doesn't work. Can I solve this problem through 'dplyr' chain?
mydf %>% mutate(desired_x_value = get(paste0(prefix,"_x")),
desired_y_value = get(paste0(prefix,"_y")))

So basically you want to create new columns (desired_x_value and desired_y_value) of which its value depends on a condition. Using dplyr I prefer case_when as it is the best readable way to do it, but you could also use (nested) if(else) statements. What it is doing is "if X meets condition A do Y, if X meets condition B do Z, if X meets condition .... do ..."
mydf %>%
dplyr::mutate(
desired_x_value = case_when(
prefix == "a" ~ a_x,
prefix == "b" ~ b_x,
desired_y_values = case_when(
prefix == "a" ~a_y,
prefix == "b" ~b_y,
TRUE ~ NA_character_ ))
You can remove the columns you don't need anymore in a second step if you want. the code above results in the table:
a_x b_x a_y b_y prefix desired_x_value desired_y_values
1 1 8 k aa a 1 k
2 2 9 b bb b 9 bb
3 3 10 a cc c NA <NA>
4 4 11 d dd a 4 d
5 5 12 z ee a 5 z

You can write a helper function for this :
get_value <- function(data, prefix, group) {
data[cbind(1:nrow(data), match(paste(prefix, group, sep = '_'), names(data)))]
}
mydf %>%
mutate(desired_x_value = get_value(select(., ends_with('_x')), prefix, 'x'),
desired_y_value = get_value(select(., ends_with('_y')), prefix, 'y'))
# a_x b_x a_y b_y prefix desired_x_value desired_y_value
#1 1 8 k aa a 1 k
#2 2 9 b bb b 9 bb
#3 3 10 a cc c NA <NA>
#4 4 11 d dd a 4 d
#5 5 12 z ee a 5 z

A simple rowwise also works.
mydf %>% rowwise() %>%
mutate(desired_x = ifelse(any(str_detect(names(mydf)[-5], prefix)),
get(paste(prefix, 'x', sep = '_')), NA),
desired_y = ifelse(any(str_detect(names(mydf)[-5], prefix)),
get(paste(prefix, 'y', sep = '_')), NA))
# A tibble: 5 x 7
# Rowwise:
a_x b_x a_y b_y prefix desired_x desired_y
<dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 1 8 k aa a 1 k
2 2 9 b bb b 9 bb
3 3 10 a cc c NA NA
4 4 11 d dd a 4 d
5 5 12 z ee a 5 z
If the prefixes don't contain any invalid column prefixes, this will do without ifelse statement.
mydf <- data.frame(a_x = c(1,2,3,4,5),
b_x = c(8,9,10,11,12),
a_y = c("k",'b','a','d','z'),
b_y = c('aa','bb','cc','dd','ee'),
prefix=c("a","b","a","a","a"))
mydf %>% rowwise() %>%
mutate(desired_x = get(paste(prefix, 'x', sep = '_')),
desired_y = get(paste(prefix, 'y', sep = '_')))
# A tibble: 5 x 7
# Rowwise:
a_x b_x a_y b_y prefix desired_x desired_y
<dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 1 8 k aa a 1 k
2 2 9 b bb b 9 bb
3 3 10 a cc a 3 a
4 4 11 d dd a 4 d
5 5 12 z ee a 5 z

First I would like to say that I am not presenting this as a good solution as other proposed solutions are much better and simpler. However, since you have brought up get function, I wanted to show you how to make use of it to get your desired output. As a matter of fact some of the values in your prefix column such as c does not have a match among your column names and get function throws an error on terminating the execution, and unlike mget function it does not have a ifnotfound argument. So you need a way to go around that error message by means of an ifelse:
library(dplyr)
library(stringr)
library(tidyr)
library(purrr)
library(glue)
mydf1 %>%
mutate(desired_x_value = map(prefix, ~ ifelse(any(str_detect(names(mydf)[-5], .x)),
get(glue("{.x}_x")), NA)),
desired_y_value = map(prefix, ~ ifelse(any(str_detect(names(mydf)[-5], .x)),
get(glue("{.x}_y")), NA))) %>%
unnest(cols = c(desired_x_value, desired_y_value))
# A tibble: 5 x 7
a_x b_x a_y b_y prefix desired_x_value desired_y_value
<dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 1 8 k aa a 1 k
2 2 9 b bb b 9 bb
3 3 10 a cc NA NA NA
4 4 11 d dd a 4 d
5 5 12 z ee a 5 z
You can also use paste function instead of glue and in case we already know the output types of the desired columns, we can spare the last line:
mydf1 %>%
mutate(desired_x_value = map_dbl(prefix, ~ ifelse(any(str_detect(names(mydf)[-5], .x)),
get(paste(.x, "x", sep = "_")), NA)),
desired_y_value = map_chr(prefix, ~ ifelse(any(str_detect(names(mydf)[-5], .x)),
get(paste(.x, "y", sep = "_")), NA)))
# A tibble: 5 x 7
# Rowwise:
a_x b_x a_y b_y prefix desired_x_value desired_y_value
<dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 1 8 k aa a 1 k
2 2 9 b bb b 9 bb
3 3 10 a cc NA NA NA
4 4 11 d dd a 4 d
5 5 12 z ee a 5 z

Infill missing variables of a df from a list

I have missing categorical variables in a list. I would like to add all the combinations of these classifications to the data frame using complete. I can do this for a single variable using mutate.
Simplified example:
library(tidyverse)
df <- tibble(a1 = 1:6,
b1 = rep(c(1,2),3),
c1 = rep(c(1:3), 2))
missing_cols <- list(d1 = c(7:8),
e1 = c(12:14))
# Use the first classification of d1 for mutate and complete with all classifications
df %>%
mutate(!!names(missing_cols)[1] := missing_cols[[1]][1]) %>%
complete(nesting(a1, b1,c1), d1 = missing_cols[[1]])
Desired output
df %>%
mutate(!!names(missing_cols)[1] := missing_cols[[1]][1]) %>%
mutate(!!names(missing_cols)[2] := missing_cols[[2]][1]) %>%
complete(nesting(a1, b1,c1), d1 = missing_cols[[1]], e1 = missing_cols[[2]])
This will get the correct output for d1. How can I do this for all variables in my list?

We can use crossing with cross_df :
library(tidyr)
crossing(df, cross_df(missing_cols))
# a1 b1 c1 d1 e1
# <int> <dbl> <int> <int> <int>
# 1 1 1 1 7 12
# 2 1 1 1 7 13
# 3 1 1 1 7 14
# 4 1 1 1 8 12
# 5 1 1 1 8 13
# 6 1 1 1 8 14
# 7 2 2 2 7 12
# 8 2 2 2 7 13
# 9 2 2 2 7 14
#10 2 2 2 8 12
# … with 26 more rows
cross_df creates all possible combination of missing_cols while crossing takes that output and creates all possible combination with df.

Using expand.grid
library(tidyr)
crossing(df, expand.grid(missing_cols))

R calculating the sum of values according to condition

Here is a data frame:
ID<-c(rep("A",3),rep("B",2), rep("C",3),rep("D",5))
cell<-c("a1","a2","a3","a1","a2","a1","a2", "a3","a1","a2","a1","a2","a3")
value<-c(2,5,3,4,5,6,9,8,7,2,5,2,4)
df<-as.data.frame(cbind(ID, cell, value))
I want to calculate the sum of all values for each ID up to cell a2 (incl.). The sequence of cells and ID’s must be taken into account. If there isn’t any cell “a2” after calculating of the sum, this rows should not be taken into account.
As a result I would like to get this table:
Could You please help me to code this condition?
Thanks in advance.
Best regards, Inna

assuming the file is already correctly ordered by cell
library( tidyverse )
df %>%
group_by( ID ) %>%
mutate( value = cumsum( value ) ) %>%
filter( cell == "a2" )
# # A tibble: 5 x 3
# # Groups: ID [4]
# ID cell value
# <chr> <chr> <dbl>
# 1 A a2 7
# 2 B a2 9
# 3 C a2 15
# 4 D a2 9
# 5 D a2 16

Treating each occurrence of "a2" as different group we can do :
library(dplyr)
df %>%
#Create a group column with every value of cell == 'a2' as different group
group_by(ID, grp = cumsum(lag(cell == 'a2', default = TRUE))) %>%
#Remove those groups that do not have 'a2' in them
filter(any(cell == 'a2')) %>%
#Sum till 'a2' value
summarise(value = sum(value[seq_len(match('a2', cell))]),
cell = last(cell)) %>%
select(-grp)
# ID value cell
# <chr> <dbl> <chr>
#1 A 7 a2
#2 B 9 a2
#3 C 15 a2
#4 D 9 a2
#5 D 7 a2

A succinct solution using ave.
r <- transform(df, value=ave(value, ID, FUN=cumsum))[df$cell == "a2", ]
r
# ID cell value
# 2 A a2 7
# 5 B a2 9
# 7 C a2 15
# 10 D a2 9
# 12 D a2 16

An option with data.table
library(data.table)
setDT(df)[, value := cumsum(value) , ID][cell == 'a2']
-output
# ID cell value
#1: A a2 7
#2: B a2 9
#3: C a2 15
#4: D a2 9
#5: D a2 16