How to call columns implicitly in certain R functions - r

There are some functions in R where you have to call columns explicitly by name, such as pmin. My question is how to get around this, preferably using tidyverse.
Here's some sample data.
library(tidyverse)
df <- tibble(a = c(1:5),
b = c(6:10),
d = c(11:15),
e = c(16:20))
# A tibble: 5 x 4
a b d e
<int> <int> <int> <int>
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 4 9 14 19
5 5 10 15 20
Now I'd like to find the minimum of all the columns except for "e". I can do this:
df %>%
mutate(min = pmin(a, b, d))
# A tibble: 5 x 5
a b d e min
<int> <int> <int> <int> <int>
1 1 6 11 16 1
2 2 7 12 17 2
3 3 8 13 18 3
4 4 9 14 19 4
5 5 10 15 20 5
But what if I have many columns and would like to call every column except "e" without having to type out each column's name? I've made several attempts but none successful. I used the column index in my examples but I'd prefer excluding "e" by name. See below.
df %>%
mutate(min = pmin(-e))
df %>%
mutate(min = pmin(names(. %>% select(.))[-4]))
df %>%
mutate(min = pmin(names(.)[-4]))
df %>%
mutate(min = pmin(noquote(paste0(names(.)[-4], collapse = ","))))
df %>%
mutate(min = pmin(!!ensyms(names(.)[-4])))
None of these worked and I'm a bit at a loss.

One option using dplyr and purrr could be:
df %>%
mutate(min = exec(pmin, !!!select(., -e)))
a b d e min
<int> <int> <int> <int> <int>
1 1 6 11 16 1
2 2 7 12 17 2
3 3 8 13 18 3
4 4 9 14 19 4
5 5 10 15 20 5
For those not reading the comments, a nice option proposed by #IceCreamToucan and involving only dplyr could be:
df %>%
mutate(min = do.call(pmin, select(., -e)))

We can also use reduce with pmin
library(dplyr)
library(purrr)
df %>%
mutate(min = select(., -e) %>%
reduce(pmin))
# A tibble: 5 x 5
# a b d e min
# <int> <int> <int> <int> <int>
#1 1 6 11 16 1
#2 2 7 12 17 2
#3 3 8 13 18 3
#4 4 9 14 19 4
#5 5 10 15 20 5
Or with syms and !!!. Note that en- prefix is used while using from inside a function
df %>%
mutate(min = pmin(!!! syms(names(.)[-4])))
# A tibble: 5 x 5
# a b d e min
# <int> <int> <int> <int> <int>
#1 1 6 11 16 1
#2 2 7 12 17 2
#3 3 8 13 18 3
#4 4 9 14 19 4
#5 5 10 15 20 5

I would do this by reshaping long, calculating the min by group, and then reshaping back to wide:
df %>%
rowid_to_column() %>%
pivot_longer(cols = -c(e, rowid)) %>%
group_by(rowid) %>%
mutate(min = min(value)) %>%
ungroup() %>%
pivot_wider() %>%
select(-rowid, -min, min)

Related

Group by the position and not by the name of a column in a dataframe

In the code below I group_by() the Pop column. How can I modify the code in order not to use Pop but the fist column of the dataset to group_by()
library(janitor)
library(dplyr)
Pop<-c("p","p","p","p","p","p","p","p")
RC<-c("a","b","c","c","d","d","f","f")
MM<-c(10,9,8,27,9,10,23,42)
UT<-c(5,3,6,14,7,9,45,61)
df<-data.frame(Pop,RC,MM,UT)
tor <- df %>% group_by(Pop) %>% group_modify(~ .x %>% adorn_totals()) %>% ungroup
tor <- df %>% group_by(df[,1]) %>% group_modify(~ .x %>% adorn_totals()) %>% ungroup
You would want to use across within group_by:
library(dplyr)
df |>
group_by(across(1))
Output:
# A tibble: 8 × 4
# Groups: Pop [1]
Pop RC MM UT
<chr> <chr> <dbl> <dbl>
1 p a 10 5
2 p b 9 3
3 p c 8 6
4 p c 27 14
5 p d 9 7
6 p d 10 9
7 p f 23 45
8 p f 42 61
If it is a single column to group, we may also use .data with the first column name
library(dplyr)
df %>%
group_by(.data[[first(names(.))]])
-output
# A tibble: 8 × 4
# Groups: Pop [1]
Pop RC MM UT
<chr> <chr> <dbl> <dbl>
1 p a 10 5
2 p b 9 3
3 p c 8 6
4 p c 27 14
5 p d 9 7
6 p d 10 9
7 p f 23 45
8 p f 42 61

Get mean column values every n rows grouped by value in other column

I have a dataframe df that looks like this
time object
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
7 7 B
8 8 B
9 9 B
10 10 B
11 11 B
12 12 C
13 13 C
14 14 C
15 15 C
16 16 C
17 17 C
18 18 C
I would like to get the mean of the timecolumn every 3 rows based on the object column
df_mean
time object
1 2 A
2 5 A
3 8 B
4 13 C
5 16 C
I though about using dplyr
df%>%
mutate(grp = 1+ (row_number()-1) %/% 3) %>%
group_by(grp) %>%
summarise(across(c("time"), mean, na.rm = TRUE)) %>%
select(-grp)
but I do not know how to integrate the control for the object.
Another option would be to use aggregate
aggregate(.~object, data=df, mean)
but in this case I do not know how to get the mean every 3 rows.
Your dplyr attempt is on the right track. With a few modifications it will work.
library(dplyr)
df <- tibble(time = 1:18, object = rep(c('A', 'B', 'C'), each = 6))
df %>%
group_by(object, grp = (row_number()-1) %/% 3) %>%
summarise(across(time, mean, na.rm = T), .groups = 'drop') %>%
select(-grp)
#> # A tibble: 6 × 2
#> object time
#> <chr> <dbl>
#> 1 A 2
#> 2 A 5
#> 3 B 8
#> 4 B 11
#> 5 C 14
#> 6 C 17
Here is an alternative dplyr way:
library(dplyr)
n = 3
df %>%
group_by(object, Col2 = rep(row_number(), each=n, length.out = n())) %>%
summarise(time = mean(time, na.rm = TRUE)) %>%
select(-Col2)
object time
<chr> <dbl>
1 A 2
2 A 5
3 B 8
4 B 11
5 C 14
6 C 17
You can do a slight modification, creating your grp variable within object group first, and then filtering where the size of the joint grouping is >=3, and then summarize:
df%>%
group_by(object) %>%
mutate(grp = 1+ (row_number()-1) %/% 3) %>%
group_by(object,grp) %>%
filter(n()>=3) %>%
summarize(time=mean(time), .groups="drop") %>%
select(-grp)
Output:
object time
<chr> <dbl>
1 A 2
2 A 5
3 B 8
4 C 13
5 C 16
data.table option:
library(data.table)
setDT(df)
df[, n3:=gl(.N, 3, length=.N), by=object]
df[, .(time=mean(time)), by=.(object, n3)][, !"n3"]
Output:
object time
1: A 2
2: A 5
3: B 8
4: B 11
5: C 14
6: C 17
in base R:
aggregate(time~., cbind(df, gr=gl(nrow(df),3, nrow(df))), mean)
object gr time
1 A 1 2
2 A 2 5
3 B 3 8
4 B 4 11
5 C 5 14
6 C 6 17

grouping to aggregate values, but tripping up on NA's

I have long data, and I am trying to make a new variable (consistent) that is the value for a given column (VALUE), for each person (ID), at TIME = 2. I used the code below to do this, but I am getting tripped up on NA's. If the VALUE for TIME = 2 is NA, then I want it to grab the VALUE at TIME = 1 instead. That part I'm not sure how to do. So, in the example below, I want the new variable (consistent) should be 10 instead of NA.
ID = c("A", "A", "B", "B", "C", "C", "D", "D")
TIME = c(1, 2, 1, 2, 1, 2, 1, 2)
VALUE = c(8, 9, 10, NA, 12, 13, 14, 9)
df = data.frame(ID, TIME, VALUE)
df <- df %>%
group_by(ID) %>%
mutate(consistent = VALUE[TIME == 2]) %>% ungroup
df
If we want to use the same code, then coalesce with the 'VALUE' where 'TIME' is 1 (assuming there is a single observation of 'TIME' for each 'ID')
library(dplyr)
df %>%
group_by(ID) %>%
mutate(consistent = coalesce(VALUE[TIME == 2], VALUE[TIME == 1])) %>%
ungroup
-output
# A tibble: 8 × 4
ID TIME VALUE consistent
<chr> <dbl> <dbl> <dbl>
1 A 1 8 9
2 A 2 9 9
3 B 1 10 10
4 B 2 NA 10
5 C 1 12 13
6 C 2 13 13
7 D 1 14 9
8 D 2 9 9
Or another option is to arrange before doing the group_by and get the first element of 'VALUE' (assuming no replicating for 'TIME')
df %>%
arrange(ID, is.na(VALUE), desc(TIME)) %>%
group_by(ID) %>%
mutate(consistent = first(VALUE)) %>%
ungroup
-output
# A tibble: 8 × 4
ID TIME VALUE consistent
<chr> <dbl> <dbl> <dbl>
1 A 2 9 9
2 A 1 8 9
3 B 1 10 10
4 B 2 NA 10
5 C 2 13 13
6 C 1 12 13
7 D 2 9 9
8 D 1 14 9
Another possible solution, using tidyr::fill:
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(consistent = VALUE) %>% fill(consistent) %>% ungroup
#> # A tibble: 8 × 4
#> ID TIME VALUE consistent
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 1 8 8
#> 2 A 2 9 9
#> 3 B 1 10 10
#> 4 B 2 NA 10
#> 5 C 1 12 12
#> 6 C 2 13 13
#> 7 D 1 14 14
#> 8 D 2 9 9
You can also use ifelse with your condition. TIME is guaranteed to be 1 in this scenario if there are only 2 group member each with TIME 1 and 2.
df %>%
group_by(ID) %>%
arrange(TIME, .by_group=T) %>%
mutate(consistent=ifelse(is.na(VALUE)&TIME==2, lag(VALUE), VALUE)) %>%
ungroup()
# A tibble: 8 × 4
ID TIME VALUE consistent
<chr> <dbl> <dbl> <dbl>
1 A 1 8 8
2 A 2 9 9
3 B 1 10 10
4 B 2 NA 10
5 C 1 12 12
6 C 2 13 13
7 D 1 14 14
8 D 2 9 9

Combine: rowwise(), mutate(), across(), for multiple functions

This is somehow related to this question:
In principle I try to understand how rowwise operations with mutate across multiple columns applying more then 1 functions like (mean(), sum(), min() etc..) work.
I have learned that across does this job and not c_across.
I have learned that the function mean() is different to the function min() in that way that mean() doesn't work on dataframes and we need to change it to vector which can be done with unlist or as.matrix -> learned from Ronak Shah hereUnderstanding rowwise() and c_across()
Now with my actual case: I was able to do this task but I loose one column d. How can I avoid the loose of the column d in this setting.
My df:
df <- structure(list(a = 1:5, b = 6:10, c = 11:15, d = c("a", "b",
"c", "d", "e"), e = 1:5), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
Works not:
df %>%
rowwise() %>%
mutate(across(a:e),
avg = mean(unlist(cur_data()), na.rm = TRUE),
min = min(unlist(cur_data()), na.rm = TRUE),
max = max(unlist(cur_data()), na.rm = TRUE)
)
# Output:
a b c d e avg min max
<int> <int> <int> <chr> <int> <dbl> <chr> <chr>
1 1 6 11 a 1 NA 1 a
2 2 7 12 b 2 NA 12 b
3 3 8 13 c 3 NA 13 c
4 4 9 14 d 4 NA 14 d
5 5 10 15 e 5 NA 10 e
Works, but I loose column d:
df %>%
select(-d) %>%
rowwise() %>%
mutate(across(a:e),
avg = mean(unlist(cur_data()), na.rm = TRUE),
min = min(unlist(cur_data()), na.rm = TRUE),
max = max(unlist(cur_data()), na.rm = TRUE)
)
a b c e avg min max
<int> <int> <int> <int> <dbl> <dbl> <dbl>
1 1 6 11 1 4.75 1 11
2 2 7 12 2 5.75 2 12
3 3 8 13 3 6.75 3 13
4 4 9 14 4 7.75 4 14
5 5 10 15 5 8.75 5 15
Using pmap() from purrr might be more preferable since you need to select the data just once and you can use the select helpers:
df %>%
mutate(pmap_dfr(across(where(is.numeric)),
~ data.frame(max = max(c(...)),
min = min(c(...)),
avg = mean(c(...)))))
a b c d e max min avg
<int> <int> <int> <chr> <int> <int> <int> <dbl>
1 1 6 11 a 1 11 1 4.75
2 2 7 12 b 2 12 2 5.75
3 3 8 13 c 3 13 3 6.75
4 4 9 14 d 4 14 4 7.75
5 5 10 15 e 5 15 5 8.75
Or with the addition of tidyr:
df %>%
mutate(res = pmap(across(where(is.numeric)),
~ list(max = max(c(...)),
min = min(c(...)),
avg = mean(c(...))))) %>%
unnest_wider(res)
Edit:
Best way out here
df %>%
rowwise() %>%
mutate(min = min(c_across(a:e & where(is.numeric)), na.rm = TRUE),
max = max(c_across(a:e & where(is.numeric)), na.rm = TRUE),
avg = mean(c_across(a:e & where(is.numeric)), na.rm = TRUE)
)
# A tibble: 5 x 8
# Rowwise:
a b c d e min max avg
<int> <int> <int> <chr> <int> <int> <int> <dbl>
1 1 6 11 a 1 1 11 4.75
2 2 7 12 b 2 2 12 5.75
3 3 8 13 c 3 3 13 6.75
4 4 9 14 d 4 4 14 7.75
5 5 10 15 e 5 5 15 8.75
Earlier Answer
Your this will work won't even work properly, if you change the output sequence, see
df %>%
select(-d) %>%
rowwise() %>%
mutate(across(a:e),
min = min(unlist(cur_data()), na.rm = TRUE),
max = max(unlist(cur_data()), na.rm = TRUE),
avg = mean(unlist(cur_data()), na.rm = TRUE)
)
# A tibble: 5 x 7
# Rowwise:
a b c e min max avg
<int> <int> <int> <int> <int> <int> <dbl>
1 1 6 11 1 1 11 5.17
2 2 7 12 2 2 12 6.17
3 3 8 13 3 3 13 7.17
4 4 9 14 4 4 14 8.17
5 5 10 15 5 5 15 9.17
Therefore, it is advised to do it like this-
df %>%
select(-d) %>%
rowwise() %>%
mutate(min = min(c_across(a:e), na.rm = TRUE),
max = max(c_across(a:e), na.rm = TRUE),
avg = mean(c_across(a:e), na.rm = TRUE)
)
# A tibble: 5 x 7
# Rowwise:
a b c e min max avg
<int> <int> <int> <int> <int> <int> <dbl>
1 1 6 11 1 1 11 4.75
2 2 7 12 2 2 12 5.75
3 3 8 13 3 3 13 6.75
4 4 9 14 4 4 14 7.75
5 5 10 15 5 5 15 8.75
One more alternative is
cols <- c('a', 'b', 'c', 'e')
df %>%
rowwise() %>%
mutate(min = min(c_across(cols), na.rm = TRUE),
max = max(c_across(cols), na.rm = TRUE),
avg = mean(c_across(cols), na.rm = TRUE)
)
# A tibble: 5 x 8
# Rowwise:
a b c d e min max avg
<int> <int> <int> <chr> <int> <int> <int> <dbl>
1 1 6 11 a 1 1 11 4.75
2 2 7 12 b 2 2 12 5.75
3 3 8 13 c 3 3 13 6.75
4 4 9 14 d 4 4 14 7.75
5 5 10 15 e 5 5 15 8.75
Even #Sinh suggested approach of group_by won't work properly in these cases.
Here is one method which would preserve the data.frame attribute in mutate if we want to set a particular column to row name attribute (column_to_rownames) and then return the attribute after the transformation
library(dplyr)
library(tibble)
library(purrr)
df %>%
column_to_rownames('d') %>%
mutate(max = reduce(., pmax), min = reduce(., pmin),
avg = rowMeans(.)) %>%
rownames_to_column('d')
# d a b c e max min avg
#1 a 1 6 11 1 11 1 4.75
#2 b 2 7 12 2 12 2 5.75
#3 c 3 8 13 3 13 3 6.75
#4 d 4 9 14 4 14 4 7.75
#5 e 5 10 15 5 15 5 8.75

Restructure data based on row numbers in R

I am having troubles restructuring the data as I need to.
My df looks like this:
id <- (1:20)
author <- c("A","A","A","A","A","B","B","B","A","A","A","B","B","B","B"
,"B","B","B","A","A")
df <- data.frame(id, author)
> print(df)
id author
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 B
7 7 B
8 8 B
9 9 A
10 10 A
11 11 A
12 12 B
13 13 B
14 14 B
15 15 B
16 16 B
17 17 B
18 18 B
19 19 A
20 20 A
And I'm trying to get a data structure where the columns are the authors and the rwos indicate the first and last id values of each sequence of A or B values. So in this case the first row with author A is id = 1, and the last one of that series is id 5, and so forth.
Something like this:
A <- c(1, 5, 9, 11, 19,20)
B <- c(6, 8, 12, 18, NA, NA)
df.desired <- data.frame(A, B)
print(df.desired)
A B
1 1 6
2 5 8
3 9 12
4 11 18
5 19 NA
6 20 NA
Any ideas?
Thanks a lot!
We can create groups using data.table rleid, select 1st and last row in each group and get data in wide format.
library(dplyr)
df %>%
group_by(grp = data.table::rleid(author)) %>%
slice(1L, n()) %>%
group_by(author) %>%
mutate(grp = row_number()) %>%
tidyr::pivot_wider(names_from = author, values_from = id) %>%
select(-grp)
# A tibble: 6 x 2
# A B
# <int> <int>
#1 1 6
#2 5 8
#3 9 12
#4 11 18
#5 19 NA
#6 20 NA
For the updated request in comments we can do :
df %>%
group_by(grp = data.table::rleid(author)) %>%
slice(1L, n()) %>%
mutate(author = row_number()) %>%
tidyr::pivot_wider(names_from = row, values_from = id) %>%
ungroup %>%
select(-grp)
# A tibble: 5 x 2
# `1` `2`
# <int> <int>
#1 1 5
#2 6 8
#3 9 11
#4 12 18
#5 19 20
Here is a base R option
z <- rle(df$author)
lst <- split(df,findInterval(1:nrow(df),cumsum(z$lengths), left.open = TRUE))
u <- lapply(lst,function(v) range(v$id))
idx <- split(seq_along(z$values),z$values)
x <- lapply(idx,function(v) unlist(u[v],use.names = FALSE))
df.desired <- as.data.frame(lapply(x,`length<-`,max(lengths(x))))
which gives
> df.desired
A B
1 1 6
2 5 8
3 9 12
4 11 18
5 19 NA
6 20 NA
An option using data.table:
library(data.table)
dcast(
setDT(df)[, ri := rleid(author)][, id[c(1L, .N)], .(author, ri)],
rowid(author) ~ author, value.var="V1")
output:
author A B
1: 1 1 6
2: 2 5 8
3: 3 9 12
4: 4 11 18
5: 5 19 NA
6: 6 20 NA
If there is a possibility of an author having a single row, you will need unique(c(1L, .N))

Resources