group_by() and summarise() by row - r

I have a data with several line ids per time and with -infinite values, and I would like to use the R packages dplyr and tidyverse to calculate the average number of -infinite per ID per time.
This is my data:
dt <- data.frame(id = rep(1:3, each = 4),
time = rep(1:4, time=3),
x = c(1, 2, 1, -Inf, 2, -Inf,1, 1, 5, 1, 2, 1),
y = c(2, -Inf, -Inf, -Inf, -Inf, 5, -Inf, 2, 1, 2, 2, 2))
In the real data I have more than 100 columns but to simplify I put only x and y.
The expected result:
id time n
2 1 2 0.5
3 1 3 0.5
4 1 4 1.0
5 2 1 0.5
6 2 2 0.5
7 2 3 0.5
The idea is to use some specific columns to generate a vector according to a specific calculation function.
After searching I found the rowwise() function, but it did not help, Here is my attempt:
dt %>%
group_by(id,time) %>%
summarise(n = across(x:y, ~mean(is.infinite(x) & x < 0, na.rm=TRUE)))
dt %>%
group_by(id,time) %>%
rowwise() %>%
summarise(n = across(everything(), ~mean(is.infinite(x) & x < 0, na.rm=TRUE)))
dt %>%
rowwise() %>%
summarise(n = across(everything(), ~mean(is.infinite(x) & x < 0, na.rm=TRUE)))
# same results:
`summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
# A tibble: 12 x 3
# Groups: id [3]
id time n$x $y
<int> <int> <dbl> <dbl>
1 1 1 0 0
2 1 2 0 0
3 1 3 0 0
4 1 4 1 1
5 2 1 0 0
6 2 2 1 1
7 2 3 0 0
8 2 4 0 0
9 3 1 0 0
10 3 2 0 0
11 3 3 0 0
12 3 4 0 0
Could you help me to generate this vector n?

I think I understand better what you're aiming to do here. across isn't needed (as it's more for modifying columns in place). Either rowwise or group_by would work:
library(dplyr)
dt <- data.frame(id = rep(1:3, each = 4),
time = rep(1:4, times = 3),
x = c(1, 2, 1, -Inf, 2, -Inf,1, 1, 5, 1, 2, 1),
y = c(2, -Inf, -Inf, -Inf, -Inf, 5, -Inf, 2, 1, 2, 2, 2))
dt %>%
group_by(id, time) %>%
summarise(n = mean(c(is.infinite(x), is.infinite(y)))) %>%
filter(n != 0)
#> `summarise()` has grouped output by 'id'. You can override using the `.groups`
#> argument.
#> # A tibble: 6 × 3
#> # Groups: id [2]
#> id time n
#> <int> <int> <dbl>
#> 1 1 2 0.5
#> 2 1 3 0.5
#> 3 1 4 1
#> 4 2 1 0.5
#> 5 2 2 0.5
#> 6 2 3 0.5
Here's a possible way of doing the calculation across any number of columns after grouping (by making a quick function to check the negative and the infinite value):
library(dplyr)
dt <- data.frame(id = rep(1:3, each = 4),
time = rep(1:4, times = 3),
x = c(1, 2, 1, -Inf, 2, -Inf,1, 1, 5, 1, 2, 1),
y = c(2, -Inf, -Inf, -Inf, -Inf, 5, -Inf, 2, 1, 2, 2, 2),
z = sample(c(1, 2, -Inf), 12, replace = TRUE))
is_minus_inf <- function(x) is.infinite(x) & x < 0
dt %>%
group_by(id, time) %>%
mutate(n = mean(is_minus_inf(c_across(everything()))))
#> # A tibble: 12 × 6
#> # Groups: id, time [12]
#> id time x y z n
#> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 2 2 0
#> 2 1 2 2 -Inf -Inf 0.667
#> 3 1 3 1 -Inf 2 0.333
#> 4 1 4 -Inf -Inf 1 0.667
#> 5 2 1 2 -Inf 1 0.333
#> 6 2 2 -Inf 5 2 0.333
#> 7 2 3 1 -Inf -Inf 0.667
#> 8 2 4 1 2 2 0
#> 9 3 1 5 1 1 0
#> 10 3 2 1 2 1 0
#> 11 3 3 2 2 2 0
#> 12 3 4 1 2 -Inf 0.333
(Or even simpler, use mutate(n = mean(c_across(everything()) == -Inf, na.rm = TRUE)) and no new checking function is needed)

How about this solution? It looks like giving the desired output and is scalable.
First I "melt" the columns x and y and then just summarise over id and time:
dt %>%
reshape2::melt(id = c("id", "time")) %>%
group_by(id, time) %>%
summarise(count_neg_inf = mean(value == -Inf, na.rm = TRUE))
regards,
Samuel

Related

R dataframe with special cumsum

I have a dateframe like this:
df <- data.frame(grp = c(rep("a", 5), rep("b", 5)), t = c(1:5, 1:5), value = c(-1, 5, 9, -15, 6, 5, 1, 7, -11, 9))
# Limits for desired cumulative sum (CumSum)
maxCumSum <- 8
minCumSum <- 0
What I would like to calculate is a cumulative sum of value by group (grp) within the values of maxCumSum and minCumSum. The respective table dt2 should look something like this:
grp t value CumSum
a 1 -1 0
a 2 5 5
a 3 9 8
a 4 -15 0
a 5 6 6
b 1 5 5
b 2 1 6
b 3 7 8
b 4 -11 0
b 5 9 8
Think of CumSum as a water storage with has a certain maximum capacity and the level of which cannot sink below zero.
The normal cumsum does obviously not do the trick since there are no limitations to maximum or minimum. Has anyone a suggestion how to achieve this? In the real dataframe there are of course more than 2 groups and far more than 5 times.
Many thanks!
What you can do is create a function which calculate the cumsum until it reach the max value and start again at the min value like this:
df <- data.frame(grp = c(rep("a", 5), rep("b", 5)), t = c(1:5, 1:5), value = c(-1, 5, 9, -15, 6, 5, 1, 7, -11, 9))
library(dplyr)
maxCumSum <- 8
minCumSum <- 0
f <- function(x, y) max(min(x + y, maxCumSum), minCumSum)
df %>%
group_by(grp) %>%
mutate(CumSum = Reduce(f, value, 0, accumulate = TRUE)[-1])
#> # A tibble: 10 × 4
#> # Groups: grp [2]
#> grp t value CumSum
#> <chr> <int> <dbl> <dbl>
#> 1 a 1 -1 0
#> 2 a 2 5 5
#> 3 a 3 9 8
#> 4 a 4 -15 0
#> 5 a 5 6 6
#> 6 b 1 5 5
#> 7 b 2 1 6
#> 8 b 3 7 8
#> 9 b 4 -11 0
#> 10 b 5 9 8
Created on 2022-07-04 by the reprex package (v2.0.1)

Can't add rows to grouped data frames

This is a follow-up question of this How to add a row to a dataframe modifying only some columns.
After solving this question I wanted to apply the solution provided by stefan to a larger dataframe with group_by:
My dataframe:
df <- structure(list(test_id = c(1, 1, 1, 1, 1, 1, 1, 1), test_nr = c(1,
1, 1, 1, 2, 2, 2, 2), region = c("A", "B", "C", "D", "A", "B",
"C", "D"), test_value = c(3, 1, 1, 2, 4, 2, 4, 1)), class = "data.frame", row.names = c(NA,
-8L))
test_id test_nr region test_value
1 1 1 A 3
2 1 1 B 1
3 1 1 C 1
4 1 1 D 2
5 1 2 A 4
6 1 2 B 2
7 1 2 C 4
8 1 2 D 1
I now want to add a new row to each group with this code, which gives an error:
df %>%
group_by(test_nr) %>%
add_row(test_id = .$test_id[1], test_nr = .$test_nr[1], region = "mean", test_value = mean(.$test_value))
Error: Can't add rows to grouped data frames.
Run `rlang::last_error()` to see where the error occurred.
My expected output would be:
test_id test_nr region test_value
1 1 1 A 3.00
2 1 1 B 1.00
3 1 1 C 1.00
4 1 1 D 2.00
5 1 1 MEAN 1.75
6 1 2 A 4.00
7 1 2 B 2.00
8 1 2 C 4.00
9 1 2 D 1.00
10 1 2 MEAN 2.75
I have tried so far:
library(tidyverse)
df %>%
group_by(test_nr) %>%
group_split() %>%
map_dfr(~ .x %>%
add_row(!!! map(.[4], mean)))
test_id test_nr region test_value
<dbl> <dbl> <chr> <dbl>
1 1 1 A 3
2 1 1 B 1
3 1 1 C 1
4 1 1 D 2
5 NA NA NA 1.75
6 1 2 A 4
7 1 2 B 2
8 1 2 C 4
9 1 2 D 1
10 NA NA NA 2.75
How could I modify column 1 to 3 to place my values there?
I actually recently made a little helper function for exactly this. The idea
is to use group_modify() to take the group data, and
bind_rows() the summary statistics calculated with summarise().
This is what it looks like in code:
add_summary_rows <- function(.data, ...) {
group_modify(.data, function(x, y) bind_rows(x, summarise(x, ...)))
}
And here’s how that would work with your data:
library(dplyr, warn.conflicts = FALSE)
df <- data.frame(
test_id = c(1, 1, 1, 1, 1, 1, 1, 1),
test_nr = c(1, 1, 1, 1, 2, 2, 2, 2),
region = c("A", "B", "C", "D", "A", "B", "C", "D"),
test_value = c(3, 1, 1, 2, 4, 2, 4, 1)
)
df %>%
group_by(test_id, test_nr) %>%
add_summary_rows(
region = "MEAN",
test_value = mean(test_value)
)
#> # A tibble: 10 x 4
#> # Groups: test_id, test_nr [2]
#> test_id test_nr region test_value
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 1 A 3
#> 2 1 1 B 1
#> 3 1 1 C 1
#> 4 1 1 D 2
#> 5 1 1 MEAN 1.75
#> 6 1 2 A 4
#> 7 1 2 B 2
#> 8 1 2 C 4
#> 9 1 2 D 1
#> 10 1 2 MEAN 2.75
You can combine your two approaches:
df %>%
split(~test_nr) %>%
map_dfr(~ .x %>%
add_row(test_id = .$test_id[1],
test_nr = .$test_nr[1],
region = "mean",
test_value = mean(.$test_value)))
You could achieve your target with this Base R one-liner:
merge( df, aggregate( df, by = list( df$test_nr ), FUN = mean ), all = TRUE )[ , 1:4 ]
aggregate provides you with the lines you need, and merge inserts them into the right places of your dataframe. You don't need the last column of the combined dataframe, so use only the first four columns. The code produces some warnings for the region column which can be disregarded. In the region column, the function (MEAN) is not displayed.
Making it a little more generic:
f <- "mean"
df1 <- merge( df, aggregate( df, by = list( df$test_id, df$test_nr ),
FUN = f ), all = TRUE )[ , 1:4 ]
df1$region[ is.na( df1$region ) ] <- toupper( f )
Here, you aggregate also by test_id, you can change the function you are using in one place, and you have it printed in the region column:
> df1
test_id test_nr region test_value
1 1 1 A 3.00
2 1 1 B 1.00
3 1 1 C 1.00
4 1 1 D 2.00
5 1 1 MEAN 1.75
6 1 2 A 4.00
7 1 2 B 2.00
8 1 2 C 4.00
9 1 2 D 1.00
10 1 2 MEAN 2.75

Ratio of largest value per row in dataframe in R

I have a dataframe somewhat similar to the one below (df). I need to add a new column indicating the ratio of the largest value for each row (= largest value in row divided by sum of all values in the row). The output should look similar to df1.
df <- data.frame('x' = c(1, 4, 1, 4, 1), 'y' = c(4, 6, 5, 2, 3), 'z' = c(5, 3, 2, 3, 2))
df1 <- data.frame('x' = c(1, 4, 1, 4, 1), 'y' = c(4, 6, 5, 2, 3), 'z' = c(5, 3, 2, 3, 2), 'ratio' = c(0.5, 0.462, 0.625, 0.444, 0.5)
Thank you!
Here is a solution using dplyr:
df %>%
rowwise() %>%
mutate(max_value = max(x,y,z),
sum_values = sum(x,y,z),
ratio = max_value / sum_values) #%>%
#select(-max_value, -sum_values) #uncomment this line if you want to df1 as in your question
# A tibble: 5 x 6
x y z max_value sum_values ratio
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 5 5 10 0.5
2 4 6 3 6 13 0.462
3 1 5 2 5 8 0.625
4 4 2 3 4 9 0.444
5 1 3 2 3 6 0.5
library(tidyverse)
df %>%
rowwise() %>%
mutate(MAX = max(x,y,z, na.rm = TRUE ),
SUM = sum(x,y,z, na.rm = TRUE),
ratio = MAX / SUM)
# A tibble: 5 x 6
x y z MAX SUM ratio
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 5 5 10 0.5
2 4 6 3 6 13 0.462
3 1 5 2 5 8 0.625
4 4 2 3 4 9 0.444
5 1 3 2 3 6 0.5
Another option with rowSums and pmax
library(dplyr)
library(purrr)
df %>%
mutate(ratio = reduce(., pmax)/rowSums(.))
# x y z ratio
#1 1 4 5 0.5000000
#2 4 6 3 0.4615385
#3 1 5 2 0.6250000
#4 4 2 3 0.4444444
#5 1 3 2 0.5000000
Or in base R
df$ratio <- do.call(pmax, df)/rowSums(df)
Additional solution
df$ratio <- apply(df, 1, function(x) max(x, na.rm = T) / sum(x, na.rm = T))

Different output between sum and +

I'm working on a problem that consists basically on sum all the rows based on their ID and sum some specific variables to get a consolidated dataset to input on another work, but there is an issue with the sum function and I'd appreciate some explanation about this.
Dataset:
teste <- data.frame(ID = c(1, 1, 2, 1, 3, 3, 2),
VALUE = c(10, 10, 10, 10, 10, 10, 10),
MOD = c(1, 1, 1, 1, 1, 1, 1))
ID VALUE MOD
1 1 10 1
2 1 10 1
3 2 10 1
4 1 10 1
5 3 10 1
6 3 10 1
7 2 10 1
Using + operator:
teste %>%
group_by(ID) %>%
summarise_all(sum, na.rm = TRUE) %>%
mutate(CONS = VALUE + MOD)
# A tibble: 3 x 4
ID VALUE MOD CONS
<dbl> <dbl> <dbl> <dbl>
1 1 30 3 33
2 2 20 2 22
3 3 20 2 22
Using sum function:
teste %>%
group_by(ID) %>%
summarise_all(sum, na.rm = TRUE) %>%
mutate(CONS = sum(VALUE, MOD))
# A tibble: 3 x 4
ID VALUE MOD CONS
<dbl> <dbl> <dbl> <dbl>
1 1 30 3 77
2 2 20 2 77
3 3 20 2 77
summarize_all removes one level of grouping so re-group it:
teste %>%
group_by(ID) %>%
summarise_all(sum, na.rm = TRUE) %>%
group_by(ID) %>% # <--------------------------
mutate(CONS = sum(VALUE, MOD)) %>%
ungroup
giving:
# A tibble: 3 x 4
# Groups: ID [3]
ID VALUE MOD CONS
<dbl> <dbl> <dbl> <dbl>
1 1 30 3 33
2 2 20 2 22
3 3 20 2 22

How to apply a function to mutate a specific combination of columns? (purrr:: use preferred)

Suppose I have the following data:
data = tibble::tribble(
~id, ~year_1, ~year_2, ~cod_1, ~cod_2, ~cod_3, ~cod_4, ~var_x,
1, 0, 1, 5, 5, 3, 6, "x",
1, 0, 1, 3, 6, 14, 5, "x",
1, 0, 1, 2, 8, 5, 4, "x",
2, 1, 0, 10, 8, 2, 3, "x",
2, 1, 0, 3, 9, 1, 2, "x",
2, 1, 0, 1, 12, 0, 1, "x"
)
I'd like to create all posible products of the combination of all columns "year_" by all the columns "cod_". I mean something like this:
data.new = data %>%
mutate(year_1_cod_1 = year_1 * cod_1) %>%
mutate(year_1_cod_2 = year_1 * cod_2) %>%
mutate(year_1_cod_3 = year_1 * cod_3) %>%
mutate(year_1_cod_4 = year_1 * cod_4) %>%
mutate(year_2_cod_1 = year_2 * cod_1) %>%
mutate(year_2_cod_2 = year_2 * cod_2) %>%
mutate(year_2_cod_3 = year_2 * cod_3) %>%
mutate(year_2_cod_4 = year_2 * cod_4)
I can get all the possible combinations using:
year.var = colnames(data[, grepl("year", names(data))])
cod.var = colnames(data[, grepl("cod", names(data))])
com = crossing(year.var, cod.var)
> com
# A tibble: 8 x 2
year.var cod.var
<chr> <chr>
1 year_1 cod_1
2 year_1 cod_2
3 year_1 cod_3
4 year_1 cod_4
5 year_2 cod_1
6 year_2 cod_2
7 year_2 cod_3
8 year_2 cod_4
I could use a for loop to move over com data frame and create each new column. But a I'd like to do this inside dplyr:: environment. I think I can use purrr:: to mutate over all the combinations, but I am not sure how to.
In fact in my real data I have more than 1k possible combinations (i.e. more than 1k variables to mutate).
You could use map2 to loop over the combination in com and use transmute to create new columns by multiplying those columns using non-standard evaluation and finally binding it to the original dataframe.
library(dplyr)
library(purrr)
data %>%
bind_cols(map2_dfc(com$year.var, com$cod.var,
~data %>% transmute(!!paste(.x, .y, sep = "_") := !!sym(.x) * !!sym(.y))))
# A tibble: 6 x 16
# id year_1 year_2 cod_1 cod_2 cod_3 cod_4 var_x year_1_cod_1 year_1_cod_2
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
#1 1 0 1 5 5 3 6 x 0 0
#2 1 0 1 3 6 14 5 x 0 0
#3 1 0 1 2 8 5 4 x 0 0
#4 2 1 0 10 8 2 3 x 10 8
#5 2 1 0 3 9 1 2 x 3 9
#6 2 1 0 1 12 0 1 x 1 12
# … with 6 more variables: year_1_cod_3 <dbl>, year_1_cod_4 <dbl>,
# year_2_cod_1 <dbl>, year_2_cod_2 <dbl>, year_2_cod_3 <dbl>,
# year_2_cod_4 <dbl>
library(dplyr)
library(tidyr)
data %>%
pivot_longer(starts_with("year"), names_to = "year", values_to = "year_val") %>%
pivot_longer(starts_with("cod"), names_to = "cod", values_to = "cod_val") %>%
mutate(year_cod = paste(year, cod, sep = "_"),
val = year_val * cod_val) %>%
pivot_wider(
id_cols = c(id, var_x),
names_from = year_cod,
values_from = val,
values_fn = list(val = list)
) %>%
unnest(cols = c(-id, -var_x))
#> # A tibble: 6 x 10
#> id var_x year_1_cod_1 year_1_cod_2 year_1_cod_3 year_1_cod_4 year_2_cod_1
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 x 0 0 0 0 5
#> 2 1 x 0 0 0 0 3
#> 3 1 x 0 0 0 0 2
#> 4 2 x 10 8 2 3 0
#> 5 2 x 3 9 1 2 0
#> 6 2 x 1 12 0 1 0
#> # … with 3 more variables: year_2_cod_2 <dbl>, year_2_cod_3 <dbl>,
#> # year_2_cod_4 <dbl>
Created on 2020-02-26 by the reprex package (v0.3.0)

Resources