Different output between sum and + - r

I'm working on a problem that consists basically on sum all the rows based on their ID and sum some specific variables to get a consolidated dataset to input on another work, but there is an issue with the sum function and I'd appreciate some explanation about this.
Dataset:
teste <- data.frame(ID = c(1, 1, 2, 1, 3, 3, 2),
VALUE = c(10, 10, 10, 10, 10, 10, 10),
MOD = c(1, 1, 1, 1, 1, 1, 1))
ID VALUE MOD
1 1 10 1
2 1 10 1
3 2 10 1
4 1 10 1
5 3 10 1
6 3 10 1
7 2 10 1
Using + operator:
teste %>%
group_by(ID) %>%
summarise_all(sum, na.rm = TRUE) %>%
mutate(CONS = VALUE + MOD)
# A tibble: 3 x 4
ID VALUE MOD CONS
<dbl> <dbl> <dbl> <dbl>
1 1 30 3 33
2 2 20 2 22
3 3 20 2 22
Using sum function:
teste %>%
group_by(ID) %>%
summarise_all(sum, na.rm = TRUE) %>%
mutate(CONS = sum(VALUE, MOD))
# A tibble: 3 x 4
ID VALUE MOD CONS
<dbl> <dbl> <dbl> <dbl>
1 1 30 3 77
2 2 20 2 77
3 3 20 2 77

summarize_all removes one level of grouping so re-group it:
teste %>%
group_by(ID) %>%
summarise_all(sum, na.rm = TRUE) %>%
group_by(ID) %>% # <--------------------------
mutate(CONS = sum(VALUE, MOD)) %>%
ungroup
giving:
# A tibble: 3 x 4
# Groups: ID [3]
ID VALUE MOD CONS
<dbl> <dbl> <dbl> <dbl>
1 1 30 3 33
2 2 20 2 22
3 3 20 2 22

Related

How to find the next occurrence in a data.frame in R?

Assume we have an email dataset with a sender and a recipient in every row. We want to find the next occurrence in the dataset for which the sender and the recipient are interchanged. So if sender==x & recipient==y, we are looking for the next row that has sender==y & recipient==x. Subsequently, we want to calculate the difference between counts for those observations. See the column diff_count for the desired output.
# creating the data.frame
id = 1:10
sender = c(1, 2, 3, 2, 3, 1, 2, 1, 2, 3)
recipient = c(2, 1, 2, 3, 1, 2, 3, 3, 1, 1)
count = c(1, 4, 5, 7, 12, 17, 24, 31, 34, 41)
df <- data.frame(id, sender, recipient, count)
# output should look like this
df$diff_count <- c(3, 13, 2, NA, 19, 17, NA, 10, NA, NA)
If there are no more observations that satisfy the requirement, then we simply fill in NA. Solution should be relatively easy with tidyverse, but I seem not to be able to do it.
Another dplyr-way without a custom function but several self joins:
library(dplyr)
data %>%
left_join(data,
by = c("sender" = "recipient", "recipient" = "sender"),
suffix = c("", ".y")) %>%
filter(id < id.y) %>%
group_by(id) %>%
slice_min(id.y) %>%
ungroup() %>%
mutate(diff_count = count.y - count) %>%
right_join(data) %>%
select(-matches("\\.(y|x)")) %>%
arrange(id)
returns
Joining, by = c("id", "sender", "recipient", "count")
# A tibble: 10 x 5
id sender recipient count diff_count
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 1 3
2 2 2 1 4 13
3 3 3 2 5 2
4 4 2 3 7 NA
5 5 3 1 12 19
6 6 1 2 17 17
7 7 2 3 24 NA
8 8 1 3 31 10
9 9 2 1 34 NA
10 10 3 1 41 NA
There should be easier ways, but below is one way using a custom function in tidyverse style:
library(dplyr)
calc_diff <- function(df, send, recp, cnt) {
df %>%
slice_tail(n = nrow(df) - cur_group_rows()) %>%
filter(sender == send, recipient == recp) %>%
slice_head(n = 1) %>%
pull(count) %>%
{ifelse(length(.) == 0, NA, .)} %>%
`-`(., cnt)
}
df %>%
rowwise(id) %>%
mutate(diff_count = calc_diff(df,
send = recipient,
recp = sender,
cnt = count))
#> # A tibble: 10 x 5
#> # Rowwise: id
#> id sender recipient count diff_count
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2 1 3
#> 2 2 2 1 4 13
#> 3 3 3 2 5 2
#> 4 4 2 3 7 NA
#> 5 5 3 1 12 19
#> 6 6 1 2 17 17
#> 7 7 2 3 24 NA
#> 8 8 1 3 31 10
#> 9 9 2 1 34 NA
#> 10 10 3 1 41 NA
Created on 2021-08-20 by the reprex package (v2.0.1)

Dplyr Summarize: Combining values for certain groups

I have data on hospital admissions per patients. I am trying add up the price of care for patients that were re-admitted to hospital within 5 days.
This is an example dataset:
(
dt <- data.frame(
id = c(1, 1, 2, 2, 3, 4),
admit_date = c(1, 9, 5, 9, 10, 20),
price = c(10, 20, 20, 30, 15, 16)
)
)
# id admit_date price
# 1 1 1 10
# 2 1 9 20
# 3 2 5 20
# 4 2 9 30
# 5 3 10 15
# 6 4 20 16
And this is what I have tried so far:
library(dplyr)
# 5-day readmission:
dt %>%
group_by(id) %>%
arrange(id, admit_date)%>%
mutate(
duration = admit_date - lag(admit_date),
readmit = ifelse(duration < 6, 1, 0)
) %>%
group_by(id, readmit) %>% # this is where i get stuck
summarize(sumprice = sum(price))
# # A tibble: 6 × 3
# # Groups: id [4]
# id readmit sumprice
# <dbl> <dbl> <dbl>
# 1 1 0 20
# 2 1 NA 10
# 3 2 1 30
# 4 2 NA 20
# 5 3 NA 15
# 6 4 NA 16
And this is what I would like to have:
# id sum_price
# 1 1 10
# 2 1 20
# 3 2 50
# 4 3 15
# 5 4 16
If the difference in days, between adjacent visits is greater than 5 - return TRUE if not - return FALSE (-Inf > 5 is FALSE for the first day, thus lags default is Inf). After that, for each individual we take a cumulative sum to label the groups. We finally summarize within each individual, using this cumsum as a grouping variable for by:
dt |>
group_by(id) |>
arrange(id, admit_date) |>
summarise(
sum_price = by(
price,
cumsum((admit_date - lag(admit_date, , Inf)) > 5),
sum
)
) |>
ungroup()
# # A tibble: 5 × 2
# id sum_price
# <dbl> <by>
# 1 1 10
# 2 1 20
# 3 2 50
# 4 3 15
# 5 4 16
So, you want (at most) one row per patient in the final dataframe, so you should group on just id.
Then, for each patient, you should calculate if that patient has any row with readmit==).
Finally, you filter out any patient that wasn't readmitted from your summarized dataframe.
Putting it all together, it might look like:
dt %>%
group_by(id) %>%
arrange(id, admit_date) %>%
mutate(duration = admit_date - lag(admit_date),
readmit = ifelse(duration < 6, 1, 0)) %>%
group_by(id) %>% # group by just 'id' to get one row per patient
summarize(sumprice = sum(price, na.rm = T),
is_readmit = any(readmit == 1)) %>% # If patient has any 'readmit' rows, count the patient as a readmit patient
filter(is_readmit) %>% # Filter out any non-readmit patients
select(-is_readmit) # get rid of the `is_readmit` column
Which should result in:
# A tibble: 1 x 3
id sumprice is_readmit
<dbl> <dbl> <lgl>
1 2 50 TRUE

Dplyr tranformation based on string filtering and conditions

I would like to tranform messy dataset in R,
However I am having issues figuring out how to do so, I provided example dataset and result that I need to achieve:
dataset <- tribble(
~ID, ~DESC,
1, "3+1Â 81Â mÂ",
2, "2+1Â 90Â mÂ",
3, "3+KK 28Â mÂ",
4, "3+1 120 m (Mezone)")
dataset
dataset_tranformed <- tribble(
~ID, ~Rooms, ~Meters, ~Mezone, ~KK,
1, 4, 81,0, 0,
2, 3, 90,0,0,
3, 3, 28,0,1,
4, 4, 120,1, 0)
dataset_tranformed
columns firstly need to be seperated, however using dataset %>% separate(DESC, c("size", "meters_squared", "Mezone"), sep = " ") does not work because (Mezone) is thrown away.
We can do this by doing evaluation and individually extract the components
library(dplyr)
library(stringr)
library(tidyr)
dataset %>%
mutate(Rooms = map_dbl(DESC, ~
str_extract(.x, "^\\d+\\+\\d*") %>%
str_replace("\\+$", "+0") %>%
rlang::parse_expr(.) %>%
eval ),
Meters = str_extract(DESC, "(?<=\\s)\\d+(?=Â)"),
Mezone = +(str_detect(DESC, "Mezone")),
KK = +(str_detect(DESC, "KK"))) %>%
select(-DESC)
# A tibble: 4 x 5
# ID Rooms Meters Mezone KK
# <dbl> <dbl> <chr> <int> <int>
#1 1 4 81 0 0
#2 2 3 90 0 0
#3 3 3 28 0 1
#4 4 4 120 1 0
Or another option is extract and then make use of str_detect
dataset %>%
extract(DESC, into = c("Rooms1", "Rooms2", "Meters"),
"^(\\d+)\\+(\\d*)[^0-9]+(\\d+)", convert = TRUE, remove = FALSE) %>%
transmute(ID, Mezone = +(str_detect(DESC, "Mezone")),
KK = +(is.na(Rooms2)), Rooms = Rooms1 + replace_na(Rooms2, 0), Meters )
# A tibble: 4 x 5
# ID Mezone KK Rooms Meters
# <dbl> <int> <int> <dbl> <int>
#1 1 0 0 4 81
#2 2 0 0 3 90
#3 3 0 1 3 28
#4 4 1 0 4 120

How to apply a function to mutate a specific combination of columns? (purrr:: use preferred)

Suppose I have the following data:
data = tibble::tribble(
~id, ~year_1, ~year_2, ~cod_1, ~cod_2, ~cod_3, ~cod_4, ~var_x,
1, 0, 1, 5, 5, 3, 6, "x",
1, 0, 1, 3, 6, 14, 5, "x",
1, 0, 1, 2, 8, 5, 4, "x",
2, 1, 0, 10, 8, 2, 3, "x",
2, 1, 0, 3, 9, 1, 2, "x",
2, 1, 0, 1, 12, 0, 1, "x"
)
I'd like to create all posible products of the combination of all columns "year_" by all the columns "cod_". I mean something like this:
data.new = data %>%
mutate(year_1_cod_1 = year_1 * cod_1) %>%
mutate(year_1_cod_2 = year_1 * cod_2) %>%
mutate(year_1_cod_3 = year_1 * cod_3) %>%
mutate(year_1_cod_4 = year_1 * cod_4) %>%
mutate(year_2_cod_1 = year_2 * cod_1) %>%
mutate(year_2_cod_2 = year_2 * cod_2) %>%
mutate(year_2_cod_3 = year_2 * cod_3) %>%
mutate(year_2_cod_4 = year_2 * cod_4)
I can get all the possible combinations using:
year.var = colnames(data[, grepl("year", names(data))])
cod.var = colnames(data[, grepl("cod", names(data))])
com = crossing(year.var, cod.var)
> com
# A tibble: 8 x 2
year.var cod.var
<chr> <chr>
1 year_1 cod_1
2 year_1 cod_2
3 year_1 cod_3
4 year_1 cod_4
5 year_2 cod_1
6 year_2 cod_2
7 year_2 cod_3
8 year_2 cod_4
I could use a for loop to move over com data frame and create each new column. But a I'd like to do this inside dplyr:: environment. I think I can use purrr:: to mutate over all the combinations, but I am not sure how to.
In fact in my real data I have more than 1k possible combinations (i.e. more than 1k variables to mutate).
You could use map2 to loop over the combination in com and use transmute to create new columns by multiplying those columns using non-standard evaluation and finally binding it to the original dataframe.
library(dplyr)
library(purrr)
data %>%
bind_cols(map2_dfc(com$year.var, com$cod.var,
~data %>% transmute(!!paste(.x, .y, sep = "_") := !!sym(.x) * !!sym(.y))))
# A tibble: 6 x 16
# id year_1 year_2 cod_1 cod_2 cod_3 cod_4 var_x year_1_cod_1 year_1_cod_2
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
#1 1 0 1 5 5 3 6 x 0 0
#2 1 0 1 3 6 14 5 x 0 0
#3 1 0 1 2 8 5 4 x 0 0
#4 2 1 0 10 8 2 3 x 10 8
#5 2 1 0 3 9 1 2 x 3 9
#6 2 1 0 1 12 0 1 x 1 12
# … with 6 more variables: year_1_cod_3 <dbl>, year_1_cod_4 <dbl>,
# year_2_cod_1 <dbl>, year_2_cod_2 <dbl>, year_2_cod_3 <dbl>,
# year_2_cod_4 <dbl>
library(dplyr)
library(tidyr)
data %>%
pivot_longer(starts_with("year"), names_to = "year", values_to = "year_val") %>%
pivot_longer(starts_with("cod"), names_to = "cod", values_to = "cod_val") %>%
mutate(year_cod = paste(year, cod, sep = "_"),
val = year_val * cod_val) %>%
pivot_wider(
id_cols = c(id, var_x),
names_from = year_cod,
values_from = val,
values_fn = list(val = list)
) %>%
unnest(cols = c(-id, -var_x))
#> # A tibble: 6 x 10
#> id var_x year_1_cod_1 year_1_cod_2 year_1_cod_3 year_1_cod_4 year_2_cod_1
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 x 0 0 0 0 5
#> 2 1 x 0 0 0 0 3
#> 3 1 x 0 0 0 0 2
#> 4 2 x 10 8 2 3 0
#> 5 2 x 3 9 1 2 0
#> 6 2 x 1 12 0 1 0
#> # … with 3 more variables: year_2_cod_2 <dbl>, year_2_cod_3 <dbl>,
#> # year_2_cod_4 <dbl>
Created on 2020-02-26 by the reprex package (v0.3.0)

dplyr collapse 'tail' rows into larger groups

library(tidyverse)
df <- tibble(a = as.factor(1:20), b = c(50, 20, 13, rep(2, 10), rep(1, 7)))
How do I make dplyr look at this data frame df and collapse all these occurences of 2 into a single summed group, and collapse all the occurrences of 1 into a single summed group? And also keep the rest of the data frame.
Turn this:
# A tibble: 20 x 2
a b
<fct> <dbl>
1 1 50
2 2 20
3 3 13
4 4 2
5 5 2
6 6 2
7 7 2
8 8 2
9 9 2
10 10 2
11 11 2
12 12 2
13 13 2
14 14 1
15 15 1
16 16 1
17 17 1
18 18 1
19 19 1
20 20 1
into this:
# A tibble: 5 x 2
a b
<fct> <dbl>
1 1 50
2 2 20
3 3 13
4 grp2 20
5 grp1 7
[Edit] - I fixed the example data. Sorry about that.
We group by a manufactured sortkey to maintain sort order. We used the fact that b is in descending order in the input but if that is not the case in your actual data then replace sortkey = -b with the more general sortkey = data.table::rleid(b) or the longer sortkey = cumsum(coalesce(b != lag(b), FALSE)) .
We also convert b to the group names giving a new a. It wasn't clear which groups are to be converted to grp... form. Hard-coded 1 and 2? Any group with more than one row? Groups at the end with more than one row? At any rate it would be easy enough to change the condition in the if_else once that were clarified.
Finally perform the summation and then remove the sortkey.
df %>%
group_by(sortkey = -b, a = paste0(if_else(b %in% 1:2, "grp", ""), b)) %>%
summarize(b = sum(b)) %>%
ungroup %>%
select(-sortkey)
giving:
# A tibble: 5 x 2
a b
<chr> <int>
1 50 50
2 20 20
3 13 13
4 grp2 20
5 grp1 7
Here's a way. I have converted a from factor to character to make things easier. You can convert it back to factor if you want. Also your test data was a bit wrong.
df <- tibble(a = as.character(1:20), b = c(50, 20, 13, rep(2, 10), rep(1, 7)))
df %>%
mutate(
a = case_when(
b == 1 ~ "grp1",
b == 2 ~ "grp2",
TRUE ~ a
)
) %>%
group_by(a) %>%
summarise(b = sum(b))
# A tibble: 5 x 2
a b
<chr> <dbl>
1 1 50
2 2 20
3 3 13
4 grp1 7
5 grp2 20
This is an approach which gives you the desired names for groups & where you don't need to think in advance how many cases like that you would need (e.g. it would create grp3, grp4, ... depending on the number in b).
library(dplyr)
df %>%
mutate(
grp = as.numeric(lag(df$b) != df$b),
grp = cumsum(ifelse(is.na(grp), 0, grp))
) %>% group_by(grp) %>%
mutate(
a = ifelse(n() > 1, paste0("grp", b), a),
b = sum(b)
) %>% ungroup() %>% distinct(a, b)
Output:
a b
<chr> <dbl>
1 1 50
2 2 20
3 3 13
4 grp2 20
5 grp1 7
Note that the code could be also condensed but that leads to a certain lack of readability in my opinion:
df %>%
group_by(grp = cumsum(ifelse(is.na(as.numeric(lag(df$b) != df$b)), 0, as.numeric(lag(df$b) != df$b)))) %>%
mutate(
a = ifelse(n() > 1, paste0("grp", b), a),
b = sum(b)
) %>% ungroup() %>% distinct(a, b)

Resources