R: Multiple levels of grouping in dplyr::summarise using lapply - r

I'm relatively new to R and I seem to be having trouble applying a list criteria to a data.frame I'm trying to summarize. I've been reading a bunch of different posts, but they only seem to be concerned with one level of grouping, and not a second one.
Assuming my df looks like this (my actual data frame is much larger. There are 35 different "Codes" and about 20 different "Colors")
Code Color Value
[1] A Red 10
[2] A Blue 15
[3] A Red 5
[4] B Green 20
[5] B Red 15
[6] C Green 10
Ideally, I'd like to create a summary table which enables me to group the data by Code (I've been successful doing this with Group by and Split) but then i'd also like to create a sum of values by criteria "Color". Currently, I've only been able to accomplish this by running the criteria one by one.
So far I've been able to do this:
#this gives me the total value by each code, like a pivot or a sumif
dfsummary <-df %>% group_by(Code) %>% summarise (total = sum(Value))
#then I was able to come up with this to give me, by Code, value by Color.
dfsummary2 <- df %>% filter(Color == "Red") %>% group_by(Code)
%>% summarise(sumRed = sum(Value))
The results in dfsummary2 are:
Code sumRed
[1] A 15
[2] B 15
[3] C 0
What I'd like to accomplish is creating a data frame for all "Color" without having to specify each one individually.
My desired output, let's call it dfsummaryall, looks like:
Code sumRed sumBlue sumGreen
[1] A 15 15 0
[2] B 15 0 20
[3] C 0 0 10
This is where I get stumped. I can run each one individually and then merge them into one table, but I'd like to find a way to work in an apply function (lapply, I would think). This is where I'm definitely a novice.
My attempt so far, and this is where I'm sure I'm egregiously wrong, goes like this:
colors <- c("Red","Blue","Green")
dfsummaryall <- lapply(colors, function(x){dftmp <- df %>%
dplyr::filter(Color == x) %>% group_by(Code) %>%
summarise(x == sum(MktValue)
I know there's definitely a problem here in the "summarise(x == sum(MktValue)" part, but I'm really stumped as to how to pull this off.
Any help would be truly appreciated!

From user duckmayr in the comments:
df %>% group_by(Code, Color) %>% summarise(Sum = sum(Value)) %>% tidyr::spread(Color, Sum, fill = 0)
This worked perfectly for my purposes.

Related

R - automatize exclusion from the quantile_split function

I have a dataframe that looks like this:
Var1 Var2 Var3
100 B 15
200 A 16
700 A 13
500 C 10
This is just preview data, in fact it has 10000+ rows.
I am doing the following:
data %>%
group_by(Var2) %>%
mutate(Tercile = fabricatr::split_quantile(Var3, 3)) %>%
group_by(Var2, Tercile) %>%
summarise(Var1 = mean(Var1))
This results in a following error message:
The `x` argument provided to quantile split must be non-null and length at least 2.
As far as I understand, this means that for some values of Var2 there is only 1 unique value of Var3 and the tercile split cannot be accomplished. My first question is: Is this interpretation correct? I am confused by the part that says "length at least 2" because I expect that length should be at least 3 to perform a tercile split, right?
If the interpretation is correct, my second question is: How to automate the exclusion of such cases? I don't have nearly enough time to go through some 300 values of Var2 and examine the values of Var3. I need a coding solution that excludes such levels of Var2, so that the error mentioned previously doesn't appear.
As the error message says split_quantile needs a vector of at least length 2 we can remove the groups which has rows less than 2 and then apply the function?
library(dplyr)
data %>%
group_by(Var2) %>%
filter(n() >= 2) %>%
mutate(Tercile = fabricatr::split_quantile(Var3, 3)) %>%
group_by(Var2, Tercile) %>%
summarise(Var1 = mean(Var1))

R code to detect a change in a variable over time for multiple patients

I have a data set with multiple rows per patient, where each row represents a 1-week period of time over the course of 4 months. There is a variable grade that can take on values of 1,2,or 3, and I want to detect when a single patient's grade INCREASES (1 to 2, 1 to 3, or 2 to 3) at any point (the result would be a yes/no variable). I could write a function to do it but I'm betting there is some clever functional programming I could do to make use of existing R functions. Here is a sample data set below. Thank you!
df=data.frame(patient=c(1,1,1,2,2,3,3,3,3),period=c(1,2,3,1,3,1,3,4,5),grade=c(1,1,1,2,3,1,1,2,3))
what I would want is a resulting data frame of:
data.frame(patient=c(1,2,3),grade.increase=c(0,1,1))
library(dplyr)
df %>%
arrange(patient, period) %>%
mutate(grade.increase = case_when(grade > lag(grade) ~ TRUE,TRUE ~ FALSE)) %>%
group_by(patient) %>%
summarise(grade.increase = max(grade.increase))
Combining lag which checks the previous value with case_when allows us to identify each grade.increase.
Summarising the maximum of grade.increase for each patient gets the desired results as boolean calculations treat FALSE as 0 and TRUE as 1.
If you feel like doing this in base R, here's a solution that uses the split-apply-combine approach.
You use split to make a list with a separate data frame for each patient;
you use lapply to iterate a summarization function over each list element, where the summarization function uses diff to look at changes in grade and if and any to summarize; and then
you wrap the whole thing in do.call(rbind, ...) to collapse the resulting list into a data frame.
Here's what that looks like:
do.call(rbind, lapply(split(df, df[,"patient"]), function(i) {
data.frame(patient = i[,"patient"][1],
grade.increase = if (any(diff(i[,"grade"]) > 0)) 1 else 0 )
}))
Result:
patient grade.increase
1 1 0
2 2 1
3 3 1

R code not working to merge data (unduplicate) in columns based on identical values

I am trying to do some data wrangling on a tibble (dataframe) using dplyr, to unduplicate records, where if an id shows up twice, the resulting record will contain the same values if they are all identical, or an NA if there is a discrepancy in one of the records. For example, if I have df:
id date amount tag
--- ---- ------ ---
1 2018-01-03 10 big
2 2019-01-16 20 small
3 2020-01-05 30 big
3 2001-03-04 30 big
1 2018-01-03 5 big
The result should look like:
id date amount tag
--- ---- ------ ---
1 2018-01-03 NA big
2 2019-01-16 20 small
3 NA 30 big
Based on other answers I've found on stack overflow, I have tried various methods of using summarise_all including:
new_df <- df %>% group_by(id) %>% summarise_all(function(x) ifelse(all(x[1] == x),x[1],NA))
new_df <- df %>% group_by(id) %>% summarise_all(list(~ if(all(.[1] == .)) .[1] else NA))
new_df <- df %>% group_by(id) %>% summarise_all(funs(if(all(.[1] == .)) .[1] else NA))
Since I was able to use ifelse(all(x[1] == x),x[1],NA) on its own with a vector and it worked fine, I thought that would work with summarise_all. But when I use it with summarise_all or the other variants I show above, I get the error:
Error in summarise_impl(.data, dots): Column `date` can't promote group 2 to character
I suspect I just need to make a small tweak to my code to get it to work, but I've been working on this all day, and I don't know why it isn't working... So any help that the community can provide would be appreciated. This is the first time I've actually asked a question on stack overflow, because I almost always can find the answer from other people's questions :-) Thank you so much for any help!
First, the solution:
d %>%
group_by(id) %>%
summarise_all(~if(n_distinct(.) == 1) first(.) else c(NA, .)[1])
This is actually a little tricky. You'd think one could write simply:
d %>%
group_by(id) %>%
summarise_all(~if(n_distinct(.) == 1) first(.) else NA)
Which is just an alternative to your if (all ...) ... else ..., using some more dplyr functions.
However, dplyr doesn't like simply giving NA, but rather you need to be type specific. E.g. you need to provide NA_character_ or NA_integer_ etc to match the correct data type. This is why your code is failing, the error says that group 2 (i.e. id == 2 in this case) is failing to be "promoted" to character. This means that the NA provided there in column Date isn't being coerced to character and a new column is failing to be created.
Since you don't want to code all the correct NA types, I use a little trick here. Using c(NA, .)[1] to combine an NA value with the original variable will coerce that NA to the correct type, which I then use. You can probably use other tricks to get the correct NA too.

Designing a function so filter does not drop NAs

I have only just recently been bitten by dplyr::filter removing a large number of NAs from my tibble when filtering. I have mostly worked on complete data sets but am now venturing into messier data where I want to make comparisons. Therefore I want to create a function with the same capabilities as filter but without removing NAs. Here are some suggestions: Why does dplyr's filter drop NA values from a factor variable? or How to filter data without losing NA rows using dplyr however they are cumbersome solutions when dealing with lots of missing data values and many comparisons. Below is an example of some ways to get around it.
This is sample data, with missing NAs in both columns A and B
df = tibble(A = rep(c(1,2,3,NA,NA),10000),
B = rep(c(NA,1,2,3,4),10000))
This is intuitively what I want to do. Return values where A does not equal B, however it drops all the NAs (as expected).
df %>% filter(A != B)
1st solution: A solution to fix this problem is to use the %in% from base R, but you need to do this row by row and then ungroup, so it slows the process down. But gives the right result by keeping NAs when they appear in either A or B.
df %>% rowwise() %>% filter(!A %in% B) %>% ungroup()
2nd solution: The other option that has previously been suggested is using | to return A and B if they are NA.
df %>% filter(A != B|is.na(A)|is.na(B))
Now if you are doing multiple filtering and comparisons, this becomes tiresome and you are likely to stuff up somewhere! Therefore is it possible to create a function that automatically has is.na() keep inbuilt. Maybe something like this.
filter_keepna = function(data, expression){
data %>% filter(expression|is.na(column1)|is.na(column2)
}
I do not have enough knowlege to get something like this to work. But I am assuming from all the comments across various platforms that it is something that is required.
In your function you can use the functions for tidy evaluation from rlang package. The enquo(), f_lhs() and quo_get_expr() functions can help to extract variables from the expression. Also you need the bang bang operator (!!) to interpret the quosures. On your example, is:
filter_keepna <- function(data, expre){
expre <- enquo(expre) #Quotation
data %>%
filter(!!expre | #!! is a tidy evaluator
# get quoted left variable from expre
is.na(!!f_lhs(quo_get_expr(expre))) |
# get quoted right variable from expre
is.na(!!f_rhs(quo_get_expr(expre))))
}
Using the filter_keepna() function in your example data:
df = tibble(A = rep(c(1,2,3,NA,NA),10000),
B = rep(c(NA,1,2,3,4),10000))
filter_keepna(df, A != B)
# A tibble: 40,000 x 2
# A B
# <dbl> <dbl>
# 1 2 1
# 2 3 2
# 3 NA 3
# 4 NA 4
# 5 2 1
# 6 3 2
# 7 NA 3
# 8 NA 4
# 9 2 1
# 10 3 2
# # ... with 39,990 more rows
Detailed information in the quotation reference and Quosure getters reference of rlang package.
.........
Try coalesce
df %>% filter(coalesce(A != B, TRUE))

Dplyr/Lubridate: How to summarise overlapping intervals after grouping

I would like to group agreements and then compare how much their periods overlap (or are apart).
My dataframe may look like:
library(tidyverse)
library(lubridate)
tribble(
~ShipTo, ~Code, ~Start, ~End,
"xxxx", "AAA11", 2018-01-01, 2018-03-01,
"yyyy", "BBB23", 2018-02-01, 2018-05-11,
"yyyy", "BBB23", 2018-03-01, 2018-06-11,
"cccc", "AAA11", 2018-01-06, 2018-03-12,
"yyyy", "CCC04", 2018-01-16, 2018-03-31,
"xxxx", "DDD", 2018-01-21, 2018-03-25
)
I would like to mutate a column to create lubridate periods and evaluate them after grouping by ShipTo and Code. What I tried was:
dft3<-dft %>% filter(concat1 %in% to_filter2) %>%
arrange(ShipTo,Code)%>%
group_by(ShipTo,Code)%>%
mutate(period=interval(Start,End),
nextperiod=interval(lead(Start),lead(End)),
interv=day(as.period(intersect(period, nextperiod), "days"))) %>%
group_by(ShipTo,Code)%>%
summarise(count=n(),
intervmax=max(interv),
intervmin=min(interv))
If I remove the line group_by(ShipTo,Code)%>% the intervals are created correctly and also the lead intervals are correctly calculated from the next line. But when I naively use group_by, the intervals are not calculated correctly.
I suspect that perhaps my database should be split into many tables by groups and then, after the operation of creating and comparing intervals it should be glued back together.
Is there a succinct way to do it? Or perhaps there is a simpler way I have not yet learned? Thank you in advance for the hint in the right direction.
EDIT: The desired output should be a column with value of overlaps of intervals in days (or distances between intervals if no overlap). Grouping destroys the calculation. I would like to have these values calculated within groups (not accross them).
EDIT2: I trying to solve the problem by splitting dataframe into a list of dataframes and then combining it, but I am not sure of a syntax. It does not quite work, produces tables with one column, a help I was given on other portal (perhaps it can ilustrate the issue). The idea is to split a database, create new columns and combine the tables to a single table.
fnOverlaps <- function(x) {
mutate(x,okres=interval(Start,End),
nastokres=interval(lead(Start),lead(End)),
interv=day(as.period(intersect(okres, nastokres), "days")))
}
dft3<-dft3 %>%
split(list(.$ShipTo, .$Code), drop = TRUE) %>%
map_df(fnOverlaps) %>%
flatten_dfr()
The result (for one group) that I expect would look like this.
tribble(
~ShipTo, ~Code, ~interv,
"yyyy", "BBB23", 70 #say there is a 70 days overlap
"yyyy", "BBB23", NA #there is no next row to compare
)
It looks like the issue is being caused by trying to combine vectors with the class "Interval." Specifically, they appear to be getting converted to numeric and losing their inherent information.
I think the only viable solution is to split the data.frame, run the analysis on each component separately with lapply, then bring them back together with bind_rows. The number of groups with only one entry present an issue as max and min return -Inf and Inf when the argument is empty after removing NAs. But, that is easy enough to correct for.
This code should work. Note that I am using group_by to ensure the ShipTo/Code columns are kept, though you could do that in other ways.
dft %>%
split(paste(.$ShipTo, "XXX", .$Code)) %>%
lapply(function(x){
x %>%
arrange(ShipTo,Code) %>%
mutate(period=interval(Start,End)
, nextperiod=interval(lead(Start),lead(End))
, interv=day(as.period(intersect(period, nextperiod), "days"))
) %>%
group_by(ShipTo,Code)%>%
summarise(count=n(),
intervmax=max(interv, na.rm = TRUE),
intervmin=min(interv, na.rm = TRUE)) %>%
ungroup()
}) %>%
bind_rows() %>%
mutate(intervmax = ifelse(is.infinite(intervmax)
, NA, intervmax)
, intervmin = ifelse(is.infinite(intervmin)
, NA, intervmin))
Returns
# A tibble: 5 x 5
ShipTo Code count intervmax intervmin
<chr> <chr> <int> <dbl> <dbl>
1 cccc AAA11 1 NA NA
2 xxxx AAA11 1 NA NA
3 xxxx DDD 1 NA NA
4 yyyy BBB23 2 71.0 71.0
5 yyyy CCC04 1 NA NA
I am putting it just for the record. I received an answer from Jake Knaupp on slack r4ds group with the modern map_df() syntax, it calculates overlap of periods but it converts periods to numeric. And there is a bunch of warnings it will do that.
myFun <- function(x) {
mutate(x,period=interval(Start,End),
nextperiod=interval(lead(Start),lead(End)),
interv=day(as.period(intersect(period, nextperiod), "days")))
}
df %>%
split(list(.$ShipTo, .$Code), drop = TRUE) %>%
map_df(myFun)

Resources