I would like to group agreements and then compare how much their periods overlap (or are apart).
My dataframe may look like:
library(tidyverse)
library(lubridate)
tribble(
~ShipTo, ~Code, ~Start, ~End,
"xxxx", "AAA11", 2018-01-01, 2018-03-01,
"yyyy", "BBB23", 2018-02-01, 2018-05-11,
"yyyy", "BBB23", 2018-03-01, 2018-06-11,
"cccc", "AAA11", 2018-01-06, 2018-03-12,
"yyyy", "CCC04", 2018-01-16, 2018-03-31,
"xxxx", "DDD", 2018-01-21, 2018-03-25
)
I would like to mutate a column to create lubridate periods and evaluate them after grouping by ShipTo and Code. What I tried was:
dft3<-dft %>% filter(concat1 %in% to_filter2) %>%
arrange(ShipTo,Code)%>%
group_by(ShipTo,Code)%>%
mutate(period=interval(Start,End),
nextperiod=interval(lead(Start),lead(End)),
interv=day(as.period(intersect(period, nextperiod), "days"))) %>%
group_by(ShipTo,Code)%>%
summarise(count=n(),
intervmax=max(interv),
intervmin=min(interv))
If I remove the line group_by(ShipTo,Code)%>% the intervals are created correctly and also the lead intervals are correctly calculated from the next line. But when I naively use group_by, the intervals are not calculated correctly.
I suspect that perhaps my database should be split into many tables by groups and then, after the operation of creating and comparing intervals it should be glued back together.
Is there a succinct way to do it? Or perhaps there is a simpler way I have not yet learned? Thank you in advance for the hint in the right direction.
EDIT: The desired output should be a column with value of overlaps of intervals in days (or distances between intervals if no overlap). Grouping destroys the calculation. I would like to have these values calculated within groups (not accross them).
EDIT2: I trying to solve the problem by splitting dataframe into a list of dataframes and then combining it, but I am not sure of a syntax. It does not quite work, produces tables with one column, a help I was given on other portal (perhaps it can ilustrate the issue). The idea is to split a database, create new columns and combine the tables to a single table.
fnOverlaps <- function(x) {
mutate(x,okres=interval(Start,End),
nastokres=interval(lead(Start),lead(End)),
interv=day(as.period(intersect(okres, nastokres), "days")))
}
dft3<-dft3 %>%
split(list(.$ShipTo, .$Code), drop = TRUE) %>%
map_df(fnOverlaps) %>%
flatten_dfr()
The result (for one group) that I expect would look like this.
tribble(
~ShipTo, ~Code, ~interv,
"yyyy", "BBB23", 70 #say there is a 70 days overlap
"yyyy", "BBB23", NA #there is no next row to compare
)
It looks like the issue is being caused by trying to combine vectors with the class "Interval." Specifically, they appear to be getting converted to numeric and losing their inherent information.
I think the only viable solution is to split the data.frame, run the analysis on each component separately with lapply, then bring them back together with bind_rows. The number of groups with only one entry present an issue as max and min return -Inf and Inf when the argument is empty after removing NAs. But, that is easy enough to correct for.
This code should work. Note that I am using group_by to ensure the ShipTo/Code columns are kept, though you could do that in other ways.
dft %>%
split(paste(.$ShipTo, "XXX", .$Code)) %>%
lapply(function(x){
x %>%
arrange(ShipTo,Code) %>%
mutate(period=interval(Start,End)
, nextperiod=interval(lead(Start),lead(End))
, interv=day(as.period(intersect(period, nextperiod), "days"))
) %>%
group_by(ShipTo,Code)%>%
summarise(count=n(),
intervmax=max(interv, na.rm = TRUE),
intervmin=min(interv, na.rm = TRUE)) %>%
ungroup()
}) %>%
bind_rows() %>%
mutate(intervmax = ifelse(is.infinite(intervmax)
, NA, intervmax)
, intervmin = ifelse(is.infinite(intervmin)
, NA, intervmin))
Returns
# A tibble: 5 x 5
ShipTo Code count intervmax intervmin
<chr> <chr> <int> <dbl> <dbl>
1 cccc AAA11 1 NA NA
2 xxxx AAA11 1 NA NA
3 xxxx DDD 1 NA NA
4 yyyy BBB23 2 71.0 71.0
5 yyyy CCC04 1 NA NA
I am putting it just for the record. I received an answer from Jake Knaupp on slack r4ds group with the modern map_df() syntax, it calculates overlap of periods but it converts periods to numeric. And there is a bunch of warnings it will do that.
myFun <- function(x) {
mutate(x,period=interval(Start,End),
nextperiod=interval(lead(Start),lead(End)),
interv=day(as.period(intersect(period, nextperiod), "days")))
}
df %>%
split(list(.$ShipTo, .$Code), drop = TRUE) %>%
map_df(myFun)
Related
I am new here, as well as to R, and I couldn't find any past queries that answered my following question, so I apologise if this has already been brought up before.
I am trying to merge the ID columns from two different datasets into one, but some of the IDs in the rows have been coded differently. I need to replace all the "LNZ_" IDs with the "LNZ.", however, I cannot figure out how I would go about doing this.
df_1 <- data.frame(ID_1 = c("LNZ_00001", "LNZ_00002", "LNZ_00003", "DFG00001", "CWD00001"),
Sex=c("M","F","F","M","F"))
df_2 <- data.frame(ID_2 = c("LNZ.00001", "LNZ.00002", "LNZ_00003", "DFG00001", "CWD00001"),
Type=c("S","S","B","B","B"),
AGE=c(56,75,66,64,64))
The above is similar to the datasets that I have, only more scaled down. I hope this is somewhat clear, and any help would be appreciated.
Thanks!
The issue with merging is that your ID columns have different formatting for some of the entries which are supposed to be matched. Therefore you need to modify those values to match before performing the merge. In the examples you gave, the difference is between a period separator (.) and an underscore (_). If your real data has more complex issues, you may need to use different code to clean up those values.
However, once that is resolved, you can perform your merge easily. Here I've used the {tidyverse} packages to accomplish both steps in one pipe chain.
library(tidyverse)
df_1 <- data.frame(ID_1 = c("LNZ_00001", "LNZ_00002", "LNZ_00003", "DFG00001", "CWD00001"), Sex=c("M","F","F","M","F"))
df_2 <- data.frame(ID_2 = c("LNZ.00001", "LNZ.00002", "LNZ_00003", "DFG00001", "CWD00001"), Type=c("S","S","B","B","B"), AGE=c(56,75,66,64,64))
df_2 %>%
mutate(ID_2 = str_replace(ID_2, "\\.", "\\_")) %>%
left_join(df_1, by = c("ID_2" = "ID_1"))
#> ID_2 Type AGE Sex
#> 1 LNZ_00001 S 56 M
#> 2 LNZ_00002 S 75 F
#> 3 LNZ_00003 B 66 F
#> 4 DFG00001 B 64 M
#> 5 CWD00001 B 64 F
Created on 2022-07-17 by the reprex package (v2.0.1)
I am trying to do some data wrangling on a tibble (dataframe) using dplyr, to unduplicate records, where if an id shows up twice, the resulting record will contain the same values if they are all identical, or an NA if there is a discrepancy in one of the records. For example, if I have df:
id date amount tag
--- ---- ------ ---
1 2018-01-03 10 big
2 2019-01-16 20 small
3 2020-01-05 30 big
3 2001-03-04 30 big
1 2018-01-03 5 big
The result should look like:
id date amount tag
--- ---- ------ ---
1 2018-01-03 NA big
2 2019-01-16 20 small
3 NA 30 big
Based on other answers I've found on stack overflow, I have tried various methods of using summarise_all including:
new_df <- df %>% group_by(id) %>% summarise_all(function(x) ifelse(all(x[1] == x),x[1],NA))
new_df <- df %>% group_by(id) %>% summarise_all(list(~ if(all(.[1] == .)) .[1] else NA))
new_df <- df %>% group_by(id) %>% summarise_all(funs(if(all(.[1] == .)) .[1] else NA))
Since I was able to use ifelse(all(x[1] == x),x[1],NA) on its own with a vector and it worked fine, I thought that would work with summarise_all. But when I use it with summarise_all or the other variants I show above, I get the error:
Error in summarise_impl(.data, dots): Column `date` can't promote group 2 to character
I suspect I just need to make a small tweak to my code to get it to work, but I've been working on this all day, and I don't know why it isn't working... So any help that the community can provide would be appreciated. This is the first time I've actually asked a question on stack overflow, because I almost always can find the answer from other people's questions :-) Thank you so much for any help!
First, the solution:
d %>%
group_by(id) %>%
summarise_all(~if(n_distinct(.) == 1) first(.) else c(NA, .)[1])
This is actually a little tricky. You'd think one could write simply:
d %>%
group_by(id) %>%
summarise_all(~if(n_distinct(.) == 1) first(.) else NA)
Which is just an alternative to your if (all ...) ... else ..., using some more dplyr functions.
However, dplyr doesn't like simply giving NA, but rather you need to be type specific. E.g. you need to provide NA_character_ or NA_integer_ etc to match the correct data type. This is why your code is failing, the error says that group 2 (i.e. id == 2 in this case) is failing to be "promoted" to character. This means that the NA provided there in column Date isn't being coerced to character and a new column is failing to be created.
Since you don't want to code all the correct NA types, I use a little trick here. Using c(NA, .)[1] to combine an NA value with the original variable will coerce that NA to the correct type, which I then use. You can probably use other tricks to get the correct NA too.
I have only just recently been bitten by dplyr::filter removing a large number of NAs from my tibble when filtering. I have mostly worked on complete data sets but am now venturing into messier data where I want to make comparisons. Therefore I want to create a function with the same capabilities as filter but without removing NAs. Here are some suggestions: Why does dplyr's filter drop NA values from a factor variable? or How to filter data without losing NA rows using dplyr however they are cumbersome solutions when dealing with lots of missing data values and many comparisons. Below is an example of some ways to get around it.
This is sample data, with missing NAs in both columns A and B
df = tibble(A = rep(c(1,2,3,NA,NA),10000),
B = rep(c(NA,1,2,3,4),10000))
This is intuitively what I want to do. Return values where A does not equal B, however it drops all the NAs (as expected).
df %>% filter(A != B)
1st solution: A solution to fix this problem is to use the %in% from base R, but you need to do this row by row and then ungroup, so it slows the process down. But gives the right result by keeping NAs when they appear in either A or B.
df %>% rowwise() %>% filter(!A %in% B) %>% ungroup()
2nd solution: The other option that has previously been suggested is using | to return A and B if they are NA.
df %>% filter(A != B|is.na(A)|is.na(B))
Now if you are doing multiple filtering and comparisons, this becomes tiresome and you are likely to stuff up somewhere! Therefore is it possible to create a function that automatically has is.na() keep inbuilt. Maybe something like this.
filter_keepna = function(data, expression){
data %>% filter(expression|is.na(column1)|is.na(column2)
}
I do not have enough knowlege to get something like this to work. But I am assuming from all the comments across various platforms that it is something that is required.
In your function you can use the functions for tidy evaluation from rlang package. The enquo(), f_lhs() and quo_get_expr() functions can help to extract variables from the expression. Also you need the bang bang operator (!!) to interpret the quosures. On your example, is:
filter_keepna <- function(data, expre){
expre <- enquo(expre) #Quotation
data %>%
filter(!!expre | #!! is a tidy evaluator
# get quoted left variable from expre
is.na(!!f_lhs(quo_get_expr(expre))) |
# get quoted right variable from expre
is.na(!!f_rhs(quo_get_expr(expre))))
}
Using the filter_keepna() function in your example data:
df = tibble(A = rep(c(1,2,3,NA,NA),10000),
B = rep(c(NA,1,2,3,4),10000))
filter_keepna(df, A != B)
# A tibble: 40,000 x 2
# A B
# <dbl> <dbl>
# 1 2 1
# 2 3 2
# 3 NA 3
# 4 NA 4
# 5 2 1
# 6 3 2
# 7 NA 3
# 8 NA 4
# 9 2 1
# 10 3 2
# # ... with 39,990 more rows
Detailed information in the quotation reference and Quosure getters reference of rlang package.
.........
Try coalesce
df %>% filter(coalesce(A != B, TRUE))
I am using lme4::lmList on a tibble to obtain the coefficients of linear fit lines fitted for each subject (id) in my data. What I actually want is a nice long chain of pipes because I don't want to keep any of this output, just use it for a slope/intercept plot. However, I am running into a problem. lmList is creating a dataframe where the row numbers are the original subject ID numbers. I want to keep that information, but as soon as I use mutate on the output, the row numbers change to be sequential from 1. I tried rescuing them first by using rowid_to_column but that just gives me a column of sequential numbers from 1 too. What can I do, other than drop out of the pipe and put them in a column with base R? Is unique(a_df$id) really the best solution? I had a look around on here but didn't see a question like this one.
library(tibble)
library(dplyr)
library(Matrix)
library(lme4)
a_df <- tibble(id = c(rep(4, 3), rep(11, 3), rep(12, 3), rep(42, 3)),
age = c(rep(seq(1, 3), 4)),
hair = 1 + (age*2) + rnorm(12) + as.vector(sapply(rnorm(4), function(x) rep(x, 3))))
# as.data.frame to get around stupid RStudio diagnostics bug
int_slope <- coef(lmList(hair ~ age | id, as.data.frame(a_df))) %>%
setNames(., c("Intercept", "Slope"))
# Notice how the row numbers are the original subject ids?
print(int_slope)
Intercept Slope
4 2.9723596 1.387635
11 0.2824736 2.443538
12 -1.8912636 2.494236
42 0.8648395 1.680082
int_slope2 <- int_slope %>% mutate(ybar = Intercept + (mean(a_df$age) * Slope))
# Look! Mutate has changed them to be the numbers 1 to 4
print(int_slope2)
Intercept Slope ybar
1 2.9723596 1.387635 5.747630
2 0.2824736 2.443538 5.169550
3 -1.8912636 2.494236 3.097207
4 0.8648395 1.680082 4.225004
# Try to rescue them with rowid_to_column
int_slope3 <- int_slope %>% rowid_to_column(var = "id")
# Nope, 1 to 4 again
print(int_slope3)
id Intercept Slope
1 1 2.9723596 1.387635
2 2 0.2824736 2.443538
3 3 -1.8912636 2.494236
4 4 0.8648395 1.680082
Thanks,
SJ
The dplyr/tidyverse universe doesn't "believe in" row names. Any data that is important for an observation should be included in a column. The tibble package includes a function to move row names into a column. Try
int_slope %>% rownames_to_column()
before any mutates.
Nothing like asking for help to make you see the answer. Those aren't row numbers, they're numeric row names. Of course they are! Non-contiguous row numbers make no sense. rownames_to_column is my answer.
Why you just donĀ“t create another 'ybar' column on int_slope?
int_slope$ybar<- Intercept + mean(a_df$age) * Slope
I'm relatively new to R and I seem to be having trouble applying a list criteria to a data.frame I'm trying to summarize. I've been reading a bunch of different posts, but they only seem to be concerned with one level of grouping, and not a second one.
Assuming my df looks like this (my actual data frame is much larger. There are 35 different "Codes" and about 20 different "Colors")
Code Color Value
[1] A Red 10
[2] A Blue 15
[3] A Red 5
[4] B Green 20
[5] B Red 15
[6] C Green 10
Ideally, I'd like to create a summary table which enables me to group the data by Code (I've been successful doing this with Group by and Split) but then i'd also like to create a sum of values by criteria "Color". Currently, I've only been able to accomplish this by running the criteria one by one.
So far I've been able to do this:
#this gives me the total value by each code, like a pivot or a sumif
dfsummary <-df %>% group_by(Code) %>% summarise (total = sum(Value))
#then I was able to come up with this to give me, by Code, value by Color.
dfsummary2 <- df %>% filter(Color == "Red") %>% group_by(Code)
%>% summarise(sumRed = sum(Value))
The results in dfsummary2 are:
Code sumRed
[1] A 15
[2] B 15
[3] C 0
What I'd like to accomplish is creating a data frame for all "Color" without having to specify each one individually.
My desired output, let's call it dfsummaryall, looks like:
Code sumRed sumBlue sumGreen
[1] A 15 15 0
[2] B 15 0 20
[3] C 0 0 10
This is where I get stumped. I can run each one individually and then merge them into one table, but I'd like to find a way to work in an apply function (lapply, I would think). This is where I'm definitely a novice.
My attempt so far, and this is where I'm sure I'm egregiously wrong, goes like this:
colors <- c("Red","Blue","Green")
dfsummaryall <- lapply(colors, function(x){dftmp <- df %>%
dplyr::filter(Color == x) %>% group_by(Code) %>%
summarise(x == sum(MktValue)
I know there's definitely a problem here in the "summarise(x == sum(MktValue)" part, but I'm really stumped as to how to pull this off.
Any help would be truly appreciated!
From user duckmayr in the comments:
df %>% group_by(Code, Color) %>% summarise(Sum = sum(Value)) %>% tidyr::spread(Color, Sum, fill = 0)
This worked perfectly for my purposes.