I have the following data and looking to create the "Final Col" shown below using dplyr in R. I would appreciate your ideas.
| Year | Week | MainCat|Qty |Final Col |
|:----: |:------: |:-----: |:-----:|:------------:|
| 2017 | 1 | Edible |69 |69/(69+12) |
| 2017 | 2 | Edible |12 |12/(69+12) |
| 2017 | 1 | Flowers|88 |88/(88+47) |
| 2017 | 2 | Flowers|47 |47/(88+47) |
| 2018 | 1 | Edible |90 |90/(90+35) |
| 2018 | 2 | Edible |35 |35/(90+35) |
| 2018 | 1 | Flowers|78 |78/(78+85) |
| 2018 | 2 | Flowers|85 |85/(78+85) |
It can be done with a group_by operation i.e. grouped by 'Year', 'MainCat', divide the 'Qty' by the sum of 'Qty' to create the 'Final' column
library(dplyr)
df1 <- df1 %>%
group_by(Year, MainCat) %>%
mutate(Final = Qty/sum(Qty))
You can use prop.table :
library(dplyr)
df %>% group_by(Year, MainCat) %>% mutate(Final = prop.table(Qty))
# Year Week MainCat Qty Final
# <int> <int> <chr> <int> <dbl>
#1 2017 1 Edible 69 0.852
#2 2017 2 Edible 12 0.148
#3 2017 1 Flowers 88 0.652
#4 2017 2 Flowers 47 0.348
#5 2018 1 Edible 90 0.72
#6 2018 2 Edible 35 0.28
#7 2018 1 Flowers 78 0.479
#8 2018 2 Flowers 85 0.521
You can also do this in base R :
df$Final <- with(df, ave(Qty, Year, MainCat, FUN = prop.table))
I have a very large dataset and a sample of that looks something like the one below:
| Id | Name | Start_Date | End_Date |
|----|---------|------------|------------|
| 10 | Mark | 4/2/1999 | 7/5/2018 |
| 10 | | 1/1/2000 | 9/24/2018 |
| 25 | | 5/3/1968 | 6/3/2000 |
| 25 | | 6/6/2009 | 4/23/2010 |
| 25 | Anthony | 2/20/2010 | 7/21/2016 |
| 25 | | 9/12/2014 | 11/26/2019 |
I need to parse the names from Name column based on their Id such that the output table looks like:
| Id | Name | Start_Date | End_Date |
|----|---------|------------|------------|
| 10 | Mark | 4/2/1999 | 7/5/2018 |
| 10 | Mark | 1/1/2000 | 9/24/2018 |
| 25 | Anthony | 5/3/1968 | 6/3/2000 |
| 25 | Antony | 6/6/2009 | 4/23/2010 |
| 25 | Anthony | 2/20/2010 | 7/21/2016 |
| 25 | Anthony | 9/12/2014 | 11/26/2019 |
How can I achieve an output as shown above? I went through the substitute and parse functions, but was unable to understand how they apply to this problem.
My dataset would be:
df=data.frame(Id=c("10","10","25","25","25","25"),Name=c("Mark","","","","Anthony",""),
Start_Date=c("4/2/1999", "1/1/2000","5/3/1968","6/6/2009","2/20/2010","9/12/2014"),
End_Date=c("7/5/2018","9/24/2018","6/3/2000","4/23/2010","7/21/2016","11/26/2019"))
We can change the blanks ("") to NA and use fill to replace the NA elements with the previous non-NA element
library(dplyr)
library(tidyr)
df1 %>%
mutate(Name = na_if(Name, "")) %>%
group_by(Id) %>%
fill(Name, .direction = "down") %>%
fill(Name, .direction = "up)
# A tibble: 6 x 4
# Groups: Id [2]
# Id Name Start_Date End_Date
# <chr> <chr> <chr> <chr>
#1 10 Mark 4/2/1999 7/5/2018
#2 10 Mark 1/1/2000 9/24/2018
#3 25 Anthony 5/3/1968 6/3/2000
#4 25 Anthony 6/6/2009 4/23/2010
#5 25 Anthony 2/20/2010 7/21/2016
#6 25 Anthony 9/12/2014 11/26/2019
In the devel version of tidyr (‘0.8.3.9000’), this can be done in a single fill statement as .direction = "downup" is also an option
df1 %>%
mutate(Name = na_if(Name, "")) %>%
group_by(Id) %>%
fill(Name, .direction = "downup")
Or another option is to group by 'Id', and mutate the 'Name' as the first non-blank element
df1 %>%
group_by(Id) %>%
mutate(Name = first(Name[Name!=""]))
# A tibble: 6 x 4
# Groups: Id [2]
# Id Name Start_Date End_Date
# <chr> <chr> <chr> <chr>
#1 10 Mark 4/2/1999 7/5/2018
#2 10 Mark 1/1/2000 9/24/2018
#3 25 Anthony 5/3/1968 6/3/2000
#4 25 Anthony 6/6/2009 4/23/2010
#5 25 Anthony 2/20/2010 7/21/2016
#6 25 Anthony 9/12/2014 11/26/2019
data
df1 <- structure(list(Id = c("10", "10", "25", "25", "25", "25"), Name = c("Mark",
"", "", "", "Anthony", ""), Start_Date = c("4/2/1999", "1/1/2000",
"5/3/1968", "6/6/2009", "2/20/2010", "9/12/2014"), End_Date = c("7/5/2018",
"9/24/2018", "6/3/2000", "4/23/2010", "7/21/2016", "11/26/2019"
)), class = "data.frame", row.names = c(NA, -6L))
Using DF defined reproducibly in the Note at the end, replace each zero-length element of Name with NA and then use na.omit to get the unique non-NA to use to fill. We have assumed that there is only one non-NA per Id which is the case in the question. If not we could replace na.omit with function(x) unique(na.omit(x)) assuming that the non-NAs are all the same within Id. No packages are used.
transform(DF, Name = ave(replace(Name, !nzchar(Name), NA), Id, FUN = na.omit))
giving:
Id Name Start_Date End_Date
1 10 Mark 4/2/1999 7/5/2018
2 10 Mark 1/1/2000 9/24/2018
3 25 Anthony 5/3/1968 6/3/2000
4 25 Anthony 6/6/2009 4/23/2010
5 25 Anthony 2/20/2010 7/21/2016
6 25 Anthony 9/12/2014 11/26/2019
na.strings
We can simplify this slightly if we make sure that the zero length elements of Name are NA in the first place. We replace the read.table line in the Note with the first line below. Then it is just a matter of using na.locf0.
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, sep = "|",
strip.white = TRUE, na.strings = "")
transform(DF, Name = ave(Name, Id, FUN = na.omit))
Note
The input in reproducible form:
Lines <- "
Id | Name | Start_Date | End_Date
10 | Mark | 4/2/1999 | 7/5/2018
10 | | 1/1/2000 | 9/24/2018
25 | | 5/3/1968 | 6/3/2000
25 | | 6/6/2009 | 4/23/2010
25 | Anthony | 2/20/2010 | 7/21/2016
25 | | 9/12/2014 | 11/26/2019"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, sep = "|", strip.white = TRUE)
This is a continued question from the post Remove the first row from each group if the second row meets a condition
Below is a sample dataset:
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
which would look like:
| id | Date | Buyer | diff | Amount |
|----|:----------:|------:|------|--------|
| 9 | 11/29/2018 | John | NA | 959 |
| 9 | 11/29/2018 | John | 0 | 1158 |
| 9 | 11/29/2018 | John | 0 | 596 |
| 5 | 2/13/2019 | Maria | 76 | 922 |
| 5 | 2/13/2019 | Maria | 0 | 922 |
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
| 4 | 8/17/2018 | Sandy | 58 | 4256 |
| 4 | 8/20/2018 | Sandy | 3 | 65 |
| 4 | 8/23/2018 | Sandy | 3 | 100 |
| 20 | 12/25/2018 | Paul | 124 | 313 |
| 20 | 12/25/2018 | Paul | 0 | 99 |
I need to retain those records where based on each buyer and id, the sum of amount between consecutive rows >5000 if the difference between two consecutive rows <=5. So, for example, Buyer 'Sandy' with id '4' has two transactions of 1849 and 4193 on '6/15/2018' and '6/20/2018' within a gap of 5 days, and since the sum of these two amounts>5000, the output would have these records. Whereas, for the same Buyer 'Sandy' with id '4' has another transactions of 4256, 65 and 100 on '8/17/2018', '8/20/2018' and '8/23/2018' within a gap of 3 days each, but the output will not have these records as the sum of this amount <5000.
The final output would look like:
| id | Date | Buyer | diff | Amount |
|----|:---------:|------:|------|--------|
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
Changing Date from character to Date and Amount from character to numeric:
df$Date<-as.Date(df$Date, '%m/%d/%y')
df$Amount<-as.numeric(df$Amount)
Now here I group the dataset by id, arrange it with Date, and create a rank within each id (so for example Sandy is going to have rank from 1 through 5 for 5 different days in which she has shopped), then I define a new variable called ConsecutiveSum which adds the Value of each row to it's previous row's Value (lag gives you the previous row). The ifelse statement forces consecutive sum to output a 0 if the previous row's Value doesn't exists. The next step is just enforcing your conditions:
df %>%
group_by(id) %>%
arrange(Date) %>%
mutate(rank=dense_rank(Date)) %>%
mutate(ConsecutiveSum = ifelse(is.na(lag(Amount)),0,Amount + lag(Amount , default = 0)))%>%
filter(diffs<=5 & ConsecutiveSum>=5000 | ConsecutiveSum==0 & lead(ConsecutiveSum)>=5000)
# id Date Buyer Amount diffs rank ConsecutiveSum
# <chr> <chr> <chr> <dbl> <dbl> <int> <dbl>
# 1 4 6/15/2018 Sandy 1849 NA 1 0
# 2 4 6/20/2018 Sandy 4193 5 2 6042
I would use a combination of techniques available in tidyverse:
First create a grouping variable (new_id) and use the original id and new_id in combination to add together based on a grouping. Then we can filter by the criteria of the sum of the Amount > 5000. We can take this and filter then join or semi_join to filter based on the criteria.
ids is a dataset that finds the total Amount based on id and new_id and filters for when Dollars > 5000. This gives you the id and new_id that meets your criteria
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c(959,1158,596,922,922,1849,4193,4256,65,100,313,99), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
library(tidyverse)
df1 <- df %>% mutate(Date = as.Date(Date , format = "%m/%d/%Y"),
tf1 = (id != lag(id, default = 0)),
tf2 = (is.na(diffs) | diffs > 5))
df1$new_id <- cumsum(df1$tf1 + df1$tf2 > 0)
>df1
id Date Buyer Amount diffs days_post tf1 tf2 new_id
<chr> <date> <chr> <dbl> <dbl> <date> <lgl> <lgl> <int>
1 9 2018-11-29 John 959 NA 2018-12-04 TRUE TRUE 1
2 9 2018-11-29 John 1158 0 2018-12-04 FALSE FALSE 1
3 9 2018-11-29 John 596 0 2018-12-04 FALSE FALSE 1
4 5 2019-02-13 Maria 922 NA 2019-02-18 TRUE TRUE 2
5 5 2019-02-13 Maria 922 0 2019-02-18 FALSE FALSE 2
6 4 2018-06-15 Sandy 1849 NA 2018-06-20 TRUE TRUE 3
7 4 2018-06-20 Sandy 4193 5 2018-06-25 FALSE FALSE 3
8 4 2018-08-17 Sandy 4256 58 2018-08-22 FALSE TRUE 4
9 4 2018-08-20 Sandy 65 3 2018-08-25 FALSE FALSE 4
10 4 2018-08-23 Sandy 100 3 2018-08-28 FALSE FALSE 4
11 20 2018-12-25 Paul 313 NA 2018-12-30 TRUE TRUE 5
12 20 2018-12-25 Paul 99 0 2018-12-30 FALSE FALSE 5
ids <- df1 %>%
group_by(id, new_id) %>%
summarise(dollar = sum(Amount)) %>%
ungroup() %>% filter(dollar > 5000)
id new_id dollar
<chr> <int> <dbl>
1 4 3 6042
df1 %>% semi_join(ids)
Hi I am trying to work out a way to conditionally roll up dates in R.
Suppose I have the following table below and I want to roll dates up using the Flags variable. The Flag can either be 1 or 2 and dictates which subsequent dates can be linked up.
DateStart <- c("2018-01-01", "2018-01-04", "2018-01-05", "2018-01-09", "2018-01-12", "2018-01-20")
DateEnd <- c("2018-01-05", "2018-01-09", "2018-01-12", "2018-01-15", "2018-01-20", "2018-01-21")
IndexRecord <- c(1, NA, NA, NA, NA, NA)
Flag1 <- c(1,1,1,1,1,1)
Flag2 <- c(2,1,1,1,1,1)
Flag3 <- c(1,1,2,1,2,1)
df1 <- data.frame(DateStart = as.Date(DateStart),
DateEnd = as.Date(DateEnd),
IndexRecord = IndexRecord,
Flag1 = Flag1,
Flag2 = Flag2,
Flag3 = Flag3) %>%
arrange(DateStart)
df1
| | DateStart | DateEnd | IndexRecord | Flag1 | Flag2 | Flag3 |
|---|------------|------------|-------------|-------|-------|-------|
| 1 | 2018-01-01 | 2018-01-05 | 1 | 1 | 2 | 1 |
| 2 | 2018-01-04 | 2018-01-09 | NA | 1 | 1 | 1 |
| 3 | 2018-01-05 | 2018-01-12 | NA | 1 | 1 | 2 |
| 4 | 2018-01-09 | 2018-01-15 | NA | 1 | 1 | 1 |
| 5 | 2018-01-12 | 2018-01-20 | NA | 1 | 1 | 2 |
| 6 | 2018-01-20 | 2018-01-21 | NA | 1 | 1 | 1 |
A Flag with value of 1 for the current period means that for a subsequent row to have a valid link, the subsequent row must have DateStart occurring before DateEnd of the current row. Using Flag1 as the column of interest, the result would look like:
| | DateStart | DateEnd | IndexRecord | Flag1 |
|---|------------|------------|-------------|-------|
| 1 | 2018-01-01 | 2018-01-05 | 1 | 1 |
| 2 | 2018-01-04 | 2018-01-09 | NA | 1 |
| 3 | 2018-01-05 | 2018-01-12 | NA | 1 |
| 4 | 2018-01-09 | 2018-01-15 | NA | 1 |
| 5 | 2018-01-12 | 2018-01-20 | NA | 1 |
A Flag with value of 2 for the current period means that for a subsequent row to have a valid link, the subsequent row must have DateStart occurring on the DateEnd of the current row. Using Flag2 as the column of interest, the result would look like:
| | DateStart | DateEnd | IndexRecord | Flag2 |
|---|------------|------------|-------------|-------|
| 1 | 2018-01-01 | 2018-01-05 | 1 | 2 |
| 3 | 2018-01-05 | 2018-01-12 | NA | 1 |
| 4 | 2018-01-09 | 2018-01-15 | NA | 1 |
| 5 | 2018-01-12 | 2018-01-20 | NA | 1 |
One of the more complex cases could occur with patterns such as seen in Flag3 with the desired results:
| | DateStart | DateEnd | IndexRecord | Flag3 |
|---|------------|------------|-------------|-------|
| 1 | 2018-01-01 | 2018-01-05 | 1 | 1 |
| 2 | 2018-01-04 | 2018-01-09 | NA | 1 |
| 3 | 2018-01-05 | 2018-01-12 | NA | 2 |
| 5 | 2018-01-12 | 2018-01-20 | NA | 2 |
| 6 | 2018-01-20 | 2018-01-21 | NA | 1 |
Cheers,
J
Edits:
Since this was perhaps not clear let me clarify step by step.
Flag1.
We see that in Row 1 it ends on the 2018-01-05 and Flag1 is 1.
This means that for a subsequent row to be linked to this episode, the next episode's DateStart must occur before DateEnd of Row 1. Row 2, satisfies this condition since 2018-01-04 occurs before 2018-01-05 and therefore is valid link.
If we look at the remaining rows, all these dates are nested except the Row 6. Since Flag1 of Row5 is 1, we cannot count Row 6, hence why the table stops at Row 5.
Total elapsed time is the from 2018-01-01 to 2018-01-20.
Flag2.
Row 1 has Flag2 equal to 2 which means that only a subsequent DateStart of 2018-01-05 can be linked to this row. Therefore Row 2 is dropped. If we keeping moving down we see that Row 3 has a DateStart of 2018-01-05 and therefore can be linked to Row 1.
Looking at the remaining rows, it has the same pattern as for Flag1 since Flag1 and Flag2 are identical from this point onwards.
Similarly to Flag1, total elapsed time is the from 2018-01-01 to 2018-01-20
There is no difference in elapsed time compared to Flag1 but differs in the journey taken.
Flag3.
Flag3 for Row 1 and Row 2 are the same as in Flag1 which means at this point Row 1, Row 2, and Row 3 are kept as in the Flag1 example.
Flag3 for Row 3 however is 2. Since the DateEnd of Row 3 is 2018-01-12, only Row 5 can be linked and Row 4 is removed.
Since Row 5 has Flag3 of 2 and a DateEnd of 2018-01-20, Row 6 can also be linked to this set.
The total elapsed time for this set is from 2018-01-01 to 2018-01-21.