Split a row into columns with conditions in R - r

I've a dataframe as under
+----+-------+---------+
| ID | VALUE | DATE |
+----+-------+---------+
| 1 | 10 | 2019-08 |
| 2 | 12 | 2018-05 |
| 3 | 45 | 2019-03 |
| 3 | 33 | 2018-03 |
| 1 | 5 | 2018-08 |
| 2 | 98 | 2019-05 |
| 4 | 67 | 2019-10 |
| 4 | 34 | 2018-10 |
| 1 | 55 | 2018-07 |
| 2 | 76 | 2019-08 |
| 2 | 56 | 2018-12 |
+----+-------+---------+
What I'm trying to do here is to split the value and date into value1 and value2 and data1 and date2 based on the current year(year of systemdate) and the year before
But the condition here is if the date-month combination in DATE of the main table matched to that of current systemdate then donot consider last years date
Also disregard all the values dates that appear before the year of systemdate
The resulting output would be as under
Over here in the result ID 1,2 and 3 had corresponding values for same month in this year and last year so we split them in 2 different columns
Also we didn't consider last years result of ID 4 as its month this year matches with year-month combination of systemdate
and we also disregard all the values from lat year that don't have a corresponding month match this year ( ID 1 for 2018-07 and 2 for 2018-12 in this example)
+----+---------+---------+--------+--------+
| ID | DATE1 | DATE2 | VALUE1 | VALUE2 |
+----+---------+---------+--------+--------+
| 1 | 2019-08 | 2018-08 | 10 | 5 |
| 2 | 2019-05 | 2018-05 | 98 | 12 |
| 3 | 2019-03 | 2018-03 | 45 | 33 |
| 4 | 2019-10 | NA | 67 | NA |
| 2 | 2019-08 | NA | 76 | NA |
+----+---------+---------+--------+--------+

I think you could get everything in the right format first:
df <- data.frame(ID = c(1, 2, 3, 3, 1, 2, 4, 4, 1, 2, 2),
VALUE = c(10, 12, 45, 33, 5, 98, 67, 34, 55, 76, 56),
DATE = c("2019-08", "2018-05", "2019-03","2018-03",
"2018-08","2019-05", "2019-10", "2018-10",
"2018-07", "2019-08", "2018-12"))
library(tidyverse)
df <- df %>% mutate(
year = str_split_fixed(DATE, "-", 2)[,1],
month = str_split_fixed(DATE, "-", 2)[,2]) %>%
pivot_wider(
names_from = year,
values_from = c(VALUE, DATE))
Then, you could filter and remove those values that you do not need according to your logic. I may not fully understand your system time here, but just assume it is the string "2019-10". It could be something like this:
df %>%
filter(!is.na(VALUE_2019)) %>%
mutate(
VALUE_2018 = ifelse(DATE_2019 == "2019-10", NA, VALUE_2018),
DATE_2018 = ifelse(DATE_2019 == "2019-10", NA, as.character(DATE_2018)))
# A tibble: 5 x 6
ID month VALUE_2019 VALUE_2018 DATE_2019 DATE_2018
<dbl> <chr> <dbl> <dbl> <fct> <chr>
1 1 08 10 5 2019-08 2018-08
2 2 05 98 12 2019-05 2018-05
3 3 03 45 33 2019-03 2018-03
4 4 10 67 NA 2019-10 NA
5 2 08 76 NA 2019-08 NA

Related

Weekly Weight Based on a category using dplyr in R

I have the following data and looking to create the "Final Col" shown below using dplyr in R. I would appreciate your ideas.
| Year | Week | MainCat|Qty |Final Col |
|:----: |:------: |:-----: |:-----:|:------------:|
| 2017 | 1 | Edible |69 |69/(69+12) |
| 2017 | 2 | Edible |12 |12/(69+12) |
| 2017 | 1 | Flowers|88 |88/(88+47) |
| 2017 | 2 | Flowers|47 |47/(88+47) |
| 2018 | 1 | Edible |90 |90/(90+35) |
| 2018 | 2 | Edible |35 |35/(90+35) |
| 2018 | 1 | Flowers|78 |78/(78+85) |
| 2018 | 2 | Flowers|85 |85/(78+85) |
It can be done with a group_by operation i.e. grouped by 'Year', 'MainCat', divide the 'Qty' by the sum of 'Qty' to create the 'Final' column
library(dplyr)
df1 <- df1 %>%
group_by(Year, MainCat) %>%
mutate(Final = Qty/sum(Qty))
You can use prop.table :
library(dplyr)
df %>% group_by(Year, MainCat) %>% mutate(Final = prop.table(Qty))
# Year Week MainCat Qty Final
# <int> <int> <chr> <int> <dbl>
#1 2017 1 Edible 69 0.852
#2 2017 2 Edible 12 0.148
#3 2017 1 Flowers 88 0.652
#4 2017 2 Flowers 47 0.348
#5 2018 1 Edible 90 0.72
#6 2018 2 Edible 35 0.28
#7 2018 1 Flowers 78 0.479
#8 2018 2 Flowers 85 0.521
You can also do this in base R :
df$Final <- with(df, ave(Qty, Year, MainCat, FUN = prop.table))

Swap results of a value from last years value in same month if the month-year combination is not equal to month-year of system

I've a table as under
+----+-------+---------+
| ID | VALUE | DATE |
+----+-------+---------+
| 1 | 10 | 2019-09 |
| 1 | 12 | 2018-09 |
| 2 | 13 | 2019-10 |
| 2 | 14 | 2018-10 |
| 3 | 67 | 2019-01 |
| 3 | 78 | 2018-01 |
+----+-------+---------+
I want to be able to swap the VALUE column for all ID's where the DATE != year-month of system date
and If the DATE == year-month of system date then just keep this years value
the resulting table I need is as under
+----+-------+---------+
| ID | VALUE | DATE |
+----+-------+---------+
| 1 | 12 | 2019-09 |
| 2 | 13 | 2019-10 |
| 3 | 78 | 2019-01 |
+----+-------+---------+
As Jon and Maurits noticed, your example is unclear: you give no line with what is a wrong format to you, and you mention "current year" but do not describe the expected output for the next year for instance.
Here is an attempt of code to actually answer your question:
library(dplyr)
x = read.table(text = "
ID VALUE DATE
1 10 2019-09
1 12 2018-09
1 12 2018-09-04
1 12 2018-99
2 13 2019-10
2 14 2018-10
3 67 2019-01
3 78 2018-01
", header=T)
x %>%
mutate(DATE = paste0(DATE, "-01") %>% as.Date("%Y-%m-%d")) %>%
group_by(ID) %>%
filter(DATE==max(DATE, na.rm=T))
I inserted two lines with a "wrong" format (according to me) and treated "current year" as the maximum year you could find in the column for each ID.
This may be wrong assertions, but I'd need more information to better answer this.

parse values based on groups in R

I have a very large dataset and a sample of that looks something like the one below:
| Id | Name | Start_Date | End_Date |
|----|---------|------------|------------|
| 10 | Mark | 4/2/1999 | 7/5/2018 |
| 10 | | 1/1/2000 | 9/24/2018 |
| 25 | | 5/3/1968 | 6/3/2000 |
| 25 | | 6/6/2009 | 4/23/2010 |
| 25 | Anthony | 2/20/2010 | 7/21/2016 |
| 25 | | 9/12/2014 | 11/26/2019 |
I need to parse the names from Name column based on their Id such that the output table looks like:
| Id | Name | Start_Date | End_Date |
|----|---------|------------|------------|
| 10 | Mark | 4/2/1999 | 7/5/2018 |
| 10 | Mark | 1/1/2000 | 9/24/2018 |
| 25 | Anthony | 5/3/1968 | 6/3/2000 |
| 25 | Antony | 6/6/2009 | 4/23/2010 |
| 25 | Anthony | 2/20/2010 | 7/21/2016 |
| 25 | Anthony | 9/12/2014 | 11/26/2019 |
How can I achieve an output as shown above? I went through the substitute and parse functions, but was unable to understand how they apply to this problem.
My dataset would be:
df=data.frame(Id=c("10","10","25","25","25","25"),Name=c("Mark","","","","Anthony",""),
Start_Date=c("4/2/1999", "1/1/2000","5/3/1968","6/6/2009","2/20/2010","9/12/2014"),
End_Date=c("7/5/2018","9/24/2018","6/3/2000","4/23/2010","7/21/2016","11/26/2019"))
We can change the blanks ("") to NA and use fill to replace the NA elements with the previous non-NA element
library(dplyr)
library(tidyr)
df1 %>%
mutate(Name = na_if(Name, "")) %>%
group_by(Id) %>%
fill(Name, .direction = "down") %>%
fill(Name, .direction = "up)
# A tibble: 6 x 4
# Groups: Id [2]
# Id Name Start_Date End_Date
# <chr> <chr> <chr> <chr>
#1 10 Mark 4/2/1999 7/5/2018
#2 10 Mark 1/1/2000 9/24/2018
#3 25 Anthony 5/3/1968 6/3/2000
#4 25 Anthony 6/6/2009 4/23/2010
#5 25 Anthony 2/20/2010 7/21/2016
#6 25 Anthony 9/12/2014 11/26/2019
In the devel version of tidyr (‘0.8.3.9000’), this can be done in a single fill statement as .direction = "downup" is also an option
df1 %>%
mutate(Name = na_if(Name, "")) %>%
group_by(Id) %>%
fill(Name, .direction = "downup")
Or another option is to group by 'Id', and mutate the 'Name' as the first non-blank element
df1 %>%
group_by(Id) %>%
mutate(Name = first(Name[Name!=""]))
# A tibble: 6 x 4
# Groups: Id [2]
# Id Name Start_Date End_Date
# <chr> <chr> <chr> <chr>
#1 10 Mark 4/2/1999 7/5/2018
#2 10 Mark 1/1/2000 9/24/2018
#3 25 Anthony 5/3/1968 6/3/2000
#4 25 Anthony 6/6/2009 4/23/2010
#5 25 Anthony 2/20/2010 7/21/2016
#6 25 Anthony 9/12/2014 11/26/2019
data
df1 <- structure(list(Id = c("10", "10", "25", "25", "25", "25"), Name = c("Mark",
"", "", "", "Anthony", ""), Start_Date = c("4/2/1999", "1/1/2000",
"5/3/1968", "6/6/2009", "2/20/2010", "9/12/2014"), End_Date = c("7/5/2018",
"9/24/2018", "6/3/2000", "4/23/2010", "7/21/2016", "11/26/2019"
)), class = "data.frame", row.names = c(NA, -6L))
Using DF defined reproducibly in the Note at the end, replace each zero-length element of Name with NA and then use na.omit to get the unique non-NA to use to fill. We have assumed that there is only one non-NA per Id which is the case in the question. If not we could replace na.omit with function(x) unique(na.omit(x)) assuming that the non-NAs are all the same within Id. No packages are used.
transform(DF, Name = ave(replace(Name, !nzchar(Name), NA), Id, FUN = na.omit))
giving:
Id Name Start_Date End_Date
1 10 Mark 4/2/1999 7/5/2018
2 10 Mark 1/1/2000 9/24/2018
3 25 Anthony 5/3/1968 6/3/2000
4 25 Anthony 6/6/2009 4/23/2010
5 25 Anthony 2/20/2010 7/21/2016
6 25 Anthony 9/12/2014 11/26/2019
na.strings
We can simplify this slightly if we make sure that the zero length elements of Name are NA in the first place. We replace the read.table line in the Note with the first line below. Then it is just a matter of using na.locf0.
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, sep = "|",
strip.white = TRUE, na.strings = "")
transform(DF, Name = ave(Name, Id, FUN = na.omit))
Note
The input in reproducible form:
Lines <- "
Id | Name | Start_Date | End_Date
10 | Mark | 4/2/1999 | 7/5/2018
10 | | 1/1/2000 | 9/24/2018
25 | | 5/3/1968 | 6/3/2000
25 | | 6/6/2009 | 4/23/2010
25 | Anthony | 2/20/2010 | 7/21/2016
25 | | 9/12/2014 | 11/26/2019"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, sep = "|", strip.white = TRUE)

Calculate sum of a column if the difference between consecutive rows meets a condition

This is a continued question from the post Remove the first row from each group if the second row meets a condition
Below is a sample dataset:
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
which would look like:
| id | Date | Buyer | diff | Amount |
|----|:----------:|------:|------|--------|
| 9 | 11/29/2018 | John | NA | 959 |
| 9 | 11/29/2018 | John | 0 | 1158 |
| 9 | 11/29/2018 | John | 0 | 596 |
| 5 | 2/13/2019 | Maria | 76 | 922 |
| 5 | 2/13/2019 | Maria | 0 | 922 |
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
| 4 | 8/17/2018 | Sandy | 58 | 4256 |
| 4 | 8/20/2018 | Sandy | 3 | 65 |
| 4 | 8/23/2018 | Sandy | 3 | 100 |
| 20 | 12/25/2018 | Paul | 124 | 313 |
| 20 | 12/25/2018 | Paul | 0 | 99 |
I need to retain those records where based on each buyer and id, the sum of amount between consecutive rows >5000 if the difference between two consecutive rows <=5. So, for example, Buyer 'Sandy' with id '4' has two transactions of 1849 and 4193 on '6/15/2018' and '6/20/2018' within a gap of 5 days, and since the sum of these two amounts>5000, the output would have these records. Whereas, for the same Buyer 'Sandy' with id '4' has another transactions of 4256, 65 and 100 on '8/17/2018', '8/20/2018' and '8/23/2018' within a gap of 3 days each, but the output will not have these records as the sum of this amount <5000.
The final output would look like:
| id | Date | Buyer | diff | Amount |
|----|:---------:|------:|------|--------|
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
Changing Date from character to Date and Amount from character to numeric:
df$Date<-as.Date(df$Date, '%m/%d/%y')
df$Amount<-as.numeric(df$Amount)
Now here I group the dataset by id, arrange it with Date, and create a rank within each id (so for example Sandy is going to have rank from 1 through 5 for 5 different days in which she has shopped), then I define a new variable called ConsecutiveSum which adds the Value of each row to it's previous row's Value (lag gives you the previous row). The ifelse statement forces consecutive sum to output a 0 if the previous row's Value doesn't exists. The next step is just enforcing your conditions:
df %>%
group_by(id) %>%
arrange(Date) %>%
mutate(rank=dense_rank(Date)) %>%
mutate(ConsecutiveSum = ifelse(is.na(lag(Amount)),0,Amount + lag(Amount , default = 0)))%>%
filter(diffs<=5 & ConsecutiveSum>=5000 | ConsecutiveSum==0 & lead(ConsecutiveSum)>=5000)
# id Date Buyer Amount diffs rank ConsecutiveSum
# <chr> <chr> <chr> <dbl> <dbl> <int> <dbl>
# 1 4 6/15/2018 Sandy 1849 NA 1 0
# 2 4 6/20/2018 Sandy 4193 5 2 6042
I would use a combination of techniques available in tidyverse:
First create a grouping variable (new_id) and use the original id and new_id in combination to add together based on a grouping. Then we can filter by the criteria of the sum of the Amount > 5000. We can take this and filter then join or semi_join to filter based on the criteria.
ids is a dataset that finds the total Amount based on id and new_id and filters for when Dollars > 5000. This gives you the id and new_id that meets your criteria
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c(959,1158,596,922,922,1849,4193,4256,65,100,313,99), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
library(tidyverse)
df1 <- df %>% mutate(Date = as.Date(Date , format = "%m/%d/%Y"),
tf1 = (id != lag(id, default = 0)),
tf2 = (is.na(diffs) | diffs > 5))
df1$new_id <- cumsum(df1$tf1 + df1$tf2 > 0)
>df1
id Date Buyer Amount diffs days_post tf1 tf2 new_id
<chr> <date> <chr> <dbl> <dbl> <date> <lgl> <lgl> <int>
1 9 2018-11-29 John 959 NA 2018-12-04 TRUE TRUE 1
2 9 2018-11-29 John 1158 0 2018-12-04 FALSE FALSE 1
3 9 2018-11-29 John 596 0 2018-12-04 FALSE FALSE 1
4 5 2019-02-13 Maria 922 NA 2019-02-18 TRUE TRUE 2
5 5 2019-02-13 Maria 922 0 2019-02-18 FALSE FALSE 2
6 4 2018-06-15 Sandy 1849 NA 2018-06-20 TRUE TRUE 3
7 4 2018-06-20 Sandy 4193 5 2018-06-25 FALSE FALSE 3
8 4 2018-08-17 Sandy 4256 58 2018-08-22 FALSE TRUE 4
9 4 2018-08-20 Sandy 65 3 2018-08-25 FALSE FALSE 4
10 4 2018-08-23 Sandy 100 3 2018-08-28 FALSE FALSE 4
11 20 2018-12-25 Paul 313 NA 2018-12-30 TRUE TRUE 5
12 20 2018-12-25 Paul 99 0 2018-12-30 FALSE FALSE 5
ids <- df1 %>%
group_by(id, new_id) %>%
summarise(dollar = sum(Amount)) %>%
ungroup() %>% filter(dollar > 5000)
id new_id dollar
<chr> <int> <dbl>
1 4 3 6042
df1 %>% semi_join(ids)

Conditionally rolling up dates in R

Hi I am trying to work out a way to conditionally roll up dates in R.
Suppose I have the following table below and I want to roll dates up using the Flags variable. The Flag can either be 1 or 2 and dictates which subsequent dates can be linked up.
DateStart <- c("2018-01-01", "2018-01-04", "2018-01-05", "2018-01-09", "2018-01-12", "2018-01-20")
DateEnd <- c("2018-01-05", "2018-01-09", "2018-01-12", "2018-01-15", "2018-01-20", "2018-01-21")
IndexRecord <- c(1, NA, NA, NA, NA, NA)
Flag1 <- c(1,1,1,1,1,1)
Flag2 <- c(2,1,1,1,1,1)
Flag3 <- c(1,1,2,1,2,1)
df1 <- data.frame(DateStart = as.Date(DateStart),
DateEnd = as.Date(DateEnd),
IndexRecord = IndexRecord,
Flag1 = Flag1,
Flag2 = Flag2,
Flag3 = Flag3) %>%
arrange(DateStart)
df1
| | DateStart | DateEnd | IndexRecord | Flag1 | Flag2 | Flag3 |
|---|------------|------------|-------------|-------|-------|-------|
| 1 | 2018-01-01 | 2018-01-05 | 1 | 1 | 2 | 1 |
| 2 | 2018-01-04 | 2018-01-09 | NA | 1 | 1 | 1 |
| 3 | 2018-01-05 | 2018-01-12 | NA | 1 | 1 | 2 |
| 4 | 2018-01-09 | 2018-01-15 | NA | 1 | 1 | 1 |
| 5 | 2018-01-12 | 2018-01-20 | NA | 1 | 1 | 2 |
| 6 | 2018-01-20 | 2018-01-21 | NA | 1 | 1 | 1 |
A Flag with value of 1 for the current period means that for a subsequent row to have a valid link, the subsequent row must have DateStart occurring before DateEnd of the current row. Using Flag1 as the column of interest, the result would look like:
| | DateStart | DateEnd | IndexRecord | Flag1 |
|---|------------|------------|-------------|-------|
| 1 | 2018-01-01 | 2018-01-05 | 1 | 1 |
| 2 | 2018-01-04 | 2018-01-09 | NA | 1 |
| 3 | 2018-01-05 | 2018-01-12 | NA | 1 |
| 4 | 2018-01-09 | 2018-01-15 | NA | 1 |
| 5 | 2018-01-12 | 2018-01-20 | NA | 1 |
A Flag with value of 2 for the current period means that for a subsequent row to have a valid link, the subsequent row must have DateStart occurring on the DateEnd of the current row. Using Flag2 as the column of interest, the result would look like:
| | DateStart | DateEnd | IndexRecord | Flag2 |
|---|------------|------------|-------------|-------|
| 1 | 2018-01-01 | 2018-01-05 | 1 | 2 |
| 3 | 2018-01-05 | 2018-01-12 | NA | 1 |
| 4 | 2018-01-09 | 2018-01-15 | NA | 1 |
| 5 | 2018-01-12 | 2018-01-20 | NA | 1 |
One of the more complex cases could occur with patterns such as seen in Flag3 with the desired results:
| | DateStart | DateEnd | IndexRecord | Flag3 |
|---|------------|------------|-------------|-------|
| 1 | 2018-01-01 | 2018-01-05 | 1 | 1 |
| 2 | 2018-01-04 | 2018-01-09 | NA | 1 |
| 3 | 2018-01-05 | 2018-01-12 | NA | 2 |
| 5 | 2018-01-12 | 2018-01-20 | NA | 2 |
| 6 | 2018-01-20 | 2018-01-21 | NA | 1 |
Cheers,
J
Edits:
Since this was perhaps not clear let me clarify step by step.
Flag1.
We see that in Row 1 it ends on the 2018-01-05 and Flag1 is 1.
This means that for a subsequent row to be linked to this episode, the next episode's DateStart must occur before DateEnd of Row 1. Row 2, satisfies this condition since 2018-01-04 occurs before 2018-01-05 and therefore is valid link.
If we look at the remaining rows, all these dates are nested except the Row 6. Since Flag1 of Row5 is 1, we cannot count Row 6, hence why the table stops at Row 5.
Total elapsed time is the from 2018-01-01 to 2018-01-20.
Flag2.
Row 1 has Flag2 equal to 2 which means that only a subsequent DateStart of 2018-01-05 can be linked to this row. Therefore Row 2 is dropped. If we keeping moving down we see that Row 3 has a DateStart of 2018-01-05 and therefore can be linked to Row 1.
Looking at the remaining rows, it has the same pattern as for Flag1 since Flag1 and Flag2 are identical from this point onwards.
Similarly to Flag1, total elapsed time is the from 2018-01-01 to 2018-01-20
There is no difference in elapsed time compared to Flag1 but differs in the journey taken.
Flag3.
Flag3 for Row 1 and Row 2 are the same as in Flag1 which means at this point Row 1, Row 2, and Row 3 are kept as in the Flag1 example.
Flag3 for Row 3 however is 2. Since the DateEnd of Row 3 is 2018-01-12, only Row 5 can be linked and Row 4 is removed.
Since Row 5 has Flag3 of 2 and a DateEnd of 2018-01-20, Row 6 can also be linked to this set.
The total elapsed time for this set is from 2018-01-01 to 2018-01-21.

Resources