I would like to ask a simple question about filling up orders based on eventdate per ID.
I have the following table. I would like to make a new column called order so that the day when eventdate is not NA, the order of it is zero. And then, I would like to fill up negative consecutively decreasing integers to the previous rows and positive consecutive increasing integers to the below rows per ID.
+----+------------+-------+
| ID | eventdate | Value |
+----+------------+-------+
| 1 | NA | 10 |
| 1 | NA | 11 |
| 1 | NA | 12 |
| 1 | NA | 11 |
| 1 | 2011-03-18 | 15 |
| 1 | NA | 17 |
| 1 | NA | 18 |
| 1 | NA | 15 |
| 2 | NA | 5 |
| 2 | NA | 6 |
| 2 | NA | 7 |
| 2 | 2011-05-28 | 9 |
| 2 | NA | 10 |
| 2 | NA | 11 |
| 2 | NA | 15 |
| 2 | NA | 16 |
| 3 | NA | 20 |
| 3 | NA | 22 |
| 3 | NA | 23 |
| 3 | NA | 24 |
| 3 | 2012-05-28 | 28 |
| 3 | NA | 29 |
| 3 | NA | 25 |
| 3 | NA | 24 |
| 3 | NA | 26 |
| 3 | NA | 24 |
+----+------------+-------+
In short, I would like to make the following table
+----+------------+-------+-------+
| ID | eventdate | Value | order |
+----+------------+-------+-------+
| 1 | NA | 10 | -4 |
| 1 | NA | 11 | -3 |
| 1 | NA | 12 | -2 |
| 1 | NA | 11 | -1 |
| 1 | 2011-03-18 | 15 | 0 |
| 1 | NA | 17 | 1 |
| 1 | NA | 18 | 2 |
| 1 | NA | 15 | 3 |
| 2 | NA | 5 | -3 |
| 2 | NA | 6 | -2 |
| 2 | NA | 7 | -1 |
| 2 | 2011-05-28 | 9 | 0 |
| 2 | NA | 10 | 1 |
| 2 | NA | 11 | 2 |
| 2 | NA | 15 | 3 |
| 2 | NA | 16 | 4 |
| 3 | NA | 20 | -4 |
| 3 | NA | 22 | -3 |
| 3 | NA | 23 | -2 |
| 3 | NA | 24 | -1 |
| 3 | 2012-05-28 | 28 | 0 |
| 3 | NA | 29 | 1 |
| 3 | NA | 25 | 2 |
| 3 | NA | 24 | 3 |
| 3 | NA | 26 | 4 |
| 3 | NA | 24 | 5 |
+----+------------+-------+-------+
Thank you very much in advance!
You can find the position of non-NA value in eventdate and subtract it with row position in that group.
library(dplyr)
df <- df %>%
group_by(ID) %>%
mutate(order = row_number() - match(TRUE, !is.na(eventdate))) %>%
ungroup
df
# A tibble: 26 × 4
# ID eventdate Value order
# <int> <chr> <int> <int>
# 1 1 NA 10 -4
# 2 1 NA 11 -3
# 3 1 NA 12 -2
# 4 1 NA 11 -1
# 5 1 2011-03-18 15 0
# 6 1 NA 17 1
# 7 1 NA 18 2
# 8 1 NA 15 3
# 9 2 NA 5 -3
#10 2 NA 6 -2
# … with 16 more rows
In base R, the same thing can be written as -
df <- transform(df, order = ave(!is.na(eventdate), ID,
FUN = function(x) seq_along(x) - match(TRUE, x)))
I have the following data and looking to create the "Final Col" shown below using dplyr in R.
Please note, the "Trees" main category has a value of 0 for week 1, 2017. In this case, I would like the final values for both weeks to be 0.
I would appreciate your ideas.
| Year | Week | MainCat|Qty |Final Col |
|:----: |:------: |:-----: |:-----:|:------------:|
| 2017 | 1 | Edible |69 |69/(69+12) |
| 2017 | 2 | Edible |12 |12/(69+12) |
| 2017 | 1 | Trees |00 |00 |
| 2017 | 2 | Trees |12 |00 |
| 2017 | 1 | Flowers|88 |88/(88+47) |
| 2017 | 2 | Flowers|47 |47/(88+47) |
| 2018 | 1 | Edible |90 |90/(90+35) |
| 2018 | 2 | Edible |35 |35/(90+35) |
| 2018 | 1 | Trees |32 |32/(32+12) |
| 2018 | 2 | Trees |12 |12/(32+12) |
| 2018 | 1 | Flowers|78 |78/(78+85) |
| 2018 | 2 | Flowers|85 |85/(78+85) |
For this, you can use a combination of grou_by and a regular if ... else clause
library(dplyr)
set.seed(0)
# dummy data
data <- tidyr::expand_grid(year = 2017:2018,
quarter = 1:2,
MainCat = LETTERS[1:3]) %>%
mutate(Qty = sample(0:15, 3*2*2)) %>%
arrange(year, MainCat, quarter)
data %>%
group_by(year, MainCat) %>%
mutate(finalCol = if(any(Qty == 0)){ 0 } else {Qty / sum(Qty)})
#> # A tibble: 12 x 5
#> # Groups: year, MainCat [6]
#> year quarter MainCat Qty finalCol
#> <int> <int> <chr> <int> <dbl>
#> 1 2017 1 A 13 0.684
#> 2 2017 2 A 6 0.316
#> 3 2017 1 B 8 0
#> 4 2017 2 B 0 0
#> 5 2017 1 C 3 0.75
#> 6 2017 2 C 1 0.25
#> 7 2018 1 A 12 0.632
#> 8 2018 2 A 7 0.368
#> 9 2018 1 B 10 0.476
#> 10 2018 2 B 11 0.524
#> 11 2018 1 C 2 0.333
#> 12 2018 2 C 4 0.667
I have a dataset like this in R:
Date | ID | Age |
2019-11-22 | 1 | 5 |
2018-12-21 | 1 | 4 |
2018-05-09 | 1 | 4 |
2018-05-01 | 2 | 5 |
2017-10-10 | 2 | 4 |
2017-07-21 | 1 | 3 |
How do I change the Age values of each group of ID to the most recent Age record?
Results should look like this:
Date | ID | Age |
2019-11-22 | 1 | 5 |
2018-12-21 | 1 | 5 |
2018-05-09 | 1 | 5 |
2018-05-01 | 2 | 5 |
2017-10-10 | 2 | 5 |
2017-07-21 | 1 | 5 |
I tried group_by(ID)%>% mutate(Age = max(Date, Age))
but it seems to be giving strange huge numbers for certain cases when I try it on a v huge dataset. What could be going wrong?
Try sorting first,
df %>%
arrange(as.Date(Date)) %>%
group_by(ID) %>%
mutate(Age = last(Age))
which gives,
# A tibble: 6 x 3
# Groups: ID [2]
Date ID Age
<fct> <int> <int>
1 2017-07-21 1 5
2 2017-10-10 2 5
3 2018-05-01 2 5
4 2018-05-09 1 5
5 2018-12-21 1 5
6 2019-11-22 1 5
I think the issue is in your mutate function:
Try this:
group_by(ID) %>%
arrange(as.date(Date) %>%
mutate(Age = max(Age))
Here's a sample of my dataset:
df=data.frame(id=c("9","9","9","5","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","5/25/2018","2/13/2019","2/13/2019","6/7/2018",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"))
I need to calculate the difference between dates which I have already done and the dataset then looks like:
| id | Date | Buyer | diff |
|----|:----------:|------:|------|
| 9 | 11/29/2018 | John | NA |
| 9 | 11/29/2018 | John | 0 |
| 9 | 11/29/2018 | John | 0 |
| 5 | 5/25/2018 | Maria | -188 |
| 5 | 2/13/2019 | Maria | 264 |
| 5 | 2/13/2019 | Maria | 0 |
| 4 | 6/7/2018 | Sandy | -251 |
| 4 | 6/15/2018 | Sandy | 8 |
| 4 | 6/20/2018 | Sandy | 5 |
| 4 | 8/17/2018 | Sandy | 58 |
| 4 | 8/20/2018 | Sandy | 3 |
| 20 | 12/25/2018 | Paul | 127 |
| 20 | 12/25/2018 | Paul | 0 |
Now, if the value of second row within each group of column 'diff' is greater than or equal to 5, then I need to delete the first row of each group. For example, the diff value 264 is greater than 5 for Buyer 'Maria' having id '5', so I would want to delete the first row within that group which would be the buyer 'Maria' having id '5', Date as '5/25/2018', and diff as '-188'
Below is a sample of my code:
df1=df %>% group_by(Buyer,id) %>%
mutate(diff = c(NA, diff(Date))) %>%
filter(!(diff >=5 & row_number() == 1))
The problem is that the above code selects the first row instead of the second row and I don't know how to specify the row to be 2nd for each group where the diff value should be greater than or equal to 5.
My expected output should look like:
| id | Date | Buyer | diff |
|----|:----------:|------:|------|
| 9 | 11/29/2018 | John | NA |
| 9 | 11/29/2018 | John | 0 |
| 9 | 11/29/2018 | John | 0 |
| 5 | 2/13/2019 | Maria | 264 |
| 5 | 2/13/2019 | Maria | 0 |
| 4 | 6/15/2018 | Sandy | 8 |
| 4 | 6/20/2018 | Sandy | 5 |
| 4 | 8/17/2018 | Sandy | 58 |
| 4 | 8/20/2018 | Sandy | 3 |
| 20 | 12/25/2018 | Paul | 127 |
| 20 | 12/25/2018 | Paul | 0 |
I think you forgot to provide the diff column in df. I created one called diffs so that it doesn't conflict with the function diff(). -
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y")))) %>%
filter(
n() == 1 | # always keep if only one row in group
row_number() > 1 | # always keep all row_number() > 1
diffs[2] < 5 # keep 1st row only if 2nd row diffs < 5
) %>%
ungroup()
# A tibble: 11 x 4
id Date Buyer diffs
<chr> <chr> <chr> <dbl>
1 9 11/29/2018 John NA
2 9 11/29/2018 John 0
3 9 11/29/2018 John 0
4 5 2/13/2019 Maria 264
5 5 2/13/2019 Maria 0
6 4 6/15/2018 Sandy 8
7 4 6/20/2018 Sandy 5
8 4 8/17/2018 Sandy 58
9 4 8/20/2018 Sandy 3
10 20 12/25/2018 Paul NA
11 20 12/25/2018 Paul 0
Data -
I added stringsAsFactors = FALSE
df1 <- data.frame(id=c("9","9","9","5","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","5/25/2018","2/13/2019","2/13/2019","6/7/2018",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul")
, stringsAsFactors = F)
Maybe I overthought it, but here is one idea,
df8 %>%
mutate(Date = as.Date(Date, format = '%m/%d/%Y')) %>%
mutate(diff = c(NA, diff(Date))) %>%
group_by(id) %>%
mutate(diff1 = as.integer(diff >= 5) + row_number()) %>%
filter(diff1 != 1 | lead(diff1) != 3) %>%
select(-diff1)
which gives,
# A tibble: 11 x 4
# Groups: id [4]
id Date Buyer diff
<fct> <date> <fct> <dbl>
1 9 2018-11-29 John NA
2 9 2018-11-29 John 0
3 9 2018-11-29 John 0
4 5 2019-02-13 Maria 264
5 5 2019-02-13 Maria 0
6 4 2018-06-15 Sandy 8
7 4 2018-06-20 Sandy 5
8 4 2018-08-17 Sandy 58
9 4 2018-08-20 Sandy 3
10 20 2018-12-25 Paul 127
11 20 2018-12-25 Paul 0
I have a table of daily temperature
+-------+-----+-------------+
| City | Day | Temperature |
+-------+-----+-------------+
| Miami | 1 | 25 |
| Miami | 2 | 27 |
| Miami | 3 | 34 |
| Miami | 4 | 23 |
| Miami | 5 | 30 |
| Miami | 6 | 31 |
| Paris | 1 | 15 |
| Paris | 2 | 17 |
| Paris | 3 | 14 |
| Paris | 4 | 13 |
| Paris | 5 | 10 |
| Paris | 6 | 11 |
+-------+-----+-------------+
I would to be able to summarize them by city in chunks of n days.
An exemple of the result with chunks of 3 days
+-------+-----+---------------------+
| City | Day | AVGTemperature |
+-------+-----+---------------------+
| Miami | 1-3 | 28.66 |
| Miami | 4-6 | 29 |
| Paris | 1-3 | 15.33 |
| Paris | 4-6 | 14.5 |
+-------+-----+---------------------+
I could do
AVGTemp <- ddply(temp, .(Day, City), summarize, AVGTemperature=mean(Temperature))
But that gives me the average for every single day. I can a make it so it returns chunks of n days?
Here's a dplyr solution. Change breaks from 3 to the number of chunks you'd like.
library(dplyr)
tab %>%
mutate(day_group = cut(Day, 3, include.lowest = TRUE, labels = FALSE)) %>%
group_by(City, day_group) %>%
summarise(mean_temp = mean(Temperature), start_day = min(Day), end_day = max(Day))
# Source: local data frame [6 x 5]
# Groups: City [?]
#
# City day_group mean_temp start_day end_day
# (fctr) (int) (dbl) (int) (int)
# 1 Miami 1 26.0 1 2
# 2 Miami 2 28.5 3 4
# 3 Miami 3 30.5 5 6
# 4 Paris 1 16.0 1 2
# 5 Paris 2 13.5 3 4
# 6 Paris 3 10.5 5 6