I would like to able to aggregate survey data collected over a range of days into a unique period. For example, for the first three dates (2015-03-17, 2015-03-23, 2015-03-26), i'd like to combine to produce the period "March 2015". I will then use these combined dates to produce boxplots which show "Average.Counts" for that period.
All up I would like to make 4 unique periods:
March 15 (first 3 dates as per table below)
September 15 (dates 4,5 as per table below)
March 2016 (dates 6-15 as per table below)
September 2016 (dates 16-23 as per table below)
Here are the dataset headings.
head(Survival.Pre.Harvest)
Bay.Unique Date Average.Count Total.Predators Time Previous.Average.Count
2 1 2015-03-17 346.9 2 0 NA
3 1 2015-09-14 326.6 8 181 346.9
4 1 2016-02-29 322.6 3 349 326.6
7 2 2015-03-17 326.4 2 0 NA
8 2 2015-09-14 288.8 4 181 326.4
9 2 2016-02-29 271.4 6 349 288.8
These are the unique dates within the dataset.
table(Survival.Pre.Harvest$Date)
2015-03-17 2015-03-23 2015-03-26 2015-09-14 2015-09-15 2016-02-24 2016-02-25 2016-02-26 2016-02-29
9 3 1 9 3 4 6 6 5
2016-03-01 2016-03-02 2016-03-03 2016-03-04 2016-03-22 2016-03-23 2016-09-12 2016-09-13 2016-09-14
3 6 3 6 6 2 6 6 4
2016-09-20 2016-09-22 2016-10-18 2016-10-19 2016-10-20
7 10 4 3 14
Thanks in advance!
dput(head(Survival.Pre.Harvest))
structure(list(Bay.Unique = c(1, 1, 1, 2, 2, 2), Date = structure(c(16511,
16692, 16860, 16511, 16692, 16860), class = "Date"), Average.Count = c(346.9,
326.6, 322.6, 326.4, 288.8, 271.4), Total.Predators = c(2L, 8L,
3L, 2L, 4L, 6L), Time = c(0, 181, 349, 0, 181, 349), Previous.Average.Count = c(NA,
346.9, 326.6, NA, 326.4, 288.8)), .Names = c("Bay.Unique", "Date",
"Average.Count", "Total.Predators", "Time", "Previous.Average.Count"
), row.names = c(2L, 3L, 4L, 7L, 8L, 9L), class = "data.frame")
This should work:
library(lubridate)
library(ggplot2)
Survival.Pre.Harvest$Date <- ymd(Survival.Pre.Harvest$Date)
bks = ymd("2015-01-01", "2015-08-31", "2016-01-01", "2016-08-31", "2017-01-01")
lbs <- c("Mar2015", "Sep2015", "Mar2016", "Sep2016")
Survival.Pre.Harvest$yearmonth <- cut.Date(Survival.Pre.Harvest$Date, breaks = bks, labels = lbs)
ggplot(Survival.Pre.Harvest, aes(x=yearmonth, y=Average.Count)) + geom_boxplot()
Related
This question already has answers here:
How to join two dataframes by nearest time-date?
(2 answers)
Closed last year.
I've seen various solutions for this question based on date only, but the time component is tripping me up. I have two data frames with POSIXct columns called 'datetime'. For DF1 that column has data rounded to the nearest hour. For DF2, the time component is not rounded to the nearest hour and can occur anytime. The dataframes look like this:
DF1
datetime
X
Y
Z
2020-09-01 03:00:00
1
3
4
2020-09-02 12:00:00
12
3
5
2020-09-02 22:00:00
4
9
19
2020-09-03 01:00:00
4
10
2
2020-09-04 06:00:00
4
12
1
2020-09-04 08:00:00
11
13
10
DF2
datetime
Var
2020-09-01 02:23:14
A
2020-09-01 03:12:09
B
2020-09-02 11:52:15
A
2020-09-02 12:15:44
B
2020-09-02 22:31:56
A
2020-09-02 21:38:05
B
2020-09-03 01:11:39
A
2020-09-03 00:59:33
B
2020-09-04 05:12:19
A
2020-09-04 06:07:09
B
2020-09-04 08:22:28
A
2020-09-04 07:50:17
B
What I want is to merge these two dataframes based on this column using the date and time that are closest in time to 'datetime' in DF1, so that it looks like this:
datetime
X
Y
Z
Var
2020-09-01 03:00:00
1
3
4
B
2020-09-02 12:00:00
12
3
5
A
2020-09-02 22:00:00
4
9
19
B
2020-09-03 01:00:00
4
10
2
B
2020-09-04 06:00:00
4
12
1
B
2020-09-04 08:00:00
11
13
10
B
Thank you!
Adding helper columns for merge and group_by, using merge and then dplyr for the filtering
library(dplyr)
df1$tmp <- as.Date(df1$datetime)
df2$tmp <- as.Date(df2$datetime)
df1$grp <- 1:(nrow(df1))
merge(df1, df2, "tmp") %>%
group_by(grp) %>%
slice(which.min(abs(difftime(datetime.x, datetime.y)))) %>%
ungroup() %>%
select(-c(tmp,grp,datetime.y))
# A tibble: 6 × 5
datetime.x X Y Z Var
<chr> <int> <int> <int> <chr>
1 2020-09-01 03:00:00 1 3 4 B
2 2020-09-02 12:00:00 12 3 5 A
3 2020-09-02 22:00:00 4 9 19 B
4 2020-09-03 01:00:00 4 10 2 B
5 2020-09-04 06:00:00 4 12 1 B
6 2020-09-04 08:00:00 11 13 10 B
Data
df1 <- structure(list(datetime = c("2020-09-01 03:00:00", "2020-09-02 12:00:00",
"2020-09-02 22:00:00", "2020-09-03 01:00:00", "2020-09-04 06:00:00",
"2020-09-04 08:00:00"), X = c(1L, 12L, 4L, 4L, 4L, 11L), Y = c(3L,
3L, 9L, 10L, 12L, 13L), Z = c(4L, 5L, 19L, 2L, 1L, 10L)), class = "data.frame", row.names = c(NA,
-6L))
df2 <- structure(list(datetime = c("2020-09-01 02:23:14", "2020-09-01 03:12:09",
"2020-09-02 11:52:15", "2020-09-02 12:15:44", "2020-09-02 22:31:56",
"2020-09-02 21:38:05", "2020-09-03 01:11:39", "2020-09-03 00:59:33",
"2020-09-04 05:12:19", "2020-09-04 06:07:09", "2020-09-04 08:22:28",
"2020-09-04 07:50:17"), Var = c("A", "B", "A", "B", "A", "B",
"A", "B", "A", "B", "A", "B")), class = "data.frame", row.names = c(NA,
-12L))
I have a dataset that looks like below:
PPID join_date week date visit
A 2017-10-01 1 NA 0
A 2017-10-01 2 2017-10-08 2
A 2017-10-01 3 2017-10-15 1
A 2017-10-01 4 NA 0
B 2017-05-23 1 2017-05-21 4
B 2017-05-23 2 2017-05-28 2
B 2017-05-23 3 NA 0
week indicates the difference between the Sunday of the week of join_date and date in weeks (e.g. for participant B, the Sunday of the week of 2017-05-23 is 2017-05-21; thus participant B's week1 starts on 2017-05-21, and week2 starts on 2017-05-28).
My goal is to fill in date where it is currently NA, such that the output looks like below:
PPID join_date week date visit
A 2017-10-01 1 2017-10-01 0
A 2017-10-01 2 2017-10-08 2
A 2017-10-01 3 2017-10-15 1
A 2017-10-01 4 2017-10-22 0
B 2017-05-23 1 2017-05-21 4
B 2017-05-23 2 2017-05-28 2
B 2017-05-23 3 2017-06-04 0
The code I currently have is:
library(dplyr)
library(lubridate)
df2 <- df %>%
group_by(PPID) %>%
mutate(date = seq(unique(floor_date(as.Date(join_date), "weeks")),
unique(floor_date(as.Date(join_date), "weeks") + 7*(max(week)-1)),
by="week"))
The problem with this approach is that I'm working with large dataset (~8 mil observation) and it takes forever to run! I read some posts that all those date conversion/calculation (e.g. floor_date or as.Date) is what takes so long, and was wondering if there's ways to make my code more efficient.
Thanks!
How about simply
df2$date = floor_date(df2$join_date, 'week') + 7*(df2$week-1)
# PPID join_date week date visit
# 1 A 2017-10-01 1 2017-10-01 0
# 2 A 2017-10-01 2 2017-10-08 2
# 3 A 2017-10-01 3 2017-10-15 1
# 4 A 2017-10-01 4 2017-10-22 0
# 5 B 2017-05-23 1 2017-05-21 4
# 6 B 2017-05-23 2 2017-05-28 2
# 7 B 2017-05-23 3 2017-06-04 0
Although this calculates floor_date for every row, it is vectorised rather looping (as you did implicitly using by), so should be fast enough for most purposes. If you need even more speed-up, you could subset on is.na(df2$data) to only calculate the rows you need to impute.
Data:
df2 = structure(list(PPID = c("A", "A", "A", "A", "B", "B", "B"), join_date = structure(c(17440,
17440, 17440, 17440, 17309, 17309, 17309), class = "Date"), week = c(1L,
2L, 3L, 4L, 1L, 2L, 3L), date = structure(c(NA, 17447, 17454,
NA, 17307, 17314, NA), class = "Date"), visit = c(0L, 2L, 1L,
0L, 4L, 2L, 0L)), row.names = c(NA, -7L), class = "data.frame")
I need to replace NAs with the mean of previous three values, by group.
Once an NA is replaced, it will serve as input for computing the mean corresponding to the next NA (if next NA is within the next three months).
Here it is an example:
id date value
1 2017-04-01 40
1 2017-05-01 40
1 2017-06-01 10
1 2017-07-01 NA
1 2017-08-01 NA
2 2014-01-01 27
2 2014-02-01 13
Data:
dt <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L), date = structure(c(17257, 17287, 17318, 17348, 17379, 16071, 16102), class = "Date"), value = c(40, 40, 10, NA, NA, 27, 13)), row.names = c(1L, 2L, 3L, 4L, 5L, 8L, 9L), class = "data.frame")
The output should look like:
id date value
1 2017-04-01 40.00
1 2017-05-01 40.00
1 2017-06-01 10.00
1 2017-07-01 30.00
1 2017-08-01 26.66
2 2014-01-01 27.00
2 2014-02-01 13.00
where 26.66 = (30 + 10 + 40)/3
What is an efficient way to do this (i.e. to avoid for loops)?
The following uses base R only and does what you need.
sp <- split(dt, dt$id)
sp <- lapply(sp, function(DF){
for(i in which(is.na(DF$value))){
tmp <- DF[seq_len(i - 1), ]
DF$value[i] <- mean(tail(tmp$value, 3))
}
DF
})
result <- do.call(rbind, sp)
row.names(result) <- NULL
result
# id date value
#1 1 2017-01-04 40.00000
#2 1 2017-01-05 40.00000
#3 1 2017-01-06 10.00000
#4 1 2017-01-07 30.00000
#5 1 2017-01-08 26.66667
#6 2 2014-01-01 27.00000
#7 2 2014-01-02 13.00000
Define a roll function which takes 3 or less previous values as a list and the current value and returns as a list the previous 2 values with the current value if the current value is not NA and the prevous 2 values with the mean if the current value is NA. Use that with Reduce and pick off the last value of each list in the result. Then apply all that to each group using ave.
roll <- function(prev, cur) {
prev <- unlist(prev)
list(tail(prev, 2), if (is.na(cur)) mean(prev) else cur)
}
reduce_roll <- function(x) {
sapply(Reduce(roll, init = x[1], x[-1], acc = TRUE), tail, 1)
}
transform(dt, value = ave(value, id, FUN = reduce_roll))
giving:
id date value
1 1 2017-04-01 40
2 1 2017-05-01 40
3 1 2017-06-01 10
4 1 2017-07-01 30
5 1 2017-08-01 26.66667
8 2 2014-01-01 27
9 2 2014-02-01 13
ID Date
1 1-1-2016
1 2-1-2016
1 3-1-2016
2 5-1-2016
3 6-1-2016
3 11-1-2016
3 12-1-2016
4 7-1-2016
5 9-1-2016
5 19-1-2016
5 20-1-2016
6 11-04-2016
6 12-04-2016
6 16-04-2016
6 04-08-2016
6 05-08-2016
6 06-08-2016
Expected Data Frame is based on consecutive dates pairwise
1st_Date is when he visited for first time
2nd_Date is the date after which he visited for 2 consecutive days
3rd_Date is the date after which he visited for 3 consecutive days
For e.g :
For ID = 1 , He visited first time on 1-1-2016 and his 2 consecutive visits also began on the 1-1-2016 as well as his 3rd one .
Similarly For ID = 2 , He only visited 1 time so rest will remain blank
For ID = 3 , he visited 1st Time on 6-1-2016 but visited for 2 consecutive days starting on 11-1-2016.
NOTE : This has to be done till earliest 3rd Date only
Expected Output
ID 1st_Date 2nd_Date 3rd_Date
1 1-1-2016 1-1-2016 1-1-2016
2 5-1-2016 NA NA
3 6-1-2016 11-1-2016 NA
4 7-1-2016 NA NA
5 9-1-2016 19-1-2016 NA
6 11-04-2016 11-04-2016 04-08-2016
Here is an attempt using dplyr and tidyr. The first thing to do is to convert your Date to as.Date and group_by the IDs. We next create a few new variables. The first one, new, checks to see which dates are consecutive. Date is then updated to give NA for those consecutive dates. However, If not all the dates are consecutive, then we filter out the ones that were converted to NA. We then fill (replace NA with latest non-na date for each ID), remove unwanted columns and spread.
library(dplyr)
library(tidyr)
df %>%
mutate(Date = as.Date(Date, format = '%d-%m-%Y')) %>%
group_by(ID) %>%
mutate(new = cumsum(c(1, diff.difftime(Date, units = 'days'))),
Date = replace(Date, c(0, diff(new)) == 1, NA),
new1 = sum(is.na(Date)),
new2 = seq(n())) %>%
filter(!is.na(Date)|new1 != 1) %>%
fill(Date) %>%
select(-c(new, new1)) %>%
spread(new2, Date) %>%
select(ID:`3`)
# ID `1` `2` `3`
#* <int> <date> <date> <date>
#1 1 2016-01-01 2016-01-01 2016-01-01
#2 2 2016-01-05 <NA> <NA>
#3 3 2016-01-06 2016-01-11 <NA>
#4 4 2016-01-07 <NA> <NA>
#5 5 2016-01-09 2016-01-09 2016-01-09
With your Updated Data set, It gives
# ID `1` `2` `3`
#* <int> <date> <date> <date>
#1 1 2016-01-01 2016-01-01 2016-01-01
#2 2 2016-01-05 <NA> <NA>
#3 3 2016-01-06 2016-01-11 <NA>
#4 4 2016-01-07 <NA> <NA>
#5 5 2016-01-09 2016-01-19 <NA>
DATA USED
dput(df)
structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 3L, 4L, 5L, 5L,
5L), Date = structure(c(1L, 5L, 7L, 8L, 9L, 2L, 3L, 10L, 11L,
4L, 6L), .Label = c("1-1-2016", "11-1-2016", "12-1-2016", "19-1-2016",
"2-1-2016", "20-1-2016", "3-1-2016", "5-1-2016", "6-1-2016",
"7-1-2016", "9-1-2016"), class = "factor")), .Names = c("ID",
"Date"), class = "data.frame", row.names = c(NA, -11L))
Use reshape. Code below assumes z is your data frame where date is a numeric date/time variable, ordered increasingly.
# a "set" variable represents a set of consecutive dates
z$set <- unsplit(tapply(z$date, z$ID, function(x) cumsum(diff(c(x[1], x)) > 1)), z$ID)
# "first.date" represents the first date in the set (of consecutive dates)
z$first.date <- unsplit(lapply(split(z$date, z[, c("ID", "set")]), min), z[, c("ID", "set")])
# "occurence" is a consecutive occurence #
z$occurrence <- unsplit(lapply(split(seq(nrow(z)), z$ID), seq_along), z$ID)
reshape(z[, c("ID", "first.date", "occurrence")], direction = "wide",
idvar = "ID", v.names = "first.date", timevar = "occurrence")
The result:
ID first.date.1 first.date.2 first.date.3
1 1 2016-01-01 2016-01-01 2016-01-01
4 2 2016-01-05 <NA> <NA>
5 3 2016-01-06 2016-01-11 2016-01-11
8 4 2016-01-07 <NA> <NA>
9 5 2016-01-09 2016-01-09 2016-01-09
I have a list of items with 2 dates (start date and end date) and duration in days (end date - start date). I want to cut them into bins to show the number of "active items" in each bin, i.e. if start date <= bin date and end date > bin date, the item should be counted in the bin.
Item StartDate EndDate Duration
Machine1 2005/01/21 2011/03/29 2258
Machine2 2004/05/12 2012/05/08 2918
Machine3 2004/10/15 2005/09/10 330
Machine4 2004/08/30 2011/08/02 2528
Machine5 2005/06/06 2010/12/03 2006
Machine6 2004/05/11 2007/03/17 1040
Machine7 2005/08/09 2011/05/30 2120
Machine8 2005/01/06 2012/06/07 2709
Machine9 2005/06/13 2008/08/28 1172
Machine10 2005/06/28 2010/04/08 1745
Machine11 2004/11/09 2007/05/14 916
Machine12 2005/05/26 2012/09/16 2670
Machine13 2004/05/28 2009/06/09 1838
Machine14 2005/01/06 2012/05/25 2696
Machine15 2005/08/20 2012/02/11 2366
Machine16 2004/08/02 2011/10/23 2638
Machine17 2004/08/10 2009/03/15 1678
Machine18 2005/05/08 2006/04/17 344
Machine19 2005/08/26 2006/07/24 332
Machine20 2004/03/30 2006/05/07 768
Bin counts that I want to produce:
2004/01/01 0
2005/01/01 9
2006/01/01 19
2007/01/01 16
2008/01/01 14
2009/01/01 13
2010/01/01 11
2011/01/01 9
2012/01/01 5
2013/01/01 0
As you can see, the totals of the bins do not add up to the total number of items, as you would expect with a traditional histogram.
I can do this with some verbose code, but I'm sure there must be some short way, using cut or split. I'm aware that the bin labels are off by one according to my definition above, but let's ignore that for now.
A way is:
#turn dates to actual dates
DF$StartDate <- as.Date(DF$StartDate, "%Y/%m/%d")
DF$EndDate <- as.Date(DF$EndDate, "%Y/%m/%d")
binDF[,1] <- as.Date(binDF[,1], "%Y/%m/%d")
counts <- colSums(sapply(binDF[,1], function(x) {DF$StartDate <= x & DF$EndDate > x}))
#> counts
#[1] 0 9 19 16 14 13 11 9 5 0
And as a complete dataframe:
resDF <- data.frame(dates = binDF[,1], counts = counts, stringsAsFactors = F)
#> resDF
# dates counts
#1 2004-01-01 0
#2 2005-01-01 9
#3 2006-01-01 19
#4 2007-01-01 16
#5 2008-01-01 14
#6 2009-01-01 13
#7 2010-01-01 11
#8 2011-01-01 9
#9 2012-01-01 5
#10 2013-01-01 0
The dataframes DF and binDF:
DF <- structure(list(Item = c("Machine1", "Machine2", "Machine3", "Machine4",
"Machine5", "Machine6", "Machine7", "Machine8", "Machine9", "Machine10",
"Machine11", "Machine12", "Machine13", "Machine14", "Machine15",
"Machine16", "Machine17", "Machine18", "Machine19", "Machine20"
), StartDate = c("2005/01/21", "2004/05/12", "2004/10/15", "2004/08/30",
"2005/06/06", "2004/05/11", "2005/08/09", "2005/01/06", "2005/06/13",
"2005/06/28", "2004/11/09", "2005/05/26", "2004/05/28", "2005/01/06",
"2005/08/20", "2004/08/02", "2004/08/10", "2005/05/08", "2005/08/26",
"2004/03/30"), EndDate = c("2011/03/29", "2012/05/08", "2005/09/10",
"2011/08/02", "2010/12/03", "2007/03/17", "2011/05/30", "2012/06/07",
"2008/08/28", "2010/04/08", "2007/05/14", "2012/09/16", "2009/06/09",
"2012/05/25", "2012/02/11", "2011/10/23", "2009/03/15", "2006/04/17",
"2006/07/24", "2006/05/07"), Duration = c(2258L, 2918L, 330L,
2528L, 2006L, 1040L, 2120L, 2709L, 1172L, 1745L, 916L, 2670L,
1838L, 2696L, 2366L, 2638L, 1678L, 344L, 332L, 768L)), .Names = c("Item",
"StartDate", "EndDate", "Duration"), class = "data.frame", row.names = c(NA,
-20L))
binDF <- structure(list(V1 = c("2004/01/01", "2005/01/01", "2006/01/01",
"2007/01/01", "2008/01/01", "2009/01/01", "2010/01/01", "2011/01/01",
"2012/01/01", "2013/01/01"), V2 = c(0L, 9L, 19L, 16L, 14L, 13L,
11L, 9L, 5L, 0L)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-10L))