I try to calculate the date difference between second row and last row per group id. The data looks like
data<- data.frame(pid= c(1, 1, 1,1, 2, 2, 2, 3, 3, 3,3 ,3), day = c("25/07/2018", "19/10/2018", "17/01/2019", "19/03/2019", "10/09/2018","29/11/2018", "26/03/2019", "17/06/2016", "25/04/2018", "17/07/2018","05/04/2019", "09/02/2021"), catt=c(1,1,2,1,1,1,2,2,2,1,1,2))
data
pid
day
1
1
25/07/2018
2
1
19/10/2018
3
1
17/01/2019
4
1
19/03/2019
5
2
10/09/2018
6
2
29/11/2018
7
2
26/03/2019
8
3
17/06/2016
9
3
25/04/2018
10
3
17/07/2018
11
3
05/04/2019
12
3
09/02/2021
I use the following code to obtain a difference in months.
difftime("19/10/2018","19/03/2019 ", units = "days")/ (30)
difftime("29/11/2018","26/03/2019 ", units = "days")/ (30)
difftime("25/04/2018","09/02/2021 ", units = "days")/ (30)
The desired output
id day difference
1 25/07/2018
1 19/10/2018
1 17/01/2019
1 19/03/2019 7.13
2 10/09/2018
2 29/11/2018
2 26/03/2019 44.7
3 17/06/2016
3 25/04/2018
3 17/07/2018
3 05/04/2019
3 09/02/2021 196.7667
But it is difficult to large data, so anyone can help using lubricate () + slice() functions
Convert to date object and calculate the difference between last and second date for each pid.
library(dplyr)
library(lubridate)
data %>%
mutate(day = dmy(day)) %>%
arrange(pid, day) %>%
group_by(pid) %>%
summarise(difference = (last(day) - day[2])/30)
# pid difference
# <dbl> <dbl>
#1 1 5.03
#2 2 3.9
#3 3 34.0
If you want to maintain the number of rows in the dataframe, use mutate and replace the difference only on the last row of the dataframe.
data %>%
mutate(day = dmy(day)) %>%
arrange(pid, day) %>%
group_by(pid) %>%
mutate(difference = ifelse(row_number() == n(), (last(day) - day[2])/30, NA))
Note that output from difftime in the question is incorrect.
#Wrong output
difftime("19/10/2018","19/03/2019 ", units = "days")
#Time difference of 214 days
#Correct output
difftime(dmy("19/03/2019"), dmy("19/10/2018"), units = "days")
#Time difference of 151 days
Related
I have a database containing a list of events. Each event has an associated start date, and a date when the event ended or was completed, eg:
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
> dataset
# A tibble: 25 x 3
eventid start_date completed_date
<int> <date> <date>
1 57 2011-01-14 2013-01-07
2 97 2011-01-21 2011-03-03
3 58 2011-01-26 2011-02-05
4 25 2011-03-22 2013-07-20
5 8 2011-04-20 2012-07-16
6 81 2011-04-26 2013-03-04
7 42 2011-05-02 2012-01-16
8 77 2011-05-03 2012-08-14
9 78 2011-05-21 2013-09-26
10 49 2011-05-22 2013-01-04
# ... with 15 more rows
>
I am trying to produce a rolling "snapshot" of how many tasks were pending a different points in time, e.g. month by month. Expected result:
# A tibble: 25 x 2
month count
<date> <int>
1 2011-01-01 0
2 2011-02-01 3
3 2011-03-01 2
4 2011-04-01 2
5 2011-05-01 4
6 2011-06-01 8
I have attempted to group my variables using group_by(period=floor_date(start_date,"month")), but I'm a bit stuck and would appreciate a pointer in the right direction!
I would prefer a solution using dplyr if possible.
Thanks!
You can expand rows for each month included in the range of dates with map2 from purrr. map2 will iterate over multiple inputs simultaneously. In this case, it will iterate through the start and end dates at the same time.
In each iteration, if will create a monthly sequence using seq (or seq.Date) from start to end month (determined from floor_date). The result is nested for each row of data (since one row can have multiple months in the sequence). So, unnest is needed afterwards.
The transmute will add a new variable called month_year (and drop the old ones) and use substr to extract the year and month only (no day). This is the first through seventh character of the date.
Then, you can group_by the month-year and count up the number of pending projects for each month_year.
I included set.seed to reproduce from data below.
library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)
dataset %>%
mutate(month = map2(floor_date(start_date, "month"),
floor_date(completed_date, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2011-01 1
2 2011-02 3
3 2011-03 9
4 2011-04 10
5 2011-05 13
6 2011-06 15
7 2011-07 16
8 2011-08 18
9 2011-09 19
10 2011-10 20
# … with 22 more rows
If you want to exclude the completed month (except when start month and completed month are the same, if that can exist), you can subtract 1 month from the sequence of months created. In this case, you can use pmax so that if both start and end months are the same, it will still count the month).
Here is the modified mutate with map2:
mutate(month = map2(floor_date(start_date, "month"),
pmax(floor_date(completed_date, "month") - 1, floor_date(start_date, "month")),
seq.Date,
by = "month"))
Data
set.seed(123)
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
I have hourly values of temperature measurements and I wish to calculate the average per day only for complete (i.e. with 24 measurements) days. Incomplete days would then be summarized as "NA".
I have grouped the values together per year, month and day and call summarize().
I have three month of data missing which appears as a gap in my ggplot function and which is what I want to achieve with the rest. The problem is that when I call summarize() to calculate the mean of my values, days with only 1 or 2 measurements also get called. Only those with all missing values (24) appear as "NA".
Date TempUrb TempRur UHI
1 2011-03-21 22:00:00 10.1 11.67000 -1.570000
2 2011-03-21 23:00:00 9.9 11.67000 -1.770000
3 2011-03-22 00:00:00 10.9 11.11000 -0.210000
4 2011-03-22 01:00:00 10.7 10.56000 0.140000
5 2011-03-22 02:00:00 9.7 10.00000 -0.300000
6 2011-03-22 03:00:00 9.5 10.00000 -0.500000
7 2011-03-22 04:00:00 9.4 8.89000 0.510000
8 2011-03-22 05:00:00 8.4 8.33500 0.065000
9 2011-03-22 06:00:00 8.2 7.50000 0.700000
AvgUHI <- UHI %>% group_by(year(Date), add = TRUE) %>%
group_by(month(Date), add = TRUE) %>%
group_by(day(Date), add = TRUE, .drop = TRUE) %>%
summarize(AvgUHI = mean(UHI, na.rm = TRUE))
# A tibble: 2,844 x 4
# Groups: year(Date), month(Date) [95]
`year(Date)` `month(Date)` `day(Date)` AvgUHI
<int> <int> <int> <dbl>
1476 2015 4 4 0.96625000
1477 2015 4 5 -0.11909722
1478 2015 4 6 -0.60416667
1479 2015 4 7 -0.92916667
1480 2015 4 8 NA
1481 2015 4 9 NA
AvgUHI<- AvgUHI %>% group_by(`year(Date)`, add = TRUE) %>%
group_by(`month(Date)`, add = TRUE) %>%
summarize(AvgUHI= mean(AvgUHI, na.rm = TRUE))
# A tibble: 95 x 3
# Groups: year(Date) [9]
`year(Date)` `month(Date)` AvgUHI
<int> <int> <dbl>
50 2015 4 0.580887346
51 2015 5 0.453815051
52 2015 6 0.008479618
As you can see above on the final table, I have an average for 04-2015, while I am missing data on that month (08 - 09/04/2015 on this example represented on the second table).
The same happens when I calculate AvgUHI and I'm missing hourly data.
I simply would like to see on the last table the AvgUHI for 04-2015 be NA.
E.g: of my graph1
The following will give a dataframe aggregated by day, where only the complete days, with 4 observations, are not NA. Then you can group by month to have the final dataframe.
UHI %>%
mutate(Day = as.Date(Date)) %>%
group_by(Day) %>%
mutate(n = n(), tmpUHI = if_else(n == 24, UHI, NA_real_)) %>%
summarize(AvgUHI = mean(tmpUHI)) %>%
full_join(data.frame(Day = seq(min(.$Day), max(.$Day), by = "day"))) %>%
arrange(Day) -> AvgUHI
For hours look at Rui Barradas' answer. For months the following code worked:
AvgUHI %>%
group_by(year(Day), add = TRUE) %>%
group_by(month(Day), add = TRUE) %>%
mutate(sum = sum(is.na(AvgUHI)), tmpUHI = if_else(sum <= 10, AvgUHI, NA_real_)) %>%
summarise(AvgUHI = mean(tmpUHI, na.rm = TRUE)) -> AvgUHI
Using R.
This is a small subset of my dataset, simplified to only show relevant columns. The data is taken from Capital Bikeshare. The Start.Date column below has exact rental times for a bike.
Start.date Member.type
2018-11-01 00:00:45 Member
2018-11-01 00:00:52 Casual
2018-11-01 00:01:46 Member
2018-11-01 01:00:02 Casual
2018-11-01 01:03:36 Member
What I'm trying to do is group all of the data by date, hour of day, number of each member type, and total number of member types (casual+member) for any given hour of any given day. So, in the end, I'll just have "Day - Hour - Number of Rentals per member type" so I can predict trends for hour of the day,
Here is my relevant code
library(dplyr)
bikeData <- read.csv("2011data.csv")
bikeData <- bikeData %>%
mutate(Hour = format(strptime(
bikeData$Start.date, "%Y-%m-%d %H:%M:%S"), "%m-%d %H")) %>%
mutate(day = wday(Start.date, label=TRUE))
groupData <- bikeData %>%
mutate(Start.date = ymd_hms(Start.date)) %>%
count(date1 = as.Date(Start.date), Hour1 = hour(Start.date),
member=(Member.type)) %>%
group_by(date1, Hour1) %>%
arrange(date1, Hour1) %>%
summarise(total=sum(n))
What this gives me is the following new dataset, groupData
date1 Hour1 total
2018-11-01 0 82
2018-11-01 1 43
2018-11-01 2 17
2018-11-01 3 4
2018-11-02 0 5
2018-11-02 1 24
So I was able to do the total number of Member+Casual for all 24 hours of each day of my dataset, but how do I get another two columns that show the total number of casual and another that shows the total number of member? Thanks!
Desired below:
date1 Hour1 total Casual Member
2018-11-01 0 82 40 42
2018-11-01 1 43 20 23
2018-11-01 2 17 10 7
2018-11-01 3 4 1 3
2018-11-02 0 5 1 4
2018-11-02 1 24 20 4
groupData <- bikeData %>%
mutate(Start.date = ymd_hms(Start.date)) %>%
count(date1 = as.Date(Start.date), Hour1 = hour(Start.date),
member=(Member.type)) %>%
group_by(date1, Hour1) %>%
arrange(date1, Hour1) %>%
summarise(total=sum(n),members=sum(Member.type=="Member"),casuals=sum(Member.type=="Casual"))
You can simply add to your summarize call two variables that count the logical occurrences of Member.type equaling each of the options.
So I have some data with a time stamp, and for each row, I want to count the number of rows that fall within a certain time window. For example, if I have the data below with a time stamp in h:mm (column ts), I want to count the number of rows that occur from that time stamp to five minutes in the past (column count). The first n rows that are less than five minutes from the first data point should be NAs.
ts data count
1:01 123 NA
1:02 123 NA
1:03 123 NA
1:04 123 NA
1:06 123 5
1:07 123 5
1:10 123 3
1:11 123 4
1:12 123 4
This is straightforward to do with a for loop, but I've been trying to implement with the apply() family and have not yet found any success. Any suggestions?
EDIT: modified to account for the potential for multiple readings per minute, raised in comment.
Data with new mid-minute reading:
library(dplyr)
df %>%
# Take the text above and convert to datetime
mutate(ts = lubridate::ymd_hms(paste(Sys.Date(), ts))) %>%
# Count how many observations per minute
group_by(ts_min = lubridate::floor_date(ts, "1 minute")) %>%
summarize(obs_per_min = sum(!is.na(data))) %>%
# Add rows for any missing minutes, count as zero observations
padr::pad(interval = "1 min") %>%
replace_na(list(obs_per_min = 0)) %>%
# Count cumulative observations, and calc how many in window that
# begins 5 minutes ago and ends at end of current minute
mutate(cuml_count = cumsum(obs_per_min),
prior_cuml = lag(cuml_count) %>% tidyr::replace_na(0),
in_window = cuml_count - lag(prior_cuml, 5)) %>%
# Exclude unneeded columns and rows
select(-cuml_count, -prior_cuml) %>%
filter(obs_per_min > 0)
Output (now reflects add'l reading at 1:06:30)
# A tibble: 12 x 3
ts_min obs_per_min in_window
<dttm> <dbl> <dbl>
1 2018-09-26 01:01:00 1 NA
2 2018-09-26 01:02:00 1 NA
3 2018-09-26 01:03:00 1 NA
4 2018-09-26 01:04:00 1 NA
5 2018-09-26 01:06:00 2 6
6 2018-09-26 01:07:00 1 6
7 2018-09-26 01:10:00 1 4
8 2018-09-26 01:11:00 1 5
9 2018-09-26 01:12:00 1 4
Suppose I have a daily rain data.frame like this:
df.meteoro = data.frame(Dates = seq(as.Date("2017/1/19"), as.Date("2018/1/18"), "days"),
rain = rnorm(length(seq(as.Date("2017/1/19"), as.Date("2018/1/18"), "days"))))
I'm trying to sum the accumulated rain between a 14 days interval with this code:
library(tidyverse)
library(lubridate)
df.rain <- df.meteoro %>%
mutate(TwoWeeks = round_date(df.meteoro$data, "14 days")) %>%
group_by(TwoWeeks) %>%
summarise(sum_rain = sum(rain))
The problem is that it isn't starting on 2017-01-19 but on 2017-01-15 and I was expecting my output dates to be:
"2017-02-02" "2017-02-16" "2017-03-02" "2017-03-16" "2017-03-30" "2017-04-13"
"2017-04-27" "2017-05-11" "2017-05-25" "2017-06-08" "2017-06-22" "2017-07-06" "2017-07-20"
"2017-08-03" "2017-08-17" "2017-08-31" "2017-09-14" "2017-09-28" "2017-10-12" "2017-10-26"
"2017-11-09" "2017-11-23" "2017-12-07" "2017-12-21" "2018-01-04" "2018-01-18"
TL;DR I have a year long daily rain data.frame and want to sum the accumulate rain for the dates above.
Please help.
Use of round_date in the way you have shown it will not give you 14-day periods as you might expect. I have taken a different approach in this solution and generated a sequence of dates between your first and last dates and grouped these into 14-day periods then joined the dates to your observations.
startdate = min(df.meteoro$Dates)
enddate = max(df.meteoro$Dates)
dateseq =
data.frame(Dates = seq.Date(startdate, enddate, by = 1)) %>%
mutate(group = as.numeric(Dates - startdate) %/% 14) %>%
group_by(group) %>%
mutate(starts = min(Dates))
df.rain <- df.meteoro %>%
right_join(dateseq) %>%
group_by(starts) %>%
summarise(sum_rain = sum(rain))
head(df.rain)
> head(df.rain)
# A tibble: 6 x 2
starts sum_rain
<date> <dbl>
1 2017-01-19 6.09
2 2017-02-02 5.55
3 2017-02-16 -3.40
4 2017-03-02 2.55
5 2017-03-16 -0.12
6 2017-03-30 8.95
Using a right-join to the date sequence is to ensure that if there are missing observation days that spanned a complete time period you'd still get that period listed in the result (though in your case you have a complete year of dates anyway).
round_date rounds to the nearest multiple of unit (here, 14 days) since some epoch (probably the Unix epoch of 1970-01-01 00:00:00), which doesn't line up with your purpose.
To get what you want, you can do the following:
df.rain = df.meteoro %>%
mutate(days_since_start = as.numeric(Dates - as.Date("2017/1/18")),
TwoWeeks = as.Date("2017/1/18") + 14*ceiling(days_since_start/14)) %>%
group_by(TwoWeeks) %>%
summarise(sum_rain = sum(rain))
This computes days_since_start as the days since 2017/1/18 and then manually rounds to the next multiple of two weeks.
Assuming you want to round to the closest date from the ones you have specified I guess the following will work
targetDates<-seq(ymd("2017-02-02"),ymd("2018-01-18"),by='14 days')
df.meteoro$Dates=targetDates[sapply(df.meteoro$Dates,function(x) which.min(abs(interval(targetDates,x))))]
sum_rain=ddply(df.meteoro,.(Dates),summarize,sum_rain=sum(rain,na.rm=T))
as you can see not all dates have the same number of observations. Date "2017-02-02" for instance has all the records between "2017-01-19" until "2017-02-09", which are 22 records. From "2017-02-10" on dates are rounded to "2017-02-16" etc.
This may be a cheat, but assuming each row/observation is a separate day, then why not just group by every 14 rows and sum.
# Assign interval groups, each 14 rows
df.meteoro$my_group <-rep(1:100, each=14, length.out=nrow(df.meteoro))
# Grab Interval Names
my_interval_names <- df.meteoro %>%
select(-rain) %>%
group_by(my_group) %>%
slice(1)
# Summarise
df.meteoro %>%
group_by(my_group) %>%
summarise(rain = sum(rain)) %>%
left_join(., my_interval_names)
#> Joining, by = "my_group"
#> # A tibble: 27 x 3
#> my_group rain Dates
#> <int> <dbl> <date>
#> 1 1 3.86 2017-01-19
#> 2 2 -0.581 2017-02-02
#> 3 3 -0.876 2017-02-16
#> 4 4 1.80 2017-03-02
#> 5 5 3.79 2017-03-16
#> 6 6 -3.50 2017-03-30
#> 7 7 5.31 2017-04-13
#> 8 8 2.57 2017-04-27
#> 9 9 -1.33 2017-05-11
#> 10 10 5.41 2017-05-25
#> # ... with 17 more rows
Created on 2018-03-01 by the reprex package (v0.2.0).