I have weather data for summarize across different date intervals.
Here is the weather data:
library(dplyr)
rainfall_data <- read.csv(text = "
date,rainfall_daily_mm
01/01/2019,0
01/02/2019,1
01/03/2019,3
01/04/2019,45
01/05/2019,0
01/06/2019,0
01/07/2019,0
01/08/2019,43
01/09/2019,5
01/10/2019,0
01/11/2019,55
01/12/2019,6
01/13/2019,0
01/14/2019,7
01/15/2019,0
01/16/2019,7
01/17/2019,8
01/18/2019,89
01/19/2019,65
01/20/2019,3
01/21/2019,0
01/22/2019,0
01/23/2019,2
01/24/2019,0
01/25/2019,0
01/26/2019,0
01/27/2019,0
01/28/2019,22
01/29/2019,3
01/30/2019,0
01/31/2019,0
") %>%
mutate(date = as.Date(date, format = "%d/%m/%Y"))
And here is the date intervals I need to get summaries of from the weather file:
intervals <- read.csv(text= "
treatment,initial,final
A,01/01/2019,01/05/2019
B,01/13/2019,01/20/2019
C,01/12/2019,01/26/2019
D,01/30/2019,01/31/2019
E,01/11/2019,01/23/2019
F,01/03/2019,01/19/2019
G,01/01/2019,01/24/2019
H,01/26/2019,01/28/2019
") %>%
mutate(initial = as.Date(initial, format = "%d/%m/%Y"),
final = as.Date(final, format = "%d/%m/%Y"))
The expected outcome is this one:
This is what I've tried based on a similar question:
summary_by_date_interval <- rainfall_data %>%
mutate(group = cumsum(grepl(intervals$initial|intervals$final, date))) %>%
group_by(group) %>%
summarise(rainfall = sum(rainfall_daily_mm))
And this is the error I got:
Error in `mutate()`:
! Problem while computing `group = cumsum(grepl(intervals$initial |
intervals$final, date))`.
Caused by error in `Ops.Date()`:
! | not defined for "Date" objects
Run `rlang::last_error()` to see where the error occurred.
Any help will be really appreciated.
First %d/%m/%Y need to be %m/%d/%Y (or you'll have wrong dates and many NA's).
Then you could e.g. use lubridates interval and %within%:
library(dplyr)
library(lubridate)
intervals |>
group_by(treatment) |>
mutate(test = sum(rainfall_data$rainfall_daily_mm[rainfall_data$date %within% interval(initial, final)])) |>
ungroup()
Output:
# A tibble: 8 × 4
treatment initial final test
<chr> <date> <date> <int>
1 A 2019-01-01 2019-01-05 49
2 B 2019-01-13 2019-01-20 179
3 C 2019-01-12 2019-01-26 187
4 D 2019-01-30 2019-01-31 0
5 E 2019-01-11 2019-01-23 242
6 F 2019-01-03 2019-01-19 333
7 G 2019-01-01 2019-01-24 339
8 H 2019-01-26 2019-01-28 22
Related
I have a dataframe 'my_data' which looks like this:
Calendar_Day Name
2018-03-31 ABC
2018-03-31 XYZ
2018-03-31 OPR
2019-01-31 ABC
2019-01-31 RTE
2019-10-31 YUD
2018-03-31 RYT
I wish to have another column that will serve as a primary key with a format
YEAR+MONTH+6digit sequence , eg: 201803000001
I am new to R and couldn't find a way to implement the concept.
and Dataframe should look like
Calendar_Day Name ID
2018-03-31 ABC 201803000001
2018-03-31 XYZ 201803000002
2018-03-31 OPR 201803000003
2019-01-31 ABC 201901000001
2019-01-31 RTE 201901000002
2019-10-31 YUD 201910000001
2018-03-31 RYT 201803000004
library(dplyr)
library(lubridate)
d %>%
mutate(Date = ymd(Date)) %>%
group_by(tmp1 = year(Date), tmp2 = month(Date)) %>%
mutate(ID = paste0(year(Date),
sprintf("%02d", month(Date)),
sprintf("%05d", row_number()))) %>%
ungroup() %>%
select(-tmp1, -tmp2)
#> # A tibble: 7 x 3
#> Date Name ID
#> <date> <chr> <chr>
#> 1 2018-03-31 ABC 20180300001
#> 2 2018-03-31 XYZ 20180300002
#> 3 2018-03-31 OPR 20180300003
#> 4 2019-01-31 ABC 20190100001
#> 5 2019-01-31 RTE 20190100002
#> 6 2019-10-31 YUD 20191000001
#> 7 2018-03-31 RYT 20180300004
you could use the tidyverse package like this:
library(tidyverse)
mydata %>%
mutate(Date2 = format(Date, "%Y%m")) %>%
group_by(Date2) %>%
mutate(ID = paste0(Date2, str_pad(1:n(), width = 6, side = "left", pad = "0"))) %>%
ungroup() %>%
select(-Date2)
The main idea is to use the format function: format(mydate, %Y) returns the year of a date object and format(mydate, %m) returns the month of a date object.
I paste these two together and add the six digit sequence.
I use string_pad to add leading zeros to the sequence.
I have hourly values of temperature measurements and I wish to calculate the average per day only for complete (i.e. with 24 measurements) days. Incomplete days would then be summarized as "NA".
I have grouped the values together per year, month and day and call summarize().
I have three month of data missing which appears as a gap in my ggplot function and which is what I want to achieve with the rest. The problem is that when I call summarize() to calculate the mean of my values, days with only 1 or 2 measurements also get called. Only those with all missing values (24) appear as "NA".
Date TempUrb TempRur UHI
1 2011-03-21 22:00:00 10.1 11.67000 -1.570000
2 2011-03-21 23:00:00 9.9 11.67000 -1.770000
3 2011-03-22 00:00:00 10.9 11.11000 -0.210000
4 2011-03-22 01:00:00 10.7 10.56000 0.140000
5 2011-03-22 02:00:00 9.7 10.00000 -0.300000
6 2011-03-22 03:00:00 9.5 10.00000 -0.500000
7 2011-03-22 04:00:00 9.4 8.89000 0.510000
8 2011-03-22 05:00:00 8.4 8.33500 0.065000
9 2011-03-22 06:00:00 8.2 7.50000 0.700000
AvgUHI <- UHI %>% group_by(year(Date), add = TRUE) %>%
group_by(month(Date), add = TRUE) %>%
group_by(day(Date), add = TRUE, .drop = TRUE) %>%
summarize(AvgUHI = mean(UHI, na.rm = TRUE))
# A tibble: 2,844 x 4
# Groups: year(Date), month(Date) [95]
`year(Date)` `month(Date)` `day(Date)` AvgUHI
<int> <int> <int> <dbl>
1476 2015 4 4 0.96625000
1477 2015 4 5 -0.11909722
1478 2015 4 6 -0.60416667
1479 2015 4 7 -0.92916667
1480 2015 4 8 NA
1481 2015 4 9 NA
AvgUHI<- AvgUHI %>% group_by(`year(Date)`, add = TRUE) %>%
group_by(`month(Date)`, add = TRUE) %>%
summarize(AvgUHI= mean(AvgUHI, na.rm = TRUE))
# A tibble: 95 x 3
# Groups: year(Date) [9]
`year(Date)` `month(Date)` AvgUHI
<int> <int> <dbl>
50 2015 4 0.580887346
51 2015 5 0.453815051
52 2015 6 0.008479618
As you can see above on the final table, I have an average for 04-2015, while I am missing data on that month (08 - 09/04/2015 on this example represented on the second table).
The same happens when I calculate AvgUHI and I'm missing hourly data.
I simply would like to see on the last table the AvgUHI for 04-2015 be NA.
E.g: of my graph1
The following will give a dataframe aggregated by day, where only the complete days, with 4 observations, are not NA. Then you can group by month to have the final dataframe.
UHI %>%
mutate(Day = as.Date(Date)) %>%
group_by(Day) %>%
mutate(n = n(), tmpUHI = if_else(n == 24, UHI, NA_real_)) %>%
summarize(AvgUHI = mean(tmpUHI)) %>%
full_join(data.frame(Day = seq(min(.$Day), max(.$Day), by = "day"))) %>%
arrange(Day) -> AvgUHI
For hours look at Rui Barradas' answer. For months the following code worked:
AvgUHI %>%
group_by(year(Day), add = TRUE) %>%
group_by(month(Day), add = TRUE) %>%
mutate(sum = sum(is.na(AvgUHI)), tmpUHI = if_else(sum <= 10, AvgUHI, NA_real_)) %>%
summarise(AvgUHI = mean(tmpUHI, na.rm = TRUE)) -> AvgUHI
I've got a data set with reservation data that has the below format :
property <- c('casa1', 'casa2', 'casa3')
check_in <- as.Date(c('2018-01-01', '2018-01-30','2018-02-28'))
check_out <- as.Date(c('2018-01-02', '2018-02-03', '2018-03-02'))
total_paid <- c(100,110,120)
df <- data.frame(property,check_in,check_out, total_paid)
My goal is to have the monthly total_paid amount divided by days and assigned to each month correctly for budget reasons.
While there's no issue for casa1, casa2 and casa3 have days reserved in both months and the totals get skewed because of this issue.
Any help much appreciated!
Here you go:
library(dplyr)
library(tidyr)
df %>%
mutate(id = seq_along(property), # make few variable to help
day_paid = total_paid / as.numeric(check_out - check_in),
date = check_in) %>%
group_by(id) %>%
complete(date = seq.Date(check_in, (check_out - 1), by = "day")) %>% # get date for each day of stay (except last)
ungroup() %>% # make one row per day of stay
mutate(month = cut(date, breaks = "month")) %>% # determine month of date
fill(property, check_in, check_out, total_paid, day_paid) %>%
group_by(id, month) %>%
summarise(property = unique(property),
check_in = unique(check_in),
check_out = unique(check_out),
total_paid = unique(total_paid),
paid_month = sum(day_paid)) # summarise per month
result:
# A tibble: 5 x 7
# Groups: id [3]
id month property check_in check_out total_paid paid_month
<int> <fct> <fct> <date> <date> <dbl> <dbl>
1 1 2018-01-01 casa1 2018-01-01 2018-01-02 100 100
2 2 2018-01-01 casa2 2018-01-30 2018-02-03 110 55
3 2 2018-02-01 casa2 2018-01-30 2018-02-03 110 55
4 3 2018-02-01 casa3 2018-02-28 2018-03-02 120 60
5 3 2018-03-01 casa3 2018-02-28 2018-03-02 120 60
I hope it's somewhat readable but please ask if there is something I should explain. Convention is that people don't pay the last day of a stay, so I took that into account.
I've triangulated information from other SO answers for the below code, but getting stuck with an error message. Searched SO for similar errors and resolutions but haven't been able to figure it out, so help is appreciated.
For every group ("id"), I want to get the difference between the start times for consecutive rows.
Reproducible data:
require(dplyr)
df <-data.frame(id=as.numeric(c("1","1","1","2","2","2")),
start= c("1/31/17 10:00","1/31/17 10:02","1/31/17 10:45",
"2/10/17 12:00", "2/10/17 12:20","2/11/17 09:40"))
time <- strptime(df$start, format = "%m/%d/%y %H:%M")
df %>%
group_by(id)%>%
mutate(diff = time - lag(time),
diff_mins = as.numeric(diff, units = 'mins'))
Gets me error:
Error in mutate_impl(.data, dots) :
Column diff must be length 3 (the group size) or one, not 6
In addition: Warning message:
In unclass(time1) - unclass(time2) :
longer object length is not a multiple of shorter object length
Do you mean something like this?
There is no need for lag here, a simple diff on the grouped times is sufficient.
df %>%
mutate(start = as.POSIXct(start, format = "%m/%d/%y %H:%M")) %>%
group_by(id) %>%
mutate(diff = c(0, diff(start)))
## A tibble: 6 x 3
## Groups: id [2]
# id start diff
# <dbl> <dttm> <dbl>
#1 1. 2017-01-31 10:00:00 0.
#2 1. 2017-01-31 10:02:00 2.
#3 1. 2017-01-31 10:45:00 43.
#4 2. 2017-02-10 12:00:00 0.
#5 2. 2017-02-10 12:20:00 20.
#6 2. 2017-02-11 09:40:00 1280.
You can use lag and difftime (per Hadley):
df %>%
mutate(time = as.POSIXct(start, format = "%m/%d/%y %H:%M")) %>%
group_by(id) %>%
mutate(diff = difftime(time, lag(time)))
# A tibble: 6 x 4
# Groups: id [2]
id start time diff
<dbl> <fct> <dttm> <time>
1 1. 1/31/17 10:00 2017-01-31 10:00:00 <NA>
2 1. 1/31/17 10:02 2017-01-31 10:02:00 2
3 1. 1/31/17 10:45 2017-01-31 10:45:00 43
4 2. 2/10/17 12:00 2017-02-10 12:00:00 <NA>
5 2. 2/10/17 12:20 2017-02-10 12:20:00 20
6 2. 2/11/17 09:40 2017-02-11 09:40:00 1280
Suppose I have a daily rain data.frame like this:
df.meteoro = data.frame(Dates = seq(as.Date("2017/1/19"), as.Date("2018/1/18"), "days"),
rain = rnorm(length(seq(as.Date("2017/1/19"), as.Date("2018/1/18"), "days"))))
I'm trying to sum the accumulated rain between a 14 days interval with this code:
library(tidyverse)
library(lubridate)
df.rain <- df.meteoro %>%
mutate(TwoWeeks = round_date(df.meteoro$data, "14 days")) %>%
group_by(TwoWeeks) %>%
summarise(sum_rain = sum(rain))
The problem is that it isn't starting on 2017-01-19 but on 2017-01-15 and I was expecting my output dates to be:
"2017-02-02" "2017-02-16" "2017-03-02" "2017-03-16" "2017-03-30" "2017-04-13"
"2017-04-27" "2017-05-11" "2017-05-25" "2017-06-08" "2017-06-22" "2017-07-06" "2017-07-20"
"2017-08-03" "2017-08-17" "2017-08-31" "2017-09-14" "2017-09-28" "2017-10-12" "2017-10-26"
"2017-11-09" "2017-11-23" "2017-12-07" "2017-12-21" "2018-01-04" "2018-01-18"
TL;DR I have a year long daily rain data.frame and want to sum the accumulate rain for the dates above.
Please help.
Use of round_date in the way you have shown it will not give you 14-day periods as you might expect. I have taken a different approach in this solution and generated a sequence of dates between your first and last dates and grouped these into 14-day periods then joined the dates to your observations.
startdate = min(df.meteoro$Dates)
enddate = max(df.meteoro$Dates)
dateseq =
data.frame(Dates = seq.Date(startdate, enddate, by = 1)) %>%
mutate(group = as.numeric(Dates - startdate) %/% 14) %>%
group_by(group) %>%
mutate(starts = min(Dates))
df.rain <- df.meteoro %>%
right_join(dateseq) %>%
group_by(starts) %>%
summarise(sum_rain = sum(rain))
head(df.rain)
> head(df.rain)
# A tibble: 6 x 2
starts sum_rain
<date> <dbl>
1 2017-01-19 6.09
2 2017-02-02 5.55
3 2017-02-16 -3.40
4 2017-03-02 2.55
5 2017-03-16 -0.12
6 2017-03-30 8.95
Using a right-join to the date sequence is to ensure that if there are missing observation days that spanned a complete time period you'd still get that period listed in the result (though in your case you have a complete year of dates anyway).
round_date rounds to the nearest multiple of unit (here, 14 days) since some epoch (probably the Unix epoch of 1970-01-01 00:00:00), which doesn't line up with your purpose.
To get what you want, you can do the following:
df.rain = df.meteoro %>%
mutate(days_since_start = as.numeric(Dates - as.Date("2017/1/18")),
TwoWeeks = as.Date("2017/1/18") + 14*ceiling(days_since_start/14)) %>%
group_by(TwoWeeks) %>%
summarise(sum_rain = sum(rain))
This computes days_since_start as the days since 2017/1/18 and then manually rounds to the next multiple of two weeks.
Assuming you want to round to the closest date from the ones you have specified I guess the following will work
targetDates<-seq(ymd("2017-02-02"),ymd("2018-01-18"),by='14 days')
df.meteoro$Dates=targetDates[sapply(df.meteoro$Dates,function(x) which.min(abs(interval(targetDates,x))))]
sum_rain=ddply(df.meteoro,.(Dates),summarize,sum_rain=sum(rain,na.rm=T))
as you can see not all dates have the same number of observations. Date "2017-02-02" for instance has all the records between "2017-01-19" until "2017-02-09", which are 22 records. From "2017-02-10" on dates are rounded to "2017-02-16" etc.
This may be a cheat, but assuming each row/observation is a separate day, then why not just group by every 14 rows and sum.
# Assign interval groups, each 14 rows
df.meteoro$my_group <-rep(1:100, each=14, length.out=nrow(df.meteoro))
# Grab Interval Names
my_interval_names <- df.meteoro %>%
select(-rain) %>%
group_by(my_group) %>%
slice(1)
# Summarise
df.meteoro %>%
group_by(my_group) %>%
summarise(rain = sum(rain)) %>%
left_join(., my_interval_names)
#> Joining, by = "my_group"
#> # A tibble: 27 x 3
#> my_group rain Dates
#> <int> <dbl> <date>
#> 1 1 3.86 2017-01-19
#> 2 2 -0.581 2017-02-02
#> 3 3 -0.876 2017-02-16
#> 4 4 1.80 2017-03-02
#> 5 5 3.79 2017-03-16
#> 6 6 -3.50 2017-03-30
#> 7 7 5.31 2017-04-13
#> 8 8 2.57 2017-04-27
#> 9 9 -1.33 2017-05-11
#> 10 10 5.41 2017-05-25
#> # ... with 17 more rows
Created on 2018-03-01 by the reprex package (v0.2.0).