Averaging weekly data into monthly data on 2 variables

Averaging weekly data into monthly data on 2 variables - r

I have two columns with start and end dates of every week. I need to aggregate other column on monthly basis by the mean of the weeks of particular month (I have 3 years in dataset) and create another column that will contain weight for the whole month (so it will be the same value for 5-6 weeks, depending how many weeks particular month have for particular ID (I have thousands of id's in dataset). Tricky part is that some of the weeks are overlapping, so that one row sometimes but be taken into calculation of both months eg. when we have start_date = 2020-07-27 and end_date = 2020-08-09 (It has to be taken both to July and August month).
This is my data:
ID
weight
start_date
end_date
60
1,2
2019-12-30
2020-01-05
60
1,4
2020-01-06
2020-01-12
60
1,3
2020-01-13
2020-01-19
60
1,0
2020-01-20
2020-01-26
60
3,8
2020-01-27
2020-02-02
61
1,7
2019-12-30
2020-01-05
61
12,9
2020-01-06
2020-01-12
I want to obtain:
ID
weight
start_date
end_date
Monthy_weight
Month
60
1,2
2020-12-30
2020-01-05
1,74
01.2020
60
1,4
2020-01-06
2020-01-12
1,74
01.2020
60
1,3
2020-01-13
2020-01-19
1,74
01.2020
60
1,0
2020-01-20
2020-01-26
1,74
01.2020
60
3,8
2020-01-27
2020-02-02
1,74
01.2020
61
1,7
2020-12-30
2020-01-05
7,3
01.2020
61
12,9
2020-01-06
2020-01-12
7,3
01.2020
Firstly I wanted to do a loop that will detect every month in both columns and if the month appears, it will take the mean from other column, but then I found similar problem on stack overflow (How to convert weekly data into monthly data?) and decided to do it with zoo.
I tried to implement solution from the above post:
library(zoo)
z.st <- read.zoo(long_weights[c("start_date", "weight")])
z.en <- read.zoo(long_weights[c("end_date", "weight")])
z <- c(z.st, z.en)
g <- zoo(, seq(start(z), end(z), "day"))
m <- na.locf(merge(z, g))
aggregate(m, as.yearmon, mean)
but after this line:
z <- c(z.st, z.en)
Im obtaining an error: Error in bind.zoo(...) : indexes overlap
I also tried, but this not takes into consideration overlapping weeks:
df <- df %>% group_by(HHKEY, month = floor_date((as.Date(end_date)- as.Date(start_date))/2 + as.Date(start_date), "month")) %>% mutate(monthly_weight = mean(weight), .after = end_date, month = format(month, "%Y.%m")) %>% ungroup()

A possible solution may be to get the start_date per month when they differ (at the end of a month) as end date for the grouping variable month. Extended the data to include a year change within an ID.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(start_date = as.Date(start_date), end_date = as.Date(end_date),
month = lead(format(start_date, "%m.%Y")),
month = if_else(is.na(month),
format(start_date, "%m.%Y"), format(end_date, "%m.%Y"))) %>%
group_by(ID, month) %>%
mutate(monthly_weight = mean(weight), .before=month) %>%
ungroup()
# A tibble: 14 × 6
ID weight start_date end_date monthly_weight month
<dbl> <dbl> <date> <date> <dbl> <chr>
1 60 1.2 2019-12-30 2020-01-05 1.74 01.2020
2 60 1.4 2020-01-06 2020-01-12 1.74 01.2020
3 60 1.3 2020-01-13 2020-01-19 1.74 01.2020
4 60 1 2020-01-20 2020-01-26 1.74 01.2020
5 60 3.8 2020-01-27 2020-02-02 1.74 01.2020
6 61 1.7 2019-12-30 2020-01-05 7.3 01.2020
7 61 12.9 2020-01-06 2020-01-12 7.3 01.2020
8 61 1.2 2020-12-29 2021-01-04 1.74 01.2021
9 61 1.4 2021-01-05 2021-01-11 1.74 01.2021
10 61 1.3 2021-01-12 2021-01-18 1.74 01.2021
11 61 1 2021-01-19 2021-01-25 1.74 01.2021
12 61 3.8 2021-01-26 2021-02-01 1.74 01.2021
13 63 1.7 2020-12-29 2021-01-04 7.3 01.2021
14 63 12.9 2021-01-05 2021-01-11 7.3 01.2021
extended data
df <- structure(list(ID = c(60, 60, 60, 60, 60, 61, 61, 61, 61, 61,
61, 61, 63, 63), weight = c(1.2, 1.4, 1.3, 1, 3.8, 1.7, 12.9,
1.2, 1.4, 1.3, 1, 3.8, 1.7, 12.9), start_date = structure(c(18260,
18267, 18274, 18281, 18288, 18260, 18267, 18625, 18632, 18639,
18646, 18653, 18625, 18632), class = "Date"), end_date = structure(c(18266,
18273, 18280, 18287, 18294, 18266, 18273, 18631, 18638, 18645,
18652, 18659, 18631, 18638), class = "Date")), row.names = c(NA,
-14L), class = c("tbl_df", "tbl", "data.frame"))

Related

15 minutes to hours in R

I have a time series with 15 minutes intervals.
I would like to change it into 1 hour interval using R. So the results of the measurements will be added together as well.
Could you please help me with this?
And is it possible to change it after that from hours to month?
The data frame is as below:
timestamp (UTC) value
2020-06-11 22:15:00 5,841
2020-06-11 22:30:00 5,719
2020-06-11 22:45:00 5,841
2020-06-11 23:00:00 5,841
2020-06-11 23:15:00 5,597
2020-06-11 23:30:00 5,232
2020-06-11 23:45:00 5,476
2020-06-12 0:00:00 4,259
2020-06-12 0:15:00 0,243
2020-06-12 0:30:00 0,243
2020-06-12 0:45:00 0,365
2020-06-12 1:00:00 0,243

Depending on how you count, every 15 mins after an hour belongs to the next, you can use lubridate::ceiling_date (22:15 => 23:00), if it belongs to the same hour, use lubridate::floor_date (22:15 => 22:00).
library(dplyr)
library(lubridate)
# option 1
df1 %>%
mutate(timestamp = ceiling_date(timestamp, unit = "hour")) %>%
group_by(timestamp) %>%
summarise(value = sum(value))
# A tibble: 3 × 2
timestamp value
<dttm> <dbl>
1 2020-06-11 23:00:00 23.2
2 2020-06-12 00:00:00 20.6
3 2020-06-12 01:00:00 1.09
#option 2
df1 %>%
mutate(timestamp = floor_date(timestamp, unit = "hour")) %>%
group_by(timestamp) %>%
summarise(value = sum(value))
# A tibble: 4 × 2
timestamp value
<dttm> <dbl>
1 2020-06-11 22:00:00 17.4
2 2020-06-11 23:00:00 22.1
3 2020-06-12 00:00:00 5.11
4 2020-06-12 01:00:00 0.243
data:
df1 <- structure(list(timestamp = structure(c(1591906500, 1591907400,
1591908300, 1591909200, 1591910100, 1591911000, 1591911900, 1591912800,
1591913700, 1591914600, 1591915500, 1591916400), class = c("POSIXct",
"POSIXt"), tzone = ""), value = c(5.841, 5.719, 5.841, 5.841,
5.597, 5.232, 5.476, 4.259, 0.243, 0.243, 0.365, 0.243)), row.names = c(NA,
-12L), class = "data.frame")

Finding YTD change in R

Hi I am trying to find the YTD change. YTD formula is (current month value/last month of previous year)-1. The result I would like to get is in column y.
For example, for Jan-20 is (20/100)-1 ; Feb-20 is (120/100)-1. Basically all values divide by Dec-19 which is the last month of year 2019.
And for Jan-21, it should be divided by Dec-20 value so its (100/210)-1.
structure(list(date = structure(c(1575158400, 1577836800, 1580515200,
1583020800, 1585699200, 1588291200, 1590969600, 1593561600, 1596240000,
1598918400, 1601510400, 1604188800, 1606780800, 1609459200, 1612137600
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), x = c(100,
20, 120, 90, 100, 40, 55, 70, 90, 120, 290, 100, 210, 100, 130
), y = c(NA, -0.8, 0.2, -0.1, 0, -0.6, -0.45, -0.3, -0.1, 0.2,
1.9, 0, 1.1, -0.523809523809524, -0.380952380952381)), class = "data.frame", row.names =
c(NA, -15L))
date x y
2019-12-01 100 NA
2020-01-01 20 -0.8000000
2020-02-01 120 0.2000000
2020-03-01 90 -0.1000000
2020-04-01 100 0.0000000
2020-05-01 40 -0.6000000
2020-06-01 55 -0.4500000
2020-07-01 70 -0.3000000
2020-08-01 90 -0.1000000
2020-09-01 120 0.2000000
2020-10-01 290 1.9000000
2020-11-01 100 0.0000000
2020-12-01 210 1.1000000
2021-01-01 100 -0.5238095
2021-02-01 130 -0.3809524

Here's a solution using the tidyverse and lubridate packages. First we create a data frame called last_per_year that stores the last value for each year. Then in the main data frame, we calculate each date's "previous year", and use this to join with last_per_year. With that done, it's simple to perform the YTD calculation.
This technique would make it easy to select multiple columns in last_per_year, join those into the main data set, and compute whatever calculations are needed.
library(tidyverse)
library(lubridate)
last_per_year <- df %>% # YOUR DATA GOES HERE
group_by(year = year(date)) %>% # for each year...
slice_max(order_by = date) %>% # get the last date in each year
select(year, last_value = x) # output columns are "year" and "last_value" (renamed from "x")
year last_value
<dbl> <dbl>
1 2019 100
2 2020 210
3 2021 130
df.new <- df %>%
select(-y) %>% # removing your example output
mutate(
year = year(date),
prev_year = year - 1
) %>%
inner_join(last_per_year, by = c(prev_year = 'year')) %>% # joining with "last_per_year"
mutate(
ytd = x / last_value - 1
)
df.new
date x year prev_year last_value ytd
1 2020-01-01 20 2020 2019 100 -0.8000000
2 2020-02-01 120 2020 2019 100 0.2000000
3 2020-03-01 90 2020 2019 100 -0.1000000
4 2020-04-01 100 2020 2019 100 0.0000000
5 2020-05-01 40 2020 2019 100 -0.6000000
6 2020-06-01 55 2020 2019 100 -0.4500000
7 2020-07-01 70 2020 2019 100 -0.3000000
8 2020-08-01 90 2020 2019 100 -0.1000000
9 2020-09-01 120 2020 2019 100 0.2000000
10 2020-10-01 290 2020 2019 100 1.9000000
11 2020-11-01 100 2020 2019 100 0.0000000
12 2020-12-01 210 2020 2019 100 1.1000000
13 2021-01-01 100 2021 2020 210 -0.5238095
14 2021-02-01 130 2021 2020 210 -0.3809524

Creating Labels for Dates

I am working in R. I have a data frame that consists of Sampling Date and water temperature. I have provided a sample dataframe below:
Date Temperature
2015-06-01 11
2015-08-11 13
2016-01-12 2
2016-07-01 12
2017-01-08 4
2017-08-13 14
2018-03-04 7
2018-09-19 10
2019-8-24 8
Due to the erratic nature of sampling dates (due to samplers ability to site) I am unable to classify years normally January 1 to December 31st and instead am using the beginning of the sampling period as the start of 1 year. In this case a year would start June 1st and End may 31st, that way I can accruately compare the years to one another. Thus I want 4 years to have the following labels
Year_One = "2015-06-01" - "2016-05-31"
Year_Two = "2016-06-01" - "2017-05-31"
Year_Three = "2017-06-01" - "2018-05-31"
Year_Four = "2018-06-01" - "2019-08-24"
My goal is to create an additional column with these labels but have thus far been unable to do so.

I create two columns year1 and year2 with two different approaches. The year2 approach needs that all the periods start june 1st and end may 31st (in your code the year_four ends 2019-08-24) so it may not be exactly what you need:
library(tidyverse)
library(lubridate)
dt$Date <- as.Date(dt$Date)
dt %>%
mutate(year1= case_when(between(Date, as.Date("2015-06-01") , as.Date("2016-05-31")) ~ "Year_One",
between(Date, as.Date("2016-06-01") , as.Date("2017-05-31")) ~ "Year_Two",
between(Date, as.Date("2017-06-01") , as.Date("2018-05-31")) ~ "Year_Three",
between(Date, as.Date("2018-06-01") , as.Date("2019-08-24")) ~ "Year_Four",
TRUE ~ "0")) %>%
mutate(year2 = paste0(year(Date-months(5)),"/", year(Date-months(5))+1))
The output:
# A tibble: 9 x 4
Date Temperature year1 year2
<date> <dbl> <chr> <chr>
1 2015-06-01 11 Year_One 2015/2016
2 2015-08-11 13 Year_One 2015/2016
3 2016-01-12 2 Year_One 2015/2016
4 2016-07-01 12 Year_Two 2016/2017
5 2017-01-08 4 Year_Two 2016/2017
6 2017-08-13 14 Year_Three 2017/2018
7 2018-03-04 7 Year_Three 2017/2018
8 2018-09-19 10 Year_Four 2018/2019
9 2019-08-24 8 Year_Four 2019/2020

Using strftime to get the years, then make a factor with levels on the unique values. I'd recommend numbers instead of words, because they can be coded automatically. Else, use labels=c("one", "two", ...).
d <- within(d, {
year <- strftime(Date, "%Y")
year <- paste("Year", factor(year, labels=seq(unique(year))), sep="_")
})
# Date temperature year
# 1 2017-06-01 1 Year_1
# 2 2017-09-01 2 Year_1
# 3 2017-12-01 3 Year_1
# 4 2018-03-01 4 Year_2
# 5 2018-06-01 5 Year_2
# 6 2018-09-01 6 Year_2
# 7 2018-12-01 7 Year_2
# 8 2019-03-01 8 Year_3
# 9 2019-06-01 9 Year_3
# 10 2019-09-01 10 Year_3
# 11 2019-12-01 11 Year_3
# 12 2020-03-01 12 Year_4
# 13 2020-06-01 13 Year_4
Data:
d <- structure(list(Date = structure(c(17318, 17410, 17501, 17591,
17683, 17775, 17866, 17956, 18048, 18140, 18231, 18322, 18414
), class = "Date"), temperature = 1:13), class = "data.frame", row.names = c(NA,
-13L))

How to subset data by specific hours of interest?

I have a dataset of temperature values taken at specific datetimes across five locations. For whatever reason, sometimes the readings are every hour, and some every four hours. Another issue is that when the time changed as a result of daylight savings, the readings are off by one hour. I am interested in the readings taken every four hours and would like to subset these by day and night to ultimately get daily and nightly mean temperatures.
To summarise, the readings I am interested in are either:
0800, 1200, 1600 =day
2000, 0000, 0400 =night
Recordings between 0800-1600 and 2000-0400 each day should be averaged.
During daylight savings, the equivalent times are:
0900, 1300, 1700 =day
2100, 0100, 0500 =night
Recordings between 0900-1700 and 2100-0500 each day should be averaged.
In the process, I am hoping to subset by site.
There are also some NA values or blank cells which should be ignored.
So far, I tried to subset by one hour of interest just to see if it worked, but haven't got any further than that. Any tips on how to subset by a series of times of interest? Thanks!
temperature <- read.csv("SeaTemperatureData.csv",
stringsAsFactors = FALSE)
temperature <- subset(temperature, select=-c(X)) #remove last column that contains comments, not needed
temperature$Date.Time < -as.POSIXct(temperature$Date.Time,
format="%d/%m/%Y %H:%M",
tz="Pacific/Auckland")
#subset data by time, we only want to include temperatures recorded at certain times
temperature.goat <- subset(temperature, Date.Time==c('01:00:00'), select=c("Goat.Island"))
Date.Time Goat.Island Tawharanui Kawau Tiritiri Noises
1 2019-06-10 16:00:00 16.820 16.892 16.749 16.677 15.819
2 2019-06-10 20:00:00 16.773 16.844 16.582 16.654 15.796
3 2019-06-11 00:00:00 16.749 16.820 16.749 16.606 15.819
4 2019-06-11 04:00:00 16.487 16.796 16.654 16.558 15.796
5 2019-06-11 08:00:00 16.582 16.749 16.487 16.463 15.867
6 2019-06-11 12:00:00 16.630 16.773 16.725 16.654 15.867

One possible solution is to extract hours from your DateTime variable, then filter for particular hours of interest.
Here a fake example over 4 days:
library(lubridate)
df <- data.frame(DateTime = seq(ymd_hms("2020-02-01 00:00:00"), ymd_hms("2020-02-05 00:00:00"), by = "hour"),
Value = sample(1:100,97, replace = TRUE))
DateTime Value
1 2020-02-01 00:00:00 99
2 2020-02-01 01:00:00 51
3 2020-02-01 02:00:00 44
4 2020-02-01 03:00:00 49
5 2020-02-01 04:00:00 60
6 2020-02-01 05:00:00 56
Now, you can extract hours with hour function of lubridate and subset for the desired hour:
library(lubridate)
subset(df, hour(DateTime) == 5)
DateTime Value
6 2020-02-01 05:00:00 56
30 2020-02-02 05:00:00 31
54 2020-02-03 05:00:00 65
78 2020-02-04 05:00:00 80
EDIT: Getting mean of each sites per subset of hours
Per OP's request in comments, the question is to calcualte the mean of values for various sites for different period of times.
Basically, you want to have two period per days, one from 8:00 to 17:00 and the other one from 18:00 to 7:00.
Here, a more elaborated example based on the previous one:
df <- data.frame(DateTime = seq(ymd_hms("2020-02-01 00:00:00"), ymd_hms("2020-02-05 00:00:00"), by = "hour"),
Site1 = sample(1:100,97, replace = TRUE),
Site2 = sample(1:100,97, replace = TRUE))
DateTime Site1 Site2
1 2020-02-01 00:00:00 100 6
2 2020-02-01 01:00:00 9 49
3 2020-02-01 02:00:00 86 12
4 2020-02-01 03:00:00 34 55
5 2020-02-01 04:00:00 76 29
6 2020-02-01 05:00:00 41 1
....
So, now you can do the following to label each time point as daily or night, then group by this category for each day and calculate the mean of each individual sites using summarise_at:
library(lubridate)
library(dplyr)
df %>% mutate(Date = date(DateTime),
Hour= hour(DateTime),
Category = ifelse(between(hour(DateTime),8,17),"Daily","Night")) %>%
group_by(Date, Category) %>%
summarise_at(vars(c(Site1,Site2)), ~ mean(., na.rm = TRUE))
# A tibble: 9 x 4
# Groups: Date [5]
Date Category Site1 Site2
<date> <chr> <dbl> <dbl>
1 2020-02-01 Daily 56.9 63.1
2 2020-02-01 Night 58.9 46.6
3 2020-02-02 Daily 54.5 47.6
4 2020-02-02 Night 36.9 41.7
5 2020-02-03 Daily 42.3 56.9
6 2020-02-03 Night 44.1 55.9
7 2020-02-04 Daily 54.3 50.4
8 2020-02-04 Night 54.8 34.3
9 2020-02-05 Night 75 16
Does it answer your question ?

Calculation of the maximum duration over threshold in R (timeseries)

I have a xts-timeseries temperature data in 5 min resolution.
head(dataset)
Time Temp
2016-04-26 10:00:00 6.877
2016-04-26 10:05:00 6.877
2016-04-26 10:10:00 6.978
2016-04-26 10:15:00 6.978
2016-04-26 10:20:00 6.978
I want to calculate the longest duration the temperature exceeds a certain threshold. (let's say 20 °C)
I want to calculate all the periods with their duration the temperature exceeds a certain threshold.
I create a data.frame from my xts-data:
df=data.frame(Time=index(dataset),coredata(dataset))
head(df)
Time Temp
1 2016-04-26 10:00:00 6.877
2 2016-04-26 10:05:00 6.877
3 2016-04-26 10:10:00 6.978
4 2016-04-26 10:15:00 6.978
5 2016-04-26 10:20:00 6.978
6 2016-04-26 10:25:00 7.079
then I create a subset with only the data that exceeds the threshold:
sub=(subset(x=df,subset = df$Temp>20))
head(sub)
Time Temp
7514 2016-05-22 12:05:00 20.043
7515 2016-05-22 12:10:00 20.234
7516 2016-05-22 12:15:00 20.329
7517 2016-05-22 12:20:00 20.424
7518 2016-05-22 12:25:00 20.615
7519 2016-05-22 12:30:00 20.805
But now im having trouble to calculate the duration of the event the temperature exceeds the threshold. I dont know how to identify a connected period and calculate their duration?
I would be happy if you have a solution for this question (it's my first thread so please excuse minor mistakes) If you need more information on my data, feel free to ask.

This may work. I take as example this data:
df <- structure(list(Time = structure(c(1463911500, 1463911800, 1463912100,
1463912400, 1463912700, 1463913000), class = c("POSIXct", "POSIXt"
), tzone = ""), Temp = c(20.043, 20.234, 6.329, 20.424, 20.615,
20.805)), row.names = c(NA, -6L), class = "data.frame")
> df
Time Temp
1 2016-05-22 12:05:00 20.043
2 2016-05-22 12:10:00 20.234
3 2016-05-22 12:15:00 6.329
4 2016-05-22 12:20:00 20.424
5 2016-05-22 12:25:00 20.615
6 2016-05-22 12:30:00 20.805
library(dplyr)
df %>%
# add id for different periods/events
mutate(tmp_Temp = Temp > 20, id = rleid(tmp_Temp)) %>%
# keep only periods with high temperature
filter(tmp_Temp) %>%
# for each period/event, get its duration
group_by(id) %>%
summarise(event_duration = difftime(last(Time), first(Time)))
id event_duration
<int> <time>
1 1 5 mins
2 3 10 mins

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Averaging weekly data into monthly data on 2 variables - r

Related

15 minutes to hours in R

Finding YTD change in R

Creating Labels for Dates

How to subset data by specific hours of interest?

Calculation of the maximum duration over threshold in R (timeseries)

Categories

Resources