Aggregating data, left aligned in R

Aggregating data, left aligned in R - r

I have a data set like below:
Timestamp Value1 Value2
2020-10-29 05:00:00 10 20
2020-10-29 05:00:01 10 20
2020-10-29 05:00:02 11 22
2020-10-29 05:00:03 11 22
and so on, in one second intervals, and upto a few hours of data. I want to generate an average value two minutes, but left align the data. Essentially, rolling average of 2 minutes at 2020-10-29 05:00:00 should be the average of data points between 2020-10-29 05:00:00 and 2020-10-29 05:01:59
I have used data %>% group_by(Timestamp =cut (Timestamp, breaks="2min"))%>% summarize(Meanval1=mean(Value1), Meanval2=mean(Value2) but this right aligns the data. How can I left align it?
Thanks!

You can round down the Timestamp column to the nearest two minutes using lubridate::floor_date. If you then group_by this new column, you will get a left-aligned two-minute mean:
library(dplyr)
df %>%
mutate(time = lubridate::floor_date(df$TimeStamp, "2 minutes")) %>%
group_by(time) %>%
summarize(mean_val1 = mean(Value1), mean_val2 = mean(Value2))
#> # A tibble: 9 x 3
#> time mean_val1 mean_val2
#> <dttm> <dbl> <dbl>
#> 1 2020-10-29 05:00:00 10.2 19.9
#> 2 2020-10-29 05:02:00 9.84 20.0
#> 3 2020-10-29 05:04:00 10.1 19.9
#> 4 2020-10-29 05:06:00 9.72 20.3
#> 5 2020-10-29 05:08:00 9.98 19.9
#> 6 2020-10-29 05:10:00 9.98 20.0
#> 7 2020-10-29 05:12:00 10.1 20.0
#> 8 2020-10-29 05:14:00 10.0 20.1
#> 9 2020-10-29 05:16:00 10.0 20.2
Data used
set.seed(69)
t <- seq(as.POSIXct("2020-10-29 05:00:00"), by = "1 sec", length.out = 1000)
df <- data.frame(TimeStamp = t,
Value1 = sample(8:12, 1000, TRUE),
Value2 = sample(18:22, 1000, TRUE))
head(df)
#> TimeStamp Value1 Value2
#> 1 2020-10-29 05:00:00 8 20
#> 2 2020-10-29 05:00:01 10 21
#> 3 2020-10-29 05:00:02 9 19
#> 4 2020-10-29 05:00:03 12 19
#> 5 2020-10-29 05:00:04 12 19
#> 6 2020-10-29 05:00:05 9 18

Related

heatwaveR package, ts2clm() turn temperature values into NA

I'm using heatwaveR package in R to make a plot (event_line()) and visualize the heatwaves over the years. The first step is to run ts2clm(), but this command turn my temp colum into NA so I can't plot anything. Does anyone see any errors?
This is my data:
>>> Data
t temp
[Date] [num]
0 2020-05-14 6.9
1 2020-05-06 6.8
2 2020-04-23 5.5
3 2020-04-16 3.6
4 2020-03-31 2.5
5 2020-02-25 2.3
6 2020-01-30 2.8
7 2019-10-02 13.4
8 2022-09-02 19
9 2022-08-15 18.7
...
687 1974-05-06 4.2
This is my code:
#Load data
Data <- read_xlsx("seili_raw_temp.xlsx")
#Set t as class Date
Data$t <- as.Date(Data$t, format = "%Y-%m-%d")
#Constructs seasonal and threshold climatologies
ts <- ts2clm(Data, climatologyPeriod = c("1974-05-06", "2020-05-14"))
#This is the point where almost all temp values turn into NA, so you can ignore below.
#Detect_even
res <- detect_event(ts)
#Draw heatwave plot
event_line(res, min_duration = "3",metric = "int_cum",
start_date = c("1974-05-06"), end_date = c("2020-05-14"))

The data you posted isn't long enough to get the function to work, so I just made some up:
library(heatwaveR)
library(lubridate)
set.seed(1234)
Data <- data.frame(
t = seq(ymd("2015-01-01"), ymd("2023-01-01"), by="7 day"))
Data$temp <- runif(nrow(Data), 0,45)
Then, when I execute the function, I get the result below. The problem is that your data (like the ones I generated) have one observation every 7 days. The ts2clm() function pads out the dataset so that every day has an entry and if a temperature was not observed on that day, it fills in with a missing value.
ts <- ts2clm(Data, climatologyPeriod = c("2015-01-01", "2022-12-29"))
ts
#> # A tibble: 2,920 × 5
#> doy t temp seas thresh
#> <int> <date> <dbl> <dbl> <dbl>
#> 1 1 2015-01-01 5.12 22.5 38.6
#> 2 2 2015-01-02 NA 22.4 38.5
#> 3 3 2015-01-03 NA 22.2 38.2
#> 4 4 2015-01-04 NA 22.1 37.9
#> 5 5 2015-01-05 NA 21.9 37.3
#> 6 6 2015-01-06 NA 21.7 36.8
#> 7 7 2015-01-07 NA 21.5 36.5
#> 8 8 2015-01-08 28.0 21.3 36.1
#> 9 9 2015-01-09 NA 21.2 36.1
#> 10 10 2015-01-10 NA 21.0 35.8
#> # … with 2,910 more rows
Created on 2023-02-10 by the reprex package (v2.0.1)

Aggregate daily data into weeks

I have data resembling the following structure, where the when variable denotes the day of measurement:
## Generate data.
set.seed(1986)
n <- 1000
y <- rnorm(n)
when <- as.POSIXct(strftime(seq(as.POSIXct("2021-11-01 23:00:00 UTC", tryFormats = "%Y-%m-%d"),
as.POSIXct("2022-11-01 23:00:00 UTC", tryFormats = "%Y-%m-%d"),
length.out = n), format = "%Y-%m-%d"))
dta <- data.frame(y, when)
head(dta)
#> y when
#> 1 -0.04625141 2021-11-01
#> 2 0.28000082 2021-11-01
#> 3 0.25317063 2021-11-01
#> 4 -0.96411077 2021-11-02
#> 5 0.49222664 2021-11-02
#> 6 -0.69874551 2021-11-02
I need to compute averages of y over time. For instance, the following computes daily averages:
## Compute daily averages of y.
library(dplyr)
daily_avg <- dta %>%
group_by(when) %>%
summarise(daily_mean = mean(y)) %>%
ungroup()
daily_avg
#> # A tibble: 366 × 2
#> when daily_mean
#> <dttm> <dbl>
#> 1 2021-11-01 00:00:00 0.162
#> 2 2021-11-02 00:00:00 -0.390
#> 3 2021-11-03 00:00:00 -0.485
#> 4 2021-11-04 00:00:00 -0.152
#> 5 2021-11-05 00:00:00 0.425
#> 6 2021-11-06 00:00:00 0.726
#> 7 2021-11-07 00:00:00 0.855
#> 8 2021-11-08 00:00:00 0.0608
#> 9 2021-11-09 00:00:00 -0.995
#> 10 2021-11-10 00:00:00 0.395
#> # … with 356 more rows
I am having a hard time computing weekly averages. Here is what I have tried so far:
## Fail - compute weekly averages of y.
library(lubridate)
dta$week <- week(dta$when) # This is wrong.
dta[165: 171, ]
#> y when week
#> 165 0.9758333 2021-12-30 52
#> 166 -0.8630091 2021-12-31 53
#> 167 0.3054031 2021-12-31 53
#> 168 1.2814421 2022-01-01 1
#> 169 0.1025440 2022-01-01 1
#> 170 1.3665411 2022-01-01 1
#> 171 -0.5373058 2022-01-02 1
Using the week function from the lubridate package ignores the fact that my data spawn across years. So, if I were to use a code similar to the one I used for the daily averages, I would aggregate observations belonging to different years (but to the same week number). How can I solve this?

You can use %V (from ?strptime) for weeks, combining it with the year.
dta %>%
group_by(week = format(when, format = "%Y-%V")) %>%
summarize(daily_mean = mean(y)) %>%
ungroup()
# # A tibble: 54 x 2
# week daily_mean
# <chr> <dbl>
# 1 2021-44 0.179
# 2 2021-45 0.0477
# 3 2021-46 0.0340
# 4 2021-47 0.356
# 5 2021-48 0.0544
# 6 2021-49 -0.0948
# 7 2021-50 -0.0419
# 8 2021-51 0.209
# 9 2021-52 0.251
# 10 2022-01 -0.197
# # ... with 44 more rows
There are different variants of "week", depending on your preference.
%V
Week of the year as decimal number (01–53) as defined in ISO 8601.
If the week (starting on Monday) containing 1 January has four or more
days in the new year, then it is considered week 1. Otherwise, it is
the last week of the previous year, and the next week is week 1.
(Accepted but ignored on input.)
%W
Week of the year as decimal number (00–53) using Monday as the first
day of week (and typically with the first Monday of the year as day 1
of week 1). The UK convention.

You can extract year and week from the dates and group by both:
dta %>%
mutate(year = year(when),
week = week(when)) %>%
group_by(year, week) %>%
summarise(y_mean = mean(y)) %>%
ungroup()
# # A tibble: 54 x 3
# # Groups: year, week [54]
# year week y_mean
# <dbl> <dbl> <dbl>
# 1 2021 44 -0.222
# 2 2021 45 0.234
# 3 2021 46 0.0953
# 4 2021 47 0.206
# 5 2021 48 0.192
# 6 2021 49 -0.0831
# 7 2021 50 0.0282
# 8 2021 51 0.196
# 9 2021 52 0.132
# 10 2021 53 -0.279
# # ... with 44 more rows

Loop to sum weekly rolling average

I am new to coding. I have a data set of daily stream flow averages over 20 years. Following is an example:
DATE FLOW
1 10/1/2001 88.2
2 10/2/2001 77.6
3 10/3/2001 68.4
4 10/4/2001 61.5
5 10/5/2001 55.3
6 10/6/2001 52.5
7 10/7/2001 49.7
8 10/8/2001 46.7
9 10/9/2001 43.3
10 10/10/2001 41.3
11 10/11/2001 39.3
12 10/12/2001 37.7
13 10/13/2001 35.8
14 10/14/2001 34.1
15 10/15/2001 39.8
I need to create a loop summing the previous 6 days as well as the current day (rolling weekly average), and print it to an array for the designated water year. I have already created an aggregate function to separate yearly average daily means into their designated water years.
# Separating dates into specific water years
wtr_yr <- function(dates, start_month=9)
# Convert dates into POSIXlt
POSIDATE = as.POSIXlt(NEW_DATE)
# Year offset
offset = ifelse(POSIDATE$mon >= start_month - 1, 1, 0)
# Water year
adj.year = POSIDATE$year + 1900 + offset
# Aggregating the water year function to take the mean
mean.FLOW=aggregate(data_set$FLOW,list(adj.year), mean)

It seems that it can be done much more easily.
But first I need to prepare a bit more data.
library(tidyverse)
library(lubridate)
df = tibble(
DATE = seq(mdy("1/1/2010"), mdy("12/31/2022"), 1),
FLOW = rnorm(length(DATE), 40, 10)
)
output
# A tibble: 4,748 x 2
DATE FLOW
<date> <dbl>
1 2010-01-01 34.4
2 2010-01-02 37.7
3 2010-01-03 55.6
4 2010-01-04 40.7
5 2010-01-05 41.3
6 2010-01-06 57.2
7 2010-01-07 44.6
8 2010-01-08 27.3
9 2010-01-09 33.1
10 2010-01-10 35.5
# ... with 4,738 more rows
Now let's do the aggregation by year and week number
df %>%
group_by(year(DATE), week(DATE)) %>%
summarise(mean = mean(FLOW))
output
# A tibble: 689 x 3
# Groups: year(DATE) [13]
`year(DATE)` `week(DATE)` mean
<dbl> <dbl> <dbl>
1 2010 1 44.5
2 2010 2 39.6
3 2010 3 38.5
4 2010 4 35.3
5 2010 5 44.1
6 2010 6 39.4
7 2010 7 41.3
8 2010 8 43.9
9 2010 9 38.5
10 2010 10 42.4
# ... with 679 more rows
Note, for the function week, the first week starts on January 1st. If you want to number the weeks according to the ISO 8601 standard, use the isoweek function. Alternatively, you can also use an epiweek compatible with the US CDC.
df %>%
group_by(year(DATE), isoweek(DATE)) %>%
summarise(mean = mean(FLOW))
output
# A tibble: 681 x 3
# Groups: year(DATE) [13]
`year(DATE)` `isoweek(DATE)` mean
<dbl> <dbl> <dbl>
1 2010 1 40.0
2 2010 2 45.5
3 2010 3 33.2
4 2010 4 38.9
5 2010 5 45.0
6 2010 6 40.7
7 2010 7 38.5
8 2010 8 42.5
9 2010 9 37.1
10 2010 10 42.4
# ... with 671 more rows
If you want to better understand how these functions work, please follow the code below
df %>%
mutate(
w1 = week(DATE),
w2 = isoweek(DATE),
w3 = epiweek(DATE)
)
output
# A tibble: 4,748 x 5
DATE FLOW w1 w2 w3
<date> <dbl> <dbl> <dbl> <dbl>
1 2010-01-01 34.4 1 53 52
2 2010-01-02 37.7 1 53 52
3 2010-01-03 55.6 1 53 1
4 2010-01-04 40.7 1 1 1
5 2010-01-05 41.3 1 1 1
6 2010-01-06 57.2 1 1 1
7 2010-01-07 44.6 1 1 1
8 2010-01-08 27.3 2 1 1
9 2010-01-09 33.1 2 1 1
10 2010-01-10 35.5 2 1 2
# ... with 4,738 more rows

Calculate daily parameters from a dataframe with hourly-values in rows and with several columns of interest

The dataframe df1 summarizes water temperature at different depths (T5m,T15m,T25m,T35m) for every hour (Datetime). As an example of dataframe:
df1<- data.frame(Datetime=c("2016-08-12 12:00:00","2016-08-12 13:00:00","2016-08-12 14:00:00","2016-08-12 15:00:00","2016-08-13 12:00:00","2016-08-13 13:00:00","2016-08-13 14:00:00","2016-08-13 15:00:00"),
T5m= c(10,20,20,10,10,20,20,10),
T15m=c(10,20,10,20,10,20,10,20),
T25m=c(20,20,20,30,20,20,20,30),
T35m=c(20,20,10,10,20,20,10,10))
df1$Datetime<- as.POSIXct(df1$Datetime, format="%Y-%m-%d %H")
df1
Datetime T5m T15m T25m T35m
1 2016-08-12 12:00:00 10 10 20 20
2 2016-08-12 13:00:00 20 20 20 20
3 2016-08-12 14:00:00 20 10 20 10
4 2016-08-12 15:00:00 10 20 30 10
5 2016-08-13 12:00:00 10 10 20 20
6 2016-08-13 13:00:00 20 20 20 20
7 2016-08-13 14:00:00 20 10 20 10
8 2016-08-13 15:00:00 10 20 30 10
I would like to create a new dataframe df2 in which I have the average water temperature per day for either each depth interval and for the whole water column and the standard error estimation. I would expect something like this (I did the calculations by hand so there might be some mistakes):
> df2
Date meanT5m meanT15m meanT25m meanT35m meanTotal seT5m seT15m seT25m seT35m seTotal
1 2016-08-12 15 15 22.5 15 16.875 2.88 2.88 2.5 2.88 1.29
2 2016-08-13 15 15 22.5 15 16.875 2.88 2.88 2.5 2.88 1.29
I am especially interested in knowing how to do it with data.table since I will work with huge data.frames and I think data.table is quite efficient.
For calculating the standard error I know the function std.error() from the package plotrix.

Update based on #chinsoon's comment
First transform your data frame into a data table:
library(data.table)
setDT(df1)
Create a total column:
df1[, total := rowSums(.SD), .SDcols = grep("T[0-9]+m", names(df1))][]
# Datetime T5m T15m T25m T35m total
# 1: 2016-08-12 12:00:00 10 10 20 20 60
# 2: 2016-08-12 13:00:00 20 20 20 20 80
# 3: 2016-08-12 14:00:00 20 10 20 10 60
# 4: 2016-08-12 15:00:00 10 20 30 10 70
# 5: 2016-08-13 12:00:00 10 10 20 20 60
# 6: 2016-08-13 13:00:00 20 20 20 20 80
# 7: 2016-08-13 14:00:00 20 10 20 10 60
# 8: 2016-08-13 15:00:00 10 20 30 10 70
Apply the functions per day:
library(lubridate)
(df3 <- df1[, as.list(unlist(lapply(.SD, function (x)
c(mean = mean(x), sem = sd(x) / sqrt(length(x)))))),
day(Datetime)])
# day T5m.mean T5m.sem T15m.mean T15m.sem T25m.mean T25m.sem T35m.mean
# 1: 12 15 2.886751 15 2.886751 22.5 2.5 15
# 2: 13 15 2.886751 15 2.886751 22.5 2.5 15
# T35m.sem total.mean total.sem
# 1: 2.886751 67.5 4.787136
# 2: 2.886751 67.5 4.787136

Here is one way using dplyr and tidyr calculated in two parts
library(dplyr)
library(tidyr)
df2 <- df1 %>%
mutate(Datetime = as.Date(Datetime)) %>%
gather(key, value, -Datetime) %>%
group_by(Datetime, key) %>%
summarise(se = plotrix::std.error(value),
mean = mean(value)) %>%
gather(total, value, -key, -Datetime)
bind_rows(df2, df2 %>%
group_by(Datetime, total) %>%
summarise(value = sum(value)) %>%
mutate(key = paste("total", c("mean", "se"), sep = "_"))) %>%
unite(key, key, total) %>%
spread(key, value)
# A tibble: 2 x 11
# Groups: Datetime [2]
# Datetime T15m_mean T15m_se T25m_mean T25m_se T35m_mean
# <date> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 2016-08-12 15 2.89 22.5 2.5 15
#2 2016-08-13 15 2.89 22.5 2.5 15
# … with 5 more variables: T35m_se <dbl>, T5m_mean <dbl>,
# T5m_se <dbl>, total_mean_mean <dbl>, total_se_se <dbl>

Group data by group of days within months in R

I am trying to summarise this daily time serie of rainfall by groups of 10-day periods within each month and calculate the acummulated rainfall.
library(tidyverse)
(dat <- tibble(
date = seq(as.Date("2016-01-01"), as.Date("2016-12-31"), by=1),
rainfall = rgamma(length(date), shape=2, scale=2)))
Therefore, I will obtain variability in the third group along the year, for instance: in january the third period has 11 days, february 9 days, and so on. This is my try:
library(lubridate)
dat %>%
group_by(decade=floor_date(date, "10 days")) %>%
summarize(acum_rainfall=sum(rainfall),
days = n())
this is the resulting output
# A tibble: 43 x 3
decade acum_rainfall days
<date> <dbl> <int>
1 2016-01-01 48.5 10
2 2016-01-11 39.9 10
3 2016-01-21 36.1 10
4 2016-01-31 1.87 1
5 2016-02-01 50.6 10
6 2016-02-11 32.1 10
7 2016-02-21 22.1 9
8 2016-03-01 45.9 10
9 2016-03-11 30.0 10
10 2016-03-21 42.4 10
# ... with 33 more rows
can someone help me to sum the residuals periods to the third one to obtain always 3 periods within each month? This would be the desired output (pay attention to the row 3):
decade acum_rainfall days
<date> <dbl> <int>
1 2016-01-01 48.5 10
2 2016-01-11 39.9 10
3 2016-01-21 37.97 11
4 2016-02-01 50.6 10
5 2016-02-11 32.1 10
6 2016-02-21 22.1 9

One way to do this is to use if_else to apply floor_date with different arguments depending on the day value of date. If day(date) is <30, use the normal way, if it's >= 30, then use '20 days' to ensure it gets rounded to day 21:
dat %>%
group_by(decade=if_else(day(date) >= 30,
floor_date(date, "20 days"),
floor_date(date, "10 days"))) %>%
summarize(acum_rainfall=sum(rainfall),
days = n())
# A tibble: 36 x 3
decade acum_rainfall days
<date> <dbl> <int>
1 2016-01-01 38.8 10
2 2016-01-11 38.4 10
3 2016-01-21 43.4 11
4 2016-02-01 34.4 10
5 2016-02-11 34.8 10
6 2016-02-21 25.3 9
7 2016-03-01 39.6 10
8 2016-03-11 53.9 10
9 2016-03-21 38.1 11
10 2016-04-01 36.6 10
# … with 26 more rows

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Aggregating data, left aligned in R - r

Related

heatwaveR package, ts2clm() turn temperature values into NA

Aggregate daily data into weeks

Loop to sum weekly rolling average

Calculate daily parameters from a dataframe with hourly-values in rows and with several columns of interest

Group data by group of days within months in R

Categories

Resources