I am trying to summarise this daily time serie of rainfall by groups of 10-day periods within each month and calculate the acummulated rainfall.
library(tidyverse)
(dat <- tibble(
date = seq(as.Date("2016-01-01"), as.Date("2016-12-31"), by=1),
rainfall = rgamma(length(date), shape=2, scale=2)))
Therefore, I will obtain variability in the third group along the year, for instance: in january the third period has 11 days, february 9 days, and so on. This is my try:
library(lubridate)
dat %>%
group_by(decade=floor_date(date, "10 days")) %>%
summarize(acum_rainfall=sum(rainfall),
days = n())
this is the resulting output
# A tibble: 43 x 3
decade acum_rainfall days
<date> <dbl> <int>
1 2016-01-01 48.5 10
2 2016-01-11 39.9 10
3 2016-01-21 36.1 10
4 2016-01-31 1.87 1
5 2016-02-01 50.6 10
6 2016-02-11 32.1 10
7 2016-02-21 22.1 9
8 2016-03-01 45.9 10
9 2016-03-11 30.0 10
10 2016-03-21 42.4 10
# ... with 33 more rows
can someone help me to sum the residuals periods to the third one to obtain always 3 periods within each month? This would be the desired output (pay attention to the row 3):
decade acum_rainfall days
<date> <dbl> <int>
1 2016-01-01 48.5 10
2 2016-01-11 39.9 10
3 2016-01-21 37.97 11
4 2016-02-01 50.6 10
5 2016-02-11 32.1 10
6 2016-02-21 22.1 9
One way to do this is to use if_else to apply floor_date with different arguments depending on the day value of date. If day(date) is <30, use the normal way, if it's >= 30, then use '20 days' to ensure it gets rounded to day 21:
dat %>%
group_by(decade=if_else(day(date) >= 30,
floor_date(date, "20 days"),
floor_date(date, "10 days"))) %>%
summarize(acum_rainfall=sum(rainfall),
days = n())
# A tibble: 36 x 3
decade acum_rainfall days
<date> <dbl> <int>
1 2016-01-01 38.8 10
2 2016-01-11 38.4 10
3 2016-01-21 43.4 11
4 2016-02-01 34.4 10
5 2016-02-11 34.8 10
6 2016-02-21 25.3 9
7 2016-03-01 39.6 10
8 2016-03-11 53.9 10
9 2016-03-21 38.1 11
10 2016-04-01 36.6 10
# … with 26 more rows
Related
I am new to coding. I have a data set of daily stream flow averages over 20 years. Following is an example:
DATE FLOW
1 10/1/2001 88.2
2 10/2/2001 77.6
3 10/3/2001 68.4
4 10/4/2001 61.5
5 10/5/2001 55.3
6 10/6/2001 52.5
7 10/7/2001 49.7
8 10/8/2001 46.7
9 10/9/2001 43.3
10 10/10/2001 41.3
11 10/11/2001 39.3
12 10/12/2001 37.7
13 10/13/2001 35.8
14 10/14/2001 34.1
15 10/15/2001 39.8
I need to create a loop summing the previous 6 days as well as the current day (rolling weekly average), and print it to an array for the designated water year. I have already created an aggregate function to separate yearly average daily means into their designated water years.
# Separating dates into specific water years
wtr_yr <- function(dates, start_month=9)
# Convert dates into POSIXlt
POSIDATE = as.POSIXlt(NEW_DATE)
# Year offset
offset = ifelse(POSIDATE$mon >= start_month - 1, 1, 0)
# Water year
adj.year = POSIDATE$year + 1900 + offset
# Aggregating the water year function to take the mean
mean.FLOW=aggregate(data_set$FLOW,list(adj.year), mean)
It seems that it can be done much more easily.
But first I need to prepare a bit more data.
library(tidyverse)
library(lubridate)
df = tibble(
DATE = seq(mdy("1/1/2010"), mdy("12/31/2022"), 1),
FLOW = rnorm(length(DATE), 40, 10)
)
output
# A tibble: 4,748 x 2
DATE FLOW
<date> <dbl>
1 2010-01-01 34.4
2 2010-01-02 37.7
3 2010-01-03 55.6
4 2010-01-04 40.7
5 2010-01-05 41.3
6 2010-01-06 57.2
7 2010-01-07 44.6
8 2010-01-08 27.3
9 2010-01-09 33.1
10 2010-01-10 35.5
# ... with 4,738 more rows
Now let's do the aggregation by year and week number
df %>%
group_by(year(DATE), week(DATE)) %>%
summarise(mean = mean(FLOW))
output
# A tibble: 689 x 3
# Groups: year(DATE) [13]
`year(DATE)` `week(DATE)` mean
<dbl> <dbl> <dbl>
1 2010 1 44.5
2 2010 2 39.6
3 2010 3 38.5
4 2010 4 35.3
5 2010 5 44.1
6 2010 6 39.4
7 2010 7 41.3
8 2010 8 43.9
9 2010 9 38.5
10 2010 10 42.4
# ... with 679 more rows
Note, for the function week, the first week starts on January 1st. If you want to number the weeks according to the ISO 8601 standard, use the isoweek function. Alternatively, you can also use an epiweek compatible with the US CDC.
df %>%
group_by(year(DATE), isoweek(DATE)) %>%
summarise(mean = mean(FLOW))
output
# A tibble: 681 x 3
# Groups: year(DATE) [13]
`year(DATE)` `isoweek(DATE)` mean
<dbl> <dbl> <dbl>
1 2010 1 40.0
2 2010 2 45.5
3 2010 3 33.2
4 2010 4 38.9
5 2010 5 45.0
6 2010 6 40.7
7 2010 7 38.5
8 2010 8 42.5
9 2010 9 37.1
10 2010 10 42.4
# ... with 671 more rows
If you want to better understand how these functions work, please follow the code below
df %>%
mutate(
w1 = week(DATE),
w2 = isoweek(DATE),
w3 = epiweek(DATE)
)
output
# A tibble: 4,748 x 5
DATE FLOW w1 w2 w3
<date> <dbl> <dbl> <dbl> <dbl>
1 2010-01-01 34.4 1 53 52
2 2010-01-02 37.7 1 53 52
3 2010-01-03 55.6 1 53 1
4 2010-01-04 40.7 1 1 1
5 2010-01-05 41.3 1 1 1
6 2010-01-06 57.2 1 1 1
7 2010-01-07 44.6 1 1 1
8 2010-01-08 27.3 2 1 1
9 2010-01-09 33.1 2 1 1
10 2010-01-10 35.5 2 1 2
# ... with 4,738 more rows
I have a data set like below:
Timestamp Value1 Value2
2020-10-29 05:00:00 10 20
2020-10-29 05:00:01 10 20
2020-10-29 05:00:02 11 22
2020-10-29 05:00:03 11 22
and so on, in one second intervals, and upto a few hours of data. I want to generate an average value two minutes, but left align the data. Essentially, rolling average of 2 minutes at 2020-10-29 05:00:00 should be the average of data points between 2020-10-29 05:00:00 and 2020-10-29 05:01:59
I have used data %>% group_by(Timestamp =cut (Timestamp, breaks="2min"))%>% summarize(Meanval1=mean(Value1), Meanval2=mean(Value2) but this right aligns the data. How can I left align it?
Thanks!
You can round down the Timestamp column to the nearest two minutes using lubridate::floor_date. If you then group_by this new column, you will get a left-aligned two-minute mean:
library(dplyr)
df %>%
mutate(time = lubridate::floor_date(df$TimeStamp, "2 minutes")) %>%
group_by(time) %>%
summarize(mean_val1 = mean(Value1), mean_val2 = mean(Value2))
#> # A tibble: 9 x 3
#> time mean_val1 mean_val2
#> <dttm> <dbl> <dbl>
#> 1 2020-10-29 05:00:00 10.2 19.9
#> 2 2020-10-29 05:02:00 9.84 20.0
#> 3 2020-10-29 05:04:00 10.1 19.9
#> 4 2020-10-29 05:06:00 9.72 20.3
#> 5 2020-10-29 05:08:00 9.98 19.9
#> 6 2020-10-29 05:10:00 9.98 20.0
#> 7 2020-10-29 05:12:00 10.1 20.0
#> 8 2020-10-29 05:14:00 10.0 20.1
#> 9 2020-10-29 05:16:00 10.0 20.2
Data used
set.seed(69)
t <- seq(as.POSIXct("2020-10-29 05:00:00"), by = "1 sec", length.out = 1000)
df <- data.frame(TimeStamp = t,
Value1 = sample(8:12, 1000, TRUE),
Value2 = sample(18:22, 1000, TRUE))
head(df)
#> TimeStamp Value1 Value2
#> 1 2020-10-29 05:00:00 8 20
#> 2 2020-10-29 05:00:01 10 21
#> 3 2020-10-29 05:00:02 9 19
#> 4 2020-10-29 05:00:03 12 19
#> 5 2020-10-29 05:00:04 12 19
#> 6 2020-10-29 05:00:05 9 18
I am trying to apply some basic math to daily stock values based on a corresponding yearly value.
reprex
(daily prices)
library(tidyquant)
data(FANG)
# daily prices
FANG %>%
select(c(date, symbol, adjusted)) %>%
group_by(symbol)
# A tibble: 4,032 x 3
# Groups: symbol [4]
date symbol adjusted
<date> <chr> <dbl>
1 2013-01-02 FB 28
2 2013-01-03 FB 27.8
3 2013-01-04 FB 28.8
4 2013-01-07 FB 29.4
5 2013-01-08 FB 29.1
6 2013-01-09 FB 30.6
7 2013-01-10 FB 31.3
8 2013-01-11 FB 31.7
9 2013-01-14 FB 31.0
10 2013-01-15 FB 30.1
# ... with 4,022 more rows
(max price per year)
FANG_yearly_high <-
FANG %>%
group_by(symbol) %>%
summarise_by_time(
.date_var = date,
.by = "year",
price = AVERAGE(adjusted))
# Groups: symbol [4]
symbol date price
<chr> <date> <dbl>
1 AMZN 2013-01-01 404.
2 AMZN 2014-01-01 407.
3 AMZN 2015-01-01 694.
4 AMZN 2016-01-01 844.
5 FB 2013-01-01 58.0
6 FB 2014-01-01 81.4
7 FB 2015-01-01 109.
8 FB 2016-01-01 133.
9 GOOG 2013-01-01 560.
10 GOOG 2014-01-01 609.
11 GOOG 2015-01-01 777.
12 GOOG 2016-01-01 813.
13 NFLX 2013-01-01 54.4
14 NFLX 2014-01-01 69.2
15 NFLX 2015-01-01 131.
16 NFLX 2016-01-01 128.
I would like to divide each daily price by the corresponding max price for the year.
I tried:
FANG %>%
group_by(symbol) %>%
summarise_by_time(
.date_var = date,
.by = "year",
price = AVERAGE(adjusted) / YEAR(date(MAX(adjusted)))
)
and the get this error:
Error in as.POSIXlt.numeric(x, tz = tz(x)) : 'origin' must be supplied
Any sensible way to accomplish this?
Thank you
summarise_by_time is good if you just want to summarise. But you want to divide a daily price by a max of a period. So you need to use mutate. Below are 2 examples. The first one for daily prices, the second for weekly. You can adjust the weekly version easily to monthly.
library(tidyquant)
library(dplyr)
data(FANG)
# daily prices
FANG %>%
select(c(date, symbol, adjusted)) %>%
group_by(symbol, year = year(date)) %>%
mutate(price_pct = adjusted / max(adjusted))
# A tibble: 4,032 x 5
# Groups: symbol, year [16]
date symbol adjusted year price_pct
<date> <chr> <dbl> <dbl> <dbl>
1 2013-01-02 FB 28 2013 0.483
2 2013-01-03 FB 27.8 2013 0.479
3 2013-01-04 FB 28.8 2013 0.496
4 2013-01-07 FB 29.4 2013 0.508
5 2013-01-08 FB 29.1 2013 0.501
6 2013-01-09 FB 30.6 2013 0.528
7 2013-01-10 FB 31.3 2013 0.540
8 2013-01-11 FB 31.7 2013 0.547
9 2013-01-14 FB 31.0 2013 0.534
10 2013-01-15 FB 30.1 2013 0.519
# ... with 4,022 more rows
Weekly / monthly:
# weekly
FANG %>%
select(c(date, symbol, adjusted)) %>%
group_by(symbol) %>%
tq_transmute(mutate_fun = to.period,
period = "weeks" # change weeks to months for monthly
) %>%
group_by(symbol, year = year(date)) %>%
mutate(price_pct = adjusted / max(adjusted))
# A tibble: 836 x 5
# Groups: symbol, year [16]
symbol date adjusted year price_pct
<chr> <date> <dbl> <dbl> <dbl>
1 FB 2013-01-04 28.8 2013 0.519
2 FB 2013-01-11 31.7 2013 0.572
3 FB 2013-01-18 29.7 2013 0.535
4 FB 2013-01-25 31.5 2013 0.569
5 FB 2013-02-01 29.7 2013 0.536
6 FB 2013-02-08 28.5 2013 0.515
7 FB 2013-02-15 28.3 2013 0.511
8 FB 2013-02-22 27.1 2013 0.489
9 FB 2013-03-01 27.8 2013 0.501
10 FB 2013-03-08 28.0 2013 0.504
# ... with 826 more rows
The dataframe df1 summarizes water temperature at different depths (T5m,T15m,T25m,T35m) for every hour (Datetime). As an example of dataframe:
df1<- data.frame(Datetime=c("2016-08-12 12:00:00","2016-08-12 13:00:00","2016-08-12 14:00:00","2016-08-12 15:00:00","2016-08-13 12:00:00","2016-08-13 13:00:00","2016-08-13 14:00:00","2016-08-13 15:00:00"),
T5m= c(10,20,20,10,10,20,20,10),
T15m=c(10,20,10,20,10,20,10,20),
T25m=c(20,20,20,30,20,20,20,30),
T35m=c(20,20,10,10,20,20,10,10))
df1$Datetime<- as.POSIXct(df1$Datetime, format="%Y-%m-%d %H")
df1
Datetime T5m T15m T25m T35m
1 2016-08-12 12:00:00 10 10 20 20
2 2016-08-12 13:00:00 20 20 20 20
3 2016-08-12 14:00:00 20 10 20 10
4 2016-08-12 15:00:00 10 20 30 10
5 2016-08-13 12:00:00 10 10 20 20
6 2016-08-13 13:00:00 20 20 20 20
7 2016-08-13 14:00:00 20 10 20 10
8 2016-08-13 15:00:00 10 20 30 10
I would like to create a new dataframe df2 in which I have the average water temperature per day for either each depth interval and for the whole water column and the standard error estimation. I would expect something like this (I did the calculations by hand so there might be some mistakes):
> df2
Date meanT5m meanT15m meanT25m meanT35m meanTotal seT5m seT15m seT25m seT35m seTotal
1 2016-08-12 15 15 22.5 15 16.875 2.88 2.88 2.5 2.88 1.29
2 2016-08-13 15 15 22.5 15 16.875 2.88 2.88 2.5 2.88 1.29
I am especially interested in knowing how to do it with data.table since I will work with huge data.frames and I think data.table is quite efficient.
For calculating the standard error I know the function std.error() from the package plotrix.
Update based on #chinsoon's comment
First transform your data frame into a data table:
library(data.table)
setDT(df1)
Create a total column:
df1[, total := rowSums(.SD), .SDcols = grep("T[0-9]+m", names(df1))][]
# Datetime T5m T15m T25m T35m total
# 1: 2016-08-12 12:00:00 10 10 20 20 60
# 2: 2016-08-12 13:00:00 20 20 20 20 80
# 3: 2016-08-12 14:00:00 20 10 20 10 60
# 4: 2016-08-12 15:00:00 10 20 30 10 70
# 5: 2016-08-13 12:00:00 10 10 20 20 60
# 6: 2016-08-13 13:00:00 20 20 20 20 80
# 7: 2016-08-13 14:00:00 20 10 20 10 60
# 8: 2016-08-13 15:00:00 10 20 30 10 70
Apply the functions per day:
library(lubridate)
(df3 <- df1[, as.list(unlist(lapply(.SD, function (x)
c(mean = mean(x), sem = sd(x) / sqrt(length(x)))))),
day(Datetime)])
# day T5m.mean T5m.sem T15m.mean T15m.sem T25m.mean T25m.sem T35m.mean
# 1: 12 15 2.886751 15 2.886751 22.5 2.5 15
# 2: 13 15 2.886751 15 2.886751 22.5 2.5 15
# T35m.sem total.mean total.sem
# 1: 2.886751 67.5 4.787136
# 2: 2.886751 67.5 4.787136
Here is one way using dplyr and tidyr calculated in two parts
library(dplyr)
library(tidyr)
df2 <- df1 %>%
mutate(Datetime = as.Date(Datetime)) %>%
gather(key, value, -Datetime) %>%
group_by(Datetime, key) %>%
summarise(se = plotrix::std.error(value),
mean = mean(value)) %>%
gather(total, value, -key, -Datetime)
bind_rows(df2, df2 %>%
group_by(Datetime, total) %>%
summarise(value = sum(value)) %>%
mutate(key = paste("total", c("mean", "se"), sep = "_"))) %>%
unite(key, key, total) %>%
spread(key, value)
# A tibble: 2 x 11
# Groups: Datetime [2]
# Datetime T15m_mean T15m_se T25m_mean T25m_se T35m_mean
# <date> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 2016-08-12 15 2.89 22.5 2.5 15
#2 2016-08-13 15 2.89 22.5 2.5 15
# … with 5 more variables: T35m_se <dbl>, T5m_mean <dbl>,
# T5m_se <dbl>, total_mean_mean <dbl>, total_se_se <dbl>
i have a dataframe like this :
ds y
1 2015-12-31 35.59050
2 2016-01-01 28.75111
3 2016-01-04 25.53158
4 2016-01-06 17.75369
5 2016-01-07 29.01500
6 2016-01-08 29.22663
7 2016-01-09 29.05249
8 2016-01-10 27.54387
9 2016-01-11 28.05674
10 2016-01-12 29.00901
11 2016-01-13 31.66441
12 2016-01-14 29.18520
13 2016-01-15 29.79364
14 2016-01-16 30.07852
i'm trying to create a loop that remove the rows which values in the 'ds' column are above 34 or below 26, because there is where my outliers are:
for (i in grupo$y){if (i < 26) {grupo$y[i] = NA}}
i tried this to remove those below 26, i don't get any errors, but those rows won't go.
Any suggestions about how to remove those outliers??
Thanks in advance
Here are a base R solution and a tidyverse solution. Part of the strength of R is that for a problem such as this one, R's default of working across vectors means you often don't need a for loop. The issue is that in your loop, you're assigning values to NA. That doesn't actually get rid of those values, it just gives them the value NA.
In base R, you can use subset to get the rows or columns of a data frame that meet certain criteria:
subset(grupo, y >= 26 & y <= 34)
#> # A tibble: 11 x 2
#> ds y
#> <date> <dbl>
#> 1 2016-01-01 28.8
#> 2 2016-01-07 29.0
#> 3 2016-01-08 29.2
#> 4 2016-01-09 29.1
#> 5 2016-01-10 27.5
#> 6 2016-01-11 28.1
#> 7 2016-01-12 29.0
#> 8 2016-01-13 31.7
#> 9 2016-01-14 29.2
#> 10 2016-01-15 29.8
#> 11 2016-01-16 30.1
Or using dplyr functions, you can filter your data similarly, and make use of dplyr::between. between(y, 26, 34) is a shorthand for y >= 26 & y <= 34.
library(dplyr)
grupo %>%
filter(between(y, 26, 34))
#> # A tibble: 11 x 2
#> ds y
#> <date> <dbl>
#> 1 2016-01-01 28.8
#> 2 2016-01-07 29.0
#> 3 2016-01-08 29.2
#> 4 2016-01-09 29.1
#> 5 2016-01-10 27.5
#> 6 2016-01-11 28.1
#> 7 2016-01-12 29.0
#> 8 2016-01-13 31.7
#> 9 2016-01-14 29.2
#> 10 2016-01-15 29.8
#> 11 2016-01-16 30.1
With dplyr you could do:
library(dplyr)
df %>%
filter(y >= 26 & y <= 34)
ds y
1 2016-01-01 28.75111
2 2016-01-07 29.01500
3 2016-01-08 29.22663
4 2016-01-09 29.05249
5 2016-01-10 27.54387
6 2016-01-11 28.05674
7 2016-01-12 29.00901
8 2016-01-13 31.66441
9 2016-01-14 29.18520
10 2016-01-15 29.79364
11 2016-01-16 30.07852