Which is first day above and below threshold? - r

I have a dataframe that represents a two year daily time series of temperature for two rivers. For each river and each year, I would like to know what day of year:
temperature is greater than or equal to 15 degrees
temperature is sustained greater than or equal to 15 degrees (sustained is when there are no more dips below 15 until the autumn)
temperature is less than or equal to 15 degrees
temperature is sustained less than or equal to 15 degrees (sustained is when there are are no more peaks above 15 until the following spring)
Example timeseries
library(ggplot2)
library(lubridate)
library(dplyr)
library(dataRetrieval)
siteNumber <- c("01428500","01432805") # United States Geological Survey site numbers
parameterCd <- "00010" # temperature
statCd <- "00003" # mean
startDate <- "2018-01-01"
endDate <- "2019-10-31"
dat <- readNWISdv(siteNumber, parameterCd, startDate, endDate, statCd=statCd) # obtains the timeseries from the USGS
dat <- dat[,c(2:4)]
colnames(dat)[3] <- "temperature"
# To view at the time series
ggplot(data = dat, aes(x = Date, y = temperature)) +
geom_point() +
theme_bw() +
facet_wrap(~site_no)
# Adds a new column for year and day of year (doy; Jan 1 = 1, Dec 31 = 365)
dat <- dat %>%
mutate(year = year(Date),
doy = yday(Date))
I have tried using the dplyr filter() function but have had little success
dat %>%
group_by(site_no,year) %>%
filter(temperature >= 15 & temperature <= 15)
The ideal output would look something like this:
site_no year doy_firstabove15 doy_sustainedAbove15 doy_firstbelow15 doy_sustainedBelow15
1 01428500 2018 136 144 253 286
2 01428500 2019 140 146 279 289
3 01432805 2018 143 143 272 276
4 01432805 2019 140 140 278 278

I like this question. Its specific, tricky, and has a good example. I would go about this by creating a rolling variable that checks for changes in temperature above and below 15. Then you can group by site and year and check for the first instance of the change in group id. Now I make several assumptions here that might be bad. 1) I am grouping by year, but it would likely be better to group by water year. You don't explicitly say that the year is important, but I suspect it is. 2) I also assume that Autumn is defined by the first day of month 9. Maybe you want the sustained to be through the fall instead of the first day. Same assumption for spring.
library(tidyverse)
library(dataRetrieval)
dat |>
mutate(month = lubridate::month(Date),
year = lubridate::year(Date)) |>
group_by(site_no, year) |>
mutate(gt_15 = temperature > 15,
lt_15 = temperature < 15,
chng1 = cumsum(gt_15 != lag(gt_15, def = first(gt_15))),
chng2 = cumsum(lt_15 != lag(lt_15, def = first(lt_15))),
)|>
summarise(first_above_15 = Date[chng1==1][1],
indx_autum = chng1[month == 9][1],
first_above_15_sustained = Date[chng1== indx_autum][1],
first_below_15 = Date[chng2==1][1],
indx_spring = chng1[month == 3][1],
first_below_15_sustained = Date[chng2== indx_spring][1],
.groups = "drop") |>
select(-contains("indx"))
#> # A tibble: 4 x 6
#> site_no year first_above_15 first_above_15_sustained first_belo~1 first_be~2
#> <chr> <dbl> <date> <date> <date> <date>
#> 1 01428500 2018 2018-05-18 2018-05-24 2018-05-16 2018-01-01
#> 2 01428500 2019 2019-05-20 2019-05-26 2019-05-20 2019-01-01
#> 3 01432805 2018 2018-05-23 2018-05-23 2018-05-23 2018-01-01
#> 4 01432805 2019 2019-05-20 2019-05-20 2019-05-20 2019-02-22
#> # ... with abbreviated variable names 1: first_below_15,
#> # 2: first_below_15_sustained

This works by splitting the annual time series into times before and after the peak temperature. Then it logically returns the day of year (doy) when temperature is first above 15 before the peak temperature and the first below 15 after the peak temperature.
Similarly, the sustain_above and sustain_beloware determined logically first creating a run-length ID column (run) then extracting the doy where the run is max for the subset of run where the values are TRUE in below_peak or after_peak and get the first element of 'doy'
dat %>%
mutate(year = year(Date),
doy = yday(Date)) %>%
group_by(site_no, year) %>%
mutate(gt_15 = temperature >= 15,
lt_15 = temperature <= 15,
peak_doy = doy[which.max(temperature)],
below_peak = doy < peak_doy,
after_peak = doy > peak_doy,
run = data.table::rleid(lt_15)) %>%
summarise(first_above = doy[below_peak & gt_15][1],
sustain_above = first(doy[run == max(run[below_peak])]),
first_below = doy[after_peak & lt_15][1],
sustain_below = first(doy[run == max(run[after_peak])]), .groups = 'drop')

Related

Identify first unique value when multiple conditions are met using dplyr in R

I have a dataframe representing a two-year daily time series of temperature for two rivers. I have identified when the temperature is either above or below the peak temperature. I have also created a run-length ID column for when temperature is either above or below a threshold temperature of 10 degrees.
How can I get the first day of year for each site and year and the following conditions:
maximum run-length & below peak = TRUE
maximum run-length & above peak = TRUE
Example Data:
library(ggplot2)
library(lubridate)
library(dplyr)
library(dataRetrieval)
siteNumber <- c("01432805","01388000") # United States Geological Survey site numbers
parameterCd <- "00010" # temperature
statCd <- "00003" # mean
startDate <- "1996-01-01"
endDate <- "1997-12-31"
dat <- readNWISdv(siteNumber, parameterCd, startDate, endDate, statCd=statCd) # obtains the timeseries from the USGS
dat <- dat[,c(2:4)]
colnames(dat)[3] <- "temperature"
# To view at the time series
ggplot(data = dat, aes(x = Date, y = temperature)) +
geom_point() +
theme_bw() +
facet_wrap(~site_no)
To create the columns described above
dat <- dat %>%
mutate(year = year(Date),
doy = yday(Date)) %>% # doy = day of year
group_by(site_no, year) %>%
mutate(lt_10 = temperature <= 10,
peak_doy = doy[which.max(temperature)],
below_peak = doy < peak_doy,
after_peak = doy > peak_doy,
run = data.table::rleid(lt_10))
View(dat)
The ideal output would look as follows:
site_no year doy_below doy_after
1 01388000 1996 111 317
2 01388000 1997 112 312
3 01432805 1996 137 315
4 01432805 1997 130 294
doy_after = the first row for after_peak == TRUE & max(run) when group_by(site_no,year)
doy_below = the first row for below_peak == TRUE & max(run) when group_by(site_no,year)
For site_no = 01388000 in year = 1996, the max(run) when below_peak == TRUE is 4. The first row whenrun = 4 and below_peak == TRUE corresponds with date 1996-04-20 which has a doy = 111.
As the data is already grouped, just summarise by extracting the 'doy' where the run is max for the subset of run where the values are TRUE in 'below_peak' or 'after_peak' and get the first element of 'doy'
library(dplyr)
dat %>%
summarise(doy_below = first(doy[run == max(run[below_peak])]),
doy_above = first(doy[run == max(run[after_peak])]), .groups = 'drop')
-output
# A tibble: 4 × 4
site_no year doy_below doy_above
<chr> <dbl> <dbl> <dbl>
1 01388000 1996 111 317
2 01388000 1997 112 312
3 01432805 1996 137 315
4 01432805 1997 130 294

R Calculate change in Weekly values Year on Year (with additional complication)

I have a data set of daily value. It spans from Dec-1 2018 to April-1 2020.
The columns are "date" and "value". As shown here:
date <- c("2018-12-01","2000-12-02", "2000-12-03",
...
"2020-03-30","2020-03-31","2020-04-01")
value <- c(1592,1825,1769,1909,2022, .... 2287,2169,2366,2001,2087,2099,2258)
df <- data.frame(date,value)
What I would like to do is the sum the values by week and then calculate week over week change from the current to previous year.
I know that I can sum by week using the following function:
Data_week <- df%>% group_by(category ,week = cut(date, "week")) %>% mutate(summed= sum(value))
My questions are twofold:
1) How do I sum by week and then manipulate the dataframe so that I can calculate week over week change (e.g. week dec.1 2019/ week dec.1 2018).
2) How can I do that above, but using a "customized" week. Let's say I want to define a week as moving 7 days back from the latest date I have data for. Eg. the latest week I would have would be week starting on March 26th (April 1st -7 days).
We can use lag from dplyr to help and also some convenience functions from lubridate.
library(dplyr)
library(lubridate)
df %>%
mutate(year = year(date)) %>%
group_by(week = week(date),year) %>%
summarize(summed = sum(value)) %>%
arrange(year, week) %>%
ungroup %>%
mutate(change = summed - lag(summed))
# week year summed change
# <dbl> <dbl> <dbl> <dbl>
# 1 48 2018 3638. NA
# 2 49 2018 15316. 11678.
# 3 50 2018 13283. -2033.
# 4 51 2018 15166. 1883.
# 5 52 2018 12885. -2281.
# 6 53 2018 1982. -10903.
# 7 1 2019 14177. 12195.
# 8 2 2019 14969. 791.
# 9 3 2019 14554. -415.
#10 4 2019 12850. -1704.
#11 5 2019 1907. -10943.
If you would like to define "weeks" in different ways, there is also isoweek and epiweek. See this answer for a great explaination of your options.
Data
set.seed(1)
df <- data.frame(date = seq.Date(from = as.Date("2018-12-01"), to = as.Date("2019-01-29"), "days"), value = runif(60,1500,2500))

How to calculate/count the number of extreme precipitation events (above a "threshold") from daily rainfall data in each month per year basis

I am working on daily rainfall data and trying to evaluate the extreme events from the time series data above a certain threshold value in each month per year i.e. the number of times the rainfall exceeded a certain threshold in each month per year.
The rainfall timeseries data is from St Lucia and has two columns:
"YEARMODA" - defining the time (format- YYYYMMDD)
"PREP" - rainfall in mm (numeric)
StLucia <- read_excel("C:/Users/hp/Desktop/StLuciaProject.xlsx")
The dataframe which I'm working i.e "Precip1" on has two columns namely:
Time (format YYYY-MM-DD)
Precipitation (numeric value)
The code is provided below:
library("imputeTS")
StLucia$YEARMODA <- as.Date(as.character(StLucia$YEARMODA), format = "%Y%m%d")
data1 <- na_ma(StLucia$PREP, k=4, weighting = "exponential")
Precip1 <- data.frame(Time= StLucia$YEARMODA, Precipitation= data1, check.rows = TRUE)
I found out the threshold value based on the 95th percentile and 99th percentile using function quantile().
I now want to count the number of "extreme events" of rainfall above this threshold in each month on per year basis.
Please help me out on this. I would be highly obliged by your help. Thank You!
If you are open to a tidyverse method, here is an example with the economics dataset that is built into ggplot2. We can use ntile to assign a percentile group to each observation. Then we group_by the year, and get a count of the values that are in the desired percentiles. Because this is monthly data the counts are pretty low, but it's easily translated to daily data.
library(tidyverse)
thresholds <- economics %>%
mutate(
pctile = ntile(unemploy, 100),
year = year(date)
) %>%
group_by(year) %>%
summarise(
q95 = sum(pctile >= 95L),
q99 = sum(pctile >= 99L)
)
arrange(thresholds, desc(q95))
#> # A tibble: 49 x 3
#> year q95 q99
#> <dbl> <int> <int>
#> 1 2010 12 6
#> 2 2011 12 0
#> 3 2009 10 5
#> 4 1967 0 0
#> 5 1968 0 0
#> 6 1969 0 0
#> 7 1970 0 0
#> 8 1971 0 0
#> 9 1972 0 0
#> 10 1973 0 0
#> # ... with 39 more rows
Created on 2018-06-04 by the reprex package (v0.2.0).

Assign day of the day year to a month

Sample data
df <- data.frame(ID1 = rep(1:1000, each= 5*365), year = rep(rep(2000:2004, each = 365), times = 1000),
day = rep(1:365, times = 1000*5),
x= runif(365*1000*5))
This data contains a column day which is the day of the year. I need to produce two columns:
Month column: a column of month (which month does the day belong)
Biweek column: which biweek does a day belong to. There are 24 biweek in a year. All days <= 15 in a month is the first biweek and > 15 is second biweek.
For e.g.
15th Jan is Biweek 1,
16-31 Jan is biweek 2,
1-15 Feb is biweek 3 and
16-28 Feb is biweek 4 and so on.
For sake of simplicity, I am assuming all the years are non-leap years.
Here's the code I have (with help from RS as well) that creates the two columns.
# create a vector of days for each month
months <- list(1:31, 32:59, 60:90, 91:120, 121:151, 152:181, 182:212, 213:243, 244:273, 274:304, 305:334, 335:365)
library(dplyr)
ptm <- proc.time()
df <- df %>% mutate(month = sapply(day, function(x) which(sapply(months, function(y) x %in% y))), # this assigns each day to a month
date = as.Date(paste0(year,'-',format(strptime(paste0('1981-',day), '%Y-%j'), '%m-%d'))), # this creates a vector of dates for a non-leap year
twowk = month*2 - (as.numeric(format(date, "%d")) <= 15)) %>% # this describes which biweek each day falls into
dplyr::select(-date)
proc.time() - ptm
user system elapsed
121.71 0.31 122.43
My issue is that of the time it takes to run this script and I am looking for a solution that is relatively faster
EDIT: To be clear, I have assumed all years must have 365 days. In one of the answers below, for the year 2000 (a leap year), Feb has 29 days (last day of Feb is 60 but I want the last day to be 59) and therefore Dec has only 30 days (Dec start with 336 though it should start with 335). I hope this is clear. My solution addresses this issue but takes lot of time to run.
Here is a solution using lubridate extractors and replacement functions as mentioned by Frank in a comment. The key ones are yday<-, mday() and month(), which respectively set the day of year of a date, get the day of month of a date, and get the month of a date. 8 sec running time seems pretty acceptable to me, though I'm sure some optimising could shave that down though there might be a loss of generality.
Note also the use of case_when to ensure the correct numbering of days after Feb 29 on a leap year.
EDIT: Here is a significantly faster solution. You can just get the mapping of DOYs to months and biweeks for a single year, and then left_join to the main table. 0.36s running time, since you no longer have to repetitively create the date. We also bypass having to use case_when, since the join will take care of the missing days. See that Day 59 of year 2000 is February and Day 60 is March, as requested.
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
tbl <- tibble(
ID1 = rep(1:1000, each= 5*365),
year = rep(rep(2000:2004, each = 365), times = 1000),
day = rep(1:365, times = 1000*5),
x= runif(365*1000*5)
)
tictoc::tic("")
doys <- tibble(
day = rep(1:365),
date = seq.Date(ymd("2001-1-1"), ymd("2001-12-31"), by = 1),
month = month(date),
biweek = case_when(
mday(date) <= 15 ~ (month * 2) - 1,
mday(date) > 15 ~ month * 2
)
)
tbl_out2 <- left_join(tbl, select(doys, -date), by = "day")
tictoc::toc()
#> : 0.36 sec elapsed
tbl_out2
#> # A tibble: 1,825,000 x 6
#> ID1 year day x month biweek
#> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 2000 1 0.331 1. 1.
#> 2 1 2000 2 0.284 1. 1.
#> 3 1 2000 3 0.627 1. 1.
#> 4 1 2000 4 0.762 1. 1.
#> 5 1 2000 5 0.460 1. 1.
#> 6 1 2000 6 0.500 1. 1.
#> 7 1 2000 7 0.340 1. 1.
#> 8 1 2000 8 0.952 1. 1.
#> 9 1 2000 9 0.663 1. 1.
#> 10 1 2000 10 0.385 1. 1.
#> # ... with 1,824,990 more rows
tbl_out2[55:65, ]
#> # A tibble: 11 x 6
#> ID1 year day x month biweek
#> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 2000 55 0.127 2. 4.
#> 2 1 2000 56 0.779 2. 4.
#> 3 1 2000 57 0.625 2. 4.
#> 4 1 2000 58 0.245 2. 4.
#> 5 1 2000 59 0.640 2. 4.
#> 6 1 2000 60 0.423 3. 5.
#> 7 1 2000 61 0.439 3. 5.
#> 8 1 2000 62 0.105 3. 5.
#> 9 1 2000 63 0.218 3. 5.
#> 10 1 2000 64 0.668 3. 5.
#> 11 1 2000 65 0.589 3. 5.
Created on 2018-04-06 by the reprex package (v0.2.0).
You can speed this up almost an order of magnitude by defining date first, reducing redundancy in the date call, and then extracting month from date.
ptm <- proc.time()
df <- df %>% mutate(
date = as.Date(paste0(year, "-", day), format = "%Y-%j"), # this creates a vector of dates
month = as.numeric(format(date, "%m")), # extract month
twowk = month*2 - (as.numeric(format(date, "%d")) <= 15)) %>% # this describes which biweek each day falls into
dplyr::select(-date)
proc.time() - ptm
# user system elapsed
# 18.58 0.13 18.75
Versus original version in the question
# user system elapsed
# 117.67 0.15 118.45
Filtered for one year. I think it solves the leap issue you described, unless I'm not clear on what you're saying. Last day of Feb is 59 in the df in my result below, but only because day is 0 indexed.
df2000 <- filter(df, year == "2000")
ptm <- proc.time()
df2000 <- df2000 %>% mutate(
day = day - 1, # dates are 0 indexed
date = as.Date(day, origin = "2000-01-01"),
month = as.numeric(as.POSIXlt(date, format = "%Y-%m-%d")$mon + 1),
bis = month * 2 - (as.numeric(format(date, "%d")) <= 15)
)
proc.time() - ptm
user system elapsed
0.8 0.0 0.8
One year is 0.2 of the whole df, so times reflect that.

Using filter in dplyr to generate values for all rows

library(tidyverse)
library(nycflights13)
nycflights13::flights
If the following expression gives flights per day from the dataset:
daily <- dplyr::group_by( flights, year, month, day)
(per_day <- dplyr::summarize( daily, flights = n()))
I wanted something similar for cancelled flights:
canx <- dplyr::filter( flights, is.na(dep_time) & is.na(arr_time))
canx2 <- canx %>% dplyr::group_by( year, month, day)
My goal was to have the same length of data frame as for all summarised flights.
I can get number of flights cancelled per day:
(canx_day <- dplyr::summarize( canx2, flights = n()))
but obviously this is a slightly shorter data frame, so I cannot run e.g.:
canx_day$propcanx <- per_day$flights/canx_day$flights
Even if I introduce NAs I can replace them.
So my question is, should I not be using filter, or are there arguments to filter I should be applying?
Many thanks
You should not be using filter. As others suggest, this is easy with a canceled column, so our first step will be to create that column. Then you can easily get whatever you want with a single summarize. For example:
flights %>%
mutate(canceled = as.integer(is.na(dep_time) & is.na(arr_time))) %>%
group_by(year, month, day) %>%
summarize(n_scheduled = n(),
n_not_canceled = sum(!canceled),
n_canceled = sum(canceled),
prop_canceled = mean(canceled))
# # A tibble: 365 x 7
# # Groups: year, month [?]
# year month day n_scheduled n_not_canceled n_canceled prop_canceled
# <int> <int> <int> <int> <int> <int> <dbl>
# 1 2013 1 1 842 838 4 0.004750594
# 2 2013 1 2 943 935 8 0.008483563
# 3 2013 1 3 914 904 10 0.010940919
# 4 2013 1 4 915 909 6 0.006557377
# 5 2013 1 5 720 717 3 0.004166667
# 6 2013 1 6 832 831 1 0.001201923
# 7 2013 1 7 933 930 3 0.003215434
# 8 2013 1 8 899 895 4 0.004449388
# ...
This gives you flights and canceled flight per day by flight, year, month, day
nycflights13::flights %>%
group_by(flight, year, month, day) %>%
summarize(per_day = n(),
canx = sum(ifelse(is.na(arr_time), 1, 0)))
There is a simple way to calculate number of flights canceled per day. Lets assume that Cancelled column is TRUE for the cancelled flight. If so then way to calculate daily canceled flights will be:
flights %>%
group_by(year, month, day) %>%
summarize( canx_day = sum(Cancelled))
canx_day will contain canceled flights for a day.

Resources