I have daily temperature in this format starting from 1950 to 2017
I need to compute snowmelt onset date which is defined as as the the first day when daily temperature is above 0 C, following the last five-day period between March and May, when the daily temperature is below 0 C. My codes so far:
# applying the condition to check each temperature value
df1$boolean<- ifelse(df1$temp<0.0 , 1, 0)
#computing the total sum < 0 and the start and end date
snow<-df1 %>%
mutate(boolean = ifelse(, 0, boolean)) %>%
group_by(group = cumsum(c(0, diff(boolean) != 0))) %>%
filter(boolean == 1 & n() > 1) %>%
summarize("Start Date"=min(as.character(date)),
"End Date"=max(as.character(date)),
"Length of Run"=n()) %>%
ungroup() %>%
colnames(snow)[3] <- 'length'
# subset length that greater >5
obs<-subset(snow,length >=5)
The codes above give me partial solution ( if further manually edit I will get ideal solution to match my definition) I am only interested in one onset date for each year. I need some further guidance on how I can edit this code to compute onset date based on definition above.
I have number of locations so manually editing this would not be ideal solution.
Your help would be appreciated

We have assumed in (1) that the melt day must occur in Mar, Apr or May and in (2) that only the 5 subzero days occur in Mar, Apr, May but the melt day could occur in June, say.
1) Define df2 which is df1 plus additional columns: month, year and code where code is 0 if the date is not in Mar, Apr, May and is otherwise 1 if temp < 0 and 2 if temp >= 0.
Now using df2 run rollapplyr on code returning TRUE if the most recent 6 dates have codes 1, 1, 1, 1, 1, 2 and otherwise FALSE. Take the TRUE rows and only keep the last in each year. Right join that to a data frame of all years in order to generate NAs in the output for any missing years.
df2 <- df1 %>%
mutate(Date = as.Date(Date), month = as.numeric(format(Date, "%m")),
year = as.numeric(format(Date, "%Y")),
code = (month %in% 3:5) * ((temp < 0) + 2 * (temp >= 0)),
OK = rollapplyr(code, 6, identical, c(1, 1, 1, 1, 1, 2), fill = FALSE))
df2 %>%
filter(OK) %>%
filter(!duplicated(year, fromLast = TRUE)) %>%
right_join(unique(df2["year"]), by = "year") %>%
select(year, Date)
year Date
1 1950 1950-05-24
2 1951 1951-05-21
3 1952 1952-05-28
4 1953 1953-05-15
5 1954 1954-05-28
6 1955 1955-05-14
7 1956 1956-05-27
8 1957 1957-05-17
9 1958 1958-05-21
10 1959 <NA>
11 1960 1960-05-26
12 1961 1961-05-16
13 1962 1962-05-19
14 1963 1963-05-13
15 1964 1964-05-27
16 1965 1965-05-20
17 1966 1966-05-26
18 1967 1967-05-26
19 1968 1968-05-27
20 1969 1969-05-30
21 1970 1970-05-21
2) In (1) we assumed that the melt onset day must be in Mar, Apr or May but here we assume that only the subzero days lie in that range and the melt onset day may extend further out.
Calculations are the same as in (1) except that the codes are now such that 1 indicates a subzero temperature in Mar, Apr or May, 2 indicates any temp above zero any time (not just in Mar, Apr and May) and 0 is anything else. We collapse the codes into a character string (one character per date) and use a regular expression on it to look for a substring of 5 ones followed by anything until we get to the next 2. We process the rest as in (1) except now we don't need the join since there will always be a melt onset day. Without the join we can represent this now as a single pipeline.
df1 %>%
mutate(Date = as.Date(Date), month = as.numeric(format(Date, "%m")),
year = as.numeric(format(Date, "%Y")),
code = (month %in% 3:5) * (temp < 0) + 2 * (temp >= 0),
OK = { g <- gregexpr("1{5}.*?2", paste(code, collapse = ""))[[1]]
seq_along(code) %in% (g + attr(g, "match.length") - 1) }) %>%
filter(OK) %>%
filter(!duplicated(year, fromLast = TRUE)) %>%
select(year, Date)
year Date
1 1950 1950-05-24
2 1951 1951-06-01
3 1952 1952-05-28
4 1953 1953-05-15
5 1954 1954-05-28
6 1955 1955-05-14
7 1956 1956-05-27
8 1957 1957-05-17
9 1958 1958-05-21
10 1959 1959-06-02
11 1960 1960-05-26
12 1961 1961-05-16
13 1962 1962-05-19
14 1963 1963-06-01
15 1964 1964-05-27
16 1965 1965-05-20
17 1966 1966-05-26
18 1967 1967-05-26
19 1968 1968-05-27
20 1969 1969-05-30
21 1970 1970-05-21

A straightforward solution in tidyverse.
readxl::read_excel("temp.xlsx") -> df1
df1 %>%
mutate(year = year(Date),
month = month(Date)) %>%
group_by(year) %>%
below_0 = as.numeric(temp < 0),
streak5 = cumsum(below_0) - cumsum(lag(below_0, 5, 0)),
onset = month %in% c(3, 4, 5) & lag(streak5) == 5 & below_0 == 0) %>%
filter(onset) %>%
summarise(Date = last(Date))
# A tibble: 20 x 2
year Date
<dbl> <dttm>
1 1950 1950-05-24
2 1951 1951-05-21
3 1952 1952-05-28
4 1953 1953-05-15
5 1954 1954-05-28
6 1955 1955-05-14
7 1956 1956-05-27
8 1957 1957-05-17
9 1958 1958-05-21
10 1960 1960-05-26
11 1961 1961-05-16
12 1962 1962-05-19
13 1963 1963-05-13
14 1964 1964-05-27
15 1965 1965-05-20
16 1966 1966-05-26
17 1967 1967-05-26
18 1968 1968-05-27
19 1969 1969-05-30
20 1970 1970-05-21
I hope the code more or less explains itself, streak5 is the number of previous days with temp below 0, onset implements the criteria given in the question, summarise picks the last date in given year.

rle() to the rescue!
temp <- read_csv("temp.csv")
Best read the pipe below first before reading this helper function.
For each year we:
take a run-length encoding of above/below 0
the first one that's TRUE (<0) and has 5+ consecutive days is our candidate
take the next index
if that's too much (no days that fit the criteria) return NA
else return that date
mk_runs <- function(xdf) {
r <- rle(xdf$below_0) take the T/F RLE
pos <- which(r$values & r$length>=5)[1] # find the first one meeting criteria
idx <- (sum(r$lengths[1:pos]))+1 # sum the lengths up until this point and add 1 to get to the first > 0 day
if (idx > nrow(xdf)) { # if past our date range return NA
data_frame(year=xdf$year[1], date=NA)
} else {
xdf[idx, c("year", "date")]
We need to get the data into shape:
separate(temp, Date, c("month", "day", "year")) %>%
mutate_all(as.numeric) %>%
mutate(year = ifelse(year >=50, 1900+year, 2000+year)) %>%
mutate(date = as.Date(sprintf("%04d-%02d-%02d", year, month, day))) %>%
mutate(month = lubridate::month(date)) %>%
mutate(below_0 = temp < 0) %>%
filter(month >= 3 & month <=5) %>%
group_by(year) %>% # year groups
arrange(date) %>% # in order
do(mk_runs(.)) %>% # see above function
## # A tibble: 21 x 2
## # Groups: year [21]
## year date
## <dbl> <date>
## 1 1950 1950-04-30
## 2 1951 1951-05-21
## 3 1952 1952-05-28
## 4 1953 1953-05-15
## 5 1954 1954-05-28
## 6 1955 1955-05-14
## 7 1956 1956-05-02
## 8 1957 1957-05-07
## 9 1958 1958-04-27
## 10 1959 NA
## 11 1960 1960-04-24
## 12 1961 1961-05-16
## 13 1962 1962-05-19
## 14 1963 1963-05-13
## 15 1964 1964-05-20
## 16 1965 1965-05-20
## 17 1966 1966-05-07
## 18 1967 1967-04-27
## 19 1968 1968-05-10
## 20 1969 1969-05-22
## 21 1970 1970-05-21

Here is another attempt. In my first step, I created two new columns first (i.e., year and month). Then, I filtered the data for data between March and May. Then, I created index numbers for rows which have temperature higher than 0 Celsius. This process is done per year. Since you need to have five consecutive days before those days that have temperature above zero, index numbers equal to / smaller than 5 needs to be ignored. This is done if_else() in the true condition in the outer if_else().
In my second step, I chose to use a package called SOfun which is developed by the author of splitstackshape. You can download this package from github. What getMyRows() is doing are; 1) it identifies which rows should be considered by specifying pattern, 2) get a certain range of rows from the marked rows in 1), and 3) create a list. Here range = -5:0 means that I am choosing five previous rows of a target row, and the target row itself.
In my third step, I subsetted mylist with two logical conditions. !$ind[6]) checks if the 6th element of ind is not NA, and all(x$temp[1:5] < 0) checks if the 1st-5th elements of temp (temperature) are all smaller than zero. Filter() chooses list elements that satisfy the two logical condition. Then, I extracted the 6th row from each data frame since that is the target row. I bound the list, grouped the data by year and chose the first observation for each year using slice().
# Part 1
mydf <- read_excel("temp.xlsx") %>%
mutate(year = as.numeric(format(Date, "%Y")),
month = as.numeric(format(Date, "%m"))) %>%
filter(between(month, 3, 5)) %>%
group_by(year) %>%
mutate(ind = if_else(temp > 0,
{ind <- row_number()
if_else(ind <= 5, NA_integer_, ind)},
NA_integer_)) %>%
# Part 2
mylist <- getMyRows(mydf,
pattern = which(complete.cases(mydf$ind)),
range = -5:0, isNumeric = TRUE)
# Part 3
Filter(function(x) !$ind[6]) & all(x$temp[1:5] < 0), mylist) %>%
lapply(function(x) x[6, ]) %>%
bind_rows %>%
group_by(year) %>%
slice(1) %>%
year Date
<dbl> <dttm>
1 1950 1950-04-30 00:00:00
2 1951 1951-05-21 00:00:00
3 1952 1952-05-28 00:00:00
4 1953 1953-05-15 00:00:00
5 1954 1954-05-28 00:00:00
6 1955 1955-05-14 00:00:00
7 1956 1956-05-02 00:00:00
8 1957 1957-05-07 00:00:00
9 1958 1958-04-27 00:00:00
10 1960 1960-04-24 00:00:00
11 1961 1961-05-16 00:00:00
12 1962 1962-05-19 00:00:00
13 1963 1963-05-13 00:00:00
14 1964 1964-05-20 00:00:00
15 1965 1965-05-20 00:00:00
16 1966 1966-05-07 00:00:00
17 1967 1967-04-27 00:00:00
18 1968 1968-05-10 00:00:00
19 1969 1969-05-22 00:00:00
20 1970 1970-05-21 00:00:00


