Calculate number of negative values between two dates - r

I have a data frame of SPEI values. I want to calculate two statistics (explained below) at an interval of
20 years i.e 2021-2040, 2041-2060, 2061-2080, 2081-2100. The first column contains the Date (month-year), and
Each year i.e. 2021, 2022, 2023 etc. till 2100.
The statistics are:
Drought frequency: Number of times SPEI < 0 in the specified period (20 years and 1 year respectively)
Drought Duration: Equal to the number of months between its start (included) and end month (not included) of the specified period. I am assuming a drought event starts when SPEI < 0.
I was wondering if there's a way to do that in R? It seems like an easy problem, but I don't know how to do it. Please help me out. Excel is taking too long. Thanks.
> head(test, 20)
Date spei-3
1 2021-01-01 NA
2 2021-02-01 NA
3 2021-03-01 -0.52133737
4 2021-04-01 -0.60047887
5 2021-05-01 0.56838399
6 2021-06-01 0.02285012
7 2021-07-01 0.26288462
8 2021-08-01 -0.14314685
9 2021-09-01 -0.73132256
10 2021-10-01 -1.23389220
11 2021-11-01 -1.15874943
12 2021-12-01 0.27954143
13 2022-01-01 1.14606657
14 2022-02-01 0.66872986
15 2022-03-01 -1.13758050
16 2022-04-01 -0.27861017
17 2022-05-01 0.99992395
18 2022-06-01 0.61024314
19 2022-07-01 -0.47450485
20 2022-08-01 -1.06682997
Edit:
I very much like to add some code, but I don't know where to start.
test = "E:/drought.xlsx"
#Extract year and month and add it as a column
test$Year = format(test$Date,"%Y")
test$Month = format(test$Date,"%B")
I don't know how to go from here. I found that cumsum can help, but how do I select one year and then apply cumsum on it. I am not withholding code on purpose. I just don't know where or how to begin.

There are a couple questions the OP's post so I will go through them step by step. You'll need dplyr and lubridate for this workflow.
First, we create some fake data to use:
library(lubridate)
library(dplyr)
#create example data
dd<- data.frame(Date = seq.Date(as.Date("2021-01-01"), as.Date("2100-12-01"), by = "month"),
spei = rnorm(960,0,2))
That will look like this, similar to what you have above
> head(dd)
Date spei year year_20 drought
1 2021-01-01 -6.85689789 2021 2021_2040 1
2 2021-02-01 -0.09292459 2021 2021_2040 1
3 2021-03-01 0.13715922 2021 2021_2040 0
4 2021-04-01 2.26805601 2021 2021_2040 0
5 2021-05-01 -0.47325008 2021 2021_2040 1
6 2021-06-01 0.37034138 2021 2021_2040 0
Then we can use lubridate and cut to create our yearly and 20-year variables to group by later and create a column drought signifying if spei was negative.
#create a column to group on by year and by 20-year
dd <- dd %>%
mutate(year = year(Date),
year_20 = cut(year, breaks = c(2020,2040,2060,2080, 2100), include.lowest = T,
labels = c("2021_2040", "2041_2060", "2061_2080", "2081_2100"))) %>%
#column signifying if that month was a drought
mutate(drought = ifelse(spei<0,1,0))
Once we have that, we just use the group_by function to get frequency (or number of months with a drought) by year or 20-year period
#by year
dd %>%
group_by(year) %>%
summarise(year_freq = sum(drought)) %>%
ungroup()
# A tibble: 80 x 2
year year_freq
<dbl> <dbl>
1 2021 6
2 2022 4
3 2023 7
4 2024 6
5 2025 6
6 2026 7
#by 20-year group
dd %>%
group_by(year_20) %>%
summarise(year20_freq = sum(drought)) %>%
ungroup()
# A tibble: 4 x 2
year_20 year20_freq
<fct> <dbl>
1 2021_2040 125
2 2041_2060 121
3 2061_2080 121
4 2081_2100 132
Calculating drought duration is a bit more complicated. It involves
identifying the first month of each drought
calculating the length of each drought
combining information from 1 and 2 together
We can use lag to identify when a month changed from "no drought" to "drought". In this case we want an index of where the value in row i is different from that in row i-1
# find index of where values change.
change.ind <- dd$drought != lag(dd$drought)
#use index to find drought start
drought.start <- dd[change.ind & dd$drought == 1,]
This results in a subset of the initial dataset, but only with the rows with the first month of a drought. Then we can use rle to calculate the length of the drought. rle will calculate the length of every run of numbers, so we will have to subset to only those runs where the value==1 (drought)
#calculate drought lengths
drought.lengths <- rle(dd$drought)
# we only want droughts (values = 1)
drought.lengths <- drought.lengths$lengths[drought.lengths$values==1]
Now we can combine these two pieces of information together. The first row is an NA because there is no value at i-1 to compare the lag to. It can be dropped, unless you want to include that data.
drought.dur <- cbind(drought.start, drought_length = drought.lengths)
head(drought.dur)
Date spei year year_20 drought drought_length
NA <NA> NA NA <NA> NA 2
5 2021-05-01 -0.47325008 2021 2021_2040 1 1
9 2021-09-01 -2.04564549 2021 2021_2040 1 1
11 2021-11-01 -1.04293866 2021 2021_2040 1 2
14 2022-02-01 -0.83759671 2022 2021_2040 1 1
17 2022-05-01 -0.07784316 2022 2021_2040 1 1

Related

Six-month peak-season running average

I'm trying to implement this:
The recommendation is a peak season ozone AQG level of 60 μg/m3
(the average of daily maximum 8-hour mean ozone concentrations).
The peak season is defined as the six consecutive months of the year
with the highest six-month running-average ozone concentration.
In regions away from the equator, this period will typically be in the
warm season within a single calendar year (northern hemisphere)
or spanning two calendar years (southern hemisphere). Close to
the equator, such clear seasonal patterns may not be obvious, but a
running-average six-month peak season will usually be identifiable
from existing monitoring or modelling data.
I have:
# A tibble: 300 × 2
date value
<dttm> <dbl>
1 1997-01-01 00:00:00 NA
2 1997-02-01 00:00:00 NA
3 1997-03-01 00:00:00 NA
4 1997-04-01 00:00:00 30.2
5 1997-05-01 00:00:00 20.9
6 1997-06-01 00:00:00 10.1
7 1997-07-01 00:00:00 9.40
8 1997-08-01 00:00:00 22.4
9 1997-09-01 00:00:00 26.2
10 1997-10-01 00:00:00 32.9
# … with 290 more rows
Every year is complete (with or without NA). I found the peaks by "findpeaks" from pracma package, and get:
peaks = findpeaks(mda8_omit$value, minpeakdistance = 6,
minpeakheight = mean(mda8_omit$value))
How do i optimize to get the best six month by peak? For northern hemisphere is easier because the peaks is within a yer (summer) but in the southern hemisphere is split in two years and peaks may change depending on latitude. Any ideas on how to continue?
Assuming that
we only use windows with 6 consecutive months of data
the year that a window falls is determined by the last month of the window
we compare all such windows, at most 12, within each calendar year
Calculate the rolling mean and then grouping by year take the row with the largest rolling mean within year. This row is the last month of the 6 month window. The input is shown reproducibly in the Note at the end.
library(dplyr)
library(zoo)
DF %>%
mutate(date = as.yearmon(date),
peakmean = rollapplyr(value, 6, mean, fill = NA)) %>%
group_by(year = as.integer(date)) %>%
slice_max(peakmean) %>%
ungroup %>%
select(-year)
## # A tibble: 1 × 3
## date value peakmean
## <yearmon> <dbl> <dbl>
## 1 Oct 1997 32.9 20.3
Note
Lines <- "date value
1 1997-01-01T00:00:00 NA
2 1997-02-01T00:00:00 NA
3 1997-03-01T00:00:00 NA
4 1997-04-01T00:00:00 30.2
5 1997-05-01T00:00:00 20.9
6 1997-06-01T00:00:00 10.1
7 1997-07-01T00:00:00 9.40
8 1997-08-01T00:00:00 22.4
9 1997-09-01T00:00:00 26.2
10 1997-10-01T00:00:00 32.9"
DF <- read.table(text = Lines)

Convert tibble to time series

I tried to download data on covid provided by the Economist's Github repository.
library(readr)
library(knitr)
myfile <- "https://raw.githubusercontent.com/TheEconomist/covid-19-excess-deaths-tracker/master/output-data/excess-deaths/all_weekly_excess_deaths.csv"
test <- read_csv(myfile)
What I get is a tibble data frame and I am unable to easily access the data stored in that tibble. I would like to look at one column, say test$covid_deaths_per_100k and re-shape that into a matrix or ts object with rows referring to time and columns referring to countries.
I tried it manually, but I failed. Then I tried with the tsibble package and failed again:
tsibble(test[c("covid_deaths_per_100k","country")],index=test$start_date)
Error: Must extract column with a single valid subscript.
x Subscript `var` has the wrong type `date`.
ℹ It must be numeric or character.
So, I guess the problem is that the data are stacked by countries and hence the time index is duplicated. I would need some of these magic pipe functions to make this work? Is there an easy way to do that, perhaps without piping?
A valid tsibble must have distinct rows identified by key and index:
as_tsibble(test,index = start_date,key=c(country,region))
# A tsibble: 11,715 x 17 [1D]
# Key: country, region [176]
country region region_code start_date end_date days year week population total_deaths
<chr> <chr> <chr> <date> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Australia Australia 0 2020-01-01 2020-01-07 7 2020 1 25734100 2497
2 Australia Australia 0 2020-01-08 2020-01-14 7 2020 2 25734100 2510
3 Australia Australia 0 2020-01-15 2020-01-21 7 2020 3 25734100 2501
4 Australia Australia 0 2020-01-22 2020-01-28 7 2020 4 25734100 2597
5 Australia Australia 0 2020-01-29 2020-02-04 7 2020 5 25734100 2510
6 Australia Australia 0 2020-02-05 2020-02-11 7 2020 6 25734100 2530
7 Australia Australia 0 2020-02-12 2020-02-18 7 2020 7 25734100 2613
8 Australia Australia 0 2020-02-19 2020-02-25 7 2020 8 25734100 2608
9 Australia Australia 0 2020-02-26 2020-03-03 7 2020 9 25734100 2678
10 Australia Australia 0 2020-03-04 2020-03-10 7 2020 10 25734100 2602
# ... with 11,705 more rows, and 7 more variables: covid_deaths <dbl>, expected_deaths <dbl>,
# excess_deaths <dbl>, non_covid_deaths <dbl>, covid_deaths_per_100k <dbl>,
# excess_deaths_per_100k <dbl>, excess_deaths_pct_change <dbl>
ts works best with monthly, quarterly or annual series. Here we show a few approaches.
1) monthly This creates a monthly zoo object z from the indicated test columns splitting by country and aggregating to produce a monthly time series. It then creates a ts object from that.
library(zoo)
z <- read.zoo(test[c("start_date", "country", "covid_deaths")],
split = "country", FUN = as.yearmon, aggregate = sum)
as.ts(z)
2) weekly To create a weekly ts object with frequency 53
to_weekly <- function(x) {
yr <- as.integer(as.yearmon(x))
wk <- as.integer(format(as.Date(x), "%U"))
yr + wk/53
}
z <- read.zoo(test[c("start_date", "country", "covid_deaths")],
split = "country", FUN = to_weekly, aggregate = sum)
as.ts(z)
3) daily If you want a series where the times are dates then omit the FUN argument and use zoo directly.
z <- read.zoo(test[c("end_date", "country", "covid_deaths")],
split = "country", aggregate = sum)

How to Calculate rolling average of a month Day by Day with flexible window in r?

I am trying to calculate rolling average of covid cases in the month of march day by day.
For example on 5th of march it should take the mean of cases for first 5 days of march, on 20th it should take mean of first 20 days.
I have written a small piece of code for this but is there a prebuilt function or a better way of doing this ?
df:
Country.Region Date Cases_count
<chr> <date> <dbl>
1 France 2021-03-01 4730
2 France 2021-03-02 22872
3 France 2021-03-03 26903
4 France 2021-03-04 25286
5 France 2021-03-05 23507
6 France 2021-03-06 23306
7 France 2021-03-07 21835
8 France 2021-03-08 5534
9 France 2021-03-09 23143
10 France 2021-03-10 29674
code:
max_date <- ymd(max(df$Date))
march <- seq(ymd("2021-03-01"), ymd(max_date), by = "day")
rolling_data <- lapply(march, function(x){
rolling_avg <- df %>%
filter(
Country.Region == "France",
Date %in% c(ymd("2021-03-01"): x)) %>%
summarise(rolling_mean = mean(Cases_count)) #%>%
# from: https://stackoverflow.com/questions/61038643/loop-through-irregular-list-of-numbers-to-append-rows-to-summary-table
data.frame(Date = x, rolling_march = rolling_avg)
})
do.call(rbind,rolling_data)
output:
Date rolling_mean
1 2021-03-01 4730.00
2 2021-03-02 13801.00
3 2021-03-03 18168.33
4 2021-03-04 19947.75
5 2021-03-05 20659.60
6 2021-03-06 21100.67
7 2021-03-07 21205.57
8 2021-03-08 19246.62
9 2021-03-09 19679.56
10 2021-03-10 20679.00
Issue: For using this along with cases count I will have to do some join. So if there is some prebuilt function then I can probably use it with mutate or summarise.
So what you actually want is a cummulative average, not a rolling/moving average.
A way easier approach is to use cumsum. For example, if you have a vector x with N elements, the cummulative mean could be expressed as:
cummulative_mean <- cumsum(x) / seq_len(length(x))
For an actual rolling mean, the zoo pkg provides us zoo::rollmean.

How do I know, what day of week is a date

I've got the following problem: I have the daily stock exchange rates of a certain share stored in a vector with the belonging date(from 2015 to 2017).
I need to extract the last exchange rate of every week.
This means I need to know what weekday corresponds to every date and store those rates in a vector (or delete the other rows from the existing vector). I did this by using 'wday' (from lubridate) and then did the following:
vector<-stochexchangerate
weekdays<-wday(stockexchangerate) ## length =35; monday=2,
tuesday=3,..
for(i in 1:10){
if(weekdays[i]<6){
vector<-vector[-c(i)]
}
}
But this only has the consequence, that some "random" rows are deleted and if I run this code 6 times, there is only 1 row left although there were some values which were taken on friday. Can anyone help me?
Yes, using lubridate was a good insight. I would extract the day of the week using lubridate::wday and argument label = TRUE and filter that column.
Assuming that you have a dataframe with 2 columns (one for the dates and, one for the value of rates) you can do:
library(tidyverse)
library(lubridate)
# DATA
#> df <- tibble(date = mdy("02/15/1980") + 1:300,
#> value = 1:300)
df %>%
mutate(day = wday(date, label = TRUE)) %>%
filter(day == "Fri")
#> # A tibble: 42 x 3
#> date value day
#> <date> <int> <ord>
#> 1 1980-02-22 7 Fri
#> 2 1980-02-29 14 Fri
#> 3 1980-03-07 21 Fri
#> 4 1980-03-14 28 Fri
#> 5 1980-03-21 35 Fri
#> 6 1980-03-28 42 Fri
#> 7 1980-04-04 49 Fri
#> 8 1980-04-11 56 Fri
#> 9 1980-04-18 63 Fri
#> 10 1980-04-25 70 Fri
#> # … with 32 more rows

Extend data frame column with inflation in R

I'm trying to extend some code to be able to:
1) read in a vector of prices
2) left join that vector of prices to a data frame of years (or years and months)
3) append/fill the prices for missing years with interpolated data based on the last year of available prices plus a specified inflation rate. Consider an example like this one:
prices <- data.frame(year=2018:2022,
wti=c(75,80,90,NA,NA),
brent=c(80,85,94,93,NA))
What I need is something that will fill the missing rows of each column with the last price plus inflation (suppose 2%). I can do this in a pretty brute force way as:
i_rate<-0.02
for(i in c(1:nrow(prices))){
if(is.na(prices$wti[i]))
prices$wti[i]<-prices$wti[i-1]*(1+i_rate)
if(is.na(prices$brent[i]))
prices$brent[i]<-prices$brent[i-1]*(1+i_rate)
}
It seems to me there should be a way to do this using some combination of apply() and/or fill() but I can't seem to make it work.
Any help would be much appreciated.
As noted by #camille, the problem with dplyr::lag is that it doesn't work here with consecutive NAs because it uses the "original" ith element of a vector instead of the "revised" ith element. We'd have to first create a version of lag that will do this by creating a new function:
impute_inflation <- function(x, rate) {
output <- x
y <- rep(NA, length = length(x)) #Creating an empty vector to fill in with the loop. This makes R faster to run for vectors with a large number of elements.
for (i in seq_len(length(output))) {
if (i == 1) {
y[i] <- output[i] #To avoid an error attempting to use the 0th element.
} else {
y[i] <- output[i - 1]
}
if (is.na(output[i])) {
output[i] <- y[i] * (1 + rate)
} else {
output[i]
}
}
output
}
Then it's a pinch to apply this across a bunch of variables with dplyr::mutate_at():
library(dplyr)
mutate_at(prices, vars(wti, brent), impute_inflation, 0.02)
year wti brent
1 2018 75.000 80.00
2 2019 80.000 85.00
3 2020 90.000 94.00
4 2021 91.800 93.00
5 2022 93.636 94.86
You can use dplyr::lag to get the previous value in a given column. Your lagged values look like this:
library(dplyr)
inflation_factor <- 1.02
prices <- data_frame(year=2018:2022,
wti=c(75,80,90,NA,NA),
brent=c(80,85,94,93,NA)) %>%
mutate_at(vars(wti, brent), as.numeric)
prices %>%
mutate(prev_wti = lag(wti))
#> # A tibble: 5 x 4
#> year wti brent prev_wti
#> <int> <dbl> <dbl> <dbl>
#> 1 2018 75 80 NA
#> 2 2019 80 85 75
#> 3 2020 90 94 80
#> 4 2021 NA 93 90
#> 5 2022 NA NA NA
When a value is NA, multiply the lagged value by the inflation factor. As you can see, that doesn't handle consecutive NAs, however.
prices %>%
mutate(wti = ifelse(is.na(wti), lag(wti) * inflation_factor, wti),
brent = ifelse(is.na(brent), lag(brent) * inflation_factor, brent))
#> # A tibble: 5 x 3
#> year wti brent
#> <int> <dbl> <dbl>
#> 1 2018 75 80
#> 2 2019 80 85
#> 3 2020 90 94
#> 4 2021 91.8 93
#> 5 2022 NA 94.9
Or to scale this and avoid doing the same multiplication over and over, gather the data into a long format, get lags within each group (wti, brent, or any others you may have), and adjust values as needed. Then you can spread back to the original shape:
prices %>%
tidyr::gather(key = key, value = value, wti, brent) %>%
group_by(key) %>%
mutate(value = ifelse(is.na(value), lag(value) * inflation_factor, value)) %>%
tidyr::spread(key = key, value = value)
#> # A tibble: 5 x 3
#> year brent wti
#> <int> <dbl> <dbl>
#> 1 2018 80 75
#> 2 2019 85 80
#> 3 2020 94 90
#> 4 2021 93 91.8
#> 5 2022 94.9 NA
Created on 2018-07-12 by the reprex package (v0.2.0).

Resources