Heatwave calculation based on maximum temperature in R - r

A heatwave is defined if the maximum temperature at a meteorological station is 3 °C or more than the normal temperature consecutively for 3 days or more. I have calculated the daily average (daily normal) from multiple-year daily maximum temperature data like
df <- data.frame("date"= seq(from = as.Date("1970-1-1"), to = as.Date("2000-12-31"), by = "day"),
"MaxT" = runif(length(seq.Date(as.Date("1970-1-1"), as.Date("2000-12-31"), "days")), 20, 40))
df$day <- format(df$date, format='%m-%d')
daily_mean <- aggregate(MaxT ~ day, data=df, FUN=mean)
Now it has to be matched with every year's daily maximum temperature and identify the dates when the maximum temperature is 3 °C or more than the normal daily temperature consecutively for 3 days or more. Those events will be considered as heatwaves. My question is how to implement it in R?

Here is a way using dplyr and zoo's rollapplyr.
library(dplyr)
df_out <- df %>%
left_join(daily_mean %>% rename(mean_temp = MaxT), by = 'day') %>%
mutate(is_heatwave = zoo::rollapplyr(MaxT > (mean_temp + 3),
3, all,fill = NA))
some instances of heatwave :
df_out[31:50, ]
# date MaxT day mean_temp is_heatwave
#31 1970-01-31 26.31675 01-31 28.31451 FALSE
#32 1970-02-01 22.05946 02-01 29.83059 FALSE
#33 1970-02-02 34.22469 02-02 29.84562 FALSE
#34 1970-02-03 33.03264 02-03 29.87919 FALSE
#35 1970-02-04 36.62357 02-04 31.50603 TRUE
#36 1970-02-05 29.82134 02-05 30.22581 FALSE
#37 1970-02-06 28.13625 02-06 29.64073 FALSE
#38 1970-02-07 29.95754 02-07 29.54277 FALSE
#39 1970-02-08 21.40026 02-08 30.96619 FALSE
#40 1970-02-09 33.10983 02-09 28.16146 FALSE
#41 1970-02-10 30.87346 02-10 29.37693 FALSE
#42 1970-02-11 31.08721 02-11 28.89930 FALSE
#43 1970-02-12 27.34925 02-12 29.27882 FALSE
#44 1970-02-13 31.88582 02-13 29.35825 FALSE
#45 1970-02-14 30.05155 02-14 28.24995 FALSE
#46 1970-02-15 35.07049 02-15 29.02716 FALSE
#47 1970-02-16 39.49029 02-16 32.75644 FALSE
#48 1970-02-17 37.41917 02-17 31.44022 TRUE
#49 1970-02-18 36.03564 02-18 29.56212 TRUE
#50 1970-02-19 36.48052 02-19 30.18766 TRUE
TRUE values are where heatwave was present. As we can see in row 33, 34 and 35 we had 3 consecutive days when MaxT was greater than mean_temp by more than 3 degrees. Similarly, we can verify for other days.
To get yearly heatwave occurrences, we can do :
df_year <- df_out %>%
group_by(year = format(date, "%Y")) %>%
summarise(total_heat = with(rle(is_heatwave),
sum(values, na.rm = TRUE)))
and sum(df_year$total_heat) would give overall count.

Related

Replace zero values with mean and then finding the daily sum based on hourly data

the following data is a subset of a larger dataset(power_use):
ï..time watt_hour
1 18-01-22 0:00 0.0261
2 18-01-22 1:00 0.0304
3 18-01-22 2:00 0.0275
4 18-01-22 3:00 0.0275
5 18-01-22 4:00 6.4600
6 18-01-22 5:00 0.0275
7 18-01-22 6:00 3.9500
8 18-01-22 7:00 0.0259
9 18-01-22 8:00 0.2470
10 18-01-22 9:00 1.1200
I am trying to replace all zero and NA values by imputing(replacing with the mean while omitting zeros and NA). I have the below code:
dates <- power_use$ï..time
avg = power_use$watt_hour
newdata = power_use %>%
mutate(dates=as.POSIXct(dates,format="%d-%m-%y %H:%M",tz="")) %>%
group_by(date(dates)) %>%
summarise(mean=mean(avg,na.rm=TRUE),max=max(avg,na.rm=TRUE),
min=min(avg,na.rm=TRUE))
The result i am getting is that all new daily values are equal to the mean. I have written this exact same code on another set of data where the only difference is that there are no zero values but only NA and the data is taken every 30 minutes (it worked perfectly well).
After replacing zeros and NA with the mean I want to take the daily sum. Some days have 2 data points while others have more.
Any help?
If you replace all zeros with NA in a first step, you could use na_mean from the imputeTS package:
library(imputeTS)
library(tidyverse)
library(lubridate)
data <- tibble(date = c(ymd("2013-01-01")+ months(0:17)),
value = c(NA,0,0,0,0,0,0,7,NA,0,0,0,11,23,54,33,45,0))
data[data == 0] <- NA
na_mean(data, option = "mean", maxgap = Inf)
I managed to solve my problem by doing the following:
ac13 = as.numeric(power_use$watt_hour)
ac13[ac13==0] = NA
avg_ac13 = mean(na.omit(ac13))
power_use_replaced <- power_use %>%
mutate(ac13_mean_replaced = ifelse(is.na(ac13), avg_ac13, ac13))
sum(is.na(power_use_replaced$watt_hour))
sum(is.na(power_use_replaced$ac13_mean_replaced))
head(power_use_replaced)

Merge time series data with different length (gaps)

I have two water flow measurement devices which give a value every minute. Now i need to merge both time series. My problem: The devices produce every couple of hour some failures. Thus, the two time series have a different length. I need to fill the gaps first. This could be done with a NA, zero value or with the leading value before the gap.
I can easily define the required time vector tseq by min and max values of the time series:
from <- as.POSIXct(min(Measurement1[[1]], Measurement1[[1]]))
to <- as.POSIXct(max(Measurement1[[1]], Measurement1[[1]]))
tseq <- as.data.frame(seq.POSIXt(from = from, to = to, by = deltaT, tz=UTC))
Then i tried to complete the two lists Measurement1 and Measurement2 with the zoo function as follows:
Measurement1Zoo <- as.data.frame(zoo(x=Measurement1, tseq[[1]]))
This leads to a df with the same length than tseq, but zoo just adds some values at the end of the vector.
I'm a bit confused how zoo works. I just want to add the missing time stamps in the two time series and complete it with NA (or another value). How could this be done? You can find two example files here:
Example time series
Thank you!
You can use dplyr to do an outerjoin (i.e. full_join):
library(data.table)
m1 <- fread(file = "/Measurement1.CSV", sep = ";", header = TRUE)
m1$Date <- as.POSIXct(m1$Date,format="%d.%m.%Y %H:%M",tz=Sys.timezone())
m2 <- fread(file = "/Measurement2.CSV", sep = ";", header = TRUE)
m2$Date <- as.POSIXct(m2$Date,format="%d.%m.%Y %H:%M",tz=Sys.timezone())
names(m2)[2] <- "Value 5"
min(m1$Date) == min(m2$Date) #TRUE
max(m1$Date) == max(m2$Date) #TRUE
library(dplyr)
m_all <- full_join(x = m1, y = m2, by = "Date")
nrow(m1) #11517
nrow(m2) #11520
nrow(m_all) #11520
head(m_all)
# Date Value 1 Value 2 Value 3 Value 4 Value 5
#1 2015-07-24 00:00:00 28 2 0 26 92
#2 2015-07-24 00:01:00 28 2 0 26 95
#3 2015-07-24 00:02:00 28 2 0 26 90
#4 2015-07-24 00:03:00 28 2 0 26 89
#5 2015-07-24 00:04:00 28 2 0 26 94
#6 2015-07-24 00:05:00 27 1 0 26 95
#checking NA's
sum(is.na(m1$`Value 1`)) #0
sum(is.na(m1$`Value 2`)) #0
sum(is.na(m1$`Value 3`)) #3
sum(is.na(m1$`Value 4`))#0
sum(is.na(m2$`Value 5`)) #42
sum(is.na(m_all$`Value 1`)) #3
sum(is.na(m_all$`Value 2`)) #3
sum(is.na(m_all$`Value 3`)) #6
sum(is.na(m_all$`Value 4`)) #3
sum(is.na(m_all$`Value 5`)) #42

Filter R data frame by hour of the day

I have a data frame with a datetime column. I want to know the number of rows by hour of the day. However, I care only about the rows between 8 AM and 10 PM.
The lubridate package requires us to filter hours of the day using the 24-hour convention.
library(tidyverse)
library(lubridate)
### Fake Data with Date-time ----
x <- seq.POSIXt(as.POSIXct('1999-01-01'), as.POSIXct('1999-02-01'), length.out=1000)
df <- data.frame(myDateTime = x)
### Get all rows between 8 AM and 10 PM (inclusive)
df %>%
mutate(myHour = hour(myDateTime)) %>%
filter(myHour >= 8, myHour <= 22) %>% ## between 8 AM and 10 PM (both inclusive)
count(myHour) ## number of rows
Is there a way for me to use 10:00 PM rather than the integer 22?
You can use the ymd_hm and hour functions to do 12-hour to 24-hour conversions.
df %>%
mutate(myHour = hour(myDateTime)) %>%
filter(myHour >= hour(ymd_hm("2000-01-01 8:00 AM")), ## hour() ignores year, month, date
myHour <= hour(ymd_hm("2000-01-01 10:00 PM"))) %>% ## between 8 AM and 10 PM (both inclusive)
count(myHour)
A more elegant solution.
## custom function to convert 12 hour time to 24 hour time
hourOfDay_12to24 <- function(time12hrFmt){
out <- paste("2000-01-01", time12hrFmt)
out <- hour(ymd_hm(out))
out
}
df %>%
mutate(myHour = hour(myDateTime)) %>%
filter(myHour >= hourOfDay_12to24("8:00 AM"),
myHour <= hourOfDay_12to24("10:00 PM")) %>% ## between 8 AM and 10 PM (both inclusive)
count(myHour)
You can also use base R to do this
#Extract the hour
df$hour_day <- as.numeric(format(df$myDateTime, "%H"))
#Subset data between 08:00 AM and 10:00 PM
new_df <- df[df$hour_day >= as.integer(format(as.POSIXct("08:00 AM",
format = "%I:%M %p"), "%H")) & as.integer(format(as.POSIXct("10:00 PM",
format = "%I:%M %p"), "%H")) >= df$hour_day, ]
#Count the frequency
stack(table(new_df$hour_day))
# values ind
#1 42 8
#2 42 9
#3 41 10
#4 42 11
#5 42 12
#6 41 13
#7 42 14
#8 41 15
#9 42 16
#10 42 17
#11 41 18
#12 42 19
#13 42 20
#14 41 21
#15 42 22
This gives the same output as the tidyverse/lubridate approach
library(tidyverse)
library(lubridate)
df %>%
mutate(myHour = hour(myDateTime)) %>%
filter(myHour >= hour(ymd_hm("2000-01-01 8:00 AM")),
myHour <= hour(ymd_hm("2000-01-01 10:00 PM"))) %>%
count(myHour)

R - find outliers in time series dataset using standard deviation

i have a xts time series object with numeric values for the data. str (dataTS)
An ‘xts’ object on 2014-02-14 14:27:00/2014-02-28 14:22:00 containing:
Data: num [1:4032, 1] 51.8 44.5 41.2 48.6 46.7 ...
Indexed by objects of class: [POSIXlt,POSIXt] TZ:
xts Attributes:
NULL
I want to find the data points that are more than (2 * s.d.) away from mean.
I would like to create an new dataset from it.
[,1]
2015-02-14 14:27:00 51.846
2015-02-14 14:32:00 44.508
2016-02-14 14:37:00 41.244
2015-02-14 14:42:00 48.568
2015-02-14 14:47:00 46.714
2015-02-14 14:52:00 44.986
2015-02-14 14:57:00 49.108
2015-02-14 15:02:00 1000.470
2015-02-14 15:07:00 53.404
2015-02-14 15:12:00 45.400
2015-02-14 15:17:00 3.216
2015-02-14 15:22:00 49.7204
the time series.
i want to subset the outliers 3.216 and 1000.470
You can scale your data to have zero mean and unit standard deviation. You can then directly identify individual observations that are >= 2 sd away from the mean.
As an example, I randomly sample some data from a Cauchy distribution.
set.seed(2010);
smpl <- rcauchy(10, location = 4, scale = 3);
To illustrate, I store the sample data and scaled sample data in a data.frame; I also mark observations that are >= 2 standard deviations away from the mean.
library(tidyverse);
df <- data.frame(Data = smpl) %>%
mutate(
Data.scaled = as.numeric(scale(Data)),
deviation_greater_than_2sd = ifelse(Data.scaled >= 2, TRUE, FALSE));
df;
# Data Data.scaled deviation_greater_than_2sd
#1 8.007951 -0.2639689 FALSE
#2 -34.072054 -0.5491882 FALSE
#3 465.099800 2.8342104 TRUE
#4 7.191778 -0.2695010 FALSE
#5 2.383882 -0.3020890 FALSE
#6 3.544079 -0.2942252 FALSE
#7 -7.002769 -0.3657119 FALSE
#8 4.384503 -0.2885287 FALSE
#9 15.722492 -0.2116796 FALSE
#10 4.268082 -0.2893179 FALSE
We can also visualise the distribution of Data.scaled:
ggplot(df, aes(Data.scaled)) + geom_histogram();
The "outlier" is 2.8 units of standard deviation away from the mean.

Plotting histogram for data with start and end date

I have a data set that is something like this:
start_date end_date outcome
1 2014-07-18 2014-08-20 TRUE
2 2014-08-04 2014-09-23 TRUE
3 2014-08-01 2014-09-03 TRUE
4 2014-08-01 2014-09-03 TRUE
5 2014-12-10 2014-12-10 TRUE
6 2014-10-11 2014-11-07 TRUE
7 2015-04-27 2015-05-20 TRUE
8 2014-11-22 2014-12-25 TRUE
9 2015-03-24 2015-04-26 TRUE
10 2015-03-12 2015-04-10 FALSE
11 2014-05-29 2014-06-28 FALSE
12 2015-03-19 2015-04-20 TRUE
13 2015-03-25 2015-04-26 TRUE
14 2015-03-25 2015-04-26 TRUE
15 2014-07-09 2014-08-10 TRUE
16 2015-03-26 2015-04-26 TRUE
17 2014-07-09 2014-08-10 TRUE
18 2015-03-30 2015-04-28 TRUE
19 2014-03-13 2014-04-13 TRUE
20 2015-04-01 2015-04-29 TRUE
I want to plot a histogram where each bar corresponds to a month and it contains the proportion of FALSE / ALL = (FALSE + TRUE) in that month.
What is the easiest way to do this in R preferably using ggplot?
Here is one way. There will be better ways to do this. But I will leave what I tried. The main job was to create a new data frame for the graphic. Using your data above, I first converted factors to date objects. If yo have date objects in your data, you do not need this. Then, I summarised your data for start_date and end_date using count(). I bound the two data frames and further did the calculation to get the proportion of FALSE for each month.
library(zoo)
library(dplyr)
library(ggplot2)
library(lubridate)
mutate_each(mydf, funs(as.POSIXct(., format = "%Y-%m-%d")), -outcome) %>%
mutate_each(funs(paste(year(.),"-",month(.), sep = "")), vars = -outcome) -> foo1;
count(foo1, start_date, outcome) %>% rename(date = start_date) -> foo2;
count(foo1, end_date, outcome) %>%
rename(date = end_date) %>%
bind_rows(foo2) %>%
group_by(date, outcome) %>%
summarize(total = sum(n)) %>%
summarize(prop = length(which(outcome == FALSE)) / sum(total)) %>%
mutate(date = as.Date(as.yearmon(date))) -> foo3
ggplot(data = foo3, aes(x = date, y = prop)) +
geom_bar(stat = "identity") +
scale_x_date(labels = date_format("%Y-%m"), breaks = date_breaks("month")) +
theme(axis.text.x = element_text(angle = 90, vjust = 1))

Resources