fill NA values with mean of preceding and subsequent values - r

I'm working with a dataset of weather variables (temperature, precipitation, etc.) that has a few missing values. Because of my specific approach (summing these variables across several days), I need to address NA values in the dataset.
When there is a missing daily value, I'd like to fill that day with a mean value of the previous and following day. The assumption here is that weather values are similar from one day to the next. And yes, I realize this is a big assumption.
I've developed the following:
maxTemp <- c(13.2, 10.7, NA, 17.9, 6.6, 10, 13, NA, NA, 8.8, 9.9, 14.9, 16.3, NA, 18, 9.9, 11.5, 15.3, 21.7, 23.9, 26.6, 27, 22.3, NA, 17.9)
weather <- as.data.frame(maxTemp)
weather %>%
mutate(maxTempNA = if_else(is.na(maxTemp),
(lag(maxTemp) + lead(maxTemp))/2,
maxTemp))
However, in a few cases, I have two NA values on consecutive days, so this doesn't work. Any thoughts on approaches to code this so that when there are two (or more) NA's in a row, the average uses the 'bookending' values to fill the NAs?
The final result would do look like this:
maxTemp <- c(13.2, 10.7, 14.3, 17.9, 6.6, 10, 13, 10.9, 10.9, 8.8, 9.9, 14.9, 16.3, 17.15, 18, 9.9, 11.5, 15.3, 21.7, 23.9, 26.6, 27, 22.3, 20.1, 17.9)

How about using approx to replace NAs with interpolated values; by default, approx uses linear interpolation, so this should match your manual replace-by-mean results.
weather %>%
mutate(maxTemp_interp = approx(1:n(), maxTemp, 1:n())$y)
# maxTemp maxTemp_interp
# 1 13.2 13.20
# 2 10.7 10.70
# 3 NA 14.30
# 4 17.9 17.90
# 5 6.6 6.60
# 6 10.0 10.00
# 7 13.0 13.00
# 8 NA 11.60
# 9 NA 10.20
# 10 8.8 8.80
# 11 9.9 9.90
# 12 14.9 14.90
# 13 16.3 16.30
# 14 NA 17.15
# 15 18.0 18.00
# 16 9.9 9.90
# 17 11.5 11.50
# 18 15.3 15.30
# 19 21.7 21.70
# 20 23.9 23.90
# 21 26.6 26.60
# 22 27.0 27.00
# 23 22.3 22.30
# 24 NA 20.10
# 25 17.9 17.90
I've created a new column here to make it easier to compare with the original data.
Update
Markus pointed out in the comments (thanks #markus) that to reproduce your expected output, you'd actually need method = "constant" with f = 0.5:
weather %>%
mutate(maxTemp_interp = approx(1:n(), maxTemp, 1:n(), method = "constant", f = 0.5)$y)
# maxTemp maxTemp_interp
# 1 13.2 13.20
# 2 10.7 10.70
# 3 NA 14.30
# 4 17.9 17.90
# 5 6.6 6.60
# 6 10.0 10.00
# 7 13.0 13.00
# 8 NA 10.90
# 9 NA 10.90
# 10 8.8 8.80
# 11 9.9 9.90
# 12 14.9 14.90
# 13 16.3 16.30
# 14 NA 17.15
# 15 18.0 18.00
# 16 9.9 9.90
# 17 11.5 11.50
# 18 15.3 15.30
# 19 21.7 21.70
# 20 23.9 23.90
# 21 26.6 26.60
# 22 27.0 27.00
# 23 22.3 22.30
# 24 NA 20.10
# 25 17.9 17.90

If you want to use the mean of the most recent non-NA value going backwards and forwards, you can use something like data.table::nafill() to fill values both down and up, and then take the mean:
weather$prevTemp = data.table::nafill(weather$maxTemp, type = "locf")
weather$nextTemp = data.table::nafill(weather$maxTemp, type = "nocb")
weather$maxTemp[is.na(weather$maxTemp)] = ((weather$prevTemp + weather$nextTemp) / 2)[is.na(weather$maxTemp)]

Related

How to merge the names while converting nested list to data frame

I have a nested lists with names:
lst <- list(var1 = list(`0.1` = c(100, 10, 1, 0.1, 0.01), `0.2` = c(100,
20, 4, 0.8, 0.16), `0.3` = c(100, 30, 9, 2.7, 0.81), `0.4` = c(100,
40, 16, 6.4, 2.56), `0.5` = c(100, 50, 25, 12.5, 6.25), `0.6` = c(100,
60, 36, 21.6, 12.96), `0.7` = c(100, 70, 49, 34.3, 24.01), `0.8` = c(100,
80, 64, 51.2, 40.96), `0.9` = c(100, 90, 81, 72.9, 65.61)), var2 = list(
`0.1` = c(10, 11, 11.1, 11.11, 11.111), `0.2` = c(10, 12,
12.4, 12.48, 12.496), `0.3` = c(10, 13, 13.9, 14.17, 14.251
), `0.4` = c(10, 14, 15.6, 16.24, 16.496), `0.5` = c(10,
15, 17.5, 18.75, 19.375), `0.6` = c(10, 16, 19.6, 21.76,
23.056), `0.7` = c(10, 17, 21.9, 25.33, 27.731), `0.8` = c(10,
18, 24.4, 29.52, 33.616), `0.9` = c(10, 19, 27.1, 34.39,
40.951)))
I'd like to convert it to data frame. I could do it with dplyr::bind_cols, but then my names are partly lost:
# A tibble: 5 x 18
`0.1` `0.2` `0.3` `0.4` `0.5` `0.6` `0.7` `0.8` `0.9` `0.11` `0.21` `0.31` `0.41` `0.51` `0.61` `0.71` `0.81`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 100 100 100 100 100 100 100 100 10 10 10 10 10 10 10 10
2 10 20 30. 40 50 60 70 80 90 11 12 13 14 15 16 17 18
3 1 4 9. 16 25 36 49. 64 81 11.1 12.4 13.9 15.6 17.5 19.6 21.9 24.4
4 0.1 0.8 2.7 6.4 12.5 21.6 34.3 51.2 72.9 11.1 12.5 14.2 16.2 18.8 21.8 25.3 29.5
5 0.01 0.16 0.81 2.56 6.25 13.0 24.0 41.0 65.6 11.1 12.5 14.3 16.5 19.4 23.1 27.7 33.6
# ... with 1 more variable: `0.91` <dbl>
while I'd like to create a informative names joining names from two levels of list together. So the resulting output would be:
# A tibble: 5 x 18
var1_0.1 var1_0.2 var1_0.3 var1_0.4 var1_0.5 var1_0.6 var1_0.7 var1_0.8 var1_0.9 var2_0.1 var2_0.2 var2_0.3 var2_0.4
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 100 100 100 100 100 100 100 100 10 10 10 10
2 10 20 30. 40 50 60 70 80 90 11 12 13 14
3 1 4 9. 16 25 36 49. 64 81 11.1 12.4 13.9 15.6
4 0.1 0.8 2.7 6.4 12.5 21.6 34.3 51.2 72.9 11.1 12.5 14.2 16.2
5 0.01 0.16 0.81 2.56 6.25 13.0 24.0 41.0 65.6 11.1 12.5 14.3 16.5
# ... with 5 more variables: var2_0.5 <dbl>, var2_0.6 <dbl>, var2_0.7 <dbl>, var2_0.8 <dbl>, var2_0.9 <dbl>
How can I achieve that in the most efficient way?
I did not use dplyr, but data.table and rlist.
Is this what you wanted?
library(data.table)
library(rlist)
lst <- list(var1 = list(`0.1` = c(100, 10, 1, 0.1, 0.01),
`0.2` = c(100, 20, 4, 0.8, 0.16),
`0.3` = c(100, 30, 9, 2.7, 0.81),
`0.4` = c(100, 40, 16, 6.4, 2.56),
`0.5` = c(100, 50, 25, 12.5, 6.25),
`0.6` = c(100, 60, 36, 21.6, 12.96),
`0.7` = c(100, 70, 49, 34.3, 24.01),
`0.8` = c(100, 80, 64, 51.2, 40.96),
`0.9` = c(100, 90, 81, 72.9, 65.61)),
var2 = list(`0.1` = c(10, 11, 11.1, 11.11, 11.111),
`0.2` = c(10, 12, 12.4, 12.48, 12.496),
`0.3` = c(10, 13, 13.9, 14.17, 14.251),
`0.4` = c(10, 14, 15.6, 16.24, 16.496),
`0.5` = c(10, 15, 17.5, 18.75, 19.375),
`0.6` = c(10, 16, 19.6, 21.76, 23.056),
`0.7` = c(10, 17, 21.9, 25.33, 27.731),
`0.8` = c(10, 18, 24.4, 29.52, 33.616),
`0.9` = c(10, 19, 27.1, 34.39, 40.951)))
temp = lapply(lst, as.data.table)
final = rlist::list.cbind( temp )
final
#> var1.0.1 var1.0.2 var1.0.3 var1.0.4 var1.0.5 var1.0.6 var1.0.7 var1.0.8
#> 1: 1e+02 100.00 100.00 100.00 100.00 100.00 100.00 100.00
#> 2: 1e+01 20.00 30.00 40.00 50.00 60.00 70.00 80.00
#> 3: 1e+00 4.00 9.00 16.00 25.00 36.00 49.00 64.00
#> 4: 1e-01 0.80 2.70 6.40 12.50 21.60 34.30 51.20
#> 5: 1e-02 0.16 0.81 2.56 6.25 12.96 24.01 40.96
#> var1.0.9 var2.0.1 var2.0.2 var2.0.3 var2.0.4 var2.0.5 var2.0.6 var2.0.7
#> 1: 100.00 10.000 10.000 10.000 10.000 10.000 10.000 10.000
#> 2: 90.00 11.000 12.000 13.000 14.000 15.000 16.000 17.000
#> 3: 81.00 11.100 12.400 13.900 15.600 17.500 19.600 21.900
#> 4: 72.90 11.110 12.480 14.170 16.240 18.750 21.760 25.330
#> 5: 65.61 11.111 12.496 14.251 16.496 19.375 23.056 27.731
#> var2.0.8 var2.0.9
#> 1: 10.000 10.000
#> 2: 18.000 19.000
#> 3: 24.400 27.100
#> 4: 29.520 34.390
#> 5: 33.616 40.951
Created on 2020-04-30 by the reprex package (v0.3.0)
You can use dplyr::bind_cols to convert to data frame (as you have mentioned) and then change names using base R by replicating first level names appropriate number of times:
df <- dplyr::bind_cols(lst)
names(df) <- paste(rep(names(lst), times = sapply(lst, length)),
unlist(lapply(lst, names)),
sep = '_')
If you know your inner level names before hand, it gets even simpler:
paste(rep(names(lst), each = 9), seq(0.1, 0.9, by = 0.1), sep = '_')
a data.table solution
library(data.table)
l <- lapply(seq_along(lst),function(x){
tmp <- as.data.table(lst[[x]])
names(tmp) <- paste0(names(lst)[x],"_",names(lst[[x]]))
tmp
})
as.data.table(unlist(l,recursive = FALSE))
#> var1_0.1 var1_0.2 var1_0.3 var1_0.4 var1_0.5 var1_0.6 var1_0.7 var1_0.8
#> 1: 1e+02 100.00 100.00 100.00 100.00 100.00 100.00 100.00
#> 2: 1e+01 20.00 30.00 40.00 50.00 60.00 70.00 80.00
#> 3: 1e+00 4.00 9.00 16.00 25.00 36.00 49.00 64.00
#> 4: 1e-01 0.80 2.70 6.40 12.50 21.60 34.30 51.20
#> 5: 1e-02 0.16 0.81 2.56 6.25 12.96 24.01 40.96
#> var1_0.9 var2_0.1 var2_0.2 var2_0.3 var2_0.4 var2_0.5 var2_0.6 var2_0.7
#> 1: 100.00 10.000 10.000 10.000 10.000 10.000 10.000 10.000
#> 2: 90.00 11.000 12.000 13.000 14.000 15.000 16.000 17.000
#> 3: 81.00 11.100 12.400 13.900 15.600 17.500 19.600 21.900
#> 4: 72.90 11.110 12.480 14.170 16.240 18.750 21.760 25.330
#> 5: 65.61 11.111 12.496 14.251 16.496 19.375 23.056 27.731
#> var2_0.8 var2_0.9
#> 1: 10.000 10.000
#> 2: 18.000 19.000
#> 3: 24.400 27.100
#> 4: 29.520 34.390
#> 5: 33.616 40.951
Created on 2020-04-30 by the reprex package (v0.3.0)
in Base-R
new_lst <- do.call(cbind,lapply(lst, function(x) do.call(cbind,x)))
colnames(new_lst) <- with(stack(lapply(lst, names)), paste(ind,values,sep="_"))
Another method in Base-R (This one is much easier to understand)
lst <- do.call(cbind,unlist(lst,recursive=F))
colnames(lst) <- lapply(colnames(lst), function(x) sub("\\.","_",x))
output
> new_lst
var1_0.1 var1_0.2 var1_0.3 var1_0.4 var1_0.5 var1_0.6 var1_0.7 var1_0.8 var1_0.9 var2_0.1 var2_0.2 var2_0.3 var2_0.4 var2_0.5 var2_0.6 var2_0.7 var2_0.8 var2_0.9
[1,] 1e+02 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 10.000 10.000 10.000 10.000 10.000 10.000 10.000 10.000 10.000
[2,] 1e+01 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 11.000 12.000 13.000 14.000 15.000 16.000 17.000 18.000 19.000
[3,] 1e+00 4.00 9.00 16.00 25.00 36.00 49.00 64.00 81.00 11.100 12.400 13.900 15.600 17.500 19.600 21.900 24.400 27.100
[4,] 1e-01 0.80 2.70 6.40 12.50 21.60 34.30 51.20 72.90 11.110 12.480 14.170 16.240 18.750 21.760 25.330 29.520 34.390
[5,] 1e-02 0.16 0.81 2.56 6.25 12.96 24.01 40.96 65.61 11.111 12.496 14.251 16.496 19.375 23.056 27.731 33.616 40.951

Where am I going wrong in spliting time series?

data<-c(10.0,11.1,12.3,13.2,14.8,15.6,16.7,17.5,18.9,19.7,20.7,21.1,22.6,23.5,24.9,25.1,26.3,27.8,28.8,29.6,30.2,31.6,32.1,33.7)
startDate <- '2013-01-01'
endDate <- '2013-01-01'
df <- ts(cbind(data, startDate, endDate))
df
################
smp_size <- 0.80
train_ind <- length(df) * smp_size
train_split <- seq(from = 1, to = train_ind)
test_split <- seq(from = train_ind +1, to = length(df))
train <- data[train_split]
test <- data[-test_split]
(c(train, test))
I have the above data and I am trying to split it into time series splits, i..e the first 80% as training and the remaining 20% as testing.
I keep getting weird results:
(c(train, test))
[1] 10.0 11.1 12.3 13.2 14.8 15.6 16.7 17.5 18.9 19.7 20.7 21.1 22.6 23.5 24.9 25.1 26.3 27.8 28.8 29.6 30.2
[22] 31.6 32.1 33.7 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[43] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 10.0 11.1 12.3 13.2 14.8 15.6
[64] 16.7 17.5 18.9 19.7 20.7 21.1 22.6 23.5 24.9 25.1 26.3 27.8 28.8 29.6 30.2 31.6 32.1 33.7
Why are there NA values in the middle of the data?
You should use nrow(df), not length(df) for time-series objects.
data <- c(10.0, 11.1, 12.3, 13.2, 14.8, 15.6, 16.7, 17.5, 18.9,
19.7, 20.7, 21.1, 22.6, 23.5, 24.9, 25.1, 26.3, 27.8,
28.8, 29.6, 30.2, 31.6, 32.1, 33.7)
startDate <- '2013-01-01'
endDate <- '2013-01-01'
df <- ts(cbind(data, startDate, endDate))
train <- df[1:(nrow(df) * .8), ]
test <- df[-(1:(nrow(df) * .8)), ]
> all.equal(df, ts(rbind(train, test)))
[1] TRUE
> length(df)
[1] 72
> nrow(df)
[1] 24
Calculate the number of rows to include in test set and use window function to subset time-series
train_size <- ceiling(nrow(df) * 0.8)
train_set <- window(df, end = train_size)
test_set <- window(df, start = train_size + 1)
train_set
#Time Series:
#Start = 1
#End = 20
#Frequency = 1
# data startDate endDate
# 1 10 2013-01-01 2013-01-01
# 2 11.1 2013-01-01 2013-01-01
# 3 12.3 2013-01-01 2013-01-01
# 4 13.2 2013-01-01 2013-01-01
# 5 14.8 2013-01-01 2013-01-01
# 6 15.6 2013-01-01 2013-01-01
# 7 16.7 2013-01-01 2013-01-01
# 8 17.5 2013-01-01 2013-01-01
# 9 18.9 2013-01-01 2013-01-01
#10 19.7 2013-01-01 2013-01-01
#11 20.7 2013-01-01 2013-01-01
#12 21.1 2013-01-01 2013-01-01
#13 22.6 2013-01-01 2013-01-01
#14 23.5 2013-01-01 2013-01-01
#15 24.9 2013-01-01 2013-01-01
#16 25.1 2013-01-01 2013-01-01
#17 26.3 2013-01-01 2013-01-01
#18 27.8 2013-01-01 2013-01-01
#19 28.8 2013-01-01 2013-01-01
#20 29.6 2013-01-01 2013-01-01
test_set
#Time Series:
#Start = 21
#End = 24
#Frequency = 1
# data startDate endDate
#21 30.2 2013-01-01 2013-01-01
#22 31.6 2013-01-01 2013-01-01
#23 32.1 2013-01-01 2013-01-01
#24 33.7 2013-01-01 2013-01-01

Aggregate with a start and end of date

I'm new to R so this is maybe simple, but I haven't find how to do it yet.
I'm trying to aggregate my temperature data by day so I have a mean temperature for every day of the year.
Here's an example of my data and the code I made :
Date Qobs Ptot Fsol Temp PE X
1 1956-11-01 0.001 14.0 -99 12.0 1.4 NA
2 1956-11-02 0.001 0.0 -99 13.5 1.5 NA
3 1956-11-03 0.001 0.0 -99 13.5 1.5 NA
4 1956-11-04 0.001 0.0 -99 13.0 1.4 NA
5 1956-11-05 0.001 0.0 -99 11.5 1.3 NA
6 1956-11-06 0.001 0.0 -99 11.0 1.2 NA
7 1956-11-07 0.001 2.0 -99 12.5 1.3 NA
8 1956-11-08 0.000 0.0 -99 5.0 0.7 NA
9 1956-11-09 0.000 0.5 -99 0.0 0.4 NA
10 1956-11-10 0.000 0.0 -99 -2.5 0.2 NA
11 1956-11-11 0.000 2.5 -99 5.5 0.8 NA
12 1956-11-12 0.000 0.0 -99 7.5 0.9 NA
reg_T=aggregate(x=tmp_data$Temp, by=list(j=format(tmp_data$Date, "%j")), mean)
But as you can see my data doesn't start the 1st Januray, so the 1st day of my data is the 01/11 which makes it complicated for later when it's aggregated.
How can I aggregate and define the start at the 01/01 and make it forget the beginning and end of my data because they are not complete years?
Thanks!
dput() of the data:
df <- structure(list(Date = structure(c(-4809, -4808, -4807, -4806, -4805, -4804,
-4803, -4802, -4801, -4800, -4799, -4798, -4797,
-4796, -4795, -4794, -4793, -4792, -4791, -4790,
-4789, -4788, -4787, -4786, -4785, -4784, -4783,
-4782, -4781, -4780), class = "Date"),
Temp = c(12, 13.5, 13.5, 13, 11.5, 11, 12.5, 5, 0, -2.5, 5.5, 7.5,
1.5, 6, 14, 6, 0.5, 0.5, 4, 2, 9, -4.5, -11.5, -10, -4.5,
-2.5, -3.5, -1, -1.5, -7.5)),
.Names = c("Date", "Temp"), row.names = c(NA, 30L), class = "data.frame")
What about something like this:
require(tidyverse)
df %>%
mutate(MonthDay = str_sub(as.character(Date), 6)) %>%
group_by(MonthDay) %>%
summarise(MeanDay = mean(Temp, na.rm = TRUE))
# A tibble: 30 x 2
MonthDay MeanDay
<chr> <dbl>
1 11-01 12.0
2 11-02 13.5
3 11-03 13.5
4 11-04 13.0
5 11-05 11.5
6 11-06 11.0
7 11-07 12.5
8 11-08 5.00
9 11-09 0.
10 11-10 -2.50
# ... with 20 more rows

Calculate formula over all rows and specific columns of dataframe

I have the following sample dataframe with prices of toys in different shops:
dfData <- data.frame(article = c("Fix", "Foxi", "Stan", "Olli", "Barbie", "Ken", "Hulk"),
priceToys1 = c(10, NA, 10.5, NA, 10.7, 11.2, 12.0),
priceAllToys = c(NA, 11.4, NA, 11.9, 11.7, 11.1, NA),
price123Toys = c(12, 12.4, 12.7, NA, NA, 11.0, 12.1))
Additionally I generate a min price column by adding:
dfData$MinPrice <- apply(dfData[, grep("price", colnames(dfData))], 1, FUN=min, na.rm = TRUE)
So I have this dataframe now:
# article priceToys1 priceAllToys price123Toys MinPrice
#1 Fix 10.0 NA 12.0 10.0
#2 Foxi NA 11.4 12.4 11.4
#3 Stan 10.5 NA 12.7 10.5
#4 Olli NA 11.9 NA 11.9
#5 Barbie 10.7 11.7 NA 10.7
#6 Ken 11.2 11.1 11.0 11.0
#7 Hulk 12.0 NA 12.1 12.0
How do I get additional columns into the dataframe that tell me the factor of all prices relatively to the minimum price in percentage? The new column names should also include the shop name.
The result should look like this:
# article priceToys1 PercToys1 priceAllToys PercAllToys price123Toys Perc123Toys MinPrice
#1 Fix 10.0 100.0 NA NA 12.0 120.0 10.0
#2 Foxi NA NA 11.4 100.0 12.4 108.8 11.4
#3 Stan 10.5 100.0 NA NA 12.7 121.0 10.5
#4 Olli NA NA 11.9 100.0 NA NA 11.9
#5 Barbie 10.7 100.0 11.7 109.4 NA NA 10.7
#6 Ken 11.2 101.8 11.1 100.9 11.0 100.0 11.0
#7 Hulk 12.0 100.0 NA NA 12.1 100.8 12.0
Two possible solutions:
1) With the data.table-package:
# load the 'data.table'-package
library(data.table)
# get the columnnames on which to operate
cols <- names(dfData)[2:4] # or: grep("price", names(dfData), value = TRUE)
# convert dfData to a 'data.table'
setDT(dfData)
# compute the 'fraction'-columns
dfData[, paste0('Perc', gsub('price','',cols)) := lapply(.SD, function(x) round(100 * x / MinPrice, 1))
, .SDcols = cols][]
which gives:
article priceToys1 priceAllToys price123Toys MinPrice PercToys1 PercAllToys Perc123Toys
1: Fix 10.0 NA 12.0 10.0 100.0 NA 120.0
2: Foxi NA 11.4 12.4 11.4 NA 100.0 108.8
3: Stan 10.5 NA 12.7 10.5 100.0 NA 121.0
4: Olli NA 11.9 NA 11.9 NA 100.0 NA
5: Barbie 10.7 11.7 NA 10.7 100.0 109.3 NA
6: Ken 11.2 11.1 11.0 11.0 101.8 100.9 100.0
7: Hulk 12.0 NA 12.1 12.0 100.0 NA 100.8
2) With base R:
cols <- names(dfData)[2:4] # or: grep("price", names(dfData), value = TRUE)
dfData[, paste0('Perc', gsub('price','',cols))] <- round(100 * dfData[, cols] / dfData$MinPrice, 1)
which will get you the same result.
We can use mutate_at from dplyr
library(dplyr)
library(magrittr)
dfData %<>%
mutate_at(vars(matches("^price")), funs(Perc = round(100* ./MinPrice, 1)))
dfData

How to subset consecutive rows if they meet a condition

I am using R to analyze a number of time series (1951-2013) containing daily values of Max and Min temperatures. The data has the following structure:
YEAR MONTH DAY MAX MIN
1985 1 1 22.8 9.4
1985 1 2 28.6 11.7
1985 1 3 24.7 12.2
1985 1 4 17.2 8.0
1985 1 5 17.9 7.6
1985 1 6 17.7 8.1
I need to find the frequency of heat waves based on this definition: A period of three or more consecutive days ‎with a daily maximum and minimum temperature exceeding the 90th percentile of the maximum ‎and minimum temperatures for all days in the studied period.
Basically, I want to subset those consecutive days (three or more) when the Max and Min temp exceed a threshold value. The output would be something like this:
YEAR MONTH DAY MAX MIN
1989 7 18 45.0 23.5
1989 7 19 44.2 26.1
1989 7 20 44.7 24.4
1989 7 21 44.6 29.5
1989 7 24 44.4 31.6
1989 7 25 44.2 26.7
1989 7 26 44.5 25.0
1989 7 28 44.8 26.0
1989 7 29 44.8 24.6
1989 8 19 45.0 24.3
1989 8 20 44.8 26.0
1989 8 21 44.4 24.0
1989 8 22 45.2 25.0
I have tried the following to subset my full dataset to just the days that exceed the 90th percentile temperature:
HW<- subset(Mydata, Mydata$MAX >= (quantile(Mydata$MAX,.9)) &
Mydata$MIN >= (quantile(Mydata$MIN,.9)))
However, I got stuck in how I can subset only consecutive days that have met the condition.
An approach with data.table which is slightly different from #jlhoward's approach (using the same data):
library(data.table)
setDT(df)
df[, hotday := +(MAX>=44.5 & MIN>=24.5)
][, hw.length := with(rle(hotday), rep(lengths,lengths))
][hotday == 0, hw.length := 0]
this produces a datatable with a heat wave length variable (hw.length) instead of a TRUE/FALSE variable for a specific heat wave length:
> df
YEAR MONTH DAY MAX MIN hotday hw.length
1: 1989 7 18 45.0 23.5 0 0
2: 1989 7 19 44.2 26.1 0 0
3: 1989 7 20 44.7 24.4 0 0
4: 1989 7 21 44.6 29.5 1 1
5: 1989 7 22 44.4 31.6 0 0
6: 1989 7 23 44.2 26.7 0 0
7: 1989 7 24 44.5 25.0 1 3
8: 1989 7 25 44.8 26.0 1 3
9: 1989 7 26 44.8 24.6 1 3
10: 1989 7 27 45.0 24.3 0 0
11: 1989 7 28 44.8 26.0 1 1
12: 1989 7 29 44.4 24.0 0 0
13: 1989 7 30 45.2 25.0 1 1
I may be missing something here but I don't see the point of subsetting beforehand. If you have data for every day, in chronological order, you can use run length encoding (see the docs on the rle(...) function).
In this example we create an artificial data set and define "heat wave" as MAX >= 44.5 and MIN >= 24.5. Then:
# example data set
df <- data.frame(YEAR=1989, MONTH=7, DAY=18:30,
MAX=c(45, 44.2, 44.7, 44.6, 44.4, 44.2, 44.5, 44.8, 44.8, 45, 44.8, 44.4, 45.2),
MIN=c(23.5, 26.1, 24.4, 29.5, 31.6, 26.7, 25, 26, 24.6, 24.3, 26, 24, 25))
r <- with(with(df, rle(MAX>=44.5 & MIN>=24.5)),rep(lengths,lengths))
df$heat.wave <- with(df,MAX>=44.5&MIN>=24.5) & (r>2)
df
# YEAR MONTH DAY MAX MIN heat.wave
# 1 1989 7 18 45.0 23.5 FALSE
# 2 1989 7 19 44.2 26.1 FALSE
# 3 1989 7 20 44.7 24.4 FALSE
# 4 1989 7 21 44.6 29.5 FALSE
# 5 1989 7 22 44.4 31.6 FALSE
# 6 1989 7 23 44.2 26.7 FALSE
# 7 1989 7 24 44.5 25.0 TRUE
# 8 1989 7 25 44.8 26.0 TRUE
# 9 1989 7 26 44.8 24.6 TRUE
# 10 1989 7 27 45.0 24.3 FALSE
# 11 1989 7 28 44.8 26.0 FALSE
# 12 1989 7 29 44.4 24.0 FALSE
# 13 1989 7 30 45.2 25.0 FALSE
This creates a column, heat.wave which is TRUE if there was a heat wave on that day. If you need to extract only the hw days, use
df[df$heat.wave,]
# YEAR MONTH DAY MAX MIN heat.wave
# 7 1989 7 24 44.5 25.0 TRUE
# 8 1989 7 25 44.8 26.0 TRUE
# 9 1989 7 26 44.8 24.6 TRUE
Your question really boils down to finding groupings of 3+ consecutive days in your subsetted dataset, removing all remaining data.
Let's consider an example where we would want to keep some rows and remove others:
dat <- data.frame(year = 1989, month=c(6, 7, 7, 7, 7, 7, 8, 8, 8, 10, 10), day=c(12, 11, 12, 13, 14, 21, 5, 6, 7, 12, 13))
dat
# year month day
# 1 1989 6 12
# 2 1989 7 11
# 3 1989 7 12
# 4 1989 7 13
# 5 1989 7 14
# 6 1989 7 21
# 7 1989 8 5
# 8 1989 8 6
# 9 1989 8 7
# 10 1989 10 12
# 11 1989 10 13
I've excluded the temperature data, because I'm assuming we've already subsetted to just the days that exceed the 90th percentile using the code from your question.
In this dataset there is a 4-day heat wave in July and a three-day heat wave in August. The first step would be to convert the data to date objects and compute the number of days between consecutive observations (I assume the data is already ordered by day here):
dates <- as.Date(paste(dat$year, dat$month, dat$day, sep="-"))
(dd <- as.numeric(difftime(tail(dates, -1), head(dates, -1), units="days")))
# [1] 29 1 1 1 7 15 1 1 66 1
We're close, because now we can see the time periods where there were multiple date gaps of 1 day -- these are the ones we want to grab. We can use the rle function to analyze runs of the number 1, keeping only the runs of length 2 or more:
(valid.gap <- with(rle(dd == 1), rep(values & lengths >= 2, lengths)))
# [1] FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
Finally, we can subset the dataset to just the days that were on either side of a 1-day date gap that is part of a heat wave:
dat[c(FALSE, valid.gap) | c(valid.gap, FALSE),]
# year month day
# 2 1989 7 11
# 3 1989 7 12
# 4 1989 7 13
# 5 1989 7 14
# 7 1989 8 5
# 8 1989 8 6
# 9 1989 8 7
A simple approach, not full vectorized..
# play data
year <- c("1960")
month <- c(rep(1,30), rep(2,30), rep(3,30))
day <- rep(1:30,3)
maxT <- round(runif(90, 20, 22),1)
minT <- round(runif(90, 10, 12),1)
df <- data.frame(year, month, day, maxT, minT)
# target and tricky data...
df[1:3, 4] <- 30
df[1:4, 5] <- 14
df[10:13, 4] <- 30
df[10:11, 5] <- 14
# limits
df$maxTope <- df$maxT - quantile(df$maxT,0.9)
df$minTope <- df$minT - quantile(df$minT,0.9)
# define heat day
df$heat <- ifelse(df$maxTope > 0 & df$minTope >0, 1, 0)
# count heat day2
for(i in 2:dim(df)[1]){
df$count[1] <- ifelse(df$heat[1] == 1, 1, 0)
df$count[i] <- ifelse(df$heat[i] == 1, df$count[i-1]+1, 0)
}
# select last day of heat wave (and show the number of days in $count)
df[which(df$count >= 3),]
Here's a quick little solution:
is_High_Temp <- ((quantile(Mydata$MAX,.9)) &
Mydata$MIN >= (quantile(Mydata$MIN,.9)))
start_of_a_series <- c(T,is_High_Temp[-1] != is_High_Temp[-length(x)]) # this is the tricky part
series_number <- cumsum(start_of_a_series)
series_length <- ave(series_number,series_number,FUN=length())
is_heat_wave <- series_length >= 3 & is_High_Temp
A solution with dplyr , also using rle()
library(dplyr)
cond <- expr(MAX >= 44.5 & MIN >= 24.5)
df %>%
mutate(heatwave =
rep(rle(!!cond)$values & rle(!!cond)$lengths >= 3,
rle(!!cond)$lengths)) %>%
filter(heatwave)
#> YEAR MONTH DAY MAX MIN heatwave
#> 1 1989 7 24 44.5 25.0 TRUE
#> 2 1989 7 25 44.8 26.0 TRUE
#> 3 1989 7 26 44.8 24.6 TRUE
Created on 2020-05-16 by the reprex package (v0.3.0)
data
#devtools::install_github("alistaire47/read.so")
df <- read.so::read.so("YEAR MONTH DAY MAX MIN
1989 7 18 45.0 23.5
1989 7 19 44.2 26.1
1989 7 20 44.7 24.4
1989 7 21 44.6 29.5
1989 7 24 44.4 31.6
1989 7 25 44.2 26.7
1989 7 26 44.5 25.0
1989 7 28 44.8 26.0
1989 7 29 44.8 24.6
1989 8 19 45.0 24.3
1989 8 20 44.8 26.0
1989 8 21 44.4 24.0
1989 8 22 45.2 25.0")

Resources