I'm having trouble figuring out how to use xblocks() to work. First, here's a small example from a much larger dataset:
data <- data.frame(
Date = sample(c("1993-07-05", "1993-07-05", "1993-07-05", "1993-08-30", "1993-08-30", "1993-08-30", "1993-08-30", "1993-09-04", "1993-09-04")),
Oxygen = sample(c("0.9", "0.4", "4.2", "5.6", "7.3", NA, "9.5", NA, "0.3")))
I then averaged values for each month using xts:
xtsAveragedata <- xts(Averagedata[-1], Averagedata[[1]])
xtsAverageMonthlyData <- apply.monthly(xtsAveragedata, FUN = mean)
Now, I linear interpolated my data:
Interpolateddata <- na.approx(xtsAverageMonthlyData)
I want to create a figure in which I use xblocks() or something similar to show the regions in my data where I used interpolation, something like this, which I found online:
How do I go about doing this for all values/automate for my entire dataset? There's no examples I could translate into something like this from the reference guide.
Thank you for your help. It is greatly appreciated.
So this doesn't use xts or zoo, but maybe this walkthrough will be helpful. I am using a slightly larger (and daily) dataset, but it should be reproducible:
library(tidyverse)
library(lubridate)
set.seed(4)
df <- tibble(
Date = seq.Date(ymd("1993-07-01"), by = "1 day", length.out = 100),
Oxygen = runif(100, 0, 10)
)
# Randomly assign 20 records to NA
df[sample(1:nrow(df), 20), "Oxygen"] <- NA
df_for_plot <- df %>%
arrange(Date) %>%
group_by(month(Date)) %>%
mutate(
is_na = is.na(Oxygen),
month_avg = mean(Oxygen, na.rm = TRUE),
oxygen_to_plot = if_else(is_na, month_avg, Oxygen)
)
df_for_plot
#> # A tibble: 100 x 6
#> # Groups: month(Date) [4]
#> Date Oxygen `month(Date)` is_na month_avg oxygen_to_plot
#> <date> <dbl> <dbl> <lgl> <dbl> <dbl>
#> 1 1993-07-01 5.86 7 FALSE 5.87 5.86
#> 2 1993-07-02 0.0895 7 FALSE 5.87 0.0895
#> 3 1993-07-03 2.94 7 FALSE 5.87 2.94
#> 4 1993-07-04 2.77 7 FALSE 5.87 2.77
#> 5 1993-07-05 8.14 7 FALSE 5.87 8.14
#> 6 1993-07-06 NA 7 TRUE 5.87 5.87
#> 7 1993-07-07 7.24 7 FALSE 5.87 7.24
#> 8 1993-07-08 9.06 7 FALSE 5.87 9.06
#> 9 1993-07-09 9.49 7 FALSE 5.87 9.49
#> 10 1993-07-10 0.731 7 FALSE 5.87 0.731
#> # ... with 90 more rows
# Plot the regular data, but for the geom_rect use only the filtered data where the is_na column is TRUE.
# Assuming you have daily data, you just set the xmax to be that Date + 1.
ggplot(df_for_plot, aes(x = Date, y = oxygen_to_plot)) +
geom_line() +
geom_rect(
data = df_for_plot %>% filter(is_na),
aes(xmin = Date, xmax = Date + 1, ymin = -Inf, ymax = +Inf), fill = "skyblue", alpha = 0.5
)
Related
I am working with a large time series of oceanographic data which needs a lot of manipulation.
I have several days of data missing and would like to interpolate them. Specifically date/depth/temperature.
Here is an example of my df:
> tibble(df)
# A tibble: 351,685 x 9
date time depthR SV temp salinity conduct density calcSV
<date> <times> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2021-11-17 07:50:18 0.5 1524. 19.7 37.8 51.0 27 1524.
2 2021-11-17 07:50:22 0.5 1524. 19.9 37.6 50.9 26.8 1524.
3 2021-11-17 07:50:23 1.1 1524. 19.9 37.6 50.9 26.8 1524.
4 2021-11-17 07:50:24 1.5 1524. 19.9 37.6 50.9 26.8 1524.
5 2021-11-17 07:50:25 2 1524. 19.9 37.6 50.9 26.8 1524.
Each date contains over 1000 lines of data and so my idea was to find the max depth of each day to therefore interpolate reasonable max depth values for the missing days between.
So far, I have found the max depth per date:
group <- df %>% group_by(date) %>% summarise(max =max(depthR, na.rm=TRUE))
> tibble(group)
# A tibble: 40 x 2
date max
<date> <dbl>
1 2021-11-17 685.
2 2021-11-18 695.
3 2021-11-19 136.
4 2021-11-20 138.
5 2021-11-21 142.
6 2021-11-22 26
7 2021-11-23 136.
8 2021-11-24 297.
9 2021-11-25 613.
10 2021-11-26 81.1
# ... with 30 more rows
And then I managed to interpolate the missing dates by:
> group <- seq(min(group$date), max(group$date), by = "1 day")
> group <- data.frame(date=group)
> tibble(group)
# A tibble: 69 x 1
date
<date>
1 2021-11-17
2 2021-11-18
3 2021-11-19
4 2021-11-20
5 2021-11-21
6 2021-11-22
7 2021-11-23
8 2021-11-24
9 2021-11-25
10 2021-11-26
# ... with 59 more rows
As you can see, the previous query was overwritten.
So I tried creating a new df for the interpolated dates and tried merging them together. I got the error:
> library(stringr)
> group$combined <- str_c(group$date, '', dateinterp$date)
Error: Assigned data `str_c(group$date, "", dateinterp$date)` must be compatible with existing data.
x Existing data has 40 rows.
x Assigned data has 69 rows.
i Only vectors of size 1 are recycled.
How can I insert these two matrices of differing length into the dataframe in chronological order without overwriting original data or conflicting?
Following that, I'm not sure how I would proceed to interpolate the depths and temperatures for each date.
Perhaps starting with something like the following:
depth = seq(1, 200, length.out = 100))
Eventually the date variable will be exchanged for geo coords.
Any advice greatly appreciated.
EDIT: As requested by #AndreaM, an example of my data:
> dput(head(df))
structure(list(date = structure(c(18948, 18948, 18948, 18948,
18948, 18948), class = "Date"), time = structure(c(0.326597222222222,
0.326643518518519, 0.326655092592593, 0.326666666666667, 0.326678240740741,
0.326712962962963), format = "h:m:s", class = "times"), depth = c(0.5,
0.5, 1.1, 1.5, 2, 2.5), SV = c(1524.024, 1524.026, 1524.025,
1524.008, 1524.016, 1524.084), temp = c(19.697, 19.864, 19.852,
19.854, 19.856, 19.847), salinity = c(37.823, 37.561, 37.557,
37.568, 37.573, 37.704), conduct = c(51.012, 50.878, 50.86, 50.876,
50.884, 51.032), density = c(27, 26.755, 26.758, 26.768, 26.773,
26.877), calcSV = c(1523.811, 1523.978, 1523.949, 1523.975, 1523.993,
1524.124)), row.names = 100838:100843, class = "data.frame")
one approach, adapt to your case as appropriate:
library(dplyr)
library(lubridate) ## facilitates date-time manipulations
## example data:
patchy_data <- data.frame(date = as.Date('2021-11-01') + sample(1:10, 6),
value = rnorm(12)) %>%
arrange(date)
## create vector of -only!- missing dates:
missing_dates <-
setdiff(
seq.Date(from = min(patchy_data$date),
to = max(patchy_data$date),
by = '1 day'
),
patchy_data$date
) %>% as.Date(origin = '1970-01-01')
## extend initial dataframe with rows per missing date:
full_data <-
patchy_data %>%
bind_rows(data.frame(date = missing_dates,
value = NA)
) %>%
arrange(date)
## group by month and impute missing data from monthwise statistic:
full_data %>%
mutate(month = lubridate::month(date)) %>%
group_by(month) %>%
## coalesce conveniently replaces ifelse-constructs to replace NAs
mutate(imputed = coalesce(.$value, mean(.$value, na.rm = TRUE)))
edit
One possibility to granulate generated data (missing dates) with additional parameters (e. g. measuring depths) is to use expand.grid as follows. Assuming object names from previous code:
## depths of daily measurements:
observation_depths <- c(0.5, 1.1, 1.5) ## example
## generate dataframe with missing dates x depths:
missing_dates_and_depths <-
setNames(expand.grid(missing_dates, observation_depths),
c('date','depthR')
)
## stack both dataframes as above:
full_data <-
patchy_data %>%
bind_rows(missing_dates_and_depths) %>%
arrange(date)
Suppose I have data that looks like this:
Date time price minute FOMC Daily.Return
<date> <time> <dbl> <dbl> <fct> <dbl>
1 2005-01-03 16:00:00 120. 960 FALSE -1.24
2 2005-01-04 16:00:00 119. 960 FALSE -1.44
3 2005-01-05 16:00:00 118. 960 FALSE -0.354
4 2005-01-06 16:00:01 119. 960 FALSE 0.245
5 2005-01-07 15:59:00 119. 959 FALSE -0.328
6 2005-01-10 16:00:00 119. 960 FALSE 0.506
7 2005-01-11 16:00:00 118. 960 FALSE -0.279
8 2005-01-12 16:00:01 119. 960 FALSE 0.329
9 2005-01-13 16:00:00 118. 960 FALSE -0.787
10 2005-01-14 16:00:00 118. 960 FALSE 0.372
I want to summarize Daily.Return per group using the FOMC variable which is either TRUE or FALSE. That is easy with dplyr. I get the following:
daily.SPY %>% group_by(FOMC) %>%
summarise(Mean = 100 * mean(Daily.Return),
Median = 100 * median(Daily.Return),
Vol = 100 * sqrt(252) * sd(Daily.Return/100))
As expected, I get the following tibble back:
FOMC Mean Median Vol
<fct> <dbl> <dbl> <dbl>
1 FALSE 0.00551 5.24 14.9
2 TRUE 20.8 1.20 17.6
However, I would like to have a third row which would perform the same computations without grouping. It would compute the average, median and standard deviation for the whole sample, without conditioning on the group. What's the easiest way to do that within tidyverse? Thanks!
One option is to just row bind a duplicate of the whole data where you mutate() the FOMC variable to "ALL" so that you end up with that as a separate group when you group_by() and summarise().
library(tidyverse)
set.seed(1)
daily.SPY <- tibble(
FOMC = factor(rep(c(T, F), each = 25)),
Daily.Return = c(cumsum(rnorm(25)), cumsum(rnorm(25)))
)
daily.SPY %>%
bind_rows(., mutate(., FOMC = "ALL")) %>%
group_by(FOMC) %>%
summarise(Mean = 100 * mean(Daily.Return),
Median = 100 * median(Daily.Return),
Vol = 100 * sqrt(252) * sd(Daily.Return/100))
#> # A tibble: 3 x 4
#> FOMC Mean Median Vol
#> <chr> <dbl> <dbl> <dbl>
#> 1 ALL 58.4 -6.57 32.3
#> 2 FALSE -80.3 -53.6 13.9
#> 3 TRUE 197. 151. 30.5
Created on 2022-01-11 by the reprex package (v2.0.1)
You can make a function for summarizing data:
summarize_returns = function(data) {
data %>%
summarise(
Mean = 100 * mean(Daily.Return),
Median = 100 * median(Daily.Return),
Vol = 100 * sqrt(252) * sd(Daily.Return / 100),
.groups = "drop"
)
}
Then you can combine the two summaries using dplyr::bind_rows():
data %>%
group_by(FOMC) %>%
summarize_returns() %>%
bind_rows(
data %>% summarize_returns() %>% mutate(FOMC = "Total")
)
# A tibble: 3 x 4
FOMC Mean Median Vol
<chr> <dbl> <dbl> <dbl>
1 FALSE -13.6 -13.3 15.5
2 TRUE 14.4 8.79 16.6
3 Total 0.992 -1.08 16.2
My data:
library(tidyverse)
set.seed(123)
data = tibble(
FOMC = as.character(sample(c(TRUE, FALSE), 100, replace = TRUE),
Daily.Return = rnorm(100)
)
I have the following data frame in R
data <- structure(list(Date = structure(c(18352, 18382, 18413, 18443,
18474, 18505, 18535, 18566, 18596, 18627), class = "Date"), `Item 1` = c(1.51832975855564,
0.37616251475745, 0.235532024125229, 0.709469777058103, 1.02933768602063,
1.32152918133017, 1.40923563776068, 1.20359679507398, 1.16086943030891,
1.28886722075181), `Item 2` = c(2.14545986795986, 2.96713831051805,
3.07870599806344, 3.23176921561792, 4.15485377279825, 4.90266273750217,
5.01422259880169, 4.41057753970351, 4.99683267473077, 5.26300032931175
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
I would like to calculate rolling correlations between the two columns in my dataframe. When I try to use rollapply like below
correlation <- rollapply(data, width=3, function(x) cor(x[,2],x[,3]), by.column=FALSE)
I get the following error. I've tried setting the two columns to numeric format using as.numeric, however that didn't solve the issue. Any ideas on how to solve this?
Error in cor(x[, 2], x[, 3]) : 'x' must be numeric
1) rollapply coerces the data to a matrix and since the first column of data is not numeric it winds up passing a character matrix. Omit the first column and it will work. Also you may or may not want to use rollapplyr with an r on the end in order to use a right aligned window (the default is centered) and fill = NA (the default is no filling) to fill out the result to have the same length as the number of rows in data. We have shown those here just in case that was what you wanted.
cor2 <- function(x, ...) cor(x[, 1], x[, 2], ...) # scalar cor of 1st 2 cols
rollapplyr(data[-1], 3, cor2, by.column = FALSE, fill = NA)
## [1] NA NA -0.9999543 0.7475923 0.8759702 0.9994088
## [7] 0.9947371 0.9652525 0.3777091 0.4791595
2) or if you wanted a centered window with no NA filling then use rollapply with no r on the end and omit the fill= argument.
rollapply(data[-1], 3, cor2, by.column = FALSE)
## [1] -0.9999543 0.7475923 0.8759702 0.9994088 0.9947371 0.9652525 0.3777091
## [8] 0.4791595
3) Another possibility is to use a zoo object as the input giving the displayed zoo object as output. zoo object stores the time index as an attribute rather than a column so we can just pass it as is.
z <- read.zoo(data)
rollapply(z, 3, cor2, by.column = FALSE)
## 2020-04-30 2020-05-31 2020-06-30 2020-07-31 2020-08-31 2020-09-30 2020-10-31
## -0.9999543 0.7475923 0.8759702 0.9994088 0.9947371 0.9652525 0.3777091
## 2020-11-30
## 0.4791595
Note
One other point not related to the rolling window is that it appears that the input is using a date equal to the last day of the month to represent year/month and it would be possible to directly represent a year and month without a day using yearmon class. Such a class represents the date internally as the year + fraction where Jan = 0, Feb = 1/12, ..., Dec = 11/12, displays showing the month and year and would be regularly spaced unlike using dates which are irregularly spaced due to the different number of days in a month.
data$Date <- as.yearmon(data$Date)
giving:
# A tibble: 10 x 3
Date `Item 1` `Item 2`
<yearmon> <dbl> <dbl>
1 Mar 2020 1.52 2.15
2 Apr 2020 0.376 2.97
3 May 2020 0.236 3.08
4 Jun 2020 0.709 3.23
5 Jul 2020 1.03 4.15
6 Aug 2020 1.32 4.90
7 Sep 2020 1.41 5.01
8 Oct 2020 1.20 4.41
9 Nov 2020 1.16 5.00
10 Dec 2020 1.29 5.26
I'll respond this way, as I prefer slider (and it would be even safer to use tsibble on your dates). Anyway, here's an answer (where you'll need, say, nine observations before calculating a correlation):
data %>%
mutate(
correl = slider::slide2_dbl(
.x = `Item 1`,
.y = `Item 2`,
.f = ~cor(.x, .y),
.before = 8L,
.complete = TRUE
)
)
... which produces:
# A tibble: 10 x 4
Date `Item 1` `Item 2` correl
<date> <dbl> <dbl> <dbl>
1 2020-03-31 1.52 2.15 NA
2 2020-04-30 0.376 2.97 NA
3 2020-05-31 0.236 3.08 NA
4 2020-06-30 0.709 3.23 NA
5 2020-07-31 1.03 4.15 NA
6 2020-08-31 1.32 4.90 NA
7 2020-09-30 1.41 5.01 NA
8 2020-10-31 1.20 4.41 NA
9 2020-11-30 1.16 5.00 0.407
10 2020-12-31 1.29 5.26 0.941
That said, you can switch to complete = FALSE if you want correlation estimates for earlier dates.
I have a data frame in r that contains readings each five minutes of an hour for couple of months. I want to calculate daily mean of the var3 (data frame under) and add into this data frame as var4.
Here is my df:
>df
timestamp Var1 Var2 Var3
1 2018-07-20 13:50:00 32.0358 28.1 3.6
2 2018-07-20 13:55:00 32.0358 28.0 2.5
3 2018-07-20 14:00:00 32.0358 28.1 2.2
I find this solution from searching the forum, but it's raising error.
Here is the solution I am applying:
aggregate(ts(df$var3[, 2], freq = 288), 1, mean)
This is the error I am getting:
Error in df$var3[, 2] : incorrect number of dimensions
I think this should work for my data frame too but not able to remove this error. Please help.
Here's an approach with dplyr and lubridate.
library(dplyr)
library(lubridate)
df %>%
group_by(Day = day(ymd_hms(timestamp))) %>%
mutate(Var4 = mean(Var3))
## A tibble: 1,000 x 6
## Groups: Day [5]
# timestamp Var1 Var2 Var3 Day Var4
# <dttm> <dbl> <dbl> <dbl> <int> <dbl>
# 1 2018-07-20 13:55:30 32.2 22.9 2.35 20 2.99
# 2 2018-07-20 14:00:30 37.7 24.8 2.99 20 2.99
# 3 2018-07-20 14:05:30 38.7 29.6 3.47 20 2.99
# 4 2018-07-20 14:10:30 30.4 24.2 3.02 20 2.99
# 5 2018-07-20 14:15:30 32.0 28.4 2.95 20 2.99
## … with 995 more rows
Sample Data
df <- data.frame(timestamp = ymd_hms("2018-07-20 13:50:30") + 60*5 * 1:1000,
Var1 = runif(100,30,40),
Var2 = runif(100,20,30),
Var3 = runif(100,2,4))
I do time series decomposition and I want to save the resulting objects in a dataframe. It works if I store the results in a object and use it to make the dataframe afterwards:
# needed packages
library(tidyverse)
library(forecast)
# some "time series"
vec <- 1:1000 + rnorm(1000)
# store pipe results
pipe_out <-
# do decomposition
decompose(msts(vec, start= c(2001, 1, 1), seasonal.periods= c(7, 365.25))) %>%
# relevant data
.$seasonal
# make a dataframe with the stored seasonal data
data.frame(ts= pipe_out)
But doing the same as a one-liner fails:
decompose(msts(vec, start= c(2001, 1, 1), seasonal.periods= c(7, 365.25))) %>%
data.frame(ts= .$seasonal)
I get the error
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ‘"decomposed.ts"’ to a data.frame
I thought that the pipe simply moves forward the things that came up in the last step which saves us storing those things in objects. If so, shouldn't both codes result in the very same output?
EDIT (from comments)
The first code works but it is a bad solution because if one wants to extract all the vectors of the decomposed time series one would need to do it in multiple steps. Something like the following would be better:
decompose(msts(vec, start= c(2001, 1, 1),
seasonal.periods= c(7, 365.25))) %>%
data.frame(seasonal= .$seasonal, x=.$x, trend=.$trend, random=.$random)
It's unclear from your example whether you want to extract $x or $seasonal. Either way, you can extract part of a list either with the `[[`() function in base or the alias extract2() in magrittr, as you prefer. You should then use the . when you create a data.frame in the last step.
Cleaning up the code a bit to be consistent with the piping, the following works:
library(magrittr)
library(tidyverse)
library(forecast)
vec <- 1:1000 + rnorm(1000)
vec %>%
msts(start = c(2001, 1, 1), seasonal.periods= c(7, 365.25)) %>%
decompose %>%
`[[`("seasonal") %>%
# extract2("seasonal") %>% # Another option, uncomment if preferred
data.frame(ts = .) %>%
head # Just for the reprex, remove as required
#> ts
#> 1 -1.17332998
#> 2 0.07393265
#> 3 0.37631946
#> 4 0.30640395
#> 5 1.04279779
#> 6 0.20470768
Created on 2019-11-28 by the reprex package (v0.3.0)
Edit based on comment:
To do what you mention in the comments, you need to use curly brackets (see e.g. here for an explanation why). Hence, the following works:
library(magrittr)
library(tidyverse)
library(forecast)
vec <- 1:1000 + rnorm(1000)
vec %>%
msts(start= c(2001, 1, 1), seasonal.periods = c(7, 365.25)) %>%
decompose %>%
{data.frame(seasonal = .$seasonal,
trend = .$trend)} %>%
head
#> seasonal trend
#> 1 -0.4332034 NA
#> 2 -0.6185832 NA
#> 3 -0.5899566 NA
#> 4 0.7640938 NA
#> 5 -0.4374417 NA
#> 6 -0.8739449 NA
However, for your specific use case, it may be clearer and easier to use magrittr::extract and then simply bind_cols:
vec %>%
msts(start= c(2001, 1, 1), seasonal.periods = c(7, 365.25)) %>%
decompose %>%
magrittr::extract(c("seasonal", "trend")) %>%
bind_cols %>%
head
#> # A tibble: 6 x 2
#> seasonal trend
#> <dbl> <dbl>
#> 1 -0.433 NA
#> 2 -0.619 NA
#> 3 -0.590 NA
#> 4 0.764 NA
#> 5 -0.437 NA
#> 6 -0.874 NA
Created on 2019-11-29 by the reprex package (v0.3.0)
With daily data, decompose() does not work well because it will only handle the annual seasonality and will give relatively poor estimates of it. If the data involve human behaviour, it will probably have both weekly and annual seasonal patterns.
Also, msts objects are not great for daily data either because they don't store the dates explicitly.
I suggest you use tsibble objects with an STL decomposition instead. Here is an example using your data.
library(tidyverse)
library(tsibble)
library(feasts)
mydata <- tsibble(
day = as.Date(seq(as.Date("2001-01-01"), length=1000, by=1)),
vec = 1:1000 + rnorm(1000)
)
#> Using `day` as index variable.
mydata
#> # A tsibble: 1,000 x 2 [1D]
#> day vec
#> <date> <dbl>
#> 1 2001-01-01 0.161
#> 2 2001-01-02 2.61
#> 3 2001-01-03 1.37
#> 4 2001-01-04 3.15
#> 5 2001-01-05 4.43
#> 6 2001-01-06 7.35
#> 7 2001-01-07 7.10
#> 8 2001-01-08 10.0
#> 9 2001-01-09 9.16
#> 10 2001-01-10 10.2
#> # … with 990 more rows
# Compute a decomposition
mydata %>% STL(vec)
#> # A dable: 1,000 x 7 [1D]
#> # STL Decomposition: vec = trend + season_year + season_week + remainder
#> day vec trend season_year season_week remainder season_adjust
#> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2001-01-01 0.161 14.7 -14.6 0.295 -0.193 14.5
#> 2 2001-01-02 2.61 15.6 -14.2 0.0865 1.04 16.7
#> 3 2001-01-03 1.37 16.6 -15.5 0.0365 0.240 16.9
#> 4 2001-01-04 3.15 17.6 -13.0 -0.0680 -1.34 16.3
#> 5 2001-01-05 4.43 18.6 -13.4 -0.0361 -0.700 17.9
#> 6 2001-01-06 7.35 19.5 -12.4 -0.122 0.358 19.9
#> 7 2001-01-07 7.10 20.5 -13.4 -0.181 0.170 20.7
#> 8 2001-01-08 10.0 21.4 -12.7 0.282 1.10 22.5
#> 9 2001-01-09 9.16 22.2 -13.8 0.0773 0.642 22.9
#> 10 2001-01-10 10.2 22.9 -12.7 0.0323 -0.0492 22.9
#> # … with 990 more rows
Created on 2019-11-30 by the reprex package (v0.3.0)
The output is a dable (decomposition table) which behaves like a dataframe most of the time. So you can extract the trend column, or either of the seasonal component columns in the usual way.