Rolling 7 Day Sum grouped by date and unique ID - r

I am using workload data to compute 3 metrics - Daily, 7-Day rolling (sum of last 7 days) 28-Day Rolling Average(sum of last 28 days/4).
I have been able to compute by Daily but I need my 7-Day rolling and 28-Day Rolling Average and am having some trouble. I have 17 unique ID's for each date (dates range from 2018-08-09 to 2018-12-15).
library(dplyr)
library(tidyr)
library(tidyverse)
library(zoo)
Post_Practice <- read.csv("post.csv", stringsAsFactors = FALSE)
Post_Data <- Post_Practice[, 1:3]
DailyLoad <- Post_Data %>%
group_by(Date, Name) %>%
transmute(Daily = sum(DayLoad)) %>%
distinct(Date, Name, .keep_all = TRUE) %>%
mutate('7-day' = rollapply(Daily, 7, sum, na.rm = TRUE, partial = TRUE))
Input:
Date Name DayLoad
2018-08-09 Athlete 1 273.92000
2018-08-09 Athlete 2 351.16000
2018-08-09 Athlete 3 307.97000
2018-08-09 Athlete 1 434.20000
2018-08-09 Athlete 2 605.92000
2018-08-09 Athlete 3 432.87000
Input looks like this all the way to 2018-12-15. Some dates have multiples of data (like above) and some only have one entry.
This code produces the 7-day column but it shows the same number as the Daily ie:
Date Name Daily 7-day
<chr> <chr> <dbl> <dbl>
1 2018-08-09 Athlete 1 708. 708.
2 2018-08-09 Athlete 2 957. 957.
3 2018-08-09 Athlete 3 741. 741.
The goal is to have final table (ie 7 days later) look like this:
Date Name Daily 7-day
<chr> <chr> <dbl> <dbl>
1 2018-08-15 Athlete 1 413. 3693.
2 2018-08-15 Athlete 2 502. 4348.
3 2018-08-15 Athlete 3 490. 4007.
Where the Daily is the sum of that specific date and the 7-Dayis the sum of the last 7 dates for that specific unique ID.

The help file for rollsum says:
The default methods of rollmean and rollsum do not handle inputs that
contain NAs.
Use rollapplyr(x, width, sum, na.rm = TRUE) to exclude NAs in the input from the sum. Note the r at the end of rollapplyr to specify right alignment.
Also note the partial=TRUE argument can be used if you want partial sums at the beginning rather than NAs.

Related

Sum table values ​per day

I have a table as shown in the image, where each comment has a publication date, with year, month, day and time, I would like to add the sentiment values ​​by day.
this is how the table is composed
serie <- data.frame(comments$created_time,sentiment2$positive-sentiment2$negative)
Using dplyr you can do:
library(dplyr)
df %>%
group_by(as.Date(comments.created_time)) %>%
summarize(total = sum(sentiment))
Here is some sample data that will help others to troubleshoot and understand the data:
df <- tibble(comments.created_time = c("2015-01-26 22:43:00",
"2015-01-26 22:44:00",
"2015-01-27 22:43:00",
"2015-01-27 22:44:00",
"2015-01-28 22:43:00",
"2015-01-28 22:44:00"),
sentiment = c(1,3,5,1,9,1))
Using the sample data will yield:
# A tibble: 3 × 2
`as.Date(comments.created_time)` total
<date> <dbl>
1 2015-01-26 4
2 2015-01-27 6
3 2015-01-28 10

R Calculate change in Weekly values Year on Year (with additional complication)

I have a data set of daily value. It spans from Dec-1 2018 to April-1 2020.
The columns are "date" and "value". As shown here:
date <- c("2018-12-01","2000-12-02", "2000-12-03",
...
"2020-03-30","2020-03-31","2020-04-01")
value <- c(1592,1825,1769,1909,2022, .... 2287,2169,2366,2001,2087,2099,2258)
df <- data.frame(date,value)
What I would like to do is the sum the values by week and then calculate week over week change from the current to previous year.
I know that I can sum by week using the following function:
Data_week <- df%>% group_by(category ,week = cut(date, "week")) %>% mutate(summed= sum(value))
My questions are twofold:
1) How do I sum by week and then manipulate the dataframe so that I can calculate week over week change (e.g. week dec.1 2019/ week dec.1 2018).
2) How can I do that above, but using a "customized" week. Let's say I want to define a week as moving 7 days back from the latest date I have data for. Eg. the latest week I would have would be week starting on March 26th (April 1st -7 days).
We can use lag from dplyr to help and also some convenience functions from lubridate.
library(dplyr)
library(lubridate)
df %>%
mutate(year = year(date)) %>%
group_by(week = week(date),year) %>%
summarize(summed = sum(value)) %>%
arrange(year, week) %>%
ungroup %>%
mutate(change = summed - lag(summed))
# week year summed change
# <dbl> <dbl> <dbl> <dbl>
# 1 48 2018 3638. NA
# 2 49 2018 15316. 11678.
# 3 50 2018 13283. -2033.
# 4 51 2018 15166. 1883.
# 5 52 2018 12885. -2281.
# 6 53 2018 1982. -10903.
# 7 1 2019 14177. 12195.
# 8 2 2019 14969. 791.
# 9 3 2019 14554. -415.
#10 4 2019 12850. -1704.
#11 5 2019 1907. -10943.
If you would like to define "weeks" in different ways, there is also isoweek and epiweek. See this answer for a great explaination of your options.
Data
set.seed(1)
df <- data.frame(date = seq.Date(from = as.Date("2018-12-01"), to = as.Date("2019-01-29"), "days"), value = runif(60,1500,2500))

How to calculate/count the number of extreme precipitation events (above a "threshold") from daily rainfall data in each month per year basis

I am working on daily rainfall data and trying to evaluate the extreme events from the time series data above a certain threshold value in each month per year i.e. the number of times the rainfall exceeded a certain threshold in each month per year.
The rainfall timeseries data is from St Lucia and has two columns:
"YEARMODA" - defining the time (format- YYYYMMDD)
"PREP" - rainfall in mm (numeric)
StLucia <- read_excel("C:/Users/hp/Desktop/StLuciaProject.xlsx")
The dataframe which I'm working i.e "Precip1" on has two columns namely:
Time (format YYYY-MM-DD)
Precipitation (numeric value)
The code is provided below:
library("imputeTS")
StLucia$YEARMODA <- as.Date(as.character(StLucia$YEARMODA), format = "%Y%m%d")
data1 <- na_ma(StLucia$PREP, k=4, weighting = "exponential")
Precip1 <- data.frame(Time= StLucia$YEARMODA, Precipitation= data1, check.rows = TRUE)
I found out the threshold value based on the 95th percentile and 99th percentile using function quantile().
I now want to count the number of "extreme events" of rainfall above this threshold in each month on per year basis.
Please help me out on this. I would be highly obliged by your help. Thank You!
If you are open to a tidyverse method, here is an example with the economics dataset that is built into ggplot2. We can use ntile to assign a percentile group to each observation. Then we group_by the year, and get a count of the values that are in the desired percentiles. Because this is monthly data the counts are pretty low, but it's easily translated to daily data.
library(tidyverse)
thresholds <- economics %>%
mutate(
pctile = ntile(unemploy, 100),
year = year(date)
) %>%
group_by(year) %>%
summarise(
q95 = sum(pctile >= 95L),
q99 = sum(pctile >= 99L)
)
arrange(thresholds, desc(q95))
#> # A tibble: 49 x 3
#> year q95 q99
#> <dbl> <int> <int>
#> 1 2010 12 6
#> 2 2011 12 0
#> 3 2009 10 5
#> 4 1967 0 0
#> 5 1968 0 0
#> 6 1969 0 0
#> 7 1970 0 0
#> 8 1971 0 0
#> 9 1972 0 0
#> 10 1973 0 0
#> # ... with 39 more rows
Created on 2018-06-04 by the reprex package (v0.2.0).

How to calculate aggregate statistics on a dataframe in R by applying conditions on time values?

I am working on climate data analysis. After loading file in R, my interest is to subset data based upon hours in a day.
for time analysis we can use $hour with the variable in which time vector has been stored if our interest is to deal with hours.
I want to subset my data for each hour in a day for 365 days and then take an average of the data at a particular hour throughout the year. Say I am interested to take values of irradiation/wind speed etc at 12:OO PM for a year and then take mean of these values to get the desired result.
I know how to subset a data frame based upon conditions. If for example my data is in a matrix called data and contains 2 rows say time and wind speed and I'm interested to subset rows of data in which irradiationb isn't zero. We can do this using the following code
my_data <- subset(data, data[,1]>0)
but now in order to deal with hours values in time column which is a variable stored in data, how can I subset values?
My data look like this:
I hope I made sense in this question.
Thanks in advance!
Here is a possible solution. You can create a hourly grouping with format(df$time,'%H'), so we obtain only the hour for each period, we can then simply group by this new column and calculate the mean for each group.
df = data.frame(time=seq(Sys.time(),Sys.time()+2*60*60*24,by='hour'),val=sample(seq(5),49,replace=T))
library(dplyr)
df %>% mutate(hour=format(df$time,'%H')) %>%
group_by(hour) %>%
summarize(mean_val = mean(val))
To subset the non-zero values first, you can do either:
df = subset(df,val!=0)
or start the dplyr chain with:
df %>% filter(df$val!=0)
Hope this helps!
df looks as follows:
time val
1 2018-01-31 12:43:33 4
2 2018-01-31 13:43:33 2
3 2018-01-31 14:43:33 2
4 2018-01-31 15:43:33 3
5 2018-01-31 16:43:33 3
6 2018-01-31 17:43:33 1
7 2018-01-31 18:43:33 2
8 2018-01-31 19:43:33 4
... ... ... ...
And the output:
# A tibble: 24 x 2
hour mean_val
<chr> <dbl>
1 00 3.50
2 01 3.50
3 02 4.00
4 03 2.50
5 04 3.00
6 05 2.00
.... ....
This assumes your time column is already of class POSIXct, otherwise you'd first have to convert it using for example as.POSIXct(x,format='%Y-%m-%d %H:%M:%S')

R: converting xts or zoo object to a data frame

What is an easy way of coercing time series data to a data frame, in a format where the resulting data is a summary of the original?
This could be some example data, stored in xts or zoo object:
t, V1
"2010-12-03 12:00", 10.0
"2010-11-04 12:00", 10.0
"2010-10-05 12:00", 10.0
"2010-09-06 12:00", 10.0
...and so on, monthly data for many years.
and I would like to transform it to a data frame like:
year, month, V1
2010, 12, a descriptive statistic calculated of that month's data
2010, 11, ...
2010, 10, ...
2010, 9, ...
The reason I'm asking this, is because I want to plot monthly calculated summaries of data in the same plot. I can do this quite easily for data in the latter format, but haven't found a plotting method for the time series format.
For example, I could have temperature data from several years measured in a daily interval and I would like to plot the curves for the monthly mean temperatures for each year in the same plot. I didn't figure out how to do this using the xts-formatted data, or if this even suits the purpose of the xts/zoo formatting of the data, which seems to always carry the year information along it.
Please provide a sample of data to work with and I will try to provide a less general answer. Basically you can use apply.monthly to calculate summary statistics on your xts object. Then you can convert the index to yearmon and convert the xts object to a data.frame.
x <- xts(rnorm(50), Sys.Date()+1:50)
mthlySumm <- apply.monthly(x, mean)
index(mthlySumm) <- as.yearmon(index(mthlySumm))
Data <- as.data.frame(mthlySumm)
Here's a solution using the tidyquant package, which includes functions as_xts() for coercing data frames to xts objects and as_tibble() for coercing xts objects to tibbles ("tidy" data frames).
Recreating your data:
> data_xts
V1
2010-09-06 10
2010-10-05 10
2010-11-04 10
2010-12-03 10
Use as_tibble() to convert to a tibble. The preserve_row_names = TRUE adds a column called "row.names" with the xts index as character class. A rename and mutate are used to clean up dates. The output is a tibble with dates and values.
> data_df <- data_xts %>%
as_tibble(preserve_row_names = TRUE) %>%
rename(date = row.names) %>%
mutate(date = as_date(date))
> data_df
# A tibble: 4 × 2
date V1
<date> <dbl>
1 2010-09-06 10
2 2010-10-05 10
3 2010-11-04 10
4 2010-12-03 10
You can go a step further and add other fields such as day, month, and year using the mutate function.
> data_df %>%
mutate(day = day(date),
month = month(date),
year = year(date))
# A tibble: 4 × 5
date V1 day month year
<date> <dbl> <int> <dbl> <dbl>
1 2010-09-06 10 6 9 2010
2 2010-10-05 10 5 10 2010
3 2010-11-04 10 4 11 2010
4 2010-12-03 10 3 12 2010

Resources