Adding Hourly Gaps into a Frequency Table - r

I am trying to create two frequency tables, one that is daily, and one that is hourly. I am able to get the daily values fairly easily.
C<-Data
C$Data<-format(C$Data, "%m/%d/%Y")
Freq_Day<- C %>% group_by(Data) %>% summarise(frequency = n())
However when I try to get the hourly frequency by doing the following
B<-Data
B$Data<-format(B$Data,"%m/%d/%Y %H:%M")
Freq_HRLY<-B %>% group_by(Data) %>% summarise(frequency = n())
It omits hours that simply did not occur in the data set. Thus it returns a column that is less than (# of Days) *24. How would I go about getting a column of dates in one hour increments with their corresponding frequency, in a way that if there is no occurrence in "Data' it just has a value of 0

One way would be to use tidyr::complete to fill in the missing hours on the Freq_HRLY data which is already calculated by creating a sequence of hourly interval between min and max Data.
library(dplyr)
Freq_HRLY %>%
ungroup() %>%
mutate(Data = as.POSIXct(Data, format = "%m/%d/%Y %H:%M")) %>%
tidyr::complete(Data = seq(min(Data), max(Data), by = "1 hour"),
fill = list(frequency = 0))

Related

R impute/'approximate' delayed covid time series points using non-delayed total

I'm using a grouped time series data set where there's often NAs for more recent dates (length of NAs varies fairly randomly). A total of all the series is provided in the data, where for more recent dates, this total is actually greater than the sum of the individual series, I guess because of imputation/forecasting.
So, my question is, how can the missing values be estimated, assuming that the series total is correct?
My general approach is to calculate what proportion of the total each series is, and somehow extrapolate to the future missing dates. As you can see by the graphs, I'm not so successful. There's complications caused by differing last dates of reported data. I'm not sure if cumulative makes a difference.
R code for simulated data and failed solution below:
Code:
## simulate simple grouped time series
library(tidyverse)
set.seed(555)
## time series length, e.g. 10
len=10
## group names
grps=letters[1:5]
df=bind_rows(lapply(grps,function(z){
tibble(rn=seq(1:len)) %>%
mutate(real_val=runif(len,min=0,max=1)) %>%
mutate(grp=z) %>%
select(grp,rn,real_val) %>%
## replace final data points with NA, length varying by group
## this simulates delays in data reporting across groups
mutate(reported_val=ifelse(rn>len-match(z,letters)+1,NA,real_val)) %>%
# mutate(reported_val=ifelse(rn>len-runif(1,0,round(max_trim)),NA,real_val)) %>%
## make cumulative to assist viz a bit. may affect imputation method.
group_by(grp) %>%
arrange(rn) %>%
mutate(real_val=cumsum(real_val),reported_val=cumsum(reported_val)) %>%
ungroup()
}))
df
## attempt to impute/estimate missing real values, given total for each rn (date)
## general solution is to use (adjusted?) proportions of the total.
df2=df %>%
group_by(rn) %>%
mutate(sum_real_val=sum(real_val),sum_reported_val=sum(reported_val,na.rm=T)) %>%
ungroup() %>%
## total value missing for each date
mutate(val_missing=sum_real_val-sum_reported_val) %>%
## what proportion of the continent that country takes up
mutate(prop=reported_val/sum_real_val) %>%
# mutate(prop=reported_val/sum_reported_val) %>%
## fill missing proportions to end terminus from most recent value
group_by(grp) %>%
arrange(rn) %>%
fill(prop,.direction='down') %>%
ungroup() %>%
## get estimated proportion of those missing
mutate(temp1=ifelse(is.na(reported_val),prop,NA)) %>%
## re-calculate proportion as only of those missing.
group_by(grp) %>%
mutate(prop_temp1=temp1/sum(temp1,na.rm=T)) %>%
ungroup() %>%
## if value missing, then multiply total missing by expected proportion of missing
mutate(result_val=ifelse(is.na(reported_val),val_missing*prop_temp1,reported_val)) %>%
ungroup()
## time series plot
## stacked by group, black line shows real_val total.
ggplot(df2 %>%
select(grp,rn,real_val,reported_val,sum_real_val,result_val) %>%
gather(val_grp,val,-c(grp,rn,sum_real_val)) %>%
ungroup(),aes(rn,val))+
# geom_line(aes(colour=loc))+
geom_area(aes(fill=grp))+
geom_line(aes(y=sum_real_val))+
facet_wrap("val_grp")
## but alas the result total doesn't agree with the reported total
## nb the imputed values for each group don't necessarily have to agree with the real values.
It's hard to know what conclusions to draw from the incomplete dates. One simple assumption could be to take the last share and extrapolate that into the future:
default_share <- df %>%
count(rn, grp, wt = !is.na(reported_val)) %>%
count(rn, n) %>%
filter(nn == max(nn)) %>%
slice_max(rn) %>%
left_join(df) %>%
mutate(share = real_val / sum(real_val)) %>%
select(rn, grp, share)
df %>%
group_by(rn) %>%
mutate(result_val = if_else(rn > default_share$rn[[1]],
sum(real_val) * default_share$share[rn == rn],
real_val)
) %>% ungroup() %>%
select(-reported_val) %>%
pivot_longer(-c(grp:rn)) %>%
ggplot(aes(rn,value))+
geom_area(aes(fill=grp))+
facet_wrap(~name)

Loop for aggregated data frame in R

I have a data frame with 58 columns labeled SD1 through to SD58 along with columns for date info (Date, Year, Month, Day).
I'm trying to find the date of the maximum value of each of the SD columns each year using the following code:
maxs<-aggregate(SD1~Year, data=SDtime, max)
SDMax<-merge(maxs,SDtime)
I only need the dates so I made a new df and relabeled the column as below:
SD1Max = subset(SDMax, select = c(Year, Date))
SD1Max %>%
rename(
SD1=Date
)
I want to do the same thing for every SD column but I don't want to have to repeat these steps 58 times. Is there a way to loop the process?
Assuming there are no ties (multiple days with where the variable reached its maximum) this probably does what you want:
library('tidyverse')
SDtime %>%
pivot_longer(
cols = matches('^SD[0-9]{1,2}$')
) %>%
group_by(name) %>%
filter(value == max(value, na.rm = TRUE)) %>%
ungroup()
You might want to pivot_wider afterwards.

How to compute the mean of a date range in R?

This is my tibble:
date;temp
1953-1-1;-0.2
1953-1-2;-0.2
1953-1-3;-0.2
1953-1-4;-0.1
...
1954-1-1;2
1954-1-2;3
1954-1-3;4
1954-1-4;5
...
1955-1-1;6
1955-1-2;7
1955-1-3;8
1955-1-4;9
I would now like to calculate the mean temperature per year. That means I want to calculate all values of the column temp for each year. However, I have no idea how I can work in R with the year numbers. Can someone tell me how to solve the problem?
tb <- tb %>%
mutate(year = substr(date, start=1, stop=4)) %>%
group_by(year) %>%
summarise(mean_temp = mean(temp, na.rm=TRUE))
Otherwise, lubridate is a nice library to work with Dates.

how to group by month with summing or counting in R?

I am using the code below to group by month to sum or count. However, the SLARespond column seems like it sums for the whole data set, not for each month.
Any way that I can fix the problem?
also, instead of sum function, can I do count function with SLAIncident$IsSlaRespondByViolated == 1
Appreciate for helps!
SLAIncident <- SLAIncident %>%
mutate(month = format(SLAIncident$CreatedDateLocal, "%m"), year = format(SLAIncident$CreatedDateLocal, "%Y")) %>%
group_by(year, month) %>%
summarise(SLARespond = sum(SLAIncident$IsSlaRespondByViolated))
If you could provide a small bit of the dataset to illustrate your example that would be great. I would first make sure that your months/years are characters or factors so that dplyr can grab them. An ifelse function wrapped in a sum should also fit your criteria for the second part of the question. I am using your code here to convert the dates into month and year but I recommend lubridate
SLAIncident <- SLAIncident %>%
mutate(month = as.character(format(SLAIncident$CreatedDateLocal, "%m")),
year = as.character(format(SLAIncident$CreatedDateLocal, "%Y"))) %>%
group_by(year, month) %>%
summarise(SLARespond = sum(IsSlaRespondByViolated),
sla_1 = sum(ifelse(isSlaRespondByViolated == 1, 1, 0)))
Also as hinted to in the comments, these column names are really long and could use some tidying

Keeping the class date for aggregated time series

I'd like to aggregate time-series data to get weekly data, but doing so the class of the temporal variable becomes "character" instead of "Date", losing therefore any cool features of being a date.
This is quite annoying, especially when I need to plot data and play with breaks and labels.
Here is a short example of what I'm facing
# Storing some random daily data
require(plyr)
require(dplyr)
df = data.frame(date = seq.Date(from = as.Date('2013-01-01'),
to = as.Date('2014-12-31'),
by = 'day'),
data = rnorm(365*2))
Aggregating the data into some weekly data
wdf = df %>%
mutate(week = strftime(df$date, format = '%Y-%U')) %>%
group_by(week) %>%
summarise(wdata = max(data))
Unfortunately now the variable week is not of class "Date". Any idea about the possibility of keeping the class date for objects of the format %Y-%V?
Thanks in advance!
EB
Use the awesome lubridate package. It has a floor_date function that rounds a date downward according to any of several time units (including weeks as you want).
library(lubridate)
wdf = df %>%
mutate(week = floor_date(date, unit = 'week')) %>%
group_by(week) %>%
summarise(wdata = max(data))

Resources