Moving averages - r

I have daily data for over 100 years that looks like
01.01.1856 12
02.01.1956 9
03.01.1956 -12
04.01.1956 7
etc.
I wish to calculate the 30 year running average for this huge data. I tried converting the data into a time series but cant still figure out how to go about it. I will prefer a simple method that has to do with working with a data.frame.

I guess the preparation is the difficulty considering some leapyears.
So I try to show some way for preparing, before using the already mentioned function runmean of package require(caTools).
First we create example data (which is not necessary for you, but for the understanding).
Second I divide the data frame into a list of data frames, one for each year and taking the mean values for each year. These two steps could be done at once, but I think the separated way is easier to understand and to adapt.
#example data
Days <- seq(as.Date("1958-01-01"), as.Date("2015-12-31"), by="days")
Values <- runif(length(Days))
DF <- data.frame(Days = Days, Values = Values)
#start of script
Years <- format(DF$Days, "%Y")
UniqueYears <- unique(format(DF$Days, "%Y"))
#Create subset of years
#look for every unique year which element of days is in this year.
YearlySubset <- lapply(UniqueYears, function(x){
DF[which(Years == x), ]
})
YearlyMeanValues <- sapply(YearlySubset, function(x){
mean(x$Values)
})
Now the running mean is applied:
#install.packages("caTools")
require(caTools)
RM <- data.frame(Years = UniqueYears, RunningMean30y = runmean(YearlyMeanValues, 30))
Just if I didn't got you right at first and you want some running mean for every day within about 30 years, of course you could simply do:
RM <- cbind(DF, runmean(DF$Values, 365 * 30))
And considering your problems creating a timeseries:
DF[ , 1] <- as.Date(DF[ , 1], format = "%Y.%m.%d")

I would also suggest exploring RcppRoll in combination with dplyr which provides a fairly convenient solution to calculate rolling averages, sums, etc.
Code
# Libs
library(RcppRoll) # 'roll'-ing functions for R vectors and matrices.
library(dplyr) # data grammar (convenience)
library(zoo) # time series (convenience)
library(magrittr) # compound assignment pipe-operator (convenience)
# Data
data("UKgas")
## Convert to data frame to make example better
UKgas <- data.frame(Y = as.matrix(UKgas), date = time(UKgas))
# Calculations
UKgas %<>%
# To make example more illustrative I converted the data to a quarterly format
mutate(date = as.yearqtr(date)) %>%
arrange(date) %>%
# The window size can be changed to reflect any period
mutate(roll_mean = roll_mean(Y, n = 4, align = "right", fill = NA))
Notes
As the data provided in the example was fairly modest I used quarterly UK gas consumption data available via the data function in the utils package.

Related

Applying a function using elements within a list take 2

I attempted this question yesterday(Applying a function using elements within a list) but my reprex produced the wrong data structure and unfortunately the suggestions didn't work for my actual dataset.
I have what is hopefully a simple functional programming question. I have a list of locations with average temperature and amplitude for each day (180 days in my actual dataset). I want to iterate through these locations and create a sine curve of 24 points using a custom made function taking the average temperature and amplitude from each day within a list. Below is my new reprex.
library(tibble)
library(REdaS)##degrees to radians
library(tidyverse)
sinefunc<- function(Amplitude,Average){
hour<- seq(0,23,1)
temperature<-vector("double",length = 24)
for(i in seq_along(hour)){
temperature[i]<- (Amplitude*sin(deg2rad(180*(hour[i]/24)))+Average)+Amplitude*sin(deg2rad(180*hour[i]/12))
}
temperature
}
data<- tibble(Location = c(rep("London",6),rep("Glasgow",6),rep("Dublin",6)),
Day= rep(seq(1,6,1),3),
Average = runif(18,0,20),
Amplitude = runif(18,0,15))%>%
nest_by(Location)
Using Purrr and map_dfr I get the error Error in .x$Average : $ operator is invalid for atomic vectors
df<-data %>%
map_dfr(~sinefunc(.x$Average, .x$Amplitude))
Using lapply I get the error Error in x[, "Amplitude"] : incorrect number of dimensions
data <- lapply(data, function(x){
sinefunc(Amplitude = x[,"Amplitude"], Average = x[,"Average"])
})
My goal is to have 24 hourly data points for each day and location.
Any further help would be much appreciated.
Stuart
Maybe you look for this? You get a dataframe back with 24 datapoints for each day and location, e.g. London-Day1, Dublin-Day1 etc.
library(dplyr)
library(purrr)
data<- tibble(Location = c(rep("London",6),rep("Glasgow",6),rep("Dublin",6)),
Day= rep(seq(1,6,1),3),
Average = runif(18,0,20),
Amplitude = runif(18,0,15))
# get group name
group_name <- data %>%
group_by(Location, Day) %>%
group_keys() %>%
mutate(group_name = stringr::str_c(Location,"_",Day)) %>%
pull(group_name)
data %>%
# split into lists
group_split(Location, Day) %>%
# get list name
setNames(group_name) %>%
# apply your function and get a dataframe back
map_dfr(~sinefunc(.x$Average, .x$Amplitude))

How can I re-write code that applies a function on subset of rows based on another vector in different R ecosystems?

in my problem I have to apply a function on a subset of individual time-series based on a set of dates extracted from the original data.
So, I have a data.frame with a time-series for each individual between 2005-01-01 and 2010-12-31 (test_final_ind_series) and a sample of pairs individual-date (sample_events) ideally extracted from the same data.
With these, in my example I attempt to calculate an average on a subset of the time-series values exp conditional on individual and date in the sample_events.
I did this in 2 different ways:
1: a simple but effective code that gets the job done very quickly
I simply ask the user to input the data for a specific individual and define a lag of time and a window width (like a rolling average). The function exp_summary then outputs the requested average.
To repeat the operation for each row in sample_events I decided to nest the individual series by ID of the individuals and then attach the sample of dates. Eventually, I just run a loop that applies the function to each individual nested dataframe.
#Sample data
set.seed(111)
exp_series <- data.frame(
id = as.character(rep(1:10000, each=2191)),
date = rep(seq(as.Date('2005-01-01'),
as.Date('2010-12-31'), by = 'day'),times=10000),
exp = rep(rnorm(n=10000, mean=10, sd=5),times=2191)
)
sample_dates <- data.frame(
Event_id = as.character(replicate(10000,sample(1:10000,size = 1,replace = TRUE))),
Event_date = sample(
seq(as.Date('2005-01-01'),
as.Date('2010-12-31'), by = 'day'),
size =10000,replace = TRUE)
)
#This function, given a dataframe with dates and exposure series (df)
#an event_date
#a lag value
#a width of the window
#Outputs the average for a user-defined time window
exp_summary<- function(df, event_date, lag=0,width=0){
df<-as.data.table(df)
end<-as.character(as.Date(event_date)-lag)
start<-as.character(max(as.Date(end)-width, min(df$date)))# I need this in case the time window goes beyond the time limits (earliest date)
return(mean(df[date %between% c(start,end)]$exp))
}
#Nest dataframes
exp_series_nest <- exp_series %>%
group_by(id) %>%
nest()
#Merge with sample events, including only the necessary dates
full_data<-merge(exp_series_nest,sample_dates, by.x="id", by.y="Event_id",all.x = FALSE, all.y=TRUE)
#Initialize dataframe in advance
summaries1<-setNames(data.frame(matrix(ncol = 2, nrow = nrow(full_data))), c("id", "mean"))
summaries1$id<-full_data$id
#Loop over each id, which is nasted data.frame
system.time(for (i in 1:nrow(full_data)){
summaries1$mean[i]<-exp_summary(full_data$data[[i]], full_data$Event_date[i], lag=1, width=365)
})
2: using the highly-flexible package runner
With the same data I need to properly specify the arguments properly. I have also opened an issue on the Github repository to speed-up this code with parallelization.
system.time(summaries2 <- sample_dates %>%
group_by(Event_id) %>%
mutate(
mean = runner(
x = exp_series[exp_series$id == Event_id[1],],
k = "365 days",
lag = "1 days",
idx =exp_series$date[exp_series$id == Event_id[1]],
at = Event_date,
f = function(x) {mean(x$exp)},
na_pad=FALSE
)
)
)
They give very same results up to the second decimal, but method 1 is much faster than 2, and you can see the difference when you use very datasets.
My question is, for method 1, how can I write the last loop in a more concise way within the data.table and/or tidyverse ecosystems? I really struggle in making work together nested lists and "normal" columns embedded in the same dataframe.
Also, if you have any other recommendation I am open to hear it! I am here more for curiosity than need, as my problem is solved by method 1 already acceptably.
With data.table, you could join exp_series with the range you wish in sample_dates and calculate mean by=.EACHI:
library(data.table)
setDT(exp_series)
setDT(sample_dates)
lag <- 1
width <- 365
# Define range
sample_dates[,':='(begin=Event_date-width-lag,end=Event_date-lag)]
# Calculate mean by .EACHI
summariesDT <- exp_series[sample_dates,.(id,mean=mean(exp))
,on=.(id=Event_id,date>=begin,date<=end),by=.EACHI][
,.(id,mean)]
Note that this returns the same results as summaries1 only for Event_id without duplicates in sample_dates.
The results are different in case of duplicates, for instance Event_id==1002:
sample_dates[Event_id==1002]
Event_id Event_date begin end
<char> <Date> <Date> <Date>
1: 1002 2010-08-17 2009-08-16 2010-08-16
2: 1002 2010-06-23 2009-06-22 2010-06-22
If you don't have duplicates in your real data, this shouldn't be a problem.

Plotting only 1 hourly datapoint (1 per day) alongside hourly points (24 per day) in R Studio

I am a bit stuck with some code. Of course I would appreciate a piece of code which sorts my dilemma, but I am also grateful for hints of how to sort that out.
Here goes:
First of all, I installed the packages (ggplot2, lubridate, and openxlsx)
The relevant part:
I extract a file from an Italians gas TSO website:
Storico_G1 <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",sheet = "Storico_G+1", startRow = 1, colNames = TRUE)
Then I created a data frame with the variables I want to keep:
Storico_G1_df <- data.frame(Storico_G1$pubblicazione, Storico_G1$IMMESSO, Storico_G1$`SBILANCIAMENTO.ATTESO.DEL.SISTEMA.(SAS)`)
Then change the time format:
Storico_G1_df$pubblicazione <- ymd_h(Storico_G1_df$Storico_G1.pubblicazione)
Now the struggle begins. Since in this example I would like to chart the 2 time series with 2 different Y axes because the ranges are very different. This is not really a problem as such, because with the melt function and ggplot i can achieve that. However, since there are NAs in 1 column, I dont know how I can work around that. Since, in the incomplete (SAS) column, I mainly care about the data point at 16:00, I would ideally have hourly plots on one chart and only 1 datapoint a day on the second chart (at said 16:00). I attached an unrelated example pic of a chart style I mean. However, in the attached chart, I have equally many data points on both charts and hence it works fine.
Grateful for any hints.
Take care
library(lubridate)
library(ggplot2)
library(openxlsx)
library(dplyr)
#Use na.strings it looks like NAs can have many values in the dataset
storico.xl <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",
sheet = "Storico_G+1", startRow = 1,
colNames = TRUE,
na.strings = c("NA","N.D.","N.D"))
#Select and rename the crazy column names
storico.g1 <- data.frame(storico.xl) %>%
select(pubblicazione, IMMESSO, SBILANCIAMENTO.ATTESO.DEL.SISTEMA..SAS.)
names(storico.g1) <- c("date_hour","immesso","sads")
# the date column look is in the format ymd_h
storico.g1 <- storico.g1 %>% mutate(date_hour = ymd_h(date_hour))
#Not sure exactly what you want to plot, but here is each point by hour
ggplot(storico.g1, aes(x= date_hour, y = immesso)) + geom_line()
#For each day you can group, need to format the date_hour for a day
#You can check there are 24 points per day
#feed the new columns into the gplot
storico.g1 %>%
group_by(date = as.Date(date_hour, "d-%B-%y-")) %>%
summarise(count = n(),
daily.immesso = sum(immesso)) %>%
ggplot(aes(x = date, y = daily.immesso)) + geom_line()

R aggregating irregular time series data by groups (with meta data)

Hi I have a data frame (~4 million rows) with time series data for different sites and events.
Here is a rough idea of my data, obviously on a different scale, I have several similar time series so I've kept it general as I want to be able to apply it in different cases
Data1 <- data.frame(DateTimes =as.POSIXct("1988-04-30 13:20:00")+c(1:10,12:15,20:30,5:13,16:20,22:35)*300,
Site = c(rep("SiteA",25),rep("SiteB",28)),
Quality = rep(25,53),
Value = round(runif(53,0,5),2),
Othermetadata = c(rep("E1",10),rep("E2",15),rep("E1",10),rep("E2",18)))
What I'm looking for is a simple way to group and aggregate this data to different timesteps while keeping metadata which doesn't vary within the group
I have tried using the zoo library and zoo::aggregate ie:
library(zoo)
zooData <- read.zoo(select(Data1, DateTimes, Value))
zooagg <- aggregate(zooData, time(zooData) - as.numeric(time(zooData))%%3600, FUN = sum, reg = T)
However when I do this I'm losing all my metadata and merging different sites data.
I wondered about trying to use plyr or dplyr to split up the data and then appling the aggregate but I'm still going to lose my other columns.
Is there a better way to do this? I had a brief look at doco for xts library but couldn't see an intuitive solution in their either
*Note: as I want this to work for a few different things both the starting time step and final time step might change. With possibility for random time step, or somewhat regular time step but with missing points. And the FUN applied may vary (mostly sum or mean). As well as the fields I want to split it by *
Edit I found the solution after Hercules Apergis pushed me in the right direction.
newData <- Data1 %>% group_by(timeagg, Site) %>% summarise(Total = sum(Value))
finaldata <- inner_join(Data1,newData) %>% select(-DateTimes, - Value) %>% distinct()
The original DateTimes column wasn't a grouping variable - it was the time series, so I added a grouping variable of my aggregated time (here: time to the nearest hour) and summarised on this. Problem was if I joined on this new column I missed any points where there was time during that hour but not on the hour. Thus the inner_join %>% select %>% distinct method.
Now hopefully it works with my real data not eg data!
Given the function that you have on aggregation:
aggregate(zooData, time(zooData) - as.numeric(time(zooData))%%3600, FUN = sum, reg = T)
You want to sum the values by group of times AND NOT lose other columns. You can simply do this with the dplyr package:
library(dplyr)
newdata <- Data1 %>% group_by(DateTimes) %>% summarise(sum(Value))
finaldata <- inner_join(Data1,newdata),by="DateTimes")
The newdata is a data.frame with each group of DateTimes has the Values summed. Then inner_join merges the parts that are common on those two datasets by the DateTimes variable. Since I am not entirely sure what your desired output is, this should be a good help for starters.

Group R table entries by month

I have as CSV value with two columns, a unix timestamp and a version string. What I finally want to achieve is to group the data by months and plot the data, so that the single months are entries on the x axis, and a line is plotted for each unique version string, where the y axis values should represent the number of hits on that month.
Here is a small example CSV:
timestamp,version
1434974143,1.0.0
1435734004,1.1.0
1435734304,1.0.0
1435735386,1.2.0
I'm new to R, so I encountered several problems. First I successfully read the csv with
mydata <- read.csv("data.csv")
and figured out an ungly function that convers a single timestamp into an R date:
as_time <- function(val){
return(head(as.POSIXct(as.numeric(as.character(val)),origin="1970-01-01",tz="GMT")))
}
But non of the several apply functions seemed to work on the table column.
So how do I create a data structure, that groups the version hits by month, and can be plotted later?
It's easier than you think!
You are essentially looking for the hist function.
#Let's make some mock data
# Set the random seed for reproducibility
set.seed(12345)
my.data <- data.frame(timestamp = runif(1000, 1420000000, 1460000000),
version = sample(1:5, 1000, replace = T))
my.data$timestamp <- as.POSIXct(my.data$timestamp, origin = "1970-01-01")
# Histogram of the data, irrespective of version
hist(my.data$timestamp, "month")
# If you want to see the version then split the data first...
my.data.split <- split(my.data, my.data$version)
# Then apply hist
counts <- sapply(my.data.split, function (x)
{
h <- hist(x$timestamp, br = "month", plot = FALSE)
h$counts
})
# Transform into a matrix and plot
counts <- do.call("rbind", counts)
barplot(counts, beside = T)
You can use the as.yearmon function from the zoo package to get year/month formats:
library(zoo)
dat$yearmon <- as.yearmon(as.POSIXct(dat$timestamp, origin = "1970-01-01", tz = "GMT"))
Then it depends what you want to do with your data. For example, number of version hits per month (thanks to #Frank for fixing):
dat %>% group_by(yearmon, version) %>%
summarise(hits = n())

Resources