Extract month mean from time series in R - r

I have some data in the following format:
date x
2001/06 9949
2001/07 8554
2001/08 6954
2001/09 7568
2001/10 11238
2001/11 11969
... more rows
I want to extract the x mean for each month. I tried some code with aggregate, but
failed. Thanks for any help on doing this.

Here I simulate a data frame called df with more data:
df <- data.frame(
date = apply(expand.grid(2001:2012,1:12),1,paste,collapse="/"),
x = rnorm(12^2,1000,1000),
stringsAsFactors=FALSE)
Using the way your date vector is constructed you can obtain months by removing the firs four digits followed by a forward slash. Here I use this as indexing variable in tapply to compute the means:
with(df, tapply(x, gsub("\\d{4}/","",date), mean))

Sorry...just creat an month-sequence vector then used tapply.
It was very easy:
m.seq = rep(c(6:12, 1:5), length = nrow(data))
m.means = tapply(data$x, m.seq, mean)
But thanks for the comments anyway!

Related

Identify Week in Column of Dates and Generate Automatic Dataframe Subsets per Week

I want to automatise a code that calculates transport times. I would like that the code gives me 4 months that you can choose out of a big readout from a year and splits up the last month in its four weeks and just describes the data subsets (describing is not the problem).
Generating subsets from a dataset for chosen months is not the problem because I can define the months.
But where I struggle is the 3/4 weeks of the last month. I need to identify them automatically and after that generate the subsets. (I hope that generating subsets should be easier after identifying.)
I can give you a little mock-up of my data.
dates <- as.Date(c("2019-01-07", "2019-01-08", "2019-01-09",
"2019-01-15", "2019-01-21"))
number <- c(12,13,14,15,20)
df <- data.frame(number, dates)
The original df contains of 60 variables but I believe this simple mockup can provide enough info for the task.
I am pretty new to r, I have no idea how to solve the problem, I will show you how I solved it with the months, but as said, in this case they are defined.
function(data = df, m1 = "01" , m2 = "02") {
Monat1 <- subset(data, format.Date(dates , "%m") == m1)
Thank you for helping me out a bit.
you can use the function strftime
strftime(df$dates, format = "%W")
in rstudio use
?strftime()
to see all the different values you can extract from a date or POSIXCT object
You can do it using base R and lubridate
Data
dates <- as.Date(c("2019-01-07", "2019-01-08", "2019-01-09",
"2019-01-15", "2019-01-21"))
number <- c(12,13,14,15,20)
df <- data.frame(number, dates)
str(df)
Answer
library(lubridate)
df$condition <- ifelse(month(df$dates) == month(Sys.Date())-1,week(df$dates),"-")
condition will check if the date is less than a month ago or not and if yes it will give you week number for that particular value

Unnest a ts class

My data has multiple customers data with different start and end dates along with their sales data.So I did simple exponential smoothing.
I applied the following code to apply ses
library(zoo)
library(forecast)
z <- read.zoo(data_set,FUN = function(x) as.Date(x) + seq_along(x) / 10^10 , index = "Date", split = "customer_id")
L <- lapply(as.list(z), function(x) ts(na.omit(x),frequency = 52))
HW <- lapply(L, ses)
Now my output class is list with uneven lengths.Can someone help me how to unnest or unlist the output in to a data frame and get the fitted values,actuals,residuals along with their dates,sales and customer_id.
Note : the reson I post my input data rather than data of HW is,the HW data is too large.
Can someone help me in R.
I would use tidyverse package to handle this problem.
map(HW, ~ .x %>%
as.data.frame %>% # convert each element of the list to data.frame
rownames_to_column) %>% # add row names as columns within each element
bind_rows(.id = "customer_id") # bind all elements and add customer ID
I am not sure how to relate dates and actual sales to your output (HW). If you explain it I might provide solution to that part of the problem too.
Firstly took all the unique customer_id into a variable called 'k'
k <- unique(data_set$customer_id)
Created a empty data frame
b <- data.frame()
extracted all the fitted values using a for loop and stored in 'a'.Using the rbind function attached all the fitted values to data frame 'b'
for(key in k){
print(a <- as.data.frame((as.numeric(HW_ses[[key]]$model$fitted))))
b <- rbind(b,a)
}
Finally using column bind function attached the input data set with data frame 'b'
data_set_final <- cbind(data_set,b)

Comparing two dataframes in ddply function

I've two dataframes, Data and quantiles. Data has a dimension of 23011 x 2 and consists of columns "year" and "data" where year are the sequence of days from 1951:2013. The Quantiles df has a dimension of 63x2 consists of columns "year" and "quantiles" , where year are 63 rows, ie. 1951:2013.
I need to compare Quantile df against the Data df and count the sum of data values exceeding the quantiles value for each year. For that, I'm using ddply in this manner :
ddply(data, .(year), function(y) sum(y[which(y[,2] > quantile[,2]),2]) )
However, the code compares only against the first row of quantile and is not iterating over each of the year against the data df.
I want to iterate over each year in quantile df and calculate the sum of data exceeding the quantile df in each year.
Any help shall be greatly appreciated.
The example problem -
quantile df is here
and Data is pasted here
The quantile df is derived from the data , which is the 90th percentile data df exceeding value 1
quantile = quantile(data[-c(which(prcp2[,2] < 1)),x],0.9)})
In addition to the Heroka answer above, If you have 10,000 columns and need to iterate over each of the column, you can use matrix notation in this form -
lapply(x, function(y) {ddply(data,.(year), function(x){ return(sum(x[x[,y] > quantile(x[x[,y]>1,y],0.9),y]))})})
where x is the size of columns, ie, 1:1000 and data is the df which contains the data.
The quantile(x[x[,y]>1,y],0.9),y]) will give the 90th percentile for data values exceeding 1 .
x[x[,y] > quantile(x[x[,y]>1,y],0.9),y] returns the rows which satisfies the condition for the yth column and sum function is used to calculate the sum.
Why not do this in one go? Creating the quantiles-dataframe first and then referring back to it makes things more complicated than they need to be. You can do this with ddply too.
set.seed(1)
data <- data.frame(
year=sample(1951:2013,23011,replace=T),
data=rnorm(23011)
)
res <- ddply(data,.(year), function(x){
return(sum(x$data[x$data>quantile(x$data,.9)]))
})
And -as plyr seems to be replaced with dplyr - :
library(dplyr)
res2 <- mydf %>% group_by(year) %>% summarise(
test=sum(value[value>quantile(value,.9)])
)

R count days of exceedance per year

My aim is to count days of exceedance per year for each column of a dataframe. I want to do this with one fixed value for the whole dataframe, as well as with different values for each column. For one fixed value for the whole dataframe, I found a solution using count with aggregate and another solution using the package plyr with ddply and colwise. But I couldn't figure out how to do this with different values for each column.
Approach for one fixed value:
# create example data
date <- seq(as.Date("1961/1/1"), as.Date("1963/12/31"), "days") # create dates
date <- date[(format.Date(as.Date(date), "%m %d") !="02 29")] # delete leap days
TempX <- rep(airquality$Temp, length.out=length(date))
TempY <- rep(rev(airquality$Temp), length.out=length(date))
df <- data.frame(date, TempX, TempY)
# This approachs works fine for specific values using aggregate.
library(plyr)
dyear <- as.numeric(format(df$date, "%Y")) # year vector
fa80 <- function (fT) {cft <- count(fT>=80); return(cft[2,2])}; # function for counting days of exceedance
aggregate(df[,-1], list(year=dyear), fa80) # use aggregate to apply function to dataframe
# Another approach using ddply with colwise, which works fine for one specific value.
fd80 <- function (fT) {cft <- count(fT>=80); cft[2,2]}; # function to count days of exceedance
ddply(cbind(df[,-1], dyear), .(dyear), colwise(fd80)) # use ddply to apply function colwise to dataframe
In order to use specific values for each column separatly, I tried passing a second argument to the function, but this didn't work.
# pass second argument to function
Oc <- c(80,85) # values
fo80 <- function (fT,fR) {cft <- count(fT>=fR); return(cft[2,2])}; # function for counting days of exceedance
aggregate(df[,-1], list(year=dyear), fo80, fR=Oc) # use aggregate to apply function to dataframe
I tried using apply.yearly, but it didn't work with count. I want to avoid using a loop, as it is slowly and I have a lot of dataframes with > 100 columns and long timeseries to process.
Furthermore the approach has to work for subsets of the dataframe as well.
# subset of dataframe
dfmay <- df[(format.Date(as.Date(df$date),"%m")=="05"),] # subset dataframe - only may
dyearmay <- as.numeric(format(dfmay$date, "%Y")) # year vector
aggregate(dfmay[,-1],list(year=dyearmay),fa80) # use aggregate to apply function to dataframe
I am out of ideas, how to solve this problem. Any help will be appreciated.
You could try something like this:
#set the target temperature for each column
targets<-c(80,80)
dyear <- as.numeric(format(df$date, "%Y"))
#for each row of the data, check if the temp is above the target limit
#this will return a matrix of TRUE/FALSE
exceedance<-t(apply(df[,-1],1,function(x){x>=targets}))
#aggregate by year and sum
aggregate(exceedance,list(year=dyear),sum)

How to get the sum of each four rows of a matrix in R

I have a 4n by m matrix (sums at 7.5 min intervals for a year). I would like to transform these to 30 min sums, e.g. convert a 70080 x 1 to a 17520 matrix.
What is the most computationally efficient way to do this?
More specifics: here is an example (shortened to one day instead of one year)
library(lubridate)
start.date <- ymd_hms("2009-01-01 00:00:00")
n.seconds <- 192 # one day in seconds
time <- start.date + c(seq(n.seconds) - 1) * seconds(450)
test.data <- data.frame(time = time,
observation = sin(1:n.seconds / n.seconds * pi))
R version: 2.13; Platform: x86_64-pc-linux-gnu (64-bit)
colSums(matrix(test.data$observation, nrow=4))
I'm going to make a set of crazy assumptions since your question is fairly ambiguous.
I'll assume your data is a matrix with observations every 7.5 min and there is NO spatial index. So 100 rows might look like this:
data <- matrix(rnorm(400), ncol=4)
and you want to sum chunks of 4 rows.
There's a bunch of ways to do this, but the first one to hop in my head is to create an index and then do the R version of a "group by" and sum.
An example index could be something like this:
index <- rep(1:25, 4)
index <- index[order(index)]
So now that we have an index of the same length as the data, you can use aggregate() to sum things up:
aggregate(x=data, by = list(index), FUN=sum)
EDIT:
The spirit of the above method may still work. However if you do much work with timeseries data you should probably get to know the xts package. Here's an xts example:
require(xts)
test.xts <- xts(test.data$observation, order.by=test.data$time)
period.apply(test.xts, endpoints(test.xts,"minutes", 30), sum)
sapply(split(test.data$observation, rep(1:(192/4), each=4)), sum)

Resources