Given the following baby fragment:
d1=as.Date('April 26, 2001',format='%B %d, %Y')
d2=as.Date('April 27, 2001',format='%B %d, %Y')
d3=as.Date('April 28, 2001',format='%B %d, %Y')
tibble(DATE=c(d1,d1,d2,d2,d3,d3), Symbol=c("A","B","A","B","A","B"), voladj=c(0.2, 0.3, -0.2, -0.1, 0.3, 0.2))
resulting in
# A tibble: 6 x 3
DATE Symbol voladj
<date> <chr> <dbl>
1 2001-04-26 A 0.2
2 2001-04-26 B 0.3
3 2001-04-27 A -0.2
4 2001-04-27 B -0.1
5 2001-04-28 A 0.3
6 2001-04-28 B 0.2
I try to compute a series of correlation/covariance matrices cor at time D2, cor at time D3, ... etc. Ideally data is exponentially weighted. What options do I have in R? To make things a bit more spicy a Symbol C may at some point show up, too. I was thinking to compute the outer product (rank 1 matrix) at time t1, t2, t3, and then use a simple moving mean.
A potential output could be the following:
DATE cov
<date>
1 2001-04-26 M1
2 2001-04-27 M2
3 2001-04-28 M3
where M_i are matrices (or frames), such as
M_1 = A B
A 1.0 c1
B c1 1.0
etc.
Obviously more interesting once more symbols are involved
Updated answer, given comments
Here is an approach using quantmod to retrieve 5 stocks for three weeks from Yahoo Finance. We combine the Close variable from the xts objects into a data frame, generate week identifiers with lubridate::week(), split() it by week, and calculate covariance matrices for each week using lapply().
library(quantmod)
from.dat <- as.Date("12/03/19",format="%m/%d/%y")
to.dat <- as.Date("12/24/19",format="%m/%d/%y")
theSymbols <- c("AAPL","AXP","BA","CAT","CSCO")
getSymbols(theSymbols,from=from.dat,to=to.dat,src="yahoo")
#combine to single data frame
combinedData <- data.frame(date = as.Date(rownames(as.data.frame(AAPL))),
AAPL$AAPL.Close,
AXP$AXP.Close,
BA$BA.Close,
CAT$CAT.Close,
CSCO$CSCO.Close)
colnames(combinedData) <- c("date","AAPL","AXP","BA","CAT","CSCO")
# split by week
library(lubridate)
combinedData$week <- week(combinedData$date)
symbolsByWeek <- split(combinedData,as.factor(combinedData$week))
covariances <- lapply(symbolsByWeek,function(x){
cov(x[,-c(1,7)])
})
covariances[[1]]
...and the output:
> covariances[[1]]
AAPL AXP BA CAT CSCO
AAPL 19.4962156 7.0959976 3.9093027 5.4158116 -0.66194433
AXP 7.0959976 3.0026695 2.0175793 2.2569625 -0.18793832
BA 3.9093027 2.0175793 10.4511473 1.8555752 0.55619975
CAT 5.4158116 2.2569625 1.8555752 1.8335361 -0.11141911
CSCO -0.6619443 -0.1879383 0.5561997 -0.1114191 0.07287982
>
Original answer
Here is an approach using quantmod to retrieve Dow 30 data for four days from Yahoo Finance, apply() and do.call() with rbind() to massage it into a single data frame, and split() to split by day to produce daily covariance matrices.
library(quantmod)
from.dat <- as.Date("12/02/19",format="%m/%d/%y")
to.dat <- as.Date("12/06/19",format="%m/%d/%y")
theSymbols <- c("AAPL","AXP","BA","CAT","CSCO","CVX","XOM","GS","HD","IBM",
"INTC","JNJ","KO","JPM","MCD","MMM","MRK","MSFT","NKE","PFE","PG",
"TRV","UNH","UTX","VZ","V","WBA","WMT","DIS","DOW")
getSymbols(theSymbols,from=from.dat,to=to.dat,src="yahoo")
# since quantmod::getSymbols() writes named xts objects, need to use
# get() with the symbol names to access each data frame
# e.g. head(get(theSymbols[[1]]))
# convert to list
symbolData <- lapply(theSymbols,function(x){
y <- as.data.frame(get(x))
colnames(y) <- c("open","high","low","close","volume","adjusted")
# add date and symbol name to output data frames
y$date <- rownames(y)
y$symbol <- x
y
})
#combine to single data frame
combinedData <- do.call(rbind,symbolData)
# split by day
symbolsByDay <- split(combinedData,as.factor(combinedData$date))
covariances <- lapply(symbolsByDay,function(x){
cov(x[,1:6]) # only use first 6 columns
})
# print first covariance matrix
covariances[1]
...and the output:
> covariances[1]
$`2019-12-02`
open high low close volume adjusted
open 5956.289 5962.359 5811.514 5818.225 -9.274871e+07 5809.939
high 5962.359 5968.557 5817.580 5824.272 -9.314473e+07 5816.005
low 5811.514 5817.580 5671.809 5678.470 -9.188418e+07 5670.276
close 5818.225 5824.272 5678.470 5685.467 -9.155485e+07 5677.246
volume -92748711.735 -93144729.578 -91884178.312 -91554853.356 4.365841e+13 -90986549.261
adjusted 5809.939 5816.005 5670.276 5677.246 -9.098655e+07 5669.171
>
Related
library(qrmdata)
library(xts)
library(dplyr)
#load the data
data("EUR_USD")
data("SP500_const")
data("EURSTX_const")
#select stock and period
walmart <- data.frame(nr = SP500_const['2005-05-20/2015-05-19',"WMT"])
danone <- data.frame(nr = EURSTX_const['2005-05-20/2015-05-19',"BN.PA"])
exrate <- data.frame(nr = EUR_USD['2005-05-20/2015-05-19',])
#omit 'NA' entries
walmart <- na.omit(walmart)
danone <- na.omit(danone)
exrate <- na.omit(exrate)
I want to merge the three time series walmart, danone and exrate into one time series, but I only want those days in it for which I have data in all three time series.
I tried to merge danone and walmart first using
z <- merge(danone,walmart, join='inner')
which should merge danone and walmart (only using the days for which I have data from both danone and walmart) but it doesn't give me the output I described above
You can use inner_join from dplyr library
walmart$date<-rownames(walmart)
danone$date<-rownames(danone)
exrate$date<-rownames(exrate)
a<-inner_join(inner_join(walmart, danone,by = "date" ), exrate, by = "date")
> head(a)
WMT date BN.PA EUR.USD
1 37.52 2005-05-20 24.8986 1.2561
2 38.05 2005-05-23 25.0496 1.2585
3 37.89 2005-05-24 25.1000 1.2586
4 37.61 2005-05-25 25.0832 1.2605
5 37.62 2005-05-26 25.3013 1.2513
6 37.59 2005-05-27 25.2006 1.2580
Here my time period range:
start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')
df = as.data.frame(seq(from = start_day, to = end_day, by = 'day'))
colnames(df) = 'date'
I need to created 10,000 data.frames with different fake years of 365days each one. This means that each of the 10,000 data.frames needs to have different start and end of year.
In total df has got 14,965 days which, divided by 365 days = 41 years. In other words, df needs to be grouped 10,000 times differently by 41 years (of 365 days each one).
The start of each year has to be random, so it can be 1974-10-03, 1974-08-30, 1976-01-03, etc... and the remaining dates at the end df need to be recycled with the starting one.
The grouped fake years need to appear in a 3rd col of the data.frames.
I would put all the data.frames into a list but I don't know how to create the function which generates 10,000 different year's start dates and subsequently group each data.frame with a 365 days window 41 times.
Can anyone help me?
#gringer gave a good answer but it solved only 90% of the problem:
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=365, by="day"),
simplify=FALSE))
colnames(dates.df) <- 1:10000
What I need is 10,000 columns with 14,965 rows made by dates taken from df which need to be eventually recycled when reaching the end of df.
I tried to change length.out = 14965 but R does not recycle the dates.
Another option could be to change length.out = 1 and eventually add the remaining df rows for each column by maintaining the same order:
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=1, by="day"),
simplify=FALSE))
colnames(dates.df) <- 1:10000
How can I add the remaining df rows to each col?
The seq method also works if the to argument is unspecified, so it can be used to generate a specific number of days starting at a particular date:
> seq(from=df$date[20], length.out=10, by="day")
[1] "1974-01-20" "1974-01-21" "1974-01-22" "1974-01-23" "1974-01-24"
[6] "1974-01-25" "1974-01-26" "1974-01-27" "1974-01-28" "1974-01-29"
When used in combination with replicate and sample, I think this will give what you want in a list:
> replicate(2,seq(sample(df$date, 1), length.out=10, by="day"), simplify=FALSE)
[[1]]
[1] "1985-07-24" "1985-07-25" "1985-07-26" "1985-07-27" "1985-07-28"
[6] "1985-07-29" "1985-07-30" "1985-07-31" "1985-08-01" "1985-08-02"
[[2]]
[1] "2012-10-13" "2012-10-14" "2012-10-15" "2012-10-16" "2012-10-17"
[6] "2012-10-18" "2012-10-19" "2012-10-20" "2012-10-21" "2012-10-22"
Without the simplify=FALSE argument, it produces an array of integers (i.e. R's internal representation of dates), which is a bit trickier to convert back to dates. A slightly more convoluted way to do this is and produce Date output is to use data.frame on the unsimplified replicate result. Here's an example that will produce a 10,000-column data frame with 365 dates in each column (takes about 5s to generate on my computer):
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=365, by="day"),
simplify=FALSE));
colnames(dates.df) <- 1:10000;
> dates.df[1:5,1:5];
1 2 3 4 5
1 1988-09-06 1996-05-30 1987-07-09 1974-01-15 1992-03-07
2 1988-09-07 1996-05-31 1987-07-10 1974-01-16 1992-03-08
3 1988-09-08 1996-06-01 1987-07-11 1974-01-17 1992-03-09
4 1988-09-09 1996-06-02 1987-07-12 1974-01-18 1992-03-10
5 1988-09-10 1996-06-03 1987-07-13 1974-01-19 1992-03-11
To get the date wraparound working, a slight modification can be made to the original data frame, pasting a copy of itself on the end:
df <- as.data.frame(c(seq(from = start_day, to = end_day, by = 'day'),
seq(from = start_day, to = end_day, by = 'day')));
colnames(df) <- "date";
This is easier to code for downstream; the alternative being a double seq for each result column with additional calculations for the start/end and if statements to deal with boundary cases.
Now instead of doing date arithmetic, the result columns subset from the original data frame (where the arithmetic is already done). Starting with one date in the first half of the frame and choosing the next 14965 values. I'm using nrow(df)/2 instead for a more generic code:
dates.df <-
as.data.frame(lapply(sample.int(nrow(df)/2, 10000),
function(startPos){
df$date[startPos:(startPos+nrow(df)/2-1)];
}));
colnames(dates.df) <- 1:10000;
>dates.df[c(1:5,(nrow(dates.df)-5):nrow(dates.df)),1:5];
1 2 3 4 5
1 1988-10-21 1999-10-18 2009-04-06 2009-01-08 1988-12-28
2 1988-10-22 1999-10-19 2009-04-07 2009-01-09 1988-12-29
3 1988-10-23 1999-10-20 2009-04-08 2009-01-10 1988-12-30
4 1988-10-24 1999-10-21 2009-04-09 2009-01-11 1988-12-31
5 1988-10-25 1999-10-22 2009-04-10 2009-01-12 1989-01-01
14960 1988-10-15 1999-10-12 2009-03-31 2009-01-02 1988-12-22
14961 1988-10-16 1999-10-13 2009-04-01 2009-01-03 1988-12-23
14962 1988-10-17 1999-10-14 2009-04-02 2009-01-04 1988-12-24
14963 1988-10-18 1999-10-15 2009-04-03 2009-01-05 1988-12-25
14964 1988-10-19 1999-10-16 2009-04-04 2009-01-06 1988-12-26
14965 1988-10-20 1999-10-17 2009-04-05 2009-01-07 1988-12-27
This takes a bit less time now, presumably because the date values have been pre-caclulated.
Try this one, using subsetting instead:
start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')
date_vec <- seq.Date(from=start_day, to=end_day, by="day")
Now, I create a vector long enough so that I can use easy subsetting later on:
date_vec2 <- rep(date_vec,2)
Now, create the random start dates for 100 instances (replace this with 10000 for your application):
random_starts <- sample(1:14965, 100)
Now, create a list of dates by simply subsetting date_vec2 with your desired length:
dates <- lapply(random_starts, function(x) date_vec2[x:(x+14964)])
date_df <- data.frame(dates)
names(date_df) <- 1:100
date_df[1:5,1:5]
1 2 3 4 5
1 1997-05-05 2011-12-10 1978-11-11 1980-09-16 1989-07-24
2 1997-05-06 2011-12-11 1978-11-12 1980-09-17 1989-07-25
3 1997-05-07 2011-12-12 1978-11-13 1980-09-18 1989-07-26
4 1997-05-08 2011-12-13 1978-11-14 1980-09-19 1989-07-27
5 1997-05-09 2011-12-14 1978-11-15 1980-09-20 1989-07-28
This question already has answers here:
Split date into different columns for year, month and day
(4 answers)
Closed 6 years ago.
I have a dataset which looks like:
mother_id,dateOfBirth
1,1962-09-24
2,1991-02-19
3,1978-11-11
I need to extract the constituent elements (day,month,year) from date of birth and put them in corresponding columns to look like:
mother_id,dateOfBirth,dayOfBirth,monthOfBirth,yearOfBirth
1,1962-09-24,24,09,1962
2,1991-02-19,19,02,1991
3,1978-11-11,11,11,1978
Currently, I have it coded as a loop:
data <- read.csv("/home/tumaini/Desktop/IHI-Projects/Data-Linkage/matching file dss nacp.csv",stringsAsFactors = F)
dss_individuals <- read.csv("/home/tumaini/Desktop/IHI-Projects/Data-Linkage/Data/dssIndividuals.csv", stringsAsFactors = F)
lookup <- data[,c("patientid","extId")]
# remove duplicates
lookup <- lookup[!(duplicated(lookup$patientid)),]
dss_individuals$dateOfBirth <- as.character.Date(dss_individuals$dob)
dss_individuals$dayOfBirth <- 0
dss_individuals$monthOfBirth <- 0
dss_individuals$yearOfBirth <- 0
# Loop starts here
for(i in 1:nrow(dss_individuals)){ #nrow(dss_individuals)
split_list <- unlist(strsplit(dss_individuals[i,]$dateOfBirth,'[- ]'))
dss_individuals[i,]["dayOfBirth"] <- split_list[3]
dss_individuals[i,]["monthOfBirth"] <- split_list[2]
dss_individuals[i,]["yearOfBirth"] <- split_list[1]
}
This seems to work, but is horrendously slow as I have 400 000 rows. Is there a way I can get this done more efficiently?
I compared the speed of substr, format, and use of lubridate. It seems that lubridate and format are much faster than substr, if the the variable is stored as date. However, substr would be fastest if the variable is stored as character vector. The results of a single run is shown.
x <- sample(
seq(as.Date('1000/01/01'), as.Date('2000/01/01'), by="day"),
400000, replace = T)
system.time({
y <- substr(x, 1, 4)
m <- substr(x, 6, 7)
d <- substr(x, 9, 10)
})
# user system elapsed
# 3.775 0.004 3.779
system.time({
y <- format(x,"%y")
m <- format(x,"%m")
d <- format(x,"%d")
})
# user system elapsed
# 1.118 0.000 1.118
system.time({
y <- year(x)
m <- month(x)
d <- day(x)
})
# user system elapsed
# 0.951 0.000 0.951
x1 <- as.character(x)
system.time({
y <- substr(x1, 1, 4)
m <- substr(x1, 6, 7)
d <- substr(x1, 9, 10)
})
# user system elapsed
# 0.082 0.000 0.082
Not sure if this will solve your speed issues but here is a nicer way of doing it using dplyr and lubridate. In general when it comes to manipulating data.frames I personally recommend using either data.tables or dplyr. Data.tables is supposed to be faster but dplyr is more verbose which I personally prefer as I find it easier to pick up my code after not having read it for months.
library(dplyr)
library(lubridate)
dat <- data.frame( mother_id = c(1,2,3),
dateOfBirth = ymd(c( "1962-09-24" ,"1991-02-19" ,"1978-11-11"))
)
dat %>% mutate( year = year(dateOfBirth) ,
month = month(dateOfBirth),
day = day(dateOfBirth) )
Or you can use the mutate_each function to save having to write the variable name multiple times (though you get less control over the name of the output variables)
dat %>% mutate_each( funs(year , month , day) , dateOfBirth)
Here are some solutions. These solutions each (i) use 1 or 2 lines of code and (ii) return numeric year, month and day columns. In addition, the first two solutions use no packages -- the third uses chron's month.day.year function.
1) POSIXlt Convert to "POSIXlt" class and pick off the parts.
lt <- as.POSIXlt(DF$dateOfBirth, origin = "1970-01-01")
transform(DF, year = lt$year + 1900, month = lt$mon + 1, day = lt$mday)
giving:
mother_id dateOfBirth year month day
1 1 1962-09-24 1962 9 24
2 2 1991-02-19 1991 2 19
3 3 1978-11-11 1978 11 11
2) read.table
cbind(DF, read.table(text = format(DF$dateOfBirth), sep = "-",
col.names = c("year", "month", "day")))
giving:
mother_id dateOfBirth year month day
1 1 1962-09-24 1962 9 24
2 2 1991-02-19 1991 2 19
3 3 1978-11-11 1978 11 11
3) chron::month.day.year
library(chron)
cbind(DF, month.day.year(DF$dateOfBirth))
giving:
mother_id dateOfBirth month day year
1 1 1962-09-24 9 24 1962
2 2 1991-02-19 2 19 1991
3 3 1978-11-11 11 11 1978
Note 1: Often when year, month and day are added to data it is not really necessary and in fact they could be generated on the fly when needed using format, substr or as.POSIXlt so you might critically examine whether you actually need to do this.
Note 2: The input data frame, DF in reproducible form, was assumed to be:
Lines <- "mother_id,dateOfBirth
1,1962-09-24
2,1991-02-19
3,1978-11-11"
DF <- read.csv(text = Lines)
Use format once for each part:
dss_individuals$dayOfBirth <- format(dss_individuals$dateOfBirth,"%d")
dss_individuals$monthOfBirth <- format(dss_individuals$dateOfBirth,"%m")
dss_individuals$yearOfBirth <- format(dss_individuals$dateOfBirth,"%Y")
Check the substr function from the base package (or other functions from the nice stringr package) to extract different parts of a string. This function may assume that day, month and year are always in the same place and with the same length.
The strsplit function is vectorized so using rbind.data.frame to convert your list to a dataframe works:
do.call(rbind.data.frame, strsplit(df$dateOfBirth, split = '-'))
Results need to be transposed in order to be used: you can do it using do.call or the t function.
The above code calculates the cohort-wise retention rates. The cohort is the month of joining. So, the code caluclates the number customers who joined in the month of May 2015, how many are active month on month. The final output is stored in data frame df1 (shown below)
I need help creating dynamic column names which are currently hard coded in the ddply function. M0 which means month of joining, M1 means 1st month from joining, M2 means 2 month from joining to M(n) should be variables. This can calculated by subtracting the farthest expiry date from the earliest joining date.
Unfortunately, I'm not able to auto calculate M0 to M(n) range dynamically.
Here is my code dump which works but is not optimal because I have hardcoded M0 to M3 as a variable in ddply function. So if my input data has a customer with a longer subscription period than 5 months my code will fail.
The input to the code is the following dummy data.
customer dj exp
abc 01/05/15 25/06/15
efg 01/05/15 25/07/15
ghd 01/05/15 25/07/15
mkd 01/06/15 25/07/15
kskm 01/06/15 05/08/15
Reproducible code.
library(zoo)
library(plyr)
customer<-c("abc","efg","ghd","mkd","kskm")
dj<-c("2015-05-01", "2015-05-01", "2015-05-01","2015-06-01","2015-06-01")
exp<-c("2015-06-25", "2015-07-25", "2015-07-25","2015-07-01","2015-08-05")
data.frame(customer,dj,exp)
df$dj <- as.Date(df$dj,"%d/%m/%y")
df$exp <- as.Date(df$exp,"%d/%m/%y")
# The data in the file has different variable names than your example data
# so I'm changing them to match
names(df)[1:3] <- c("customer","dj","exp")
# Make a variable called Cohort that contains only the year and month of joining
# as.yearmon() comes from the 'zoo' package
df$Cohort <- as.yearmon(df$dj)
# Calculate the difference in months between date of expiry and date of joining
df$MonthDiff <- ceiling((df$exp-df$dj)/30)
#df$MonthDiff <- 12*(as.yearmon(df$exp+months(1))-df$Cohort)
range<-as.integer(ceiling((max(df$exp)-min(df$dj)))/30)
# Use ddply() from the 'plyr' package to get the frequency of subjects that are
# still active after 0, 1, 2, and 3 months.
df1 <- ddply(df,.(Cohort),summarize,
M0 = sum(MonthDiff > 0),
M1 = sum(MonthDiff > 1),
M2 = sum(MonthDiff > 2),
M3 = sum(MonthDiff > 3)
)
df1
df1
Cohort M0 M1 M2 M3
1 May 2015 3 3 2 0
2 Jun 2015 2 2 1 0
The above is the output working output. Ask is to make column M0 to M3 dynamic
Try inserting this after creating range:
for(i in 0:range) df <- within(df,assign(paste0("M",i),MonthDiff>i))
df1 <- ddply(df,.(Cohort),function(x) colSums(x[,paste0("M",0:range)]))
df1
# Cohort M0 M1 M2 M3
# 1 May 2015 3 3 2 0
# 2 Jun 2015 2 1 1 0
I don't often have to work with dates in R, but I imagine this is fairly easy. I have daily data as below for several years with some values and I want to get for each 8 days period the sum of related values.What is the best approach?
Any help you can provide will be greatly appreciated!
str(temp)
'data.frame':648 obs. of 2 variables:
$ Date : Factor w/ 648 levels "2001-03-24","2001-03-25",..: 1 2 3 4 5 6 7 8 9 10 ...
$ conv2: num -3.93 -6.44 -5.48 -6.09 -7.46 ...
head(temp)
Date amount
24/03/2001 -3.927020472
25/03/2001 -6.4427004
26/03/2001 -5.477592528
27/03/2001 -6.09462162
28/03/2001 -7.45666902
29/03/2001 -6.731540928
30/03/2001 -6.855206184
31/03/2001 -6.807210228
1/04/2001 -5.40278802
I tried to use aggregate function but for some reasons it doesn't work and it aggregates in wrong way:
z <- aggregate(amount ~ Date, timeSequence(from =as.Date("2001-03-24"),to =as.Date("2001-03-29"), by="day"),data=temp,FUN=sum)
I prefer the package xts for such manipulations.
I read your data, as zoo objects. see the flexibility of format option.
library(xts)
ts.dat <- read.zoo(text ='Date amount
24/03/2001 -3.927020472
25/03/2001 -6.4427004
26/03/2001 -5.477592528
27/03/2001 -6.09462162
28/03/2001 -7.45666902
29/03/2001 -6.731540928
30/03/2001 -6.855206184
31/03/2001 -6.807210228
1/04/2001 -5.40278802',header=TRUE,format = '%d/%m/%Y')
Then I extract the index of given period
ep <- endpoints(ts.dat,'days',k=8)
finally I apply my function to the time series at each index.
period.apply(x=ts.dat,ep,FUN=sum )
2001-03-29 2001-04-01
-36.13014 -19.06520
Use cut() in your aggregate() command.
Some sample data:
set.seed(1)
mydf <- data.frame(
DATE = seq(as.Date("2000/1/1"), by="day", length.out = 365),
VALS = runif(365, -5, 5))
Now, the aggregation. See ?cut.Date for details. You can specify the number of days you want in each group using cut:
output <- aggregate(VALS ~ cut(DATE, "8 days"), mydf, sum)
list(head(output), tail(output))
# [[1]]
# cut(DATE, "8 days") VALS
# 1 2000-01-01 8.242384
# 2 2000-01-09 -5.879011
# 3 2000-01-17 7.910816
# 4 2000-01-25 -6.592012
# 5 2000-02-02 2.127678
# 6 2000-02-10 6.236126
#
# [[2]]
# cut(DATE, "8 days") VALS
# 41 2000-11-16 17.8199285
# 42 2000-11-24 -0.3772209
# 43 2000-12-02 2.4406024
# 44 2000-12-10 -7.6894484
# 45 2000-12-18 7.5528077
# 46 2000-12-26 -3.5631950
rollapply. The zoo package has a rolling apply function which can also do non-rolling aggregations. First convert the temp data frame into zoo using read.zoo like this:
library(zoo)
zz <- read.zoo(temp)
and then its just:
rollapply(zz, 8, sum, by = 8)
Drop the by = 8 if you want a rolling total instead.
(Note that the two versions of temp in your question are not the same. They have different column headings and the Date columns are in different formats. I have assumed the str(temp) output version here. For the head(temp) version one would have to add a format = "%d/%m/%Y" argument to read.zoo.)
aggregate. Here is a solution that does not use any external packages. It uses aggregate based on the original data frame.
ix <- 8 * ((1:nrow(temp) - 1) %/% 8 + 1)
aggregate(temp[2], list(period = temp[ix, 1]), sum)
Note that ix looks like this:
> ix
[1] 8 8 8 8 8 8 8 8 16
so it groups the indices of the first 8 rows, the second 8 and so on.
Those are NOT Date classed variables. (No self-respecting program would display a date like that, not to mention the fact that these are labeled as factors.) [I later noticed these were not the same objects.] Furthermore, the timeSequence function (at least the one in the timeDate package) does not return a Date class vector either. So your expectation that there would be a "right way" for two disparate non-Date objects to be aligned in a sensible manner is ill-conceived. The irony is that just using the temp$Date column would have worked since :
> z <- aggregate(amount ~ Date, data=temp , FUN=sum)
> z
Date amount
1 1/04/2001 -5.402788
2 24/03/2001 -3.927020
3 25/03/2001 -6.442700
4 26/03/2001 -5.477593
5 27/03/2001 -6.094622
6 28/03/2001 -7.456669
7 29/03/2001 -6.731541
8 30/03/2001 -6.855206
9 31/03/2001 -6.807210
But to get it in 8 day intervals use cut.Date:
> z <- aggregate(temp$amount ,
list(Dts = cut(as.Date(temp$Date, format="%d/%m/%Y"),
breaks="8 day")), FUN=sum)
> z
Dts x
1 2001-03-24 -49.792561
2 2001-04-01 -5.402788
A more cleaner approach extended to #G. Grothendieck appraoch. Note: It does not take into account if the dates are continuous or discontinuous, sum is calculated based on the fixed width.
code
interval = 8 # your desired date interval. 2 days, 3 days or whatevea
enddate = interval-1 # this sets the enddate
nrows = nrow(z)
z <- aggregate(.~V1,data = df,sum) # aggregate sum of all duplicate dates
z$V1 <- as.Date(z$V1)
data.frame ( Start.date = (z[seq(1, nrows, interval),1]),
End.date = z[seq(1, nrows, interval)+enddate,1],
Total.sum = rollapply(z$V2, interval, sum, by = interval, partial = TRUE))
output
Start.date End.date Total.sum
1 2000-01-01 2000-01-08 9.1395926
2 2000-01-09 2000-01-16 15.0343960
3 2000-01-17 2000-01-24 4.0974712
4 2000-01-25 2000-02-01 4.1102645
5 2000-02-02 2000-02-09 -11.5816277
data
df <- data.frame(
V1 = seq(as.Date("2000/1/1"), by="day", length.out = 365),
V2 = runif(365, -5, 5))