I have the following dataframe,
FECHA Pxlast
1 2010-12-31 332.636
2 2011-01-07 334.327
3 2011-01-14 341.771
4 2011-01-21 331.241
5 2011-01-28 333.252
I have to calculate a new column called "Rolling 4 weeks", the values are based on the following idea, for example for index 5, it will be pxlast[5]/pxlast[5-number of weeks]-1.
Then I get the performance for 4 weeks, in this example it should be pxlast[5] = 333.252 , pxlast[5-4] = 332.636, then I divide it and I subtract - 1 , so the result is -0,384.
Ok, I can do it using "for" loop ; but reading about some functions that could do it properly I find the function chart.rollingPerformance, from the PerformanceAnalytics.
It applied the rolling over a parameter called FUN, for example "mean", it will calculate the mean between width space, but I don't know how to calculate Performance correctly.
Here is the output dataframe for example.
FECHA Pxlast Rolling4W
1 2010-12-31 332.636 NA
2 2011-01-07 334.327 NA
3 2011-01-14 341.771 NA
4 2011-01-21 331.241 NA
5 2011-01-28 333.252 -0,384
The NA values are because we are calculating performance from a width of 4 spaces ( weeks ).
Is there any function to do it without loops?
require(data.table)
d <- data.table(x = 1:5, y=c(1, 2, 4, 7, 11))
d[, z := shift(y, 4)]
d[, z := y/z - 1]
d
# x y z
# 1: 1 1 NA
# 2: 2 2 NA
# 3: 3 4 NA
# 4: 4 7 NA
# 5: 5 11 10
Related
Can somebody please help me with a recode from SPSS into R?
SPSS code:
RECODE variable1
(1,2=1)
(3 THRU 8 =2)
(9, 10 =3)
(ELSE = SYSMIS)
INTO variable2
I can create new variables with the different values. However, I'd like it to be in the same variable, as SPSS does.
Many thanks.
x <- y<- 1:20
x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
y[x %in% (1:2)] <- 1
y[x %in% (3:8)] <- 2
y[x %in% (9:10)] <- 3
y[!(x %in% (1:10))] <- NA
y
[1] 1 1 2 2 2 2 2 2 3 3 NA NA NA NA NA NA NA NA NA NA
I wrote a function that has very similiar coding to the spss code recode. See here
variable1 <- -1:11
recodeR(variable1, c(1, 2, 1), c(3:8, 2), c(9, 10, 3), else_do= "missing")
NA NA 1 1 2 2 2 2 2 2 3 3 NA
This function now works also for other examples. This is how the function is defined
recodeR <- function(vec_in, ..., else_do){
l <- list(...)
# extract the "from" values
from_vec <- unlist(lapply(l, function(x) x[1:(length(x)-1)]))
# extract the "to" values
to_vec <- unlist(lapply(l, function(x) rep(x[length(x)], length(x)-1)))
# plyr is required for mapvalues
require(plyr)
# recode the variable
vec_out <- mapvalues(vec_in, from_vec, to_vec)
# if "missing" is written then all outside the defined range will be missings.
# Otherwise values outside the defined range stay the same
if(else_do == "missing"){
vec_out <- ifelse(vec_in < min(from_vec, na.rm=T) | vec_in > max(from_vec, na.rm=T), NA, vec_out)
}
# return resulting vector
return(vec_out)}
I would like to create a rolling 2 quarter average for alpha, bravo and charlie (and lots of other variables. Research is taking me to zoo and lubricate packages but seem to always go back to rolling within one variable or grouping
set.seed(123)
dates <- c("Q4'15", "Q1'16", "Q2'16","Q3'16", "Q4'16", "Q1'17", "Q2'17" ,"Q3'17", "Q4'17","Q1'18")
df <- data.frame(dates = sample(dates, 100, replace = TRUE, prob=rep(c(.03,.07,.03,.08, .05),2)),
alpha = rnorm(100, 5), bravo = rnorm(100, 10), charlie = rnorm(100, 15))
I'm looking for something like
x <- df %>% mutate_if(is.numeric, funs(rollmean(., 2, align='right', fill=NA)))
Desired result: a weighted average across "Q4'15" & "Q1'16", "Q1'16" & "Q2'16", etc for each column of data (alpha, bravo, charlie). Not looking for the average of the paired quarterly averages.
Here is what the averages would be for the Q4'15&"Q1'16" time point
df %>% filter(dates %in% c("Q4'15", "Q1'16")) %>% select(-dates) %>% summarise_all(mean)
I like data.table for this, and I have a solution for you but there may be a more elegant one. Here is what I have:
Data
Now as data.table:
R> suppressMessages(library(data.table))
R> set.seed(123)
R> datesvec <- c("Q4'15", "Q1'16", "Q2'16","Q3'16", "Q4'16",
+ "Q1'17", "Q2'17" ,"Q3'17", "Q4'17","Q1'18")
R> df <- data.table(dates = sample(dates, 100, replace = TRUE,
+ prob=rep(c(.03,.07,.03,.08, .05),2)),
+ alpha = rnorm(100, 5),
+ bravo = rnorm(100, 10),
+ charlie = rnorm(100, 15))
R> df[ , ind := which(datesvec==dates), by=dates]
R> setkey(df, ind) # optional but may as well
R> head(df)
dates alpha bravo charlie ind
1: Q4'15 5.37964 11.05271 14.4789 1
2: Q4'15 7.05008 10.36896 15.0892 1
3: Q4'15 4.29080 12.12845 13.6047 1
4: Q4'15 5.00576 8.93667 13.3325 1
5: Q4'15 3.53936 9.81707 13.6360 1
6: Q1'16 3.45125 10.56299 16.0808 2
R>
The key here is that we need to restore / maintain the temporal ordering of your quarters which your data representation does not have.
Average by quarter
This is easy with data.table:
R> ndf <- df[ ,
+ .(qtr=head(dates,1), # label of quarter
+ sa=sum(alpha), # sum of a in quarter
+ sb=sum(bravo), # sum of b in quarter
+ sc=sum(charlie), # sum of c in quarter
+ n=.N), # number of observations
+ by=ind]
R> ndf
ind qtr sa sb sc n
1: 1 Q4'15 25.2656 52.3039 70.1413 5
2: 2 Q1'16 65.8562 132.6650 192.7921 13
3: 3 Q2'16 10.3422 17.8061 31.3404 2
4: 4 Q3'16 84.6664 168.1914 256.9010 17
5: 5 Q4'16 41.3268 87.8253 139.5873 9
6: 6 Q1'17 42.6196 85.4059 134.8205 9
7: 7 Q2'17 76.5190 162.0784 241.2597 16
8: 8 Q3'17 42.8254 83.2483 127.2600 8
9: 9 Q4'17 68.1357 133.5794 198.1920 13
10: 10 Q1'18 37.0685 78.4107 120.2808 8
R>
Lag those averages once
R> ndf[, `:=`(psa=shift(sa), # previous sum of a
+ psb=shift(sb), # previous sum of b
+ psc=shift(sc), # previous sum of c
+ pn=shift(n))] # previous nb of obs
R> ndf
ind qtr sa sb sc n psa psb psc pn
1: 1 Q4'15 25.2656 52.3039 70.1413 5 NA NA NA NA
2: 2 Q1'16 65.8562 132.6650 192.7921 13 25.2656 52.3039 70.1413 5
3: 3 Q2'16 10.3422 17.8061 31.3404 2 65.8562 132.6650 192.7921 13
4: 4 Q3'16 84.6664 168.1914 256.9010 17 10.3422 17.8061 31.3404 2
5: 5 Q4'16 41.3268 87.8253 139.5873 9 84.6664 168.1914 256.9010 17
6: 6 Q1'17 42.6196 85.4059 134.8205 9 41.3268 87.8253 139.5873 9
7: 7 Q2'17 76.5190 162.0784 241.2597 16 42.6196 85.4059 134.8205 9
8: 8 Q3'17 42.8254 83.2483 127.2600 8 76.5190 162.0784 241.2597 16
9: 9 Q4'17 68.1357 133.5794 198.1920 13 42.8254 83.2483 127.2600 8
10: 10 Q1'18 37.0685 78.4107 120.2808 8 68.1357 133.5794 198.1920 13
R>
Average over current and previous quarter
R> ndf[is.finite(psa), # where we have valid data
+ `:=`(ra=(sa+psa)/(n+pn), # total sum / total n == avg
+ rb=(sb+psb)/(n+pn),
+ rc=(sc+psc)/(n+pn))]
R> ndf[,c(1:2, 11:13)]
ind qtr ra rb rc
1: 1 Q4'15 NA NA NA
2: 2 Q1'16 5.06233 10.27605 14.6074
3: 3 Q2'16 5.07989 10.03141 14.9422
4: 4 Q3'16 5.00045 9.78935 15.1706
5: 5 Q4'16 4.84589 9.84680 15.2496
6: 6 Q1'17 4.66369 9.62395 15.2449
7: 7 Q2'17 4.76554 9.89937 15.0432
8: 8 Q3'17 4.97268 10.22195 15.3550
9: 9 Q4'17 5.28386 10.32513 15.4977
10: 10 Q1'18 5.00972 10.09476 15.1654
R>
taking advantage of the fact that the total sum over two quarters divided by the total number of observations is the same as the mean of those two quarters. (And this reflects an edit following an earlier thinko of mine).
Spot check
We can use the selection feature of data.table to compute two of those rows by hand, I am picked those for indices <1,2> and <4,5> here:
R> df[ ind <= 2, .(a=mean(alpha), b=mean(bravo), c=mean(charlie))]
a b c
1: 5.06233 10.276 14.6074
R> df[ ind == 4 | ind == 5, .(a=mean(alpha), b=mean(bravo), c=mean(charlie))]
a b c
1: 4.84589 9.8468 15.2496
R>
This pans out fine, and the approach should scale easily to millions of rows thanks to data.table.
PS: All in One
As you mentioned pipes etc, you can write all this with chained data.table operations. Not my preferred style, but possible. The following creates the exact same out without ever creating an ndf temporary as above:
## All in one
df[ , ind := which(datesvec==dates), by=dates][
,
.(qtr=head(dates,1), # label of quarter
sa=sum(alpha), # sum of a in quarter
sb=sum(bravo), # sum of b in quarter
sc=sum(charlie), # sum of c in quarter
n=.N), # number of observations
by=ind][
,
`:=`(psa=shift(sa), # previous sum of a
psb=shift(sb), # previous sum of b
psc=shift(sc), # previous sum of c
pn=shift(n))][
is.finite(psa), # where we have valid data
`:=`(ra=(sa+psa)/(n+pn), # total sum / total n == avg
rb=(sb+psb)/(n+pn),
rc=(sc+psc)/(n+pn))][
,c(1:2, 11:13)][]
I am reading in an extremely large dataset as a data.table for speed. The relevant columns are DATE (weekly data in year-month-day strings e.g. "2017-12-25"), V1 (Integer), V2 (String), V3 (Numeric). I would like to produce V4 which is the moving average of V3, for the last 3 weeks (DATE, DATE-7, and DATE-14)
here is a naive attempt/solution, which is terribly inefficient:
dt <- fread("largefile.csv")
dt$DATE <- as.IDate(dt$DATE) //convert dates to date format
V1_list <- sort(unique(dt$V1))
V2_list <- sort(unique(dt$V2))
DATE_list <- sort(unique(dt$DATE))
for(i in 1:length(V1_list)){
for(j in 1:length(V2_list)){
for(k in 3:length(DATE_list){
dt[which(dt$V1 == V1_list[i] && dt$V2 == V2_list[j] && dt$DATE == DATE_list[k]),"V4"]
<- mean(dt[which(dt$V1 == V1_list[i] && dt$V2 == V2_list[j] && dt$DATE %in% DATE_list[k-2:k]),"V3"])
}
}
}
I am avoiding using plyr partly due to computational constraints given the 50M rows I'm using. I have investigated options with setkey() and zoo / rolling functions but I am unable to figure out how to layer in the date component (assuming I group by V1, V2 and average V3). Apologies for not providing sample code.
The OP has requested to append a new column which is the rolling average of V3 over the past 3 weeks grouped by V1 and V2 for a data.table of 50 M rows.
If the DATE values are without gap, i.e., without missing weeks in all groups, one possible approach is to use the rollmeanr() function from the zoo package:
DT[order(DATE), V4 := zoo::rollmeanr(V3, 3L, fill = NA), by = .(V1, V2)]
DT[order(V1, V2, DATE)]
DATE V1 V2 V3 V4
1: 2017-12-04 1 A 1 NA
2: 2017-12-11 1 A 2 NA
3: 2017-12-18 1 A 3 2
4: 2017-12-25 1 A 4 3
5: 2017-12-04 1 B 5 NA
6: 2017-12-11 1 B 6 NA
7: 2017-12-18 1 B 7 6
8: 2017-12-25 1 B 8 7
9: 2017-12-04 2 A 9 NA
10: 2017-12-11 2 A 10 NA
11: 2017-12-18 2 A 11 10
12: 2017-12-25 2 A 12 11
13: 2017-12-04 2 B 13 NA
14: 2017-12-11 2 B 14 NA
15: 2017-12-18 2 B 15 14
16: 2017-12-25 2 B 16 15
Note that the NAs are purposefully introduced because we do not have DATE-7 and DATE-14 values for the first two rows within each group.
Also note that this approach does not require type conversion of the character dates.
Data
According to OP's description, the data.table has 4 columns: DATE are weekly character dates in standard unambiguous format %Y-%m-%d, V1 is of type integer, V2 is of type character, and V3 is of type double (numeric). V1 and V2 are used for grouping.
library(data.table)
# create data
n_week = 4L
n_V1 = 2L
# cross join
DT <- CJ(
DATE = as.character(rev(seq(as.Date("2017-12-25"), length.out = n_week, by = "-1 week"))),
V1 = seq_len(n_V1),
V2 = LETTERS[1:2]
)
DT[order(V1, V2, DATE), V3 := as.numeric(seq_len(.N))][]
DATE V1 V2 V3
1: 2017-12-04 1 A 1
2: 2017-12-04 1 B 5
3: 2017-12-04 2 A 9
4: 2017-12-04 2 B 13
5: 2017-12-11 1 A 2
6: 2017-12-11 1 B 6
7: 2017-12-11 2 A 10
8: 2017-12-11 2 B 14
9: 2017-12-18 1 A 3
10: 2017-12-18 1 B 7
11: 2017-12-18 2 A 11
12: 2017-12-18 2 B 15
13: 2017-12-25 1 A 4
14: 2017-12-25 1 B 8
15: 2017-12-25 2 A 12
16: 2017-12-25 2 B 16
So I tried to solve your problem using two inner_joins from the dplyr package:
First I created an example data.frame (1.000.000 rows):
V3 <- seq(from=1, to=1000000, by =1 )
DATE <- seq(from=1, to= 7000000, by =7)
dt <- data.frame(V3, DATE)
Does it look correct? I dropped all unnecessary content and ignored the Date format (you can subtract Dates the same way as integers)
Next, I did two innerjoins on the DATE column but the second data.frame was containing the DATE +7 and DATE +14 so you join on the correct Dates. Finally, i select the 3 interesting columns and computed the rowMean. I took like 5 seconds on my lousy MacBook.
inner_join(
inner_join(x= dt, y=mutate(dt, DATE=DATE+7), by= 'DATE'),
y = mutate(dt, DATE= DATE+14), by= 'DATE') %>%
select(V3 , V3.y, V3.x) %>%
rowMeans()
and if you want to add it to your dt remember that for the first 2 dates there is no average because no DATE-14 and DATE-7 exists.
dt$V4 <- c(NA, NA, inner_join(
inner_join(x= dt, y=mutate(dt, DATE=DATE+7), by= 'DATE'),
y = mutate(dt, DATE= DATE+14), by= 'DATE') %>%
select(V3 , V3.y, V3.x) %>%
rowMeans())
I have a dataset with two values for each date like this:
date x y
1 2013-05-01 1 2
2 2013-05-02 2 2
3 2013-05-03 3 2
date is in the format as.Date, using the package lubridate.
Now I want to have the mean of the two values, except for a certain time span, in which I want to use the values of x.
I tried the following:
mean=(x+y)/2
newdata=ifelse((data$date < 2013-10-01 | date$date > 2014-04-09), mean, x)
but if will just take the mean for all dates.
Is it possible to use greater/lesser than relationships for dates?
Any suggestions on how to make this work?
Thanks in advance
It looks like you are not casting the comparison values as dates. Also the dates you used for comparison don't exclude any of the dates in the dataframe you provided so I'd expect the mean to be selected every time.
date <- as.Date(c('2013-05-01', '2013-05-02', '2013-05-03'))
x <- c(1, 2, 3)
y <- c(2, 2, 2)
mean <- (x + y)/2
df <- data.frame(date = date, x = x, y = y)
newdata <- ifelse((df$date < as.Date('2013-05-02') | df$date > as.Date('2014-04-09')), mean, x)
newdata
I changed the dates in the condition to be more selective and I got 1.5 2.0 3.0. It selects the first value from mean and the others from x which agrees with the condition I used in the ifelse().
How about something like this:
library(lubridate)
library(data.table)
##
set.seed(123)
Data <- data.frame(
date=as.Date(ymd(20130904))+0:364,
x=as.numeric(sample(1:3,365,replace=TRUE)),
y=as.numeric(sample(1:3,365,replace=TRUE)))
setDT(Data)
##
xSpan <- seq.Date(
from=as.Date("2013-10-01"),
to=as.Date("2014-04-09"),
by="day")
##
Edited - forgot to group by date
Data[,z:=ifelse(
date %in% xSpan,
x,
mean(c(x,y))),
by=date]
##
> head(Data)
date x y z
1: 2013-09-04 1 3 2.0
2: 2013-09-05 3 1 2.0
3: 2013-09-06 2 1 1.5
4: 2013-09-07 3 2 2.5
5: 2013-09-08 3 2 2.5
6: 2013-09-09 1 2 1.5
> head(subset(Data, date %in% xSpan))
date x y z
1: 2013-10-01 2 3 2
2: 2013-10-02 1 3 1
3: 2013-10-03 1 1 1
4: 2013-10-04 3 1 3
5: 2013-10-05 3 1 3
6: 2013-10-06 3 1 3
I just defined xSpan as a contiguous sequence of days for which one of the functions is used in (in your example, just the identity function of x). Dates not included in this time span will use mean to determine their value of z.
Assume you have a data frame like this:
df <- data.frame(Nums = c(1,2,3,4,5,6,7,8,9,10), Cum.sums = NA)
> df
Nums Cum.sums
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 NA
6 6 NA
7 7 NA
8 8 NA
9 9 NA
10 10 NA
and you want an output like this:
Nums Cum.sums
1 1 0
2 2 0
3 3 0
4 4 3
5 5 5
6 6 7
7 7 9
8 8 11
9 9 13
10 10 15
The 4. element of the column Cum.sum is the sum of 1 and 2, the 5. element of the Column Cum.sum is the sum of 2 and 3 and so on...
This means, I would like to build the cumulative sum of the first row and save it in the second row. However I don't want the normal cumulative sum but the sum of the element 2 rows above the current row plus the element 3 rows above the current row.
I allready tried to play a little bit around with the sum and cumsum function but I failed.
Any ideas?
Thanks!
You could use the embed function to create the appropriate lags, rowSums to sum, then lag appropriately (I used head).
df$Cum.sums[-(1:3)] <- head(rowSums(embed(df$Nums,2)),-2)
You don't need any special function, just use normal vector operations (these solutions are all equivalent):
df$Cum.sums[-(1:3)] <- head(df$Nums, -3) + head(df$Nums[-1], -2)
or
with(df, Cum.sums[-(1:3)] <- head(Nums, -3) + head(Nums[-1], -2))
or
df$Cum.sums[-(1:3)] <- df$Nums[1:(nrow(df)-3)] + df$Nums[2:(nrow(df)-2)]
I believe the first 3 sums SHOULD be NA, not 0, but if you prefer zeroes, you can initialize the sums first:
df$Cum.sums <- 0
Another solution, elegant and general, using matrix multiplication - and so very inefficient for large data. So it's not much practical, though a nice excercise:
len <- nrow(df)
sr <- 2 # number of rows to sum
lag <- 3
mat <- matrix(
head(c(
rep(0, lag * len),
rep(rep(1:0, c(sr, len - sr + 1)), len)
), len * len),
nrow = 10, byrow = TRUE
)
mat %*% df$Nums