Supposing I need to apply an MA(5) to a batch of market data, stored in an xts object. I can easily pull the subset of data I wanted smoothed with xts subsetting:
x['2013-12-05 17:00:01/2013-12-06 17:00:00']
However, I need an additional 5 observations prior to the first one in my subset to "prime" the filter. Is there an easy way to do this?
The only thing I have been able to figure out is really ugly, with explicit row numbers (here using xts sample data):
require(xts)
data(sample_matrix)
x <- as.xts(sample_matrix)
x$rn <- row(x[,1])
frst <- first(x['2007-05-18'])$rn
finl <- last(x['2007-06-09'])$rn
ans <- x[(frst-5):finl,]
Can I just say bleah? Somebody help me.
UPDATE: by popular request, a short example that applies an MA(5) to the daily data in sample_matrix:
require(xts)
data(sample_matrix)
x <- as.xts(sample_matrix)$Close
calc_weights <- function(x) {
##replace rnorm with sophisticated analysis
wgts <- matrix(rnorm(5,0,0.5), nrow=1)
xts(wgts, index(last(x)))
}
smooth_days <- function(x, wgts) {
w <- wgts[index(last(x))]
out <- filter(x, w, sides=1)
xts(out, index(x))
}
set.seed(1.23456789)
wgts <- apply.weekly(x, calc_weights)
lapply(split(x, f='weeks'), smooth_days, wgts)
For brevity, only the final week's output:
[[26]]
[,1]
2007-06-25 NA
2007-06-26 NA
2007-06-27 NA
2007-06-28 NA
2007-06-29 -9.581503
2007-06-30 -9.581208
The NAs here are my problem. I want to recalculate my weights for each week of data, and apply those new weights to the upcoming week. Rinse, repeat. In real life, I replace the lapply with some ugly stuff with row indexes, but I'm sure there's a better way.
In an attempt to define the problem clearly, this appears to be a conflict between the desire to run an analysis on non-overlapping time periods (weeks, in this case) but requiring overlapping time periods of data (2 weeks, in this case) to perform the calculation.
Here's one way to do this using endpoints and a for loop. You could still use the which.i=TRUE suggestion in my comment, but integer subsetting is faster.
y <- x*NA # pre-allocate result
ep <- endpoints(x,"weeks") # time points where parameters change
set.seed(1.23456789)
for(i in seq_along(ep)[-(1:2)]) {
rng1 <- ep[i-1]:ep[i] # obs to calc weights
rng2 <- ep[i-2]:ep[i] # "prime" obs
wgts <- calc_weights(x[rng1])
# calc smooth_days on rng2, but only keep rng1 results
y[rng1] <- smooth_days(x[rng2], wgts)[index(x[rng1])]
}
Related
My data frame contains 22 columns: "DATE", "INDEX" and S1, S2, S3 ... S20. There are over 4322 rows. I want to calculate log returns and store the results in a data frame. That should give me 4321 rows.
I run this code, but I am sure there is a much more elegant way to do the calculation in a short way.
# count the sum of rows in order to make the following formula work appropriately - (n-1)
n <- nrow(df)
# calculating the log returns (natural logarithm), of INDEX and S1-20
LogRet_INDEX <- log(df$INDEX[2:n])-log(df$INDEX[1:(n-1)])
LogRet_S1 <- log(df$S1[2:n])-log(df$S1[1:(n-1)])
LogRet_S2 <- log(df$S2[2:n])-log(df$S2[1:(n-1)])
LogRet_S3 <- log(df$S3[2:n])-log(df$S3[1:(n-1)])
LogRet_S4 <- log(df$S4[2:n])-log(df$S4[1:(n-1)])
LogRet_S5 <- log(df$S5[2:n])-log(df$S5[1:(n-1)])
LogRet_S6 <- log(df$S6[2:n])-log(df$S6[1:(n-1)])
LogRet_S7 <- log(df$S7[2:n])-log(df$S7[1:(n-1)])
LogRet_S8 <- log(df$S8[2:n])-log(df$S7[1:(n-1)])
LogRet_S9 <- log(df$S9[2:n])-log(df$S8[1:(n-1)])
LogRet_S10 <- log(df$S10[2:n])-log(df$S10[1:(n-1)])
LogRet_S11 <- log(df$S11[2:n])-log(df$S11[1:(n-1)])
LogRet_S12 <- log(df$S12[2:n])-log(df$S12[1:(n-1)])
LogRet_S13 <- log(df$S13[2:n])-log(df$S13[1:(n-1)])
LogRet_S14 <- log(df$S14[2:n])-log(df$S14[1:(n-1)])
LogRet_S15 <- log(df$S15[2:n])-log(df$S15[1:(n-1)])
LogRet_S16 <- log(df$S16[2:n])-log(df$S16[1:(n-1)])
LogRet_S17 <- log(df$S17[2:n])-log(df$S17[1:(n-1)])
LogRet_S18 <- log(df$S18[2:n])-log(df$S18[1:(n-1)])
LogRet_S19 <- log(df$S19[2:n])-log(df$S19[1:(n-1)])
LogRet_S20 <- log(df$S20[2:n])-log(df$S20[1:(n-1)])
# adding the results from the previous calculation (log returns) to a data frame
LogRet_df <- data.frame(LogRet_INDEX, LogRet_S1, LogRet_S2, LogRet_S3, LogRet_S4, LogRet_S5, LogRet_S6, LogRet_S7, LogRet_S8, LogRet_S9, LogRet_S10, LogRet_S11, LogRet_S12, LogRet_S13, LogRet_S14, LogRet_S15, LogRet_S16, LogRet_S17, LogRet_S18, LogRet_S19, LogRet_S20)
Is there a possibility to make this code shorter? Maybe some kind of loop or using a for argument? Since I am quite new to R, I try to improve my knowledge.
Any kind of help is highly appreciated!
You can use sapply to apply a function to each column of the data.frame.
What the code below does, is 1) take columns 2 to 22 from the data frame called df. 2) for each of this columns, calculate logarithm of the respective column and then calculate the difference between two neighboring rows. 3) when done, convert it to data.frame called df2
df2 <- as.data.frame(sapply(df[2:22], function(x) diff(log(x))))
Part of a funtion I am including in an R-package involves filling NAs with last ovbservation carried forward (locf). The locf should be implemnted to all columns in the data frame except what I called below the good columns goodcols (i.e. should be applied to the badcols). The column names for the badcols can be anything. I use the locf function below and a for-loop to acheive this. However, the for-loop is a bit slow when using large data set. Can anybody suggest a faster alternative or another way of filling in the NAs in the presented scenario?
Here is an example data frame:
#Test df
TIME <- c(0,5,10,15,20,25,30,40,50)
AMT <- c(50,0,0,0,50,0,0,0,0)
COV1 <- c(10,9,NA,NA,5,5,NA,10,NA)
COV2 <- c(20,15,15,NA,NA,10,NA,30,NA)
ID <- rep(1, times=length(TIME))
df <- data.frame(ID,TIME,AMT,COV1,COV2)
df <- expand.grid(df)
goodcols <- c("ID","TIME","AMT")
badcols <- which(names(df)%in%goodcols==F)
#----------------------------------------------------
#locf function
locf <- function (x) {
good <- !is.na(x)
positions <- seq(length(x))
good.positions <- good * positions
last.good.position <- cummax(good.positions)
last.good.position[last.good.position == 0] <- NA
x[last.good.position]
}
#------------------------------------------------------
#Now fill in the gaps by locf function
for (i in badcols)
{
df[,i] <- locf(df[,i])
}
Sorry for writing an answer (not enough reputation to just comment)
But what prevents you from doing as #ProcrastinatusMaximus said?
(you can include the zoo call in your loop)
Would look like this:
for (i in badcols)
{
df[,i] <- zoo::na.locf(df[,i])
}
I am not sure if zoo is faster than your implementation. You would have to try this out. You could also check spacetime::na.locf, imputeTS::na.locf to see which of the existing locf implementations is the fastest.
I've gotten fairly good with the *apply family of functions, and I've recently learned to use the do.call("rbind", by(... as a wrapper for tapply. I'm working with a large data set (Compustat) and I have a function (see below) that generates a new column of lagged variables which I later attach to the main data frame df.
My problem is that it is extremely slow. I create about two dozen lagged variables, and the processing in this function takes approximately 1.5 hours because there are 350,000+ firm-year observations in the data set.
Can anyone help improve the speed of this function without losing the aspects that I find desirable:
#' lag vector of unknown size (for do.call-rbind-by: using datadate to track)
lag.vec <- function(x){
x <- x[order(x$datadate), ] # sort data into ascending by date
var <- x[,2] # the specific variable name in data.frame x hereby ignored
var.name <- paste(names(x)[2], "lag", sep = '.') # keep variable name
if(length(var)>1){ # no lagging if single observation
lagged <- c(NA, var[1:(length(var)-1)])
datelag <- c(x$datadate[1], x$datadate[1:(length(x$datadate) - 1)])
datediff <- x$datadate - datelag
y <- data.frame(x$datadate, datediff, lagged) # join lagged variable and difference in YYYYMMDD data
y$lagged[y$datediff >= 20000 & !is.na(y$datediff)] <- NA # 2 or more full years difference
y <- y[, c('x.datadate', 'lagged')]
names(y) <- c("datadate", var.name)
} else { y <- c(x$datadate[1], NA); names(y) <- c("datadate", var.name) }
return(y)
}
I then call this function in a command separately for each variable that I want to generate a lagged series for (here I use the ni variable as an example):
ni_lag <- do.call('rbind', by(df[ , c('datadate', 'ni')], df$gvkey, lag.vec))
where gvkey is the ID number for the particular firm and datadate is an 8-digit integer of the form YYYYMMDD.
The approach was much faster when I used a simpler function:
lag.vec.seq <- function(x){#' lag vector when all data points are present, in order
if(length(x)>1){
y <- c(NA, x[1:(length(x)-1)])
} else {y <- NA}
return(y)
}
along with the tapply command in something like
ni_lag <- as.vector(unlist(tapply(df$ni, df$gvkey, lag.vec.seq)))
As you can see the main difference is that the tapply approach doesn't include any datadate information and so the function assumes that all data are sequential (i.e., there are no missing years in the dataframe). Since I know there are missing years, I built the do.call-by function to account for that.
Some notes:
1) The first order command in the function is probably unnecessary since my data is ordered by gvkey and datadate in advance (e.g. df <- df[order(df$gvkey, df$datadate), ]). However, I'm always a bit afraid that R messies up my row ordering when I use functional programming like this. Is that an unfounded fear?
2) Identifying what is slowing down the processing would be very helpful. Is it the renaming of variables? The creation of a new data frame in the function? Or is the do.call with by just typically (much) slower than tapply?
Thank you!
I am currently trying to use plyr + reshape2 to proccess my data, but it is taking a lot of time.
I have a dataframe (df) with 3 columns: network, user_id and date.
My goal is:
To split df in 2 levels (network and user_id);
apply a function (get_interval) in each split;
bind the results in another dataframe (df2).
get_interval returns a vector of the same length as the number of rows of the input.
Thus, df2 has the same size of df, but with the results computed by get_interval.
The problem is that I cannot use ddply directly, since it only handles vectors of equal length and the results of the function have varied length.
I came up with this solution:
aux <- melt(dlply(df,.(network,user_id), get_interval))
df2 <- cbind(interval=aux$value,colsplit(aux$L1,"\\.",names=c("network","user_id")))
But it is very inefficient, and since df is quite big I waste hours every time I have to run it.
Is there a way of doing this more efficiently?
EDIT
The basic operation of get_interval is as follows:
get_interval <- function(df){
if(nrow(df) < 2)
return (NA)
x <- c(NA,df$date[-1] - df$date[-nrow(df)])
return(x) ## ceiling wont work because some intervals are 0.
}
It is possible to generate this data artificially with:
n <- 1000000
ref_time <- as.POSIXct("2013-12-17 00:00:00")
interval_range <- 86400*10 # 10 days
df <- data.frame(user_id=floor(runif(n,1,n/10)),
network=gl(2,n,labels=c("anet","unet")),
value=as.POSIXct(ref_time - runif(n,0,interval_range)))
Preliminaries: this question is mostly of educational value, the actual task at hand is completed, even if the approach is not entirely optimal. My question is whether the code below can be optimized for speed and/or implemented more elegantly. Perhaps using additional packages, such as plyr or reshape. Run on the actual data it takes about 140 seconds, much higher than the simulated data, since some of the original rows contain nothing but NA, and additional checks have to be made. To compare, the simulated data are processed in about 30 seconds.
Conditions: the dataset contains 360 variables, 30 times the set of 12. Let's name them V1_1, V1_2... (first set), V2_1, V2_2 ... (second set) and so forth. Each set of 12 variables contains dichotomous (yes/no) responses, in practice corresponding to a career status. For instance: work (yes/no), study (yes/no) and so forth, in total 12 statuses, repeated 30 times.
Task: the task at hand is to recode each set of 12 dichotomous variables into a single variable with 12 response categories (e.g. work, study... ). Ultimately we should get 30 variables, each with 12 response categories.
Data: I cannot post the actual dataset, but here is a good simulated approximation:
randomRow <- function() {
# make a row with a single 1 and some NA's
sample(x=c(rep(0,9),1,NA,NA),size=12,replace=F)
}
# create a data frame with 12 variables and 1500 cases
makeDf <- function() {
data <- matrix(NA,ncol=12,nrow=1500)
for (i in 1:1500) {
data[i,] <- randomRow()
}
return(data)
}
mydata <- NULL
# combine 30 of these dataframes horizontally
for (i in 1:30) {
mydata <- cbind(mydata,makeDf())
}
mydata <- as.data.frame(mydata) # example data ready
My solution:
# Divide the dataset into a list with 30 dataframes, each with 12 variables
S1 <- lapply(1:30,function(i) {
Z <- rep(1:30,each=12) # define selection vector
mydata[Z==i] # use selection vector to get groups of variables (x12)
})
recodeDf <- function(df) {
result <- as.numeric(apply(df,1,function(x) {
if (any(!is.na(df))) which(x == 1) else NA # return the position of "1" per row
})) # the if/else check is for the real data
return(result)
}
# Combine individual position vectors into a dataframe
final.df <- as.data.frame(do.call(cbind,lapply(S1,recodeDf)))
All in all, there is a double *apply function, one across the list, the other across the dataframe rows. This makes it a bit slow. Any suggestions? Thanks in advance.
Here is an approach that is basically instantaneous. (system.time = 0.1 seconds)
se set. The columnMatch component will depend on your data, but if it is every 12 columns, then the following will work.
MYD <- data.table(mydata)
# a new data.table (changed to numeric : Arun)
newDT <- as.data.table(replicate(30, numeric(nrow(MYD)),simplify = FALSE))
# for each column, which values equal 1
whiches <- lapply(MYD, function(x) which(x == 1))
# create a list of column matches (those you wish to aggregate)
columnMatch <- split(names(mydata), rep(1:30,each = 12))
setattr(columnMatch, 'names', names(newDT))
# cycle through all new columns
# and assign the the rows in the new data.table
## Arun: had to generate numeric indices for
## cycling through 1:12, 13:24 in whiches[[.]]. That was the problem.
for(jj in seq_along(columnMatch)) {
for(ii in seq_along(columnMatch[[jj]])) {
set(newDT, j = jj, i = whiches[[ii + 12 * (jj-1)]], value = ii)
}
}
This would work just as well adding columns by reference to the original.
Note set works on data.frames as well....
I really like #Arun's matrix multiplication idea. Interestingly, if you compiling R against some OpenBLAS libraries, you could get this to operate in parallel.
However, I wanted to provide you with another, perhaps slower than matrix multiplication, solution that uses your original pattern, but is much faster than your implementation:
# Match is usually faster than which, because it only returns the first match
# (and therefore won't fail on multiple matches)
# It also neatly handles your *all NA* case
recodeDf2 <- function(df) apply(df,1,match,x=1)
# You can split your data.frame by column with split.default
# (Using split on data.frame will split-by-row)
S2<-split.default(mydata,rep(1:30,each=12))
final.df2<-lapply(S2,recodeDf2)
If you had a very large data frame, and many processors, you may consider parallelizing this operation with:
library(parallel)
final.df2<-mclapply(S2,recodeDf2,mc.cores=numcores)
# Where numcores is your number of processors.
Having read #Arun and #mnel, I learned a lot about how to improve this function, by avoiding the coercion to an array, by processing the data.frame by column instead of by row. I don't mean to "steal" an answer here; OP should consider switching the checkbox to #mnel's answer.
I wanted, however, to share a solution that doesn't use data.table, and avoids for. It is still, however, slower than #mnel's solution, albeit slightly.
nograpes2<-function(mydata) {
test<-function(df) {
l<-lapply(df,function(x) which(x==1))
lens<-lapply(l,length)
rep.int(seq.int(l),times=lens)[order(unlist(l))]
}
S2<-split.default(mydata,rep(1:30,each=12))
data.frame(lapply(S2,test))
}
I would also like to add that #Aaron's approach, using which with arr.ind=TRUE would also be very fast and elegant, if mydata started out as a matrix, rather than a data.frame. Coercion to a matrix is slower than the rest of the function. If speed were an issue, it would be worth considering reading the data in as a matrix in the first place.
IIUC, you've only one 1 per 12 columns. You've the rest with 0's or NA's. If so, the operation can be performed much faster by this idea.
The idea: Instead of going through each row and asking for the position of 1, you could use a matrix with dimensions 1500 * 12 where each row is just 1:12. That is:
mul.mat <- matrix(rep(1:12, nrow(DT)), ncol = 12, byrow=TRUE)
Now, you can multiply this matrix with each of your subset'd data.frame (of same dimensions, 1500*12 here) and them take their "rowSums" (which is vectorised) with na.rm = TRUE. This'll just give directly the row where you have 1 (because that 1 will have been multiplied by the corresponding value between 1 and 12).
data.table implementation: Here, I'll use data.table to illustrate the idea. Since it creates column by references, I'd expect that the same idea used on a data.frame would be a tad slower, although it should drastically speed up your current code.
require(data.table)
DT <- data.table(mydata)
ids <- seq(1, ncol(DT), by=12)
# for multiplying with each subset and taking rowSums to get position of 1
mul.mat <- matrix(rep(1:12, nrow(DT)), ncol = 12, byrow=TRUE)
for (i in ids) {
sdcols <- i:(i+12-1)
# keep appending the new columns by reference to the original data
DT[, paste0("R", i %/% 12 + 1) := rowSums(.SD * mul.mat,
na.rm = TRUE), .SDcols = sdcols]
}
# delete all original 360 columns by reference from the original data
DT[, grep("V", names(DT), value=TRUE) := NULL]
Now, you'll be left with 30 columns that correspond to the position of 1's. On my system, this takes about 0.4 seconds.
all(unlist(final.df) == unlist(DT)) # not a fan of `identical`
# [1] TRUE
Another way this could be done with base R is with simply getting the values you want to put in the new matrix and filling them in directly with matrix indexing.
idx <- which(mydata==1, arr.ind=TRUE) # get indices of 1's
i <- idx[,2] %% 12 # get column that was 1
idx[,2] <- ((idx[,2] - 1) %/% 12) + 1 # get "group" and put in "col" of idx
out <- array(NA, dim=c(1500,30)) # make empty matrix
out[idx] <- i # and fill it in!