Last observation carried forward in a data frame - r

Part of a funtion I am including in an R-package involves filling NAs with last ovbservation carried forward (locf). The locf should be implemnted to all columns in the data frame except what I called below the good columns goodcols (i.e. should be applied to the badcols). The column names for the badcols can be anything. I use the locf function below and a for-loop to acheive this. However, the for-loop is a bit slow when using large data set. Can anybody suggest a faster alternative or another way of filling in the NAs in the presented scenario?
Here is an example data frame:
#Test df
TIME <- c(0,5,10,15,20,25,30,40,50)
AMT <- c(50,0,0,0,50,0,0,0,0)
COV1 <- c(10,9,NA,NA,5,5,NA,10,NA)
COV2 <- c(20,15,15,NA,NA,10,NA,30,NA)
ID <- rep(1, times=length(TIME))
df <- data.frame(ID,TIME,AMT,COV1,COV2)
df <- expand.grid(df)
goodcols <- c("ID","TIME","AMT")
badcols <- which(names(df)%in%goodcols==F)
#----------------------------------------------------
#locf function
locf <- function (x) {
good <- !is.na(x)
positions <- seq(length(x))
good.positions <- good * positions
last.good.position <- cummax(good.positions)
last.good.position[last.good.position == 0] <- NA
x[last.good.position]
}
#------------------------------------------------------
#Now fill in the gaps by locf function
for (i in badcols)
{
df[,i] <- locf(df[,i])
}

Sorry for writing an answer (not enough reputation to just comment)
But what prevents you from doing as #ProcrastinatusMaximus said?
(you can include the zoo call in your loop)
Would look like this:
for (i in badcols)
{
df[,i] <- zoo::na.locf(df[,i])
}
I am not sure if zoo is faster than your implementation. You would have to try this out. You could also check spacetime::na.locf, imputeTS::na.locf to see which of the existing locf implementations is the fastest.

Related

Split Apply Combine

I have a large list, and would like to apply the exact technique detailed in the answer here:
Create mutually exclusive dummy variables from categorical variable in R
However, my data is much larger, and I would like to split, apply and combine the operation to each individual row.
This code, which of course does not work, illustrates what I am trying to do:
id <- c(1,1,1,1)
time <- c(1,2,3,4)
time <- as.character(time)
unique.time <- as.character(unique(df$time))
df <- data.frame(id,time)
df1 <- split(df, row(df))
sapply(df1, (unique.time, function(x)as.numeric(df1$time == x)))
z <- unsplit(lapply(df1, row(df)), scale), x)
Thanks!

Stepwise reducing the input dataframe in a loop

I need to do iteratively evaluate the variance of a dataset while i reduce the data.frame row by row in each step. As an example
data <- matrix(runif(100),10,10)
perc <- list("vector")
sums <- sum(data)
for (i in 1:nrow(data)) {
data <- data[-1,]
perc[[i]] <- sum(data)/sums # in reality, here are ~8 additonal lines of code
}
I dont like that data is re-initialized in every step, and that the loop breaks with an error, when data is emptied.
So the questions are:
1. How to express data <- data[-1,] in an incrementing way (something like tmp <- data[-c(1:i),], which doesnt work?
2. Is there a way to stop the loop, before the last row is removed from data?
You could try
set.seed(123)
data <- matrix(runif(100),10,10)
sums <- sum(data)
perc <- lapply(2:nrow(data),function(x) sum(data[x:nrow(data),]/sums))
The above code yields the same result as your original code, but without error message and without modifying data.
perc1 <- list()
for (i in 1:nrow(data)) {
data <- data[-1,]
perc1[[i]] <- sum(data)/sums
}
identical(perc,perc1)
#[1] TRUE
If you wish to preserve the for loop in order to perform other calculations within the loop, you could try:
for (i in 2:nrow(data)) {
perc[[i-1]] <- sum(data[i:nrow(data),])/sums
# do more stuff here
}
identical(perc,perc1)
#[1] TRUE
If you are using the loop index i for other calculations within the loop, you will most probably need to replace it with i-1. It may depend on what is calculated.
You can use lapply
res <- lapply(2:nrow(data), function(i)sum(data[i:nrow(data),])/sums)
You can write the loop part like this:
for (i in 2:nrow(data)) {
perc[[i - 1]] <- sum(data[i:nrow(data),])/sums # in reality, here are ~8 additonal lines of code
}

convert X-Y data.frame to matrix for every column in R efficiently

I have found away to do this using reshape2 but it is quite slow and doesn't quite give me exactly what I want. I have a data.frame that looks like this:
df<-data.frame(expand.grid(1:10,1:10))
colnames(df) <- c("x","y")
for(i in 3:10){
df[i] <- runif(100,10,100)
}
I run:
require(reshape2)
matrices<-lapply(colnames(df)[-c(1:2)],function(x){
mat<-acast(df, y~x, value.var=x, fill= 0,fun.aggregate = mean)
return(mat)
})
there I have a list of matrices for each value vector in my data, I can transform this into an array of 1:10,1:10,1:10 dimension, but I am looking to see if there is a faster way to do this as my datasets can contain many value columns and this process can take a long time and I can't seem to find a more efficient way of doing it..
Thanks for any help.
If your data.frame is stored regularly as you say, you could accomplish this in a for loop, which may actually be faster than casting:
# preallocate array
myArray <- array(0, dim=c(10,10,10))
# loop through:
for(i in 1:10) {
myArray[,,i] <- as.matrix(df[df$y==i,])
}

R - Improve speed of do.call / by function

I've gotten fairly good with the *apply family of functions, and I've recently learned to use the do.call("rbind", by(... as a wrapper for tapply. I'm working with a large data set (Compustat) and I have a function (see below) that generates a new column of lagged variables which I later attach to the main data frame df.
My problem is that it is extremely slow. I create about two dozen lagged variables, and the processing in this function takes approximately 1.5 hours because there are 350,000+ firm-year observations in the data set.
Can anyone help improve the speed of this function without losing the aspects that I find desirable:
#' lag vector of unknown size (for do.call-rbind-by: using datadate to track)
lag.vec <- function(x){
x <- x[order(x$datadate), ] # sort data into ascending by date
var <- x[,2] # the specific variable name in data.frame x hereby ignored
var.name <- paste(names(x)[2], "lag", sep = '.') # keep variable name
if(length(var)>1){ # no lagging if single observation
lagged <- c(NA, var[1:(length(var)-1)])
datelag <- c(x$datadate[1], x$datadate[1:(length(x$datadate) - 1)])
datediff <- x$datadate - datelag
y <- data.frame(x$datadate, datediff, lagged) # join lagged variable and difference in YYYYMMDD data
y$lagged[y$datediff >= 20000 & !is.na(y$datediff)] <- NA # 2 or more full years difference
y <- y[, c('x.datadate', 'lagged')]
names(y) <- c("datadate", var.name)
} else { y <- c(x$datadate[1], NA); names(y) <- c("datadate", var.name) }
return(y)
}
I then call this function in a command separately for each variable that I want to generate a lagged series for (here I use the ni variable as an example):
ni_lag <- do.call('rbind', by(df[ , c('datadate', 'ni')], df$gvkey, lag.vec))
where gvkey is the ID number for the particular firm and datadate is an 8-digit integer of the form YYYYMMDD.
The approach was much faster when I used a simpler function:
lag.vec.seq <- function(x){#' lag vector when all data points are present, in order
if(length(x)>1){
y <- c(NA, x[1:(length(x)-1)])
} else {y <- NA}
return(y)
}
along with the tapply command in something like
ni_lag <- as.vector(unlist(tapply(df$ni, df$gvkey, lag.vec.seq)))
As you can see the main difference is that the tapply approach doesn't include any datadate information and so the function assumes that all data are sequential (i.e., there are no missing years in the dataframe). Since I know there are missing years, I built the do.call-by function to account for that.
Some notes:
1) The first order command in the function is probably unnecessary since my data is ordered by gvkey and datadate in advance (e.g. df <- df[order(df$gvkey, df$datadate), ]). However, I'm always a bit afraid that R messies up my row ordering when I use functional programming like this. Is that an unfounded fear?
2) Identifying what is slowing down the processing would be very helpful. Is it the renaming of variables? The creation of a new data frame in the function? Or is the do.call with by just typically (much) slower than tapply?
Thank you!

Prepend xts rows to a subset

Supposing I need to apply an MA(5) to a batch of market data, stored in an xts object. I can easily pull the subset of data I wanted smoothed with xts subsetting:
x['2013-12-05 17:00:01/2013-12-06 17:00:00']
However, I need an additional 5 observations prior to the first one in my subset to "prime" the filter. Is there an easy way to do this?
The only thing I have been able to figure out is really ugly, with explicit row numbers (here using xts sample data):
require(xts)
data(sample_matrix)
x <- as.xts(sample_matrix)
x$rn <- row(x[,1])
frst <- first(x['2007-05-18'])$rn
finl <- last(x['2007-06-09'])$rn
ans <- x[(frst-5):finl,]
Can I just say bleah? Somebody help me.
UPDATE: by popular request, a short example that applies an MA(5) to the daily data in sample_matrix:
require(xts)
data(sample_matrix)
x <- as.xts(sample_matrix)$Close
calc_weights <- function(x) {
##replace rnorm with sophisticated analysis
wgts <- matrix(rnorm(5,0,0.5), nrow=1)
xts(wgts, index(last(x)))
}
smooth_days <- function(x, wgts) {
w <- wgts[index(last(x))]
out <- filter(x, w, sides=1)
xts(out, index(x))
}
set.seed(1.23456789)
wgts <- apply.weekly(x, calc_weights)
lapply(split(x, f='weeks'), smooth_days, wgts)
For brevity, only the final week's output:
[[26]]
[,1]
2007-06-25 NA
2007-06-26 NA
2007-06-27 NA
2007-06-28 NA
2007-06-29 -9.581503
2007-06-30 -9.581208
The NAs here are my problem. I want to recalculate my weights for each week of data, and apply those new weights to the upcoming week. Rinse, repeat. In real life, I replace the lapply with some ugly stuff with row indexes, but I'm sure there's a better way.
In an attempt to define the problem clearly, this appears to be a conflict between the desire to run an analysis on non-overlapping time periods (weeks, in this case) but requiring overlapping time periods of data (2 weeks, in this case) to perform the calculation.
Here's one way to do this using endpoints and a for loop. You could still use the which.i=TRUE suggestion in my comment, but integer subsetting is faster.
y <- x*NA # pre-allocate result
ep <- endpoints(x,"weeks") # time points where parameters change
set.seed(1.23456789)
for(i in seq_along(ep)[-(1:2)]) {
rng1 <- ep[i-1]:ep[i] # obs to calc weights
rng2 <- ep[i-2]:ep[i] # "prime" obs
wgts <- calc_weights(x[rng1])
# calc smooth_days on rng2, but only keep rng1 results
y[rng1] <- smooth_days(x[rng2], wgts)[index(x[rng1])]
}

Resources