Constrained randomization of column order in a data.frame - r

I am trying to duplicate each column from data frame and move it to a randomly located point within 1-3 columns and do it for each column in the data frame. I want columns to move AT LEAST one space to the left or right. Of course sample(data) reorders columns randomly, but my attempts to put it in a loop are embarrassingly bad (I admit I skipped majority of linear algebra classes, damn...). Below is an example data:
dat <- read.table(textConnection(
"-515.5718 94.33423 939.6324 -502.9918 -75.14629 946.6926
-515.2283 96.10239 939.5687 -503.1425 -73.39015 946.6360
-515.0044 97.68119 939.4177 -503.4021 -71.79252 946.6909
-514.7430 99.59141 939.3976 -503.6645 -70.08514 946.6887
-514.4449 101.08511 939.2342 -503.9207 -68.48133 946.7183
-514.2769 102.29453 939.0013 -504.2665 -67.04509 946.7809
-513.9294 104.02753 938.9436 -504.4703 -65.34361 946.7899
-513.5900 105.49624 938.7684 -504.7405 -63.75965 946.7991"
),header=F,as.is=T)
sample(dat)#random columns position

How about this brute-force but plenty-fast solution?
It tries out different permutations of the columns until it finds one in which each column is moved at least 1, and not more than 3 columns to left or right. When it finds such a permutation, the test in the final line of the while() call evaluates to FALSE, terminating the loop and leaving the variable x containing the acceptable permutation.
n <- ncol(dat)
while({x <- sample(n) # Proposed new column positions
y <- seq_len(n) # Original column positions
max(abs(x - y)) > 3 | min(abs(x - y)) == 0
}) NULL
dat[x]

I should probably wait to post this until I have time to comment it up, and discuss some of the ambiguities in the problem as currently specified in the comments above. But since I won't be able to do that, possibly for a while, I thought I'd give you code for a solution that you can examine yourself.
# Create a function that generates acceptable permutations of the data
getPermutation <- function(blockSize, # number of columns/block
nBlock, # number of blocks of data
fromBlocks) { # indices of blocks to be moved
X <- unique(as.vector(outer(fromBlocks, c(-2,-1,1,2), "+")))
# To remove nonsensical indices like 0 or -1
X <- X[X %in% seq.int(nBlock)]
while({toBlocks <- sample(X, size = length(fromBlocks))
max(abs(toBlocks - fromBlocks)) > 2 | min(abs(toBlocks - fromBlocks)) < 1
}) NULL
A <- seq.int(nBlock)
A[toBlocks] <- fromBlocks
A[fromBlocks] <- toBlocks
blockColIndices <-
lapply(seq.int(nBlock) - 1,
function(X) {
seq(from = X * blockSize + 1,
by = 1,
length.out = blockSize)
})
unlist(blockColIndices[A])
}
# Create an example dataset, a 90 column data.frame
dat <- as.data.frame(matrix(seq.int(90*4), ncol=90))
# Call the function for a data frame with 30 3-column blocks
# within which you want to move blocks 2, 14, and 14.
index <- getPermutation(3, 30, c(2, 14, 15))
newdat <- dat[index]

Related

r: function with lag working across rows, not columns

I'm writing a function that will take the most recent observation and add it to the previous days values times a designated share of the previous observations. The below is a version that just uses one transformation and works:
df1<- data.frame(var1=rnorm(10,3,2), var2= rnorm(10, 4, 3))
df1$carryover<- lag(df1$var1, 1, default = 0)*(.5) + df1$var1
>df1
var1 var2 carryover
1 3.2894474 2.0839128 3.2894474
2 3.6059389 7.8880658 5.2506625
3 -1.4274057 6.2763882 0.3755637
4 3.8531253 3.2653448 3.1394225
My function attempts to do the same but across multiple different shares, see below:
carryover<- function(x){
result_df<- data.frame(x)
xnames<- names(x)
for (i in 1:7){
result_column<- lag(x, 1, default = 0)*(i/10) + x
result_column_name<- paste(xnames, i, sep= "_")
result_df[result_column_name] <- result_column
}
return(result_df)
}
When I run carryover(df1), df$var1 remains the same across all iterations while df1$var2 takes lag values across rows, when I'm aiming for columns. What is structurally wrong about my function that is causing it to not return lag the column values?
Worked on this a bit using feedback from Stackoverflow and came-up with the below solve, defining the the carryover function within a larger function, then using apply with MARGIN=2 to calculate by column:
adStock<- function(x){
# create datafame to store results in
result_df<- data.frame(x)
# assign names to be applied as a column
xnames<- names(x)
# create list of carryovers
carryovers<- seq(.1, .7, .1)
# create carryover function
carryover<- function(x){
x + dplyr::lag(x, 1, default = 0)*(i)
}
# run for loop across all carryover values
for (i in carryovers){
result_column<- apply(x, 2, carryover)
result_column_name<- paste(xnames, i, sep= "_")
result_df[result_column_name] <- result_column
}
return(data.frame(result_df))
}

R: Creating Grid and checking what row falls into each cell with for loop

I have looked through many forum posts and have not found an answer. I have a large list of latitudes and longitudes and would like to make a grid of them and based on that grid have each pair of lat/longs be assigned a cell reference from that grid. Eventually I want to assign values based on the cell reference. E.g. Lat 39.5645 and long -122.4654 fall into grid cell reference 1, the total number of murders in that cell are 16 and assaults are 21. There is a better way to do this but this is the only way I know of.
#number of segments, this determines size of grid
segments <- 5
#use these to dvide up the arrays
Xcounter <-(max(cleantrain$X)-min(cleantrain$X))/segments
Ycounter <-(quantile (cleantrain$Y,.9999)-min(cleantrain$Y))/segments
#arrays created from the counter and lat and longs
Xarray <- as.data.frame(seq(from=min(cleantrain$X), to=max(cleantrain$X), by=Xcounter))
Yarray <-as.data.frame(seq(from=min(cleantrain$Y), to=quantile(cleantrain$Y,.9999), by=Ycounter))
#the max for the latitude is 90 but the .9999 percentile is ~39,
# but I still want the grid to include the 90
Yarray[6,1]<-max(cleantrain$Y)
#create dummy column so I know what the values shouldn't be when I print the results
cleantrain$Area <- seq(from =1, to=nrow(cleantrain), by =1)
#for loop that goes through once for each row in my data
for (k in 1:100) {
#this loop goes through the longitudes
for (i in 1:seg-1) {
#this loop goes though the latitudes
for (j in 1:seg-1){
#should check if the row fits into that grid
if(cleantrain$Y[k] < Yarray[(j+1),1] &&
cleantrain$X[k] < Xarray[(i+1),1] &&
cleantrain$Y[k] >= Yarray[j,1] &&
cleantrain$X[k] >= Xarray[i,1]){
#writes to the row the cell reference
cleantrain$Area[k] <- ((i-1)*segments+j)
}
}
}
}
#check the results
cleantrain$Area[1:100]
if you only write the i value to cleantrain$Area it will always print 1 instead of 1-5. But the j for loop will print 1-5 like it is supposed to. But if you went into the if statement and switched the i and j loop references, the j would always be 1 and the i would always be 1-5.
Here are my array values
#Yarray
1 37.70788
2 37.73030
3 37.75272
4 37.77514
5 37.79756
6 37.81998
#Xarray
1 -122.5136
2 -122.1109
3 -121.7082
4 -121.3055
5 -120.9027
6 -120.5000
EDIT:
Here are the first 10 lats and longs:
cleantrain$Y[1:10]
[1] 37.77460 37.77460 37.80041 37.80087 37.77154 37.71343 37.72514 37.72756 37.77660 37.80780
cleantrain$X[1:10]
[1] -122.4259 -122.4259 -122.4244 -122.4270 -122.4387 -122.4033 -122.4233 -122.3713 -122.5082 -122.4191
The code above isn't reproducible, but I think I get the gist of what you're trying to achieve. In this example, I will assume I have a bunch of x coordinates and y coordinates which are randomly generated. x in [0,1] and y in [10, 20].
df <- data.frame(xcoord = runif(1000), ycoord = runif(1000, min=10, max=20))
To solve the problem of assigning points to a grid we simply need to map the points to some defined portions. The easiest way to do this is via the cut function. For example to map the xcoord to a bunch of numbers of length 10 and ycoord to a bunch of numbers of length 9, we would do:
df$x_cut <- as.numeric(cut(df$xcoord, 10))
df$y_cut <- as.numeric(cut(df$ycoord, 9))
If you want to cut things in a specific interval, you could do it like this: cut(runif(10), c(0, 0.2, 0.4, 0.6, 0.8, 1)), and see this answer for other ways.
Now we basically have our grid. If you want to map it to the individual numbers we can create such a mapping.
grid_index <- expand.grid(x_cut=1:xlength, y_cut=1:ylength)
grid_index$index <- row.names(grid_index)
and merge the two data frames to get the whole picture.
df_all <- merge(df, grid_index)
Full code:
df <- data.frame(xcoord = runif(1000), ycoord = runif(1000, min=10, max=20))
df$x_cut <- as.numeric(cut(df$xcoord, 10))
df$y_cut <- as.numeric(cut(df$ycoord, 9))
grid_index <- expand.grid(x_cut=1:xlength, y_cut=1:ylength)
grid_index$index <- row.names(grid_index)
df_all <- merge(df, grid_index)

use rollapply for certain rows

I wonder if it is possible to use rollapply() only for certain rows of a dataframe. I know the "by" argument can specify the every by-th time point at which I calculate FUN, but now I have a very specific vector of row indices to which I wish to apply the rollapply(). For example, I have the below dataframe:
df <- data.frame(x = (1:10), y = (11:20))
I know how to calculate the rolling mean for y column when the rolling width is 3.
library(zoo)
m <- rollapply(df$y, width = 3, FUN = mean, fill = NA, align = "right")
But what if I want the width-3-mean only for the 4th and 9th row? Is there something in "by" argument that I can manipulate? Or some other better methods (using apply to do rolling calculation maybe)?
Hopefully I am understanding your question correctly. I think you are asking how to perform a function on every 4th and 9th element in a sliding window? If yes, just restrict your function to the 4th and 9th element using x[4] and x[9]. Like this:
output <- rollapply(df, 9, function(x) (x[4] + x[9])/2), fill = NA, align = "right")
I also interpret your question as asking how to get the mean when the window contains the 4th or 9th row? This can be done by sub setting. The question you need to think about is where you want the 4th and 9th row to be located within your window. Do you want the 4th row to be at position x[1], x[2], or x[3] within your window? Depending on what is at the other positions will obviously effect your output. Say you dont know, and all three seem reasonable, you will need to write a function a that creates a list of dataframes containing the range of data you are interested in, and then use an apply function, or a for loop, to rollapply the mean function over each dataframe in the list. You can then all of these outputs into a dataframe to work with further. Like this:
# the rlist library has a function that allows us to add items to a list
# which will be handy later on
library(rlist)
library(zoo)
# your example data
df <- data.frame(x = (1:10), y = (11:20))
# a vector of your desired rows
desired_rows <- c(4,9)
# A for loop that generates a list of dataframes
# with your desired rows in the middle of each
for (i in desired_rows){
lower_bound <- i-2
upper_bound <- i+2
df_subset <- df[c(lower_bound:upper_bound), ]
if(exists("list_df_range")){
list_df_range <- list.append(list_df_range, df_subset)
}else{
list_df_range <- list(df_subset)
}
}
# a second for loop that applies your rollapply function to each
# data frame in the list and then
# returns a dataframe of the final results
# with each column named after the originating row
for (n in list_df_range){
m <- rollapply(n$y, width = 3, FUN = mean, fill = NA, align = "right")
if(exists("final_out")){
final_out <- cbind(final_out, m)
}else{
final_out <- data.frame(m)
}
}
names(final_out) <- desired_rows
Based on the comment below the question by the poster it seems that what is wanted is to take the mean of each rolling window of width 3 excluding the middle element in each window and only keeping the 4th and 9th elements so
cc <- c(4, 9)
rollapply(df$y, list(c(-2, 0)), mean, fill = NA)[cc]
## [1] 13 18
or
rollapplyr(df$y, 3, function(x) mean(x[-2]), fill = NA)[cc]
## [1] 13 18
or
sapply(cc, function(ix) mean(df$y[seq(to = ix, by = 2, length = 2)]))
## [1] 13 18
or
(df$y[cc - 2] + df$y[cc]) / 2
## [1] 13 18

averaging every 16 columns in r [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
apply a function over groups of columns
I have a data.frame with 30 rows and many columns (1000+), but I need to average every 16 columns together. For example, the data frame will look like this (I truncate it to make it easier..):
Col1 Col2 Col3 Col4........
4.176 4.505 4.048 4.489
6.167 6.184 6.359 6.444
5.829 5.739 5.961 5.764
.
.
.
Therefore, I cannot aggregate (I do not have a list) and I tried:
a <- data.frame(rowMeans(my.df[,1:length(my.df)]) )
which gives me the average of the all 1000+ coumns, But is there any way to say I want to do that every 16 columns until the end? (they are multiple of 16 the total number of columns).
A secondary, less important point but would be useful to solve this as well.
The col names are in the following structure:
XXYY4ZZZ.txt
Once averaged the columns all I need is a new col name with only XXYY as the rest will be averaged out. I know I could use gsub but is there an optimal way to do the averaging and this operation in one go?
I am still relatively new to R and therefore I am not sure where and how to find the answer.
Here is an example adapted from #ben's question and #TylerRinker's answer from apply a function over groups of columns . It should be able to apply any function over a matrix or data frame by intervals of columns.
# Create sample data for reproducible example
n <- 1000
set.seed(1234)
x <- matrix(runif(30 * n), ncol = n)
# Function to apply 'fun' to object 'x' over every 'by' columns
# Alternatively, 'by' may be a vector of groups
byapply <- function(x, by, fun, ...)
{
# Create index list
if (length(by) == 1)
{
nc <- ncol(x)
split.index <- rep(1:ceiling(nc / by), each = by, length.out = nc)
} else # 'by' is a vector of groups
{
nc <- length(by)
split.index <- by
}
index.list <- split(seq(from = 1, to = nc), split.index)
# Pass index list to fun using sapply() and return object
sapply(index.list, function(i)
{
do.call(fun, list(x[, i], ...))
})
}
# Run function
y <- byapply(x, 16, rowMeans)
# Test to make sure it returns expected result
y.test <- rowMeans(x[, 17:32])
all.equal(y[, 2], y.test)
# TRUE
You can do other odd things with it. For example, if you needed to know the total sum of every 10 columns, being sure to remove NAs if present:
y.sums <- byapply(x, 10, sum, na.rm = T)
y.sums[1]
# 146.7756
sum(x[, 1:10], na.rm = T)
# 146.7756
Or find the standard deviations:
byapply(x, 10, apply, 1, sd)
Update
by can also be specified as a vector of groups:
byapply(x, rep(1:10, each = 10), rowMeans)
This works for me on a much smaller data frame:
rowMeans(my.df[,seq(1,length(my.df),by=16)])

Replacing NAs in R with nearest value

I'm looking for something similar to na.locf() in the zoo package, but instead of always using the previous non-NA value I'd like to use the nearest non-NA value. Some example data:
dat <- c(1, 3, NA, NA, 5, 7)
Replacing NA with na.locf (3 is carried forward):
library(zoo)
na.locf(dat)
# 1 3 3 3 5 7
and na.locf with fromLast set to TRUE (5 is carried backwards):
na.locf(dat, fromLast = TRUE)
# 1 3 5 5 5 7
But I wish the nearest non-NA value to be used. In my example this means that the 3 should be carried forward to the first NA, and the 5 should be carried backwards to the second NA:
1 3 3 5 5 7
I have a solution coded up, but wanted to make sure that I wasn't reinventing the wheel. Is there something already floating around?
FYI, my current code is as follows. Perhaps if nothing else, someone can suggest how to make it more efficient. I feel like I'm missing an obvious way to improve this:
na.pos <- which(is.na(dat))
if (length(na.pos) == length(dat)) {
return(dat)
}
non.na.pos <- setdiff(seq_along(dat), na.pos)
nearest.non.na.pos <- sapply(na.pos, function(x) {
return(which.min(abs(non.na.pos - x)))
})
dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]
To answer smci's questions below:
No, any entry can be NA
If all are NA, leave them as is
No. My current solution defaults to the lefthand nearest value, but it doesn't matter
These rows are a few hundred thousand elements typically, so in theory the upper bound would be a few hundred thousand. In reality it'd be no more than a few here & there, typically a single one.
Update So it turns out that we're going in a different direction altogether but this was still an interesting discussion. Thanks all!
Here is a very fast one. It uses findInterval to find what two positions should be considered for each NA in your original data:
f1 <- function(dat) {
N <- length(dat)
na.pos <- which(is.na(dat))
if (length(na.pos) %in% c(0, N)) {
return(dat)
}
non.na.pos <- which(!is.na(dat))
intervals <- findInterval(na.pos, non.na.pos,
all.inside = TRUE)
left.pos <- non.na.pos[pmax(1, intervals)]
right.pos <- non.na.pos[pmin(N, intervals+1)]
left.dist <- na.pos - left.pos
right.dist <- right.pos - na.pos
dat[na.pos] <- ifelse(left.dist <= right.dist,
dat[left.pos], dat[right.pos])
return(dat)
}
And here I test it:
# sample data, suggested by #JeffAllen
dat <- as.integer(runif(50000, min=0, max=10))
dat[dat==0] <- NA
# computation times
system.time(r0 <- f0(dat)) # your function
# user system elapsed
# 5.52 0.00 5.52
system.time(r1 <- f1(dat)) # this function
# user system elapsed
# 0.01 0.00 0.03
identical(r0, r1)
# [1] TRUE
Code below. The initial question was not totally well-defined, I had asked for these clarifications:
Is it guaranteed that at least the first and/or last entries are non-NA? [No]
What to do if all entries in a row are NA? [Leave as-is]
Do you care how ties are split i.e. how to treat the middle NA in 1 3 NA NA NA 5 7? [Don't-care/ left]
Do you have an upper-bound (S) on the longest contiguous span of NAs in a row? (I'm thinking a recursive solution if S is small. Or a dataframe solution with ifelse if S is large and number of rows and cols is large.) [worst-case S could be pathologically large, hence recursion should not be used]
geoffjentry, re your solution your bottlenecks will be the serial calculation of nearest.non.na.pos and the serial assignment dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]
For a large gap of length G all we really need to compute is that the first (G/2, round up) items fill-from-left, the rest from right. (I could post an answer using ifelse but it would look similar.)
Are your criteria runtime, big-O efficiency, temp memory usage, or code legibility?
Coupla possible tweaks:
only need to compute N <- length(dat) once
common-case speed enhance: if (length(na.pos) == 0) skip row, since it has no NAs
if (length(na.pos) == length(dat)-1) the (rare) case where there is only one non-NA entry hence we fill entire row with it
Outline solution:
Sadly na.locf does not work on an entire dataframe, you must use sapply, row-wise:
na.fill_from_nn <- function(x) {
row.na <- is.na(x)
fillFromLeft <- na.locf(x, na.rm=FALSE)
fillFromRight <- na.locf(x, fromLast=TRUE, na.rm=FALSE)
disagree <- rle(fillFromLeft!=fillFromRight)
for (loc in (disagree)) { ... resolve conflicts, row-wise }
}
sapply(dat, na.fill_from_nn)
Alternatively, since as you say contiguous NAs are rare, do a fast-and-dumb ifelse to fill isolated NAs from left. This will operate data-frame wise => makes the common-case fast. Then handle all the other cases with a row-wise for-loop. (This will affect the tiebreak on middle elements in a long span of NAs, but you say you don't care.)
I can't think of an obvious simple solution, but, having looked at the suggestions (particularly smci's suggestion of using rle) I came up with a complicated function that appears to be more efficient.
This is the code, I'll explain below:
# Your function
your.func = function(dat) {
na.pos <- which(is.na(dat))
if (length(na.pos) == length(dat)) {
return(dat)
}
non.na.pos <- setdiff(seq_along(dat), na.pos)
nearest.non.na.pos <- sapply(na.pos, function(x) which.min(abs(non.na.pos - x)))
dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]
dat
}
# My function
my.func = function(dat) {
nas=is.na(dat)
if (!any(!nas)) return (dat)
t=rle(nas)
f=sapply(t$lengths[t$values],seq)
a=unlist(f)
b=unlist(lapply(f,rev))
x=which(nas)
l=length(dat)
dat[nas]=ifelse(a>b,dat[ ifelse((x+b)>l,x-a,x+b) ],dat[ifelse((x-a)<1,x+b,x-a)])
dat
}
# Test
n = 100000
test.vec = 1:n
set.seed(1)
test.vec[sample(test.vec,n/4)]=NA
system.time(t1<-my.func(test.vec))
system.time(t2<-your.func(test.vec)) # 10 times speed improvement on my machine
# Verify
any(t1!=t2)
My function relies on rle. I am reading the comments above but it looks to me like rle works just fine for NA. It is easiest to explain with a small example.
If I start with a vector:
dat=c(1,2,3,4,NA,NA,NA,8,NA,10,11,12,NA,NA,NA,NA,NA,18)
I then get the positions of all the NAs:
x=c(5,6,7,8,13,14,15,16,17)
Then, for every "run" of NAs I create a sequence from 1 to the length of the run:
a=c(1,2,3,1,1,2,3,4,5)
Then I do it again, but I reverse the sequence:
b=c(3,2,1,1,5,4,3,2,1)
Now, I can just compare vectors a and b: If a<=b then look back and grab the value at x-a. If a>b then look ahead and grab the value at x+b. The rest is just handling the corner cases when you have all NAs or NA runs at the end or the start of the vector.
There is probably a better, simpler, solution, but I hope this gets you started.
I like all the rigorous solutions. Though not directly what was asked, I found this post looking for a solution to filling NA values with an interpolation. After reviewing this post I discovered na.fill on a zoo object(vector, factor, or matrix):
z <- c(1,2,3,4,5,6,NA,NA,NA,2,3,4,5,6,NA,NA,4,6,7,NA)
z1 <- zoo::na.fill(z, "extend")
Note the smooth transition across the NA values
round(z1, 0)
#> [1] 1 2 3 4 5 6 5 4 3 2 3 4 5 6 5 5 4 6 7 7
Perhaps this could help
Here's my stab at it. I never like to see a for loop in R, but in the case of a sparsely-NA vector, it looks like it will actually be more efficient (performance metrics below). The gist of the code is below.
#get the index of all NA values
nas <- which(is.na(dat))
#get the Boolean map of which are NAs, used later to determine which values can be used as a replacement, and which are just filled-in NA values
namask <- is.na(dat)
#calculate the maximum size of a run of NAs
length <- getLengthNAs(dat);
#the furthest away an NA value could be is half of the length of the maximum NA run
windowSize <- ceiling(length/2)
#loop through all NAs
for (thisIndex in nas){
#extract the neighborhood of this NA
neighborhood <- dat[(thisIndex-windowSize):(thisIndex+windowSize)]
#any already-filled-in values which were NA can be replaced with NAs
neighborhood[namask[(thisIndex-windowSize):(thisIndex+windowSize)]] <- NA
#the center of this neighborhood
center <- windowSize + 1
#compute the difference within this neighborhood to find the nearest non-NA value
delta <- center - which(!is.na(neighborhood))
#find the closest replacement
replacement <- delta[abs(delta) == min(abs(delta))]
#in case length > 1, just pick the first
replacement <- replacement[1]
#replace with the nearest non-NA value.
dat[thisIndex] <- dat[(thisIndex - (replacement))]
}
I liked the code you proposed, but I noticed that we were calculating the delta between every NA value and every other non-NA index in the matrix. I think this was the biggest performance hog. Instead, I just extract the minimum-sized neighborhood or window around each NA and find the nearest non-NA value within that window.
So the performance scales linearly on the number of NAs and the window size -- where the window size is (the ceiling of) half the length of the maximum run of NAs. To calculate the length of the maximum run of NAs, you can use the following function:
getLengthNAs <- function(dat){
nas <- which(is.na(dat))
spacing <- diff(nas)
length <- 1;
while (any(spacing == 1)){
length <- length + 1;
spacing <- diff(which(spacing == 1))
}
length
}
Performance Comparison
#create a test vector with 10% NAs and length 50,000.
dat <- as.integer(runif(50000, min=0, max=10))
dat[dat==0] <- NA
#the a() function is the code posted in the question
a <- function(dat){
na.pos <- which(is.na(dat))
if (length(na.pos) == length(dat)) {
return(dat)
}
non.na.pos <- setdiff(seq_along(dat), na.pos)
nearest.non.na.pos <- sapply(na.pos, function(x) {
return(which.min(abs(non.na.pos - x)))
})
dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]
dat
}
#my code
b <- function(dat){
#the same code posted above, but with some additional helper code to sanitize the input
if(is.null(dat)){
return(NULL);
}
if (all(is.na(dat))){
stop("Can't impute NAs if there are no non-NA values.")
}
if (!any(is.na(dat))){
return(dat);
}
#starts with an NA (or multiple), handle these
if (is.na(dat[1])){
firstNonNA <- which(!is.na(dat))[1]
dat[1:(firstNonNA-1)] <- dat[firstNonNA]
}
#ends with an NA (or multiple), handle these
if (is.na(dat[length(dat)])){
lastNonNA <- which(!is.na(dat))
lastNonNA <- lastNonNA[length(lastNonNA)]
dat[(lastNonNA+1):length(dat)] <- dat[lastNonNA]
}
#get the index of all NA values
nas <- which(is.na(dat))
#get the Boolean map of which are NAs, used later to determine which values can be used as a replacement, and which are just filled-in NA values
namask <- is.na(dat)
#calculate the maximum size of a run of NAs
length <- getLengthNAs(dat);
#the furthest away an NA value could be is half of the length of the maximum NA run
#if there's a run at the beginning or end, then the nearest non-NA value could possibly be `length` away, so we need to keep the window large for that case.
windowSize <- ceiling(length/2)
#loop through all NAs
for (thisIndex in nas){
#extract the neighborhood of this NA
neighborhood <- dat[(thisIndex-windowSize):(thisIndex+windowSize)]
#any already-filled-in values which were NA can be replaced with NAs
neighborhood[namask[(thisIndex-windowSize):(thisIndex+windowSize)]] <- NA
#the center of this neighborhood
center <- windowSize + 1
#compute the difference within this neighborhood to find the nearest non-NA value
delta <- center - which(!is.na(neighborhood))
#find the closest replacement
replacement <- delta[abs(delta) == min(abs(delta))]
#in case length > 1, just pick the first
replacement <- replacement[1]
#replace with the nearest non-NA value.
dat[thisIndex] <- dat[(thisIndex - (replacement))]
}
dat
}
#nograpes' answer on this question
c <- function(dat){
nas=is.na(dat)
if (!any(!nas)) return (dat)
t=rle(nas)
f=sapply(t$lengths[t$values],seq)
a=unlist(f)
b=unlist(lapply(f,rev))
x=which(nas)
l=length(dat)
dat[nas]=ifelse(a>b,dat[ ifelse((x+b)>l,x-a,x+b) ],dat[ifelse((x-a)<1,x+b,x-a)])
dat
}
#run 10 times each to get average performance.
sum <- 0; for (i in 1:10){ sum <- sum + system.time(a(dat))["elapsed"];}; cat ("A: ", sum/10)
A: 5.059
sum <- 0; for (i in 1:10){ sum <- sum + system.time(b(dat))["elapsed"];}; cat ("B: ", sum/10)
B: 0.126
sum <- 0; for (i in 1:10){ sum <- sum + system.time(c(dat))["elapsed"];}; cat ("C: ", sum/10)
C: 0.287
So it looks like this code (at least under these conditions), offers about a 40X speedup from the original code posted in the question, and a 2.2X speedup over #nograpes' answer below (though I imagine an rle solution would certainly be faster in some situations -- including a more NA-rich vector).
Speed is about 3-4x slower than that of the chosen answer. Mine is pretty simple though. It's a rare while loop too.
f2 <- function(x){
# check if all are NA to skip loop
if(!all(is.na(x))){
# replace NA's until they are gone
while(anyNA(x)){
# replace from the left
x[is.na(x)] <- c(NA,x[1:(length(x)-1)])[is.na(x)]
# replace from the right
x[is.na(x)] <- c(x[-1],NA)[is.na(x)]
}
}
# return original or fixed x
x
}

Resources