I am calculating sums of matrix columns to each group, where the corresponding group values are contained in matrix columns as well. At the moment I am using a loop as follows:
index <- matrix(c("A","A","B","B","B","B","A","A"),4,2)
x <- matrix(1:8,4,2)
for (i in 1:2) {
tapply(x[,i], index[,i], sum)
}
At the end of the day I need the following result:
1 2
A 3 15
B 7 11
Is there a way to do this using matrix operations without a loop? On top, the real data is large (e.g. 500 x 10000), therefore it has to be fast.
Thanks in advance.
Here are a couple of solutions:
# 1
ag <- aggregate(c(x), data.frame(index = c(index), col = c(col(x))), sum)
xt <- xtabs(x ~., ag)
# 2
m <- mapply(rowsum, as.data.frame(x), as.data.frame(index))
dimnames(m) <- list(levels(factor(index)), 1:ncol(index))
The second only works if every column of index has at least one of each level and also requires that there be at least 2 levels; however, its faster.
This is ugly and works but there's a much better way to do it that is more generalizable. Just getting the ball rolling.
data.frame("col1"=as.numeric(table(rep(index[,1], x[,1]))),
"col2"=as.numeric(table(rep(index[,2], x[,2]))),
row.names=names(table(index)))
I still suspect there's a better option, but this seems reasonably fast actually:
index <- matrix(sample(LETTERS[1:4],size = 500*1000,replace = TRUE),500,10000)
x <- matrix(sample(1:10,500*10000,replace = TRUE),500,10000)
rs <- matrix(NA,4,10000)
rownames(rs) <- LETTERS[1:4]
for (i in LETTERS[1:4]){
tmp <- x
tmp[index != i] <- 0
rs[i,] <- colSums(tmp)
}
It runs in ~0.8 seconds on my machine. I upped the number of categories to four and scaled it up to the size data you have. But I don't having to copy x each time.
You can get clever with matrix multiplication, but I think you still have to do one row or column at a time.
You used tapply. If you add mapply, you can complete your objective.
It does the same thing as that for loop.
index <- matrix(c("A","A","B","B","B","B","A","A"),4,2)
x <- matrix(1:8,4,2)
mapply( function(i) tapply(x[,i], index[,i], sum), 1:2 )
result:
[,1] [,2]
A 3 15
B 7 11
Related
I have the following code:
n <- 1e6
no_clm <- rpois(n,30)
hold <- data.frame("x" = double(n))
c = 1
for (i in no_clm){
ctl <- sum(rgamma(i,30000)-2000)
hold[c,1] <- ctl
#hold <- rbind(hold,df)
c = c +1
}
Unfortunately the speed of this code is quite slow. I've narrowed down the speed to hold[c,1] <- ctl. If I remove this then the code runs near instantly.
How can I make this efficient? I need to store the results to some sort of dataframe or list in a fast fashion. In reality the actual code is more complex than this but the slowing point is the assigning.
Note that the above is just an example, in reality I have multiple calculations on the rgamma samples and each of these calculations are then stored in a large dataframe.
Try this
hold=data.frame(sapply(no_clm,function(x){
return(sum(rgamma(x,30000)-2000))
}))
It looks like you can just use one call to rgamma, as you are iterating over the number of observations parameter.
So if you do one call and the split the vector to the lengths required (no_clm) you can then just iterate over that list and sum
n <- 1e6
no_clm <- rpois(n, 30)
hold <- data.frame("x" = double(n))
# total observations to use for rgamma
total_clm <- sum(no_clm)
# get values
gammas <- rgamma(total_clm, 30000) - 2000
# split into list of lengths dictated by no_clm
hold$x <- sapply(split(gammas, cumsum(sequence(no_clm) == 1)), sum)
This took 5.919892 seconds
Move into sapply() loop instead of a for loop and then realise 2000 * no_clm can be moved outside the loop (to minimise number of function calls).
n <- 1e6
no_clm <- rpois(n, 30)
hold <- data.frame(x = sapply(no_clm, function(i) sum(rgamma(i, 30000))) - 2000 * no_clm)
You may observe a speed pickup using data.table:
dt = data.table(no_clm)
dt[, hold := sapply(no_clm, function(x) sum(rgamma(x, 30000)-2000))]
My question is quite similar to this one: Find a subset from a set of integer whose sum is closest to a value
It discussed the algorithm only, but I want to solve it with R. I'm quite new to R and tried to work out a solution, but I wonder whether there is a more efficient way.
Here is my example:
# Define a vector, to findout a subset whose sum is closest to the reference number 20.
A <- c(2,5,6,3,7)
# display all the possible combinations
y1 <- combn(A,1)
y2 <- combn(A,2)
y3 <- combn(A,3)
y4 <- combn(A,4)
y5 <- combn(A,5)
Y <- list(y1,y2,y3,y4,y5)
# calculate the distance to the reference number of each combination
s1 <- abs(apply(y1,2,sum)-20)
s2 <- abs(apply(y2,2,sum)-20)
s3 <- abs(apply(y3,2,sum)-20)
s4 <- abs(apply(y4,2,sum)-20)
s5 <- abs(apply(y5,2,sum)-20)
S <- list(s1,s2,s3,s4,s5)
# find the minimum difference
M <- sapply(S,FUN=function(x) list(which.min(x),min(x)))
Mm <- which.min(as.numeric(M[2,]))
# return the right combination
data.frame(Y[Mm])[as.numeric(M[,Mm[1]])]
so the answer is 2,5,6,7.
How can I refine this program? Especially the five combn()s and five apply()s, is there a way that can work them at once? I hope when A has more items in it, I can use length(A) to cover it.
Here is another way to do it,
l1 <- sapply(seq_along(A), function(i) combn(A, i))
l2 <- sapply(l1, function(i) abs(colSums(i) - 20))
Filter(length, Map(function(x, y)x[,y], l1, sapply(l2, function(i) i == Reduce(min, l2))))
#[[1]]
#[1] 2 5 6 7
The last line uses Map to index l1 based on a logical list created by finding the minimum value from list l2.
combiter library has isubsetv iterator, which goes through all subset of a vector. Combined with foreach simplifies the code.
library(combiter)
library(foreach)
A <- c(2,5,6,3,7)
res <- foreach(x = isubsetv(A), .combine = c) %do% sum(x)
absdif <- abs(res-20)
ind <- which(absdif==min(absdif))
as.list(isubsetv(A))[ind]
What I am trying to do is generate all possible permutations of 1 and 0 given a particular sample size. For instance with a sample of n=8 I would like the m = 2^8 = 256 possible permutations, i.e:
I've written a function in R to do this, but after n=11 it takes a very long time to run. I would prefer a solution in R, but if its in another programming language I can probably figure it out. Thanks!
PermBinary <- function(n){
n.perms <- 2^n
array <- matrix(0,nrow=n,ncol=n.perms)
# array <- big.matrix(n, n.perms, type='integer', init=-5)
for(i in 1:n){
div.length <- ncol(array)/(2^i)
div.num <- ncol(array)/div.length
end <- 0
while(end!=ncol(array)){
end <- end +1
start <- end + div.length
end <- start + div.length -1
array[i,start:end] <- 1
}
}
return(array)
}
expand.grid is probably the best vehicle to get what you want.
For example if you wanted a sample size of 3 we could do something like
expand.grid(0:1, 0:1, 0:1)
For a sample size of 4
expand.grid(0:1, 0:1, 0:1, 0:1)
So what we want to do is find a way to automate that call.
If we had a list of the inputs we want to give to expand.grid we could use do.call to construct the call for us. For example
vals <- 0:1
tmp <- list(vals, vals, vals)
do.call(expand.grid, tmp)
So now the challenge is to automatically make the "tmp" list above in a fashion that we can dictate how many copies of "vals" we want. There are lots of ways to do this but one way is to use replicate. Since we want a list we'll need to tell it to not simplify the result or else we will get a matrix/array as the result.
vals <- 0:1
tmp <- replicate(4, vals, simplify = FALSE)
do.call(expand.grid, tmp)
Alternatively we can use rep on a list input (which I believe is faster because it doesn't have as much overhead as replicate but I haven't tested it)
tmp <- rep(list(vals), 4)
do.call(expand.grid, tmp)
Now wrap that up into a function to get:
binarypermutations <- function(n, vals = 0:1){
tmp <- rep(list(vals), n)
do.call(expand.grid, tmp)
}
Then call with the sample size like so binarypermutations(5).
This gives a data.frame of dimensions 2^n x n as a result - transpose and convert to a different data type if you'd like.
The answer above may be better since it uses base - my first thought was to use data.table's CJ function:
library(data.table)
do.call(CJ, replicate(8, c(0, 1), FALSE))
It will be slightly faster (~15%) than expand.grid, so it will only be more valuable for extreme cases.
I have a data.frame, ordered by mean column that looks like this:
10SE191_2 10SE207 10SE208 mean
7995783 12.64874 13.06391 12.69378 12.73937
8115327 12.69979 12.52285 12.41582 12.50363
8108370 12.58685 12.87818 12.66021 12.45720
7945680 12.46392 12.26087 11.77040 12.36518
7923547 11.98463 11.96649 12.50666 12.33138
8016718 12.81610 12.71548 12.48164 12.32703
I would like to apply a t.test to each row, using as input the intensity values: df[i,1:3] and the mean values from the rows with lower intensities. For example, for the first row I want to compute a t.test for df[1,1:3] vs _mean values_ from row 2 to row 6. My code uses a for loop but my current data.frame has more than 20,000 rows and 24 columns and it takes a long time. Any ideas for improving the code?
Thanks
Code:
temp <- matrix(-9, nrow=dim(matrix.order)[1], ncol=2) #create a result matrix
l <- dim(matrix.order)[1]
for (i in 1:l){
j <- 1+i
if (i < l | j +2 == l) { #avoid not enough y observations
mean.val <- matrix.order[j:l,4]
p <- t.test(matrix.order[i, 1:3], mean.val)
temp[i,1] <- p$p.value
}
else {temp[i,1] <- 1}
}
dput for my df
structure(list(`10SE191_2` = c(12.6487418898415, 12.6997932097351,12.5868508174491, 12.4639169398277, 11.9846348627906, 12.8160978540904), `10SE207` = c(13.0639063105224, 12.522848114011, 12.8781769160682, 12.260865493177, 11.9664905651469, 12.7154788700468), `10SE208` = c(12.6937808736673, 12.4158248856386, 12.6602128982717, 11.7704045448312, 12.5066604109231, 12.4816357798965), mean = c(12.7393707471856, 12.5036313008127, 12.4572035036992, 12.3651842840775, 12.3313821056582, 12.3270331271091)), .Names = c("10SE191_2", "10SE207", "10SE208", "mean"), row.names = c("7995783", "8115327", "8108370", "7945680", "7923547", "8016718"), class = "data.frame")
You can obtain all p-values (if possible) with this command:
apply(df, 1, function(x) {
y <- df$mean[df$mean < x[4]]
if(length(y) > 1)
t.test(x[1:3], y)$p.value
else NA
})
The function will return NA if there are not enough values for y.
7995783 8115327 8108370 7945680 7923547 8016718
0.08199794 0.15627947 0.04993244 0.50885253 NA NA
Running 2E4 t.tests probably takes a lot of time no matter what. Try using Rprof to find the hot spots. You might also want to use mcapply or similar parallel processing tools, since your analysis of each row is independent of all other data (which means this is a task well-suited to multicore parallel processing).
I'm looking for something similar to na.locf() in the zoo package, but instead of always using the previous non-NA value I'd like to use the nearest non-NA value. Some example data:
dat <- c(1, 3, NA, NA, 5, 7)
Replacing NA with na.locf (3 is carried forward):
library(zoo)
na.locf(dat)
# 1 3 3 3 5 7
and na.locf with fromLast set to TRUE (5 is carried backwards):
na.locf(dat, fromLast = TRUE)
# 1 3 5 5 5 7
But I wish the nearest non-NA value to be used. In my example this means that the 3 should be carried forward to the first NA, and the 5 should be carried backwards to the second NA:
1 3 3 5 5 7
I have a solution coded up, but wanted to make sure that I wasn't reinventing the wheel. Is there something already floating around?
FYI, my current code is as follows. Perhaps if nothing else, someone can suggest how to make it more efficient. I feel like I'm missing an obvious way to improve this:
na.pos <- which(is.na(dat))
if (length(na.pos) == length(dat)) {
return(dat)
}
non.na.pos <- setdiff(seq_along(dat), na.pos)
nearest.non.na.pos <- sapply(na.pos, function(x) {
return(which.min(abs(non.na.pos - x)))
})
dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]
To answer smci's questions below:
No, any entry can be NA
If all are NA, leave them as is
No. My current solution defaults to the lefthand nearest value, but it doesn't matter
These rows are a few hundred thousand elements typically, so in theory the upper bound would be a few hundred thousand. In reality it'd be no more than a few here & there, typically a single one.
Update So it turns out that we're going in a different direction altogether but this was still an interesting discussion. Thanks all!
Here is a very fast one. It uses findInterval to find what two positions should be considered for each NA in your original data:
f1 <- function(dat) {
N <- length(dat)
na.pos <- which(is.na(dat))
if (length(na.pos) %in% c(0, N)) {
return(dat)
}
non.na.pos <- which(!is.na(dat))
intervals <- findInterval(na.pos, non.na.pos,
all.inside = TRUE)
left.pos <- non.na.pos[pmax(1, intervals)]
right.pos <- non.na.pos[pmin(N, intervals+1)]
left.dist <- na.pos - left.pos
right.dist <- right.pos - na.pos
dat[na.pos] <- ifelse(left.dist <= right.dist,
dat[left.pos], dat[right.pos])
return(dat)
}
And here I test it:
# sample data, suggested by #JeffAllen
dat <- as.integer(runif(50000, min=0, max=10))
dat[dat==0] <- NA
# computation times
system.time(r0 <- f0(dat)) # your function
# user system elapsed
# 5.52 0.00 5.52
system.time(r1 <- f1(dat)) # this function
# user system elapsed
# 0.01 0.00 0.03
identical(r0, r1)
# [1] TRUE
Code below. The initial question was not totally well-defined, I had asked for these clarifications:
Is it guaranteed that at least the first and/or last entries are non-NA? [No]
What to do if all entries in a row are NA? [Leave as-is]
Do you care how ties are split i.e. how to treat the middle NA in 1 3 NA NA NA 5 7? [Don't-care/ left]
Do you have an upper-bound (S) on the longest contiguous span of NAs in a row? (I'm thinking a recursive solution if S is small. Or a dataframe solution with ifelse if S is large and number of rows and cols is large.) [worst-case S could be pathologically large, hence recursion should not be used]
geoffjentry, re your solution your bottlenecks will be the serial calculation of nearest.non.na.pos and the serial assignment dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]
For a large gap of length G all we really need to compute is that the first (G/2, round up) items fill-from-left, the rest from right. (I could post an answer using ifelse but it would look similar.)
Are your criteria runtime, big-O efficiency, temp memory usage, or code legibility?
Coupla possible tweaks:
only need to compute N <- length(dat) once
common-case speed enhance: if (length(na.pos) == 0) skip row, since it has no NAs
if (length(na.pos) == length(dat)-1) the (rare) case where there is only one non-NA entry hence we fill entire row with it
Outline solution:
Sadly na.locf does not work on an entire dataframe, you must use sapply, row-wise:
na.fill_from_nn <- function(x) {
row.na <- is.na(x)
fillFromLeft <- na.locf(x, na.rm=FALSE)
fillFromRight <- na.locf(x, fromLast=TRUE, na.rm=FALSE)
disagree <- rle(fillFromLeft!=fillFromRight)
for (loc in (disagree)) { ... resolve conflicts, row-wise }
}
sapply(dat, na.fill_from_nn)
Alternatively, since as you say contiguous NAs are rare, do a fast-and-dumb ifelse to fill isolated NAs from left. This will operate data-frame wise => makes the common-case fast. Then handle all the other cases with a row-wise for-loop. (This will affect the tiebreak on middle elements in a long span of NAs, but you say you don't care.)
I can't think of an obvious simple solution, but, having looked at the suggestions (particularly smci's suggestion of using rle) I came up with a complicated function that appears to be more efficient.
This is the code, I'll explain below:
# Your function
your.func = function(dat) {
na.pos <- which(is.na(dat))
if (length(na.pos) == length(dat)) {
return(dat)
}
non.na.pos <- setdiff(seq_along(dat), na.pos)
nearest.non.na.pos <- sapply(na.pos, function(x) which.min(abs(non.na.pos - x)))
dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]
dat
}
# My function
my.func = function(dat) {
nas=is.na(dat)
if (!any(!nas)) return (dat)
t=rle(nas)
f=sapply(t$lengths[t$values],seq)
a=unlist(f)
b=unlist(lapply(f,rev))
x=which(nas)
l=length(dat)
dat[nas]=ifelse(a>b,dat[ ifelse((x+b)>l,x-a,x+b) ],dat[ifelse((x-a)<1,x+b,x-a)])
dat
}
# Test
n = 100000
test.vec = 1:n
set.seed(1)
test.vec[sample(test.vec,n/4)]=NA
system.time(t1<-my.func(test.vec))
system.time(t2<-your.func(test.vec)) # 10 times speed improvement on my machine
# Verify
any(t1!=t2)
My function relies on rle. I am reading the comments above but it looks to me like rle works just fine for NA. It is easiest to explain with a small example.
If I start with a vector:
dat=c(1,2,3,4,NA,NA,NA,8,NA,10,11,12,NA,NA,NA,NA,NA,18)
I then get the positions of all the NAs:
x=c(5,6,7,8,13,14,15,16,17)
Then, for every "run" of NAs I create a sequence from 1 to the length of the run:
a=c(1,2,3,1,1,2,3,4,5)
Then I do it again, but I reverse the sequence:
b=c(3,2,1,1,5,4,3,2,1)
Now, I can just compare vectors a and b: If a<=b then look back and grab the value at x-a. If a>b then look ahead and grab the value at x+b. The rest is just handling the corner cases when you have all NAs or NA runs at the end or the start of the vector.
There is probably a better, simpler, solution, but I hope this gets you started.
I like all the rigorous solutions. Though not directly what was asked, I found this post looking for a solution to filling NA values with an interpolation. After reviewing this post I discovered na.fill on a zoo object(vector, factor, or matrix):
z <- c(1,2,3,4,5,6,NA,NA,NA,2,3,4,5,6,NA,NA,4,6,7,NA)
z1 <- zoo::na.fill(z, "extend")
Note the smooth transition across the NA values
round(z1, 0)
#> [1] 1 2 3 4 5 6 5 4 3 2 3 4 5 6 5 5 4 6 7 7
Perhaps this could help
Here's my stab at it. I never like to see a for loop in R, but in the case of a sparsely-NA vector, it looks like it will actually be more efficient (performance metrics below). The gist of the code is below.
#get the index of all NA values
nas <- which(is.na(dat))
#get the Boolean map of which are NAs, used later to determine which values can be used as a replacement, and which are just filled-in NA values
namask <- is.na(dat)
#calculate the maximum size of a run of NAs
length <- getLengthNAs(dat);
#the furthest away an NA value could be is half of the length of the maximum NA run
windowSize <- ceiling(length/2)
#loop through all NAs
for (thisIndex in nas){
#extract the neighborhood of this NA
neighborhood <- dat[(thisIndex-windowSize):(thisIndex+windowSize)]
#any already-filled-in values which were NA can be replaced with NAs
neighborhood[namask[(thisIndex-windowSize):(thisIndex+windowSize)]] <- NA
#the center of this neighborhood
center <- windowSize + 1
#compute the difference within this neighborhood to find the nearest non-NA value
delta <- center - which(!is.na(neighborhood))
#find the closest replacement
replacement <- delta[abs(delta) == min(abs(delta))]
#in case length > 1, just pick the first
replacement <- replacement[1]
#replace with the nearest non-NA value.
dat[thisIndex] <- dat[(thisIndex - (replacement))]
}
I liked the code you proposed, but I noticed that we were calculating the delta between every NA value and every other non-NA index in the matrix. I think this was the biggest performance hog. Instead, I just extract the minimum-sized neighborhood or window around each NA and find the nearest non-NA value within that window.
So the performance scales linearly on the number of NAs and the window size -- where the window size is (the ceiling of) half the length of the maximum run of NAs. To calculate the length of the maximum run of NAs, you can use the following function:
getLengthNAs <- function(dat){
nas <- which(is.na(dat))
spacing <- diff(nas)
length <- 1;
while (any(spacing == 1)){
length <- length + 1;
spacing <- diff(which(spacing == 1))
}
length
}
Performance Comparison
#create a test vector with 10% NAs and length 50,000.
dat <- as.integer(runif(50000, min=0, max=10))
dat[dat==0] <- NA
#the a() function is the code posted in the question
a <- function(dat){
na.pos <- which(is.na(dat))
if (length(na.pos) == length(dat)) {
return(dat)
}
non.na.pos <- setdiff(seq_along(dat), na.pos)
nearest.non.na.pos <- sapply(na.pos, function(x) {
return(which.min(abs(non.na.pos - x)))
})
dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]
dat
}
#my code
b <- function(dat){
#the same code posted above, but with some additional helper code to sanitize the input
if(is.null(dat)){
return(NULL);
}
if (all(is.na(dat))){
stop("Can't impute NAs if there are no non-NA values.")
}
if (!any(is.na(dat))){
return(dat);
}
#starts with an NA (or multiple), handle these
if (is.na(dat[1])){
firstNonNA <- which(!is.na(dat))[1]
dat[1:(firstNonNA-1)] <- dat[firstNonNA]
}
#ends with an NA (or multiple), handle these
if (is.na(dat[length(dat)])){
lastNonNA <- which(!is.na(dat))
lastNonNA <- lastNonNA[length(lastNonNA)]
dat[(lastNonNA+1):length(dat)] <- dat[lastNonNA]
}
#get the index of all NA values
nas <- which(is.na(dat))
#get the Boolean map of which are NAs, used later to determine which values can be used as a replacement, and which are just filled-in NA values
namask <- is.na(dat)
#calculate the maximum size of a run of NAs
length <- getLengthNAs(dat);
#the furthest away an NA value could be is half of the length of the maximum NA run
#if there's a run at the beginning or end, then the nearest non-NA value could possibly be `length` away, so we need to keep the window large for that case.
windowSize <- ceiling(length/2)
#loop through all NAs
for (thisIndex in nas){
#extract the neighborhood of this NA
neighborhood <- dat[(thisIndex-windowSize):(thisIndex+windowSize)]
#any already-filled-in values which were NA can be replaced with NAs
neighborhood[namask[(thisIndex-windowSize):(thisIndex+windowSize)]] <- NA
#the center of this neighborhood
center <- windowSize + 1
#compute the difference within this neighborhood to find the nearest non-NA value
delta <- center - which(!is.na(neighborhood))
#find the closest replacement
replacement <- delta[abs(delta) == min(abs(delta))]
#in case length > 1, just pick the first
replacement <- replacement[1]
#replace with the nearest non-NA value.
dat[thisIndex] <- dat[(thisIndex - (replacement))]
}
dat
}
#nograpes' answer on this question
c <- function(dat){
nas=is.na(dat)
if (!any(!nas)) return (dat)
t=rle(nas)
f=sapply(t$lengths[t$values],seq)
a=unlist(f)
b=unlist(lapply(f,rev))
x=which(nas)
l=length(dat)
dat[nas]=ifelse(a>b,dat[ ifelse((x+b)>l,x-a,x+b) ],dat[ifelse((x-a)<1,x+b,x-a)])
dat
}
#run 10 times each to get average performance.
sum <- 0; for (i in 1:10){ sum <- sum + system.time(a(dat))["elapsed"];}; cat ("A: ", sum/10)
A: 5.059
sum <- 0; for (i in 1:10){ sum <- sum + system.time(b(dat))["elapsed"];}; cat ("B: ", sum/10)
B: 0.126
sum <- 0; for (i in 1:10){ sum <- sum + system.time(c(dat))["elapsed"];}; cat ("C: ", sum/10)
C: 0.287
So it looks like this code (at least under these conditions), offers about a 40X speedup from the original code posted in the question, and a 2.2X speedup over #nograpes' answer below (though I imagine an rle solution would certainly be faster in some situations -- including a more NA-rich vector).
Speed is about 3-4x slower than that of the chosen answer. Mine is pretty simple though. It's a rare while loop too.
f2 <- function(x){
# check if all are NA to skip loop
if(!all(is.na(x))){
# replace NA's until they are gone
while(anyNA(x)){
# replace from the left
x[is.na(x)] <- c(NA,x[1:(length(x)-1)])[is.na(x)]
# replace from the right
x[is.na(x)] <- c(x[-1],NA)[is.na(x)]
}
}
# return original or fixed x
x
}