Loop in a dataset simulation - r

I hope to get help on the following problem in R.
I have the folowing code to generate 30 column dataset based on an exponential distribuition:
x0=0
xmax=8000
xout=3000
lambda=0.0002
n=1
x1=x0+rexp(n,lambda)-xout
x2=x1+rexp(n,lambda)-xout
x3=x2+rexp(n,lambda)-xout
x4=x3+rexp(n,lambda)-xout
x5=x4+rexp(n,lambda)-xout
x6=x5+rexp(n,lambda)-xout
x7=x6+rexp(n,lambda)-xout
x8=x7+rexp(n,lambda)-xout
x9=x8+rexp(n,lambda)-xout
x10=x9+rexp(n,lambda)-xout
x11=x10+rexp(n,lambda)-xout
x12=x11+rexp(n,lambda)-xout
x13=x12+rexp(n,lambda)-xout
x14=x13+rexp(n,lambda)-xout
x15=x14+rexp(n,lambda)-xout
x16=x15+rexp(n,lambda)-xout
x17=x16+rexp(n,lambda)-xout
x18=x17+rexp(n,lambda)-xout
x19=x18+rexp(n,lambda)-xout
x20=x19+rexp(n,lambda)-xout
x21=x20+rexp(n,lambda)-xout
x22=x21+rexp(n,lambda)-xout
x23=x22+rexp(n,lambda)-xout
x24=x23+rexp(n,lambda)-xout
x25=x24+rexp(n,lambda)-xout
x26=x25+rexp(n,lambda)-xout
x27=x26+rexp(n,lambda)-xout
x28=x27+rexp(n,lambda)-xout
x29=x28+rexp(n,lambda)-xout
x30=x29+rexp(n,lambda)-xout
I have three doubts:
1 - Is there any way to write this function in a reduced form?
2 - This row (30 columns) needs to be simulated 10,000 times. How to do this in a loop?
3 - The values ​​of each cell (x1, x2, x3 ...) must be limited to the interval x0 and xmax (0-8000). How to do this?

That depends on what you want to do with values over 8000. Here's a solution that just takes those values and wraps them around with a modulo operator.
library(tidyverse)
test <- data.frame(x0 = rep(0, n))
for (i in 1:30) {
new_col <- sym(paste0("x", i))
old_col <- sym(paste0("x", i - 1))
test <- test %>%
mutate(!!new_col := (!!old_col + rexp(n, lambda) - xout) %% xmax)
}
I don't know how familiar you may or may not be with the tidyverse and tidy evaluation, which I've used here liberally. The !! operator, combined with sym(), turns the variable names into actual variables. The %>% operator "pipes" data from one function to the next. The := operator is needed only if you want to make assignments with a !! on the lefthand side.
I think this is my first time actually trying to post an answer on StackOverflow, so be easy on me! :)

As I'm fairly new to R myself, I thought it would be good practice to try to write this out. Perhaps not the most efficient code, but it works:
xmax <- 8000
xout <- 3000
lambda <- 0.0002
n <- 1
iterations <- 30
df <- data.frame(matrix(ncol = 31, nrow = iterations))
names(df) <- c(paste("x", 0:30, sep=""))
for (j in 1:iterations) {
df$x0[j] <- 0
df$x1[j] <- df$x0[j] + rexp(n,lambda)-xout
if (df$x1[j] < 0) {
df$x1[j] <- 0
}
if (df$x1[j] > 8000) {
df$x1[j] <- 8000
}
for (i in 3:31) {
df[j,i] <- df[j, i-1] + rexp(n,lambda)-xout
if (df[j,i] < 0) {
df[j,i] <- 0
}
if (df[j,i] > 8000) {
df[j,i] <- 8000
}
}
}
You can change iterations to 30000, for testing purposes I've used 30. Also I didn't know if you wanted to limit to 0 and 8000 before or after the next iterations, I've done it before.

Is there any way to write this function in a reduced form?
I would do it like this. Pretty sure this is equivalent.
ncol = 30
row = rexp(ncol, lambda)
row = cumsum(row) - xout * (1:ncol)
This row (30 columns) needs to be simulated 10,000 times. How to do this in a loop?
Use replicate with the code above:
sim_data = t(replicate(10000, {
row = rexp(ncol, lambda)
row = cumsum(row) - xout * (1:ncol)
}))
replicate gives 10000 columns and 30 rows. We use t() to transpose it to 10000 rows with 30 columns.
The values ​​of each cell (x1, x2, x3 ...) must be limited to the interval x0 and xmax (0-8000). How to do this?
Use pmin() and pmax(). Not sure if you want this done before or after the cumulative summing...
sim_data = t(replicate(10000, {
row = rexp(ncol, lambda)
row = cumsum(row) - xout * (1:ncol)
row = pmax(0, row)
row = pmin(xmax, row)
row
}))

Related

Avoiding a loop when populating data frames in R

I have an empty data frame T_modelled with 2784 columns and 150 rows.
T_modelled <- data.frame(matrix(ncol = 2784, nrow = 150))
names(T_modelled) <- paste0("t=", t_sec_ERT)
rownames(T_modelled) <- paste0("z=", seq(from = 0.1, to = 15, by = 0.1))
where
t_sec_ERT <- seq(from = -23349600, to = 6706800, by = 10800)
z <- seq(from = 0.1, to = 15, by = 0.1)
I filled T_modelled by column with a nested for loop, based on a formula:
for (i in 1:ncol(T_modelled)) {
col_tmp <- colnames(T_modelled)[i]
for (j in 1:nrow(T_modelled)) {
z_tmp <- z[j]-0.1
T_tmp <- MANSRT+As*e^(-z_tmp*(omega/(2*K))^0.5)*sin(omega*t_sec_ERT[i]-((omega/(2*K))^0.5)*z_tmp)
T_modelled[j ,col_tmp] <- T_tmp
}
}
where
MANSRT <- -2.051185
As <- 11.59375
omega <- (2*pi)/(347.875*24*60*60)
c <- 790
k <- 0.00219
pb <- 2600
K <- (k*1000)/(c*pb)
e <- exp(1)
I do get the desired results but I keep thinking there must be a more efficient way of filling that data frame. The loop is quite slow and looks cumbersome to me. I guess there is an opportunity to take advantage of R's vectorized way of calculating. I just cannot see myself how to incorporate the formula in an easier way to fill T_modelled.
Anyone got any ideas how to get the same result in a faster, more "R-like" manner?
I believe this does it.
Run this first instruction right after creating T_modelled, it will be needed to test that the results are equal.
Tm <- T_modelled
Now run your code then run the code below.
z_tmp <- z - 0.1
for (i in 1:ncol(Tm)) {
T_tmp <- MANSRT + As*exp(-z_tmp*(omega/(2*K))^0.5)*sin(omega*t_sec_ERT[i]-((omega/(2*K))^0.5)*z_tmp)
Tm[ , i] <- T_tmp
}
all.equal(T_modelled, Tm)
#[1] TRUE
You don't need the inner loop, that's the only difference.
(I also used exp directly but that is of secondary importance.)
Much like your previous question's solution which you accepted, consider simply using sapply, iterating through the vector, t_sec_ERT, which is the same length as your desired dataframe's number of columns. But first adjust every element of z by 0.1. Plus, there's no need to create empty dataframe beforehand.
z_adj <- z - 0.1
T_modelled2 <- data.frame(sapply(t_sec_ERT, function(ert)
MANSRT+As*e^(-z_adj*(omega/(2*K))^0.5)*sin(omega*ert-((omega/(2*K))^0.5)*z_adj)))
colnames(T_modelled2) <- paste0("t=", t_sec_ERT)
rownames(T_modelled2) <- paste0("z=", z)
all.equal(T_modelled, T_modelled2)
# [1] TRUE
Rui is of course correct, I just want to suggest a way of reasoning when writing a loop like this.
You have two numeric vectors. Functions for numerics in R are usually vectorized. By which I mean you can do stuff like this
x <- c(1, 6, 3)
sum(x)
not needing something like this
x_ <- 0
for (i in x) {
x_ <- i + x_
}
x_
That is, no need for looping in R. Of course looping takes place none the less, it just happens in the underlying C, Fortran etc. code, where it can be done more efficiently. This is usually what we mean when we call a function vectorized: looping takes place "under the hood" as it were. The output of Vectorize() thus isn't strictly vectorized by this definition.
When you have two numeric vectors you want to loop over you have to first see if the constituent functions are vectorized, usually by reading the docs.
If it is, you continue by constructing that central vectorized compound function and and start testing it with one vector and one scalar. In your case it would be something like this (testing with just the first element of t_sec_ERT).
z_tmp <- z - 0.1
i <- 1
T_tmp <- MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*t_sec_ERT[i] - ((omega/(2*K))^0.5)*z_tmp)
Looks OK. Then you start looping over the elements of t_sec_ERT.
T_tmp <- matrix(nrow=length(z), ncol=length(t_sec_ERT))
for (i in 1:length(t_sec_ERT)) {
T_tmp[, i] <- MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*t_sec_ERT[i] - ((omega/(2*K))^0.5)*z_tmp)
}
Or you can do it with sapply() which is often neater.
f <- function(x) {
MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*x - ((omega/(2*K))^0.5)*z_tmp)
}
T_tmp <- sapply(t_sec_ERT, f)
I would prefer to put the data in a long format, with all combinations of z and t_sec_ERT as two columns, in order to take advantage of vectorization. Although I usually prefer tidyr for switching between long and wide formats, I've tried to keep this as a base solution:
t_sec_ERT <- seq(from = -23349600, to = 6706800, by = 10800)
z <- seq(from = 0.1, to = 15, by = 0.1)
v <- expand.grid(t_sec_ERT, z)
names(v) <- c("t_sec_ERT", "z")
v$z_tmp <- v$z-0.1
v$T_tmp <- MANSRT+As*e^(-v$z_tmp*(omega/(2*K))^0.5)*sin(omega*v$t_sec_ERT-((omega/(2*K))^0.5)*v$z_tmp)
T_modelled <- data.frame(matrix(v$T_tmp, nrow = length(z), ncol = length(t_sec_ERT), byrow = TRUE))
names(T_modelled) <- paste0("t=", t_sec_ERT)
rownames(T_modelled) <- paste0("z=", seq(from = 0.1, to = 15, by = 0.1))

Use an 'apply' function to perform code with conditional statements in R

I have been working on a project for which I need to find peaks and valleys in a dataset (not just the highest numbers per column, but all of the peaks and valleys).
I did manage to get it to work on 1 column, but I use a for-loop for that and I need to do this for about 50 columns, so I think I should use an 'apply' function. I just don't know how to do so. Can I put 'if' statements and such in an 'apply' function?
Here is what I used for checking one column:
('First' is the name of the dataset and 'Seq1' is the first column)
Lowest = 0
Highest = 0
Summits = vector('numeric')
Valleys = vector('numeric')
for (i in 1:length(First$Seq1))
{
if (!is.na(First$Seq1[i+1]))
{
if (First$Seq1[i] < Lowest) {Lowest = First$Seq1[i]}
if (First$Seq1[i] > Highest) {Highest = First$Seq1[i]}
if (First$Seq1[i] > 0 && First$Seq1[i+1] < 0)
{ Summits <- append(Summits, Highest, after=length(Summits)) }
if (First$Seq1[i] < 0 && First$Seq1[i+1] > 0)
{ Valleys <- append(Valleys, Lowest, after=length(Summits)) }
}
}
Sure you can! I would first define a helper function that defines what is to be done with one specific column and then you call that function within apply:
HelperFun <- function(x) {
# your code from above, replacing 'Seq1' by x
}
apply(First, 2, HelperFun)
An *apply function is not better for this than a for loop, provided you don't grow an object in the for loop. You must never use append in a loop. Pre-allocate your results object and fill it.
This finds all local minima on a grid:
#an example
set.seed(42)
plane <- matrix(rnorm(100, sd = 5), 10)
#plot
library(raster)
plot(raster(plane))
#initialize a logical matrix
res <- matrix(TRUE, ncol = ncol(plane), nrow = nrow(plane))
#check for each subgrid of 2 times 2 cells which of the cells is the minimum
for (i in 1:(nrow(plane) - 1)) {
for (j in 1:(ncol(plane) - 1)) {
inds <- as.matrix(expand.grid(r = i + 0:1, c = j + 0:1))
#cell must be a minimum of all 4 subgrids it is part of
res[inds] <- res[inds] & plane[inds] == min(plane[inds])
}
}
print(res)
plane[res]
#[1] -13.282277 -8.906542 -8.585043 -12.071038 -3.919195 -14.965450 -5.215595 -5.498904 -5.971644 -2.380870 -7.296070
#highlight local minima
plot(rasterToPolygons(raster(res)), border = t(res), add = TRUE)
library(reshape2)
res1 <- melt(res)
res1 <- res1[res1$value,]
text(x = res1$Var2 /10 - 0.05,
y = 1-res1$Var1 /10 + 0.05,
labels = round(plane[res],1))
I've assumed here that diagonal neighbors are counted as neighbors and not only neighbors in the same column or row. But this would be trivial to change.
I know that this is not the solution you want --- you have one-dimensional time series, but here is a (more direct) variation on Roland's solution.
#example data
set.seed(42)
plane <- matrix(rnorm(100, sd = 5), 10)
library(raster)
r <- raster(plane)
f <- focal(r, matrix(1,3,3), min, pad=TRUE, na.rm=TRUE)
x <- r == f
mins <- mask(r, x, maskvalue=FALSE)
pts <- rasterToPoints(mins)
cells <- cellFromXY(x, pts)
r[cells]
plot(r)
text(mins, digits=1)
plot(rasterToPolygons(mins), add=TRUE)

weird error with R when using data.table

I'm doing some small calculations and i decided to fill the data inside a data.table since it's much faster than data.frame and rbind
so basically my code is something like that:
df is a data.frame used in the calculation but it's important what does it contain.
l=12000
dti = 1
dt = data.table(ni = 0, nj = 0, regerr = 0)
for (i in seq(1,12000,200)) {
for (j in seq(1, 12000, 200)) {
for (ind in 1:nrow(df)) {
if( i+j >= l/2 ){
df[ind,]$X = df[ind,]$pos * 2
} else {
df[ind,]$X = df[ind,]$pos/l
}
}
for (i in 1:100) { # 100 sample
sample(df$X,nrow(df), replace=FALSE)
fit=lm(X ~ gx, df) #linear regression calculation
regerror=sum(residuals(fit)^2)
print(paste(i,j,regerror))
set(dt,dti,1L,as.double(i))
set(dt,dti,2L,as.double(j))
set(dt,dti,3L,regerror)
dti=dti+1
}
}
}
The code prints the first few rounds of print(paste(i,j,regerror)) and then it quits with this error:
*** caught segfault ***
address 0x3ff00008, cause 'memory not mapped'
Segmentation fault (core dumped)
EDIT
structure(list(ax = c(-0.0242214, 0.19770304, 0.01587302, -0.0374415,
0.05079826, 0.12209738), gx = c(-0.3913043, -0.0242214, -0.4259067,
-0.725, -0.0374415, 0.01587302), pos = c(11222, 13564, 16532,
12543, 12534, 14354)), .Names = c("ax", "gx", "pos"), row.names = c(NA,
-6L), class = "data.frame")
Any ideas are appreciated.
Without meaning to sound rude, I think you may benefit from reading a few R tutorials before going forward. This question is also very likely to be closed as too localized. Also, seg faults are almost always a bug somewhere, but you can avoid a bunch of this headache by understanding what each piece of your code is doing. Since its Friday, lets walk through some of it:
if( i+j >= l/2 ){
data[ind,]$X = df[ind,]$pos * 2
}
else{
data[ind,]$X = df[ind,]$pos/l
}
I'll assume data is meant to be df and go from there. We're inside two loops of i and j that both go from 1 through 20000. They will never sum to less than 1/2 so you will always execute the first statement. Also, if you ever expected the FALSE case to occur, you would need else on the same line as your closing brace:
if (i + j >= 1/2) {
df$X <- df$pos * 2
} else {
df$X <- df$pos
}
R is vectorized so doing the above is the same as looping through every value and multiplying by 2. I also removed the / 1 statement since it doesn't do anything. This whole section can be moved outside of the loop. Since its a constant operation of adding a column X that is double the column pos.
Next, your loop where you do a fit:
for (i in 1:100) { # 100 sample
sample(df$X,nrow(df), replace=FALSE)
fit=lm(X ~ gx, df) #linear regression calculation
regerror=sum(residuals(fit)^2)
print(paste(i,j,regerror))
set(dt,dti,1L,as.double(i))
set(dt,dti,2L,as.double(j))
set(dt,dti,3L,regerror)
dti=dti+1
}
Taking, sample(df$X, nrow(df), replace=FALSE) will only show you the new order. It doesn't actual assign them. Instead df$X <- sample(df$X, nrow(df), replace=FALSE).
Now, It looks like you're going to assign into dt (which is a function much like df and should be avoided as a variable name) at row dti the result of this fit error as well as your indicies? As far as I can tell, nothing depends on i or j. Instead, you're going to perform a randomly ordered fit 60 * 60 * 100 times... If that is what you want to do, by all means go for it! But instead do it in an efficient way:
df$X <- df$pos * 2
fit.fun <- function(n, dat) {
jumble <- sample(nrow(dat))
dat$X <- dat$X[jumble]
sum(residuals(lm(X ~ gx, dat))^2)
}
sapply(1:10, fit.fun, dat=df)

Faster solution to looped grouped RLE calculation

I have a working solution to my problem, but I will not be able to use it because it is so slow (my calculations predict that the whole simulation will take 2-3 years!). Thus I am looking for a better (faster) solution. This is (in essence) the code I am working with:
N=4
x <-NULL
for (i in 1:N) { #first loop
v <-sample(0:1, 1000000, 1/2) #generate data
v <-as.data.frame(v) #convert to dataframe
v$t <-rep(1:2, each=250) #group
v$p <-rep(1:2000, each=500) #p.number
# second loop
for (j in 1:2000) { #second loop
#count rle for group 1 for each pnumber
x <- rbind(x, table(rle(v$v[v$t==1&v$p==j])))
#count rle for group 2 for each pnumber
x <- rbind(x, table(rle(v$v[v$t==2&v$p==j])))
} #end second loop
} #end first loop
#total rle counts for both group 1 & 2
y <-aggregate(x, list(as.numeric(rownames(x))), sum)
In words: The code generates a coin-flip simulation (v). A group factor is generated (1 & 2). A p.number factor is generated (1:2000). The run lengths are recorded for each p.number (1:2000) for both groups 1 & group 2 (each p.number has runs in both groups). After N loops (the first loop), the total run lengths are presented as a table (aggregate) (that is, the run lengths for each group, for each p.number, over N loops as a total).
I need the first loop because the data that I am working with comes in individual files (so I'm loading the file, calculating various statistics etc and then loading the next file and doing the same). I am much less attached to the second loop, but can't figure out how to replace it with something faster.
What can be done to the second loop to make it (hopefully, a lot) faster?
You are committing the cardinal sin of growing an object within a for() loop in R. Don't (I repeat don't) do this. Allocate sufficient storage for x at the beginning and then fill in x as you go.
x <- matrix(nrow = N * (2000 * 2), ncol = ??)
Then in the inner loop
x[ii, ] <- table(rle(....))
where ii is a loop counter that you initialise to 1 before the first loop and increment within the second loop:
x <- matrix(nrow = N * (2000 * 2), ncol = ??)
ii <- 1
for(i in 1:N) {
.... # stuff here
for(j in 1:2000) {
.... # stuff here
x[ii, ] <- table(rle(....))
## increment ii
ii <- ii + 1
x[ii, ] <- table(rle(....))
## increment ii
ii <- ii + 1
} ## end inner loop
} ## end outer loop
Also note that you are reusing index i in bot for()loops which will not work.iis just a normal R object and so bothfor()loops will be overwriting it as the progress. USej` for the second loop as I did above.
Try that simple optimisation first and see if that will allow the real simulation to complete in an acceptable amount of time. If not, come back with a new Q showing the latest code and we can think about other optimisations. The optimisation above is simple to do, optimising table() and rle() might take a lot more work. Noting that, you might look at the tabulate() function which does the heavy lifting in table(), which might be one avenue for optimising that particular step.
If you just want to run rle and table for each combination of the values of v$t and v$p separately, there is no need for the second loop. It is much faster in this way:
values <- v$v + v$t * 10 + v$p * 100
runlength <- rle(values)
runlength$values <- runlength$values %% 2
x <- table(runlength)
y <- aggregate(unclass(x), list(as.numeric(rownames(x))), sum)
The whole code will look like this. If N is as low as 4, the growing object x will not be a severe problem. But generally I agree with #GavinSimpson, that it is not a good programming technique.
N=4
x <-NULL
for (i in 1:N) { #first loop
v <-sample(0:1, 1000000, 1/2) #generate data
v <-as.data.frame(v) #convert to dataframe
v$t <-rep(1:2, each=250) #group
v$p <-rep(1:2000, each=500) #p.number
values <- v$v + N * 10 + v$t * 100 + v$p * 1000
runlength <- rle(values)
runlength$values <- runlength$values %% 2
x <- rbind(x, table(runlength))
} #end first loop
y <-aggregate(x, list(as.numeric(rownames(x))), sum) #tota

R filter() dealing with NAs

I am trying to implement Chebyshev filter to smooth a time series but, unfortunately, there are NAs in the data series.
For example,
t <- seq(0, 1, len = 100)
x <- c(sin(2*pi*t*2.3) + 0.25*rnorm(length(t)),NA, cos(2*pi*t*2.3) + 0.25*rnorm(length(t)))
I am using Chebyshev filter: cf1 = cheby1(5, 3, 1/44, type = "low")
I am trying to filter the time series exclude NAs, but not mess up the orders/position. So, I have already tried na.rm=T, but it seems there's no such argument.
Then
z <- filter(cf1, x) # apply filter
Thank you guys.
Try using x <- x[!is.na(x)] to remove the NAs, then run the filter.
You can remove the NAs beforehand using the compelete.cases function. You also might consider imputing the missing data. Check out the mtsdi or Amelia II packages.
EDIT:
Here's a solution with Rcpp. This might be helpful is speed is important:
require(inline)
require(Rcpp)
t <- seq(0, 1, len = 100)
set.seed(7337)
x <- c(sin(2*pi*t*2.3) + 0.25*rnorm(length(t)),NA, cos(2*pi*t*2.3) + 0.25*rnorm(length(t)))
NAs <- x
x2 <- x[!is.na(x)]
#do something to x2
src <- '
Rcpp::NumericVector vecX(vx);
Rcpp::NumericVector vecNA(vNA);
int j = 0; //counter for vx
for (int i=0;i<vecNA.size();i++) {
if (!(R_IsNA(vecNA[i]))) {
//replace and update j
vecNA[i] = vecX[j];
j++;
}
}
return Rcpp::wrap(vecNA);
'
fun <- cxxfunction(signature(vx="numeric",
vNA="numeric"),
src,plugin="Rcpp")
if (identical(x,fun(x2,NAs)))
print("worked")
# [1] "worked"
I don't know if ts objects can have missing values, but if you just want to re-insert the NA values, you can use ?insert from R.utils. There might be a better way to do this.
install.packages(c('R.utils', 'signal'))
require(R.utils)
require(signal)
t <- seq(0, 1, len = 100)
set.seed(7337)
x <- c(sin(2*pi*t*2.3) + 0.25*rnorm(length(t)), NA, NA, cos(2*pi*t*2.3) + 0.25*rnorm(length(t)), NA)
cf1 = cheby1(5, 3, 1/44, type = "low")
xex <- na.omit(x)
z <- filter(cf1, xex) # apply
z <- as.numeric(z)
for (m in attributes(xex)$na.action) {
z <- insert(z, ats = m, values = NA)
}
all.equal(is.na(z), is.na(x))
?insert
Here is a function you can use to filter a signal with NAs in it.
The NAs are ignored rather than replaced by zero.
You can then specify a maximum percentage of weight which the NAs may take at any point of the filtered signal. If there are too many NAs (and too few actual data) at a specific point, the filtered signal itself will be set to NA.
# This function applies a filter to a time series with potentially missing data
filter_with_NA <- function(x,
window_length=12, # will be applied centrally
myfilter=rep(1/window_length,window_length), # a boxcar filter by default
max_percentage_NA=25) # which percentage of weight created by NA should not be exceeded
{
# make the signal longer at both sides
signal <- c(rep(NA,window_length),x,rep(NA,window_length))
# see where data are present and not NA
present <- is.finite(signal)
# replace the NA values by zero
signal[!is.finite(signal)] <- 0
# apply the filter
filtered_signal <- as.numeric(filter(signal,myfilter, sides=2))
# find out which percentage of the filtered signal was created by non-NA values
# this is easy because the filter is linear
original_weight <- as.numeric(filter(present,myfilter, sides=2))
# where this is lower than one, the signal is now artificially smaller
# because we added zeros - compensate that
filtered_signal <- filtered_signal / original_weight
# but where there are too few values present, discard the signal
filtered_signal[100*(1-original_weight) > max_percentage_NA] <- NA
# cut away the padding to left and right which we previously inserted
filtered_signal <- filtered_signal[((window_length+1):(window_length+length(x)))]
return(filtered_signal)
}

Resources