As part of a project I made a smoother to smooth out missing data. I make use of the previous slope of the last data points to calculate new values. After calculated each new point I use this data to calculate a new value (and so on). Hence I used a while-loop to calculate each value (both from left to right as from right to left to eventually take a average of these 2 values). This scripts works fine!
Although I expect that I can significantly accelerate this with a function from the apply-family, I still want to use this while loop. The script is however really slow (3 days for ~ 2,500,000 data points). Do you have tips (for the current script) for me to change to speed things up?
#Loop from: bottom -> top
number_rows <- nrow(weight_id)
i <- nrow(weight_id)
while (i >= 1){
j = as.integer(weight_id[i,1])
prev1 <- temp[j+1,]$new_MAP_bottom
if(j<max(weight_id)){
previous_slope <- ifelse((temp[j+2,]$duration-temp[j+1,]$duration)>0,prev1-temp[j+2,]$new_MAP_bottom,0)
}else{
previous_slope <- 0
}
new_MAP <- round(prev1+((previous_slope-(factor*temp[j,]$steps))/(1+factor)), digit=2)
temp[j,]$new_MAP_bottom <- new_MAP
i <- i-1
}
#Loop from: top -> bottom
weight_factor <- 0
i <- 1
while (i <= nrow(weight_id)) {
j = as.integer(weight_id[i,1])
prev1 <- temp[j-1,]$new_MAP_top
if(j>2){
previous_slope <- ifelse((temp[j-1,]$duration-temp[j-2,]$duration)>0,prev1-temp[j-2,]$new_MAP_top,0)
}else{
previous_slope <- 0
}
new_MAP <- round(prev1+((previous_slope+(factor*temp[j,]$steps))/(1+factor)), digit=2)
temp[j,]$new_MAP_top <- new_MAP
#Take weighted average of two approaches (top -> bottom/bottom -> top)
if(weight_factor < 1){ weight_factor = temp[j,]$weight-1 }
weight_top <- weight_factor
weight_bottom <- temp[j,]$weight-weight_factor
if(weight_top>weight_bottom){ weight_top<-weight_top-1 }
if(weight_top<weight_bottom){ weight_bottom<-weight_bottom-1}
temp[j,]$MAP <- round(((new_MAP*weight_top)+(temp[j,]$new_MAP_bottom*weight_bottom))/(weight_top+weight_bottom),digit=0)
weight_factor <- weight_factor-1
i <- i+1
}
I did not read all of your code, especially without example data, but from the textual description, its only linear approximation: Please check, if the buildin functions approx and approxfun already do what you try to implement yourself, as these will be optimized more than you can with suitable effort.
par(mfrow=c(2,1))
example <- data.frame(x = 1:14,
y = c(3,4,5,NA, NA, NA, 6,7,8.1, 8.2, NA, 8.4, 8.5, NA))
plot(example)
f <- approxfun(example)
plot(example$x, f(example$x))
The apply family tends to give you shorter, more succinct code, but not necessarily much more speed then loops. If you are into speed, first check, if somebody else has already implemented, what you need, then try vectorization.
Edit:
The following runs in about a second on my computer. If this does something close enough to your own "linear smoother" so that you can replace yours with this, that is a speed increase of about 3 days.
n <- 2500000
example <- data.frame(x = 1:n,
y = sample(1:1000, n, replace = TRUE))
example$y[sample(1:n, n/5)] <- NA
print(Sys.time())
f <- approxfun(example)
mean(f(example$x))
print(Sys.time())
Related
I am looking towards a set of numbers and aiming to split them into subsets via set partitioning. The deciding factor on how these subsets will be generated will be ensuring that the sum of all the elements in the subset is as close as possible to a number generated by a pre-determined distribution. The subsets need not be the same size and each element can only be in one subset. I had previously been given guidance on this problem via the greedy algorithm (Link here), but I have found that some of the larger numbers in the set drastically skewed the results. I would therefore like to use some form of set partitioning for this problem.
A deeper underlying issue, which I would really like to correct for future problems, is I find I am drawn to the “brute force” approach with these type of problems. (As you can see from my code below which attempts to use folds to solve the problem via “brute force”). This is obviously a completely inefficient way to tackle the problem, and so I would like to tackle these minimization type problems with a more intelligent approach going forward. Therefore any advice is greatly appreciated.
library(groupdata2)
library(dplyr)
set.seed(345)
j <- runif(500,0,10000000)
dist <- c(.3,.2,.1,.05,.065,.185,.1)
s_diff <- 9999999999
for (i in 1:100) {
x <- fold(j, k = length(dist), method = "n_rand")
if (abs(sum(j) * dist[1] - sum(j[which(x$.folds==1)])) < abs(s_diff)) {
s_diff <- abs(sum(j) * dist[1] - sum(j[which(x$.folds==1)]))
x_fin <- x
}
}
This is just a simplified version only looking at the first ‘subset’. s_diff would be the smallest difference between the theoretical and actual results simulated, and x_fin would be which subset each element would be in (ie which fold it corresponds to). I was then looking to remove the elements that fell into the first subset and continue from there, but I know my method is inefficient.
Thanks in advance!
This is not a trivial problem, as you will probably gather from the complete lack of answers at 10 days, even with a bounty. As it happens, I think it is a great problem for thinking about algorithms and optimizations, so thanks for posting.
The first thing I would note is that you are absolutely right that this is not the kind of problem with which to try brute force. You may get near to a correct answer, but with a non-trivial number of samples and distribution points, you won't find the optimum solution. You need an iterative approach that moves elements about only if they make the fit better, and the algorithm needs to stop when it can't make it any better.
My approach here is to split the problem into three stages:
Cut the data into approximately the correct bins as a first approximation
Move elements from the bins that are a bit too big to the ones that are a bit too small. Do this iteratively until no more moves will optimize the bins.
Swap the elements between columns to fine tune the fit, until the swaps are optimal.
The reason to do it in this order is that each step is computationally more expensive, so you want to pass a better approximation to each step before letting it do its thing.
Let's start with a function to cut the data into approximately the correct bins:
cut_elements <- function(j, dist)
{
# Specify the sums that we want to achieve in each partition
partition_sizes <- dist * sum(j)
# The cumulative partition sizes give us our initial cuts
partitions <- cut(cumsum(j), cumsum(c(0, partition_sizes)))
# Name our partitions according to the given distribution
levels(partitions) <- levels(cut(seq(0,1,0.001), cumsum(c(0, dist))))
# Return our partitioned data as a data frame.
data.frame(data = j, group = partitions)
}
We want a way to assess how close this approximation (and subsequent approximations) are to our answer. We can plot against the target distribution, but it will also be helpful to have a numerical figure to assess the goodness of fit to include on our plot. Here, I will use the sum of the squares of the differences between the sample bins and the target bins. We'll use the log to make the numbers more comparable. The lower the number, the better the fit.
library(dplyr)
library(ggplot2)
library(tidyr)
compare_to_distribution <- function(df, dist, title = "Comparison")
{
df %>%
group_by(group) %>%
summarise(estimate = sum(data)/sum(j)) %>%
mutate(group = factor(cumsum(dist))) %>%
mutate(target = dist) %>%
pivot_longer(cols = c(estimate, target)) ->
plot_info
log_ss <- log(sum((plot_info$value[plot_info$name == "estimate"] -
plot_info$value[plot_info$name == "target"])^2))
ggplot(data = plot_info, aes(x = group, y = value, fill = name)) +
geom_col(position = "dodge") +
labs(title = paste(title, ": log sum of squares =", round(log_ss, 2)))
}
So now we can do:
cut_elements(j, dist) %>% compare_to_distribution(dist, title = "Cuts only")
We can see that the fit is already pretty good with a simple cut of the data, but we can do a lot better by moving appropriately sized elements from the over-sized bins to the under-sized bins. We do this iteratively until no more moves will improve our fit. We use two nested while loops, which should make us worry about computation time, but we have started with a close match, so we shouldn't get too many moves before the loop stops:
move_elements <- function(df, dist)
{
ignore_max = length(dist);
while(ignore_max > 0)
{
ignore_min = 1
match_found = FALSE
while(ignore_min < ignore_max)
{
group_diffs <- sort(tapply(df$data, df$group, sum) - dist*sum(df$data))
group_diffs <- group_diffs[ignore_min:ignore_max]
too_big <- which.max(group_diffs)
too_small <- which.min(group_diffs)
swap_size <- (group_diffs[too_big] - group_diffs[too_small])/2
which_big <- which(df$group == names(too_big))
candidate_row <- which_big[which.min(abs(swap_size - df[which_big, 1]))]
if(df$data[candidate_row] < 2 * swap_size)
{
df$group[candidate_row] <- names(too_small)
ignore_max <- length(dist)
match_found <- TRUE
break
}
else
{
ignore_min <- ignore_min + 1
}
}
if (match_found == FALSE) ignore_max <- ignore_max - 1
}
return(df)
}
Let's see what that has done:
cut_elements(j, dist) %>%
move_elements(dist) %>%
compare_to_distribution(dist, title = "Cuts and moves")
You can see now that the match is so close we are struggling to see whether there is any difference between the target and the partitioned data. That's why we needed the numerical measure of GOF.
Still, let's get this fit as good as possible by swapping elements between columns to fine-tune them. This step is computationally expensive, but again we are already giving it a close approximation, so it shouldn't have much to do:
swap_elements <- function(df, dist)
{
ignore_max = length(dist);
while(ignore_max > 0)
{
ignore_min = 1
match_found = FALSE
while(ignore_min < ignore_max)
{
group_diffs <- sort(tapply(df$data, df$group, sum) - dist*sum(df$data))
too_big <- which.max(group_diffs)
too_small <- which.min(group_diffs)
current_excess <- group_diffs[too_big]
current_defic <- group_diffs[too_small]
current_ss <- current_excess^2 + current_defic^2
all_pairs <- expand.grid(df$data[df$group == names(too_big)],
df$data[df$group == names(too_small)])
all_pairs$diff <- all_pairs[,1] - all_pairs[,2]
all_pairs$resultant_big <- current_excess - all_pairs$diff
all_pairs$resultant_small <- current_defic + all_pairs$diff
all_pairs$sum_sq <- all_pairs$resultant_big^2 + all_pairs$resultant_small^2
improvements <- which(all_pairs$sum_sq < current_ss)
if(length(improvements) > 0)
{
swap_this <- improvements[which.min(all_pairs$sum_sq[improvements])]
r1 <- which(df$data == all_pairs[swap_this, 1] & df$group == names(too_big))[1]
r2 <- which(df$data == all_pairs[swap_this, 2] & df$group == names(too_small))[1]
df$group[r1] <- names(too_small)
df$group[r2] <- names(too_big)
ignore_max <- length(dist)
match_found <- TRUE
break
}
else ignore_min <- ignore_min + 1
}
if (match_found == FALSE) ignore_max <- ignore_max - 1
}
return(df)
}
Let's see what that has done:
cut_elements(j, dist) %>%
move_elements(dist) %>%
swap_elements(dist) %>%
compare_to_distribution(dist, title = "Cuts, moves and swaps")
Pretty close to identical. Let's quantify that:
tapply(df$data, df$group, sum)/sum(j)
# (0,0.3] (0.3,0.5] (0.5,0.6] (0.6,0.65] (0.65,0.715] (0.715,0.9]
# 0.30000025 0.20000011 0.10000014 0.05000010 0.06499946 0.18500025
# (0.9,1]
# 0.09999969
So, we have an exceptionally close match: each partition is less than one part in one million away from the target distribution. Quite impressive considering we only had 500 measurements to put into 7 bins.
In terms of retrieving your data, we haven't touched the ordering of j within the data frame df:
all(df$data == j)
# [1] TRUE
and the partitions are all contained within df$group. So if we want a single function to return just the partitions of j given dist, we can just do:
partition_to_distribution <- function(data, distribution)
{
cut_elements(data, distribution) %>%
move_elements(distribution) %>%
swap_elements(distribution) %>%
`[`(,2)
}
In conclusion, we have created an algorithm that creates an exceptionally close match. However, that's no good if it takes too long to run. Let's test it out:
microbenchmark::microbenchmark(partition_to_distribution(j, dist), times = 100)
# Unit: milliseconds
# expr min lq mean median uq
# partition_to_distribution(j, dist) 47.23613 47.56924 49.95605 47.78841 52.60657
# max neval
# 93.00016 100
Only 50 milliseconds to fit 500 samples. Seems good enough for most applications. It would grow exponentially with larger samples (about 10 seconds on my PC for 10,000 samples), but by that point the relative fineness of the samples means that cut_elements %>% move_elements already gives you a log sum of squares of below -30 and would therefore be an exceptionally good match without the fine tuning of swap_elements. These would only take about 30 ms with 10,000 samples.
To add to the excellent answer by #AllanCameron, here is a solution that utilizes the highly efficient function comboGeneral from RcppAlgos*.
library(RcppAlgos)
partDist <- function(v, d, tol_ratio = 0.0001) {
tot_sum <- d * sum(v)
orig_len <- length(v)
tot_len <- d * orig_len
df <- do.call(rbind, lapply(1L:(length(d) - 1L), function(i) {
len <- as.integer(tot_len[i])
vals <- comboGeneral(v, len,
constraintFun = "sum",
comparisonFun = "==",
limitConstraints = tot_sum[i],
tolerance = tol_ratio * tot_sum[i],
upper = 1)
ind <- match(vals, v)
v <<- v[-ind]
data.frame(data = as.vector(vals), group = rep(paste0("g", i), len))
}))
len <- orig_len - nrow(df)
rbind(df, data.frame(data = v,
group = rep(paste0("g", length(d)), len)))
}
The idea is that we find a subset of v (e.g. j in the OP's case) such that the sum is within a tolerance of sum(v) * d[i] for some index i (d is equivalent to dist in the OP's example). After we find a solution (N.B. we are putting a cap on the number of solutions by setting upper = 1), we assign them to a group, and then remove them from v. We then iterate until we are left with just enough elements in v that will be assigned to the last distributed value (e.g. dist[length[dist]].
Here is an example using the OP's data:
set.seed(345)
j <- runif(500,0,10000000)
dist <- c(.3,.2,.1,.05,.065,.185,.1)
system.time(df_op <- partDist(j, dist, 0.0000001))
user system elapsed
0.019 0.000 0.019
And using the function for plotting by #AllanCameron we have:
df_op %>% compare_to_distribution(dist, "RcppAlgos OP Ex")
What about a larger sample with the same distribution:
set.seed(123)
j <- runif(10000,0,10000000)
## N.B. Very small ratio
system.time(df_huge <- partDist(j, dist, 0.000000001))
user system elapsed
0.070 0.000 0.071
The results:
df_huge %>% compare_to_distribution(dist, "RcppAlgos Large Ex")
As you can see, the solutions scales very well. We can speed up execution by loosening tol_ratio at the expense of the quality of the result.
For reference with the large data set, the solution given by #AllanCameron takes just under 3 seconds and gives a similar log sum of squares values (~44):
system.time(allan_large <- partition_to_distribution(j, dist))
user system elapsed
2.261 0.675 2.938
* I am the author of RcppAlgos
I have an empty data frame T_modelled with 2784 columns and 150 rows.
T_modelled <- data.frame(matrix(ncol = 2784, nrow = 150))
names(T_modelled) <- paste0("t=", t_sec_ERT)
rownames(T_modelled) <- paste0("z=", seq(from = 0.1, to = 15, by = 0.1))
where
t_sec_ERT <- seq(from = -23349600, to = 6706800, by = 10800)
z <- seq(from = 0.1, to = 15, by = 0.1)
I filled T_modelled by column with a nested for loop, based on a formula:
for (i in 1:ncol(T_modelled)) {
col_tmp <- colnames(T_modelled)[i]
for (j in 1:nrow(T_modelled)) {
z_tmp <- z[j]-0.1
T_tmp <- MANSRT+As*e^(-z_tmp*(omega/(2*K))^0.5)*sin(omega*t_sec_ERT[i]-((omega/(2*K))^0.5)*z_tmp)
T_modelled[j ,col_tmp] <- T_tmp
}
}
where
MANSRT <- -2.051185
As <- 11.59375
omega <- (2*pi)/(347.875*24*60*60)
c <- 790
k <- 0.00219
pb <- 2600
K <- (k*1000)/(c*pb)
e <- exp(1)
I do get the desired results but I keep thinking there must be a more efficient way of filling that data frame. The loop is quite slow and looks cumbersome to me. I guess there is an opportunity to take advantage of R's vectorized way of calculating. I just cannot see myself how to incorporate the formula in an easier way to fill T_modelled.
Anyone got any ideas how to get the same result in a faster, more "R-like" manner?
I believe this does it.
Run this first instruction right after creating T_modelled, it will be needed to test that the results are equal.
Tm <- T_modelled
Now run your code then run the code below.
z_tmp <- z - 0.1
for (i in 1:ncol(Tm)) {
T_tmp <- MANSRT + As*exp(-z_tmp*(omega/(2*K))^0.5)*sin(omega*t_sec_ERT[i]-((omega/(2*K))^0.5)*z_tmp)
Tm[ , i] <- T_tmp
}
all.equal(T_modelled, Tm)
#[1] TRUE
You don't need the inner loop, that's the only difference.
(I also used exp directly but that is of secondary importance.)
Much like your previous question's solution which you accepted, consider simply using sapply, iterating through the vector, t_sec_ERT, which is the same length as your desired dataframe's number of columns. But first adjust every element of z by 0.1. Plus, there's no need to create empty dataframe beforehand.
z_adj <- z - 0.1
T_modelled2 <- data.frame(sapply(t_sec_ERT, function(ert)
MANSRT+As*e^(-z_adj*(omega/(2*K))^0.5)*sin(omega*ert-((omega/(2*K))^0.5)*z_adj)))
colnames(T_modelled2) <- paste0("t=", t_sec_ERT)
rownames(T_modelled2) <- paste0("z=", z)
all.equal(T_modelled, T_modelled2)
# [1] TRUE
Rui is of course correct, I just want to suggest a way of reasoning when writing a loop like this.
You have two numeric vectors. Functions for numerics in R are usually vectorized. By which I mean you can do stuff like this
x <- c(1, 6, 3)
sum(x)
not needing something like this
x_ <- 0
for (i in x) {
x_ <- i + x_
}
x_
That is, no need for looping in R. Of course looping takes place none the less, it just happens in the underlying C, Fortran etc. code, where it can be done more efficiently. This is usually what we mean when we call a function vectorized: looping takes place "under the hood" as it were. The output of Vectorize() thus isn't strictly vectorized by this definition.
When you have two numeric vectors you want to loop over you have to first see if the constituent functions are vectorized, usually by reading the docs.
If it is, you continue by constructing that central vectorized compound function and and start testing it with one vector and one scalar. In your case it would be something like this (testing with just the first element of t_sec_ERT).
z_tmp <- z - 0.1
i <- 1
T_tmp <- MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*t_sec_ERT[i] - ((omega/(2*K))^0.5)*z_tmp)
Looks OK. Then you start looping over the elements of t_sec_ERT.
T_tmp <- matrix(nrow=length(z), ncol=length(t_sec_ERT))
for (i in 1:length(t_sec_ERT)) {
T_tmp[, i] <- MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*t_sec_ERT[i] - ((omega/(2*K))^0.5)*z_tmp)
}
Or you can do it with sapply() which is often neater.
f <- function(x) {
MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*x - ((omega/(2*K))^0.5)*z_tmp)
}
T_tmp <- sapply(t_sec_ERT, f)
I would prefer to put the data in a long format, with all combinations of z and t_sec_ERT as two columns, in order to take advantage of vectorization. Although I usually prefer tidyr for switching between long and wide formats, I've tried to keep this as a base solution:
t_sec_ERT <- seq(from = -23349600, to = 6706800, by = 10800)
z <- seq(from = 0.1, to = 15, by = 0.1)
v <- expand.grid(t_sec_ERT, z)
names(v) <- c("t_sec_ERT", "z")
v$z_tmp <- v$z-0.1
v$T_tmp <- MANSRT+As*e^(-v$z_tmp*(omega/(2*K))^0.5)*sin(omega*v$t_sec_ERT-((omega/(2*K))^0.5)*v$z_tmp)
T_modelled <- data.frame(matrix(v$T_tmp, nrow = length(z), ncol = length(t_sec_ERT), byrow = TRUE))
names(T_modelled) <- paste0("t=", t_sec_ERT)
rownames(T_modelled) <- paste0("z=", seq(from = 0.1, to = 15, by = 0.1))
I have a collection of DNA sequencing reads of various lengths, sorted from longest to shortest. I would like to know the largest number of reads I can include in a set such that the N50 of that set is above some threshold t
For any given set of reads, the total amount of data is just the cumulative sum of the lengths of the reads. The N50 is defined as the length of the read such that half of the data are contained in reads at least that long.
I have a solution below, but it is slow for very large read sets. I tried vectorising it, but this was slower (probably because my threshold is usually relatively large, such that my solution below stops calculating fairly early on).
Here's a worked example:
df = data.frame(l = 100:1) # read lengths
df$cs = cumsum(df$l) # getting the cumulative sum is easy and quick
t = 95 # let's imagine that this is my threshold N50
for(i in 1:nrow(df)){
N50 = df$l[min(which(df$cs>df$cs[i]/2))]
if(N50 < t){ break }
}
# the loop will have gone one too far, so I subtract one
number.of.reads = as.integer(i-1)
This works fine on small datasets, but my actual data are more like 5m reads that vary from ~200,000 to 1 in length (longer reads are rarer), and I'm interested in an N50 of 100,000, then it gets pretty slow.
This example is closer to something that's realistic. It takes ~15s on my desktop.
l = ceiling(runif(100000, min = 0, max = 19999))
l = sort(l, decreasing = T)
df = data.frame(l = l)
df$cs = cumsum(df$l)
t = 18000
for(i in 1:nrow(df)){
n = df$l[min(which(df$cs>df$cs[i]/2))]
if(n < t){ break }
}
result = as.integer(i-1)
So, I'm interested in any ideas, tips, or tricks to noticeably optimise this. It seems like this should be possible, but I'm out of ideas.
As n is decreasing with i, you should use a binary search algorithm.
binSearch <- function(min, max) {
print(mid <- floor(mean(c(min, max))))
if (mid == min) {
if (df$l[min(which(df$cs>df$cs[min]/2))] < t) {
return(min - 1)
} else {
return(max - 1)
}
}
n = df$l[min(which(df$cs>df$cs[mid]/2))]
if (n >= t) {
return(binSearch(mid, max))
} else {
return(binSearch(min, mid))
}
}
Then, just call
binSearch(1, nrow(df))
Since your data are ordered by DNA/read length, maybe you could avoid testing every single row. On the contrary, you can iterate and test a limited number of rows (reasonably spaced) at each iteration (using while() for example), and so get progressively closer to your solution. This should make things much faster. Just make sure that once you get close to the solution, you stop iterating.
This is your solution
set.seed(111)
l = ceiling(runif(100000, min = 0, max = 19999))
l = sort(l, decreasing = T)
df = data.frame(l = l)
df$cs = cumsum(df$l)
t = 18000
for(i in 1:nrow(df)){
n = df$l[min(which(df$cs>df$cs[i]/2))]
if(n < t){ break }
}
result = as.integer(i-1)
result
# 21216, in ~29 seconds
Instead of testing every row, let's set a range
i1 <- 1
i2 <- nrow(df)
i.range <- as.integer(seq(i1, i2, length.out = 10))
Now, test only these 10 rows. Get the closest one and "focus in" by re-defining the range. Stop when you cannot increase granularity.
while(sum(duplicated(i.range))==0){
for(i in 1:length(i.range)){
N50 = df$l[min(which(df$cs>df$cs[i.range[i]]/2))]
if(N50 < t){ break }
}
#update i1 and i2
i1 <- i.range[(i-1)]
i2 <- i.range[i]
i.range <- as.integer(seq(i1, i2, length.out = 10))
}
i.range <- seq(i1, i2, by=1)
for(i in i.range){
N50 = df$l[min(which(df$cs>df$cs[i]/2))]
if(N50 < t){ break }
}
result <- as.integer(i-1)
result
#21216, in ~ 0.06 seconds
Same result in a fraction of the time.
I made a matrix based population model, however, I would like to run more than one simultaneously in order to represent different groups of animals, in order that dispersing individuals can move between matrices. I originally just repeated everything to get a second matrix but then I realised that because I run the model using a for loop and break() under certain conditions (when that specific matrix should stop running, ie that group has died out) it is, understandably, stopping the whole model rather than just that singular matrix.
I was wondering if anyone had any suggestions on the best ways to code the model so that instead of breaking, and stopping the whole for loop, it just stops running across that specific matrix. I'm a little stumped. I have include a single run of one matrix below.
Also if anyone has a more efficient way of creating and running 9 matrices than writing everything out 9 times advice much appreciated.
n.steps <- 100
mats <- array(0,c(85,85,n.steps))
ns <- array(0,c(85,n.steps))
ns[1,1]<-0
ns[12,1]<-rpois(1,3)
ns[24,1]<-rpois(1,3)
ns[85,1] <- 1
birth<-4
nextbreed<-12
for (i in 2:n.steps){
# set up an empty matrix;
mat <- matrix(0,nrow=85,ncol=85)
surv.age.1 <- 0.95
x <- 2:10
diag(mat[x,(x-1)]) <- surv.age.1
surv.age.a <- 0.97
disp <- 1:74
disp <- disp*-0.001
disp1<-0.13
disp<-1-(disp+disp1)
survdisp<-surv.age.a*disp
x <- 11:84
diag(mat[x,(x-1)])<-survdisp
if (i == nextbreed) {
pb <- 1
} else {
pb <- 0
}
if (pb == 1) {
(nextbreed <- nextbreed+12)
}
mat[1,85] <- pb*birth
mat[85,85]<-1
death<-sample(c(replicate(1000,
sample(c(1,0), prob=c(0.985, 1-0.985), size = 1))),1)
if (death == 0) {
break()}
mats[,,i]<- mat
ns[,i] <- mat%*%ns[,i-1]
}
group.size <- apply(ns[1:85,],2,sum)
plot(group.size)
View(mat)
View(ns)
As somebody else suggested on Twitter, one solution might be to simply turn the matrix into all 0s whenever death happens. It looks to me like death is the probability that a local population disappears? It which case it seems to make good biological sense to just turn the entire population matrix into 0s.
A few other small changes: I made a list of replicate simulations so I could summarize them easily.
If I understand correctly,
death<-sample(c(replicate(1000,sample(c(1,0), prob=c(0.985, 1-0.985), size =1))),1)
says " a local population dies completely with probability 1.5% ". In which case, I think you could replace it with rbinom(). I did that below and my plots look similar to those I made with your code.
Hope that helps!
lots <- replicate(100, simplify = FALSE, expr = {
for (i in 2:n.steps){
# set up an empty matrix;
mat <- matrix(0,nrow=85,ncol=85)
surv.age.1 <- 0.95
x <- 2:10
diag(mat[x,(x-1)]) <- surv.age.1
surv.age.a <- 0.97
disp <- 1:74
disp <- disp*-0.001
disp1<-0.13
disp<-1-(disp+disp1)
survdisp<-surv.age.a*disp
x <- 11:84
diag(mat[x,(x-1)])<-survdisp
if (i == nextbreed) {
pb <- 1
} else {
pb <- 0
}
if (pb == 1) {
(nextbreed <- nextbreed+12)
}
mat[1,85] <- pb*birth
mat[85,85]<-1
death<-rbinom(1, size = 1, prob = 0.6)
if (death == 0) {
mat <- 0
}
mats[,,i]<- mat
ns[,i] <- mat%*%ns[,i-1]
}
ns
})
lapply(lots, FUN = function(x) apply(x[1:85,],2,sum))
I am trying to implement Chebyshev filter to smooth a time series but, unfortunately, there are NAs in the data series.
For example,
t <- seq(0, 1, len = 100)
x <- c(sin(2*pi*t*2.3) + 0.25*rnorm(length(t)),NA, cos(2*pi*t*2.3) + 0.25*rnorm(length(t)))
I am using Chebyshev filter: cf1 = cheby1(5, 3, 1/44, type = "low")
I am trying to filter the time series exclude NAs, but not mess up the orders/position. So, I have already tried na.rm=T, but it seems there's no such argument.
Then
z <- filter(cf1, x) # apply filter
Thank you guys.
Try using x <- x[!is.na(x)] to remove the NAs, then run the filter.
You can remove the NAs beforehand using the compelete.cases function. You also might consider imputing the missing data. Check out the mtsdi or Amelia II packages.
EDIT:
Here's a solution with Rcpp. This might be helpful is speed is important:
require(inline)
require(Rcpp)
t <- seq(0, 1, len = 100)
set.seed(7337)
x <- c(sin(2*pi*t*2.3) + 0.25*rnorm(length(t)),NA, cos(2*pi*t*2.3) + 0.25*rnorm(length(t)))
NAs <- x
x2 <- x[!is.na(x)]
#do something to x2
src <- '
Rcpp::NumericVector vecX(vx);
Rcpp::NumericVector vecNA(vNA);
int j = 0; //counter for vx
for (int i=0;i<vecNA.size();i++) {
if (!(R_IsNA(vecNA[i]))) {
//replace and update j
vecNA[i] = vecX[j];
j++;
}
}
return Rcpp::wrap(vecNA);
'
fun <- cxxfunction(signature(vx="numeric",
vNA="numeric"),
src,plugin="Rcpp")
if (identical(x,fun(x2,NAs)))
print("worked")
# [1] "worked"
I don't know if ts objects can have missing values, but if you just want to re-insert the NA values, you can use ?insert from R.utils. There might be a better way to do this.
install.packages(c('R.utils', 'signal'))
require(R.utils)
require(signal)
t <- seq(0, 1, len = 100)
set.seed(7337)
x <- c(sin(2*pi*t*2.3) + 0.25*rnorm(length(t)), NA, NA, cos(2*pi*t*2.3) + 0.25*rnorm(length(t)), NA)
cf1 = cheby1(5, 3, 1/44, type = "low")
xex <- na.omit(x)
z <- filter(cf1, xex) # apply
z <- as.numeric(z)
for (m in attributes(xex)$na.action) {
z <- insert(z, ats = m, values = NA)
}
all.equal(is.na(z), is.na(x))
?insert
Here is a function you can use to filter a signal with NAs in it.
The NAs are ignored rather than replaced by zero.
You can then specify a maximum percentage of weight which the NAs may take at any point of the filtered signal. If there are too many NAs (and too few actual data) at a specific point, the filtered signal itself will be set to NA.
# This function applies a filter to a time series with potentially missing data
filter_with_NA <- function(x,
window_length=12, # will be applied centrally
myfilter=rep(1/window_length,window_length), # a boxcar filter by default
max_percentage_NA=25) # which percentage of weight created by NA should not be exceeded
{
# make the signal longer at both sides
signal <- c(rep(NA,window_length),x,rep(NA,window_length))
# see where data are present and not NA
present <- is.finite(signal)
# replace the NA values by zero
signal[!is.finite(signal)] <- 0
# apply the filter
filtered_signal <- as.numeric(filter(signal,myfilter, sides=2))
# find out which percentage of the filtered signal was created by non-NA values
# this is easy because the filter is linear
original_weight <- as.numeric(filter(present,myfilter, sides=2))
# where this is lower than one, the signal is now artificially smaller
# because we added zeros - compensate that
filtered_signal <- filtered_signal / original_weight
# but where there are too few values present, discard the signal
filtered_signal[100*(1-original_weight) > max_percentage_NA] <- NA
# cut away the padding to left and right which we previously inserted
filtered_signal <- filtered_signal[((window_length+1):(window_length+length(x)))]
return(filtered_signal)
}