I know that the filter() function in R calculate the moving average. I would like to know if exists a function that return me the moving variance or standard deviation, in order to show it in a plot side by side with the output of filter() function.
Consider the zoo package. For example filter() gives:
> filter(1:100, rep(1/3,3))
Time Series:
Start = 1
End = 100
Frequency = 1
[1] NA 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[51] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
[76] 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 NA
whereas rollmean() in zoo gives:
> rollmean(1:100, k = 3, na.pad = TRUE)
[1] NA 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[51] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
[76] 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 NA
which is the same (for a 3 point moving average in this example).
Whilst zoo doesn't have a rollsd() or rollvar() it does have rollapply(), which works like the apply() functions to apply any R function to the specified window.
> rollapply(1:100, width = 3, FUN = sd, na.pad = TRUE)
[1] NA 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[26] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[51] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[76] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 NA
Warning message:
In rollapply.zoo(zoo(data), ...) : na.pad argument is deprecated
or on something more interesting:
> rollapply(vec, width = 3, FUN = sd, na.pad = TRUE)
[1] NA 0.3655067 0.8472871 0.5660495 0.3491970 0.4732417 0.9236859
[8] 0.8075226 1.8725851 1.1930784 0.6329325 1.1412416 0.8430772 0.5808005
[15] 0.3838545 1.1738170 1.1655400 1.3241700 0.6876834 0.1534157 0.4858477
[22] 0.9843506 0.6002713 0.6897541 2.0619563 2.5675788 6.3522039 6.0066864
[29] 6.2618432 5.1704866 2.1360853 2.5602557 1.0408528 1.0316396 4.9441628
[36] 5.0319314 5.7589716 3.2425000 4.8788158 2.0847286 4.5199291 2.5323486
[43] 2.1987149 1.8393000 1.2278639 1.5998965 1.5341485 4.4287108 4.4159166
[50] 4.3224546 3.6959067 4.9826264 5.3134044 8.4084322 9.1249234 7.5506725
[57] 3.8499136 3.9680487 5.6362296 4.9124095 4.3452706 4.0227141 4.5867559
[64] 4.7350394 4.3203807 4.4506799 7.2759499 7.6536424 7.8487654 2.0905576
[71] 4.0056880 5.6209853 1.5551659 1.3615268 2.8469458 2.8323588 1.9848578
[78] 1.1201124 1.4248380 1.7802571 1.4281773 2.5481935 1.8554451 1.0925410
[85] 2.1823722 2.2788755 2.4205378 2.0733741 0.7462248 1.3873578 1.4265948
[92] 0.7212619 0.7425993 1.0696432 2.4520585 3.0555819 3.1000885 1.0945292
[99] 0.3726928 NA
Warning message:
In rollapply.zoo(zoo(data), ...) : na.pad argument is deprecated
You can get rid of the warning by using the fill = NA argument, as in
> rollapply(vec, width = 3, FUN = sd, fill = NA)
The TTR package has runSD among others:
> library(TTR)
> ls("package:TTR", pattern="run*")
[1] "runCor" "runCov" "runMAD" "runMax" "runMean"
[6] "runMedian" "runMin" "runSD" "runSum" "runVar"
runSD will be much faster than rollapply because it avoids making many R function calls. For example:
rzoo <- function(x,n) rollapplyr(x, n, sd, fill=NA)
rttr <- function(x,n) runSD(x, n)
library(rbenchmark)
set.seed(21)
x <- rnorm(1e4)
all.equal(rzoo(x,250), rttr(x,250))
# [1] TRUE
benchmark(rzoo(x,250), rttr(x,250))[,1:6]
# test replications elapsed relative user.self sys.self
# 2 rttr(x, 250) 100 0.58 1.000 0.58 0.00
# 1 rzoo(x, 250) 100 54.53 94.017 53.85 0.06
rollapply in the zoo package takes an arbitrary function. It's different from filter in that it excludes the NAs by default.
That being said, though, there's not much sense in loading a package for a function that's so simple to roll yourself (pun intended).
Here's one that's right aligned:
my.rollapply <- function(vec, width, FUN)
sapply(seq_along(vec),
function(i) if (i < width) NA else FUN(vec[i:(i-width+1)]))
set.seed(1)
vec <- sample(1:50, 50)
my.rollapply(vec, 3, sd)
[1] NA NA 7.094599 12.124356 16.522712 18.502252 18.193405 7.234178 8.144528
[10] 14.468356 12.489996 3.055050 20.808652 19.467922 18.009257 18.248288 15.695010 7.505553
[19] 10.066446 11.846237 17.156146 6.557439 5.291503 23.629078 22.590558 21.197484 22.810816
[28] 24.433583 19.502137 16.165808 11.503623 12.288206 9.539392 13.051181 13.527749 19.974984
[37] 19.756855 17.616280 19.347696 18.248288 15.176737 6.082763 10.000000 10.016653 4.509250
[46] 2.645751 1.527525 5.291503 10.598742 6.557439
# rollapply output for comparison
rollapply(vec, width=3, sd, fill=NA, align='right')
[1] NA NA 7.094599 12.124356 16.522712 18.502252 18.193405 7.234178 8.144528
[10] 14.468356 12.489996 3.055050 20.808652 19.467922 18.009257 18.248288 15.695010 7.505553
[19] 10.066446 11.846237 17.156146 6.557439 5.291503 23.629078 22.590558 21.197484 22.810816
[28] 24.433583 19.502137 16.165808 11.503623 12.288206 9.539392 13.051181 13.527749 19.974984
[37] 19.756855 17.616280 19.347696 18.248288 15.176737 6.082763 10.000000 10.016653 4.509250
[46] 2.645751 1.527525 5.291503 10.598742 6.557439
runner function in runner package applies any R function on running windows. With runner one can specify simple window by setting the length k or lag. Below moving sd as suggested by OOP on 4-elements windows.
library(runner)
set.seed(1)
x <- rnorm(20, sd = 1)
runner(x, sd, k = 4, na_pad = TRUE)
#[1] NA NA NA 1.1021597 0.9967429 1.1556947 0.9884053 0.6902835 0.7180483 0.4647160
#[11] 0.7454670 0.7489618 0.9449882 1.5821988 1.4459037 1.3889432 1.3954101 0.6193867 0.5296744 0.4266423
To apply running functions on date-windows one should specify idx. Length of idx should be the same length as x and should be of type Date or integer. Example below illustrates window of size k = 4 lagged by lag = 1. In parentheses index ranges for each window.
idx <- c(4, 6, 7, 13, 17, 18, 18, 21, 27, 31, 37, 42, 44, 47, 48)
runner::runner(x = 1:15,
k = 5,
lag = 1,
idx = idx,
f = function(x) mean(x))
# [1] NA 1.0 1.5 NA 4.0 4.5 4.5 6.0 NA 9.0 NA 11.0 12.0 12.5 13.5
More info in documentation and vignettes
Related
Thanks to #akrun, I could run my previous question about merging and creating tables with loop. Merge and create tables using a loop
However, because my laptop only has 16GB of RAM, I couldn't run the large dataset using the code. So, instead of merging 100 times, I decided to separate the process, and do it step by step using a for-loop.
I was going to create 20 lists of data using for loop, but then I couldn't find a way to make this happen.
To be specific, I would run the following 20 lines of code manually without using a for loop.
list1 <- mget(paste0("", 1:5))
list2 <- mget(paste0("", 6:10))
list3 <- mget(paste0("", 11:15))
list4 <- mget(paste0("", 16:20))
list5 <- mget(paste0("", 21:25))
...
list20 <- mget(paste0("", 96:100))
How would I write for loop in this case?
I tried to find a way to do this (for example as below), but I am getting an error.
for(i in 1:20){
list[i] <- mget(paste0("",5*i-4:5*i))
}
Thanks in advance for all your help!
There are multiple ways to create the list. Either use split with %/%
fulllst <- lapply(split(as.character(1:100), (1:100-1) %/% 5 + 1), mget)
Or use the same code in OP's post by wrapping the code with () to avoid evaluation based on precedence of operators
# create an empty list to store the output
lstout <- vector('list', 20)
# loop over the sequence and add the `()` for `(5* i- 4)` and similarly for (5*i)
for(i in 1:20)
lstout[[i]] <- mget(as.character((5 *i -4):(5*i)))
Use print to find the difference
> for(i in 1:20) print((5 *i -4):(5*i))
[1] 1 2 3 4 5
[1] 6 7 8 9 10
[1] 11 12 13 14 15
[1] 16 17 18 19 20
[1] 21 22 23 24 25
[1] 26 27 28 29 30
[1] 31 32 33 34 35
[1] 36 37 38 39 40
[1] 41 42 43 44 45
[1] 46 47 48 49 50
[1] 51 52 53 54 55
[1] 56 57 58 59 60
[1] 61 62 63 64 65
[1] 66 67 68 69 70
[1] 71 72 73 74 75
[1] 76 77 78 79 80
[1] 81 82 83 84 85
[1] 86 87 88 89 90
[1] 91 92 93 94 95
[1] 96 97 98 99 100
> for(i in 1:20) print(5 *i -4:5*i)
[1] 1 0
[1] 2 0
[1] 3 0
[1] 4 0
[1] 5 0
[1] 6 0
[1] 7 0
[1] 8 0
[1] 9 0
[1] 10 0
[1] 11 0
[1] 12 0
[1] 13 0
[1] 14 0
[1] 15 0
[1] 16 0
[1] 17 0
[1] 18 0
[1] 19 0
[1] 20 0
ie. if we don't use the () the evaluation will be
i <- 1
(5 * i) - (4:5 * i)
[1] 1 0
# instead of
(5 * i -4):(5 * i)
[1] 1 2 3 4 5
The operator precendence is showed in ?Syntax
:: ::: access variables in a namespace
$ # component / slot extraction
[ [[ indexing
^ exponentiation (right to left)
- + unary minus and plus
: sequence operator
%any% |> special operators (including %% and %/%)
* / multiply, divide
+ - (binary) add, subtract
....
We are looking to create a vector with the following sequence:
1,4,5,8,9,12,13,16,17,20,21,...
Start with 1, then skip 2 numbers, then add 2 numbers, then skip 2 numbers, etc., not going above 2000. We also need the inverse sequence 2,3,6,7,10,11,...
We may use recyling vector to filter the sequence
(1:21)[c(TRUE, FALSE, FALSE, TRUE)]
[1] 1 4 5 8 9 12 13 16 17 20 21
Here's an approach using rep and cumsum. Effectively, "add up alternating increments of 1 (successive #s) and 3 (skip two)."
cumsum(rep(c(1,3), 500))
and
cumsum(rep(c(3,1), 500)) - 1
Got this one myself - head(sort(c(seq(1, 2000, 4), seq(4, 2000, 4))), 20)
We can try like below
> (v <- seq(21))[v %% 4 %in% c(0, 1)]
[1] 1 4 5 8 9 12 13 16 17 20 21
You may arrange the data in a matrix and extract 1st and 4th column.
val <- 1:100
sort(c(matrix(val, ncol = 4, byrow = TRUE)[, c(1, 4)]))
# [1] 1 4 5 8 9 12 13 16 17 20 21 24 25 28 29 32 33
#[18] 36 37 40 41 44 45 48 49 52 53 56 57 60 61 64 65 68
#[35] 69 72 73 76 77 80 81 84 85 88 89 92 93 96 97 100
A tidyverse option.
library(purrr)
library(dplyr)
map_int(1:11, ~ case_when(. == 1 ~ as.integer(1),
. %% 2 == 0 ~ as.integer(.*2),
T ~ as.integer((.*2)-1)))
# [1] 1 4 5 8 9 12 13 16 17 20 21
I have a function like this
extract = function(x)
{
a = x$2007[6:18]
b = x$2007[30:42]
c = x$2007[54:66]
}
the subsetting needs to continue up to 744 in this way. I need to skip the first 6 data points, and then pull out every other 12 points into a new object or a list. Is there a more elegant way to do this with a for loop or apply?
Side note: if 2007 is truly a column name (you would have had to explicitly do this, R defaults to converting numbers to names starting with letters, see make.names("2007")), then x$"2007"[6:18] (etc) should work for column reference.
To generate that sequence of integers, let's try
nr <- 100
ind <- seq(6, nr, by = 12)
ind
# [1] 6 18 30 42 54 66 78 90
ind[ seq_along(ind) %% 2 == 1 ]
# [1] 6 30 54 78
ind[ seq_along(ind) %% 2 == 0 ]
# [1] 18 42 66 90
Map(seq, ind[ seq_along(ind) %% 2 == 1 ], ind[ seq_along(ind) %% 2 == 0 ])
# [[1]]
# [1] 6 7 8 9 10 11 12 13 14 15 16 17 18
# [[2]]
# [1] 30 31 32 33 34 35 36 37 38 39 40 41 42
# [[3]]
# [1] 54 55 56 57 58 59 60 61 62 63 64 65 66
# [[4]]
# [1] 78 79 80 81 82 83 84 85 86 87 88 89 90
So you can use this in your function to create a list of subsets:
nr <- nrow(x)
ind <- seq(6, nr, by = 12)
out <- lapply(Map(seq, ind[ seq_along(ind) %% 2 == 1 ], ind[ seq_along(ind) %% 2 == 0 ]),
function(i) x$"2007"[i])
we could use
split( x[7:744] , cut(7:744,seq(7,744,12)) )
I am having difficulties trying to order a list element-wise by decreasing order...
I have a ByPos_Mindex object or a list of 1000 IRange objects (CG_seqP) from
C <- vmatchPattern(CG, CPGi_Seq, max.mismatch = 0, with.indels = FALSE)
IRanges object with 27 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 1 2 2
[2] 3 4 2
[3] 9 10 2
[4] 27 28 2
[5] 34 35 2
... ... ... ...
[23] 189 190 2
[24] 207 208 2
[25] 212 213 2
[26] 215 216 2
[27] 218 219 2
length(1000 of these IRanges)
I then change this to a list of only the start integers (which I want)
CG_SeqP <- sapply(C, function(x) sapply(as.vector(x), "[", 1))
[[1]]
[1] 1 3 9 27 34 47 52 56 62 66 68 70 89 110 112
[16] 136 140 146 154 160 163 178 189 207 212 215 218
(1000 of these)
The Problem happens when I try and order the list of elements using
CG_SeqP <- sapply(as.vector(CG_SeqP),order, decreasing = TRUE)
I get a list of what I think is row numbers so if the first IRAnge object is 27 I get this...
CG_SeqP[1]
[[1]]
[1] 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8
[21] 7 6 5 4 3 2 1
So the decreasing has worked but not for my actual list of elements>?
Any suggestions, thanks in advance.
Order returns order of the sequence not the actual elements of your vector, to extract it let us look at a toy example (I am following your idea here) :
set.seed(1)
alist1 <- list(a = sample(1:100, 30))
So, If you print alist1 with the current seed value , you will have below results:
> alist1
$a
[1] 99 51 67 59 23 25 69 43 17 68 10 77 55 49 29 39 93 16 44
[20] 7 96 92 80 94 34 97 66 31 5 24
Now to sort them either you use sort function or you can use order, sort just sorts the data, whereas order just returns the order number of the elements in a sorted sequence. It doesn't return the actual sequence, it returns the position. Hence we need to put those positions in the actual vector using square notation brackets to get the right sorted outcome.
lapply(as.vector(alist1),function(x)x[order(x, decreasing = TRUE)])
I have used lapply instead of sapply just to enforce the outcome as a list. You are free to choose any command basis your need
Will return:
#> lapply(as.vector(alist1),function(x)x[order(x, decreasing = TRUE)])
#$a
# [1] 99 97 96 94 93 92 80 77 69 68 67 66 59 55 51 49 44 43 39
#[20] 34 31 29 25 24 23 17 16 10 7 5
I hope this clarifies your doubt. Thanks
I have a simple question regarding the sample function in R. I'm randomly sampling from 0s and 1s and summing them together, from an input vector of length 5, which designates the number of trials to run and sets the seed to generate reproducible random numbers. Seed works as expected, but I get different matrices of random numbers depending on what I put in the prob statement. In this case I assumed prob=NULL should be the same as prob=c(0.5,0.5). Why isn't it?
vn<-c(12, 44, 9, 17, 28)
> do.call(cbind, lapply(c(1:10),function(X) {set.seed(X); sapply(vn, function(Y) sum(sample(x=c(0,1),size=Y,replace=T)), simplify=TRUE)}))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 6 7 7 6 6 9 3 6 2 5
[2,] 22 21 20 29 22 24 24 19 25 19
[3,] 4 8 3 5 4 4 4 6 4 2
[4,] 8 4 12 9 11 7 9 10 8 8
[5,] 13 9 11 14 12 14 10 13 11 12
> do.call(cbind, lapply(c(1:10),function(X) {set.seed(X); sapply(vn, function(Y) sum(sample(x=c(0,1),size=Y,replace=T, prob=c(0.5,0.5))), simplify=TRUE)}))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 6 5 5 6 6 3 9 6 10 7
[2,] 22 23 24 15 22 20 20 25 19 25
[3,] 5 1 6 4 5 5 5 3 5 7
[4,] 9 13 5 8 6 10 8 7 9 9
[5,] 15 19 17 14 16 14 18 15 17 16
UPDATE:
I extended the samplings to 100, with an input vector
vn<-seq(0,100,5)
and compared the rowMeans of the output matrices without prob (test1) and with prob=c(0.5,0.5) against expected mean. Interestingly, test1 and test2 are off by the exact same amount by reversed signs. Why is that? Thanks!
> rowMeans(test1)-seq(0,100,5)/2
[1] 0.00 -0.07 -0.01 -0.35 -0.07 0.19 -0.07 0.24 0.21 0.46 0.20 0.50 -0.37 -0.35 0.00 0.64 -0.59 0.63 -1.19 0.44 -0.38
> rowMeans(test2)-seq(0,100,5)/2
[1] 0.00 0.07 0.01 0.35 0.07 -0.19 0.07 -0.24 -0.21 -0.46 -0.20 -0.50 0.37 0.35 0.00 -0.64 0.59 -0.63 1.19 -0.44 0.38
As suggested by Randy, different routines are used by sample.int depending on whether prop is NULL.
In your case, it returns inverse results:
> set.seed(1); sample(c(0,1), size=20, replace=TRUE)
[1] 0 0 1 1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1
> set.seed(1); sample(c(0,1), size=20, replace=TRUE, prob=c(.5,.5))
[1] 1 1 0 0 1 0 0 0 0 1 1 1 0 1 0 1 0 0 1 0
What's going on?
For the former, we hit line src/main/random.c:546:
for (int i = 0; i < k; i++) iy[i] = (int)(dn * unif_rand() + 1);
This one is simple. unif_rand() returns a value between 0 and 1 (and will never return 1), dn is 2 (the number of elements in x) so iy[i] is set to 1 or 2, depending on whether unif_rand() returns a value < .5 or >= .5 respectively; this is the value chosen from x.
The latter is a bit more complex. Because prob is specified, do_sample calls the function ProbSampleReplace at src/main/random.c:309. Here, the probabilities are sorted into descending order with the function revsort at src/main/sort.c:248. This uses a heap sort on the probabilities, and with a two-element vector of equal probabilities, it reverses the order.
ProbSampleReplace again calls unif_rand() but this time it maps it to the cumulative probabilities computed after flipping the order of the vector, so if unif_rand() returns a value < 0.5 the second value is returned (1 in your example). This is the code that does the mapping of unif_rand() to the values in x:
/* compute the sample */
for (i = 0; i < nans; i++) {
rU = unif_rand();
for (j = 0; j < nm1; j++) {
if (rU <= p[j])
break;
}
ans[i] = perm[j];
}
So with equal probabilities of two elements, setting the probability explicitly to c(0.5, 0.5) will return the inverse of the same call without setting the probabilities. With more than two elements, it's not going to always reverse them, but it won't return the same order.
This also explains why Fernando's suggestion works. The values are close enough to .5 as to not change the results for this example, and the heap sort returns the values in the original order.
This expression returns the same matrix as your first line of code:
do.call(cbind, lapply(c(1:10),function(X) {set.seed(X); sapply(vn, function(Y) sum(sample(x=c(1,0),size=Y,replace=T, prob=c(0.5,0.5))), simplify=TRUE)}))
Here, the order of the entries in x have been reversed to account for the two-element sort of equal values (which swaps the entries).
Of course this is all academic. In practice, permuting the order of equiprobable entries doesn't matter.
Source files and line numbers above refer to R 3.0.2.
I updated my comment to an answer. sample uses different c routines for uniform sampling and weighted sampling. Though you are using equal weights, R will call the weighted sampling anyway.
To see this, consider
> set.seed(1)
> sample.int(100)
[1] 27 37 57 89 20 86 97 62 58 6 19 16 61 34 67 43 88 83
[19] 32 63 75 17 51 10 21 29 1 28 81 25 87 42 70 13 55 44
[37] 78 7 45 26 50 39 46 82 30 65 2 84 59 36 24 85 22 12
[55] 4 5 14 23 73 79 99 47 18 95 60 77 41 53 3 69 11 71
[73] 35 31 40 49 76 9 38 64 80 66 8 91 33 92 100 54 98 94
[91] 52 74 68 72 93 15 56 48 90 96
> set.seed(1)
> sample.int(100, prob = rep(1/100, 100))
[1] 28 39 60 93 21 91 96 67 63 7 22 18 71 41 79 51 74 1
[19] 38 78 94 20 64 12 29 40 2 42 87 35 50 61 52 17 84 69
[37] 81 10 73 44 85 65 80 54 49 82 4 46 75 68 43 90 36 23
[55] 8 11 30 55 66 34 97 26 47 31 70 24 53 86 6 95 32 89
[73] 27 33 56 98 88 25 77 100 37 62 19 15 76 13 59 5 14 9
[91] 45 3 83 99 72 58 48 57 92 16
Note that the two different sampled sequences.