Apply t.test on consecutive elements of a vector in R - r

I have a vector that consists of n elements.
I am wondering how to do the following in an efficient manner (it is basically a rooling computation):
Extract elements with indices 1 to k
Extract elements with indices k+1 to 2k
perform a t.test on those elements
Extract elements with indices 2 to k+1
Extract elements with indices k+2 to 2k+1
perform a t.test on those elements
......
7.Repeat until the end of the vector.
I do not need to see the extract values, just get some statistics (e.g. [1]$statistic).
I can do it with a loop but I am wondering how one could do it with some functions (eg. apply).
All the best

It's unclear from your pseudo-code whether you really expected the vector to have n = 3 * k elements, but that's how I have programmed it.
set.seed(123)
x = rnorm(15)
o = 1:5
k = 5
sapply(o, function(oi){
x1 = x[oi:(k + oi - 1)]
x2 = x[(k + oi):(2 * k + oi)]
t.test(x1, x2)$statistic
})
and the result is
t t t t t
0.04435004 1.28433640 1.67879015 0.11191044 -0.19398686
You could also work out some updating formulae for adding and subtracting a new element to each pair of vectors in the t-test, but it doesn't seem worth the effort.

You can use sapply, here wrapped in a function for clarity:
multiple_ttest <- function(vec, k)
{
sapply(seq(length(vec) - 2 * k), function(x){
t.test(vec[x + 1:k], vec[x + (k + 1):(2 * k)])$statistic
})
}
multiple_ttest(rnorm(100), 20)
#> t t t t t t t
#> -0.6253586 -1.1680595 -1.2979357 -1.1051207 -1.4668645 -0.6156220 -0.2470322
#> t t t t t t t
#> -1.2416802 -0.1763081 -0.1682269 -0.2490938 0.4569783 0.7721602 1.4383522
#> t t t t t t t
#> 1.5530524 2.1523397 2.3221591 2.1876903 1.7248729 1.6149827 1.6919704
#> t t t t t t t
#> 1.9747871 2.3833846 2.2826131 2.2337453 2.2225939 1.6284361 2.2067872
#> t t t t t t t
#> 1.2447557 1.1435925 0.6731618 -0.1737223 -0.5889784 -0.7339752 -1.2202893
#> t t t t t t t
#> -1.4936935 -1.5821069 -1.7795797 -1.6290700 -1.7760278 -1.4223367 -1.5931553
#> t t t t t t t
#> -2.6832755 -2.7733223 -2.7710590 -3.0889778 -2.3604743 -2.3510220 -1.5456745
#> t t t t t t t
#> -1.4098492 -0.5268468 -0.3187017 0.1878282 0.5208563 0.5656160 0.4952028
#> t t t t
#> 0.2005182 0.3769581 0.5093635 0.9358878
Created on 2020-04-02 by the reprex package (v0.3.0)

Related

Combinatorial optimization with discrete options in R

I have a function with five variables that I want to maximize using only an specific set of parameters for each variable.
Are there any methods in R that can do this, other than by brutal force? (e.g. Particle Swarm Optimization, Genetic Algorithm, Greedy, etc.). I have read a few packages but they seem to create their own set of parameters from within a given range. I am only interested in optimizing the set of options provided.
Here is a simplified version of the problem:
#Example of 5 variable function to optimize
Fn<-function(x){
a=x[1]
b=x[2]
c=x[3]
d=x[4]
e=x[5]
SUM=a+b+c+d+e
return(SUM)
}
#Parameters for variables to optimize
Vars=list(
As=c(seq(1.5,3, by = 0.3)), #float
Bs=c(1,2), #Binary
Cs=c(seq(1,60, by=10)), #Integer
Ds=c(seq(60,-60, length.out=5)), #Negtive
Es=c(1,2,3)
)
#Full combination
FullCombn= expand.grid(Vars)
Results=data.frame(I=as.numeric(), Sum=as.numeric())
for (i in 1:nrow(FullCombn)){
ParsI=FullCombn[i,]
ResultI=Fn(ParsI)
Results=rbind(Results,c(I=i,Sum=ResultI))
}
#Best iteration (Largest result)
Best=Results[Results[, 2] == max(Results[, 2]),]
#Best parameters
FullCombn[Best$I,]
Two more possibilities. Both minimize by default, so I flip the sign in your objective function (i.e. return -SUM).
#Example of 5 variable function to optimize
Fn<-function(x, ...){
a=x[1]
b=x[2]
c=x[3]
d=x[4]
e=x[5]
SUM=a+b+c+d+e
return(-SUM)
}
#Parameters for variables to optimize
Vars=list(
As=c(seq(1.5,3, by = 0.3)), #float
Bs=c(1,2), #Binary
Cs=c(seq(1,60, by=10)), #Integer
Ds=c(seq(60,-60, length.out=5)), #Negtive
Es=c(1,2,3)
)
First, a grid search. Exactly what you did, just convenient. And the implementation allows you to distribute the evaluations of the objective function.
library("NMOF")
gridSearch(fun = Fn,
levels = Vars)[c("minfun", "minlevels")]
## 5 variables with 6, 2, 6, 5, ... levels: 1080 function evaluations required.
## $minfun
## [1] -119
##
## $minlevels
## [1] 3 2 51 60 3
An alternative: a simple Local Search. You start with a valid initial guess, and then move randomly through possible feasible solutions. The key ingredient is the neighbourhood function. It picks one element randomly and then, again randomly, sets this element to one allowed value.
nb <- function(x, levels, ...) {
i <- sample(length(levels), 1)
x[i] <- sample(levels[[i]], 1)
x
}
(There would be better algorithms for neighbourhood functions; but this one is simple and so demonstrates the idea well.)
LSopt(Fn, list(x0 = c(1.8, 2, 11, 30, 2), ## a feasible initial solution
neighbour = nb,
nI = 200 ## iterations
),
levels = Vars)$xbest
## Local Search.
## ##...
## Best solution overall: -119
## [1] 3 2 51 60 3
(Disclosure: I am the maintainer of package NMOF, which provides functions gridSearch and LSopt.)
In response to the comment, a few remarks on Local Search and the neighbourhood function above (nb). Local Search, as implemented in
LSopt, will start with an arbitrary solution, and
then change that solution slightly. This new solution,
called a neighbour, will be compared (by its
objective-function value) to the old solution. If the new solution is
better, it becomes the current solution; otherwise it
is rejected and the old solution remains the current one.
Then the algorithm repeats, for a number of iterations.
So, in short, Local Search is not random sampling, but
a guided random-walk through the search space. It's
guided because only better solutions get accepted, worse one's get rejected. In this sense, LSopt will narrow down on good parameter values.
The implementation of the neighbourhood is not ideal
for two reasons. The first is that a solution may not
be changed at all, since I sample from feasible
values. But for a small set of possible values as here,
it might often happen that the same element is selected
again. However, for larger search spaces, this
inefficiency is typically negligible, since the
probability of sampling the same value becomes
smaller. Often so small, that the additional code for
testing if the solution has changed becomes more
expensive that the occasionally-wasted iteration.
A second thing could be improved, albeit through a more
complicated function. And again, for this small problem it does not matter. In the current neighbourhood, an
element is picked and then set to any feasible value.
But that means that changes from one solution to the
next might be large. Instead of picking any feasible values of the As,
in realistic problems it will often be better to pick a
value close to the current value. For example, when you are at 2.1, either move to 1.8 or 2.4, but not to 3.0. (This reasoning is only relevant, of course, if the variable in question is on a numeric or at least ordinal scale.)
Ultimately, what implementation works well can be
tested only empirically. Many more details are in this tutorial.
Here is one alternative implementation. A solution is now a vector of positions for the original values, e.g. if x[1] is 2, it "points" to 1.8, if x[2] is 2, it points to 1, and so on.
## precompute lengths of vectors in Vars
lens <- lengths(Vars)
nb2 <- function(x, lens, ...) {
i <- sample(length(lens), 1)
if (x[i] == 1L) {
x[i] <- 2
} else if (x[i] == lens[i]) {
x[i] <- lens[i] - 1
} else
x[i] <- x[i] + sample(c(1, -1), 1)
x
}
## the objective function now needs to map the
## indices in x back to the levels in Vars
Fn2 <- function(x, levels, ...){
y <- mapply(`[`, levels, x)
## => same as
## y <- numeric(length(x))
## y[1] <- Vars[[1]][x[1]]
## y[2] <- Vars[[2]][x[2]]
## ....
SUM <- sum(y)
return(-SUM)
}
xbest <- LSopt(Fn2,
list(x0 = c(1, 1, 1, 1, 1), ## an initial solution
neighbour = nb2,
nI = 200 ## iterations
),
levels = Vars,
lens = lens)$xbest
## Local Search.
## ....
## Best solution overall: -119
## map the solution back to the values
mapply(`[`, Vars, xbest)
## As Bs Cs Ds Es
## 3 2 51 60 3
Here is a genetic algorithm solution with package GA.
The key is to write a function decode enforcing the constraints, see the package vignette.
library(GA)
#> Loading required package: foreach
#> Loading required package: iterators
#> Package 'GA' version 3.2.2
#> Type 'citation("GA")' for citing this R package in publications.
#>
#> Attaching package: 'GA'
#> The following object is masked from 'package:utils':
#>
#> de
decode <- function(x) {
As <- Vars$As
Bs <- Vars$Bs
Cs <- Vars$Cs
Ds <- rev(Vars$Ds)
# fix real variable As
i <- findInterval(x[1], As)
if(x[1L] - As[i] < As[i + 1L] - x[1L])
x[1L] <- As[i]
else x[1L] <- As[i + 1L]
# fix binary variable Bs
if(x[2L] - Bs[1L] < Bs[2L] - x[2L])
x[2L] <- Bs[1L]
else x[2L] <- Bs[2L]
# fix integer variable Cs
i <- findInterval(x[3L], Cs)
if(x[3L] - Cs[i] < Cs[i + 1L] - x[3L])
x[3L] <- Cs[i]
else x[3L] <- Cs[i + 1L]
# fix integer variable Ds
i <- findInterval(x[4L], Ds)
if(x[4L] - Ds[i] < Ds[i + 1L] - x[4L])
x[4L] <- Ds[i]
else x[4L] <- Ds[i + 1L]
# fix the other, integer variable
x[5L] <- round(x[5L])
setNames(x , c("As", "Bs", "Cs", "Ds", "Es"))
}
Fn <- function(x){
x <- decode(x)
# a <- x[1]
# b <- x[2]
# c <- x[3]
# d <- x[4]
# e <- x[5]
# SUM <- a + b + c + d + e
SUM <- sum(x, na.rm = TRUE)
return(SUM)
}
#Parameters for variables to optimize
Vars <- list(
As = seq(1.5, 3, by = 0.3), # Float
Bs = c(1, 2), # Binary
Cs = seq(1, 60, by = 10), # Integer
Ds = seq(60, -60, length.out = 5), # Negative
Es = c(1, 2, 3)
)
res <- ga(type = "real-valued",
fitness = Fn,
lower = c(1.5, 1, 1, -60, 1),
upper = c(3, 2, 51, 60, 3),
popSize = 1000,
seed = 123)
summary(res)
#> ── Genetic Algorithm ───────────────────
#>
#> GA settings:
#> Type = real-valued
#> Population size = 1000
#> Number of generations = 100
#> Elitism = 50
#> Crossover probability = 0.8
#> Mutation probability = 0.1
#> Search domain =
#> x1 x2 x3 x4 x5
#> lower 1.5 1 1 -60 1
#> upper 3.0 2 51 60 3
#>
#> GA results:
#> Iterations = 100
#> Fitness function value = 119
#> Solutions =
#> x1 x2 x3 x4 x5
#> [1,] 2.854089 1.556080 46.11389 49.31045 2.532682
#> [2,] 2.869408 1.638266 46.12966 48.71106 2.559620
#> [3,] 2.865254 1.665405 46.21684 49.04667 2.528606
#> [4,] 2.866494 1.630416 46.12736 48.78017 2.530454
#> [5,] 2.860940 1.650015 46.31773 48.92642 2.521276
#> [6,] 2.851644 1.660358 46.09504 48.81425 2.525504
#> [7,] 2.855078 1.611837 46.13855 48.62022 2.575492
#> [8,] 2.857066 1.588893 46.15918 48.60505 2.588992
#> [9,] 2.862644 1.637806 46.20663 48.92781 2.579260
#> [10,] 2.861573 1.630762 46.23494 48.90927 2.555612
#> ...
#> [59,] 2.853788 1.640810 46.35649 48.87381 2.536682
#> [60,] 2.859090 1.658127 46.15508 48.85404 2.590679
apply(res#solution, 1, decode) |> t() |> unique()
#> As Bs Cs Ds Es
#> [1,] 3 2 51 60 3
Created on 2022-10-24 with reprex v2.0.2

Pi Estimator in R

The code below estimates pi in R, now I am trying to find the minimum number of terms N_Min
you would have to include in your estimate of pie to make it accurate to three decimal places.
pi_Est<- function(NTerms){
NTerms = 5 # start with an estimate of just five terms
pi_Est = 0 # initialise the value of pi to zero
Sum_i = NA # initialise the summation variable to null
for(ii in 1:NTerms)
{
Sum_i[ii] = (-1)^(ii+1)/(2*ii - 1) # this is the series equation for calculating pi
}
Sum_i = 4*Sum_i # multiply by four as required in the formula (see lecture notes)
pi_Est = sum(Sum_i)
cat('\nThe estimate of pi with terms = ', NTerms ,' is ',pi_Est)
}
First of all, I would change some things about your function. Instead of getting it to print out a message, get it to return a value. Otherwise it becomes very difficult to do anything with its output, including testing it for convergence to pi.
Also, no matter what the value of NTerms is you feed this function, you are immediately over-writing NTerms inside the function.
You could rewrite the function like this:
pi_Est <- function(NTerms) {
pi_Est <- 0
Sum_i <- numeric()
for(ii in seq(NTerms))
{
Sum_i[ii] <- (-1)^(ii+1)/(2*ii - 1)
}
return(sum(4 * Sum_i))
}
And to show it converges to pi, let's test it with 50,000 terms:
pi_Est(50000)
#> [1] 3.141573
Now, if we want to find the first value of NTerms that is correct to 3 decimal places, we are going to need to be able to call this function on a vector of NTerms - at the moment it is only working on a single number. So let's define the function f that vectorizes pi_Est:
f <- Vectorize(pi_Est)
Now, let's create the estimate for all values of NTerms between 1 and 2,000 and store them in a vector:
estimates <- f(1:2000)
We can see that the values of estimates seem to oscillate round and converge to pi if we plot the first 100 values:
plot(estimates[1:100], type = 'l')
abline(h = pi)
Our answer is just the first value which, when rounded to three decimal places, is the same as pi rounded to three decimal places:
result <- which(round(estimates, 3) == round(pi, 3))[1]
result
#> [1] 1103
And we can check this is correct by feeding 1103 into our original function:
pi_Est(result)
#> [1] 3.142499
You will see that this gives us 3.142, which is the same as pi rounded to 3 decimal places.
Created on 2022-01-31 by the reprex package (v2.0.1)
1000 terms are required to make the estimate accurate to within 0.001:
pi_Est1 <- function(n) {
if (n == 0) return(0)
neg <- 1/seq(3, 2*n + 1, 4)
if (n%%2) neg[length(neg)] <- 0
4*sum(1/seq(1, 2*n, 4) - neg)
}
pi_Est2 <- function(tol) {
for (i in ceiling(1/tol + 0.5):0) {
est <- pi_Est1(i)
if (abs(est - pi) > tol) break
est1 <- est
}
list(NTerms = i + 1, Estimate = est1)
}
tol <- 1e-3
pi_Est2(tol)
#> $NTerms
#> [1] 1000
#>
#> $Estimate
#> [1] 3.140593
tol - abs(pi - pi_Est2(tol)$Estimate)
#> [1] 2.500001e-10
tol - abs(pi - pi_Est1(pi_Est2(tol)$NTerms - 1))
#> [1] -1.00075e-06
Created on 2022-01-31 by the reprex package (v2.0.1)
Perhaps we can try the code below
pi_Est <- function(digits = 3) {
s <- 0
ii <- 1
repeat {
s <- s + 4 * (-1)^(ii + 1) / (2 * ii - 1)
if (round(s, digits) == round(pi, digits)) break
ii <- ii + 1
}
list(est = s, iter = ii)
}
and you will see
> pi_Est()
$est
[1] 3.142499
$iter
[1] 1103
> pi_Est(5)
$est
[1] 3.141585
$iter
[1] 130658
Why not use a single line of code for the calculation?
Pi <- tail(cumsum(4*(1/seq(1,4*50000000,2))*rep(c(1,-1), 50000000)),1)

Is there a testthat expect_* function for comparing two objects behavior?

I am writing some unit tests for an R package using testthat. I would like to compare two objects where not all the details need to match, but they must maintain equivalence with respect to a set of functions of interest.
For a simple example, I want to use something like
library(testthat)
x <- 1:4
y <- matrix(4:1, nrow=2)
test_that("objects behave similarly", {
expect_equal_applied(x, y, .fn=list(sum, prod))
## which would be shorthand for:
## expect_equal(sum(x), sum(y))
## expect_equal(prod(x), prod(y))
})
In practice, x and y might be S3 objects, not simply base data structures.
Obviously, this is simple to implement, but I'd prefer something idiomatic if already existing. So, the question is, does testthat implement an expect function like this?
Searching through the API, nothing struck me as fitting this description, but it seems like a natural pattern. Or maybe there is a reason why such a pattern is objectionable that I'm overlooking.
Looking at the documentation {testthat} has currently (third edition) no function like expect_equal_applied. But, as you mention already, we can construct such a function easily:
library(testthat)
x <- 1:4
y <- matrix(4:1, nrow=2)
expect_equal_applied <- function(object, expected, fns) {
fns <- purrr::map(fns, rlang::as_function)
purrr::map(fns, ~ expect_equal(.x(object), .x(expected)))
}
test_that("objects behave similarly", {
expect_equal_applied(x, y, fns = list(sum, prod))
})
#> Test passed
x <- 1:3
test_that("objects behave similarly", {
expect_equal_applied(x, y, fns = list(sum, prod))
})
#> -- Failure (<text>:19:3): objects behave similarly -----------------------------
#> .x(object) not equal to .x(expected).
#> 1/1 mismatches
#> [1] 6 - 10 == -4
#> Backtrace:
#> 1. global::expect_equal_applied(x, y, fns = list(sum, prod))
#> 2. purrr::map(fns, ~expect_equal(.x(object), .x(expected)))
#> 3. .f(.x[[i]], ...)
#> 4. testthat::expect_equal(.x(object), .x(expected))
#>
#> -- Failure (<text>:19:3): objects behave similarly -----------------------------
#> .x(object) not equal to .x(expected).
#> 1/1 mismatches
#> [1] 6 - 24 == -18
#> Backtrace:
#> 1. global::expect_equal_applied(x, y, fns = list(sum, prod))
#> 2. purrr::map(fns, ~expect_equal(.x(object), .x(expected)))
#> 3. .f(.x[[i]], ...)
#> 4. testthat::expect_equal(.x(object), .x(expected))
Created on 2021-09-17 by the reprex package (v2.0.1)
Regarding why such a function seems to be missing in {testthat}, I think that it isn't really necessary given, that we can construct it with lapply or map.

Call function in redefining the function in R

I want to redefine my function in a loop by calling the function from last iteration. However I know this is basically a recursive way which I don't want. To give an example, see the following
for(i in 1:3)
{
...
myfunction<-function(y){myfunction(y)*dnorm(full_x[j],mean=y*full_x[j-1],sd=true_eps)}
...
result[[i]]<-myfunction
}
What I want is that in step t, I define a function called myfunction and in step t+1, I want to redefine the function by using myfunction from the last step in order to do some dynamic updatings. The program keeps giving me the error: "Error: C stack usage 7971152 is too close to the limit". I guess the problem lies in this "recursive way". I tried to relabel the function in each loop but it did not work. Is there any ways to solve this?
**A concrete example would be the following
# loop for calculating the density function
for(j in 2:length(full_x))
{
# define the normal density
trued<-function(y){dnorm(full_x[j],mean=y*full_x[j-1],sd=true_eps)}
# joint distribution
# which is normal density multiplied by a piror
if(j>=3)
{
trued<- function(y){trued(y)**true_density(y)}
}
# integration of the density w.r.t. rho
trueint<-integrate(trued, lower = 0, upper = 1)$value
# density function
true_density<-function(y) {trued(y)/trueint}
# save into list
dyn_density_true[[j-1]]<-true_density
}
What I want to do is to calculate a sequence of density functions. In the first step, it is just a normal density weighed by its integral (so posterior density). In the following steps, I need to use the density function from last step as a prior then do the same exercise again.
You're referring to myfunction directly when you want to refer to the latest entry in result. If you print myfunction or result you'll see why it doesn't work.
You can instead create a string with the actual function you want and then parse it. I've created a sample function below where each iteration multiplies the previous function result by 5.
f <- function(y) y**2
k <- 5
res <- list(f)
for (i in 1:3) {
fstring <- paste0('function(y) k*res[[',i,']](y)')
res[[i+1]] <- eval(parse(text = fstring))
}
res
#> [[1]]
#> function(y) y**2
#>
#> [[2]]
#> function(y) k*res[[1]](y)
#>
#> [[3]]
#> function(y) k*res[[2]](y)
#>
#> [[4]]
#> function(y) k*res[[3]](y)
lapply(res,function(f) f(2))
#> [[1]]
#> [1] 4
#>
#> [[2]]
#> [1] 20
#>
#> [[3]]
#> [1] 100
#>
#> [[4]]
#> [1] 500
Created on 2019-10-16 by the reprex package (v0.3.0)

Variable length formula construction

I am trying to apply the Simpson's Diversity Index across a number of different datasets with a variable number of species ('nuse') captured. As such I am trying to construct code which can cope with this automatically without needing to manually construct a formula each time I do it. Example dataset for a manual formula is below:
diverse <- data.frame(nuse1=c(0,20,40,20), nuse2=c(5,5,3,20), nuse3=c(0,2,8,20), nuse4=c(5,8,2,20), total=c(10,35,53,80))
simp <- function(x) {
total <- x[,"total"]
nuse1 <- x[,"nuse1"]
nuse2 <- x[,"nuse2"]
nuse3 <- x[,"nuse3"]
nuse4 <- x[,"nuse4"]
div <- round(((1-(((nuse1*(nuse1 - 1)) + (nuse2*(nuse2 - 1)) + (nuse3*(nuse3 - 1)) + (nuse4*(nuse4 - 1)))/(total*(total - 1))))),digits=4)
return(div)
}
diverse$Simpson <- simp(diverse)
diverse
As you can see this works fine. However, how would I be able to create a function which could automatically adjust to, for example, 9 species (so up to nuse9)?
I have experimented with the paste function + as.formula as indicated here Formula with dynamic number of variables; however it is the expand form of (nuse1 * (nuse1 - 1)) that I'm struggling with. Does anyone have any suggestions please? Thanks.
How about something like:
diverse <- data.frame(nuse1=c(0,20,40,20), nuse2=c(5,5,3,20), nuse3=c(0,2,8,20), nuse4=c(5,8,2,20), total=c(10,35,53,80))
simp <- function(x, species) {
spcs <- grep(species, colnames(x)) # which column names have "nuse"
total <- rowSums(x[,spcs]) # sum by row
div <- round(1 - rowSums(apply(x[,spcs], 2, function(s) s*(s-1))) / (total*(total - 1)), digits = 4)
return(div)
}
diverse$Simpson2 <- simp(diverse, species = "nuse")
diverse
# nuse1 nuse2 nuse3 nuse4 total Simpson2
# 1 0 5 0 5 10 0.5556
# 2 20 5 2 8 35 0.6151
# 3 40 3 8 2 53 0.4107
# 4 20 20 20 20 80 0.7595
All it does is find out which columns start with "nuse" or any other species you have in your dataset. It constructs the "total" value within the function and does not require a total column in the dataset.

Resources