The dataset, "male.wt" is a collection of 100 weights of male taxi patrons. Use bootstrap sampling to estimate the variance for the population of males who use taxis.
I am trying to use the boot() function in R and I am just completely confused.
Here is the data set that was given to me to do this problem.
malewt = structure(list(x = c(184.291514203183, 238.183299307855, 217.544606414151,
233.931926116624, 229.12042611005, 243.881689583996, 259.230802242781,
217.939619221934, 137.636923032685, 170.379447345948, 195.852641733122,
185.832690963969, 186.676714564328, 215.711426139253, 186.413495533494,
237.83223009147, 180.124153998503, 215.393108191779, 188.846039074142,
142.373198101437, 233.234630310378, 186.141325709762, 220.062112044187,
213.851199681057, 148.622198219149, 197.438771523918, 206.920961557603,
190.874857845699, 217.889075914836, 152.318099234166, 218.089620221194,
196.736930479919, 235.122424359223, 217.446826955801, 201.352404389309,
216.290374765672, 173.85609629461, 215.961826427613, 213.87732008193,
177.952521505061, 132.734879010504, 221.707886490889, 224.336488758995,
218.604034088911, 228.157844234374, 196.544661577149, 228.787736646279,
237.009125179319, 194.73342863066, 190.569523115323, 192.198491573128,
204.589742888237, 198.662802876867, 195.238634847898, 201.834508205684,
220.989134791548, 180.006492709174, 168.199898332071, 250.705048451896,
209.824701073225, 212.36145906497, 205.250728119598, 196.572466206237,
186.818746613236, 138.493748904934, 193.572713536688, 171.605082170236,
243.803356964054, 188.768040728907, 201.408088256783, 196.23847341016,
202.686141019735, 167.25735383257, 171.907526464761, 224.396425425799,
183.494470842407, 220.15969728649, 143.164453849305, 152.539942653094,
198.52004650272, 185.145815429412, 206.741840856439, 259.866591064748,
135.212011256414, 164.2297511973, 200.623731663392, 199.599177980586,
175.970651370212, 197.304554981825, 189.116019204125, 198.630618004183,
185.096675814379, 203.780160863916, 174.584831373708, 150.483001599829,
223.78078870159, 170.772181294322, 218.770812392057, 151.645084212409,
210.350813872005)), class = "data.frame", row.names = c(NA, -100L
))
Very ambiguous question. Here is how to plot a histogram of the bootstrap estimator of the variance:
library(purrr)
boots <- 100
data <- structure(list(x = c(184.291514203183, 238.183299307855, 217.544606414151, 233.931926116624, 229.12042611005, 243.881689583996, 259.230802242781, 217.939619221934, 137.636923032685, 170.379447345948, 195.852641733122, 185.832690963969, 186.676714564328, 215.711426139253, 186.413495533494, 237.83223009147, 180.124153998503, 215.393108191779, 188.846039074142, 142.373198101437, 233.234630310378, 186.141325709762, 220.062112044187, 213.851199681057, 148.622198219149, 197.438771523918, 206.920961557603, 190.874857845699, 217.889075914836, 152.318099234166, 218.089620221194, 196.736930479919, 235.122424359223, 217.446826955801, 201.352404389309, 216.290374765672, 173.85609629461, 215.961826427613, 213.87732008193, 177.952521505061, 132.734879010504, 221.707886490889, 224.336488758995, 218.604034088911, 228.157844234374, 196.544661577149, 228.787736646279, 237.009125179319, 194.73342863066, 190.569523115323, 192.198491573128, 204.589742888237, 198.662802876867, 195.238634847898, 201.834508205684, 220.989134791548, 180.006492709174, 168.199898332071, 250.705048451896, 209.824701073225, 212.36145906497, 205.250728119598, 196.572466206237, 186.818746613236, 138.493748904934, 193.572713536688, 171.605082170236, 243.803356964054, 188.768040728907, 201.408088256783, 196.23847341016, 202.686141019735, 167.25735383257, 171.907526464761, 224.396425425799, 183.494470842407, 220.15969728649, 143.164453849305, 152.539942653094, 198.52004650272, 185.145815429412, 206.741840856439, 259.866591064748, 135.212011256414, 164.2297511973, 200.623731663392, 199.599177980586, 175.970651370212, 197.304554981825, 189.116019204125, 198.630618004183, 185.096675814379, 203.780160863916, 174.584831373708, 150.483001599829, 223.78078870159, 170.772181294322, 218.770812392057, 151.645084212409, 210.350813872005)), class = "data.frame", row.names = c(NA, -100L ))
map(seq_len(boots),
~ data$x[sample.int(length(data$x), length(data$x), T)]
) %>%
map_dbl(var) %>%
hist()
Created on 2022-12-09 with reprex v2.0.2
Here is solution with base package boot. In the case of the variance it's very easy to bootstrap it.
Create a function bootvar to compute each resampled variance;
the function must have the data and indices as 1st and 2nd arguments. The arguments must be these and by this order, the indices vector is what gives the resampling from the data. It's a vector created automatically for the user by boot;
in the function extract the sample given by the indices vector (i below);
compute and return the statistic of interest, var.
I wrote the function in two code lines in order to make it more clear.
library(boot)
set.seed(2022)
bootvar <- function(data, i) {
y <- data$x[i]
var(y)
}
b <- boot(malewt, bootvar, R = 5000)
b
#>
#> ORDINARY NONPARAMETRIC BOOTSTRAP
#>
#>
#> Call:
#> boot(data = malewt, statistic = bootvar, R = 5000)
#>
#>
#> Bootstrap Statistics :
#> original bias std. error
#> t1* 792.2551 -7.657891 106.0901
# bootstrapped variance
mean(b$t)
#> [1] 784.5972
hist(b$t)
Created on 2022-12-09 with reprex v2.0.2
It is also possible to write the bootstrap function so that it takes a vector as its first argument, not a data.frame like above. The call to boot is then adapted accordingly. The results are exactly the same as long as the pseudo-RNG seed is set to the same value (in this case 2022).
library(boot)
set.seed(2022)
bootvar_x <- function(x, i) {
y <- x[i]
var(y)
}
bx <- boot(malewt$x, bootvar_x, R = 5000)
bx
#>
#> ORDINARY NONPARAMETRIC BOOTSTRAP
#>
#>
#> Call:
#> boot(data = malewt$x, statistic = bootvar_x, R = 5000)
#>
#>
#> Bootstrap Statistics :
#> original bias std. error
#> t1* 792.2551 -7.657891 106.0901
mean(bx$t)
#> [1] 784.5972
hist(bx$t)
Created on 2022-12-09 with reprex v2.0.2
The code below estimates pi in R, now I am trying to find the minimum number of terms N_Min
you would have to include in your estimate of pie to make it accurate to three decimal places.
pi_Est<- function(NTerms){
NTerms = 5 # start with an estimate of just five terms
pi_Est = 0 # initialise the value of pi to zero
Sum_i = NA # initialise the summation variable to null
for(ii in 1:NTerms)
{
Sum_i[ii] = (-1)^(ii+1)/(2*ii - 1) # this is the series equation for calculating pi
}
Sum_i = 4*Sum_i # multiply by four as required in the formula (see lecture notes)
pi_Est = sum(Sum_i)
cat('\nThe estimate of pi with terms = ', NTerms ,' is ',pi_Est)
}
First of all, I would change some things about your function. Instead of getting it to print out a message, get it to return a value. Otherwise it becomes very difficult to do anything with its output, including testing it for convergence to pi.
Also, no matter what the value of NTerms is you feed this function, you are immediately over-writing NTerms inside the function.
You could rewrite the function like this:
pi_Est <- function(NTerms) {
pi_Est <- 0
Sum_i <- numeric()
for(ii in seq(NTerms))
{
Sum_i[ii] <- (-1)^(ii+1)/(2*ii - 1)
}
return(sum(4 * Sum_i))
}
And to show it converges to pi, let's test it with 50,000 terms:
pi_Est(50000)
#> [1] 3.141573
Now, if we want to find the first value of NTerms that is correct to 3 decimal places, we are going to need to be able to call this function on a vector of NTerms - at the moment it is only working on a single number. So let's define the function f that vectorizes pi_Est:
f <- Vectorize(pi_Est)
Now, let's create the estimate for all values of NTerms between 1 and 2,000 and store them in a vector:
estimates <- f(1:2000)
We can see that the values of estimates seem to oscillate round and converge to pi if we plot the first 100 values:
plot(estimates[1:100], type = 'l')
abline(h = pi)
Our answer is just the first value which, when rounded to three decimal places, is the same as pi rounded to three decimal places:
result <- which(round(estimates, 3) == round(pi, 3))[1]
result
#> [1] 1103
And we can check this is correct by feeding 1103 into our original function:
pi_Est(result)
#> [1] 3.142499
You will see that this gives us 3.142, which is the same as pi rounded to 3 decimal places.
Created on 2022-01-31 by the reprex package (v2.0.1)
1000 terms are required to make the estimate accurate to within 0.001:
pi_Est1 <- function(n) {
if (n == 0) return(0)
neg <- 1/seq(3, 2*n + 1, 4)
if (n%%2) neg[length(neg)] <- 0
4*sum(1/seq(1, 2*n, 4) - neg)
}
pi_Est2 <- function(tol) {
for (i in ceiling(1/tol + 0.5):0) {
est <- pi_Est1(i)
if (abs(est - pi) > tol) break
est1 <- est
}
list(NTerms = i + 1, Estimate = est1)
}
tol <- 1e-3
pi_Est2(tol)
#> $NTerms
#> [1] 1000
#>
#> $Estimate
#> [1] 3.140593
tol - abs(pi - pi_Est2(tol)$Estimate)
#> [1] 2.500001e-10
tol - abs(pi - pi_Est1(pi_Est2(tol)$NTerms - 1))
#> [1] -1.00075e-06
Created on 2022-01-31 by the reprex package (v2.0.1)
Perhaps we can try the code below
pi_Est <- function(digits = 3) {
s <- 0
ii <- 1
repeat {
s <- s + 4 * (-1)^(ii + 1) / (2 * ii - 1)
if (round(s, digits) == round(pi, digits)) break
ii <- ii + 1
}
list(est = s, iter = ii)
}
and you will see
> pi_Est()
$est
[1] 3.142499
$iter
[1] 1103
> pi_Est(5)
$est
[1] 3.141585
$iter
[1] 130658
Why not use a single line of code for the calculation?
Pi <- tail(cumsum(4*(1/seq(1,4*50000000,2))*rep(c(1,-1), 50000000)),1)
So I'm trying to plot a function into a graph and ultimately I landed on the easiest option being to give alpha-values and optimize velocity for the x_position values. The problem is, I'm doing something wrong with the optimization.
Here's what I've got thus far:
y <- seq(0, 55, 0.1)
x_position <- function(alpha,velo)
{velo*cos(alpha)*((velo*sin(alpha)+sqrt((velo^2*sin(alpha))^2+2*16.5*9.81)/9.81))}
x <- optimize(x_position,c(1,1000),alpha=y,maximum=TRUE)$objective
Basically, I'm trying to make "y" a vector for the angle and "x" a vector of maximum function value for each angle value so that I could then plot the x,y vector for the function. The problem is, I can't get the x-vector right. For whatever reason it just keeps telling me "invalid function value in 'optimize'". Changing the optimization interval doesn't seem to accomplish anything and I'm out of ideas. The function seems to work just fine when I tested it with e.g. alpha 55 and velocity 10.
y <- seq(0, 55, 0.1)
x_position <- function(velo,alpha){
velo*cos(alpha)*((velo*sin(alpha)+sqrt((velo^2*sin(alpha))^2+2*16.5*9.81)/9.81))
}
optimize(f = x_position, interval = c(1,1000), maximum=TRUE, alpha = y[1])
#> $maximum
#> [1] 999.9999
#>
#> $objective
#> [1] 1834.098
a <- sapply(y, function(y) optimize(f = x_position, interval = c(1,1000),
maximum=TRUE, alpha = y)$objective)
head(a)
#> [1] 1834.098 10225190.493 20042734.667 29061238.316 36921162.118
#> [6] 43309155.705
Created on 2021-09-29 by the reprex package (v2.0.1)
I am writing some unit tests for an R package using testthat. I would like to compare two objects where not all the details need to match, but they must maintain equivalence with respect to a set of functions of interest.
For a simple example, I want to use something like
library(testthat)
x <- 1:4
y <- matrix(4:1, nrow=2)
test_that("objects behave similarly", {
expect_equal_applied(x, y, .fn=list(sum, prod))
## which would be shorthand for:
## expect_equal(sum(x), sum(y))
## expect_equal(prod(x), prod(y))
})
In practice, x and y might be S3 objects, not simply base data structures.
Obviously, this is simple to implement, but I'd prefer something idiomatic if already existing. So, the question is, does testthat implement an expect function like this?
Searching through the API, nothing struck me as fitting this description, but it seems like a natural pattern. Or maybe there is a reason why such a pattern is objectionable that I'm overlooking.
Looking at the documentation {testthat} has currently (third edition) no function like expect_equal_applied. But, as you mention already, we can construct such a function easily:
library(testthat)
x <- 1:4
y <- matrix(4:1, nrow=2)
expect_equal_applied <- function(object, expected, fns) {
fns <- purrr::map(fns, rlang::as_function)
purrr::map(fns, ~ expect_equal(.x(object), .x(expected)))
}
test_that("objects behave similarly", {
expect_equal_applied(x, y, fns = list(sum, prod))
})
#> Test passed
x <- 1:3
test_that("objects behave similarly", {
expect_equal_applied(x, y, fns = list(sum, prod))
})
#> -- Failure (<text>:19:3): objects behave similarly -----------------------------
#> .x(object) not equal to .x(expected).
#> 1/1 mismatches
#> [1] 6 - 10 == -4
#> Backtrace:
#> 1. global::expect_equal_applied(x, y, fns = list(sum, prod))
#> 2. purrr::map(fns, ~expect_equal(.x(object), .x(expected)))
#> 3. .f(.x[[i]], ...)
#> 4. testthat::expect_equal(.x(object), .x(expected))
#>
#> -- Failure (<text>:19:3): objects behave similarly -----------------------------
#> .x(object) not equal to .x(expected).
#> 1/1 mismatches
#> [1] 6 - 24 == -18
#> Backtrace:
#> 1. global::expect_equal_applied(x, y, fns = list(sum, prod))
#> 2. purrr::map(fns, ~expect_equal(.x(object), .x(expected)))
#> 3. .f(.x[[i]], ...)
#> 4. testthat::expect_equal(.x(object), .x(expected))
Created on 2021-09-17 by the reprex package (v2.0.1)
Regarding why such a function seems to be missing in {testthat}, I think that it isn't really necessary given, that we can construct it with lapply or map.
I have a vector that consists of n elements.
I am wondering how to do the following in an efficient manner (it is basically a rooling computation):
Extract elements with indices 1 to k
Extract elements with indices k+1 to 2k
perform a t.test on those elements
Extract elements with indices 2 to k+1
Extract elements with indices k+2 to 2k+1
perform a t.test on those elements
......
7.Repeat until the end of the vector.
I do not need to see the extract values, just get some statistics (e.g. [1]$statistic).
I can do it with a loop but I am wondering how one could do it with some functions (eg. apply).
All the best
It's unclear from your pseudo-code whether you really expected the vector to have n = 3 * k elements, but that's how I have programmed it.
set.seed(123)
x = rnorm(15)
o = 1:5
k = 5
sapply(o, function(oi){
x1 = x[oi:(k + oi - 1)]
x2 = x[(k + oi):(2 * k + oi)]
t.test(x1, x2)$statistic
})
and the result is
t t t t t
0.04435004 1.28433640 1.67879015 0.11191044 -0.19398686
You could also work out some updating formulae for adding and subtracting a new element to each pair of vectors in the t-test, but it doesn't seem worth the effort.
You can use sapply, here wrapped in a function for clarity:
multiple_ttest <- function(vec, k)
{
sapply(seq(length(vec) - 2 * k), function(x){
t.test(vec[x + 1:k], vec[x + (k + 1):(2 * k)])$statistic
})
}
multiple_ttest(rnorm(100), 20)
#> t t t t t t t
#> -0.6253586 -1.1680595 -1.2979357 -1.1051207 -1.4668645 -0.6156220 -0.2470322
#> t t t t t t t
#> -1.2416802 -0.1763081 -0.1682269 -0.2490938 0.4569783 0.7721602 1.4383522
#> t t t t t t t
#> 1.5530524 2.1523397 2.3221591 2.1876903 1.7248729 1.6149827 1.6919704
#> t t t t t t t
#> 1.9747871 2.3833846 2.2826131 2.2337453 2.2225939 1.6284361 2.2067872
#> t t t t t t t
#> 1.2447557 1.1435925 0.6731618 -0.1737223 -0.5889784 -0.7339752 -1.2202893
#> t t t t t t t
#> -1.4936935 -1.5821069 -1.7795797 -1.6290700 -1.7760278 -1.4223367 -1.5931553
#> t t t t t t t
#> -2.6832755 -2.7733223 -2.7710590 -3.0889778 -2.3604743 -2.3510220 -1.5456745
#> t t t t t t t
#> -1.4098492 -0.5268468 -0.3187017 0.1878282 0.5208563 0.5656160 0.4952028
#> t t t t
#> 0.2005182 0.3769581 0.5093635 0.9358878
Created on 2020-04-02 by the reprex package (v0.3.0)