Incorrect results using the new `svyquantile()` with `svyby()`

Incorrect results using the new `svyquantile()` with `svyby()` - r

In July of 2021, the survey package overhauled the svyquantile() function. Quoting from the package's NEWS:
svyquantile() has been COMPLETELY REWRITTEN. The old version is available as oldsvyquantile().
The svyquantile() was being used in one of my packages when this change occurred. A co-author updated the code to retain use of the old version of the function for continuity. Now that it's been 1.5 years since the release, we're updating my code to use the new version. However, we're am experiencing an issue with the new version when combining its usage with svyby().
When we use the updated function, the results for all levels of the by variable come out equivalent. Also, the returned data frame duplicates each of the columns (once for each level of the by variable).
My guess is that rather than showing the results for the by variable in each row, the results are mistakenly being added as new columns (with repeated column names, and duplicated rows).
Am I doing something wrong with svyby(), or is this a bug in the code? The survey package is so widely used and I am no expert in its use, so I lean toward there being an issue with my code. But I am having trouble diagnosing the problem.
Thank you!
suppressPackageStartupMessages(library(survey))
packageVersion("survey")
#> [1] '4.1.1'
data(api)
# OLD svyquantile, results as expected
svyby(
# svyby args
formula = ~api00,
by = ~both,
design = svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc),
FUN = oldsvyquantile,
# args being passed to oldsvyquantile
na.rm = TRUE,
keep.var = FALSE,
quantiles = 0.5
)
#> both statistic
#> No No 631.0
#> Yes Yes 653.5
# NEW svyquantile, I don't understand what I am doing wrong
svyby(
# svyby args
formula = ~api00,
by = ~both,
design = svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc),
FUN = svyquantile,
# args being passed to svyquantile
na.rm = TRUE,
keep.var = FALSE,
quantiles = 0.5
)
#> both statistic.statistic.api00.quantile statistic.statistic.api00.ci.2.5
#> No No 631 547
#> Yes Yes 631 547
#> statistic.statistic.api00.ci.97.5 statistic.statistic.api00.se
#> No 722 39.75493
#> Yes 722 39.75493
#> statistic.statistic.api00.quantile statistic.statistic.api00.ci.2.5
#> No 655 566
#> Yes 655 566
#> statistic.statistic.api00.ci.97.5 statistic.statistic.api00.se
#> No 717 34.94774
#> Yes 717 34.94774
# results are as expected when used outside `svyby()`
svyquantile(
x = ~api00,
design = svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc),
na.rm = TRUE,
keep.var = FALSE,
quantiles = 0.5
)
#> $api00
#> quantile ci.2.5 ci.97.5 se
#> 0.5 652 561 714 35.66788
#>
#> attr(,"hasci")
#> [1] TRUE
#> attr(,"class")
#> [1] "newsvyquantile"
Created on 2022-12-25 with reprex v2.0.2

It's a bug. It's fixed in the development version, on r-forge (https://r-forge.r-project.org/R/?group_id=1788)

Related

Set 3 double parameters in p_db using paradox package

How can I set parameter with say 3 float values.
For example I want to search parameter X for 0.99, 0.98 and 0.97.
For p_dbl there are lower and upper parameters but not values I can use.
For example something like (this doesn't work ofc):
p_dbl(c(0.99, 0.98, 0.97))

I am assuming you want to do this for tuning purposes. If you don't, you have to use a p_dbl() and set the levels as characters ("0.99", ...).
Note that the second solution that I am giving here is simply a shortcut for the first approach, where the transformation is explicitly defined. This means that the paradox package will create the transformation for you, similarly to how it is done in the first of the two solutions.
library(paradox)
library(data.table)
search_space = ps(
a = p_fct(levels = c("0.99", "0.98", "0.97"), trafo = function(x, param_set) {
switch(x,
"0.99" = 0.99,
"0.98" = 0.98,
"0.97" = 0.97
)
})
)
design = rbindlist(generate_design_grid(search_space, 3)$transpose(), fill = TRUE)
design
#> a
#> 1: 0.99
#> 2: 0.98
#> 3: 0.97
class(design[[1]])
#> [1] "numeric"
# the same can be achieves as follows:
search_space = ps(
a = p_fct(levels = c(0.99, 0.98, 0.97))
)
design = rbindlist(generate_design_grid(search_space, 3)$transpose(), fill = TRUE)
design
#> a
#> 1: 0.99
#> 2: 0.98
#> 3: 0.97
class(design[[1]])
#> [1] "numeric"
Created on 2023-02-15 with reprex v2.0.2

How do I perform a bootstrap in R and estimate the variance?

The dataset, "male.wt" is a collection of 100 weights of male taxi patrons. Use bootstrap sampling to estimate the variance for the population of males who use taxis.
I am trying to use the boot() function in R and I am just completely confused.
Here is the data set that was given to me to do this problem.
malewt = structure(list(x = c(184.291514203183, 238.183299307855, 217.544606414151,
233.931926116624, 229.12042611005, 243.881689583996, 259.230802242781,
217.939619221934, 137.636923032685, 170.379447345948, 195.852641733122,
185.832690963969, 186.676714564328, 215.711426139253, 186.413495533494,
237.83223009147, 180.124153998503, 215.393108191779, 188.846039074142,
142.373198101437, 233.234630310378, 186.141325709762, 220.062112044187,
213.851199681057, 148.622198219149, 197.438771523918, 206.920961557603,
190.874857845699, 217.889075914836, 152.318099234166, 218.089620221194,
196.736930479919, 235.122424359223, 217.446826955801, 201.352404389309,
216.290374765672, 173.85609629461, 215.961826427613, 213.87732008193,
177.952521505061, 132.734879010504, 221.707886490889, 224.336488758995,
218.604034088911, 228.157844234374, 196.544661577149, 228.787736646279,
237.009125179319, 194.73342863066, 190.569523115323, 192.198491573128,
204.589742888237, 198.662802876867, 195.238634847898, 201.834508205684,
220.989134791548, 180.006492709174, 168.199898332071, 250.705048451896,
209.824701073225, 212.36145906497, 205.250728119598, 196.572466206237,
186.818746613236, 138.493748904934, 193.572713536688, 171.605082170236,
243.803356964054, 188.768040728907, 201.408088256783, 196.23847341016,
202.686141019735, 167.25735383257, 171.907526464761, 224.396425425799,
183.494470842407, 220.15969728649, 143.164453849305, 152.539942653094,
198.52004650272, 185.145815429412, 206.741840856439, 259.866591064748,
135.212011256414, 164.2297511973, 200.623731663392, 199.599177980586,
175.970651370212, 197.304554981825, 189.116019204125, 198.630618004183,
185.096675814379, 203.780160863916, 174.584831373708, 150.483001599829,
223.78078870159, 170.772181294322, 218.770812392057, 151.645084212409,
210.350813872005)), class = "data.frame", row.names = c(NA, -100L
))

Very ambiguous question. Here is how to plot a histogram of the bootstrap estimator of the variance:
library(purrr)
boots <- 100
data <- structure(list(x = c(184.291514203183, 238.183299307855, 217.544606414151, 233.931926116624, 229.12042611005, 243.881689583996, 259.230802242781, 217.939619221934, 137.636923032685, 170.379447345948, 195.852641733122, 185.832690963969, 186.676714564328, 215.711426139253, 186.413495533494, 237.83223009147, 180.124153998503, 215.393108191779, 188.846039074142, 142.373198101437, 233.234630310378, 186.141325709762, 220.062112044187, 213.851199681057, 148.622198219149, 197.438771523918, 206.920961557603, 190.874857845699, 217.889075914836, 152.318099234166, 218.089620221194, 196.736930479919, 235.122424359223, 217.446826955801, 201.352404389309, 216.290374765672, 173.85609629461, 215.961826427613, 213.87732008193, 177.952521505061, 132.734879010504, 221.707886490889, 224.336488758995, 218.604034088911, 228.157844234374, 196.544661577149, 228.787736646279, 237.009125179319, 194.73342863066, 190.569523115323, 192.198491573128, 204.589742888237, 198.662802876867, 195.238634847898, 201.834508205684, 220.989134791548, 180.006492709174, 168.199898332071, 250.705048451896, 209.824701073225, 212.36145906497, 205.250728119598, 196.572466206237, 186.818746613236, 138.493748904934, 193.572713536688, 171.605082170236, 243.803356964054, 188.768040728907, 201.408088256783, 196.23847341016, 202.686141019735, 167.25735383257, 171.907526464761, 224.396425425799, 183.494470842407, 220.15969728649, 143.164453849305, 152.539942653094, 198.52004650272, 185.145815429412, 206.741840856439, 259.866591064748, 135.212011256414, 164.2297511973, 200.623731663392, 199.599177980586, 175.970651370212, 197.304554981825, 189.116019204125, 198.630618004183, 185.096675814379, 203.780160863916, 174.584831373708, 150.483001599829, 223.78078870159, 170.772181294322, 218.770812392057, 151.645084212409, 210.350813872005)), class = "data.frame", row.names = c(NA, -100L ))
map(seq_len(boots),
~ data$x[sample.int(length(data$x), length(data$x), T)]
) %>%
map_dbl(var) %>%
hist()
Created on 2022-12-09 with reprex v2.0.2

Here is solution with base package boot. In the case of the variance it's very easy to bootstrap it.
Create a function bootvar to compute each resampled variance;
the function must have the data and indices as 1st and 2nd arguments. The arguments must be these and by this order, the indices vector is what gives the resampling from the data. It's a vector created automatically for the user by boot;
in the function extract the sample given by the indices vector (i below);
compute and return the statistic of interest, var.
I wrote the function in two code lines in order to make it more clear.
library(boot)
set.seed(2022)
bootvar <- function(data, i) {
y <- data$x[i]
var(y)
}
b <- boot(malewt, bootvar, R = 5000)
b
#>
#> ORDINARY NONPARAMETRIC BOOTSTRAP
#>
#>
#> Call:
#> boot(data = malewt, statistic = bootvar, R = 5000)
#>
#>
#> Bootstrap Statistics :
#> original bias std. error
#> t1* 792.2551 -7.657891 106.0901
# bootstrapped variance
mean(b$t)
#> [1] 784.5972
hist(b$t)
Created on 2022-12-09 with reprex v2.0.2
It is also possible to write the bootstrap function so that it takes a vector as its first argument, not a data.frame like above. The call to boot is then adapted accordingly. The results are exactly the same as long as the pseudo-RNG seed is set to the same value (in this case 2022).
library(boot)
set.seed(2022)
bootvar_x <- function(x, i) {
y <- x[i]
var(y)
}
bx <- boot(malewt$x, bootvar_x, R = 5000)
bx
#>
#> ORDINARY NONPARAMETRIC BOOTSTRAP
#>
#>
#> Call:
#> boot(data = malewt$x, statistic = bootvar_x, R = 5000)
#>
#>
#> Bootstrap Statistics :
#> original bias std. error
#> t1* 792.2551 -7.657891 106.0901
mean(bx$t)
#> [1] 784.5972
hist(bx$t)
Created on 2022-12-09 with reprex v2.0.2

How can I use normalize isotopic ratios from a single study using CIAAWconsensus package

I want to use the package CIAAWconsensus by Juris Meija and Antonio Possolo, to normalize the isotopic ratios of the NIST SRM 981 to the reference isotope 207Pb. I get an error that I can not understand when I use the function normalize ratios:
library(CIAAWconsensus)
## Certified isotopic ratios in NIST SRM 981 (Value, standard uncertainty)
CertiRati.SRM.981 <- list(R.204.206 = c(0.059042, 0.000037/2),
R.207.206 = c(0.91464, 0.00033/2),
R.208.206 = c(2.1681, 0.0008/2))
## Ordering the data frame as requested by package
(dat <- data.frame(Study = 1:3, Year = rep(1991, 3), Author = rep('NIST', 3),
Outcome = c('204Pb/206Pb', '207Pb/206Pb', '208Pb/206Pb'),
Value = c(CertiRati.SRM.981[[1]][1],
CertiRati.SRM.981[[2]][1],
CertiRati.SRM.981[[3]][1]),
Unc = c(CertiRati.SRM.981[[1]][2],
CertiRati.SRM.981[[2]][2],
CertiRati.SRM.981[[3]][2]),
k_extra = rep(1, 3)))
# Study Year Author Outcome Value Unc k_extra
# 1 1 1991 NIST 204Pb/206Pb 0.059042 1.85e-05 1
# 2 2 1991 NIST 207Pb/206Pb 0.914640 1.65e-04 1
# 3 3 1991 NIST 208Pb/206Pb 2.168100 4.00e-04 1
## Atempt to normalize the ratios to the 207Pb
normalize.ratios(dat = dat, element = 'lead', ref.isotope = '207Pb')
# Error in dimnames(x) <- dn :
# Length of 'dimnames' [2] not equal to array extent
The dataframe is created as indicated in the package's documentation but can not get around this. Any help will be much appreciated.

First two values in .Random.seed are always the same with different set.seed()s

Preamble
I've looked through other questions (1, 2, 3) describing the use and function of set.seed() and .Random.seed and can't find this particular issue documented so here it is as a question:
Inital Observation
When I inspect the .Random.seeds generated as a result of set.seed(1) and set.seed(2), I find that the first two elements are always the same (10403 & 624) while the rest appears not to be. See example below.
My questions
Is that expected?
Why does it happen?
Will this have any untoward consequenses for any random simulation I
might do based on it?
Reproducible Example
f <- function(s1, s2){
set.seed(s1)
r1 <- .Random.seed
set.seed(s2)
r2 <- .Random.seed
print(r1[1:3])
print(r2[1:3])
plot(r1, r2)
}
f(1, 2)
#> [1] 10403 624 -169270483
#> [1] 10403 624 -1619336578
Created on 2022-01-04 by the reprex package (v2.0.1)
Note that the first two elements of each .Random.seed are identical but the remainder is not. You can see in the scatterplot that it's just a random cloud as expected.

Expanding helpful comments from #r2evans and #Dave2e into an answer.
1) .Random.seed[1]
From ?.Random.seed, it says:
".Random.seed is an integer vector whose first element codes the
kind of RNG and normal generator. The lowest two decimal digits are in
0:(k-1) where k is the number of available RNGs. The hundreds
represent the type of normal generator (starting at 0), and the ten
thousands represent the type of discrete uniform sampler."
Therefore the first value doesn't change unless one changes the generator method (RNGkind).
Here is a small demonstration of this for each of the available RNGkinds:
library(tidyverse)
# available RNGkind options
kinds <- c(
"Wichmann-Hill",
"Marsaglia-Multicarry",
"Super-Duper",
"Mersenne-Twister",
"Knuth-TAOCP-2002",
"Knuth-TAOCP",
"L'Ecuyer-CMRG"
)
# test over multiple seeds
seeds <- c(1:3)
f <- function(kind, seed) {
# set seed with simulation parameters
set.seed(seed = seed, kind = kind)
# check value of first element in .Random.seed
return(.Random.seed[1])
}
# run on simulated conditions and compare value over different seeds
expand_grid(kind = kinds, seed = seeds) %>%
pmap(f) %>%
unlist() %>%
matrix(
ncol = length(seeds),
byrow = T,
dimnames = list(kinds, paste0("seed_", seeds))
)
#> seed_1 seed_2 seed_3
#> Wichmann-Hill 10400 10400 10400
#> Marsaglia-Multicarry 10401 10401 10401
#> Super-Duper 10402 10402 10402
#> Mersenne-Twister 10403 10403 10403
#> Knuth-TAOCP-2002 10406 10406 10406
#> Knuth-TAOCP 10404 10404 10404
#> L'Ecuyer-CMRG 10407 10407 10407
Created on 2022-01-06 by the reprex package (v2.0.1)
2) .Random.seed[2]
At least for the default "Mersenne-Twister" method, .Random.seed[2] is an index that indicates the current position in the random set. From the docs:
The ‘seed’ is a 624-dimensional set of 32-bit integers plus a current
position in that set.
This is updated when random processes using the seed are executed. However for other methods it the documentation doesn't mention something like this and there doesn't appear to be a clear trend in the same way.
See below for an example of changes in .Random.seed[2] over iterative random process after set.seed().
library(tidyverse)
# available RNGkind options
kinds <- c(
"Wichmann-Hill",
"Marsaglia-Multicarry",
"Super-Duper",
"Mersenne-Twister",
"Knuth-TAOCP-2002",
"Knuth-TAOCP",
"L'Ecuyer-CMRG"
)
# create function to run random process and report .Random.seed[2]
t <- function(n = 1) {
p <- .Random.seed[2]
runif(n)
p
}
# create function to set seed and iterate a random process
f2 <- function(kind, seed = 1, n = 5) {
set.seed(seed = seed,
kind = kind)
replicate(n, t())
}
# set simulation parameters
trials <- 5
seeds <- 1:2
x <- expand_grid(kind = kinds, seed = seeds, n = trials)
# evaluate and report
x %>%
pmap_dfc(f2) %>%
mutate(n = paste0("trial_", 1:trials)) %>%
pivot_longer(-n, names_to = "row") %>%
pivot_wider(names_from = "n") %>%
select(-row) %>%
bind_cols(x[,1:2], .)
#> # A tibble: 14 x 7
#> kind seed trial_1 trial_2 trial_3 trial_4 trial_5
#> <chr> <int> <int> <int> <int> <int> <int>
#> 1 Wichmann-Hill 1 23415 8457 23504 2.37e4 2.28e4
#> 2 Wichmann-Hill 2 21758 27800 1567 2.58e4 2.37e4
#> 3 Marsaglia-Multicarry 1 1280795612 945095059 14912928 1.34e9 2.23e8
#> 4 Marsaglia-Multicarry 2 -897583247 -1953114152 2042794797 1.39e9 3.71e8
#> 5 Super-Duper 1 1280795612 -1162609806 -1499951595 5.51e8 6.35e8
#> 6 Super-Duper 2 -897583247 224551822 -624310 -2.23e8 8.91e8
#> 7 Mersenne-Twister 1 624 1 2 3 4
#> 8 Mersenne-Twister 2 624 1 2 3 4
#> 9 Knuth-TAOCP-2002 1 166645457 504833754 504833754 5.05e8 5.05e8
#> 10 Knuth-TAOCP-2002 2 967462395 252695483 252695483 2.53e8 2.53e8
#> 11 Knuth-TAOCP 1 1050415712 999978161 999978161 1.00e9 1.00e9
#> 12 Knuth-TAOCP 2 204052929 776729829 776729829 7.77e8 7.77e8
#> 13 L'Ecuyer-CMRG 1 1280795612 -169270483 -442010614 4.71e8 1.80e9
#> 14 L'Ecuyer-CMRG 2 -897583247 -1619336578 -714750745 2.10e9 -9.89e8
Created on 2022-01-06 by the reprex package (v2.0.1)
Here you can see that from the Mersenne-Twister method, .Random.seed[2] increments from it's maximum of 624 back to 1 and increased by the size of the random draw and that this is the same for set.seed(1) and set.seed(2). However the same trend is not seen in the other methods. To illustrate the last point, see that runif(1) increments .Random.seed[2] by 1 while runif(2) increments it by 2:
# create function to run random process and report .Random.seed[2]
t <- function(n = 1) {
p <- .Random.seed[2]
runif(n)
p
}
set.seed(1, kind = "Mersenne-Twister")
replicate(9, t(1))
#> [1] 624 1 2 3 4 5 6 7 8
set.seed(1, kind = "Mersenne-Twister")
replicate(5, t(2))
#> [1] 624 2 4 6 8
Created on 2022-01-06 by the reprex package (v2.0.1)
3) Sequential Randoms
Because the index or state of .Random.seed (apparently for all the RNG methods) advances according to the size of the 'random draw' (number of random values genearted from the .Random.seed), it is possible to generate the same series of random numbers from the same seed in different sized increments. Furthermore, as long as you run the same random process at the same point in the sequence after setting the same seed, it seems that you will get the same result. Observe the following example:
# draw 3 at once
set.seed(1, kind = "Mersenne-Twister")
sample(100, 3, T)
#> [1] 68 39 1
# repeat single draw 3 times
set.seed(1, kind = "Mersenne-Twister")
sample(100, 1)
#> [1] 68
sample(100, 1)
#> [1] 39
sample(100, 1)
#> [1] 1
# draw 1, do something else, draw 1 again
set.seed(1, kind = "Mersenne-Twister")
sample(100, 1)
#> [1] 68
runif(1)
#> [1] 0.5728534
sample(100, 1)
#> [1] 1
Created on 2022-01-06 by the reprex package (v2.0.1)
4) Correlated Randoms
As we saw above, two random processes run at the same point after setting the same seed are expected to give the same result. However, even when you provide constraints on how similar the result can be (e.g. by changing the mean of rnorm() or even by providing different functions) it seems that the results are still perfectly correlated within their respective ranges.
# same function with different constraints
set.seed(1, kind = "Mersenne-Twister")
a <- runif(50, 0, 1)
set.seed(1, kind = "Mersenne-Twister")
b <- runif(50, 10, 100)
plot(a, b)
# different functions
set.seed(1, kind = "Mersenne-Twister")
d <- rnorm(50)
set.seed(1, kind = "Mersenne-Twister")
e <- rlnorm(50)
plot(d, e)
Created on 2022-01-06 by the reprex package (v2.0.1)

h2o SHAP values / predict_contributions for cross validation

I've looked into the h2o.predict_contributions function that exposes the Shap values from xgb and gbm models. Does this function also provide these metrics from cross validation predictions? I can't seem to find them.
library(h2o)
library(mlbench)
data(Sonar)
Sonar.h2o = as.h2o(Sonar)
mdl = h2o.xgboost(x=names(Sonar), y='Class', training_frame = Sonar, nfolds=5, keep_cross_validation_predictions = TRUE)

yes you can apply the function to a single fold of interest, here is some example code:
library(h2o)
h2o.init()
prostate_path <- system.file("extdata", "prostate.csv", package = "h2o")
prostate <- h2o.uploadFile(path = prostate_path)
prostate_gbm <- h2o.gbm(3:9, "AGE", prostate, nfolds = 3)
h2o.predict(prostate_gbm, prostate)
h2o.predict_contributions(prostate_gbm, prostate)
# take a look at the output to see which key you want to use
# there are also other options to key names
prostate_gbm#model$cross_validation_models
# update this with the key of interest
key = 'GBM_model_R_1557326910287_7702_cv_2'
cv2 = h2o.getModel(key)
h2o.predict_contributions(cv2, prostate)
# RACE DPROS DCAPS PSA VOL GLEASON __internal_cv_weights__ BiasTerm
# 1 -0.006481315 -0.19211742 -0.0836791 -0.06186131 -0.9217098 -0.20128664 0 66.37209
# 2 -0.005238285 -1.09128833 0.9614767 -0.95340544 -0.7698430 0.06820074 0 66.37209
# 3 -0.006481315 0.98101193 0.1770813 1.21195042 -1.0359415 -0.23213011 0 66.37209
# 4 0.069538474 -0.01738315 -0.2000238 4.11799049 0.1177490 -0.01457024 0 66.37209
# 5 0.012923095 0.40362182 -0.1132747 1.21669090 0.9920316 -0.37245926 0 66.37209
# 6 -0.002282504 -0.91798097 0.9024866 -0.17398398 -0.6048008 0.42300656 0 66.37209
Note: you can ignore the __internal_cv_weights__ column. I've created a ticket to clean up the output that can be tracked here.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Incorrect results using the new `svyquantile()` with `svyby()` - r

It's a bug. It's fixed in the development version, on r-forge (https://r-forge.r-project.org/R/?group_id=1788)

Related

Set 3 double parameters in p_db using paradox package

How do I perform a bootstrap in R and estimate the variance?

How can I use normalize isotopic ratios from a single study using CIAAWconsensus package

First two values in .Random.seed are always the same with different set.seed()s

h2o SHAP values / predict_contributions for cross validation

Categories

Resources