Bootstraping weighted functions in Julia - julia

I am trying to use Bootstrap.jl functions to obtain the Standard Error (SE) of a weighted function (e.g. a weighted median).
See below the Bootstrap.bootstrap code to obtain the SE of an unweighted median.
using StatsBase, DataFrames, Bootstrap
v = collect(1:1:20)
bootstrap(median, v, BasicSampling(100))
I would now need to pass a second argument to median above to obtain the SE of the weighted median. Outside of the bootstrap function, this looks like:
w = collect(0.1:0.1:2)
median(v, Weights(w))
How can I pass a second argument to the median function inside bootstrap to include the weights? Notice that the bootstrap resampling should be applied to both vectors, drawing the same indices for both of them.

You can pass a DataFrame containing both vectors to the second argument of bootstrap. Then write an anonymous function to use each of the columns within median. E.g.
df = DataFrame(v = collect(1:1:20),
w = collect(0.1:0.1:2))
bootstrap(d -> median(d[!,:v], Weights(d[!,:w])), df, BasicSampling(100))

Related

Julia function for weighted variance returning "wrong" value

I'm trying to calculate the weighted variance using Julia, but when I compare the results
with my own formula, I get a different value.
x = rand(10)
w = Weights(rand(10))
Statistics.var(x,w,corrected=false) #Julia's default function
sum(w.*(x.-mean(x)).^2)/sum(w) #my own formula
When I read the docs for the "var" function, it says that the formula for "corrected=false" is
the one I wrote.
You have to subtract a weighted mean in your formula to get the same result:
sum(w.*(x.-mean(x,w)).^2)/sum(w)
or (to expand it)
sum(w.*(x.- sum(w.*x)/sum(w)).^2)/sum(w)

Coding weighted mean (R)

I am having trouble with a piece of my code. I want to perform a weighted mean but the value I get is not the value I obtain if I calculate the weighed mean myself.
Here's how I'm coding the weighted mean:
weighted.mean(x = dataset$A[rows], weights = weights)
The variable is "dataset$A" and the rows I'm using for the weighted mean are listed in "rows" (there are 2 rows). The weights are listed in "weights."
Here's how I'm calculating it myself:
dataset$A_MEAN[rows[1]]*weights[1] + dataset$A_MEAN[rows[2]]*weights[2]
Why is there a difference with these two lines of code?
I tried with the following values:
dataset$A = [45792.76, 64984.67]
weights = [0.3253927, 0.6746073]
The first line of code returns: 55388.71
The second line of code returns: 58739.76
Thank you so much! I am sure that this is something minor, but it's driving me nuts!
Check your use of weighted.mean
The arguments weights should be w:
weighted.mean(x = dataset$A[rows], w = weights) should give you what you want.
When calling a function, you can make sure that you're using the correct variable names by reading the function's documentation with ?weighted.mean

does boot package in r, use the first return(result) as the observed data to calculate confidence intervals

I am using the function boot in R to do a bootstrap, but instead of passing my dataset directly as the data parameter in the boot function, I pass an index that is used inside the statistic to merge two data tables to get my result. It seems that boot uses the result of the first bootstrap as the real sampled data (say the empirical value). Is this correct? Because when I do the bootstrap manually I get similar results. Although I would expect boot to use 'data' as the original data. I am confused. The CI make sense but I would expect it not to work, unless for the reason I have mentioned.
In short, I have an index vector
x=1:100
and my function
myboot <- function(data,indeces) {
toselect <- data[indeces] # allows boot to select sample
toselect=as.data.table(toselect)
#this is where I use the index for the merge
t=merge(toselect,mydataset,allow.cartesian=TRUE)
return(nrow(t))
}
b <- boot(data=x, statistic=myboot, R=1000)
The results I get
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = x, statistic = myboot, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 397.2477 -0.03669725 11.70803
> boot.ci(b, type="bca")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = b, type = "bca")
Intervals :
Level BCa
95% (375.2, 421.1 )
Yes you are correct.
The function used to compute the statistic has the following requirement (according to the help page):
... In all other cases statistic must take at least two arguments. The first argument passed will always be the original data. The second will be a vector of indices, frequencies or weights which define the bootstrap sample. Further, if predictions are required, then a third argument is required which would be a vector of the random indices used to generate the bootstrap predictions.
Since your dataset consists of the numbers from 1:100 then the second argument passed will sample from 1:100 and will end up producing the exact same result. In other words your data[indeces] line will be identical to indeces.

How to extract saved envelope values in Spatstat?

I am new to both R & spatstat and am working with the inhomogeneous pair correlation function. My dataset consists of point values spread across several time intervals.
sp77.ppp = ppp(sp77.dat$Plot_X, sp77.dat$Plot_Y, window = window77, marks = sp77.dat$STATUS)
Dvall77 = envelope((Y=dv77.ppp[dv77.ppp$marks=='2']),fun=pcfinhom, r=seq(0,20,0.25), nsim=999,divisor = 'd', simulate=expression((rlabel(dv77.ppp)[rlabel(dv77.ppp)$marks=='1']),(rlabel(dv77.ppp)[rlabel(dv77.ppp)$marks=='2'])), savepatterns = T, savefuns = T).
I am trying to compare multiple pairwise comparisons (from different time periods) and need to create a function that will go through for every calculated envelope value, at each ‘r’ value, and find the min and max differences between the envelopes.
My question is: How do I find the saved envelope values? I know that the savefuns = T is saving all the simulated envelope values but I can’t find how to extract the values. The summary (below) says that the values are stored. How do I call the values and extract them?
> print(Dvall77)
Pointwise critical envelopes for g[inhom](r)
and observed value for ‘(Y = dv77.ppp[dv77.ppp$marks == "2"])’
Edge correction: “iso”
Obtained from 999 evaluations of user-supplied expression
(All simulated function values are stored)
(All simulated point patterns are stored)
Alternative: two.sided
Significance level of pointwise Monte Carlo test: 2/1000 = 0.002
.......................................................................................
Math.label Description
r r distance argument r
obs {hat(g)[inhom]^{obs}}(r) observed value of g[inhom](r) for data pattern
mmean {bar(g)[inhom]}(r) sample mean of g[inhom](r) from simulations
lo {hat(g)[inhom]^{lo}}(r) lower pointwise envelope of g[inhom](r) from simulations
hi {hat(g)[inhom]^{hi}}(r) upper pointwise envelope of g[inhom](r) from simulations
.......................................................................................
Default plot formula: .~r
where “.” stands for ‘obs’, ‘mmean’, ‘hi’, ‘lo’
Columns ‘lo’ and ‘hi’ will be plotted as shading (by default)
Recommended range of argument r: [0, 20]
Available range of argument r: [0, 20]
Thanks in advance for any suggestions!
If you are looking to access the values of the summary statistic (ginhom) for each of the randomly labelled patterns this is in principle documented in help(envelope.ppp). Admittedly this is long and if you are new to both R and spatstat it is easy to get lost. The clue is in the value section of the help file. The result is a data.frame with the some additional classes (envelope and fv) and as the help file says:
Additionally, if ‘savepatterns=TRUE’, the return value has an
attribute ‘"simpatterns"’ which is a list containing the ‘nsim’
simulated patterns. If ‘savefuns=TRUE’, the return value has an
attribute ‘"simfuns"’ which is an object of class ‘"fv"’
containing the summary functions computed for each of the ‘nsim’
simulated patterns.
Then of course you need to know how to access an attribute in R, which is done using attr:
funs <- attr(Dvall77, "simfuns")
Then funs is a data.frame (and fv-object) with all the function values for each randomly labelled pattern.
I can't really understand from your question whether you just need the values of the upper and lower curve defining the envelope? In that case you just access them like an ordinary data.frame (and there is no need to save all the individual function values in the envelope):
lo <- Dvall77$lo
hi <- Dvall77$hi
d <- hi - lo
More elegantly you can do:
d <- with(Dvall77, hi - lo)

How to implement variance function in R

I am trying to calculate the variance of a column from a data frame.I know that there are inbuilt functions var() for calculating the variance but I am not sure how to write a function for variance by passing my data frame column as variable.
var(banknote$Length)*((n-1)/n)
If the vector you're going to take the variance of is 1-dimensional, as in your case, you can simply do:
myvar = function(v) {
m = mean(v)
mean((m - v)^2)
}
This assumes (based on your example) that you don't want to use the n/(n-1) correction.

Resources