Coding weighted mean (R) - r

I am having trouble with a piece of my code. I want to perform a weighted mean but the value I get is not the value I obtain if I calculate the weighed mean myself.
Here's how I'm coding the weighted mean:
weighted.mean(x = dataset$A[rows], weights = weights)
The variable is "dataset$A" and the rows I'm using for the weighted mean are listed in "rows" (there are 2 rows). The weights are listed in "weights."
Here's how I'm calculating it myself:
dataset$A_MEAN[rows[1]]*weights[1] + dataset$A_MEAN[rows[2]]*weights[2]
Why is there a difference with these two lines of code?
I tried with the following values:
dataset$A = [45792.76, 64984.67]
weights = [0.3253927, 0.6746073]
The first line of code returns: 55388.71
The second line of code returns: 58739.76
Thank you so much! I am sure that this is something minor, but it's driving me nuts!

Check your use of weighted.mean
The arguments weights should be w:
weighted.mean(x = dataset$A[rows], w = weights) should give you what you want.
When calling a function, you can make sure that you're using the correct variable names by reading the function's documentation with ?weighted.mean

Related

Julia function for weighted variance returning "wrong" value

I'm trying to calculate the weighted variance using Julia, but when I compare the results
with my own formula, I get a different value.
x = rand(10)
w = Weights(rand(10))
Statistics.var(x,w,corrected=false) #Julia's default function
sum(w.*(x.-mean(x)).^2)/sum(w) #my own formula
When I read the docs for the "var" function, it says that the formula for "corrected=false" is
the one I wrote.
You have to subtract a weighted mean in your formula to get the same result:
sum(w.*(x.-mean(x,w)).^2)/sum(w)
or (to expand it)
sum(w.*(x.- sum(w.*x)/sum(w)).^2)/sum(w)

Calculating the MSE by definition vs. Var - Bias

So I am trying to calculate the MSE in two ways.
Say T is an estimator for the value t.
First I am trying to calculate it in R by using the theorem:
MSE(T) = Var(T) + (Bias(T))^2
Secondly, I am trying to calculate it in R by definition, i.e. MSE(T) = E((T-t)^2).
And say that T is an unbiased estimator, i.e. Bias(T) = 0
So in R, MSE(T) = Var(T) which we can just in R: var(T)
But when I try calculating the MSE by definition I get a different number from Var(T)...
And I think that my formula that I wrote in R is wrong, this is what I wrote for MSE definition in R:
It was suggested that "weighted.mean" is equivalent to the "expected value" function.
So I wrote: weighted.mean( (T - 2)^2) where my t = 2.
I hope I provided enough information to get help, thanks in advance.

Interpolation function approx() in r gives error - need at least two non-NA values to interpolate

I am using R Studio on Windows 8 machine. I am trying to interpolate a point between two points.
x1 = -159.9, y1 = 56.5,
x2 = -159.9, y2 = 56.3
I am using approx() function in the following manner (reproducible)
approx(c(-159.9,-159.9), c(56.5,56.3), n = 3)
which gives me an error
Error in approx(c(-159.9, -159.9), c(56.5, 56.3), n = 3) :
need at least two non-NA values to interpolate
Its expecting two non-NA values, which I have provided.
The function is working flawlessly for other points. Just this is the problem.
If you have come across any such error, please let me know how did you solve this?
From the Details of ?approx():
The inputs can contain missing values which are deleted, so at least
two complete (x, y) pairs are required (for method = "linear", one
otherwise). If there are duplicated (tied) x values and ties is a
function it is applied to the y values for each distinct x value.
The approx function can't interpolate values where the x-coordinates are the same.
Hence, I would tackle this problem as follows:
Group all the cases where the x-coordinates are equal and aggregate them by for example the median, mean, or a custom built function
Use you intended interpolation scheme, for example the approx function.

How to extract saved envelope values in Spatstat?

I am new to both R & spatstat and am working with the inhomogeneous pair correlation function. My dataset consists of point values spread across several time intervals.
sp77.ppp = ppp(sp77.dat$Plot_X, sp77.dat$Plot_Y, window = window77, marks = sp77.dat$STATUS)
Dvall77 = envelope((Y=dv77.ppp[dv77.ppp$marks=='2']),fun=pcfinhom, r=seq(0,20,0.25), nsim=999,divisor = 'd', simulate=expression((rlabel(dv77.ppp)[rlabel(dv77.ppp)$marks=='1']),(rlabel(dv77.ppp)[rlabel(dv77.ppp)$marks=='2'])), savepatterns = T, savefuns = T).
I am trying to compare multiple pairwise comparisons (from different time periods) and need to create a function that will go through for every calculated envelope value, at each ‘r’ value, and find the min and max differences between the envelopes.
My question is: How do I find the saved envelope values? I know that the savefuns = T is saving all the simulated envelope values but I can’t find how to extract the values. The summary (below) says that the values are stored. How do I call the values and extract them?
> print(Dvall77)
Pointwise critical envelopes for g[inhom](r)
and observed value for ‘(Y = dv77.ppp[dv77.ppp$marks == "2"])’
Edge correction: “iso”
Obtained from 999 evaluations of user-supplied expression
(All simulated function values are stored)
(All simulated point patterns are stored)
Alternative: two.sided
Significance level of pointwise Monte Carlo test: 2/1000 = 0.002
.......................................................................................
Math.label Description
r r distance argument r
obs {hat(g)[inhom]^{obs}}(r) observed value of g[inhom](r) for data pattern
mmean {bar(g)[inhom]}(r) sample mean of g[inhom](r) from simulations
lo {hat(g)[inhom]^{lo}}(r) lower pointwise envelope of g[inhom](r) from simulations
hi {hat(g)[inhom]^{hi}}(r) upper pointwise envelope of g[inhom](r) from simulations
.......................................................................................
Default plot formula: .~r
where “.” stands for ‘obs’, ‘mmean’, ‘hi’, ‘lo’
Columns ‘lo’ and ‘hi’ will be plotted as shading (by default)
Recommended range of argument r: [0, 20]
Available range of argument r: [0, 20]
Thanks in advance for any suggestions!
If you are looking to access the values of the summary statistic (ginhom) for each of the randomly labelled patterns this is in principle documented in help(envelope.ppp). Admittedly this is long and if you are new to both R and spatstat it is easy to get lost. The clue is in the value section of the help file. The result is a data.frame with the some additional classes (envelope and fv) and as the help file says:
Additionally, if ‘savepatterns=TRUE’, the return value has an
attribute ‘"simpatterns"’ which is a list containing the ‘nsim’
simulated patterns. If ‘savefuns=TRUE’, the return value has an
attribute ‘"simfuns"’ which is an object of class ‘"fv"’
containing the summary functions computed for each of the ‘nsim’
simulated patterns.
Then of course you need to know how to access an attribute in R, which is done using attr:
funs <- attr(Dvall77, "simfuns")
Then funs is a data.frame (and fv-object) with all the function values for each randomly labelled pattern.
I can't really understand from your question whether you just need the values of the upper and lower curve defining the envelope? In that case you just access them like an ordinary data.frame (and there is no need to save all the individual function values in the envelope):
lo <- Dvall77$lo
hi <- Dvall77$hi
d <- hi - lo
More elegantly you can do:
d <- with(Dvall77, hi - lo)

How can we specify a custom lambda sequence to glmnet

I am new to the glmnet package in R, and wanted to specify a lambda function based on the suggestion in a published research paper to the glmnet.cv function. The documentation suggests that we can supply a decreasing sequence of lambdas as a parameter. However, in the documentation there are no examples of how to do this.
It would be very grateful if someone can suggest how to go about doing this. Do I pass a vector of 100 odd values (default value for nlambda) to the function? What restrictions should be there for the min and max value of this vector, if any? Also, are their things to keep in mind regarding nvars, nobs etc. while specifying the vector?
Thanks in advance.
You can define a grid like this :
grid=10^seq(10,-2,length=100) ##get lambda sequence
ridge_mod=glmnet(x,y,alpha=0,lambda=grid)
This is fairly easy though it's not well explained in the original documentation ;)
In the following I've used cox family but you can change it based on your need
my_cvglmnet_fit <- cv.glmnet(x=regression_data, y=glmnet_response, family="cox", maxit = 100000)
Then you can plot the fitted object created by the cv.glmnet and in the plot you can easily see where the lambda is minimum. one of those dotted vertical lines is the minimum lambda and the other one is the 1se.
plot(my_cvglmnet_fit)
the following lines helps you see the non zero coefficients and their corresponding values:
coef(my_cvglmnet_fit, s = "lambda.min")[which(coef(my_cvglmnet_fit, s = "lambda.min") != 0)] # the non zero coefficients
colnames(regression_data)[which(coef(my_cvglmnet_fit, s = "lambda.min") != 0)] # The features that are selected
here are some links that may help:
http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
http://blog.revolutionanalytics.com/2013/05/hastie-glmnet.html

Resources