sample variance - r

function(x) var(sum(((x - mean(x))^2)/(n - 1)))
This is my function but it does not seem to work.
Are we allowed to use the var in the function?

Unless I'm mistaken, this is significantly harder than you think it is; you can't compute it directly the way you're trying to do it. (Hint: when you take the sum() you get a single value/vector of length 1, so the sample variance of that vector is NA.) This post on Mathematics Stack Exchange derives the variance of the sample variance as mu_4/n - sigma^4*(n-3)/(n*(n-1)), where mu_4 is the fourth central moment; you could also see this post on CrossValidated or this Wikipedia page (different derivations present the result in terms of the fourth central moment, the kurtosis, or the excess kurtosis ...) So:
set.seed(101)
r <- rnorm(100)
mu_4 <- mean((r-mean(r))^4)
sigma_4 <- var(r)^2
n <- length(r)
mu_4/n - sigma_4*(n-3)/(n*(n-1)) ## 0.01122
Be careful:
var(replicate(100000,var(rnorm(100))))
gives about 0.0202. At least it's not the wrong order of magnitude, but it's a factor of 2 different from the example above. My guess is just that the estimate is itself highly variable (which wouldn't be surprising since it depends on the fourth moment ...) (I tried the estimation method above a few times and it does indeed seem highly variable ...)

Related

optim (method=Brent) and optimize not giving correct minimum for binomial distribution (N > 1000)

I'm using optim() (and optimize()) to try and find the the quantiles of a binomial distribution, however for N ~ 2000 (N = 2135), the functions do not give the correct value.
optim(21, function(x) abs(1 - pbinom(x, 2135, 21/2135) - 0.1),
method = "Brent", lower = 1, upper = 2135)
optimize(function(x) abs(1 - pbinom(x, 2135, 21/2135) - 0.1), c(1,2135))
P.S: I also try to set the min argument equal to the probability, but I still get incorrect answers.
The problem is that optimize() assumes that small changes in the parameter will give reliable information about whether the minimum has been attained (and which direction to go if not). (I initially said that the function needed to be differentiable, which might not be true: see the Wikipedia article on Brent's method.) In other words, most of the easily available optimization algorithms can fail on an objective function that is piecewise constant, as this one is ...
IMO the accepted answer to this nearly identical question is simply wrong. (It states that "the gradient at your starting point is almost 0", whereas in fact it's exactly zero; using optimize() doesn't help, as you found out, and picking a different starting point is more or less a matter of luck ...)
I made up a smaller example to illustrate: find the 0.6 quantile of the binomial distribution with N=10, prob=0.2. R can do this directly, very easily: qbinom(0.6, size=10, prob=0.2) ! But assuming that you want to solve some other problem of a similar form, and this is just an example, or that the constraints are given by a homework problem, or ...
Slightly simplified objective function (using the squared difference rather than the absolute value):
fx <- function(x) (pbinom(x, size=10, prob=0.2) - 0.6)^2
What does this look like?
curve(fx, from = 0, to =10, n=501)
So the correct answer is any value between 2 and 3. In this particular case optimize(fx, interval=c(1,10)) happens to work OK (returns 2.313, you could use floor() to convert it to 2), but it will fail if I use a wider interval (optimize(fx, interval=c(1,100)) returns 99.99996), or if I did a problem with a larger size. Let me try
fx2 <- function(x) pbinom(x, size=1000, prob=0.2) - 0.6
qbinom(0.6, size=1000, prob=0.2) ## answer: 203
optimize(fx2, interval=c(1,1000)) ## 999.9999
The problem is that if the initial step of the optimization method jumps less than one unit, the algorithm will conclude that it has found the minimum.
One potential solution is to look for a root rather than a minimum:
fx3 <- function(x) pbinom(x, size=1000, prob=0.2) - 0.6
uniroot(fx3, interval=c(1,1000)) ## 203
I don't know a good way to solve this as an optimization problem. A stochastic global optimizer would work, but would in general be very inefficient. See here for one particular problem involving nonlinear discrete optimization in R. You can also look at the optimization task view, although I didn't find it useful ...

How to estimate a parameter in R- sample with replace

I have a txt file with numbers that looks like this(but with 100 numbers) -
[1] 7.1652348 5.6665965 4.4757553 4.8497086 15.2276296 -0.5730937
[7] 4.9798067 2.7396933 5.1468304 10.1221489 9.0165661 65.7118194
[13] 5.5205704 6.3067488 8.6777177 5.2528503 3.5039562 4.2477401
[19] 11.4137624 -48.1722034 -0.3764006 5.7647536 -27.3533138 4.0968204
I need to estimate MLE theta parameter from this distrubution -
[![this is my distrubution ][1]][1]
and I need to estimate theta from a sample of 1000 observations with replace, and save the sample, and do a hist.
How can I estimate theta from my sample? I have no information about normal distrubation.
I wrote something like this -
data<-read.table(file.choose(), header = TRUE, sep= "")
B <- 1000
sample.means <- numeric(data)
sample.sd <- numeric(data)
for (i in 1:B) {
MySample <- sample(data, length(data), replace = TRUE)
sample.means <- c(sample.means,mean(MySample))
sample.sd <- c(sample.sd,sd(MySample))
}
sd(sample.sd)
but it doesn't work..
This question incorporates multiple different ones, so let's tackle each step by step.
First, you will need to draw a random sample from your population (with replacement). Assuming your 100 population-observations sit in a vector named pop.
rs <- sample(pop, 1000, replace = True)
gives you your vector of random samples. If you wanna save it, you can write it to your disk in multiple formats, so I'll just suggest a few related questions (How to Export/Import Vectors in R?).
In a second step, you can use the mle()-function of the stats4-package (https://stat.ethz.ch/R-manual/R-devel/library/stats4/html/mle.html) and specify the objective function explicitly.
However, the second part of your question is more of a statistical/conceptual question than R related, IMO.
Try to understand what MLE actually does. You do not need normally distributed variables. The idea behind MLE is to choose theta in such a way, that under the resulting distribution the random sample is the most probable. Check https://en.wikipedia.org/wiki/Maximum_likelihood_estimation for more details or some youtube videos, if you'd like a more intuitive approach.
I assume, in the description of your task, it is stated that f(x|theta) is the conditional joint density function and that the observations x are iir?
What you wanna do in this case, is to select theta such that the squared difference between the observation x and the parameter theta is minimized.
For your statistical understanding, in such cases, it makes sense to perform log-linearization on the equation, instead of dealing with a non-linear function.
Minimizing the squared difference is equivalent to maximizing the log-transformed function since the sum is negative (<=> the product was in the denominator) and the log, as well as the +1 are solely linear transformations.
This leaves you with the maximization problem:
And the first-order condition:
Obviously, you would also have to check that you are actually dealing with a maximum via the second-order condition but I'll omit that at this stage for simplicity.
The algorithm in R does nothing else than solving this maximization problem.
Hope this helps for your understanding. Maybe some smarter people can give some additional input.

R - linear model does not match experimental data

I am trying to perform a linear regression on experimental data consisting of replicate measures of the same condition (for several conditions) to check for the reliability of the experimental data. For each condition I have ~5k-10k observations stored in a data frame df:
[1] cond1 repA cond1 repB cond2 repA cond2 repB ...
[2] 4.158660e+06 4454400.703 ...
[3] 1.458585e+06 4454400.703 ...
[4] NA 887776.392 ...
...
[5024] 9571785.382 9.679092e+06 ...
I use the following code to plot scatterplot + lm + R^2 values (stored in rdata) for the different conditions:
for (i in seq(1,13,2)){
vec <- matrix(0, nrow = nrow(df), ncol = 2)
vec[,1] <- df[,i]
vec[,2] <- df[,i+1]
vec <- na.exclude(vec)
plot(log10(vec[,1]),log10(vec[,2]), xlab = 'rep A', ylab = 'rep B' ,col="#00000033")
abline(fit<-lm(log10(vec[,2])~log10(vec[,1])), col='red')
legend("topleft",bty="n",legend=paste("R2 is",rdata[1,((i+1)/2)] <- format(summary(fit)$adj.r.squared,digits=4)))
}
However, the lm seems to be shifted so that it does not fit the trend I see in the experimental data:
It consistently occurs for every condition. I unsuccesfully tried to find an explanation by looking up the scource code and browsing different forums and posts (this or here).
Would have like to simply comment/ask a few questions, but can't.
From what I've understood, both repA and repB are measured with error. Hence, you cannot fit your data using an ordinary least square procedure, which only takes into account the error in Y (some might argue a weighted OLS may work, however I'm not skilled enough to discuss that). Your question seem linked to this one.
What you can use is a total least square procedure: it takes into account the error in X and Y. In the example below, I've used a "normal" TLS assuming there is the same error in X and Y (thus error.ratio=1). If it is not, you can specify the error ratio by entering error.ratio=var(y1)/var(x1) (at least I think it's var(Y)/var(X): check on the documentation to ensure that).
library(mcr)
MCR_reg=mcreg(x1,y1,method.reg="Deming",error.ratio=1,method.ci="analytical")
MCR_intercept=getCoefficients(MCR_reg)[1,1]
MCR_slope=getCoefficients(MCR_reg)[2,1]
# CI for predicted values
x_to_predict=seq(0,35)
predicted_values=MCResultAnalytical.calcResponse(MCR_reg,x_to_predict,alpha=0.05)
CI_low=predicted_values[,4]
CI_up=predicted_values[,5]
Please note that, in Deming/TLS regressions, your x- and y-errors are supposed to follow normal distribution, as explained here. If it's not the case, go for a Passing-Bablok regressions (and the R code is here).
Also note that the R2 isn't defined for Deming nor Passing Bablok regressions (see here). A correlation coefficient is a good proxy, although it does not exactly provide the same information. Since you're studying a linear correlation between two factors, see Pearson's product moment correlation coefficient, and use e.g. the rcorrfunction.

Understanding `scale` in R

I'm trying to understand the definition of scale that R provides. I have data (mydata) that I want to make a heat map with, and there is a VERY strong positive skew. I've created a heatmap with a dendrogram for both scale(mydata) and log(my data), and the dendrograms are different for both. Why? What does it mean to scale my data, versus log transform my data? And which would be more appropriate if I want to look at the dendrogram illustrating the relationship between the columns of my data?
Thank you for any help! I've read the definitions but they are whooping over my head.
log simply takes the logarithm (base e, by default) of each element of the vector.
scale, with default settings, will calculate the mean and standard deviation of the entire vector, then "scale" each element by those values by subtracting the mean and dividing by the sd. (If you use scale(x, scale=FALSE), it will only subtract the mean but not divide by the std deviation.)
Note that this will give you the same values
set.seed(1)
x <- runif(7)
# Manually scaling
(x - mean(x)) / sd(x)
scale(x)
It provides nothing else but a standardization of the data. The values it creates are known under several different names, one of them being z-scores ("Z" because the normal distribution is also known as the "Z distribution").
More can be found here:
http://en.wikipedia.org/wiki/Standard_score
This is a late addition but I was looking for information on the scale function myself and though it might help somebody else as well.
To modify the response from Ricardo Saporta a little bit.
Scaling is not done using standard deviation, at least not in version 3.6.1 of R, I base this on "Becker, R. (2018). The new S language. CRC Press." and my own experimentation.
X.man.scaled <- X/sqrt(sum(X^2)/(length(X)-1))
X.aut.scaled <- scale(X, center = F)
The result of these rows are exactly the same, I show it without centering because of simplicity.
I would respond in a comment but did not have enough reputation.
I thought I would contribute by providing a concrete example of the practical use of the scale function. Say you have 3 test scores (Math, Science, and English) that you want to compare. Maybe you may even want to generate a composite score based on each of the 3 tests for each observation. Your data could look as as thus:
student_id <- seq(1,10)
math <- c(502,600,412,358,495,512,410,625,573,522)
science <- c(95,99,80,82,75,85,80,95,89,86)
english <- c(25,22,18,15,20,28,15,30,27,18)
df <- data.frame(student_id,math,science,english)
Obviously it would not make sense to compare the means of these 3 scores as the scale of the scores are vastly different. By scaling them however, you have more comparable scoring units:
z <- scale(df[,2:4],center=TRUE,scale=TRUE)
You could then use these scaled results to create a composite score. For instance, average the values and assign a grade based on the percentiles of this average. Hope this helped!
Note: I borrowed this example from the book "R In Action". It's a great book! Would definitely recommend.

Random Data Sets within loops

Here is what I want to do:
I have a time series data frame with let us say 100 time-series of length 600 - each in one column of the data frame.
I want to pick up 4 of the time-series randomly and then assign them random weights that sum up to one (ie 0.1, 0.5, 0.3, 0.1). Using those I want to compute the mean of the sum of the 4 weighted time series variables (e.g. convex combination).
I want to do this let us say 100k times and store each result in the form
ts1.name, ts2.name, ts3.name, ts4.name, weight1, weight2, weight3, weight4, mean
so that I get a 9*100k df.
I tried some things already but R is very bad with loops and I know vector oriented
solutions are better because of R design.
Here is what I did and I know it is horrible
The df is in the form
v1,v2,v2.....v100
1,5,6,.......9
2,4,6,.......10
3,5,8,.......6
2,2,8,.......2
etc
e=NULL
for (x in 1:100000)
{
s=sample(1:100,4)#pick 4 variables randomly
a=sample(seq(0,1,0.01),1)
b=sample(seq(0,1-a,0.01),1)
c=sample(seq(0,(1-a-b),0.01),1)
d=1-a-b-c
e=c(a,b,c,d)#4 random weights
average=mean(timeseries.df[,s]%*%t(e))
e=rbind(e,s,average)#in the end i get the 9*100k df
}
The procedure runs way to slow.
EDIT:
Thanks for the help i had,i am not used to think R and i am not very used to translate every problem into a matrix algebra equation which is what you need in R.
Then the problem becomes a little bit complex if i want to calculate the standard deviation.
i need the covariance matrix and i am not sure i can if/how i can pick random elements for each sample from the original timeseries.df covariance matrix then compute the sample variance
t(sampleweights)%*%sample_cov.mat%*%sampleweights
to get in the end the ts.weighted_standard_dev matrix
Last question what is the best way to proceed if i want to bootstrap the original df
x times and then apply the same computations to test the robustness of my datas
thanks
Ok, let me try to solve your problem. As a foreword: I can think of no application where it is sensible to do what you are doing. However, that is for you to judge (non the less I would be interested in the application...)
First, note that the mean of the weighted sums equals the weighted sum of the means, as:
Let's generate some sample data:
timeseries.df <- data.frame(matrix(runif(1000, 1, 10), ncol=40))
n <- 4 # number of items in the convex combination
replications <- 100 # number of replications
Thus, we may first compute the mean of all columns and do all further computations using this mean:
ts.means <- apply(timeseries.df, 2, mean)
Let's create some samples:
samples <- replicate(replications, sample(1:length(ts.means), n))
and the corresponding weights for those samples:
weights <- matrix(runif(replications*n), nrow=n)
# Now norm the weights so that each column sums up to 1:
weights <- weights / matrix(apply(weights, 2, sum), nrow=n, ncol=replications, byrow=T)
That part was a little bit tricky. Run the single functions on each own with a small number of replications to figure out what they are doing. Note that I took a different approach for generating the weights: First get uniformly distributed data and then norm them by their sum. The result should be identical to your approach, but with arbitrary resolution and much better performance.
Again a little bit trick: Get the means for each time series and multiply them with the weights just computed:
ts.weightedmeans <- matrix(ts.means[samples], nrow=n) * weights
# and sum them up:
weights.sum <- apply(ts.weightedmeans, 2, sum)
Now, we are basically done - all information are available and ready to use. The rest is just a matter of correctly formatting the data.frame.
result <- data.frame(t(matrix(names(ts.means)[samples], nrow=n)), t(weights), weights.sum)
# For perfectness, use better names:
colnames(result) <- c(paste("Sample", 1:n, sep=''), paste("Weight", 1:n, sep=''), "WeightedMean")
I would assume this approach to be rather fast - on my system the code took 1.25 seconds with the amount of repetitions you stated.
Final word: You were in luck that I was looking for something that kept me thinking for a while. Your question was not asked in a way to encourage users to think about your problem and give good answers. The next time you have a problem, I would suggest you to read www.whathaveyoutried.com before and try to break down the problem as far as you are able to. The more concrete your problem, the faster and of higher quality your answers will be.
Edit
You mentioned correctly that the weights generated above are not uniformly distributed over the whole range of values. (I still have to object that even (0.9, 0.05, 0.025, 0.025) is possible, but it is very unlikely).
Now we are playing in a different league, though. I am pretty sure that the approach you took is not uniformly distributed as well - the probability of the last value being 0.9 is far less than the probability of the first one being that large. Honestly I do not have a good idea ready for you concerning the generation of uniformly distributed random numbers on the unit sphere according to the L_1 distance. (Actually, it is not really a unit sphere, but both problems should be identical).
Thus, I have to give up on this.
I would suggest you to raise a new question at stats.stackexchange.com concerning the generation of those random vectors. It probably is fairly simple using the correct technique. However, I doubt that this question with that heading and a fairly long answer will attract a potential responder... (If you ask the question over there, I would appreciate a link, as I would like to know the solution ;)
Concerning the variance: I do not fully understand which standard deviation you want to compute. If you just want to compute the standard deviation of each time series, why do you not use the built-in function sd? In the computation above you could just replace mean by it.
Bootstrapping: That is a whole new question. Separate different topics by starting new questions.

Resources