Separating two superimposed normal distributions in R [closed] - r

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am searching for a function/package-name in R which allows one to separate two superimposed normal distributions. The distribution looks something like this:
x<-c(3.95, 3.99, 4.0, 4.04, 4.1, 10.9, 11.5, 11.9, 11.7, 12.3)

I had good results in the past using vector generalized linear models. The VGAM package is useful for that.
The mix2normal1 function allows to estimate the parameters of a mix of two univariate normal distributions.
Little example
require(VGAM)
set.seed(12345)
# Create a binormal distribution with means 10 and 20
data <- c(rnorm(100, 10, 1.5), rnorm(200, 20, 3))
# Initial parameters for minimization algorithm
# You may want to create some logic to estimate this a priori... not always easy but possible
# m, m2: Means - s, s2: SDs - w: relative weight of the first distribution (the second is 1-w)
init.params <- list(m=5, m2=8, s=1, s2=1, w=0.5)
fit <<- vglm(data ~ 1, mix2normal1(equalsd=FALSE),
iphi=init.params$w, imu=init.params$m, imu2=init.params$m2,
isd1=init.params$s, isd2=init.params$s2)
# Calculated parameters
pars = as.vector(coef(fit))
w = logit(pars[1], inverse=TRUE)
m1 = pars[2]
sd1 = exp(pars[3])
m2 = pars[4]
sd2 = exp(pars[5])
# Plot an histogram of the data
hist(data, 30, col="black", freq=F)
# Superimpose the fitted distribution
x <- seq(0, 30, 0.1)
points(x, w*dnorm(x, m1, sd1)+(1-w)*dnorm(x,m2,sd2), "l", col="red", lwd=2)
This correctly gives ("true" parameters - 10, 20, 1.5, 3)
> m1
[1] 10.49236
> m2
[1] 20.06296
> sd1
[1] 1.792519
> sd2
[1] 2.877999

You might want to use nls , the nonlinear regression tool (or other nonlin regressors). I'm guessing you have a vector of data representing the superimposed distributions. Then, roughly, nls(y~I(a*exp(-(x-meana)^2/siga) + b*exp(-(x-meanb)^2/sigb) ),{initial guess values required for all constants} ) , where y is your distribution and x is the domain .
I'm not thinking about this at all, so I'm not sure which convergence methods are less likely to fail.

Related

Calculating the probability of a sample mean using R

I'm in an intro to stats class right now, and have absolutely no idea what's going on. How would I solve the following problem using R?
Let x be a continuous random variable that has a normal distribution with a mean of 71 and a standard deviation of 15. Assuming n/N is less than or equal to 0.05, find the probability that the sample mean, x-bar, for a random sample of 24 taken from this population will be between 68.1 and 78.3,
I'm really struggling on this one and I still have to get through other problems in the same format. Any help would be greatly appreciated!
For R coding this might set you up:
[# Children's IQ scores are normally distributed with a
# mean of 100 and a standard deviation of 15. What
# proportion of children are expected to have an IQ between
# 80 and 120?
mean=100; sd=15
lb=80; ub=120
x <- seq(-4,4,length=100)*sd + mean
hx <- dnorm(x,mean,sd)
plot(x, hx, type="n", xlab="IQ Values", ylab="",
main="Normal Distribution", axes=FALSE)
i <- x >= lb & x <= ub
lines(x, hx)
polygon(c(lb,x\[i\],ub), c(0,hx\[i\],0), col="red")
area <- pnorm(ub, mean, sd) - pnorm(lb, mean, sd)
result <- paste("P(",lb,"< IQ <",ub,") =",
signif(area, digits=3))
mtext(result,3)
axis(1, at=seq(40, 160, 20), pos=0)]
There is also some nice introductory course to R and data analysis by datacamp, this might also come in handy:
https://www.datacamp.com/courses/exploratory-data-analysis
And another tutorial on R and statistics:
http://www.cyclismo.org/tutorial/R/confidence.html
In terms of the code:
pop_sample <- rnorm(24, 71, 15)
se_pop <- sd(pop_sample)/sqrt(24)
pnorm(78.3, 71, se_pop) - pnorm(68.1, 71, se_pop) # 80%
In term of stats... you should probably refer to stats.stackexchange.com or your professor.

Fit different regression models to polynomial data? [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 6 years ago.
Improve this question
I am an absolute beginner with R so please bear with me.
I have some generated polynomial (squared) data
x.training <- seq(0, 5, by=0.01) # x data
error.training <- rnorm(n=length(x.training), mean=0, sd=1) # Error (0, 1)
y.training <- x.training^2 + error.training # y data
I want to apply 3 different regression models to this data to demonstrate which one has a better fit. My 3 models are linear, polynomial, and trigonometric (cos).
I have tried the following but the lines either don't show up or are just straight lines. How could I go about applying these models properly?
Full code:
x.training <- seq(0, 5, by=0.01) # x data
error.training <- rnorm(n=length(x.training), mean=0, sd=1) # Error (0, 1)
y.training <- x.training^2 + error.training # y data
linear.model <- lm(y.training~x.training)
poly.model <- lm(y.training~poly(x.training, 2))
trig.model <- lm(y.training~cos(x.training))
linear.predict <- predict(linear.model)
poly.predict <- predict(poly.model)
trig.predict <- predict(trig.model)
plot(x.training, y.training)
lines(linear.predict, col="red")
lines(poly.predict, col="blue")
lines(trig.predict, col="green")
Absolutely simple mistake on my part. I feel silly.
lines(x.training, linear.predict, col="red")
lines(x.training, poly.predict, col="blue")
lines(x.training, trig.predict, col="green")
I wasn't feeding in any X coordinates, and predict only returns Y-hat.
Much better!

Local FDR from p-values in R 3.1.1 [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I'm trying to find a package to calculate Efron's local FDR for a series of tests. I have over 1000 covariates, so multiple correction is definitely in order.
Looking for local FDR is see the locfdr package is no longer available on CRAN. Any idea why it was removed? This seems the most closely related to the original publication of local FDR.
I did find fdrtool, but it cannot calculate local FDR from p-values. Other packages I've found are not available for 3.1.1 - LocalFDR, localFDR, kerfdr, twilight. Of course all these packages use slightly different methods. Even if I could get to them, which to choose?
The twilight package can do this (as also suggested in the comments). It's easy to use as demonstrated below and quite clear vignettes are available at bioconductor.
First we install the package.
# Install twilight
source("http://bioconductor.org/biocLite.R")
biocLite("twilight")
Next, we simulate some test-scores and p-values.
# Simulate p p-values
set.seed(1)
p <- 10000 # Number of "genes"
prob <- 0.2 # Proportion of true alternatives
# Simulate draws from null (=0) or alternatives (=1)
null <- rbinom(p, size = 1, prob = prob)
# Simulate some t-scores, all non-null genes have an effect of 2
t.val <- (1-null)*rt(p, df = 15) + null*rt(p, df = 15, ncp = 2)
# Compute p-values
p.val <- 2*pt(-abs(t.val), df = 15)
# Plot the results
par(mfrow = c(1,2))
hist(t.val, breaks = 70, col = "grey",
xlim = c(-10, 10), prob = TRUE, ylim = c(0, .35))
hist(p.val, breaks = 70, prob = TRUE)
Next we load the library and run the fdr analysis:
library(twilight)
ans <- twilight(p.val)
We see that the estimate of the true alternative proportion is quite good:
print(1 - ans$pi0)
#[1] 0.1529
print(prob)
#[1] 0.2
The package reorders the p-values, q-values, and fdr-values in increasing order, so we do a trick to reconstruct the original order.
o <- order(order(p.val))
fdr <- ans$result$fdr[o]
plot(p.val, fdr, pch = 16, col = "red", cex = .2)
Lastly, we can crosstabulate the significant against the true:
table(estimate = fdr < 0.5, truth = as.logical(null))
# truth
#estimate FALSE TRUE
# FALSE 7564 1172
# TRUE 368 896
Hence, we have an accuracy of 84.6 % in this toy-example. I hope this helps. The twilight function also features bootstrapped CIs for the fdr which you'll find in ?twilight along with further references.
Edit
It seems that the fdrtool package (which is on CRAN) actually can compute the local fdr from p-values. In our case we do the following:
library("fdrtool")
fdr <- fdrtool(p.val, statistic="pvalue")
fdr$lfdr # The local fdr values

Create correlated variables from existing variable [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Let's say I have a vector:
Q<-rnorm(10,mean=0,sd=20)
From this vector I would like to:
1. create 10 variables (a1...a10) that each have a correlation above .5 (i.e. between .5 and 1) with Q.
the first part can be done with:
t1<-sapply(1:10, function(x) jitter(t, factor=100))
2. each of these variables (a1...a10) should have a pre-specified correlation with each other. For example some should be correlated .8 and some -.2.
Can these two things be done?
I create a correlation matrix:
cor.table <- matrix( sample( c(0.9,-0.9) , 2500 , prob = c( 0.8 , 0.2 ) , repl = TRUE ) , 50 , 50 )
k=1
while (k<=length(cor.table[1,])){
cor.table[1,k]<-0.55
k=k+1
}
k=1
while (k<=length(cor.table[,1])){
cor.table[k,1]<-0.55
k=k+1
}
diag(cor.table) <- 1
However, when I apply the excellent solution by #SprengMeister I get the error:
Error in eigen(cor.table)$values > 0 :
invalid comparison with complex values
continued here: Eigenvalue decomposition of correlation matrix
As a pointer to solution use noise function jitter in R:
set.seed(100)
t = rnorm(10,mean=0,sd=20)
t1 = jitter(t, factor = 100)
cor(t,t1)
[1] 0.8719447
To generate data with a prescribed correlation (or variance),
you can start with random data,
and rescale it using the Cholesky decomposition of the desired correlation matrix.
# Sample data
Q <- rnorm(10, mean=0, sd=20)
desired_correlations <- matrix(c(
1, .5, .6, .5,
.5, 1, .2, .8,
.6, .2, 1, .5,
.5, .8, .5, 1 ), 4, 4 )
stopifnot( eigen( desired_correlations )$values > 0 )
# Random data, with Q in the first column
n <- length(Q)
k <- ncol(desired_correlations)
x <- matrix( rnorm(n*k), nc=k )
x[,1] <- Q
# Rescale, first to make the variance equal to the identity matrix,
# then to get the desired correlation matrix.
y <- x %*% solve(chol(var(x))) %*% chol(desired_correlations)
var(y)
y[,1] <- Q # The first column was only rescaled: that does not affect the correlation
cor(y) # Desired correlation matrix
I answered a very similar question a little while ago
R: Constructing correlated variables
I am not familiar with jitter so maybe my solutions is more verbose but it would allow you determining exactly what the intercorrelations of each of your variables and q is supposed to be.
The F matrix referenced in that answer describes the intercorrelations that you want to impose on your data.
EDIT to answer question in comment:
If i am not mistaken, you are trying to create a multivariate correlated data set. So all the variables in the set are correlated to varying degrees. I assume Q is your criterion or DV, and a1-a10 are predictors or IVs.
In the F matrix you would reflect the relationships between these variables. For example
cor_Matrix <- matrix(c(1.00, 0.90, 0.20 ,
0.90, 1.00, 0.40 ,
0.20, 0.40, 1.00),
nrow=3,ncol=3,byrow=TRUE)
describes the relationships between three variables. The first one could be Q, the second a1 and the third a2. So in this scenario, q is correlated with a1 (.90) and a2 (.20).
a1 is correlated with a2 (.40)
The rest of the matrix is redundant.
In the remainder of the code, you are simply creating your raw, uncorrelated variables and then impose the loadings that you have previously pulled from the F matrix.
I hope this helps. If there is a package in R that does all that, please let me know. I build this to help me understand how multivariate data sets are actually generated.
To generalize this to 10 variables plus q, just set the parameters that are set to 3 now to 11 and create an 11x11 F matrix.

Given a sample of random variables, and n, how do I find the ecdf of the sum of n Xs? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I can't fit X to a common distribution so currently I just have X ~ ecdf(sample_data).
How do I calculate the empirical distribution of sum(X1 + ... + Xn), given n? X1 to Xn are iid.
To estimate the distribution of that sum, you can repeatedly sample with replacement (and then take the sum of) n variates from sample_data. (sample() places equal probability mass on each element of sample_data, just as the ecdf does, so you don't need to calculate ecdf(sample_data) as an intermediate step.)
# Create some example data
sample_data <- runif(100)
n <- 10
X <- replicate(1000, sum(sample(sample_data, size=n, replace=TRUE)))
# Plot the estimated distribution of the sum of n variates.
hist(X, breaks=40, col="grey", main=expression(sum(x[i], i==1, n)))
box(bty="l")
# Plot the ecdf of the sum
plot(ecdf(X))
First, generalize and simplify: solve for step function CDFs X and Y, independent but not identically distributed. For every step jump xi and every step jump yi, there will be a corresponding step jump at xi+yi in the CDF of X + Y, So the CDF of X + Y will be characterized by the list:
sorted(x + y for x in X for y in Y)
That means if there are k points in X's CDF, there will be kn in (X1 + ... + Xn). We can cut that down to a manageable number at the end by throwing away all but k again, but clearly the intermediate calculations will be costly in time and space.
Also, note that even though the original CDF is an ECDF for X, the result will not be an ECDF for (X1 + ... + Xn), even if you keep all kn points.
In conclusion, use Josh's solution.

Resources