Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I'm trying to find a package to calculate Efron's local FDR for a series of tests. I have over 1000 covariates, so multiple correction is definitely in order.
Looking for local FDR is see the locfdr package is no longer available on CRAN. Any idea why it was removed? This seems the most closely related to the original publication of local FDR.
I did find fdrtool, but it cannot calculate local FDR from p-values. Other packages I've found are not available for 3.1.1 - LocalFDR, localFDR, kerfdr, twilight. Of course all these packages use slightly different methods. Even if I could get to them, which to choose?
The twilight package can do this (as also suggested in the comments). It's easy to use as demonstrated below and quite clear vignettes are available at bioconductor.
First we install the package.
# Install twilight
source("http://bioconductor.org/biocLite.R")
biocLite("twilight")
Next, we simulate some test-scores and p-values.
# Simulate p p-values
set.seed(1)
p <- 10000 # Number of "genes"
prob <- 0.2 # Proportion of true alternatives
# Simulate draws from null (=0) or alternatives (=1)
null <- rbinom(p, size = 1, prob = prob)
# Simulate some t-scores, all non-null genes have an effect of 2
t.val <- (1-null)*rt(p, df = 15) + null*rt(p, df = 15, ncp = 2)
# Compute p-values
p.val <- 2*pt(-abs(t.val), df = 15)
# Plot the results
par(mfrow = c(1,2))
hist(t.val, breaks = 70, col = "grey",
xlim = c(-10, 10), prob = TRUE, ylim = c(0, .35))
hist(p.val, breaks = 70, prob = TRUE)
Next we load the library and run the fdr analysis:
library(twilight)
ans <- twilight(p.val)
We see that the estimate of the true alternative proportion is quite good:
print(1 - ans$pi0)
#[1] 0.1529
print(prob)
#[1] 0.2
The package reorders the p-values, q-values, and fdr-values in increasing order, so we do a trick to reconstruct the original order.
o <- order(order(p.val))
fdr <- ans$result$fdr[o]
plot(p.val, fdr, pch = 16, col = "red", cex = .2)
Lastly, we can crosstabulate the significant against the true:
table(estimate = fdr < 0.5, truth = as.logical(null))
# truth
#estimate FALSE TRUE
# FALSE 7564 1172
# TRUE 368 896
Hence, we have an accuracy of 84.6 % in this toy-example. I hope this helps. The twilight function also features bootstrapped CIs for the fdr which you'll find in ?twilight along with further references.
Edit
It seems that the fdrtool package (which is on CRAN) actually can compute the local fdr from p-values. In our case we do the following:
library("fdrtool")
fdr <- fdrtool(p.val, statistic="pvalue")
fdr$lfdr # The local fdr values
Related
I want to generate 95% confidence intervals from the R2 of a linear model. While developing the code and using the same seed for both approaches, I figured it out that doing the bootstrap manually doesn't give me the same results as using the boot function from the boot package. I am wondering now if I am doing something wrong? or why is this happening?
On the other hand, in order to calculate the 95% CI I was trying to use the confint function, but I'm getting an error "$ operator is invalid for atomic vectors". Any solution to avoid this error?
Here is a reproducible example to explain my concerns
#creating the dataframe
a <- rpois(n = 100, lambda = 10)
b <- rnorm(n = 100, mean = 5, sd = 1)
DF<- data.frame(a,b)
#bootstrapping manually
set.seed(123)
x=length(DF$a)
B_manually<- data.frame(replicate(100, summary(lm(a~b, data = DF[sample(x, replace = T),]))$r.squared))
names(B_manually)[1]<- "r_squared"
#Bootstrapping using the function "Boot" from Boot library
set.seed(123)
library(boot)
B_boot <- boot(DF, function(data,indices)
summary(lm(a~b, data[indices,]))$r.squared,R=100)
head(B_manually) == head(B_boot$t)
r_squared
1 FALSE
2 FALSE
3 FALSE
4 FALSE
5 FALSE
6 FALSE
#Why does the results of the manually vs boot function approach differs if I'm using the same seed?
# 2nd question (Using the confint function to determine the 95 CI gives me an error)
confint(B_manually$r_squared, level = 0.95, method = "quantile")
confint(B_boot$t, level = 0.95, method = "quantile")
#Error: $ operator is invalid for atomic vectors
#NOTE: I already used the boot.ci to determine the 95 confidence interval, as well as the
#quantile function to determine the CI, but the results of these CI differs from each others
#and just wanted to compare with the confint function.
quantile(B_function$t, c(0.025,0.975))
boot.ci(B_function, index=1,type="perc")
Thanks in advance for any help!
The boot package does not use replicate with sample to generate the indices. Check the importance.array function under the source code for boot. It basically generates all the indices at one go. So there's no reason to assume that you will end up with the same indices or same result. Take a step back, the purpose of bootstrap is to use random sampling methods to obtain a estimate of your parameters, you should get similar estimates from different implementation of bootstrap.
For example, you can see the distribution of R^2 is very similar:
set.seed(111)
a <- rpois(n = 100, lambda = 10)
b <- rnorm(n = 100, mean = 5, sd = 1)
DF<- data.frame(a,b)
set.seed(123)
x=length(DF$a)
B_manually<- data.frame(replicate(999, summary(lm(a~b, data = DF[sample(x, replace = T),]))$r.squared))
library(boot)
B_boot <- boot(DF, function(data,indices)
summary(lm(a~b, data[indices,]))$r.squared,R=999)
par(mfrow=c(2,1))
hist(B_manually[,1],breaks=seq(0,0.4,0.01),main="dist of R2 manual")
hist(B_boot$t,breaks=seq(0,0.4,0.01),main="dist of R2 boot")
The function confint you are using, is meant for a lm object, and works on estimating a confidence interval for the coefficient, see help page. It takes the standard error of the coefficient and multiply it by the critical t-value to give you confidence interval. You can check out this book page for the formula. The objects from your bootstrapping are not lm objects and this function doesn't work. It is not meant for any other estimates.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I would like to divide all documents in 10 topics, and it goes well with a converged result except for the dimensions of distributions and covariance matrix of topic.
Why the topics distribution is a 9 dimension vector instead of 10 and their covariance matrix is 9*9 matrix instead of 10*10?
I have use library(topicmodels) and function CTM() to implement the topic model in Chinese.
my code is below:
library(rJava);
library(Rwordseg);
library(NLP);
library(tm);
library(tmcn)
library(tm)
library(Rwordseg)
library(topicmodels)
installDict("C:\\Users\\Jeffy\\OneDrive\\Workplace\\R\\Law.scel","Law");
installDict("C:\\Users\\Jeffy\\OneDrive\\Workplace\\R\\NationalInstitution.scel","NationalInstitution");
installDict("C:\\Users\\Jeffy\\OneDrive\\Workplace\\R\\Place.scel","Place");
installDict("C:\\Users\\Jeffy\\OneDrive\\Workplace\\R\\Psychology.scel","Psychology");
installDict("C:\\Users\\Jeffy\\OneDrive\\Workplace\\R\\Politics.scel","Politics");
listDict();
#read file
d.vec <- segmentCN("samgovWithoutID.csv", returnType = "tm")
samgov.segment <- read.table("samgovWithoutID.segment.csv", header = TRUE, fill = TRUE, stringsAsFactors = F, sep = ",",fileEncoding='utf-8')
fix(samgov.segment)
# create DTM(document term matrix)
d.corpus <- Corpus(VectorSource(samgov.segment$content))
inspect(d.corpus[1:10])
d.corpus <- tm_map(d.corpus, removeWords, stopwordsCN())
ctrl <- list(removePunctuation = TRUE, removeNumbers= TRUE, wordLengths = c(1, Inf), stopwords = stopwordsCN(), wordLengths = c(2, Inf))
d.dtm <- DocumentTermMatrix(d.corpus, control = ctrl)
inspect(d.dtm[1:10, 110:112])
# impletment topic models
ctm10<-CTM(d.dtm,k=10, control=list(seed=2014012692))
Terms10 <- terms(ctm10, 10)
Terms10[,1:10]
ctm20<-CTM(d.dtm,k=20, control=list(seed=2014012692))
Terms20 <- terms(ctm20, 20)
Terms20[,1:20]
The result in R Studio (see Highlighted part):
Help document:
A probability distribution over 10 values has 9 free parameters: once I tell you the probability of the first 9, the probability of the last value has to be one minus the sum of those probabilities.
A 10-dimensional logistic normal distribution is equivalent to sampling a 10-dimensional vector from a Gaussian distribution and then "squashing" that vector by exponentiating it and normalizing it to sum to 1.0. There are an infinite number of 10-dimensional vectors that will exponentiate and normalize to the same 10-dimensional probability distribution -- you just have to add an arbitrary constant c to each value. That's because the mean of the Gaussian has 10 free parameters, one more than the more constrained distribution.
There are several ways to make the Gaussian "identifiable". One is to fix one of the elements of the mean vector to be 0.0. That's why you see a 9-dimensional mean and covariance matrix: the 10th value is always 0 with no variance.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have proceed in clustering of storm's energy data using different clustering methods (kmeans, hclust, agnes, funny) in R but even if it is easy to choose the best method for my work, I need a computational (and not theoretical) method to compare and evaluate the methods via their results. Do you believe that there is something?
Thanks in advance,
Thanks for the question, I learnt that you could compute optimal number of clusters using eclust function from factoextra package
Using kmeans demo from here
# Load and scale the dataset
data("USArrests")
DF <- scale(USArrests)
When data is not scaledd the clustering results might not be reliable [example](http://stats.stackexchange.com/questions/140711/why-does-gap-statistic-for-k-means-suggest-one-cluster-even-though-there-are-ob)
library("factoextra")
# Enhanced k-means clustering
res.km <- eclust(DF, "kmeans")
# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)
Comparison of Clustering Functions:
You can use all the available methods and compute the optimal number of clusters with:
clusterFuncList = c("kmeans", "pam", "clara", "fanny", "hclust", "agnes" ,"diana")
resultList <- sapply(clusterFuncList,function(x) {
cat("Begin clustering for function:",x,"\n")
#For each clustering function find optimal number of clusters, to disable plotting use graph=FALSE
clustObj = eclust(DF, x,graph=FALSE)
#return optimal number of clusters for each clustering function
cat("End clustering for function:",x,"\n\n\n")
resultDF = data.frame(clustFunc = x, optimalNumbClusters = clustObj$nbclust,stringsAsFactors=FALSE)
})
# >resultList
# clustFunc optimalNumbClusters
# 1 kmeans 4
# 2 pam 4
# 3 clara 5
# 4 fanny 5
# 5 hclust 4
# 6 agnes 4
# 7 diana 4
Gap Statistic i.e. goodness-of-fit measure:
The "gap statistic" is used as a measure of goodness-of-fit for clustering algorithms, see paper
For fixed number of user defined clusters we could compare gap statistic for each clustering algorithm with clusGap function from cluster package:
numbClusters = 5
library(cluster)
clusterFuncFixedK = c("kmeans", "pam", "clara", "fanny")
gapStatList <- do.call(rbind,lapply(clusterFuncFixedK,function(x) {
cat("Begin clustering for function:",x,"\n")
set.seed(42)
#For each clustering function compute gap statistic
gapStatBoot=clusGap(DF,FUNcluster=get(x),K.max=numbClusters)
gapStatVec= round(gapStatBoot$Tab[,"gap"],3)
gapStat_at_AllClusters = paste(gapStatVec,collapse=",")
gapStat_at_chosenCluster = gapStatVec[numbClusters]
#return gap statistic for each clustering function
cat("End clustering for function:",x,"\n\n\n")
resultDF = data.frame(clustFunc = x, gapStat_at_AllClusters = gapStat_at_AllClusters,gapStat_at_chosenCluster = gapStat_at_chosenCluster, stringsAsFactors=FALSE)
}))
# >gapStatList
# clustFunc gapStat_at_AllClusters gapStat_at_chosenCluster
#1 kmeans 0.184,0.235,0.264,0.233,0.27 0.270
#2 pam 0.181,0.253,0.274,0.307,0.303 0.303
#3 clara 0.181,0.253,0.276,0.311,0.315 0.315
#4 fanny 0.181,0.23,0.313,0.351,0.478 0.478
The table above has gap statistic of each algorithm at each clutser from k=1 to 5. Column 3, gapStat_at_chosenCluster has the
gap statistic at k = 5 cluster. The lower the statistic the better the partitioning hence,at k = 5 clusters, kmeans performs better
relative to other algorithms on USArrests dataset
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Let's say I have a vector:
Q<-rnorm(10,mean=0,sd=20)
From this vector I would like to:
1. create 10 variables (a1...a10) that each have a correlation above .5 (i.e. between .5 and 1) with Q.
the first part can be done with:
t1<-sapply(1:10, function(x) jitter(t, factor=100))
2. each of these variables (a1...a10) should have a pre-specified correlation with each other. For example some should be correlated .8 and some -.2.
Can these two things be done?
I create a correlation matrix:
cor.table <- matrix( sample( c(0.9,-0.9) , 2500 , prob = c( 0.8 , 0.2 ) , repl = TRUE ) , 50 , 50 )
k=1
while (k<=length(cor.table[1,])){
cor.table[1,k]<-0.55
k=k+1
}
k=1
while (k<=length(cor.table[,1])){
cor.table[k,1]<-0.55
k=k+1
}
diag(cor.table) <- 1
However, when I apply the excellent solution by #SprengMeister I get the error:
Error in eigen(cor.table)$values > 0 :
invalid comparison with complex values
continued here: Eigenvalue decomposition of correlation matrix
As a pointer to solution use noise function jitter in R:
set.seed(100)
t = rnorm(10,mean=0,sd=20)
t1 = jitter(t, factor = 100)
cor(t,t1)
[1] 0.8719447
To generate data with a prescribed correlation (or variance),
you can start with random data,
and rescale it using the Cholesky decomposition of the desired correlation matrix.
# Sample data
Q <- rnorm(10, mean=0, sd=20)
desired_correlations <- matrix(c(
1, .5, .6, .5,
.5, 1, .2, .8,
.6, .2, 1, .5,
.5, .8, .5, 1 ), 4, 4 )
stopifnot( eigen( desired_correlations )$values > 0 )
# Random data, with Q in the first column
n <- length(Q)
k <- ncol(desired_correlations)
x <- matrix( rnorm(n*k), nc=k )
x[,1] <- Q
# Rescale, first to make the variance equal to the identity matrix,
# then to get the desired correlation matrix.
y <- x %*% solve(chol(var(x))) %*% chol(desired_correlations)
var(y)
y[,1] <- Q # The first column was only rescaled: that does not affect the correlation
cor(y) # Desired correlation matrix
I answered a very similar question a little while ago
R: Constructing correlated variables
I am not familiar with jitter so maybe my solutions is more verbose but it would allow you determining exactly what the intercorrelations of each of your variables and q is supposed to be.
The F matrix referenced in that answer describes the intercorrelations that you want to impose on your data.
EDIT to answer question in comment:
If i am not mistaken, you are trying to create a multivariate correlated data set. So all the variables in the set are correlated to varying degrees. I assume Q is your criterion or DV, and a1-a10 are predictors or IVs.
In the F matrix you would reflect the relationships between these variables. For example
cor_Matrix <- matrix(c(1.00, 0.90, 0.20 ,
0.90, 1.00, 0.40 ,
0.20, 0.40, 1.00),
nrow=3,ncol=3,byrow=TRUE)
describes the relationships between three variables. The first one could be Q, the second a1 and the third a2. So in this scenario, q is correlated with a1 (.90) and a2 (.20).
a1 is correlated with a2 (.40)
The rest of the matrix is redundant.
In the remainder of the code, you are simply creating your raw, uncorrelated variables and then impose the loadings that you have previously pulled from the F matrix.
I hope this helps. If there is a package in R that does all that, please let me know. I build this to help me understand how multivariate data sets are actually generated.
To generalize this to 10 variables plus q, just set the parameters that are set to 3 now to 11 and create an 11x11 F matrix.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am searching for a function/package-name in R which allows one to separate two superimposed normal distributions. The distribution looks something like this:
x<-c(3.95, 3.99, 4.0, 4.04, 4.1, 10.9, 11.5, 11.9, 11.7, 12.3)
I had good results in the past using vector generalized linear models. The VGAM package is useful for that.
The mix2normal1 function allows to estimate the parameters of a mix of two univariate normal distributions.
Little example
require(VGAM)
set.seed(12345)
# Create a binormal distribution with means 10 and 20
data <- c(rnorm(100, 10, 1.5), rnorm(200, 20, 3))
# Initial parameters for minimization algorithm
# You may want to create some logic to estimate this a priori... not always easy but possible
# m, m2: Means - s, s2: SDs - w: relative weight of the first distribution (the second is 1-w)
init.params <- list(m=5, m2=8, s=1, s2=1, w=0.5)
fit <<- vglm(data ~ 1, mix2normal1(equalsd=FALSE),
iphi=init.params$w, imu=init.params$m, imu2=init.params$m2,
isd1=init.params$s, isd2=init.params$s2)
# Calculated parameters
pars = as.vector(coef(fit))
w = logit(pars[1], inverse=TRUE)
m1 = pars[2]
sd1 = exp(pars[3])
m2 = pars[4]
sd2 = exp(pars[5])
# Plot an histogram of the data
hist(data, 30, col="black", freq=F)
# Superimpose the fitted distribution
x <- seq(0, 30, 0.1)
points(x, w*dnorm(x, m1, sd1)+(1-w)*dnorm(x,m2,sd2), "l", col="red", lwd=2)
This correctly gives ("true" parameters - 10, 20, 1.5, 3)
> m1
[1] 10.49236
> m2
[1] 20.06296
> sd1
[1] 1.792519
> sd2
[1] 2.877999
You might want to use nls , the nonlinear regression tool (or other nonlin regressors). I'm guessing you have a vector of data representing the superimposed distributions. Then, roughly, nls(y~I(a*exp(-(x-meana)^2/siga) + b*exp(-(x-meanb)^2/sigb) ),{initial guess values required for all constants} ) , where y is your distribution and x is the domain .
I'm not thinking about this at all, so I'm not sure which convergence methods are less likely to fail.