Demonstration Code for Nested Dirichlet Process - r

My question is about how to implement the nested Dirichlet process (NDP) with R code.
The NDP is suitable for clustering over distributions and simultaneously clustering within a distribution. Rodriguez et al. (2008) provided a simulation example to demontrate the ability of the NDP to distinguish different distributions. I am trying to learn this approach by reproducing the results for this example. But failed to do so because I cannot understand well how the base distribution is related to the mixture components.
The simulation example used a normal inverse-gamma distributioin, NIG(0,0.01,3,1), as the base distribution. But the four different distributions are:
The algorithm provided in Section 4 (Rodriguez et al.,2008, p.1135) was used to do the simulation. I have problem to understand and execute this algorithm, especially step 5:
Can you please provide a sample code to demonstrate this algorithm? Your help is highly appreciated!

I have not be able to do the coding by myself but I have found a recent paper which does the simulation using exact inference instead of truncation approximation. I think it might help someone else who has interest just like me, so I post the link to that paper here.
enter link description here
The good thing I like about this paper is that it is well written and has source code (in R) for me to understand this methodology better.

Related

How to create response surface using random forest model in R?

I have a made a rf model in R having six predictors and a response. The predictive model seems to be good enough but we also wanted to generate a response surface for this model.
attach(al_mf)
library(randomForest)
set.seed(1)
rfalloy=randomForest(Mf~.,data=al_mf,mtry=6,importance=TRUE)
rfalloy
rfpred=predict(rfalloy,al_mf$Mf)
rfpred
sse=sum((rfpred-mean(al_mf$Mf))^2)
sse
ssr=sum((rfpred-al_mf$Mf)^2)
ssr
Rsqaure=1-(ssr/(sse+ssr))
Rsqaure
importance(rfalloy)
At a general level, since you haven't provided too many specifics about exactly what you are looking for in your response surface, here are a few hopefully helpful starting points:
Have you taken a look at rsm? This documentation provides some good use cases for the package.
These in-class notes from a University of New Mexico stats lecture are full of code examples related to response surfaces. Just check out the table of contents and you'll probably find what you're looking for.
This StackOverflow post also provides an example using the rgl package.

Graph Clustering

I've been searching paper about method review in graph clustering but not satisfied me,
please tell me what is best method (according to you) in graph clustering, so sorry if my question very general
Thanks
With such an open question, I guess I can recommend you to try WEKA.
It has a nice set of user interfaces to let you import your dataset and then try and compare various classification and clustering algorithms on your data, without writing even one line of code.
After you identified an algorithm that works for your problem, you can then search for a nice and fast implementation in the programming language of your choice.
EDIT: since you mentioned the graph tag, maybe you should have a look at Markov Cluster Algorithm, or else, you will have a hard time trying to represent your graph data in a format suitable for the distance based clustering algorithms in WEKA.

Setting Contrasts for ANOVA in R

I've been attempting to perform an ANOVA in R recently on the attached data frame.
My question revolves around the setting of contrasts.
My design is a 3x5 within-subjects design.
There are 3 visual conditions under 'Circle1' and 5 audio under 'Beep1'.
Does anyone have any idea how I should set the contrasts? This is something I'm unfamiliar with as I'm making the transition from point and click stats in SPSS to coded in R.
Thanks for your time
Data file:
Reiterating my answer from another stackoverflow question that was flagged as similar, since you didn't provide any code, you might start by having a look at the contrast package in R. As they note in the document:
"The purpose of the contrast package is to provide a standardized interface for testing linear combinations of parameters from common regression models. The syntax mimics the contrast. Design function from the Design library. The contrast class has been extended in this package to linear models produced using the functions lm, glm, gls, lme and geese."
There is also a nice little tutorial here by Dr. William King who talks about factorial between subjects ANOVA and also includes an abundance of R code. This is wider scoped than you question but would be a great place to start (just to get context).
Finally, here is another resource that you can refer to which talks about setting up orthogonal contrasts in R.

How does r calculate the p-values in logistic regression

What type of p-values do R calculate in a binomial logistic regression, and where is this documented?
When i read the documentation for ?glm() I find no reference to the calculation of the p-values.
The p-values are calculated by the function summary.glm. See ?summary.glm for a (very brief) bit about how those are calculated.
For more information, look at the source code by typing
summary.glm
at the R command prompt. There you will find the lines of code where an object pvalue is created. Follow the code back to see how the components of the p-value calculation are (conditionally) calculated.
The authors of R wrote the help system with several principles in mind: compactness (don't write more than is needed, it's not a textbook), accuracy, and a curious and well-educated audience. It really was written for other statisticians. The "curious" part of that opening sentence was included to raise the question why you did not also follow the various links in the ?glm page: to summary.glm where you would have found one answer to your ambiguous question or to anova.glm where you would have found another possible answer. The help-authors do expect that you will follow those links and read the whole page and execute the examples. You will notice that even after you get to summary.glm that there is no mention of "binary logistic regression" since they pretty much assume that you are well-grounded in statistics and have copy of McCullagh and Nelder handy, or if not that you will go read the references.
The other principle: sometimes it is the code itself (given the open-source nature of R) that performs the documentation. Technically glm doesn't print anything and print.glm doesn't print p-values. It would be print.summary.glm or print.anova.glm that would be doing any printing. Part of learning R is learning that the results printed to the console will have gone through a eval-print loop and that output can be tailored with object-class-specific functions.
These assumptions are just part of what many people see as a "steep learning curve for R" (although I would have called it a shallow curve if plotted with time/effort on x-axis.)

Random number analysis

Given a series of randomly generated data how can I figure out how random it actually is? Is R-lang a good tool for this matlab? What other questions can can these tools answer about randomly generated data? Is there another tool better for this?
The DieHarder test battery by Robert G. Brown --- which reimplements and extends the old DIEHARD by Marsaglia et al -- has been wrapped into the R package RDieHarder which you could start with.
Note that RDieHarder versions need their particular matching DieHarder releases -- and we're not there yet for the most recent development version of the latter.
Edit Also, for the subset of cryptographioic tests, the NIST suite (which is included in DieHarder) should be appropriate as that is what it was designed for.
First you need to decide what kind of randomness you're testing for. Do you have in mind a uniform distribution inside some range? That's usually what people have in mind, though you may have some other flavor of randomness such as a normal distribution.
Once you have a candidate distribution, you can test the goodness of fit to that distribution. The Kolmogorov-Smirnov test is a good general-purpose test. I believe it's called ks.test in R. But I also believe it assumes distinct values, so that could be a problem if you're sampling from such a small range of values that the same value appears more than once.
S. Lott mentioned Knuth's Seminumerical Algorithms in the comments. That book has a good introduction to the chi-squared test and the Kolmogorov-Smirnov tests for goodness of fit.
If you do suspect you have uniform random values, the DIEHARD test that Dirk Eddelbuettel mentioned is a standard test.
According to Wikipedia (Randomness):
The central idea is that a string of
bits is random if and only if it is
shorter than any computer program that
can produce that string (Kolmogorov
randomness) — this means that random
strings are those that cannot be
compressed.
Therefore, given the random stream of numbers, save it to a file, and compress it using your favorite tool (zip, rar, ...). The compression ratio can be interpreted as measure of randomness... Even better, I would use it as a relative score to compare the randomness of two data series.
I recommend reading Chapter 10 of Beautiful Testing: Testing a Random Number Generator. It's a little more approachable than most texts on the topic. Maybe, if we're nice, the author of that chapter, John Cook, might stop by and give his input.
There's as always a toolbox for it.
For theory, the above mentioned reference by Knuth is useful and to link Amro's response, there is work by Li & Vitanyi which relates here.
link text

Resources