No correlation after phylogenetic independent contrast - r

I am testing the correlation between two physiological parameters in plants. I am a bit stuck on the interpretation of the phylogenetic independent contrast (PIC). Without considering the PIC, I got a significant correlation, but there is no correlation between the PIC. What does the absence of correlation between the PIC mean:
the correlation without PIC is the effect of phylogeny
OR,
There is no phylogenetic effect on the correlation.

Looks like you've run into the classic case when an apparent correlation between two traits disappears when you analyze the independent contrasts. This could be due to relatedness: closely related species will tend to have similar trait values and that might look like a correlation between the traits. But it is simply a byproduct of the bifurcating nature of phylogenies and statistical non-idependence of species' trait values. I would recommend to go back to the early papers (1980 and 1990) by Felsenstein, Garland, etc, and the book by Harvey and Pagel (1991) where these concepts feature prominently.
Otherwise, there are many similar threads on the r-sig-phylo mailing list (highly recommended), websites of various phylogenetic workshops (e.g., Bodega, Woods Hole, ...), and blogs (e.g., Liam Revell's phytools blog) that might be of help.

Related

Conjoint experiment: How to formally test if effect sizes for attribute levels are significantly different from each other

I am doing conjoint analysis in R, working with the Cregg-package by Thomas Leeper. I am estimating AMCEs (all good up to this point, getting nice plots, etc.)
Now, I want to formally test whether the estimated effect of one attribute level is significantly different from the estimated effect of another attribute level within the same dimension (just for the simple AMCE, no subgroups).
Anybody has an idea of how to do that? Any solution for cjoint (or STATA) would also be most welcome. Thanks!

Multivariate Classification Trees in R

I'm looking for advice on creating classification trees where each split is based on multiple variables. A bit of background: I'm helping design a vegetation classification system, and we're hoping to use a classification and regression tree algorithm to both classify new veg data and create (or at least help to create) visual keys which can be used in publications. The data I'm using is laid out as community data, with tree species as columns, and observations as rows, and the first column is a factor with classes. I'll also add that I'm very new to this type of analysis, and while I've tried to read about it as much as possible, it's quite likely that I've missed some simple but important aspects. My apologies.
Now the problem: R has excellent packages and great documentation for classification with univariate splits (e.g. rpart, partykit, C5.0). However, I would ideally like to be able to create classification trees where each split was based on multiple criteria - so instead of each split having one decision (e.g. "Percent cover of Species A > 6.67"), it would have multiple (Percent cover of Species A > 6.67 AND Percent cover of Species B < 4.2). I've had a lot of trouble finding packages that are capable of doing multivariate splits and creating trees. This answer: https://stats.stackexchange.com/questions/4356/does-rpart-use-multivariate-splits-by-default has been very useful, and I've tried all the packages suggested there for multivariate splitting. Prim does do multivariate splits, but doesn't seem to make trees; the partDSA package seems to be somewhat what I'm looking for, but it also only creates trees with one criteria per split; the optpart package also doesn't seem to be able to make classification trees. If anyone has advice on how I could go about making a classification tree based on a multivariate partitioning method, that would be super appreciated.
Also, this is my first question, and I am very open to suggestions about how to ask questions. I didn't feel that providing an example would be helpful in this case, but if necessary I easily can.
Many Thanks!

Maximal Information Coefficient vs Hierarchical Agglomerative Clustering

What is the difference between the Maximal Information Coefficient and Hierarchical Agglomerative Clustering in identifying functional and non functional dependencies.
Which of them can identify duplicates better?
This question doesn't make a lot of sense, sorry.
The MIC and HAC have close to zero in common.
The MIC is a crippled form of "correlation" with a very crude heuristic search, and plenty of promotion video and news announcements, and received some pretty harsh reviews from statisticians. You can file it in the category "if it had been submitted to an appropriate journal (rather than the quite unspecific and overrated Science which probably shouldn't publish such topics at all - or at least, get better reviewers from the subject domains. It's not the first Science article of this quality....), it would have been rejected (as-is - better expert reviewers would have demanded major changes)". See, e.g.,
Noah Simon and Robert Tibshirani, Comment on “Detecting Novel Associations in Large Data Sets” by Reshef et al., Science Dec. 16, 2011
"As one can see from the Figure, MIC has lower power than dcor, in every case except the somewhat pathological high-frequency sine wave. MIC is sometimes less powerful than Pearson correlation as well, the linear case being particularly worrisome."
And "tibs" is a highly respected author. And this is just one of many surprised that such things get accepted in such a high reputation journal. IIRC, the MIC authors even failed to compare to "ancient" alternatives such as Spearman, to modern alternatives like dCor, or to properly conduct a test of statistical power of their method.
MIC works much worse than advertised when studied with statistical scrunity:
Gorfine, M., Heller, R., & Heller, Y. (2012). Comment on "detecting novel associations in large data sets"
"under the majority of the noisy functionals and non-functional settings, the HHG and dCor tests hold very large power advantages over the MIC test, under practical sample sizes; "
As a matter of fact, MIC gives wildly inappropriate results on some trivial data sets such as a checkerboard uniform distribution ▄▀, which it considers maximally correlated (as correlated as y=x); by design. Their grid-based design is overfitted to the rather special scenario with the sine curve. It has some interesting properties, but these are IMHO captured better by earlier approaches such as Spearman and dCor).
The failure by the MIC authors to compare to Spearman is IMHO a severe omission, because their own method is also purely rank-based if I recall correctly. Spearman is Pearson-on-ranks, yet they compare only to Pearson. The favorite example of MIC (another questionable choice) is the sine wave - which after rank transformation actually is busy a zigzag curve, not a sine anymore). I consider this to be "cheating" to make Pearson look bad, by not using the rank transformation with Pearson, too. Good reviewers would have demanded such a comparison.
Now all of these complaints are essentially unrelated to HAC. HAC is not trying to define any form if "correlation", but it can be used with any distance or similarity (including correlation similarity).
HAC is something completely different: a clustering algorithm. It analyzes a larger rows, not two (!) columns.
You could even combine them: if you compute the MIC foe every pair of variables (but I'd rather use Pearson correlation, Spearman correlation, or distance correlation dCor instead), you can use HAC to cluster variables.
For finding aftual duplicates, neither is a good choice. Just sort your data, and duplicates will follow each other. (Or, if you sort columns, next to each other).

R linear regression with lm - how to deal with categorical variables with thousands of values (like city or zip code)?

I am using R and the linear regression method function lm() to build a prediction model for business sales of retail stores. Among the many dependent feature variables in my dataset, there are some categorical (factor) features that can take on thousands of different values, such as zip code (and/or city name). For example, there are over 6000 different zip codes for California alone; if I instead use city, there are over 400 cities.
I understand that lm() creates a variable for each value of a categorical feature. The problem is that when I run lm(), the explosion of variables takes a lot of memory and a really long time. How can I avoid or handle this situation with my categorical variables?
Your intuition to move from zip codes to cities is good. However, the question is, is there a further level of spatial aggregation which will capture important spatial variation, but will result in the creation of less categorical (i.e. dummy) variables? Probably. Depending on your question, simply including a dummy for rural/suburban/urban maybe all you need.
In your case geographic region is likely a proxy meant to capture variation in socio-economic data. If so, why not include the socio-economic data directly. To do this you could use your city/zip data to link to US census data.
However, if you really need/want to include cities, try estimating a fixed effect model. The within-estimator that results differences out time invariant categorical coefficients such as your city coefficients.
Even if you find a way to obtain an OLS estimate with 400 cities in R, I would strongly encourage you not do use an OLS estimator, use a Ridge or Lasso estimator. Unless your data is massive (it can't be too big since your using R), the inclusive of so many dummy variables is going to dramatically reduce the degrees of freedom, which can lead to over-fitting and generally poorly estimated coefficients and standard errors.
In a slightly more sophisticated language, when degrees of freedom are low the minimization problem you solve when you estimate the OLS is "ill-posed", consequently you should use a regularization. For example, a Ridge Regression (i.e. Tikhonov regularization), would be a good solution. Remember, however, Ridge regression is a biased estimator and therefore you should perform bias-correction.
My solutions in order of my preference:
Aggregate up to a coarser spatial area (i.e. maybe a regions instead of cities)
Fixed effect estimator.
Ridge regression.
If you don't like my suggestions, I would suggest you pose this question on cross validated. IMO your question is closer to a statistics question than a programming question.

regressions with many nested categorical covariates

I have a few hundred thousand measurements where the dependent
variable is a probability, and would like to use logistic regression.
However, the covariates I have are all categorical, and worse, are all
nested. By this I mean that if a certain measurement has "city -
Phoenix" then obviously it is certain to have "state - Arizona" and
"country - U.S." I have four such factors - the most granular has
some 20k levels, but if need be I could do without that one, I think.
I also have a few non-nested categorical covariates (only four or so,
with maybe three different levels each).
What I am most interested in
is prediction - given a new observation in some city, I would like to
know the relevant probability/dependent variable. I am not interested
as much in the related inferential machinery - standard deviations,
etc - at least as of now. I am hoping I can afford to be sloppy.
However, I would love to have that information unless it requires
methods that are more computationally expensive.
Does anyone have any advice on how to attack this? I have looked into
mixed effects, but am not sure it is what I am looking for.
I think this is more of model design question than on R specifically; as such, I'd like to address the context of the question first then the appropriate R packages.
If your dependent variable is a probability, e.g., $y\in[0,1]$, a logistic regression is not data appropriate---particularly given that you are interested in predicting probabilities out of sample. The logistic is going to be modeling the contribution of the independent variables to the probability that your dependent variable flips from a zero to a one, and since your variable is continuous and truncated you need a different specification.
I think your latter intuition about mixed effects is a good one. Since your observations are nested, i.e., US <-> AZ <-> Phoenix, a multi-level model, or in this case a hierarchical linear model, may be the best specification for your data. The best R packages for this type of modeling are multilevel and nlme, and there is an excellent introduction to both multi-level models in R and nlme available here. You may be particularly interested in the discussion of data manipulation for multi-level modeling, which begins on page 26.
I would suggest looking into penalised regressions like the elastic net. The elastic net is used in text mining where each column represents the present or absence of a single word, and there maybe hundreds of thousands of variables, an analogous problem to yours. A good place to start with R would be the glmnet package and its accompanying JSS paper: http://www.jstatsoft.org/v33/i01/.

Resources