I am unsure as to which test to use in R. Here is the oversimplified sampling procedure, with easy-looking numbers :
We have 20 patches of the same size in a field.
Inside these patches, we look for 50 different species (10 species of grass, and 40 species of insect).
Every time we find a species of grass, we record its coverage on a rough logarithmic scale from 1 to 4.
Every time we find a species of insect, we count them and record their abundance on a rough logarithmic scale from 1 to 4.
So my data sort of looks like this:
My problem is, how do I test which species are significantly associated? How do I detect clusters? Multivariate analysis? Half weight index? Bootstrap?
I'm not exactly gifted when it comes to statistics, so any help would be greatly appreciated!
Related
I have my Response variable which is Proportion of Range Exposed to extreme events for terrestrial mammal species in the future. More clearly, it is the Difference of Proportion of Range Exposed (DPRE) from historical period to future green gases emission scenarios (it is a measure of the level of increase/decrease of percentage of range exposed): it means that my response variable goes from -1 to 1 (where +1 implies that the range will experience a +100% increase in the proportion of exposure: from 0% in historical period, to 100% in the future scenario).
As said, I am analyzing these differences for all terrestrial mammals (5311 species, across different scenarios and for two time periods, near future (means of 2021-2040) and far future (means of 2081-2100).
So, my Explicative variables are:
3 Scenarios of green gas emissions (Representative Concentration Pathways: RCP2.6, RCP4.5 and RCP8.5);
Time Periods (Near Future and Far Future): NF and FF;
Species: 5311 individuals.
I am not so expert in statistics , so I'm not sure which of the two suggestions I recieved:
Friedman test with Species as blocks (but in which I should somehow do a nested model, with RCPs as groups, nested within TimePeriods; or a sort of two way Friedman, with RCP and TimePeriod as the two different factors).
Linear Mixed Models with RCP*TimePeriod as fixed effects, and (TimePeriod | Species ) as random effects.
I run t-test, and all distribution result to be not normal, this is why I was suggested to use Friendman instead of ANOVA; I run pairwise Wilcoxon Rank Sum test and in this case I found significative differences from NF and FF for all RCPs.
I have to say I run 3 Wilcoxon, one for every RCP, so maybe a third option would be to create 3 different models, one for every RCP, but this would also go away from the standard analysis of "repated measures" for Friedman test.
Last consideration: I have to run Another model, where the Response variable is the Difference of Proportion of Subrange Exposed. In this case, other Explicative variables are mantained, but in this case analysis is not global but takes in consideration the difference that could be present across 14 IUCN Biomes. So every analysis is made across RCPs, for NF and FF and for all Biomes. Should I create and run 14 (biomes) x 3 (RCPs) x 2 (Time Periods) = 84 models, in this case? OR a sort of double nested (Time Periods and Biomes) model?
If necessary I can provide the large dataframe.
I have the following code, which basically try to predict the Species from iris data using randomForest. What I'm really intersed in is to find what are the best features (variable) that explain the species classification. I found the package randomForestExplainer is the best
to serve the purpose.
library(randomForest)
library(randomForestExplainer)
forest <- randomForest::randomForest(Species ~ ., data = iris, localImp = TRUE)
importance_frame <- randomForestExplainer::measure_importance(forest)
randomForestExplainer::plot_multi_way_importance(importance_frame, size_measure = "no_of_nodes")
The result of the code produce this plot:
Based on the plot, the key factor to explain why Petal.Length and Petal.Width is the best factor are these (the explanation is based on the vignette):
mean_min_depth – mean minimal depth calculated in one of three ways specified by the parameter mean_sample,
times_a_root – total number of trees in which Xj is used for splitting the root node (i.e., the whole sample is divided into two based on the value of Xj),
no_of_nodes – total number of nodes that use Xj for splitting (it is usually equal to no_of_trees if trees are shallow),
It's not entirely clear to me why the high times_a_root and no_of_nodes is better? And low mean_min_depth is better?
What are the intuitive explanation for that?
The vignette information doesn't help.
You would like a statistical model or measure to be a balance between "power" and "parsimony". The randomForest is designed internally to do penalization as its statistical strategy for achieving parsimony. Furthermore the number of variables selected in any given sample will be less than the the total number of predictors. This allows model building when hte number of predictors exceeds the number of cases (rows) in the dataset. Early splitting or classification rules can be applied relatively easily, but subsequent splits become increasingly difficult to meet criteria of validity. "Power" is the ability to correctly classify items that were not in the subsample, for which a proxy, the so-called OOB or "out-of-bag" items is used. The randomForest strategy is to do this many times to build up a representative set of rules that classify items under the assumptions that the out-of-bag samples will be a fair representation of the "universe" from which the whole dataset arose.
The times_a_root would fall into the category of measuring the "relative power" of a variable compared to its "competitors". The times_a_root statistic measures the number of times a variable is "at the top" of a decision tree, i.e., how likely it is to be chosen first in the process of selecting split criteria. The no_of_node measures the number of times the variable is chosen at all as a splitting criterion among all of the subsampled.
From:
?randomForest # to find the names of the object leaves
forest$ntree
[1] 500
... we can see get a denominator for assessing the meaning of the roughly 200 values in the y-axis of the plot. About 2/5ths of the sample regressions had Petal.Length in the top split criterion, while another 2/5ths had Petal.Width as the top variable selected as the most important variable. About 75 of 500 had Sepal.Length while only about 8 or 9 had Sepal.Width (... it's a log scale.) In the case of the iris dataset, the subsamples would have ignored at least one of the variables in each subsample, so the maximum possible value of times_a_root would have been less than 500. Scores of 200 are pretty good in this situation and we can see that both of these variables have a comparable explanatory ability.
The no_of_nodes statistic totals up the total number of trees that had that variable in any of its nodes, remembering that the number of nodes would be constrained by the penalization rules.
Im working through a project that is due imminently and I have almost zero idea how to use R, so was wondering if someone could lay out a step by step of how to bootstrap my data. I have data comparing the diet of 2 carnivore species, each species is feeding on 16 different prey items (almost all the same). I want to test whether these observed values (attributed to the frequency of each prey item) are significant vs. randomly generated (1000 bootstrap iterations) values. I have very little idea of how to do it.
Hello StackOverflow community,
5 weeks ago I learned to write and read R and it made me a happier being :) Stack Overflow helped me out a hundred times or more! For a while I have been struggling with vegan now. So far I have succeeded in making beautiful nMDS plots. The next step for me is DCA, but here I run into trouble...
Let me explain:
I have a abundance dataset where the columns are different species (N=120) and the rows are transects (460). Column 1 with transect codes is deleted. Abundance is in N (not relative or transformed). Most species are rare to very rare and a couple of species have very high abundance (10000-30000). Total N individuals is about 100000.
When I run the decorana function it returns this info.
decorana(veg = DCAMVA)
Detrended correspondence analysis with 26 segments.
Rescaling of axes with 4 iterations.
DCA1 DCA2 DCA3 DCA4
Eigenvalues 0.7121 0.4335 0.1657 0.2038
Decorana values 0.7509 0.4368 0.2202 0.1763
Axis lengths 1.7012 4.0098 2.5812 3.3408
The eigenvalues are however really small... Only 1 species has a DCA1 value of 2 the rest is all -1.4E-4 etc... This high DCA1 point has an abundance of 1 individual... But this is not the only species that has only 1 individual..
DCA1 DCA2 DCA3 DCA4 Totals
almaco.jack 6.44e-04 1.85e-01 1.37e-01 3.95e-02 0
Atlantic.trumpetfish 4.21e-05 5.05e-01 -6.89e-02 9.12e-02 104
banded.butterflyfish -4.62e-07 6.84e-01 -4.04e-01 -2.68e-01 32
bar.jack -3.41e-04 6.12e-01 -2.04e-01 5.53e-01 91
barred.cardinalfish -3.69e-04 2.94e+00 -1.41e+00 2.30e+00 15
and so on
I can't plot the picture yet on StackOverflow, but the idea is that there is spread on the Y-axis, but the X-values are not. Resulting in a line in the plot.
I guess everything is running okay, no errors returned or so.. I only really wonder what the reason for this clustering is... Anybody has any clue?? Is there a ecological idea behind this??
Any help is appreciated :)
Love
Erik
Looks like your data has an "outlier", a deviant site with deviant species composition. DCA has essentially selected the first axis to separate this site from everything else, and then DCA2 reflects a major pattern of variance in the remaining sites. (D)CA is known to suffer (if you want to call it that) from this problem, but it is really telling you something about your data. This likely didn't affect NMDS at all because metaMDS() maps the rank order of the distances between samples and that means it only need to put this sample slightly further away from any other sample than the distance between the next two most dissimilar samples.
You could just stop using (D)CA for these sorts of data and continue to use NMDS via metaMDS() in vegan. An alternative is to apply a transformation such as the Hellinger transformation and then use PCA (see Legendre & Gallagher 2001, Oecologia, for the details). This transformation can be applied via decostand(...., method = "hellinger") but it is trivial to do by hand as well...
I want to do clustering of my data (kmeans or hclust) in R language (coding). My data is ordinal, which means that the data is Likert scale to measure the causes of cost escalation (I have 41 causes "variables") that scaled from 1 to 5, which 1 is no effect to 5 major effect (I have about 160 observations "who rank the causes")... any help of how to cluster the 41 cause based on the observations ... do I have to convert the scale to percentage or z score before clustering or any thing that help ...... I really need your help!! here is the data to play with https://docs.google.com/spreadsheet/ccc?key=0AlrR2eXjV8nXdGtLdlYzVk01cE96Rzg2NzRpbEZjUFE&usp=sharing
I want to cluster the variables (the columns) in terms of similarity of occurrence in observations... I follow the code in statmethods.net/advstats/cluster.html; but I couldn't cluster the variables (the columns) in terms of similarity of occurrence in observations and also I follow the work at mattpeeples.net/kmeans.html#help; but I don't know why he convert the data to percentage and then to Z-score standardize.
It isn't clear to me if you want to cluster the rows (the observations) in terms of similarity in the variables, or cluster the variables (the columns) in terms of similarity of occurrence in observations?
Anyway, see package cluster. This is a recommended package that comes with all R installations.
Read ?daisy for details of what is done with ordinal data. This metric can be used in functions such as agnes (for hierarchical clustering) or pam (for partitioning about medoids, a more robust version of k-means).
By default, these will cluster the rows/observations. Simply transpose the data object using t() if you want to cluster the columns (variables). Although that may well mess up the data depending on how you have stored them.
Converting the data to percentage is called normalization of data so all the variables are in the range of 0 - 1.
If data is not normalized you run the risk of bias towards dimensions with large values