Testing the variance of betadisper() output - r

I have some population genomics data where I have allele frequencies of SNPs with respect to the ancestral genome for each treatment.
I wanted to look whether beta diversity was different between treatments so used the vegan package and betadisper() on euclidean distances.
After extracting all the information from the model and putting it into dataframes so that ggplot2 likes it I can get this plot.
Although to my eye this shows higher beta diversity in mixed (circle) than static (triangle) treatments, the anova(), permutest() and TukeyHSD() methods give results where we do not reject the null hypothesis of homogeneity of variances. In addition, the p values for most of these tests are p > 0.8.
From what I can work out, these tests on a betadisper() model object look at differences in the mean distance to the centroid, which is not different across treatments.
However the spread of distance to centroid does seem to be different between the treatments.
I was just wondering if I am ok doing something like a Bartlett test or levene test (in the car package) to look at differences in the variance of the distances from the centroid for each group as another metric of "beta diversity" (variance across groups). Or if there are methods within vegan that anyone knows to look at the variance of distance to centroid as well as the changes in the mean distance to centroid.

Your graphics are misleading: You should use equal aspect ratio (isometric scaling) in PCoA, but the horizontal axis is stretched and vertical axis compressed in your plot. Moreover, the convex hull can be misleading as it focuses on extreme observations, but the test focuses on "typical" distances from the centroid. So your "eye" was wrong and misled by graphics. We do provide correct graphics as methods for betadisper and using these instead of self-concocted ggplot2 graphics would have saved you from this problem, or at least you could use these graphics to cross-check your own versions.
Please note that betadisper already works with "homogeneity" of variances, and having a variance of variances (= variance of distances from centroids) may not be a useful or easily interpreted. The pairs of functions we have are adonis2 for differences of centroids and and betadisper for sizes of dispersion w.r.t. to centroids.

Related

Interpretation of dispersion between groups using ordihull

I have difficulty interpreting my data in regards to dispersion and composition. I have 6 groups and used adonis2() to test the compositional difference between them. Futhermore, I used betadisper() to check dispersion per group and compared the groups with anova. Now I want to visualize this and an elegant way seemed to use ordihull() in my NMDS plot.
Now my question, can I use ordihull to visualize group dispersion in an NMDS ordination? It looks like this:
Could I interpret and say that groups with largest surface area in ordihull (indicated by the coloured outlining) have the highest dispersion?

Removing outliers that are skewing data

I am looking at the relationship between agricultural intensity and functional diversity of birds.
In my GLM model I have included a number of other variables including forest, semi-natural habitat, temperature, pesticides etc.
When looking to see whether my variables are normally distributed or not, I used a QQplot to identify the normality and there appears to be these 3 outliers.
I wondered how I would remove these outliers to make my data more normally distributed?
I tried to use the outliers package but all the examples I found failed to work, or I failed to understand how they worked!
Any help would be appreciated. This is my QQ plot for my functional dispersion model and a scatter of functional dispersion x agricultural intensity.
QQ plot
functional dispersion x agriculture scatter
You could remove the observations that appear out of place. Given the amount of observations, this is unlikely to change estimates, but please make sure this is indeed the case. Also, when reporting your work, make sure you justify why you removed those points based on your domain knowledge about the variable.
You can remove the observation using
model.data.scaled <- model.data.scaled[model.data.scaled$agri > -5, ]

Simulating data using existing data and probability

I have measured multiple attributes (height, species, crown width, condition etc) for about 1500 trees in a city. Using remote sensing techniques I also have the heights for the rest of the 9000 trees in the city. I want to simulate/generate/estimate the missing attributes for these unmeasured trees by using their heights.
From the measured data I can obtain proportion of each species in the measured population (and thus a rough probability), height distributions for each species, height-crown width relationships for the species, species-condition relationship and so on. I want to use the height data for the unmeasured trees to first estimate the species and then estimate the rest of the attributes too using probability theory. So for a height of say 25m its more likely to be a Cedar (height range 5 - 30 m) rather than a Mulberry tree (height range 2 -8 m) and more likely to be a cedar (50% of population) than an oak (same height range but 2% of population) and hence will have a crown width of 10m and have a health condition of 95% (based on the distributions for cedar trees in my measured data). But also I am expecting some of the other trees of 25m to be given oak, just less frequently than cedar based on the proportion in population.
Is there a way to do this using probability theory in R preferably utilising Bayesian or machine learning methods?
Im not asking for someone to write the code for me - I am fairly experienced with R. I just want to be pointed in the right direction i.e. a package that does this kind of thing neatly.
Thanks!
Because you want to predict a categorical variable, i.e. the species, you should consider using a tree regression, a method which can be found in the R packages rpart and RandomForest. These models excel when you have a discrete number of categories and you need to slot your observations into those categories. I think those packages would work in your application. As a comparison, you can also look at multinomial regression (mnlogit, nnet, maxent) which can also predict categorical outcomes; unfortunately multinomial regression can get unwieldy with large numbers of outcomes and/or large datasets.
If you want to then predict the individual values for individual trees in your species, first run a regression of all of your measured variables, including species type, on the measured trees. Then take the categorical labels that you predicted and predict out-of-sample for the unmeasured trees where you use the categorical labels as predictors for the unmeasured variable of interest, say tree height. That way the regression will predict the average height for that species/dummy variable, plus some error and incorporating any other information you have on that out-of-sample tree.
If you want to use a Bayesian method, you consider using a hierarchical regression to model these out-of-sample predictions. Sometimes hierarchical models do better at predicting as they tend to be fairly conservative. Consider looking at the package Rstanarm for some examples.
I suggest you looking over Bayesian Networks with table CPDs over your random variables. This is a generative model that can handle missing data and do inference over casual relationships between variables. Bayesian Network structure can be specified by-hand or learned from data by a algorithm.
R has several implementations of Bayesian Networks with bnlearn being one of them: http://www.bnlearn.com/
Please see a tutorial on how to use it here: https://www.r-bloggers.com/bayesian-network-in-r-introduction/
For each species, the distribution of the other variables (height, width, condition) is probably a fairly simple bump. You can probably model the height and width as a joint Gaussian distribution; dunno about condition. Anyway with a joint distribution for variables other than species, you can construct a mixture distribution of all those per-species bumps, with mixing weights equal to the proportion of each species in the available data. Given the height, you can find the conditional distribution of the other variables conditional on height (and it will also be a mixture distribution). Given the conditional mixture, you can sample from it as usual: pick a bump with frequency equal to its mixing weight, and then sample from the selected bump.
Sounds like a good problem. Good luck and have fun.

Figure 2.5 in Elements of Statistical Learning

I met some difficulty in calculating the Bayes decision boundary of Figure 2.5. In the package ElemStatLearn, it already calcualted the probability at each point and used contour to draw the boundary. Can any one tell me how to calculate the probability? Thank you very much.
In traditional Bayes decision problem, the mixture distribution are usually normal distribution, but in this example, it uses two steps to generate the samples, so I have some difficulty in calculating the distribution.
Thank you very much.
Section 2.3.3 of ESL (accessible online) states how the data were generated. Each class is a mixture of 10 Gaussian distributions of equal covariance and each of the 10 means are drawn from another bivariate Gaussian, as specified in the text. To calculate the exact decision boundary of the simulation in Figure 2.5, you would need to know the particular 20 means (10 for each class) that were generated to produce the data but those values are not provided in the text.
However, you can generate a new pair of mixture models and calculate the probability for each of the two classes (BLUE & ORANGE) that you generate. Since each of the 10 distributions in a class are equally likely, the class-conditional probability p(x|BLUE) is just the average of the probabilities for each of the 10 distributions in the BLUE model.

estimating density in a multidimensional space with R

I have two types of individuals, say M and F, each described with six variables (forming a 6D space S). I would like to identify the regions in S where the densities of M and F differ maximally. I first tried a logistic binomial model linking F/ M to the six variables but the result of this GLM model is very hard to interpret (in part due to the numerous significant interaction terms). Thus I am thinking to an “spatial” analysis where I would separately estimate the density of M and F individuals everywhere in S, then calculating the difference in densities. Eventually I would manually look for the largest difference in densities, and extract the values at the 6 variables.
I found the function sm.density in the package sm that can estimate densities in a 3d space, but I find nothing for a space with n>3. Would you know something that would manage to do this in R? Alternatively, would have a more elegant method to answer my first question (2nd sentence)?
In advance,
Thanks a lot for your help
The function kde of the package ks performs kernel density estimation for multinomial data with dimensions ranging from 1 to 6.
pdfCluster and np packages propose functions to perform kernel density estimation in higher dimension.
If you prefer parametric techniques, you look at R packages doing gaussian mixture estimation like mclust or mixtools.
The ability to do this with GLM models may be constrained both by interpretablity issues that you already encountered as well as by numerical stability issues. Furthermore, you don't describe the GLM models, so it's not possible to see whether you include consideration of non-linearity. If you have lots of data, you might consider using 2D crossed spline terms. (These are not really density estimates.) If I were doing initial exploration with facilities in the rms/Hmisc packages in five dimensions it might look like:
library(rms)
dd <- datadist(dat)
options(datadist="dd")
big.mod <- lrm( MF ~ ( rcs(var1, 3) + # `lrm` is logistic regression in rms
rcs(var2, 3) +
rcs(var3, 3) +
rcs(var4, 3) +
rcs(var5, 3) )^2,# all 2way interactions
data=dat,
max.iter=50) # these fits may take longer times
bplot( Predict(bid.mod, var1,var2, n=10) )
That should show the simultaneous functional form of var1's and var2's contribution to the "5 dimensional" model estimates at 10 points each and at the median value of the three other variables.

Resources