Using envfit (vegan) to calculate species scores - r

I am running an NMDS and have a few questions regarding the envfit() function in the vegan package. I have read the documentation for this function and numerous posts on SO and others about vegan, envfit(), and species scores in general.
I have seen both envfit() and wascore() used to calculate species scores for ordination techniques. By default, metaMDS() uses wascore(). This uses weighted averaging, which I understand. I am having a harder time understanding envfit(). Do envfit() and wascore( yield the same results? Is wascore() preferable given that it is the default? I realize that in some situations, wascore() might not be an option (ie. negative values), as mentioned in this post. How to get 'species score' for ordination with metaMDS()?
Given that envfit() and wascore() both seem to be used for species scores, they should yield similar results, right? I am hoping that we could do a proof of this here...
The following shows species scores determined using metaMDS() using the default wascore():
data(varespec)
ord <- metaMDS(varespec)
species.scores <- as.data.frame(scores(ord, "species"))
species.scores
wascore() makes sense to me, it uses weighted averaging. There is a good explanation of weighted averaging for species scores in Analysis of Ecological Data by McCune and Grace (2002) p. 150.
Could somebody help me breakdown envfit?
species.envfit <- envfit(ord, varespec, choices = c(1,2), permutations = 999)
species.scores.envfit <- as.data.frame(scores(species.envfit, display = "vectors"))
species.scores.envfit
"The values that you see in the table are the standardised coefficients from the linear regression used to project the vectors into the ordination. These are directions for arrows of unit length." - comment from Plotted envfit vectors not matching NMDS scores
^Could somebody please show me what linear model is being run here and what standardized value is being extracted?
species.scores
species.scores.envfit
These values are very different from each other. What am I missing here?
This is my first SO post, please have mercy. I would have asked a question on some of the other relevant threads, but I am the dregs of SO and don't even have the reputation to comment.
Thanks!

Q: Do wascores() and envfit() give the same result?
No they do not give the same result as these are doing two quite different things. In this answer I have explained how envfit() works. wascores() takes the coordinates of the points in the nmds space and computes the mean on each dimension, weighting observations by the abundance of the species at each point. Hence the species score returned by wascores() is a weighted centroid in the NMDS space for each species, where the weights are the abundances of the species. envfit() fits vectors that point in the direction of increasing abundance. This implies a plane over the NMDS ordination where abundance increase linearly from any point on the plane as you move parallel to the arrow, whereas wascores() are best thought of as optima, where the abundance declines as you move away from the weighted centroid, although I think this analogy is looser than say with a CA ordination.
The issue about being optimal or not, is an issue if you passed in standardised data; as the answer you linked to shows, this would imply negative weights which doesn't work. Typically one doesn't standardise species abundances — there are transformations that we apply like converting to proportions, square root or log transformations, normalizing the data to the interval 0-1 — but these wouldn't give you negative abundances so you;re less likely to run into that issue.
envfit() in an NMDS is not necessarily a good thing as we wouldn't expect abundances to vary linearly over the ordination space. The wascores() are better as they imply non-linear abundances, but they are a little hackish in NMDS. ordisurf() is a better option in general as it adds a GAM (smooth) surface instead of the plane implied by the vectors, but you can't show more than one or a few surfaces on the ordination, whereas you can add as many species WA scores or arrows as you want.
The basic issue here is the assumption that envfit() and wascores() should give the same results. There is no reason to assume that as these are fundamentally different approaches to computing "species scores" for NMDS and each comes with it's own assumptions and advantages and disadvantages.

Related

How to calculate NME(Normalized Mean Error) between ground-truth and predicted landmarks when some of gt has no corresponding in predicted?

I am trying to learn some facial landmark detection model, and notice that many of them use NME(Normalized Mean Error) as performance metric:
The formula is straightforward, it calculate the l2 distance between ground-truth points and model prediction result, then divided it by a normalized factor, which vary from different dataset.
However, when adopting this formula on some landmark detector that some one developed, i have to deal with this non-trivial situation, that is some detector may not able to generate enough number landmarks for some input image(might because of NMS/model inherited problem/image quality etc). Thus some of ground-truth points might not have their corresponding one in the prediction result.
So how to solve this problem, should i just add such missing point result to "failure result set" and use FR to measure the model, and ignore them when doing the NME calculation?
If you have as output of neural network an vector 10x1 as example
that is your points like [x1,y1,x2,y2...x5,y5]. This vector will be fixed length cause of number of neurons in your model.
If you have missing points - this is because (as example you have 4 from 5 points) some points are go beyond the image width and height. Or are with minus (negative) like [-0.1, -0.2, 0.5,0.7 ...] there first 2 points you can not see on image like they are mission but they will be in vector and you can callculate NME.
In some custom neural nets that can be possible, because missing values will be changed to biggest error points.

PCoA function pcoa extract vectors; percentage of variance explained

I have a dataset consisting of 132 observations and 10 variables.
These variables are all categorical. I am trying to see how my observations cluster and how they are different based on the percentage of variance. i.e I want to find out if a) there are any variables which helps to draw certain observation points apart from one another and b) if yes, what is the percentage of variance explained by it?
I was advised to run a PCoA (Principle Coordinates Analysis) on my data. I ran it using vegan and ape package. This is my code after loading my csv file into r, I call it data
#data.dis<-vegdist(data,method="gower",na.rm=TRUE)
#data.pcoa<-pcoa(data.dis)
I was then told to extract the vectors from the pcoa data and so
#data.pcoa$vectors
It then returned me 132 rows but 20 columns of values (e.g. from Axis 1 to Axis 20)
I was perplexed over why there were 20 columns of values when I only have 10 variables. I was under the impression that I would only get 10 columns. If any kind souls out there could help to explain a) what do the vectors actually represent and b) how do I get the percentage of variance explained by Axis 1 and 2?
Another question that I had was I don't really understand the purpose of extracting the eigenvalues from data.pcoa because I saw some websites doing that after running a pcoa on their distance matrix but there was no further explanation on it.
Gower index is non-Euclidean and you can expect more real axes than the number of variables in Euclidean ordination (PCoA). However, you said that your variables are categorical. I assume that in R lingo they are factors. If so, you should not use vegan::vegdist() which only accepts numeric data. Moreover, if the variable is defined as a factor, vegan::vegdist() refuses to compute the dissimilarities and gives an error. If you managed to use vegdist(), you did not properly define your variables as factors. If you really have factor variables, you should use some other package than vegan for Gower dissimilarity (there are many alternatives).
Te percentage of "variance" is a bit tricky for non-Euclidean dissimilarities which also give some negative eigenvalues corresponding to imaginary dimensions. In that case, the sum of all positive eigenvalues (real axes) is higher than the total "variance" of data. ape::pcoa() returns the information you asked in the element values. The proportion of variances explained is in its element values$Relative_eig. The total "variance" is returned in element trace. All this was documented in ?pcoa where I read it.

Using a Point Process model for Prediction

I am analysing ambulance incident data. The dataset covers three years and has roughly 250000 incidents.
Preliminary analysis indicates that the incident distribution is related to population distribution.
Fitting a point process model using spatstat agrees with this, with broad agreement in a partial residual plot.
However, it is believed that the trend diverges from this population related trend during the "social hours", that is Friday, Saturday night, public holidays.
I want to take subsets of the data and see how they differ from the gross picture. How do I account for the difference in intensity due to the smaller number of points inherent in a subset of the data?
Or is there a way to directly use my fitted model for the gross picture?
It is difficult to provide data as there are privacy issues, and with the size of the dataset, it's hard to simulate the situation. I am not by any means a statistician, hence I am flundering a bit here. I have a copy of
"Spatial Point Patterns Methodology and Applications with R" which is very useful.
I will try with pseudocode to explain my methodology so far..
250k_pts.ppp <- ppp(the_ambulance_data x and y, the_window)
1.3m_census_pts <- ppp(census_data x and y, the_window)
Best bandwidth for the density surface by visual inspection seemed to be bw.scott. This was used to fit a density surface for the points.
inc_density <- density(250k_pts.ppp, bw.scott)
pop_density <- density(1.3m_census_pts, bw.scott)
fit0 <- ppm(inc_density ~ 1)
fit_pop <- ppm(inc_density ~ pop_density)
partials <- parres(fit_pop, "pop_density")
Plotting the partial residuals shows that the agreement with the linear fit is broadly acceptable, with some areas of 'wobble'..
What I am thinking of doing next:
the_ambulance_data %>% group_by(day_of_week, hour_of_day) %>%
select(x_coord, y_coord) %>% nest() -> nested_day_hour_pts
Taking one of these list items and creating a ppp, say fri_2300hr_ppp;
fri23.den <- density(fri_2300hr_ppp, bw.scott)
fit_fri23 <- fit(fri_2300hr_ppp ~ pop_density)
How do I then compare this ppp or density with the broader model? I can do characteristic tests such as dispersion, clustering.. Can I compare the partial residuals of fit_pop and fit_fri23?
How do I control for the effect of the number of points on the density - i.e. I have 250k points versus maybe 8000 points in the subset. I'm thinking maybe quantiles of the density surface?
Attach marks to the ambulance data representing the subset/categories of interest (eg 'busy' vs 'non-busy'). For an informal or nonparametric analysis, use tools like relrisk, or use density.splitppp after separating the different types of points using split.ppp. For a formal analysis (taking into account the sample sizes etc etc) you should fit several candidate models to the same data, one model having a busy/nonbusy effect and another model having no such effect, then use anova.ppm to test formally whether there is a busy/nonbusy effect. See Chapter 14 of the book mentioned.

Normalizing data in r using population raster

I have two pixel images that I created using spatstat, one is a density image created by a set of points (using function density.ppp), and the other is a pixel image created from a population raster. I am wondering if there is a way to use the population raster to normalize the density image. Basically, I have a dataset of 10000+ cyber attack origin locations in the US, using the spatstat function I hope to investigate for spatial patterns. However, the obvious problem is that areas of higher population have more cyber attack origins because there are more people. I would like to use the population raster to fix that. Any ideas would be appreciated.
As the comment by #RHA says: The first solution is to simply divide by the intensity.
I don't have your data so I will make some that might seem similar. The Chorley dataset has two types of cancer cases. I will make an estimate of the intensity of lung cancer and use it as your given population density. Then a density estimate of the larynx cases serves as your estimate of the cyber attack intensity:
library(spatstat)
# Split into list of two patterns
tmp <- split(chorley)
# Generate fake population density
pop <- density(tmp$lung)
# Generate fake attack locations
attack <- tmp$larynx
# Plot the intensity of attacks relative to population
plot(density(attack)/pop)
Alternatively, you could use the inverse population density as weights in density.ppp:
plot(density(attack, weights = 1/pop[attack]))
This might be the preferred way, where you basically say that an attack occurring at e.g. a place with population density 10 only "counts" half as much as an attack occurring at a place with density 5.
I'm not sure what exactly you want to do with your analysis, but maybe the you should consider fitting a simple Poisson model with ppm and see how your data diverges from the proposed model to understand the behaviour of the attacks.

DCA Vegan Small DCA1 Eigenvalues resulting in weird plot

Hello StackOverflow community,
5 weeks ago I learned to write and read R and it made me a happier being :) Stack Overflow helped me out a hundred times or more! For a while I have been struggling with vegan now. So far I have succeeded in making beautiful nMDS plots. The next step for me is DCA, but here I run into trouble...
Let me explain:
I have a abundance dataset where the columns are different species (N=120) and the rows are transects (460). Column 1 with transect codes is deleted. Abundance is in N (not relative or transformed). Most species are rare to very rare and a couple of species have very high abundance (10000-30000). Total N individuals is about 100000.
When I run the decorana function it returns this info.
decorana(veg = DCAMVA)
Detrended correspondence analysis with 26 segments.
Rescaling of axes with 4 iterations.
DCA1 DCA2 DCA3 DCA4
Eigenvalues 0.7121 0.4335 0.1657 0.2038
Decorana values 0.7509 0.4368 0.2202 0.1763
Axis lengths 1.7012 4.0098 2.5812 3.3408
The eigenvalues are however really small... Only 1 species has a DCA1 value of 2 the rest is all -1.4E-4 etc... This high DCA1 point has an abundance of 1 individual... But this is not the only species that has only 1 individual..
DCA1 DCA2 DCA3 DCA4 Totals
almaco.jack 6.44e-04 1.85e-01 1.37e-01 3.95e-02 0
Atlantic.trumpetfish 4.21e-05 5.05e-01 -6.89e-02 9.12e-02 104
banded.butterflyfish -4.62e-07 6.84e-01 -4.04e-01 -2.68e-01 32
bar.jack -3.41e-04 6.12e-01 -2.04e-01 5.53e-01 91
barred.cardinalfish -3.69e-04 2.94e+00 -1.41e+00 2.30e+00 15
and so on
I can't plot the picture yet on StackOverflow, but the idea is that there is spread on the Y-axis, but the X-values are not. Resulting in a line in the plot.
I guess everything is running okay, no errors returned or so.. I only really wonder what the reason for this clustering is... Anybody has any clue?? Is there a ecological idea behind this??
Any help is appreciated :)
Love
Erik
Looks like your data has an "outlier", a deviant site with deviant species composition. DCA has essentially selected the first axis to separate this site from everything else, and then DCA2 reflects a major pattern of variance in the remaining sites. (D)CA is known to suffer (if you want to call it that) from this problem, but it is really telling you something about your data. This likely didn't affect NMDS at all because metaMDS() maps the rank order of the distances between samples and that means it only need to put this sample slightly further away from any other sample than the distance between the next two most dissimilar samples.
You could just stop using (D)CA for these sorts of data and continue to use NMDS via metaMDS() in vegan. An alternative is to apply a transformation such as the Hellinger transformation and then use PCA (see Legendre & Gallagher 2001, Oecologia, for the details). This transformation can be applied via decostand(...., method = "hellinger") but it is trivial to do by hand as well...

Resources