R Studio CCA of community indices and landscape metrics - r

I'm currently trying to perform a CCA using R studio with a range of environmental and biodiversity variables however, while I encounter no coding errors, the result doesn't seem to be correct.
Now I'm no whiz when it comes to the fundamentals of stats if I'm being honest, so was hoping someone may be able to explain to me the issue I'm facing here.
Here is the code.
Arthropod.cca <- cca(Biodiversity ~ D1+MPI+SDI+SEI+AWMPFD, data=Environment)
Arthropod.cca
plot(Arthropod.cca)
summary(Arthropod.cca)
Biodiversity is the name of my community structure dataset. It includes values for Menhinick, Shannon's indices and also Hill's ratio for evenness.
D1+MPI+SDI+SEI+AWMPFD are my environmental variables looking at landscape diversity, evenness, fragmentation etc.
However, R just gives me this back.
Inertia Proportion Rank
Total 0.008392 1.000000
Constrained 0.008392 1.000000 2
Unconstrained 0.000000 0.000000 0
Inertia is scaled Chi-square
Some constraints or conditions were aliased because they were redundant
Eigenvalues for constrained axes:
CCA1 CCA2
0.008388 0.000004
I originally had 5 conditions, however, this has been reduced to only 2, with wildly different eigenvalues. Just overall extremely confused about this.

Related

Uniquenesses component of explorator yfactor analysis

I am applying an Exploratory factor analysis on a dataset using the factanal() package in R. After applying Scree Test I found out that 2 factors need to be retained from 20 features.
Trying to find what this uniqueness represents, I found the following from here
"A high uniqueness for a variable usually means it doesn’t fit neatly into our factors. ..... If we subtract the uniquenesses from 1, we get a quantity called the communality. The communality is the proportion of variance of the ith variable contributed by the m common factors. ......In general, we’d like to see low uniquenesses or high communalities, depending on what your statistical program returns."
I understand that if the value of uniquenesses is high, it could be better represented in a single factor. But what is a good threshold for this uniquenesses measure? All of my features show a value greater than 0.3, and most of them range from 0.3 to 0.7. Does the following mean that my factor analysis doesn't work well on my data? I have tried rotation, the results are not very different. What else should I try then?
You can partition an indicator variable's variance into its...
Uniqueness (h2): Variance that is not explained by the common factors
Communality (1-h2): The variance that is explained by the common factors
Which values can be considered "good" depends on your context. You should look for examples in your application domain to know what you can expect. In the package psych, you can find some examples from psychology:
library(psych)
m0 <- fa(Thurstone.33,2,rotate="none",fm="mle")
m0
m0$loadings
When you run the code, you can see that the communalities are around 0.6. The absolute factor loadings of the unrotated solution vary between 0.27 and 0.85.
An absolute value of 0.4 is often used as an arbitrary cutoff for acceptable factor loadings in psychological domains.

How to estimate less conservative standard errors when using post-stratified weights without full information in the survey package?

I'm encountering (very) huge standard errors in my analysis of proportions with post-stratified data when using the survey package.
I'm working with a data set including (normalized) weights calculated via raking by another party. I don't know exactly how the strata have been defined (e.g. "ageXgender" has been used, but it's unclear which categorization has been used). Let's assume a simple random sample with a considerable amount of non-response.
Is there any way to estimate reduced standard errors due to post-stratification without the exact information about the procedure in survey? I could recallibrate the weights with rake() if I can exactly define the strata but I don't have enough information for this.
I have tried to infer the strata by grouping all equal weights together and thought that I would at least get an upper bound of the reduction in standard errors this way but using them did only lead to marginally reduced standard errors and sometimes even increased standard errors:
# An example with the api datasets, pretending that pw are post-stratification weights of unknown origin
library(survey)
data(api)
apistrat$pw <-apistrat$pw/mean(apistrat$pw) #normalized weights
# Include some more extreme weights to simulate my data
mins <- which(apistrat$pw == min(apistrat$pw))
maxs <- which(apistrat$pw == max(apistrat$pw))
apistrat[mins[1:5], "pw"] <- 0.1
apistrat[maxs[1:5], "pw"] <- 10
apistrat[mins[6:10], "pw"] <- 0.2
apistrat[maxs[6:10], "pw"] <- 5
dclus1<-svydesign(id=~1, weights=~pw, data=apistrat)
# "Estimate" stratas from the weights
apistrat$ps_est <- as.factor(apistrat$pw)
dclus_ps_est <-svydesign(id=~1, strata=~ps_est, weights=~pw, data=apistrat)
svymean(~api00, dclus1)
svymean(~api00, dclus_ps_est)
#this actually increases the se instead of reducing it
My real weights are also much more complex with 700 unique values in 1000 cases.
Is it possible to somehow approximate the reduction of standard errors due to post-stratification without knowing the real variables and categories and -especially- population values for rake? Could I use rake with some assumptions about the variables and categories used in the strata definitions but without the population totals in some way?
If your data are already raked, then you know the population totals exactly: raking makes the estimated population totals equal the true population totals for the raking variables. So, if you know the raking variables you can estimate the population totals then rake. The raking won't change the weights (because ex hypothesi these were already raked) but it will change the standard error estimates
(The next version of the survey package will have an option in svydesign to do exactly this.)

R: Adjusting exploratory variable's distribution to known non-normal distribution

I have data for a sample of the U.S. population. The dataset for the sample has N = 10,000 records. Each row is described by a quantitative explanatory variable E, a price that affects the probability R that people return a bought item. It is necessary for the sample and population to have similar distribution of E to ensure validity of statistical models linking it to R.
There is a significant discrepancy between the frequency distributions of E in the U.S. population and in the sample (see summary below). In particular, a normal distribution does not seem to describe well the population distribution.
Value of E Population Distribution of E Sample Distribution of E
0-10 56.57% 92.95%
10.01 - 20 6.90% 1.19%
20.01 - 30 8.29% 1.38%
30.01-40 5.87% 0.85%
40.01 - 50 8.18% 0.32%
50.01 - 60 4.63% 0.48%
60.01-70 1.34% 0.32%
70.01 - 80 1.50% 0.08%
80.01 - 90 0.29% 0.49%
90.01-100 3.72% 1.12%
100.01-110 2.10% 0.69%
110.01-120 0.24% 0.00%
120.01+ 0.35% 0.13%
What are good things to do in R to make the sample's E-distribution more akin to the population's, hopefully to match it? I have tried filtering off sample data with low E values to no avail. At the same time, I am not quite sure which transformations to use since most of the common transformations attempt to fit data to a normal distribution --- which does not seem applicable here.
I myself think that transformations (possibly including weightings) of E are permissible, deletion of rows borderline acceptable, and creation of new rows forbidden --- but I would appreciate any input on what operations are usually considered permissible in contexts similar to mine.
The best way to this would be using prediction intervals. It is clear that most of your sample has very low values for E. This means that you are relatively confident about the predicted value of R for low values of E. However, as you move farther away from the range of your data (i.e. very high values of E), you are much less confident about your predictions for R.

DCA Vegan Small DCA1 Eigenvalues resulting in weird plot

Hello StackOverflow community,
5 weeks ago I learned to write and read R and it made me a happier being :) Stack Overflow helped me out a hundred times or more! For a while I have been struggling with vegan now. So far I have succeeded in making beautiful nMDS plots. The next step for me is DCA, but here I run into trouble...
Let me explain:
I have a abundance dataset where the columns are different species (N=120) and the rows are transects (460). Column 1 with transect codes is deleted. Abundance is in N (not relative or transformed). Most species are rare to very rare and a couple of species have very high abundance (10000-30000). Total N individuals is about 100000.
When I run the decorana function it returns this info.
decorana(veg = DCAMVA)
Detrended correspondence analysis with 26 segments.
Rescaling of axes with 4 iterations.
DCA1 DCA2 DCA3 DCA4
Eigenvalues 0.7121 0.4335 0.1657 0.2038
Decorana values 0.7509 0.4368 0.2202 0.1763
Axis lengths 1.7012 4.0098 2.5812 3.3408
The eigenvalues are however really small... Only 1 species has a DCA1 value of 2 the rest is all -1.4E-4 etc... This high DCA1 point has an abundance of 1 individual... But this is not the only species that has only 1 individual..
DCA1 DCA2 DCA3 DCA4 Totals
almaco.jack 6.44e-04 1.85e-01 1.37e-01 3.95e-02 0
Atlantic.trumpetfish 4.21e-05 5.05e-01 -6.89e-02 9.12e-02 104
banded.butterflyfish -4.62e-07 6.84e-01 -4.04e-01 -2.68e-01 32
bar.jack -3.41e-04 6.12e-01 -2.04e-01 5.53e-01 91
barred.cardinalfish -3.69e-04 2.94e+00 -1.41e+00 2.30e+00 15
and so on
I can't plot the picture yet on StackOverflow, but the idea is that there is spread on the Y-axis, but the X-values are not. Resulting in a line in the plot.
I guess everything is running okay, no errors returned or so.. I only really wonder what the reason for this clustering is... Anybody has any clue?? Is there a ecological idea behind this??
Any help is appreciated :)
Love
Erik
Looks like your data has an "outlier", a deviant site with deviant species composition. DCA has essentially selected the first axis to separate this site from everything else, and then DCA2 reflects a major pattern of variance in the remaining sites. (D)CA is known to suffer (if you want to call it that) from this problem, but it is really telling you something about your data. This likely didn't affect NMDS at all because metaMDS() maps the rank order of the distances between samples and that means it only need to put this sample slightly further away from any other sample than the distance between the next two most dissimilar samples.
You could just stop using (D)CA for these sorts of data and continue to use NMDS via metaMDS() in vegan. An alternative is to apply a transformation such as the Hellinger transformation and then use PCA (see Legendre & Gallagher 2001, Oecologia, for the details). This transformation can be applied via decostand(...., method = "hellinger") but it is trivial to do by hand as well...

How to replicate Stata "factor" command in R

I'm trying to replicate some Stata results in R and am having a lot of trouble. Specifically, I want to recover the same eigenvalues as Stata does in exploratory factor analysis. To provide a specific example, the factor help in Stata uses bg2 data (something about physician costs) and gives you the following results:
webuse bg2
factor bg2cost1-bg2cost6
(obs=568)
Factor analysis/correlation Number of obs = 568
Method: principal factors Retained factors = 3
Rotation: (unrotated) Number of params = 15
--------------------------------------------------------------------------
Factor | Eigenvalue Difference Proportion Cumulative
-------------+------------------------------------------------------------
Factor1 | 0.85389 0.31282 1.0310 1.0310
Factor2 | 0.54107 0.51786 0.6533 1.6844
Factor3 | 0.02321 0.17288 0.0280 1.7124
Factor4 | -0.14967 0.03951 -0.1807 1.5317
Factor5 | -0.18918 0.06197 -0.2284 1.3033
Factor6 | -0.25115 . -0.3033 1.0000
--------------------------------------------------------------------------
LR test: independent vs. saturated: chi2(15) = 269.07 Prob>chi2 = 0.0000
I'm interested in the eigenvalues in the first column of the table. When I use the same data in R, I get the following results:
bg2 = read.dta("bg2.dta")
eigen(cor(bg2)
$values
[1] 1.7110112 1.4036760 1.0600963 0.8609456 0.7164879 0.6642889 0.5834942
As you can see, these values are quite different from Stata's results. It is likely that the two programs are using different means of calculating the eigenvalues, but I've tried a wide variety of different methods of extracting the eigenvalues, including most (if not all) of the options in R commands fa, factanal, principal, and maybe some other R commands. I simply cannot extract the same eigenvalues as Stata. I've also read through Stata's manual to try and figure out exactly what method Stata uses, but couldn't figure it out with enough specificity.
I'd love any help! Please let me know if you need any additional information to answer the question.
I would advise against carrying out a factor analysis on all the variables in the bg2 data as one of the variables is clinid, which is an arbitrary identifier 1..568 and carries no information, except by accident.
Sensibly or not, you are not using the same data, as you worked on the 6 cost variables in Stata and those PLUS the identifier in R.
Another way to notice that would be to spot that you got 6 eigenvalues in one case and 7 in the other.
Nevertheless the important principle is that eigen(cor(bg2)) is just going to give you the eigenvalues from a principal component analysis based on the correlation matrix. So you can verify that pca in Stata would match what you report from R.
So far, so clear.
But your larger question remains. I don't know how to mimic Stata's (default) factor analysis in R. You may need a factor analysis expert, if any hang around here.
In short, PCA is not equal to principal axis method factor analysis.
Different methods of calculating eigenvalues are not the issue here. I'd bet that given the same matrix Stata and R match up well in reporting eigenvalues. The point is that different techniques mean different eigenvalues in principle.
P.S. I am not an R person, but I think what you call R commands are strictly R functions. In turn I am open to correction on that small point.

Resources