After I used the swin-transformer network to predict the depth, the three values of the threshold accuracy were 0.825, 0.992, and 0.998 respectively. It can be seen that the value of the first value is obviously small, and it should be around 0.900 under normal circumstances. What is the reason for this?
Related
I don't understand how summarySE() in Rmisc package calculates the confidence intervals (ci) of my data. The values do not seem to be correct.
For example, after running summarySE(data = df, measurevar = "numbers", groupvars = "conditions", conf.interval = 0.95), the output shows:
conditions N numbers sd se ci
1 constructionA 10 6.025 0.3987829 0.1261062 0.2852721
2 constructionB 10 1.925 0.3545341 0.1121135 0.2536184
However, the confidence interval of constructionA should be 6.025 ± 1.96 x (0.398729)/√10, which should be 6.025 ± 0.24716366. I don't understand where the value of 0.2852721 comes from after applying summarySE... Shouldn't it be 0.24716366?
Could anyone tell me what's wrong here?
Thank you!
A common construction of a confidence interval is
(statistic) +/- c*(standard error of statistic)
where c is the critical value. c=1.96 is (approximately) the critical value you get for a normally-distributed z-statistic and a 95% confidence interval, but that's not part of the definition of a CI or anything, it's just the CI you get if you think your statistic is normally distributed.
However, most calculations of confidence intervals, summarySE() included, use the t-distribution rather than the normal distribution to calculate critical values, as they produce more accurate results than the normal when sample sizes are small (and nearly identical results when they are large).
Here, your sample size is only N=10, so the differences between the normal-distribution 1.96 and the critical value from the t-statistic are noticeable. The 2.5th percentile of the t distribution with 10-1 = 9 degrees of freedom is qt(.025, 9) = -2.262157. So c = 2.262157 for a two-sided 95% confidence interval.
0.1261062*2.262157 = 0.285272, and this is where the confidence interval column comes from.
I'm trying to estimate the probability that the mean of 3 observations from a population is under a certain value.
Let's say I want to know what's the probability that the mean of 3 people's heights is under 1.8m
Population = c(1.7, 1.9, 1.6, 1.76, 1.8, 1.72, 1.99, 2, 1.66, 1.89)
If I pick randomly 3 observations (x_i, x_j, x_k)... What's the probability that the mean of these 3 observations is under 1.8m?
Thanks in advance.
Since the distribution of sums of variables is given by convolutions, you could do an approximation to the convolutions using FFTs using small enough sampling windows based on the gaps between your population values.
If k is large, you can use the central limit theorem to approximate:
sqrt(k)(sum_k(x_k)/k - population-mean)/population_stdev as Normal(0,1).
which makes it very easy to evaluate:
P(sum_k(x_k)/k< Val) = P( sqrt(k)(sum_k(x_k)/k - population_mean)/population_stdev<sqrt(k)(Val-population_mean/population_stdev)
= Phi(sqrt(k)(Val-population_mean)/population_stdev) where Phi is the standard normal CDF.
In the PST package we use the value C as a cut-off for the information gain function used to prune the tree. The C value, for an alpha of 0.05 is calculated as follows:
C95 <- qchisq(0.95, 1) / 2
What does it mean that the C value is based on an alpha of 0.05? Does it mean we need to be at least 95% certain that an additional node adds more information compared to previous nodes, in order for it to be retained by the pruning algorithm?
Your question concerns the use of gain="G2" in the prune function and is about the choice of the threshold C for this gain function.
Twice the G2 gain function used to check whether a branch can be pruned is actually the likelihood ratio test statistics that compares the likelihood of the tree before and after pruning the branch. The statistics 2*G2 has a Chi-squared distribution under the assumption that the tested branch does not add any information. So, the branch is pruned when the difference is not statistically significant, i.e. as long as the G2 value does not exceed the threshold for the given significance level.
The alpha is the usual level of significance used in statistical tests. It is typically 1% or 5%. Choosing alpha= 0.05 means that there is 5% chance to wrongly NOT prune a branch because of the randomness of the sample.
My data is in the following format and includes a particular statistic
site LRStat
1 3.580728
2 2.978038
3 5.058644
4 3.699278
5 4.349046
This is just a sample of the data.
I then obtained the null LR distribution as well by permuting random pairs of data. I used this to plot a histogram with frequency in the y-axes and LR statistic in the x-axes. How is it possible to determine the critical p-value cut-off points based on the null distribution (as shown in the below figure)?
You now have a sampling distribution of LR values. The quantile function in R will give you an estimate of whatever "critical value" you prefer. If, for instance, you decided you wanted the conventional 0.05 "p-value" you could take your dataframe, named LR_df for illustration, and issue this command:
quantile( LR_df[ , 'LRStat'] , 0.95)
If you wanted all of those "probabilities" on the figure, you would use a vector of values complementary to unity. The following code gives you the LSstat values at which a given proportion of the sample are higher than that value.
quantile( LR_df[ , 'LRStat'] , c(0.9, 0.95, 0.99, 0.999, 0.9999) )
The p-values are just a sampling distribution of a test statistic under a null hypothesis. Your null hypothesis in this case is that the LRstats are uniformly distributed. (I know it sounds strange to put it that way, but if you want to argue with the statisticians then get a copy of http://amstat.tandfonline.com/doi/pdf/10.1198/000313008X332421 .) The choice of p-value for cutoff will depend on scientific or business setting. If you were assessing an investment opportunity the cutoff might be 0.15 but if you are trying to find new scientific knowledge, I think it should be smaller (more stringent test). The field of molecular genetics has a lot of junk (i.e. fails to reproduce results) in their literature because they were not strict enough in the statistical methods.
I have performed a canonical correspondece analysis in R using the vegan package but i find the output very difficult to understand. The triplot is understandable, but all the numbers I get from the summary(cca) are confusing to me (as i've just started to learn about ordination techniques)
I would like to know how much of the variance in Y that is explained by X (in this case, the environmental variables) and which of the independent variables that are important in this model?
my output looks like this:
Partitioning of mean squared contingency coefficient:
Inertia Proportion
Total 4.151 1.0000
Constrained 1.705 0.4109
Unconstrained 2.445 0.5891
Eigenvalues, and their contribution to the mean squared contingency coefficient
Importance of components:
CCA1 CCA2 CCA3 CCA4 CCA5 CCA6 CCA7
Eigenvalue 0.6587 0.4680 0.34881 0.17690 0.03021 0.02257 0.0002014
Proportion Explained 0.1587 0.1127 0.08404 0.04262 0.00728 0.00544 0.0000500
Cumulative Proportion 0.1587 0.2714 0.35548 0.39810 0.40538 0.41081 0.4108600
CA1 CA2 CA3 CA4 CA5 CA6 CA7
Eigenvalue 0.7434 0.6008 0.36668 0.33403 0.28447 0.09554 0.02041
Proportion Explained 0.1791 0.1447 0.08834 0.08047 0.06853 0.02302 0.00492
Cumulative Proportion 0.5900 0.7347 0.82306 0.90353 0.97206 0.99508 1.00000
Accumulated constrained eigenvalues
Importance of components:
CCA1 CCA2 CCA3 CCA4 CCA5 CCA6 CCA7
Eigenvalue 0.6587 0.4680 0.3488 0.1769 0.03021 0.02257 0.0002014
Proportion Explained 0.3863 0.2744 0.2045 0.1037 0.01772 0.01323 0.0001200
Cumulative Proportion 0.3863 0.6607 0.8652 0.9689 0.98665 0.99988 1.0000000
Scaling 2 for species and site scores
* Species are scaled proportional to eigenvalues
* Sites are unscaled: weighted dispersion equal on all dimensions
Species scores
CCA1 CCA2 CCA3 CCA4 CCA5 CCA6
S.marinoi -0.3890 0.39759 0.1080 -0.005704 -0.005372 -0.0002441
C.tripos 1.8428 0.23999 -0.1661 -1.337082 0.636225 -0.5204045
P.alata 1.6892 0.17910 -0.3119 0.997590 0.142028 0.0601177
P.seriata 1.4365 -0.15112 -0.8646 0.915351 -1.455675 -1.4054078
D.confervacea 0.2098 -1.23522 0.5317 -0.089496 -0.034250 0.0278820
C.decipiens 2.2896 0.65801 -1.0315 -1.246933 -0.428691 0.3649382
P.farcimen -1.2897 -1.19148 -2.3562 0.032558 0.104148 -0.0068910
C.furca 1.4439 -0.02836 -0.9459 0.301348 -0.975261 0.4861669
Biplot scores for constraining variables
CCA1 CCA2 CCA3 CCA4 CCA5 CCA6
Temperature 0.88651 0.1043 -0.07283 -0.30912 -0.22541 0.24771
Salinity 0.32228 -0.3490 0.30471 0.05140 -0.32600 0.44408
O2 -0.81650 0.4665 -0.07151 0.03457 0.20399 -0.20298
Phosphate 0.22667 -0.8415 0.41741 -0.17725 -0.06941 -0.06605
TotP -0.33506 -0.6371 0.38858 -0.05094 -0.24700 -0.25107
Nitrate 0.15520 -0.3674 0.38238 -0.07154 -0.41349 -0.56582
TotN -0.23253 -0.3958 0.16550 -0.25979 -0.39029 -0.68259
Silica 0.04449 -0.8382 0.15934 -0.22951 -0.35540 -0.25650
Which of all these numbers are important to my analysis?
/anna
How much variation is explained by X?
In a CCA, variance isn't variance in the normal sense. We express it as the "mean squared contingency coefficient", or "inertia". All the info you need to ascertain how much "variation" in Y is explained by X is contained in the section of the output that I reproduce below:
Partitioning of mean squared contingency coefficient:
Inertia Proportion
Total 4.151 1.0000
Constrained 1.705 0.4109
Unconstrained 2.445 0.5891
In this example there is total inertia 4.151 and your X variables (these are "Constraints") explain a total of 1.705 bits of inertia, which is about 41%, leaving about 59% unexplained.
The next section referring to eigenvalues allows you to see both in terms of inertia explained and proportion explained which axes contribute significantly to the explanatory "power" of the CCA (the Constrained part of the table above) and the unexplained "variance" (the Unconstrained part of the table above.
The next section contains the ordination scores. Think of these as the coordinates of the points in the triplot. For some reason you show the site scores in the output above, but they would normally be there. Note that these have been scaled - by default this is using scaling = 2 - so site points are at their weighted average of the species scores IIRC etc.
The "Biplot" scores are the locations of the arrow heads or the labels on the arrows - I forget exactly how the plot is drawn now.
Which of all these numbers are important to my analysis?
All of them are important - if you think the triplot is important an interpretable, it is based entirely on the information reported by summary(). If you have specific questions to ask of the data, then perhaps only certain sections will be of paramount importance to you.
However, StackOverflow is not the place to ask such questions of a statistical nature.
I don't have the ability to comment. But in response to the first answers interpretation to the first answers interpretation on the species and site scores in scaling 2, I believe their explanation is backwards.
In the book "Numerical Ecology with R" by Borcard, Gillet, and Legendre they clearly state that in scaling 2 species scores are weighted averages of the sites.
This can be confirmed when using the ordihull funtion in CCA.
Also in the output from the OP states that species scores are scaled and site scores are unscaled. which I believe confirms what the book says.
"Scaling 2 for species and site scores
* Species are scaled proportional to eigenvalues
* Sites are unscaled: weighted dispersion equal on all dimensions"