Ranking based on unequal numbers of discrete observations - math

Ranking based on unequal numbers of discrete observations.
We have a set of targets which have different fixed numbers of discrete binary observations (0,1)
We want to compare targets to each other for some overall propensity towards a certain kind of observation(0 or 1) in a way which is not biased by total number of observations of the individual target.
The ranking is also a scale between 0 and 1 and should include an error proportionate to the number of observations (more observations the greater the certainty or propensity to one of the binary observational outcomes)
For example:
On an imaginary map we have countries which can have between 1 and 20 neighbours with discrete boundaries.
We want to rate each country based on the quality of its boundaries with each of its neigbours
So that we can come up with a comparison scheme to rank countries based on the quality of their boundaries.
So for example:
Country A has 8 boundaries (8 neighbours) 4 of which are good boundaries +/- some uncertainty
Country B has 3 boundaries (3 neighbours) 2 of which are good boundaries +/- some uncertainty
So Country B rank > Country A rank ?
The problem requires comparison on features (boundaries) that vary in number for each country.
Intuitively: A comparison scheme suggests that there is some function which is proportionate to the number of good boundaries and inversely proportionate to the total number of boundaries that each country has.
However, unmoderated, this simple proportionality would be a decaying saw tooth pattern with decreasing frequency with number of boundaries, which impedes simple ranking between countries with disparate number of boundaries.
What would be good approach to rating the countries based on discrete boundaries in such a way that they can be practically compared?
My intuition is this problem, in some guise, is a known problem in some odd branch of discrete maths maybe graph theory? however I have been unable to find an algorithm or mathematical approach to rank countries based on discrete features.
Any thoughts appreciated

Related

Score for over/under representations of a variable in sub-group

I have a corpus of book publications split into different clusters. I have information about the nationality of the authors (variable A) and the nationality of the publishing company (variable B).
In the case of variable B, publishing companies are either US-based or Euro-based (2 categories). In the case of variable A, authors are either American, European or others (3 categories).
I want to know whether a cluster is more euro-centered or more us-centered when compared to the overall corpus (basically identify clusters in which EU/US identity is important) and plot it on two axes according to variables A and B.
A positive value on the Y-axis would mean the cluster has an over-representation of EU authors, and a negative value the opposite. Similarly, the X-axis would have a positive value when we find an over-representation of EU publishing companies and a negative value for US companies. (In the case of variable A, it means that simply comparing proportions can lead to both US and EU authors being over-represented).
I initially substracted the relative ratio of variable B and plotted the resulting value on the y axis according to the following formula:
(share_europeans_authors_cluster/share_US_authors_cluster - share_europeans_authors/share_US_authors)
I did a similar thing for variable A and x-axis, and got the following plot:
I would like a better measure of what I am trying to do because my intuition is that there is something wrong with my approach. I tried using the log ratio, but it led to other issues.
You could use the log of the odds ratio:
Then, a value of zero means that the odds are the same in both groups. Numbers bigger than one mean that the odds being from the EU in the cluster are larger than the odds of being from the EU in the corpus. Negative values mean the odds of being from the US in the cluster are larger than the odds of being in the US in the corpus.

How to include plots / rows with zero values in the presence / absence community matrix in a CCA using R Vegan package

I am trying to do CCA using a presence / absence matrix of plant quadrat data and continuous environmental data for the same quadrats, using the Vegan package in R. Some of the quadrats have no plant species present (the row for the quadrat is full of 0's) but do have corresponding environmental data in another dataframe. The context of the study is that the environmental data is metal concentrations in soil, which are typically high where there are no plant species, so the quadrats with zero species do contribute to the data, and are not errors or NA's. When running the CCA with the R Vegan Package so far I have had to delete these rows to get it to work, otherwise it returns the error
'Error in cca.default(d$X, d$Y, d$Z) :
all row sums must be >0 in the community data matrix' .
Is there a way to include the data from quadrats that have no plant species in the CCA? I have read in this paper, which also uses the Vegan package,: https://www.researchgate.net/publication/229087061_Relationships_between_the_presence_of_odonate_species_and_environmental_characteristics_in_lowland_ponds_of_central_Italy and that has a similar research design, that they have included plots with zero species by adding a 'zero species' variable but do not elaborate on how this is done.
I am new to coding so any help is very much appreciated,
Thanks in advance
Here is how to do it. Assume your data set is called comm and it has some rows (sampling units) that have no species:
comm$ZERO <- as.numeric(rowSums(comm) == 0)
This will add a new column ZERO which is 1 for rows that had no species, and 0 for others.
Personally, I would be worried about doing this. Correspondence Analysis is a compositional analysis, and adding a column (species) that never occurs with any other species (by definition) creates a data set with two disjunct blocks. In unconstrained CA this disjunct block manifests in first eigenvalue 1 – which is the theoretical maximum in CA. This first eigenvector will separate the blocks: ZERO species and the sampling units with ZERO species in one extreme, and all other species and sampling units in another extreme of the first axis. The second axis of this ZERO ordination will be identical to the first axis without ZERO, so in effect you just add this disjunction axis to the ordination.
Things are slightly different with CCA which actually looks at the fitted values of your species, and these fitted values may not be disjunct. So technically you can do it. However, it is not quite clear to me what you do if you do so. Even if the data set is not completely disjunct with CCA, the zero sampling units will probably be far separated from other points, and all plotted in the same point.

How to calculate NME(Normalized Mean Error) between ground-truth and predicted landmarks when some of gt has no corresponding in predicted?

I am trying to learn some facial landmark detection model, and notice that many of them use NME(Normalized Mean Error) as performance metric:
The formula is straightforward, it calculate the l2 distance between ground-truth points and model prediction result, then divided it by a normalized factor, which vary from different dataset.
However, when adopting this formula on some landmark detector that some one developed, i have to deal with this non-trivial situation, that is some detector may not able to generate enough number landmarks for some input image(might because of NMS/model inherited problem/image quality etc). Thus some of ground-truth points might not have their corresponding one in the prediction result.
So how to solve this problem, should i just add such missing point result to "failure result set" and use FR to measure the model, and ignore them when doing the NME calculation?
If you have as output of neural network an vector 10x1 as example
that is your points like [x1,y1,x2,y2...x5,y5]. This vector will be fixed length cause of number of neurons in your model.
If you have missing points - this is because (as example you have 4 from 5 points) some points are go beyond the image width and height. Or are with minus (negative) like [-0.1, -0.2, 0.5,0.7 ...] there first 2 points you can not see on image like they are mission but they will be in vector and you can callculate NME.
In some custom neural nets that can be possible, because missing values will be changed to biggest error points.

r - Estimate selection-unbiased allele frequencies with linear regression systems

I have a few data sets consisting of frequencies for i distinct alleles/SNPs of some populations. Additionally I recorded some factors that are suspicious for having changed the frequencies of these alleles within the populations in the past due to their selectional effect. It is assumed that the selection impact can be described in the form of a simple linear regression for every selection factor.
Now I'd like to estimate how the allele frequencies are expected to be under identical selectional forces (thus, I set selection=1). These new allele frequencies a'_i are derived as
a'_i = a_i - function[a_i|selection=1]
with the current frequency a_i of the allele i of a population and function[a_i|selection=1] as the estimated allele frequency under the absence of selectional forces.
However, there are some constraints for the whole process:
The minimal values of a'_i allowed is 0.
The sum of all allele frequencies a'_i has to be 1.
Usually I'd solve this problem by applying multiple linear regressions. But then the constraints are not fulfilled ...
Any idea how to approach this analysis with constraints (maybe using linear equation/regression systems or structural equation modelling)?
Here is an example data set containing allele frequencies for the ABO major allele groups (p, q, r) as well as the selection variables (x, y, z).
Although this example file only contains 3 alleles and 3 influential variables, all my data sets contain up to ~1050 alleles/SNPs and always 8 selection variables that may have (but don't have to) an impact on the allele frequencies ...
Many thanks in advance for ideas, code snippets and hints!

Understanding TSA::periodogram()

I have some data sampled at regular intervals that looks sinusoidal and I would like to determine the frequency of the wave, to that end I obtained R and loaded the TSA package that contains a function named 'periodogram'.
In an attempt to understand how it works I created some data as follows:
x<-.0001*1:260
This could be interpreted to be 260 samples with an interval of .0001 seconds
Frequency=80
The frequency could be interpreted to be 80Hz so there should be about 125 points per wave period
y<-sin(2*pi*Frequency*x)
I then do:
foo=TSA::periodogram(y)
In the resulting periodogram I would expect to see a sharp spike at the frequency that corresponds to my data - I do see a sharp spike but the maximum 'spec' value has a frequency of 0.007407407, how does this relate to my frequency of 80Hz?
I note that there is variable foo$bandwidth with a value of 0.001069167 which I also have difficulty interpreting.
If there are better ways of determining the frequency of my data I would be interested - my experience with R is limited to one day.
The periodogram is computed from the time series without knowledge of your actual sampling interval. This result in frequencies which are limited to the normalized [0,0.5] range. To obtain a frequency in Hertz that takes into account the sampling interval, you simply need to multiply by the sampling rate. In your case, the spike you get at a normalized frequency of 0.007407407 and a sampling rate of 10,000Hz, this correspond to a frequency of ~74Hz.
Now, that's not quite 80Hz (the original tone frequency), but you have to keep in mind that a periodogram is a frequency spectrum estimate, and its frequency resolution is limited by the number of input samples. In your case you are using 260 samples, so the frequency resolution is on the order of 10,000Hz/260 or ~38Hz. Since 74Hz is well within 80 +/- 38Hz, it is a reasonable result. To get a better frequency estimate you would have to increase the number of samples.
Note that the periodogram of a sinusoidal tone will typically spike near the tone frequency and decay on either side (a phenomenon caused by the limited number of samples used for the estimation, often called spectral leakage) until the value can be considered comparatively 'negligeable'. The foo$bandwidth variable then indicates that the input signal starts to contain less energy for frequencies above 0.001069167*10000Hz ~ 107Hz, which is consistent with the tone's decay.

Resources