r - Estimate selection-unbiased allele frequencies with linear regression systems - r

I have a few data sets consisting of frequencies for i distinct alleles/SNPs of some populations. Additionally I recorded some factors that are suspicious for having changed the frequencies of these alleles within the populations in the past due to their selectional effect. It is assumed that the selection impact can be described in the form of a simple linear regression for every selection factor.
Now I'd like to estimate how the allele frequencies are expected to be under identical selectional forces (thus, I set selection=1). These new allele frequencies a'_i are derived as
a'_i = a_i - function[a_i|selection=1]
with the current frequency a_i of the allele i of a population and function[a_i|selection=1] as the estimated allele frequency under the absence of selectional forces.
However, there are some constraints for the whole process:
The minimal values of a'_i allowed is 0.
The sum of all allele frequencies a'_i has to be 1.
Usually I'd solve this problem by applying multiple linear regressions. But then the constraints are not fulfilled ...
Any idea how to approach this analysis with constraints (maybe using linear equation/regression systems or structural equation modelling)?
Here is an example data set containing allele frequencies for the ABO major allele groups (p, q, r) as well as the selection variables (x, y, z).
Although this example file only contains 3 alleles and 3 influential variables, all my data sets contain up to ~1050 alleles/SNPs and always 8 selection variables that may have (but don't have to) an impact on the allele frequencies ...
Many thanks in advance for ideas, code snippets and hints!

Related

Loess regression on genomic data

I am struggling with R loess function in R.
I have a dataframe on which I would locally weighted polynomial regression
For each ‘Gene’ is associated a Count (log10 transformed) which gives information regarding the gene expression. For each Gene is associated an ‘Integrity’ measurement (span 0-100) which tells you the quality of the ‘Count’ measurement for each gene. As a general principle, higher is the ‘Integrity’, more reliable is the ‘Count’ for the specific Gene.
Below is reported a chunk of the dataframe
Sample dataframe:
Gene
Integrity
Count
ENSG00000198786.2
96.6937
3.55279
ENSG00000210194.1
96.68682
1.39794
ENSG00000212907.2
94.81709
2.396199
ENSG00000198886.2
93.87207
3.61595
ENSG00000198727.2
89.08319
3.238548
ENSG00000198804.2
88.82048
3.78326
I would like to use loess to predict the ‘true’ value of genes with low ‘Integrity’ values (since less reliable).
I) Should I pre-process my dataframe in order to correctly apply loess ? From a pletora of examples I observed sinusoidal distributions of points (A), while my dataset seem distributed in a ‘rollercoaster’-like fashion (B).
II) How should I run loess?
I cannot understand how to run loess with the correct syntax to differentially weighted observations:
-1 loess( Count ~ Integrity, weight=None)
-2 loess( Count ~ 1:nrow(dataframe), weight=Integrity)
I performed several tests. Fig. C-D used loess (stats), Fig. E-F run weightedloess (limma). I used 2 different packages since, from the loess docs it is clear that prior weights are set based on x distance between points. weightedloess function allow the user to give priors in order to perform regression.
Below is reported the basic sintax adopted to perform regression and to generate images.
C) loess(Count ~ Integrity),degree=2,span=0.1)
D) loess(Count ~ 1:nrow(df)),weigths=’Integrity’,degree=2,span=0.1)
E) weightedLowess(x=1:nrow(df), y=Count, weigths=’Integrity’, span=0.1)
F) weightedLowess(x=1:nrow(df), y=order(Count), weigths=’Integrity’, span=0.1)
Please find enclosed images cited in the question.
Sample Images

How to include plots / rows with zero values in the presence / absence community matrix in a CCA using R Vegan package

I am trying to do CCA using a presence / absence matrix of plant quadrat data and continuous environmental data for the same quadrats, using the Vegan package in R. Some of the quadrats have no plant species present (the row for the quadrat is full of 0's) but do have corresponding environmental data in another dataframe. The context of the study is that the environmental data is metal concentrations in soil, which are typically high where there are no plant species, so the quadrats with zero species do contribute to the data, and are not errors or NA's. When running the CCA with the R Vegan Package so far I have had to delete these rows to get it to work, otherwise it returns the error
'Error in cca.default(d$X, d$Y, d$Z) :
all row sums must be >0 in the community data matrix' .
Is there a way to include the data from quadrats that have no plant species in the CCA? I have read in this paper, which also uses the Vegan package,: https://www.researchgate.net/publication/229087061_Relationships_between_the_presence_of_odonate_species_and_environmental_characteristics_in_lowland_ponds_of_central_Italy and that has a similar research design, that they have included plots with zero species by adding a 'zero species' variable but do not elaborate on how this is done.
I am new to coding so any help is very much appreciated,
Thanks in advance
Here is how to do it. Assume your data set is called comm and it has some rows (sampling units) that have no species:
comm$ZERO <- as.numeric(rowSums(comm) == 0)
This will add a new column ZERO which is 1 for rows that had no species, and 0 for others.
Personally, I would be worried about doing this. Correspondence Analysis is a compositional analysis, and adding a column (species) that never occurs with any other species (by definition) creates a data set with two disjunct blocks. In unconstrained CA this disjunct block manifests in first eigenvalue 1 – which is the theoretical maximum in CA. This first eigenvector will separate the blocks: ZERO species and the sampling units with ZERO species in one extreme, and all other species and sampling units in another extreme of the first axis. The second axis of this ZERO ordination will be identical to the first axis without ZERO, so in effect you just add this disjunction axis to the ordination.
Things are slightly different with CCA which actually looks at the fitted values of your species, and these fitted values may not be disjunct. So technically you can do it. However, it is not quite clear to me what you do if you do so. Even if the data set is not completely disjunct with CCA, the zero sampling units will probably be far separated from other points, and all plotted in the same point.

PCoA function pcoa extract vectors; percentage of variance explained

I have a dataset consisting of 132 observations and 10 variables.
These variables are all categorical. I am trying to see how my observations cluster and how they are different based on the percentage of variance. i.e I want to find out if a) there are any variables which helps to draw certain observation points apart from one another and b) if yes, what is the percentage of variance explained by it?
I was advised to run a PCoA (Principle Coordinates Analysis) on my data. I ran it using vegan and ape package. This is my code after loading my csv file into r, I call it data
#data.dis<-vegdist(data,method="gower",na.rm=TRUE)
#data.pcoa<-pcoa(data.dis)
I was then told to extract the vectors from the pcoa data and so
#data.pcoa$vectors
It then returned me 132 rows but 20 columns of values (e.g. from Axis 1 to Axis 20)
I was perplexed over why there were 20 columns of values when I only have 10 variables. I was under the impression that I would only get 10 columns. If any kind souls out there could help to explain a) what do the vectors actually represent and b) how do I get the percentage of variance explained by Axis 1 and 2?
Another question that I had was I don't really understand the purpose of extracting the eigenvalues from data.pcoa because I saw some websites doing that after running a pcoa on their distance matrix but there was no further explanation on it.
Gower index is non-Euclidean and you can expect more real axes than the number of variables in Euclidean ordination (PCoA). However, you said that your variables are categorical. I assume that in R lingo they are factors. If so, you should not use vegan::vegdist() which only accepts numeric data. Moreover, if the variable is defined as a factor, vegan::vegdist() refuses to compute the dissimilarities and gives an error. If you managed to use vegdist(), you did not properly define your variables as factors. If you really have factor variables, you should use some other package than vegan for Gower dissimilarity (there are many alternatives).
Te percentage of "variance" is a bit tricky for non-Euclidean dissimilarities which also give some negative eigenvalues corresponding to imaginary dimensions. In that case, the sum of all positive eigenvalues (real axes) is higher than the total "variance" of data. ape::pcoa() returns the information you asked in the element values. The proportion of variances explained is in its element values$Relative_eig. The total "variance" is returned in element trace. All this was documented in ?pcoa where I read it.

SVD with missing values in R

I am performing a SVD analysis with R, but I have a matrix with structural NA values. Is it possible to obtain a SVD decomposition in this case? Are there alternative solutions? Thanks in advance
You might want to try out the SVDmiss function in SpatioTemporal package which does missing value imputation as well as computes the SVD on the imputed matrix. Check this link SVDmiss Function
However, you might want to be wary of the nature of your data and whether missing value imputation makes sense in your case.
I have tried using the SVM in R with NA values without succes.
Sometimes they are important in analysis so I usually transform my data as follows:
If you have lots of variables try to reduce their number (clustering, lasso, etc...)
Transform the remaining predictors like this:
- for quantitative variables:
- calculate deciles per predictor (leaving missing obs out)
- calculate frequency of Y per decile (assuming Y is qualitative)
- regroup deciles on their Y freq similarity into 2/3/4 groups
(you can do this by looking at their plot too)
- create for each group a new binary variable
(X11 = 1 if X1 takes values in the interval ...)
- calculate Y frequency for missing obs of that predictor
- join the missing obs category to the variable that has the closest Y freq
- for qualitative variables:
- if you have variables with lots of levels you should do clustering by Y
variable
- for variables with lesser levels, you can calculate Y freq per class
- regroup the classes like above
- calculate the same thing for missing obs and attach it to the most similar
group of non-missing
- recode the variable as for numeric case*
There, now you have a complete database of dummy variables and the chance to perform SVM, neural networks, LASSO, etc...

Looking for an efficient way to compute the variances of a multinomial distribution in R

I have an R matrix which dimensions are ~20,000,000 rows by 1,000 columns. The first column represents counts and the rest of the columns represent the probabilities of a multinomial distribution of these counts. So in other words, in each row the first column is n and the rest of the k columns are the probabilities of the k categories. Another point is that the matrix is sparse, meaning that in each row there are many columns with value of 0.
Here's a toy matrix I created:
mat=rbind(c(5,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1),c(2,0.2,0.2,0.2,0.2,0.2,0,0,0,0,0),c(22,0.4,0.6,0,0,0,0,0,0,0,0),c(5,0.5,0.2,0,0.1,0.2,0,0,0,0,0),c(4,0.4,0.15,0.15,0.15,0.15,0,0,0,0,0),c(10,0.6,0.1,0.1,0.1,0.1,0,0,0,0,0))
What I'd like to do is obtain an empirical measure of the variance of the counts for each category. The natural thing that comes to mind is to obtain random draws and then compute the variances over them. Something like:
draws = apply(mat,1,function(x) rmultinom(samples,x[1],x[2:ncol(mat)]))
Where say samples=100000
Then I can run an apply over draws to compute the variances.
However, for my real data dimensions this will become prohibitive at least in terms of RAM. Is whether a more efficient solution in R to this problem?
If all you need is the variance of the counts, just compute it immediately instead of returning the intermediate simulated draws.
draws = apply(mat,1,function(x) var(rmultinom(samples,x[1],x[2:ncol(mat)])))

Resources