I am attempting to perform a least discriminant analysis on geometric morphometric data. Because geometric morphometric data typically produces large numbers of variables and discriminant analyses require more data points than variables to accurately classify specimens, a common solution in the literature is to perform a principal component analysis and then use a variable number of PCs representing less than 99% of the cumulative variance but returning the highest reclassification rate as input for the LDA.
Right now the way I am doing this is running the LDA in R (using the functions in the Morpho and MASS packages) under every possible number of PCs used and noting the classification accuracy by hand until I found the lowest number of PCs that returned the highest accuracy, but this is highly inefficient.
I was wondering if there was any way to write a function that would run an LDA for all possible numbers of the first N PCs (up to a certain, user defined level representing 99% of the cumulative variance) and return the percent reclassification rate for each level, producing something like the following:
PCs percent_accuracy
20 72.2
19 76.3
18 77.4
17 80.1
16 75.4
15 50.7
... ...
1 20.2
So row 1 would be the reclassification rate when the first 20 PCs are used, row 2 is the rate when the first 19 PCs are used, and so on and so forth.
I am beginning my PhD in transcriptome (affymetrix assay) analysis.
I have an expression matrix (trans_data : 32000 genes x 620 samples), and a clinical matrix (clin_data : 620 samples x 42 clinical caracteristics).
The samples belong to 1 of the 4 populations A-B-C-D. I'd like to draw comparision of gene expression between population A and B without triying to bind the two matrix.
I'd like to optain a matrix with mean expression of each genes in the two population, then pvalue, then adjusted p value.
Then I could select only differentially expressed genes (padj < 0,05).
thanks for your help.
Alain
Can't answer your question directly without a clear reproducible example but you might want to check out the rather excellent tableone package
I have a few data sets consisting of frequencies for i distinct alleles/SNPs of some populations. Additionally I recorded some factors that are suspicious for having changed the frequencies of these alleles within the populations in the past due to their selectional effect. It is assumed that the selection impact can be described in the form of a simple linear regression for every selection factor.
Now I'd like to estimate how the allele frequencies are expected to be under identical selectional forces (thus, I set selection=1). These new allele frequencies a'_i are derived as
a'_i = a_i - function[a_i|selection=1]
with the current frequency a_i of the allele i of a population and function[a_i|selection=1] as the estimated allele frequency under the absence of selectional forces.
However, there are some constraints for the whole process:
The minimal values of a'_i allowed is 0.
The sum of all allele frequencies a'_i has to be 1.
Usually I'd solve this problem by applying multiple linear regressions. But then the constraints are not fulfilled ...
Any idea how to approach this analysis with constraints (maybe using linear equation/regression systems or structural equation modelling)?
Here is an example data set containing allele frequencies for the ABO major allele groups (p, q, r) as well as the selection variables (x, y, z).
Although this example file only contains 3 alleles and 3 influential variables, all my data sets contain up to ~1050 alleles/SNPs and always 8 selection variables that may have (but don't have to) an impact on the allele frequencies ...
Many thanks in advance for ideas, code snippets and hints!
I am working with a gene priorization program and I want to analyze its performance by making use of ROC curves and the ROCR package for R. The problem is that the program does not give me a score for each gene, it only orders the genes.
ROCR only uses continuous data as predictions. Is it possible to assign to each gene a number in descending order? For example, I have these genes ordered:
Gene A
Gene B
Gene C
I could assign these values?
Gene A: 3
Gene B: 2
Gene C: 1
I am looking for the best way or best package available for simulating a genetic association between a specific SNP and a quantitative phenotype, with the simulated data being the most similar to my real data, except that I know the causal variant.
All of the packages I saw in R seem to be specialised in pedigree data or in population data where coalescence and other evolutionary factors are specified, but I don't have any experience in population genetics and I only want to simulate the simple case of European
population with a similar characteristics to my real data
(i.e. normal distribution for the trait and an additive effect for the genotype, similar allele frequancies…)
So for example if my genetic data is X and my quantitative variable is Y:
X <-rbinom(1000,2,0.4)
Y <- rnorm(1000,1,0.4)
I am looking for something in R similar to the function in Plink where one needs to specify a range of allele frequencies, a range for the phenotype, and specify a specific variant which should result associated with the genotype (this is important because I need to repeat these associations in different datasets with the causal variant being the same)
Can someone please help me?
If the genotype changes only the mean of the phenotype, this is very simple.
phenotype.means <- c(5, 15, 20) # phenotype means for genotypes 0, 1, and 2
phenotype.sd <- 5
X <- rbinom(1000,2,0.4)
Y <- rnorm(1000, phenotype.means[X], phenotype.sd)
This will lead to Y containing 1000 normally distributed variables, where those with homozygous recessive genotypes (aa, or 0) will have a mean of 5, those with heterozyous genotypes (Aa, or 1) will have a mean of 15, and those with homozygous dominant genotypes (AA, or 2) will have a mean of 20.
If you want a more traditional 2 setting phenotype (AA/Aa versus aa), just set phenotype.means to something like c(5, 20, 20).