simulation of genetic data in R - r

I am looking for the best way or best package available for simulating a genetic association between a specific SNP and a quantitative phenotype, with the simulated data being the most similar to my real data, except that I know the causal variant.
All of the packages I saw in R seem to be specialised in pedigree data or in population data where coalescence and other evolutionary factors are specified, but I don't have any experience in population genetics and I only want to simulate the simple case of European
population with a similar characteristics to my real data
(i.e. normal distribution for the trait and an additive effect for the genotype, similar allele frequancies…)
So for example if my genetic data is X and my quantitative variable is Y:
X <-rbinom(1000,2,0.4)
Y <- rnorm(1000,1,0.4)
I am looking for something in R similar to the function in Plink where one needs to specify a range of allele frequencies, a range for the phenotype, and specify a specific variant which should result associated with the genotype (this is important because I need to repeat these associations in different datasets with the causal variant being the same)
Can someone please help me?

If the genotype changes only the mean of the phenotype, this is very simple.
phenotype.means <- c(5, 15, 20) # phenotype means for genotypes 0, 1, and 2
phenotype.sd <- 5
X <- rbinom(1000,2,0.4)
Y <- rnorm(1000, phenotype.means[X], phenotype.sd)
This will lead to Y containing 1000 normally distributed variables, where those with homozygous recessive genotypes (aa, or 0) will have a mean of 5, those with heterozyous genotypes (Aa, or 1) will have a mean of 15, and those with homozygous dominant genotypes (AA, or 2) will have a mean of 20.
If you want a more traditional 2 setting phenotype (AA/Aa versus aa), just set phenotype.means to something like c(5, 20, 20).

Related

How can I add an amount random error to a numerical variable in R?

I am working on investigating the relationship between body measurements and overall weight in a set of biological specimens using regression equations. I have been comparing my results to previous studies, which did not draw their measurement data and body weights from the same series of individuals. Instead, these studies used the mean values reported for each species from the previously published literature (with body measurements and weight drawn from different sets of individuals) or just took the midpoint of reported ranges of body measurements.
I am trying to figure out how to introduce a small amount of random error in my data to simulate the effects of drawing measurement and weight data from different sources. For example, mutating all data to be slightly altered from their actual value by roughly +/- 5% of their actual value, which is close to the difference I get between my measurements and the literature measurements, and seeing how much that affects accuracy statistics. I know there is the jitter() command, but that only seems to work with plotting data.
There is jitter function in base R which allows you to add random noise in the data.
x <- 1:10
set.seed(123)
jitter(x)
#[1] 0.915 2.115 2.964 4.153 5.176 5.818 7.011 8.157 9.021 9.983
Check ?jitter which explains different ways to control the noise added.
Straight forward if you know what the error looks like (i.e. how is your error distributed?). Is the error normally distributed? Uniform?
v1 <- rep(100, 10) # measurements with no noise
v1_n <- v1 + rnorm(10, 0, 20) #error with mean 0 and sd 20 sampled from normal distribution
v1_u <- v1 + runif(10, -5, 5) #error with mean 0 min -5 and max 5 from uniform distribution
v1_n
[1] 87.47092 103.67287 83.28743 131.90562 106.59016 83.59063 109.74858 114.76649 111.51563 93.89223
v1_u
[1] 104.34705 97.12143 101.51674 96.25555 97.67221 98.86114 95.13390 98.82388 103.69691 98.40349

Regressing out or Removing age as confounding factor from experimental result

I have obtained cycle threshold values (CT values) for some genes for diseased and healthy samples. The healthy samples were younger than the diseased. I want to check if the age (exact age values) are impacting the CT values. And if so, I want to obtain an adjusted CT value matrix in which the gene values are not affected by age.
I have checked various sources for confounding variable adjustment, but they all deal with categorical confounding factors (like batch effect). I can't get how to do it for age.
I have done the following:
modcombat = model.matrix(~1, data=data.frame(data_val))
modcancer = model.matrix(~Age, data=data.frame(data_val))
combat_edata = ComBat(dat=t(data_val), batch=Age, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)
pValuesComBat = f.pvalue(combat_edata,mod,mod0)
qValuesComBat = p.adjust(pValuesComBat,method="BH")
data_val is the gene expression/CT values matrix.
Age is the age vector for all the samples.
For some genes the p-value is significant. So how to correctly modify those gene values so as to remove the age effect?
I tried linear regression as well (upon checking some blogs):
lm1 = lm(data_val[1,] ~ Age) #1 indicates first gene. Did this for all genes
cor.test(lm1$residuals, Age)
The blog suggested checking p-val of correlation of residuals and confounding factors. I don't get why to test correlation of residuals with age.
And how to apply a correction to CT values using regression?
Please guide if what I have done is correct.
In case it's incorrect, kindly tell me how to obtain data_val with no age effect.
There are many methods to solve this:-
Basic statistical approach
A very basic method to incorporate the effect of Age parameter in the data and make the final dataset age agnostic is:
Do centring and scaling of your data based on Age. By this I mean group your data by age and then take out the mean of each group and then standardise your data based on these groups using this mean.
For standardising you can use two methods:
1) z-score normalisation : In this you can change each data point to as (x-mean(x))/standard-dev(x)); by using group-mean and group-standard deviation.
2) mean normalization: In this you simply subtract groupmean from every observation.
3) min-max normalisation: This is a modification to z-score normalisation, in this in place of standard deviation you can use min or max of the group, ie (x-mean(x))/min(x)) or (x-mean(x))/max(x)).
On to more complex statistics:
You can get the importance of all the features/columns in your dataset using some algorithms like PCA(principle component analysis) (https://en.wikipedia.org/wiki/Principal_component_analysis), though it is generally used as a dimensionality reduction algorithm, still it can be used to get the variance in the whole data set and also get the importance of features.
Below is a simple example explaining it:
I have plotted the importance using the biplot and graph, using the decathlon dataset from factoextra package:
library("factoextra")
data(decathlon2)
colnames(data)
data<-decathlon2[,1:10] # taking only 10 variables/columns for easyness
res.pca <- prcomp(data, scale = TRUE)
#fviz_eig(res.pca)
fviz_pca_var(res.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
hep.PC.cor = prcomp(data, scale=TRUE)
biplot(hep.PC.cor)
output
[1] "X100m" "Long.jump" "Shot.put" "High.jump" "X400m" "X110m.hurdle"
[7] "Discus" "Pole.vault" "Javeline" "X1500m"
On these similar lines you can use PCA on your data to get the importance of the age parameter in your data.
I hope this helps, if I find more such methods I will share.

How can I find dependancies in my data in R? (A + B + C -> D)

I want to reduce my data by sorting out dependent variables. E.g. A + B + C -> D so I can leave out D without loosing any information.
d <- data.frame(A = c( 1, 2, 3, 4, 5),
B = c( 2, 4, 6, 4, 2),
C = c( 3, 4, 2, 1, 0),
D = c( 6, 10, 11, 9, 9))
The last value of D is wrong, that is because the data can be inaccurate.
How can I identify these dependencies with R and how can I influence the accuracy of the correlation? (e.g., use a cutoff of 80 or 90 percent)
Example findCorrelation only considers pair-wise correlations. Is there a function for multiple correlations?
You want to find dependencies in your data, you contrast findCorrelation with what you want asking 'is there a function for multiple correlations'. To answer that we need to clarify the technique that is appropriate for you...
Do you want partial correlation:
Partial correlation is the correlation of two variables while controlling for a third or more other variables
or semi-partial correlation?
Semi-partial correlation is the correlation of two variables with variation from a third or more other variables removed only from the second variable.
Definitions from {ppcor}. Decent YouTube video although the speaker might have some of the relationship to regression details slightly confused/confusing.
To Fer Arce's suggestion... it is about right. Regression is quite related to these methods, however when predictors are correlated (called multicollinearity) it can cause issues (see the answer by gung). You could force your predictors to be orthogonal (uncorrelated) via PCA, but then you'd make interpreting the coefficients quite hard.
Implementation:
library(ppcor)
d <- data.frame(A = c( 1, 2, 3, 4, 5),
B = c( 2, 4, 6, 4, 2),
C = c( 3, 4, 2, 1, 0),
D = c( 6, 10, 11, 9, 9))
# partial correlations
pcor(d, method = "pearson")
# semi-partial correlations
spcor(d, method = "pearson")
You can get a 'correlation', if you fit a lm
summary(lm(D ~ A + B + C, data =d))
But I am not sure what are you exactly asking for. I mean, with this you can get R^2, that I guess is what you are looking for?
Although correlation matrices are helpful and perfectly legitimate, one way I find particularly useful is to look at the variance inflation factor. Wikipedia's article describing the VIF is quite good.
A few reasons why I like to use the VIF:
Instead of looking at rows or columns of a correlation matrix, to try and divine which variables are more collinear than others with the other covariates multiply instead of singly, you get a single number which describes an aspect of a given predictor's relationship to all others in the model.
It's easy to use the VIF in a stepwise fashion to, in most cases, eliminate collinearity within your predictor space.
It's easy to obtain, either through using the vif() function in the car package, or by writing your own function to calculate.
VIF essentially works by regressing all the covariates/predictors in your model against each predictor you've included in turn. It obtains the R^2 value and takes the ratio: 1/(1-R^2). This gives you a number vif >= 1. If you think of R^2 as the amount of variation in your response space explained by your selected covariate model, then if one of your covariates gets a high R^2 of, say 0.80, then your vif is 5.
You choose what your threshold of comfort is. The wikipedia article suggests a vif of 10 indicates a predictor should go. I was taught that 5 is a good threshold. Often, I've found it's easy to get the vif down to less than 2 for all of my predictors without a big impact to my final models adjusted-R^2.
I feel like even a vif of 5, meaning a predictor can be modeled by its companion predictors with an R^2 of 0.80 means that that predictors marginal information contribution is quite low and not worth it. I try to take a strategy of minimizing all of my vifs for a given model without a huge (say, > 0.1 reduction in R^2) impact to my main model. That sort of an impact gives me a sense that even if the vif is higher than I'd like, the predictor still holds a lot of information.
There are other approaches. You might look into Lawson's paper on an alias-matrix-guided variable selection method as well - I feel it's particularly clever, though harder to implement than what I've discussed above.
The Question is about how it is possible to detect dependancies in larger sets of data.
For one this is possible by manually checking every possibility, like proposed in other answers with summary(lm(D ~ A + B + C, data =d)) for example. But this means a lot of manual work.
I see a few possibilities. For one Filter Methods like RReliefF or Spearman Correlation for example. they look at the correlation and
measure distance witin the data set.
Possibility two is using Feature Extraction methods lika PCA, LDA or ICA all trying to find the independent compontents (meaning eliminating any correlations...)

Weighted correlation in R

I am trying to output a correlation matrix for various locations. The row names 'PC1', PC2' etc. represent principal components. Since the percentage variance explained (and thus the weights) of principal components decreases from PC1 to PC4, I need to run Pearson correlation such that it takes the weights of PC's into account.
In other words, row 1 is more important in determining the correlation among locations than row 2, and row 2 is more important than row 3, and so on...
A simple weight vector for the 4 rows can be as follows:
w = [1.00, 0.75, 0.50, 0.25]
I did go through this, but I am not fully clear with the solution, and unlike this question, I need to find the correlation within the columns of a SINGLE matrix, while weighing its rows.
Ok, this is very easy to do in R using cov.wt (available in stats)
weighted_corr <- cov.wt(DF, wt = w, cor = TRUE)
corr_matrix <- weighted_corr$cor
That's it!

GLM combining results

I'm writing a Sweave file for clearer presentation of glm() results. The glm is for calculating premium prices of insurances. Usually 2 separate glms are used for this. One for claim frequency and one for claim severity. To get the final price I would have to multiply coefficient estimates of the 2 models, according to categorization. If both models have the same independent variables with the same levels the problem is trivial. I can just multiply the fitted values of both and it's done. The problem arises when the factors have different levels, a cause of merging them to get better results. Lets say I have factor age for frequency with 3 levels 0-25,25-50,50-110 and for severity with 2 levels 0-25,25-110. I want to combine the fitted values to be multiplied in the following sense:
Frequency Severity
0-25 0-25
25-50 25-110
50-110 25-110
In other words, the fitted values should be multiplied only when the categories are in the same range. This should also work for non numeric categorizations. For instance
Frequency Severity
a ab
b ab
c c
Is there any function/package in R that would allow me to do that? If not, what other ways exist?
Currently my only idea is to use custom labels for factor levels and then using string comparisons between them.
The best way to do this is to create code for transforming your dataset in model-specific ways, and then call it before computing the predictions. This generalises easily to situations where your models involve different subsets of variables, or are of different forms completely. Since this is R and not SAS, you can do it all in one function.
predict_combined <- function(glm.cf, glm.cs, newdata)
{
newdata.cf <- within(newdata, {
age <- cut(age, c(0, 25, 50, 110))
...
...
})
newdata.cs <- within(newdata, {
age <- cut(age, c(0, 25, 110))
...
...
})
pred.cf <- predict(glm.cf, newdata.cf, type="resp")
pred.cs <- predict(glm.cs, newdata.cs, type="resp")
pred.cf * pred.cs
}
This can be turned into a one-liner, but that would probably obfuscate more than it would elucidate.

Resources