Related
everyone!
I am trying to run a GWAS analysis in R on some very simple genetic data. It only contains the SNPs and one outcome variable (as well as an ID variable for each observation).
Everything I have found online includes chromosome and position data. I have that for the SNPs, but in a separate file. (My plan is to map the SNPs after the relevant ones have been selected).
How can I go about running a GWAS analysis on this data? Would I need to, or could I use another method to filter to only the most significant SNPs?
I tried this, but it didn't work, because my data is not a gData object.
# SNPs are in A/B notation, with 0 = AA, 1 = AB, and 2 = BB
library(statgenGWAS)
id <- c("person1", "person2", "person3", "person4", "person5", "person6", "person7", "person8", "person9", "person10")
snp1 <- c(0, 1, 2, 2, 1, 0, 0, 0, 1, 1)
snp2 <- c(2, 2, 2, 1, 1, 1, 0, 0, 0, 1)
snp3 <- c(0, 0, 2, 2, 0, 2, 1, 0, 2, 2)
diagnosis <- c(0, 1, 1, 0, 0, 1, 1, 0, 1, 1)
data <- as.data.frame(cbind(id, snp1, snp2, snp3, diagnosis))
gwas1a <- runSingleTraitGwas(gData = data,
traits = "diagnosis")
Any help here is appreciated.
Thank you!
When I apply an dbRDA to a distance matrix (in this case the Bray-Curtis distance) like this:
dbrda(sqrt(dist) ~ ., site_vars)
is it ok to include a column of ordered factors in the site_vars variable, which is a dataframe with values measured at the sampling sites, e.g. mean temperature, but which also includes a column "soil" where different soil types are ordered? Or is it neccessary to add all the ordinal and nominal scaled variables in a separate Condition argument to the formula?
Here a small example:
data <- rbind(
c(1, 1, 0, 1, 1, 0, 0, 0, 0, 0),
c(1, 1, 1, 0, 1, 1, 0, 0, 0, 0),
c(0, 1, 0, 1, 0, 1, 1, 0, 1, 0),
c(1, 0, 0, 0, 1, 0, 1, 1, 1, 0),
c(0, 0, 0, 1, 0, 0, 0, 0, 1, 1)
)
rownames(data) <- c("Site_1", "Site_2", "Site_3", "Site_4", "Site_5")
colnames(data) <- c("Spec_1", "Spec_2", "Spec_3", "Spec_4", "Spec_5", "Spec_6", "Spec_7", "Spec_8", "Spec_9", "Spec_10")
dist <- vegdist(data, "bray")
site_vars <- data.frame(
Tmean = c(9, 10, 12, 14.5, 14),
SomethingElse = c(12, 14, 13, 16, 21),
Soil = c("good", "good", "OK", "OK", "bad")
)
site_vars$Soil <- ordered(site_vars$Soil, levels = c("good", "OK", "bad"))
# Version 1
dbRDA_Condition <- dbrda(sqrt(dist) ~ Tmean + SomethingElse + Condition(Soil), site_vars)
plot(dbRDA_Condition)
# Version 2
dbRDA <- dbrda(sqrt(dist) ~ Tmean + SomethingElse + Soil, site_vars)
plot(dbRDA)
Version 1 seems to disregard the fact that my soil variable is ranked. Version 2 generates an output I find a bit tricky to interpret, because additional to the group centroids, it also displays arrows. I would expect 1 arrow for soil as if it was a numerical variable with numbers 1, 2 and 3 instead of three levels. However, it shows two arrows, labeled Soil.L and Soil.Q. Why are there two arrows for one variable? And what does *.L and *.Q stand for? Unfortunately, I haven't found any explanation.
R analyses factors using contrasts. In unordered factors the default contrasts are differences to the first factor level. For ordered factors, R uses polynomial contrasts: linear (L), quadratic (Q), cubic (C), fourth-order (^4). Check any guide to R statistical environment. dbrda does not invent this feature, but it is the R standard.
I want to visualize mean comparison with a boxplot in ggplot2, but instead of having a vector of categorical variables, I have a couple of vectors with 1 or 0 to indicate whether they belong in that category. There's some overlap - i.e., some data points will belong to multiple groups simultaneously.
I'm able to get a boxplot of values for all the values in one group, but not able to add another group's values to the same plot. With as.factor() applied to a dummy variable I'm able to get a boxplot of the means of scores for those in that group vs. not in that group. I've seen posts about faceting that seem like that might be helpful, but none of the examples I've found (Multiple boxplots placed side by side for different column values in ggplot, How do I make a boxplot with two categorical variables in R?) are quite like what I'm trying to do.
score <- c(1, 8, 3, 5, 10, 7, 4, 3, 8, 1)
group1 <- c(0, 0, 1, 0, 1, 1, 0, 1, 0, 1)
group2 <- c(1, 1, 0, 1, 0, 1, 1, 1, 0, 0)
group3 <- c(0, 1, 0, 0, 0, 0, 0, 0, 1, 1)
df <- data.frame(score, group1, group2, group3)
library(ggplot2)
ggplot(aes(y=score, x=as.factor(group1), fill=group1), data=df) +
geom_boxplot() #mean for both values inside and outside group plotted
ggplot(aes(y=score, x=as.numeric(group1), fill=group1), data=df) +
geom_boxplot() #mean for just those values where group1 == 1
I want to end up with either a) multiple plots like what I get from that first line of code, OR b) multiple plots like what I get from the second. The former includes a boxplot for all those values outside the group, the latter does not. Would also be cool to have a boxplot for the overall mean but I really am not sure what's feasible.
I'm not quite sure if you just want box plots for those with dummy = 1. Anyway, data.table::melt can be useful to you, which gives you an easy plottable long format.
library(data.table)
dat.m <- melt(dat, measure.vars=2:4)
boxplot(score ~ value + variable, dat.m[dat.m$value == 1, ])
Yields
Data
dat <- structure(list(score = c(1, 8, 3, 5, 10, 7, 4, 3, 8, 1), group1 = c(0,
0, 1, 0, 1, 1, 0, 1, 0, 1), group2 = c(1, 1, 0, 1, 0, 1, 1, 1,
0, 0), group3 = c(0, 1, 0, 0, 0, 0, 0, 0, 1, 1)), class = "data.frame", row.names = c(NA,
-10L))
I recently asked a question about looping a glm command for all possible combinations of independent variables. Another user provided a great answer that runs all possible models, however I can't figure out how to produce a data.frame of all possible p-values.
The code suggested in the previous question works for independent variables that are binary (pasted below). However, several of my variables are categorical. Is there any way to adjust the code so that I can produce a table of all p-values for every possible model (there are 2,046 possible models with 10 independent variables...)?
# p-values in a data.frame
p_values <-
cbind(formula_vec, as.data.frame ( do.call(rbind,
lapply(glm_res, function(x) {
coefs <- coef(x)
rbind(c(coefs[,4] , rep(NA, length(ind_vars) - length(coefs[,4]) + 1)))
})
)))
An example of one independent variable is "Bedrock" where possible categories include: "till," "silt," and "glacial deposit." It's not feasible to assign a numerical value to these variables, which is part of the problem. Any suggestions would be appreciated.
In case of additional categorical variable IndVar4 (factor a, b, c) the coefficient table can be more than just a row longer. Adding variable IndVar4:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7548180 1.4005800 -1.2529223 0.2102340
IndVar1 -0.2830926 1.2076534 -0.2344154 0.8146625
IndVar2 0.1894432 0.1401217 1.3519903 0.1763784
IndVar3 0.1568672 0.2528131 0.6204867 0.5349374
IndVar4b 0.4604571 1.0774018 0.4273773 0.6691045
IndVar4c 0.9084545 1.0943227 0.8301523 0.4064527
Max number of rows is less then all variables + all categories:
max_values <- length(ind_vars) +
sum(sapply( dfPRAC, function(x) pmax(length(levels(x))-1,0)))
So the new corrected function is:
p_values <-
cbind(formula_vec, as.data.frame ( do.call(rbind,
lapply(glm_res, function(x) {
coefs <- coef(x)
rbind(c(coefs[,4] , rep(NA, max_values - length(coefs[,4]) + 1)))
})
)))
But the result is not so clean as with continuous variables. I think Metrics' idea to convert every categorical variable to (levels-1) dummy variables gives same results and maybe cleaner presentation.
Data:
dfPRAC <- structure(list(DepVar1 = c(0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1), DepVar2 = c(0, 1, 0, 0,
1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1),
IndVar1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
0, 0, 0, 1, 0, 0, 0, 1, 0),
IndVar2 = c(1, 3, 9, 1, 5, 1,
1, 8, 4, 6, 3, 15, 4, 1, 1, 3, 2, 1, 10, 1, 9, 9, 11, 5),
IndVar3 = c(0.500100322564443, 1.64241601558441, 0.622735778490702,
2.42429812749226, 5.10055213237027, 1.38479786027561, 7.24663629203007,
0.5102348706939, 2.91566510995229, 3.73356170379198, 5.42003495939846,
1.29312896116503, 3.33753833987496, 0.91783513806083, 4.7735736131668,
1.17609362602233, 5.58010703426296, 5.6668754863739, 1.4377813063642,
5.07724130837643, 2.4791994535923, 2.55100067348583, 2.41043629522981,
2.14411703944206)), .Names = c("DepVar1", "DepVar2", "IndVar1",
"IndVar2", "IndVar3"), row.names = c(NA, 24L), class = "data.frame")
dfPRAC$IndVar4 <- factor(rep(c("a", "b", "c"),8))
dfPRAC$IndVar5 <- factor(rep(c("d", "e", "f", "g"),6))
Set up the models:
dep_vars <- c("DepVar1", "DepVar2")
ind_vars <- c("IndVar1", "IndVar2", "IndVar3", "IndVar4", "IndVar5")
# create all combinations of ind_vars
ind_vars_comb <-
unlist( sapply( seq_len(length(ind_vars)),
function(i) {
apply( combn(ind_vars,i), 2, function(x) paste(x, collapse = "+"))
}))
# pair with dep_vars:
var_comb <- expand.grid(dep_vars, ind_vars_comb )
# formulas for all combinations
formula_vec <- sprintf("%s ~ %s", var_comb$Var1, var_comb$Var2)
# create models
glm_res <- lapply( formula_vec, function(f) {
fit1 <- glm( f, data = dfPRAC, family = binomial("logit"))
fit1$coefficients <- coef( summary(fit1))
return(fit1)
})
names(glm_res) <- formula_vec
Using kernlab I've trained a model with code like the following:
my.model <- ksvm(result ~ f1+f2+f3, data=gold, kernel="vanilladot")
Since it's a linear model, I prefer at run-time to compute the scores as a simple weighted sum of the feature values rather than using the full SVM machinery. How can I convert the model to something like this (some made-up weights here):
> c(.bias=-2.7, f1=0.35, f2=-0.24, f3=2.31)
.bias f1 f2 f3
-2.70 0.35 -0.24 2.31
where .bias is the bias term and the rest are feature weights?
EDIT:
Here's some example data.
gold <- structure(list(result = c(-1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), f1 = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1), f2 = c(13.4138113499447,
13.2216999857095, 12.964145772169, 13.1975227965938, 13.1031520152764,
13.59351759447, 13.1031520152764, 13.2700658838026, 12.964145772169,
13.1975227965938, 12.964145772169, 13.59351759447, 13.59351759447,
13.0897162110721, 13.364151238365, 12.9483051847806, 12.964145772169,
12.964145772169, 12.964145772169, 12.9483051847806, 13.0937231331592,
13.5362700880482, 13.3654209223623, 13.4356400945176, 13.59351759447,
13.2659406408724, 13.4228886221088, 13.5103065354936, 13.5642812689161,
13.3224757352068, 13.1779418771704, 13.5601730479315, 13.5457299603578,
13.3729010596517, 13.4823595997866, 13.0965264603473, 13.2710281801434,
13.4489887206797, 13.5132372154748, 13.5196188787197), f3 = c(0,
1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0)), .Names = c("result",
"f1", "f2", "f3"), class = "data.frame", row.names = c(NA, 40L
))
To get the bias, just evaluate the model with a feature vector of all zeros. To get the coefficient of the first feature, evaluate the model with a feature vector with a "1" in the first position, and zeros everywhere else - and then subtract the bias, which you already know. I'm afraid I don't know R syntax, but conceptually you want something like this:
bias = my.model.eval([0, 0, 0])
f1 = my.model.eval([1, 0, 0]) - bias
f2 = my.model.eval([0, 1, 0]) - bias
f3 = my.model.eval([0, 0, 1]) - bias
To test that you did it correctly, you can try something like this:
assert(bias + f1 + f2 + f3 == my.model.eval([1, 1, 1]))
If I'm not mistaken, I think you're asking how to extract the W vector of the SVM, where W is defined as:
W = \sum_i y_i * \alpha_i * example_i
Ugh: don't know best way to write equations here, but this just is the sum of the weight * support vectors. After you calculate the W, you can extract the "weight" for the feature you want.
Assuming this is correct, you'd:
Get the indices of your data that are the support vectors
Get their weights (alphas)
Calculate W
kernlab stores the support vector indices and their values in a list (so it works on multiclass problems, too), anyway any use of list manipulation is just to get at the real data (you'll see that the length of the lists returned by alpha and alphaindex are just 1 if you just have a 2-class problem, which I'm assuming you do).
my.model <- ksvm(result ~ f1+f2+f3, data=gold, kernel="vanilladot", type="C-svc")
alpha.idxs <- alphaindex(my.model)[[1]] # Indices of SVs in original data
alphas <- alpha(my.model)[[1]]
y.sv <- gold$result[alpha.idxs]
# for unscaled data
sv.matrix <- as.matrix(gold[alpha.idxs, c('f1', 'f2', 'f3')])
weight.vector <- (y.sv * alphas) %*% sv.matrix
bias <- b(my.model)
kernlab actually scales your data first before doing its thing. You can get the (scaled) weights like so (where, I guess, the bias should be 0(?))
weight.vector <- (y.sv * alphas) %*% xmatrix(my.model)[[1]]
If I understood your question, this should get you what you're after.