Boxplots for overlapping data subsets w/dummy variables in R - r

I want to visualize mean comparison with a boxplot in ggplot2, but instead of having a vector of categorical variables, I have a couple of vectors with 1 or 0 to indicate whether they belong in that category. There's some overlap - i.e., some data points will belong to multiple groups simultaneously.
I'm able to get a boxplot of values for all the values in one group, but not able to add another group's values to the same plot. With as.factor() applied to a dummy variable I'm able to get a boxplot of the means of scores for those in that group vs. not in that group. I've seen posts about faceting that seem like that might be helpful, but none of the examples I've found (Multiple boxplots placed side by side for different column values in ggplot, How do I make a boxplot with two categorical variables in R?) are quite like what I'm trying to do.
score <- c(1, 8, 3, 5, 10, 7, 4, 3, 8, 1)
group1 <- c(0, 0, 1, 0, 1, 1, 0, 1, 0, 1)
group2 <- c(1, 1, 0, 1, 0, 1, 1, 1, 0, 0)
group3 <- c(0, 1, 0, 0, 0, 0, 0, 0, 1, 1)
df <- data.frame(score, group1, group2, group3)
library(ggplot2)
ggplot(aes(y=score, x=as.factor(group1), fill=group1), data=df) +
geom_boxplot() #mean for both values inside and outside group plotted
ggplot(aes(y=score, x=as.numeric(group1), fill=group1), data=df) +
geom_boxplot() #mean for just those values where group1 == 1
I want to end up with either a) multiple plots like what I get from that first line of code, OR b) multiple plots like what I get from the second. The former includes a boxplot for all those values outside the group, the latter does not. Would also be cool to have a boxplot for the overall mean but I really am not sure what's feasible.

I'm not quite sure if you just want box plots for those with dummy = 1. Anyway, data.table::melt can be useful to you, which gives you an easy plottable long format.
library(data.table)
dat.m <- melt(dat, measure.vars=2:4)
boxplot(score ~ value + variable, dat.m[dat.m$value == 1, ])
Yields
Data
dat <- structure(list(score = c(1, 8, 3, 5, 10, 7, 4, 3, 8, 1), group1 = c(0,
0, 1, 0, 1, 1, 0, 1, 0, 1), group2 = c(1, 1, 0, 1, 0, 1, 1, 1,
0, 0), group3 = c(0, 1, 0, 0, 0, 0, 0, 0, 1, 1)), class = "data.frame", row.names = c(NA,
-10L))

Related

How to calculate specificity, sensibility, predictive values and ROC curve in R?

I have a rather small dataset resulted from a linkage between two different datasets. I would like to know how can I calculate specificity, sensibility, predictive values and plot the ROC curve. This is the first time I'm using this kind of statistics in R, so I don't even know how to start.
Part of the data looks like this:
data <- data.frame(NMM_TOTAL = c(1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1),
CPAV_TOTAL = c(0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0),
SIH_NMM_TOTAL = c(0, 0, 0, 1, 1, 1, 1, 1, 1, 0 , 0, 1, 1, 0, 1),
SIH_CPAV_TOTAL = c(1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1))
And the two way tables would be the combination of:
tab1 <- table(data$SIH_NMM_TOTAL, data$NMM_TOTAL)
tab2 <- table(data$SIH_CPAV_TOTAL, data$CPAV_TOTAL)
Where NMM_TOTAL and CPAV_TOTAL are the "gold standard". I don't know if any of this makes sense. Thanks in advance!
Obs: 1 stands for positive and 0 for negative.
Let's work with tab1 to demonstrate specificity, sensitivity, and predictive values. Consider labeling the rows and columns of your tables to enhance clarity
act <- data$SIH_NMM_TOTAL
ref <- data$NMM_TOTAL
table(act,ref)
Load this library
library(caret)
The input data needs to be factors
act <- factor(act)
ref <- factor(ref)
The commands look like this
sensitivity(act, ref)
specificity(act, ref)
posPredValue(act, ref)
negPredValue(act, ref)
ROC curve. The Receiver Operating Characteristic (ROC) curve is used to assess the accuracy of a continuous measurement for predicting a binary outcome. It is not clear from your data that you can plot an ROC curve. Let me show you a simple example on how to generate one. The example is drawn from https://cran.r-project.org/web/packages/plotROC/vignettes/examples.html
library(ggplot2)
library(plotROC)
set.seed(1)
D.ex <- rbinom(200, size = 1, prob = .5)
M1 <- rnorm(200, mean = D.ex, sd = .65)
test <- data.frame(D = D.ex, D.str = c("Healthy", "Ill")[D.ex + 1],
M1 = M1, stringsAsFactors = FALSE)
head(test)
ggplot(test, aes(d = D, m = M1)) + geom_roc()

R - How to run a GWAS analysis with no position data?

everyone!
I am trying to run a GWAS analysis in R on some very simple genetic data. It only contains the SNPs and one outcome variable (as well as an ID variable for each observation).
Everything I have found online includes chromosome and position data. I have that for the SNPs, but in a separate file. (My plan is to map the SNPs after the relevant ones have been selected).
How can I go about running a GWAS analysis on this data? Would I need to, or could I use another method to filter to only the most significant SNPs?
I tried this, but it didn't work, because my data is not a gData object.
# SNPs are in A/B notation, with 0 = AA, 1 = AB, and 2 = BB
library(statgenGWAS)
id <- c("person1", "person2", "person3", "person4", "person5", "person6", "person7", "person8", "person9", "person10")
snp1 <- c(0, 1, 2, 2, 1, 0, 0, 0, 1, 1)
snp2 <- c(2, 2, 2, 1, 1, 1, 0, 0, 0, 1)
snp3 <- c(0, 0, 2, 2, 0, 2, 1, 0, 2, 2)
diagnosis <- c(0, 1, 1, 0, 0, 1, 1, 0, 1, 1)
data <- as.data.frame(cbind(id, snp1, snp2, snp3, diagnosis))
gwas1a <- runSingleTraitGwas(gData = data,
traits = "diagnosis")
Any help here is appreciated.
Thank you!

Am I using dbRDA correctly with ordered variables? What are the *.L and *.Q arrows in the output?

When I apply an dbRDA to a distance matrix (in this case the Bray-Curtis distance) like this:
dbrda(sqrt(dist) ~ ., site_vars)
is it ok to include a column of ordered factors in the site_vars variable, which is a dataframe with values measured at the sampling sites, e.g. mean temperature, but which also includes a column "soil" where different soil types are ordered? Or is it neccessary to add all the ordinal and nominal scaled variables in a separate Condition argument to the formula?
Here a small example:
data <- rbind(
c(1, 1, 0, 1, 1, 0, 0, 0, 0, 0),
c(1, 1, 1, 0, 1, 1, 0, 0, 0, 0),
c(0, 1, 0, 1, 0, 1, 1, 0, 1, 0),
c(1, 0, 0, 0, 1, 0, 1, 1, 1, 0),
c(0, 0, 0, 1, 0, 0, 0, 0, 1, 1)
)
rownames(data) <- c("Site_1", "Site_2", "Site_3", "Site_4", "Site_5")
colnames(data) <- c("Spec_1", "Spec_2", "Spec_3", "Spec_4", "Spec_5", "Spec_6", "Spec_7", "Spec_8", "Spec_9", "Spec_10")
dist <- vegdist(data, "bray")
site_vars <- data.frame(
Tmean = c(9, 10, 12, 14.5, 14),
SomethingElse = c(12, 14, 13, 16, 21),
Soil = c("good", "good", "OK", "OK", "bad")
)
site_vars$Soil <- ordered(site_vars$Soil, levels = c("good", "OK", "bad"))
# Version 1
dbRDA_Condition <- dbrda(sqrt(dist) ~ Tmean + SomethingElse + Condition(Soil), site_vars)
plot(dbRDA_Condition)
# Version 2
dbRDA <- dbrda(sqrt(dist) ~ Tmean + SomethingElse + Soil, site_vars)
plot(dbRDA)
Version 1 seems to disregard the fact that my soil variable is ranked. Version 2 generates an output I find a bit tricky to interpret, because additional to the group centroids, it also displays arrows. I would expect 1 arrow for soil as if it was a numerical variable with numbers 1, 2 and 3 instead of three levels. However, it shows two arrows, labeled Soil.L and Soil.Q. Why are there two arrows for one variable? And what does *.L and *.Q stand for? Unfortunately, I haven't found any explanation.
R analyses factors using contrasts. In unordered factors the default contrasts are differences to the first factor level. For ordered factors, R uses polynomial contrasts: linear (L), quadratic (Q), cubic (C), fourth-order (^4). Check any guide to R statistical environment. dbrda does not invent this feature, but it is the R standard.

Selecting Rows in a Column Contingent on Two Variables in R

I am working with a data set that contains multiple observations for each prescription a patient is taking, with many different patients. Patients typically take one of several drugs, which are indicated as their own binary variables, Drug1, Drug2 and so on.
I am attempting to pull out only the individuals that have switched from one drug to the other, i.e, have a 1 in Drug1 column and Drug2, but these occur in different rows.
I have attempted to use newdata <- mydata[which(Drug1 == 1 & Drug2 == 1),] however, this assumes that the 1's are in the same row, which they are not.
Is there a way to select the patients that have received both drugs, but the indicator variables are in different rows?
Thank you
I believe this is a solution to what you are asking using dplyr.
data <- data.frame(id = rep(c(1, 2, 3, 4), each = 2),
drug1 = c(1, 0, 0, 0, 0, 1, 1, 1),
drug2 = c(0, 1, 1, 1, 1, 0, 0, 0)
)
library(dplyr)
data %>%
group_by(id) %>%
mutate(both_drugs = ifelse(any(drug1 == 1) & any(drug2 == 1), 1, 0)) %>%
filter(both_drugs == 1)
Try creating a variable for each drug that indicates whether or not it was the only drug taken at that time by that individual.
data <- data.frame(id = rep(c(1, 2, 3, 4), each = 3),
drug1 = c(1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0),
drug2 = c(0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0))
library(dplyr)
data %>%
group_by(id) %>%
mutate(drug1only = ifelse(drug1==1 & drug2==0, 1, 0),
drug2only = ifelse(drug2==1 & drug1==0, 1, 0)) %>%
summarise(
drug_switch = ifelse(max(drug1only)+max(drug2only)==2,1,0))

R Return p-values for categorical independent variables with glm

I recently asked a question about looping a glm command for all possible combinations of independent variables. Another user provided a great answer that runs all possible models, however I can't figure out how to produce a data.frame of all possible p-values.
The code suggested in the previous question works for independent variables that are binary (pasted below). However, several of my variables are categorical. Is there any way to adjust the code so that I can produce a table of all p-values for every possible model (there are 2,046 possible models with 10 independent variables...)?
# p-values in a data.frame
p_values <-
cbind(formula_vec, as.data.frame ( do.call(rbind,
lapply(glm_res, function(x) {
coefs <- coef(x)
rbind(c(coefs[,4] , rep(NA, length(ind_vars) - length(coefs[,4]) + 1)))
})
)))
An example of one independent variable is "Bedrock" where possible categories include: "till," "silt," and "glacial deposit." It's not feasible to assign a numerical value to these variables, which is part of the problem. Any suggestions would be appreciated.
In case of additional categorical variable IndVar4 (factor a, b, c) the coefficient table can be more than just a row longer. Adding variable IndVar4:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7548180 1.4005800 -1.2529223 0.2102340
IndVar1 -0.2830926 1.2076534 -0.2344154 0.8146625
IndVar2 0.1894432 0.1401217 1.3519903 0.1763784
IndVar3 0.1568672 0.2528131 0.6204867 0.5349374
IndVar4b 0.4604571 1.0774018 0.4273773 0.6691045
IndVar4c 0.9084545 1.0943227 0.8301523 0.4064527
Max number of rows is less then all variables + all categories:
max_values <- length(ind_vars) +
sum(sapply( dfPRAC, function(x) pmax(length(levels(x))-1,0)))
So the new corrected function is:
p_values <-
cbind(formula_vec, as.data.frame ( do.call(rbind,
lapply(glm_res, function(x) {
coefs <- coef(x)
rbind(c(coefs[,4] , rep(NA, max_values - length(coefs[,4]) + 1)))
})
)))
But the result is not so clean as with continuous variables. I think Metrics' idea to convert every categorical variable to (levels-1) dummy variables gives same results and maybe cleaner presentation.
Data:
dfPRAC <- structure(list(DepVar1 = c(0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1), DepVar2 = c(0, 1, 0, 0,
1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1),
IndVar1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
0, 0, 0, 1, 0, 0, 0, 1, 0),
IndVar2 = c(1, 3, 9, 1, 5, 1,
1, 8, 4, 6, 3, 15, 4, 1, 1, 3, 2, 1, 10, 1, 9, 9, 11, 5),
IndVar3 = c(0.500100322564443, 1.64241601558441, 0.622735778490702,
2.42429812749226, 5.10055213237027, 1.38479786027561, 7.24663629203007,
0.5102348706939, 2.91566510995229, 3.73356170379198, 5.42003495939846,
1.29312896116503, 3.33753833987496, 0.91783513806083, 4.7735736131668,
1.17609362602233, 5.58010703426296, 5.6668754863739, 1.4377813063642,
5.07724130837643, 2.4791994535923, 2.55100067348583, 2.41043629522981,
2.14411703944206)), .Names = c("DepVar1", "DepVar2", "IndVar1",
"IndVar2", "IndVar3"), row.names = c(NA, 24L), class = "data.frame")
dfPRAC$IndVar4 <- factor(rep(c("a", "b", "c"),8))
dfPRAC$IndVar5 <- factor(rep(c("d", "e", "f", "g"),6))
Set up the models:
dep_vars <- c("DepVar1", "DepVar2")
ind_vars <- c("IndVar1", "IndVar2", "IndVar3", "IndVar4", "IndVar5")
# create all combinations of ind_vars
ind_vars_comb <-
unlist( sapply( seq_len(length(ind_vars)),
function(i) {
apply( combn(ind_vars,i), 2, function(x) paste(x, collapse = "+"))
}))
# pair with dep_vars:
var_comb <- expand.grid(dep_vars, ind_vars_comb )
# formulas for all combinations
formula_vec <- sprintf("%s ~ %s", var_comb$Var1, var_comb$Var2)
# create models
glm_res <- lapply( formula_vec, function(f) {
fit1 <- glm( f, data = dfPRAC, family = binomial("logit"))
fit1$coefficients <- coef( summary(fit1))
return(fit1)
})
names(glm_res) <- formula_vec

Resources