I have a file that represents the gene structure of bacteria models. Each row represents a model. A row is a fixed length binary string of which genes are present (1 for present and 0 for absent). My task is to compare the gene sequence for each pair of models and get a score of how similar they are and computer a dissimilarity matrix.
In total there are 450 models (rows) in one file and there are 250 files. I have a working code however it takes roughly 1.6 hours to do the whole thing for only one file.
#Sample Data
Generation: 0
[0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0]
[1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1]
[1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]
[0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0]
[0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0]
[1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0]
What my code does:
Reads the file
Convert the binary string into a data frame Gene, Model_1, Model_2,
Model_3, … Model_450
Run a nested for loop to do the pair-wise comparison (only the top
half of the matrix) – I take the two corresponding columns and add
them, then count the positions where the sum is 2 (meaning present
in both models)
Write the data to a file
Create the matrix later
comparison code
generationFiles = list.files(pattern = "^Generation.*\\_\\d+.txt$")
start.time = Sys.time()
for(a in 1:length(generationFiles)){
fname = generationFiles[a]
geneData = read.table(generationFiles[a], sep = "\n", header = T, stringsAsFactors = F)
geneCount = str_count(geneData[1,1],"[1|0]")
geneDF <- data.frame(Gene = paste0("Gene_", c(1:geneCount)), stringsAsFactors = F)
#convert the string into a data frame
for(i in 1:nrow(geneData)){
#remove the square brackets
dataRow = substring(geneData[i,1], 2, nchar(geneData[i,1]) - 1)
#removing white spaces
dataRow = gsub(" ", "", dataRow, fixed = T)
#splitting the string
dataRow = strsplit(dataRow, ",")
#converting to numeric
dataRow = as.numeric(unlist(dataRow))
colName = paste("M_",i,sep = "")
geneDF <- cbind(geneDF, dataRow)
colnames(geneDF)[colnames(geneDF) == 'dataRow'] <- colName
dataRow <- NULL
}
summaryDF <- data.frame(Model1 = character(), Model2 = character(), Common = integer(),
Uncommon = integer(), Absent = integer(), stringsAsFactors = F)
modelNames = paste0("M_",c(1:450))
secondaryLevel = modelNames
fileName = paste0("D://BellosData//GC_3//Summary//",substr(fname, 1, nchar(fname) - 4),"_Summary.txt")
for(x in 1:449){
secondaryLevel = secondaryLevel[-1]
for(y in 1:length(secondaryLevel)){
result = geneDF[modelNames[x]] + geneDF[secondaryLevel[y]]
summaryDF <- rbind(summaryDF, data.frame(Model1 = modelNames[x],
Model2 = secondaryLevel[y],
Common = sum(result == 2),
Uncommon = sum(result == 1),
Absent = sum(result == 0)))
}
}
write.table(summaryDF, fileName, sep = ",", quote = F, row.names = F)
geneDF <- NULL
summaryDF <- NULL
geneData <-NULL
}
converting to matrix
maxNum = max(summaryDF$Common)
normalizeData = summaryDF[,c(1:3)]
normalizeData[c('Common')] <- lapply(normalizeData[c('Common')], function(x) 1 - x/maxNum)
normalizeData[1:2] <- lapply(normalizeData[1:2], factor, levels=unique(unlist(normalizeData[1:2])))
distMatrixN = xtabs(Common~Model1+Model2, data=normalizeData)
distMatrixN = distMatrixN + t(distMatrixN)
Is there a way to make the process run faster? Is there a more efficient way to do the comparison?
This code should be faster. Nested loops are nightmare slow in R. Operations like rbind-ing one row at a time is also among the worst and slowest ideas in R programming.
Generate 450 rows with 20 elements of 0, 1 on each row.
M = do.call(rbind, replicate(450, sample(0:1, 20, replace = T), simplify = F))
Generate list of combination(450, 2) numbers of row pairs
L = split(v<-t(utils::combn(450, 2)), seq(nrow(v))); rm(v)
Apply whatever comparison function you want. In this case, the number of 1's at the same position for each row combinations. If you want to calculate different metrics, just write another function(x) where M[x[1],] is the first row and M[x[2],] is the second row.
O = lapply(L, function(x) sum(M[x[1],]&M[x[2],]))
Code takes ~4 seconds a fairly slow 2.6 Ghz Sandy Bridge
Get a clean data.frame with your results, three columns : row 1, row 2, metric between the two rows
data.frame(row1 = sapply(L, `[`, 1),
row2 = sapply(L, `[`, 2),
similarity_metric = do.call(rbind, O))
To be honest, I didn't thoroughly comb through your code to replicate exactly what you were doing. If this is not what you are looking for (or can't be modified to achieve what you are looking for), leave a comment.
Related
I have a rather small dataset resulted from a linkage between two different datasets. I would like to know how can I calculate specificity, sensibility, predictive values and plot the ROC curve. This is the first time I'm using this kind of statistics in R, so I don't even know how to start.
Part of the data looks like this:
data <- data.frame(NMM_TOTAL = c(1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1),
CPAV_TOTAL = c(0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0),
SIH_NMM_TOTAL = c(0, 0, 0, 1, 1, 1, 1, 1, 1, 0 , 0, 1, 1, 0, 1),
SIH_CPAV_TOTAL = c(1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1))
And the two way tables would be the combination of:
tab1 <- table(data$SIH_NMM_TOTAL, data$NMM_TOTAL)
tab2 <- table(data$SIH_CPAV_TOTAL, data$CPAV_TOTAL)
Where NMM_TOTAL and CPAV_TOTAL are the "gold standard". I don't know if any of this makes sense. Thanks in advance!
Obs: 1 stands for positive and 0 for negative.
Let's work with tab1 to demonstrate specificity, sensitivity, and predictive values. Consider labeling the rows and columns of your tables to enhance clarity
act <- data$SIH_NMM_TOTAL
ref <- data$NMM_TOTAL
table(act,ref)
Load this library
library(caret)
The input data needs to be factors
act <- factor(act)
ref <- factor(ref)
The commands look like this
sensitivity(act, ref)
specificity(act, ref)
posPredValue(act, ref)
negPredValue(act, ref)
ROC curve. The Receiver Operating Characteristic (ROC) curve is used to assess the accuracy of a continuous measurement for predicting a binary outcome. It is not clear from your data that you can plot an ROC curve. Let me show you a simple example on how to generate one. The example is drawn from https://cran.r-project.org/web/packages/plotROC/vignettes/examples.html
library(ggplot2)
library(plotROC)
set.seed(1)
D.ex <- rbinom(200, size = 1, prob = .5)
M1 <- rnorm(200, mean = D.ex, sd = .65)
test <- data.frame(D = D.ex, D.str = c("Healthy", "Ill")[D.ex + 1],
M1 = M1, stringsAsFactors = FALSE)
head(test)
ggplot(test, aes(d = D, m = M1)) + geom_roc()
everyone!
I am trying to run a GWAS analysis in R on some very simple genetic data. It only contains the SNPs and one outcome variable (as well as an ID variable for each observation).
Everything I have found online includes chromosome and position data. I have that for the SNPs, but in a separate file. (My plan is to map the SNPs after the relevant ones have been selected).
How can I go about running a GWAS analysis on this data? Would I need to, or could I use another method to filter to only the most significant SNPs?
I tried this, but it didn't work, because my data is not a gData object.
# SNPs are in A/B notation, with 0 = AA, 1 = AB, and 2 = BB
library(statgenGWAS)
id <- c("person1", "person2", "person3", "person4", "person5", "person6", "person7", "person8", "person9", "person10")
snp1 <- c(0, 1, 2, 2, 1, 0, 0, 0, 1, 1)
snp2 <- c(2, 2, 2, 1, 1, 1, 0, 0, 0, 1)
snp3 <- c(0, 0, 2, 2, 0, 2, 1, 0, 2, 2)
diagnosis <- c(0, 1, 1, 0, 0, 1, 1, 0, 1, 1)
data <- as.data.frame(cbind(id, snp1, snp2, snp3, diagnosis))
gwas1a <- runSingleTraitGwas(gData = data,
traits = "diagnosis")
Any help here is appreciated.
Thank you!
I have a data frame df in which for each column I want to calculate what share of occurrences also occur in another column. Each row of occurrences has a weight so ideally I would like to get a weighted share.
A <- c(0, 1, 0, 0, 1, 0, 1, 1, 1, 0)
B <- c(0, 1, 0, 1, 1, 0, 0, 0, 0, 0)
C <- c(0, 0, 0, 1, 1, 0, 0, 0, 0, 1)
D <- c(1, 0, 0, 1, 1, 0, 0, 0, 0, 0)
weight <- c(0.5, 1, 0.2, 0.3, 1.4, 1.5, 0.8, 1.2, 1, 0.9)
df <- data.frame(A, B, C, D, weight)
I was trying to calculate it for each column pair this way:
#total weight of occurences in A
wgt_A <- df%>%
filter(A == 1)%>%
summarise(weight_A = sum(weight))%>%
select(weight_A)
#weighted share of occurrences in A that also occur in B
wgt_A_B <- df%>%
filter(A == 1, B == 1)%>%
summarise(weight_A_B = sum(weight))%>%
select(weight_A_B)
Result_1 <- wgt_A_B / wgt_A
I would want to end up with six results in total for all combinations of the 4 columns. However, for this I would need to replicate this dplyr pipe a lot of times and my actual dataset has 20+ columns like this. Is there a more efficient/quicker way to do this with apply/sapply or some kind of loop where I can also select for which columns I want to perform this?
I'm new to R and stackoverflow so please let me know (and excuse me) if I'm doing/saying anything stupid
We may use combn to do the combinations in base R
out <- combn(df[1:4], 2, FUN = function(x)
sum(df$weight[x[[1]] & x[[2]]])/ sum(df$weight[as.logical(x[[1]])]) )
names(out) <- combn(names(df)[1:4], 2, FUN = paste, collapse = "_")
-output
> out
A_B A_C A_D B_C B_D C_D
0.4444444 0.2592593 0.2592593 0.6296296 0.6296296 0.6538462
I recently asked a question about looping a glm command for all possible combinations of independent variables. Another user provided a great answer that runs all possible models, however I can't figure out how to produce a data.frame of all possible p-values.
The code suggested in the previous question works for independent variables that are binary (pasted below). However, several of my variables are categorical. Is there any way to adjust the code so that I can produce a table of all p-values for every possible model (there are 2,046 possible models with 10 independent variables...)?
# p-values in a data.frame
p_values <-
cbind(formula_vec, as.data.frame ( do.call(rbind,
lapply(glm_res, function(x) {
coefs <- coef(x)
rbind(c(coefs[,4] , rep(NA, length(ind_vars) - length(coefs[,4]) + 1)))
})
)))
An example of one independent variable is "Bedrock" where possible categories include: "till," "silt," and "glacial deposit." It's not feasible to assign a numerical value to these variables, which is part of the problem. Any suggestions would be appreciated.
In case of additional categorical variable IndVar4 (factor a, b, c) the coefficient table can be more than just a row longer. Adding variable IndVar4:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7548180 1.4005800 -1.2529223 0.2102340
IndVar1 -0.2830926 1.2076534 -0.2344154 0.8146625
IndVar2 0.1894432 0.1401217 1.3519903 0.1763784
IndVar3 0.1568672 0.2528131 0.6204867 0.5349374
IndVar4b 0.4604571 1.0774018 0.4273773 0.6691045
IndVar4c 0.9084545 1.0943227 0.8301523 0.4064527
Max number of rows is less then all variables + all categories:
max_values <- length(ind_vars) +
sum(sapply( dfPRAC, function(x) pmax(length(levels(x))-1,0)))
So the new corrected function is:
p_values <-
cbind(formula_vec, as.data.frame ( do.call(rbind,
lapply(glm_res, function(x) {
coefs <- coef(x)
rbind(c(coefs[,4] , rep(NA, max_values - length(coefs[,4]) + 1)))
})
)))
But the result is not so clean as with continuous variables. I think Metrics' idea to convert every categorical variable to (levels-1) dummy variables gives same results and maybe cleaner presentation.
Data:
dfPRAC <- structure(list(DepVar1 = c(0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1), DepVar2 = c(0, 1, 0, 0,
1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1),
IndVar1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
0, 0, 0, 1, 0, 0, 0, 1, 0),
IndVar2 = c(1, 3, 9, 1, 5, 1,
1, 8, 4, 6, 3, 15, 4, 1, 1, 3, 2, 1, 10, 1, 9, 9, 11, 5),
IndVar3 = c(0.500100322564443, 1.64241601558441, 0.622735778490702,
2.42429812749226, 5.10055213237027, 1.38479786027561, 7.24663629203007,
0.5102348706939, 2.91566510995229, 3.73356170379198, 5.42003495939846,
1.29312896116503, 3.33753833987496, 0.91783513806083, 4.7735736131668,
1.17609362602233, 5.58010703426296, 5.6668754863739, 1.4377813063642,
5.07724130837643, 2.4791994535923, 2.55100067348583, 2.41043629522981,
2.14411703944206)), .Names = c("DepVar1", "DepVar2", "IndVar1",
"IndVar2", "IndVar3"), row.names = c(NA, 24L), class = "data.frame")
dfPRAC$IndVar4 <- factor(rep(c("a", "b", "c"),8))
dfPRAC$IndVar5 <- factor(rep(c("d", "e", "f", "g"),6))
Set up the models:
dep_vars <- c("DepVar1", "DepVar2")
ind_vars <- c("IndVar1", "IndVar2", "IndVar3", "IndVar4", "IndVar5")
# create all combinations of ind_vars
ind_vars_comb <-
unlist( sapply( seq_len(length(ind_vars)),
function(i) {
apply( combn(ind_vars,i), 2, function(x) paste(x, collapse = "+"))
}))
# pair with dep_vars:
var_comb <- expand.grid(dep_vars, ind_vars_comb )
# formulas for all combinations
formula_vec <- sprintf("%s ~ %s", var_comb$Var1, var_comb$Var2)
# create models
glm_res <- lapply( formula_vec, function(f) {
fit1 <- glm( f, data = dfPRAC, family = binomial("logit"))
fit1$coefficients <- coef( summary(fit1))
return(fit1)
})
names(glm_res) <- formula_vec
I am trying to write a function which will have a numeric vector "x" on input and will create on output a numeric vector of xi indexes, such that x(i) == x(i+1)
By far I wrote such function neighbor:
neighbor <- function(l) {
stopifnot(is.numeric(l))
w <- sapply(l, function(x) which(l[x]==l[x+1]))
w
}
So executing this instruction:
neighbor(c(1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0))
Should produce a numeric vector:
1, 4, 5, 7, 9
But I cannot get it working. Any ideas?
I am searching for an elegant solution without control-flow and if-else instructions.
diff will help with this:
x <- c(1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0)
which(diff(x) == 0)