Replicating dplyr pipe structure with apply family or loop - r

I have a data frame df in which for each column I want to calculate what share of occurrences also occur in another column. Each row of occurrences has a weight so ideally I would like to get a weighted share.
A <- c(0, 1, 0, 0, 1, 0, 1, 1, 1, 0)
B <- c(0, 1, 0, 1, 1, 0, 0, 0, 0, 0)
C <- c(0, 0, 0, 1, 1, 0, 0, 0, 0, 1)
D <- c(1, 0, 0, 1, 1, 0, 0, 0, 0, 0)
weight <- c(0.5, 1, 0.2, 0.3, 1.4, 1.5, 0.8, 1.2, 1, 0.9)
df <- data.frame(A, B, C, D, weight)
I was trying to calculate it for each column pair this way:
#total weight of occurences in A
wgt_A <- df%>%
filter(A == 1)%>%
summarise(weight_A = sum(weight))%>%
select(weight_A)
#weighted share of occurrences in A that also occur in B
wgt_A_B <- df%>%
filter(A == 1, B == 1)%>%
summarise(weight_A_B = sum(weight))%>%
select(weight_A_B)
Result_1 <- wgt_A_B / wgt_A
I would want to end up with six results in total for all combinations of the 4 columns. However, for this I would need to replicate this dplyr pipe a lot of times and my actual dataset has 20+ columns like this. Is there a more efficient/quicker way to do this with apply/sapply or some kind of loop where I can also select for which columns I want to perform this?
I'm new to R and stackoverflow so please let me know (and excuse me) if I'm doing/saying anything stupid

We may use combn to do the combinations in base R
out <- combn(df[1:4], 2, FUN = function(x)
sum(df$weight[x[[1]] & x[[2]]])/ sum(df$weight[as.logical(x[[1]])]) )
names(out) <- combn(names(df)[1:4], 2, FUN = paste, collapse = "_")
-output
> out
A_B A_C A_D B_C B_D C_D
0.4444444 0.2592593 0.2592593 0.6296296 0.6296296 0.6538462

Related

R - How to run a GWAS analysis with no position data?

everyone!
I am trying to run a GWAS analysis in R on some very simple genetic data. It only contains the SNPs and one outcome variable (as well as an ID variable for each observation).
Everything I have found online includes chromosome and position data. I have that for the SNPs, but in a separate file. (My plan is to map the SNPs after the relevant ones have been selected).
How can I go about running a GWAS analysis on this data? Would I need to, or could I use another method to filter to only the most significant SNPs?
I tried this, but it didn't work, because my data is not a gData object.
# SNPs are in A/B notation, with 0 = AA, 1 = AB, and 2 = BB
library(statgenGWAS)
id <- c("person1", "person2", "person3", "person4", "person5", "person6", "person7", "person8", "person9", "person10")
snp1 <- c(0, 1, 2, 2, 1, 0, 0, 0, 1, 1)
snp2 <- c(2, 2, 2, 1, 1, 1, 0, 0, 0, 1)
snp3 <- c(0, 0, 2, 2, 0, 2, 1, 0, 2, 2)
diagnosis <- c(0, 1, 1, 0, 0, 1, 1, 0, 1, 1)
data <- as.data.frame(cbind(id, snp1, snp2, snp3, diagnosis))
gwas1a <- runSingleTraitGwas(gData = data,
traits = "diagnosis")
Any help here is appreciated.
Thank you!

Unlist the second to last list of a nested list

I have a deeply nested list of lists. In the "center" of the nested list is a vector containing n integers. I need to count how many integers are in each nested list, then unlist one level above to have a vector of these counts (i.e., instead of list(0, 1:5, 0, 0, 1:3) at the center of the nest, I want c(0, 5, 0, 0, 3).
This seems relatively simple - I was able to use rapply to accomplish the first part, i.e. convert list(0, 1:5, 0, 0, 1:3) to list(0, 5, 0, 0, 3). My specific question I need help with is how to unlist the innermost lists to a vector (instead of list(0, 5, 0, 0, 3) I want c(0, 5, 0, 0, 3)
I have searched and tried various apply, lapply, unlist approaches but none of them are quite right as they target the very innermost list. Since the list I want to unlist is the second to last element, I am struggling finding a way to accomplish this elegantly.
In the sample data below, I can get the desired outcome 2 ways: either multiple lapply functions or a for loop. However, my actual data contain many more lists and millions of datapoints, so these are likely not effective options.
Below is (1) sample data, (2) what I have tried, and (3) sample data having the desired structure.
Sample Data
have_list <- list(scenario1 = list(method1 = list(place1 = list(0, 1:5, 0, 0, 1:3),
place2 = list(1:2, 0, 1:10, 0, 0),
place3 = list(0:19, 0, 0, 0, 0),
place4 = list(1:100, 0, 0, 1:4, 0)),
method2 = list(place1 = list(1:5, 1:5, 0, 0, 1:3),
place2 = list(0, 0, 1:5, 0, 0),
place3 = list(0:19, 0, 1:7, 0, 0),
place4 = list(1:22, 0, 0, 1:4, 0)),
method3 = list(place1 = list(0, 1:2, 1:6, 0, 1:3),
place2 = list(1:2, 0, 1:6, 1:4, 0),
place3 = list(0:19, 0, 0, 0, 1:2),
place4 = list(1:12, 0, 0, 1:12, 0))),
scenario2 = list(method1 = list(place1 = list(0, 1:5, 0, 0, 1:3),
place2 = list(1:2, 0, 1:10, 0, 0),
place3 = list(0:19, 0, 0, 0, 0),
place4 = list(1:100, 0, 0, 1:4, 0)),
method2 = list(place1 = list(1:5, 1:5, 0, 0, 1:3),
place2 = list(0, 0, 1:5, 0, 0),
place3 = list(0:19, 0, 1:7, 0, 0),
place4 = list(1:22, 0, 0, 1:4, 0)),
method3 = list(place1 = list(0, 1:2, 1:6, 0, 1:3),
place2 = list(1:2, 0, 1:6, 1:4, 0),
place3 = list(0:19, 0, 0, 0, 1:2),
place4 = list(1:12, 0, 0, 1:12, 0))))
What I have tried
And SO questions I have visited:
Unlist LAST level of a list in R
R - unlist nested list of dates
Issues unlisting a nested-list
# Get number of integers in each nested list
lengths <- rapply(have_list, function(x) unlist(length(x)), how = "list") # this works fine
#' Each count is currently still in its own list of length 1,
#' Convert each count to vector
#' In the "middle" the nested list:
# I have list(0, 5, 0, 0, 3)
# I want c(0, 5, 0, 0, 3)
# Attempts to unlist the counts
# Unlist the counts
test1 <- rapply(lengths, unlist, how = "list") # doesn't work
test2 <- unlist(lengths, recursive = FALSE) # doesn't work
test3 <- lapply(lengths, function(x) lapply(x, unlist)) # doesnt work
test4 <- lapply(lengths, function(x) lapply(x, unlist, recursive = FALSE)) # doesnt work
test5 <- rapply(have_list, function(x) unlist(length(x)), how = "list") #doesnt work
test6 <- rapply(have_list, function(x) unlist(length(x)), how = "unlist") #doesnt work
Data structure I want
# This works on test data but is impractical for real data
want_list <- lapply(lengths, function(w) lapply(w, function(x) lapply(x, unlist)))
# or
want_list <- lengths
## for loops work but is not practical
for (i in 1:length(lengths)){
for (j in 1:length(lengths[[i]])){
for (k in 1:length(lengths[[i]][[j]])){
want_list[[i]][[j]][[k]] <- unlist(lengths[[i]][[j]][[k]])
}
}
}
An option is to melt the nested list with rrapply, replace the 'value' column with the lengths and then use the recursive split (rsplit) from collapse
library(rrapply)
library(collapse)
dat <- transform(rrapply(have_list, how = "melt"), value= lengths(value))
out <- rsplit(dat$value, dat[1:3])
-testing with OP' expected
identical(out, want_list)
[1] TRUE
Another solution with rrapply() could be to apply lengths() only to the lists of vectors using a condition function:
library(rrapply)
out <- rrapply(have_list, classes = "list", condition = \(x) is.numeric(x[[1]]), f = lengths)
identical(want_list, out)
#> [1] TRUE
This can be done by using recursion. A simple recursion will be:
my_fun <- function(x) if(is.list(x[[1]])) lapply(x, my_fun) else lengths(x)
out <- my_fun(have_list)
identical(out, want_list)
[1] TRUE

Pair wise binary comparison - optimizing code in R

I have a file that represents the gene structure of bacteria models. Each row represents a model. A row is a fixed length binary string of which genes are present (1 for present and 0 for absent). My task is to compare the gene sequence for each pair of models and get a score of how similar they are and computer a dissimilarity matrix.
In total there are 450 models (rows) in one file and there are 250 files. I have a working code however it takes roughly 1.6 hours to do the whole thing for only one file.
#Sample Data
Generation: 0
[0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0]
[1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1]
[1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]
[0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0]
[0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0]
[1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0]
What my code does:
Reads the file
Convert the binary string into a data frame Gene, Model_1, Model_2,
Model_3, … Model_450
Run a nested for loop to do the pair-wise comparison (only the top
half of the matrix) – I take the two corresponding columns and add
them, then count the positions where the sum is 2 (meaning present
in both models)
Write the data to a file
Create the matrix later
comparison code
generationFiles = list.files(pattern = "^Generation.*\\_\\d+.txt$")
start.time = Sys.time()
for(a in 1:length(generationFiles)){
fname = generationFiles[a]
geneData = read.table(generationFiles[a], sep = "\n", header = T, stringsAsFactors = F)
geneCount = str_count(geneData[1,1],"[1|0]")
geneDF <- data.frame(Gene = paste0("Gene_", c(1:geneCount)), stringsAsFactors = F)
#convert the string into a data frame
for(i in 1:nrow(geneData)){
#remove the square brackets
dataRow = substring(geneData[i,1], 2, nchar(geneData[i,1]) - 1)
#removing white spaces
dataRow = gsub(" ", "", dataRow, fixed = T)
#splitting the string
dataRow = strsplit(dataRow, ",")
#converting to numeric
dataRow = as.numeric(unlist(dataRow))
colName = paste("M_",i,sep = "")
geneDF <- cbind(geneDF, dataRow)
colnames(geneDF)[colnames(geneDF) == 'dataRow'] <- colName
dataRow <- NULL
}
summaryDF <- data.frame(Model1 = character(), Model2 = character(), Common = integer(),
Uncommon = integer(), Absent = integer(), stringsAsFactors = F)
modelNames = paste0("M_",c(1:450))
secondaryLevel = modelNames
fileName = paste0("D://BellosData//GC_3//Summary//",substr(fname, 1, nchar(fname) - 4),"_Summary.txt")
for(x in 1:449){
secondaryLevel = secondaryLevel[-1]
for(y in 1:length(secondaryLevel)){
result = geneDF[modelNames[x]] + geneDF[secondaryLevel[y]]
summaryDF <- rbind(summaryDF, data.frame(Model1 = modelNames[x],
Model2 = secondaryLevel[y],
Common = sum(result == 2),
Uncommon = sum(result == 1),
Absent = sum(result == 0)))
}
}
write.table(summaryDF, fileName, sep = ",", quote = F, row.names = F)
geneDF <- NULL
summaryDF <- NULL
geneData <-NULL
}
converting to matrix
maxNum = max(summaryDF$Common)
normalizeData = summaryDF[,c(1:3)]
normalizeData[c('Common')] <- lapply(normalizeData[c('Common')], function(x) 1 - x/maxNum)
normalizeData[1:2] <- lapply(normalizeData[1:2], factor, levels=unique(unlist(normalizeData[1:2])))
distMatrixN = xtabs(Common~Model1+Model2, data=normalizeData)
distMatrixN = distMatrixN + t(distMatrixN)
Is there a way to make the process run faster? Is there a more efficient way to do the comparison?
This code should be faster. Nested loops are nightmare slow in R. Operations like rbind-ing one row at a time is also among the worst and slowest ideas in R programming.
Generate 450 rows with 20 elements of 0, 1 on each row.
M = do.call(rbind, replicate(450, sample(0:1, 20, replace = T), simplify = F))
Generate list of combination(450, 2) numbers of row pairs
L = split(v<-t(utils::combn(450, 2)), seq(nrow(v))); rm(v)
Apply whatever comparison function you want. In this case, the number of 1's at the same position for each row combinations. If you want to calculate different metrics, just write another function(x) where M[x[1],] is the first row and M[x[2],] is the second row.
O = lapply(L, function(x) sum(M[x[1],]&M[x[2],]))
Code takes ~4 seconds a fairly slow 2.6 Ghz Sandy Bridge
Get a clean data.frame with your results, three columns : row 1, row 2, metric between the two rows
data.frame(row1 = sapply(L, `[`, 1),
row2 = sapply(L, `[`, 2),
similarity_metric = do.call(rbind, O))
To be honest, I didn't thoroughly comb through your code to replicate exactly what you were doing. If this is not what you are looking for (or can't be modified to achieve what you are looking for), leave a comment.

Selecting Rows in a Column Contingent on Two Variables in R

I am working with a data set that contains multiple observations for each prescription a patient is taking, with many different patients. Patients typically take one of several drugs, which are indicated as their own binary variables, Drug1, Drug2 and so on.
I am attempting to pull out only the individuals that have switched from one drug to the other, i.e, have a 1 in Drug1 column and Drug2, but these occur in different rows.
I have attempted to use newdata <- mydata[which(Drug1 == 1 & Drug2 == 1),] however, this assumes that the 1's are in the same row, which they are not.
Is there a way to select the patients that have received both drugs, but the indicator variables are in different rows?
Thank you
I believe this is a solution to what you are asking using dplyr.
data <- data.frame(id = rep(c(1, 2, 3, 4), each = 2),
drug1 = c(1, 0, 0, 0, 0, 1, 1, 1),
drug2 = c(0, 1, 1, 1, 1, 0, 0, 0)
)
library(dplyr)
data %>%
group_by(id) %>%
mutate(both_drugs = ifelse(any(drug1 == 1) & any(drug2 == 1), 1, 0)) %>%
filter(both_drugs == 1)
Try creating a variable for each drug that indicates whether or not it was the only drug taken at that time by that individual.
data <- data.frame(id = rep(c(1, 2, 3, 4), each = 3),
drug1 = c(1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0),
drug2 = c(0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0))
library(dplyr)
data %>%
group_by(id) %>%
mutate(drug1only = ifelse(drug1==1 & drug2==0, 1, 0),
drug2only = ifelse(drug2==1 & drug1==0, 1, 0)) %>%
summarise(
drug_switch = ifelse(max(drug1only)+max(drug2only)==2,1,0))

R Return p-values for categorical independent variables with glm

I recently asked a question about looping a glm command for all possible combinations of independent variables. Another user provided a great answer that runs all possible models, however I can't figure out how to produce a data.frame of all possible p-values.
The code suggested in the previous question works for independent variables that are binary (pasted below). However, several of my variables are categorical. Is there any way to adjust the code so that I can produce a table of all p-values for every possible model (there are 2,046 possible models with 10 independent variables...)?
# p-values in a data.frame
p_values <-
cbind(formula_vec, as.data.frame ( do.call(rbind,
lapply(glm_res, function(x) {
coefs <- coef(x)
rbind(c(coefs[,4] , rep(NA, length(ind_vars) - length(coefs[,4]) + 1)))
})
)))
An example of one independent variable is "Bedrock" where possible categories include: "till," "silt," and "glacial deposit." It's not feasible to assign a numerical value to these variables, which is part of the problem. Any suggestions would be appreciated.
In case of additional categorical variable IndVar4 (factor a, b, c) the coefficient table can be more than just a row longer. Adding variable IndVar4:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7548180 1.4005800 -1.2529223 0.2102340
IndVar1 -0.2830926 1.2076534 -0.2344154 0.8146625
IndVar2 0.1894432 0.1401217 1.3519903 0.1763784
IndVar3 0.1568672 0.2528131 0.6204867 0.5349374
IndVar4b 0.4604571 1.0774018 0.4273773 0.6691045
IndVar4c 0.9084545 1.0943227 0.8301523 0.4064527
Max number of rows is less then all variables + all categories:
max_values <- length(ind_vars) +
sum(sapply( dfPRAC, function(x) pmax(length(levels(x))-1,0)))
So the new corrected function is:
p_values <-
cbind(formula_vec, as.data.frame ( do.call(rbind,
lapply(glm_res, function(x) {
coefs <- coef(x)
rbind(c(coefs[,4] , rep(NA, max_values - length(coefs[,4]) + 1)))
})
)))
But the result is not so clean as with continuous variables. I think Metrics' idea to convert every categorical variable to (levels-1) dummy variables gives same results and maybe cleaner presentation.
Data:
dfPRAC <- structure(list(DepVar1 = c(0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1), DepVar2 = c(0, 1, 0, 0,
1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1),
IndVar1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
0, 0, 0, 1, 0, 0, 0, 1, 0),
IndVar2 = c(1, 3, 9, 1, 5, 1,
1, 8, 4, 6, 3, 15, 4, 1, 1, 3, 2, 1, 10, 1, 9, 9, 11, 5),
IndVar3 = c(0.500100322564443, 1.64241601558441, 0.622735778490702,
2.42429812749226, 5.10055213237027, 1.38479786027561, 7.24663629203007,
0.5102348706939, 2.91566510995229, 3.73356170379198, 5.42003495939846,
1.29312896116503, 3.33753833987496, 0.91783513806083, 4.7735736131668,
1.17609362602233, 5.58010703426296, 5.6668754863739, 1.4377813063642,
5.07724130837643, 2.4791994535923, 2.55100067348583, 2.41043629522981,
2.14411703944206)), .Names = c("DepVar1", "DepVar2", "IndVar1",
"IndVar2", "IndVar3"), row.names = c(NA, 24L), class = "data.frame")
dfPRAC$IndVar4 <- factor(rep(c("a", "b", "c"),8))
dfPRAC$IndVar5 <- factor(rep(c("d", "e", "f", "g"),6))
Set up the models:
dep_vars <- c("DepVar1", "DepVar2")
ind_vars <- c("IndVar1", "IndVar2", "IndVar3", "IndVar4", "IndVar5")
# create all combinations of ind_vars
ind_vars_comb <-
unlist( sapply( seq_len(length(ind_vars)),
function(i) {
apply( combn(ind_vars,i), 2, function(x) paste(x, collapse = "+"))
}))
# pair with dep_vars:
var_comb <- expand.grid(dep_vars, ind_vars_comb )
# formulas for all combinations
formula_vec <- sprintf("%s ~ %s", var_comb$Var1, var_comb$Var2)
# create models
glm_res <- lapply( formula_vec, function(f) {
fit1 <- glm( f, data = dfPRAC, family = binomial("logit"))
fit1$coefficients <- coef( summary(fit1))
return(fit1)
})
names(glm_res) <- formula_vec

Resources