dplyr split-apply-combine with list column - r

Say I have a tibble with two columns: a group variable (grp) and a list-column
containing matrices of equal dimension (mat).
mat1 <- matrix(c(2, 0, 0, 0), nrow = 2)
mat2 <- matrix(c(0, 0, 0, 0), nrow = 2)
mat3 <- matrix(c(0, 0, 0, 2), nrow = 2)
mat4 <- matrix(c(0, 0, 0, 0), nrow = 2)
df <- tibble(grp = c('a', 'a', 'b', 'b'),
mat = list(mat1, mat2, mat3, mat4))
Edit:
I want to calculate the mean matrix by group, and add it as a new list-column. I.e. The new column should be:
list(matrix(c(1, 0, 0, 0), nrow = 2),
matrix(c(1, 0, 0, 0), nrow = 2),
matrix(c(0, 0, 0, 1), nrow = 2),
matrix(c(0, 0, 0, 1), nrow = 2))
The best I can do is:
df_out <- df %>%
group_by(grp) %>%
mutate(n = n(),
mean_mat = list(Reduce('+', mat) / n)) %>%
ungroup()
It works, but I'm trying to understand why the call to list is necessary, and also hoping to find an alternative approach (either tidyverse or base R) that is perhaps simpler.

Related

Unlist the second to last list of a nested list

I have a deeply nested list of lists. In the "center" of the nested list is a vector containing n integers. I need to count how many integers are in each nested list, then unlist one level above to have a vector of these counts (i.e., instead of list(0, 1:5, 0, 0, 1:3) at the center of the nest, I want c(0, 5, 0, 0, 3).
This seems relatively simple - I was able to use rapply to accomplish the first part, i.e. convert list(0, 1:5, 0, 0, 1:3) to list(0, 5, 0, 0, 3). My specific question I need help with is how to unlist the innermost lists to a vector (instead of list(0, 5, 0, 0, 3) I want c(0, 5, 0, 0, 3)
I have searched and tried various apply, lapply, unlist approaches but none of them are quite right as they target the very innermost list. Since the list I want to unlist is the second to last element, I am struggling finding a way to accomplish this elegantly.
In the sample data below, I can get the desired outcome 2 ways: either multiple lapply functions or a for loop. However, my actual data contain many more lists and millions of datapoints, so these are likely not effective options.
Below is (1) sample data, (2) what I have tried, and (3) sample data having the desired structure.
Sample Data
have_list <- list(scenario1 = list(method1 = list(place1 = list(0, 1:5, 0, 0, 1:3),
place2 = list(1:2, 0, 1:10, 0, 0),
place3 = list(0:19, 0, 0, 0, 0),
place4 = list(1:100, 0, 0, 1:4, 0)),
method2 = list(place1 = list(1:5, 1:5, 0, 0, 1:3),
place2 = list(0, 0, 1:5, 0, 0),
place3 = list(0:19, 0, 1:7, 0, 0),
place4 = list(1:22, 0, 0, 1:4, 0)),
method3 = list(place1 = list(0, 1:2, 1:6, 0, 1:3),
place2 = list(1:2, 0, 1:6, 1:4, 0),
place3 = list(0:19, 0, 0, 0, 1:2),
place4 = list(1:12, 0, 0, 1:12, 0))),
scenario2 = list(method1 = list(place1 = list(0, 1:5, 0, 0, 1:3),
place2 = list(1:2, 0, 1:10, 0, 0),
place3 = list(0:19, 0, 0, 0, 0),
place4 = list(1:100, 0, 0, 1:4, 0)),
method2 = list(place1 = list(1:5, 1:5, 0, 0, 1:3),
place2 = list(0, 0, 1:5, 0, 0),
place3 = list(0:19, 0, 1:7, 0, 0),
place4 = list(1:22, 0, 0, 1:4, 0)),
method3 = list(place1 = list(0, 1:2, 1:6, 0, 1:3),
place2 = list(1:2, 0, 1:6, 1:4, 0),
place3 = list(0:19, 0, 0, 0, 1:2),
place4 = list(1:12, 0, 0, 1:12, 0))))
What I have tried
And SO questions I have visited:
Unlist LAST level of a list in R
R - unlist nested list of dates
Issues unlisting a nested-list
# Get number of integers in each nested list
lengths <- rapply(have_list, function(x) unlist(length(x)), how = "list") # this works fine
#' Each count is currently still in its own list of length 1,
#' Convert each count to vector
#' In the "middle" the nested list:
# I have list(0, 5, 0, 0, 3)
# I want c(0, 5, 0, 0, 3)
# Attempts to unlist the counts
# Unlist the counts
test1 <- rapply(lengths, unlist, how = "list") # doesn't work
test2 <- unlist(lengths, recursive = FALSE) # doesn't work
test3 <- lapply(lengths, function(x) lapply(x, unlist)) # doesnt work
test4 <- lapply(lengths, function(x) lapply(x, unlist, recursive = FALSE)) # doesnt work
test5 <- rapply(have_list, function(x) unlist(length(x)), how = "list") #doesnt work
test6 <- rapply(have_list, function(x) unlist(length(x)), how = "unlist") #doesnt work
Data structure I want
# This works on test data but is impractical for real data
want_list <- lapply(lengths, function(w) lapply(w, function(x) lapply(x, unlist)))
# or
want_list <- lengths
## for loops work but is not practical
for (i in 1:length(lengths)){
for (j in 1:length(lengths[[i]])){
for (k in 1:length(lengths[[i]][[j]])){
want_list[[i]][[j]][[k]] <- unlist(lengths[[i]][[j]][[k]])
}
}
}
An option is to melt the nested list with rrapply, replace the 'value' column with the lengths and then use the recursive split (rsplit) from collapse
library(rrapply)
library(collapse)
dat <- transform(rrapply(have_list, how = "melt"), value= lengths(value))
out <- rsplit(dat$value, dat[1:3])
-testing with OP' expected
identical(out, want_list)
[1] TRUE
Another solution with rrapply() could be to apply lengths() only to the lists of vectors using a condition function:
library(rrapply)
out <- rrapply(have_list, classes = "list", condition = \(x) is.numeric(x[[1]]), f = lengths)
identical(want_list, out)
#> [1] TRUE
This can be done by using recursion. A simple recursion will be:
my_fun <- function(x) if(is.list(x[[1]])) lapply(x, my_fun) else lengths(x)
out <- my_fun(have_list)
identical(out, want_list)
[1] TRUE

How to include matrix multiplication in constraint?

I am trying to make run this model. I am trying to maximize:x[4]
w.r.t Mv = c(0,0,0,0)
lb < v < ub
But I have 2 problems, first matrix multiplication.
library(ompr)
lb <- c(-200, 0, -200, -200)
ub <- c(1000, 1000, 1000, 1000)
M <- matrix(rbind(
c(-1, 0, -1, 0), # A
c(-1, 0, 0, -2), # B
c(1, -2, 0, 0), # C
c(1, 0, 0, 2), # D
c(0, 2, -1, 0), # E
c(0, 0, 1, -1) # F
), nrow = 6)
n <- 4
rhs <- rep(0, n)
model <- MIPModel() %>%
add_variable(x[i], i = 1:n, type = "continuous") %>%
set_objective(x[4]) %>%
add_constraint(M[i, ] %*% x == rhs[i], i = 1:n)
I got the following error.
Error in M[i, ] %*% x : requires numeric/complex matrix/vector
arguments
Second, I am trying to set the bounds in a vectorized way, but I don't know how to do that. I tried the following:
set_bounds(x[i], ub = ub[i], lb = lb[i], i = 1:n)
This gives:
object 'i' not found
Any help would be very useful!
Works like this, but the solution is (0, 0, 0, 0):
library(ompr)
library(ompr.roi)
library(ROI.plugin.glpk)
library(magrittr)
lb <- c(-200, 0, -200, -200)
ub <- c(1000, 1000, 1000, 1000)
M <- matrix(rbind(
c(-1, 0, -1, 0), # A
c(-1, 0, 0, -2), # B
c(1, -2, 0, 0), # C
c(1, 0, 0, 2), # D
c(0, 2, -1, 0), # E
c(0, 0, 1, -1) # F
), nrow = 6)
n <- 4
rhs <- rep(0, n)
model <- MIPModel() %>%
add_variable(x[i], i = 1:n, type = "continuous") %>%
set_objective(x[4], "max") %>%
add_constraint(sum_over(M[i, j] * x[j], j = 1:4) == rhs[i], i = 1:n) %>%
add_constraint(x[i] <= ub[i], i = 1:n) %>%
add_constraint(x[i] >= lb[i], i = 1:n) %>%
solve_model(with_ROI(solver = "glpk"))
get_solution(model, x[i])

Replicating dplyr pipe structure with apply family or loop

I have a data frame df in which for each column I want to calculate what share of occurrences also occur in another column. Each row of occurrences has a weight so ideally I would like to get a weighted share.
A <- c(0, 1, 0, 0, 1, 0, 1, 1, 1, 0)
B <- c(0, 1, 0, 1, 1, 0, 0, 0, 0, 0)
C <- c(0, 0, 0, 1, 1, 0, 0, 0, 0, 1)
D <- c(1, 0, 0, 1, 1, 0, 0, 0, 0, 0)
weight <- c(0.5, 1, 0.2, 0.3, 1.4, 1.5, 0.8, 1.2, 1, 0.9)
df <- data.frame(A, B, C, D, weight)
I was trying to calculate it for each column pair this way:
#total weight of occurences in A
wgt_A <- df%>%
filter(A == 1)%>%
summarise(weight_A = sum(weight))%>%
select(weight_A)
#weighted share of occurrences in A that also occur in B
wgt_A_B <- df%>%
filter(A == 1, B == 1)%>%
summarise(weight_A_B = sum(weight))%>%
select(weight_A_B)
Result_1 <- wgt_A_B / wgt_A
I would want to end up with six results in total for all combinations of the 4 columns. However, for this I would need to replicate this dplyr pipe a lot of times and my actual dataset has 20+ columns like this. Is there a more efficient/quicker way to do this with apply/sapply or some kind of loop where I can also select for which columns I want to perform this?
I'm new to R and stackoverflow so please let me know (and excuse me) if I'm doing/saying anything stupid
We may use combn to do the combinations in base R
out <- combn(df[1:4], 2, FUN = function(x)
sum(df$weight[x[[1]] & x[[2]]])/ sum(df$weight[as.logical(x[[1]])]) )
names(out) <- combn(names(df)[1:4], 2, FUN = paste, collapse = "_")
-output
> out
A_B A_C A_D B_C B_D C_D
0.4444444 0.2592593 0.2592593 0.6296296 0.6296296 0.6538462

Iterate through elements in a list with shared element names in R

I have a list that looks something like this (a must-reduced version of a list with 301 sub-elements):
myList <- list()
myList$Speaker1 <- list("ID" = c(1, 2, 3, 4, 5),
"S1C1.Sonorant" = c(0, 0, 0, 0.5, 0, -1),
"S1C1.Consonantal" = c(0, 0, 0, 0, 0, 1),
"S1C1.Voice" = c(0, 0, 1, 1, 1, -1),
"S1C1.Nasal" = c(0, 0, 1, 0, 1, -1))
myList$Speaker2 <- list("ID" = c(1, 2, 3, 4, 5),
"S1C1.Sonorant" = c(0, 0, 0, 0.5, 0, -1),
"S1C1.Consonantal" = c(0, 0, 0, 0, 0, 1),
"S1C1.Voice" = c(0, 0, 1, 1, 1, -1),
"S1C1.Nasal" = c(0, 0, 1, 0, 1, -1))
myList$Speaker3 <- list("ID" = c(1, 2, 3, 4, 5),
"S1C1.Sonorant" = c(0, 0, 0, 0.5, 0, -1),
"S1C1.Consonantal" = c(0, 0, 0, 0, 0, 1),
"S1C1.Voice" = c(0, 0, 1, 1, 1, -1),
"S1C1.Nasal" = c(0, 0, 1, 0, 1, -1))
For each speaker, I want to run some functions through all the sub-elements that include the string S1C1.. So far, I have the following, which calls each column containing S1C1 individually:
my_matrix <- lapply(myList, FUN = function(element) {
ones <- rep(1, nrow(element)) # count repeated rows
sonorant_vec.S1C1 <- element$S1C1.Sonorant
sonorant_mat.S1C1 <- (sonorant_vec.S1C1 %*% t(ones) - ones %*% t(sonorant_vec.S1C1))^2
consonantal_vec.S1C1 <- element$S1C1.Consonantal
consonantal_mat.S1C1 <- (consonantal_vec.S1C1 %*% t(ones) - ones %*% t(consonantal_vec.S1C1))^2
voice_vec.S1C1 <- element$S1C1.Voice
voice_mat.S1C1 <- (voice_vec.S1C1 %*% t(ones) - ones %*% t(voice_vec.S1C1))^2
nasal_vec.S1C1 <- element$S1C1.Nasal
nasal_mat.S1C1 <- (nasal_vec.S1C1 %*% t(ones) - ones %*% t(nasal_vec.S1C1))^2
mat.S1C1 <- sonorant_mat.S1C1 +
consonantal_mat.S1C1 +
voice_mat.S1C1 +
nasal_mat.S1C1
rownames(mat.S1C1) <- element$S1C1.S1C1
colnames(mat.S1C1) <- element$S1C1.S1C1
all_mat <- sqrt(mat.S1C1[,])
return(all_mat)
})
Is there a way I can iterate through all the sub-elements that start with the string S1C1.? The current code works but is very long!

Pair wise binary comparison - optimizing code in R

I have a file that represents the gene structure of bacteria models. Each row represents a model. A row is a fixed length binary string of which genes are present (1 for present and 0 for absent). My task is to compare the gene sequence for each pair of models and get a score of how similar they are and computer a dissimilarity matrix.
In total there are 450 models (rows) in one file and there are 250 files. I have a working code however it takes roughly 1.6 hours to do the whole thing for only one file.
#Sample Data
Generation: 0
[0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0]
[1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1]
[1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]
[0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0]
[0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0]
[1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0]
What my code does:
Reads the file
Convert the binary string into a data frame Gene, Model_1, Model_2,
Model_3, … Model_450
Run a nested for loop to do the pair-wise comparison (only the top
half of the matrix) – I take the two corresponding columns and add
them, then count the positions where the sum is 2 (meaning present
in both models)
Write the data to a file
Create the matrix later
comparison code
generationFiles = list.files(pattern = "^Generation.*\\_\\d+.txt$")
start.time = Sys.time()
for(a in 1:length(generationFiles)){
fname = generationFiles[a]
geneData = read.table(generationFiles[a], sep = "\n", header = T, stringsAsFactors = F)
geneCount = str_count(geneData[1,1],"[1|0]")
geneDF <- data.frame(Gene = paste0("Gene_", c(1:geneCount)), stringsAsFactors = F)
#convert the string into a data frame
for(i in 1:nrow(geneData)){
#remove the square brackets
dataRow = substring(geneData[i,1], 2, nchar(geneData[i,1]) - 1)
#removing white spaces
dataRow = gsub(" ", "", dataRow, fixed = T)
#splitting the string
dataRow = strsplit(dataRow, ",")
#converting to numeric
dataRow = as.numeric(unlist(dataRow))
colName = paste("M_",i,sep = "")
geneDF <- cbind(geneDF, dataRow)
colnames(geneDF)[colnames(geneDF) == 'dataRow'] <- colName
dataRow <- NULL
}
summaryDF <- data.frame(Model1 = character(), Model2 = character(), Common = integer(),
Uncommon = integer(), Absent = integer(), stringsAsFactors = F)
modelNames = paste0("M_",c(1:450))
secondaryLevel = modelNames
fileName = paste0("D://BellosData//GC_3//Summary//",substr(fname, 1, nchar(fname) - 4),"_Summary.txt")
for(x in 1:449){
secondaryLevel = secondaryLevel[-1]
for(y in 1:length(secondaryLevel)){
result = geneDF[modelNames[x]] + geneDF[secondaryLevel[y]]
summaryDF <- rbind(summaryDF, data.frame(Model1 = modelNames[x],
Model2 = secondaryLevel[y],
Common = sum(result == 2),
Uncommon = sum(result == 1),
Absent = sum(result == 0)))
}
}
write.table(summaryDF, fileName, sep = ",", quote = F, row.names = F)
geneDF <- NULL
summaryDF <- NULL
geneData <-NULL
}
converting to matrix
maxNum = max(summaryDF$Common)
normalizeData = summaryDF[,c(1:3)]
normalizeData[c('Common')] <- lapply(normalizeData[c('Common')], function(x) 1 - x/maxNum)
normalizeData[1:2] <- lapply(normalizeData[1:2], factor, levels=unique(unlist(normalizeData[1:2])))
distMatrixN = xtabs(Common~Model1+Model2, data=normalizeData)
distMatrixN = distMatrixN + t(distMatrixN)
Is there a way to make the process run faster? Is there a more efficient way to do the comparison?
This code should be faster. Nested loops are nightmare slow in R. Operations like rbind-ing one row at a time is also among the worst and slowest ideas in R programming.
Generate 450 rows with 20 elements of 0, 1 on each row.
M = do.call(rbind, replicate(450, sample(0:1, 20, replace = T), simplify = F))
Generate list of combination(450, 2) numbers of row pairs
L = split(v<-t(utils::combn(450, 2)), seq(nrow(v))); rm(v)
Apply whatever comparison function you want. In this case, the number of 1's at the same position for each row combinations. If you want to calculate different metrics, just write another function(x) where M[x[1],] is the first row and M[x[2],] is the second row.
O = lapply(L, function(x) sum(M[x[1],]&M[x[2],]))
Code takes ~4 seconds a fairly slow 2.6 Ghz Sandy Bridge
Get a clean data.frame with your results, three columns : row 1, row 2, metric between the two rows
data.frame(row1 = sapply(L, `[`, 1),
row2 = sapply(L, `[`, 2),
similarity_metric = do.call(rbind, O))
To be honest, I didn't thoroughly comb through your code to replicate exactly what you were doing. If this is not what you are looking for (or can't be modified to achieve what you are looking for), leave a comment.

Resources