Unlist the second to last list of a nested list - r

I have a deeply nested list of lists. In the "center" of the nested list is a vector containing n integers. I need to count how many integers are in each nested list, then unlist one level above to have a vector of these counts (i.e., instead of list(0, 1:5, 0, 0, 1:3) at the center of the nest, I want c(0, 5, 0, 0, 3).
This seems relatively simple - I was able to use rapply to accomplish the first part, i.e. convert list(0, 1:5, 0, 0, 1:3) to list(0, 5, 0, 0, 3). My specific question I need help with is how to unlist the innermost lists to a vector (instead of list(0, 5, 0, 0, 3) I want c(0, 5, 0, 0, 3)
I have searched and tried various apply, lapply, unlist approaches but none of them are quite right as they target the very innermost list. Since the list I want to unlist is the second to last element, I am struggling finding a way to accomplish this elegantly.
In the sample data below, I can get the desired outcome 2 ways: either multiple lapply functions or a for loop. However, my actual data contain many more lists and millions of datapoints, so these are likely not effective options.
Below is (1) sample data, (2) what I have tried, and (3) sample data having the desired structure.
Sample Data
have_list <- list(scenario1 = list(method1 = list(place1 = list(0, 1:5, 0, 0, 1:3),
place2 = list(1:2, 0, 1:10, 0, 0),
place3 = list(0:19, 0, 0, 0, 0),
place4 = list(1:100, 0, 0, 1:4, 0)),
method2 = list(place1 = list(1:5, 1:5, 0, 0, 1:3),
place2 = list(0, 0, 1:5, 0, 0),
place3 = list(0:19, 0, 1:7, 0, 0),
place4 = list(1:22, 0, 0, 1:4, 0)),
method3 = list(place1 = list(0, 1:2, 1:6, 0, 1:3),
place2 = list(1:2, 0, 1:6, 1:4, 0),
place3 = list(0:19, 0, 0, 0, 1:2),
place4 = list(1:12, 0, 0, 1:12, 0))),
scenario2 = list(method1 = list(place1 = list(0, 1:5, 0, 0, 1:3),
place2 = list(1:2, 0, 1:10, 0, 0),
place3 = list(0:19, 0, 0, 0, 0),
place4 = list(1:100, 0, 0, 1:4, 0)),
method2 = list(place1 = list(1:5, 1:5, 0, 0, 1:3),
place2 = list(0, 0, 1:5, 0, 0),
place3 = list(0:19, 0, 1:7, 0, 0),
place4 = list(1:22, 0, 0, 1:4, 0)),
method3 = list(place1 = list(0, 1:2, 1:6, 0, 1:3),
place2 = list(1:2, 0, 1:6, 1:4, 0),
place3 = list(0:19, 0, 0, 0, 1:2),
place4 = list(1:12, 0, 0, 1:12, 0))))
What I have tried
And SO questions I have visited:
Unlist LAST level of a list in R
R - unlist nested list of dates
Issues unlisting a nested-list
# Get number of integers in each nested list
lengths <- rapply(have_list, function(x) unlist(length(x)), how = "list") # this works fine
#' Each count is currently still in its own list of length 1,
#' Convert each count to vector
#' In the "middle" the nested list:
# I have list(0, 5, 0, 0, 3)
# I want c(0, 5, 0, 0, 3)
# Attempts to unlist the counts
# Unlist the counts
test1 <- rapply(lengths, unlist, how = "list") # doesn't work
test2 <- unlist(lengths, recursive = FALSE) # doesn't work
test3 <- lapply(lengths, function(x) lapply(x, unlist)) # doesnt work
test4 <- lapply(lengths, function(x) lapply(x, unlist, recursive = FALSE)) # doesnt work
test5 <- rapply(have_list, function(x) unlist(length(x)), how = "list") #doesnt work
test6 <- rapply(have_list, function(x) unlist(length(x)), how = "unlist") #doesnt work
Data structure I want
# This works on test data but is impractical for real data
want_list <- lapply(lengths, function(w) lapply(w, function(x) lapply(x, unlist)))
# or
want_list <- lengths
## for loops work but is not practical
for (i in 1:length(lengths)){
for (j in 1:length(lengths[[i]])){
for (k in 1:length(lengths[[i]][[j]])){
want_list[[i]][[j]][[k]] <- unlist(lengths[[i]][[j]][[k]])
}
}
}

An option is to melt the nested list with rrapply, replace the 'value' column with the lengths and then use the recursive split (rsplit) from collapse
library(rrapply)
library(collapse)
dat <- transform(rrapply(have_list, how = "melt"), value= lengths(value))
out <- rsplit(dat$value, dat[1:3])
-testing with OP' expected
identical(out, want_list)
[1] TRUE

Another solution with rrapply() could be to apply lengths() only to the lists of vectors using a condition function:
library(rrapply)
out <- rrapply(have_list, classes = "list", condition = \(x) is.numeric(x[[1]]), f = lengths)
identical(want_list, out)
#> [1] TRUE

This can be done by using recursion. A simple recursion will be:
my_fun <- function(x) if(is.list(x[[1]])) lapply(x, my_fun) else lengths(x)
out <- my_fun(have_list)
identical(out, want_list)
[1] TRUE

Related

Replicating dplyr pipe structure with apply family or loop

I have a data frame df in which for each column I want to calculate what share of occurrences also occur in another column. Each row of occurrences has a weight so ideally I would like to get a weighted share.
A <- c(0, 1, 0, 0, 1, 0, 1, 1, 1, 0)
B <- c(0, 1, 0, 1, 1, 0, 0, 0, 0, 0)
C <- c(0, 0, 0, 1, 1, 0, 0, 0, 0, 1)
D <- c(1, 0, 0, 1, 1, 0, 0, 0, 0, 0)
weight <- c(0.5, 1, 0.2, 0.3, 1.4, 1.5, 0.8, 1.2, 1, 0.9)
df <- data.frame(A, B, C, D, weight)
I was trying to calculate it for each column pair this way:
#total weight of occurences in A
wgt_A <- df%>%
filter(A == 1)%>%
summarise(weight_A = sum(weight))%>%
select(weight_A)
#weighted share of occurrences in A that also occur in B
wgt_A_B <- df%>%
filter(A == 1, B == 1)%>%
summarise(weight_A_B = sum(weight))%>%
select(weight_A_B)
Result_1 <- wgt_A_B / wgt_A
I would want to end up with six results in total for all combinations of the 4 columns. However, for this I would need to replicate this dplyr pipe a lot of times and my actual dataset has 20+ columns like this. Is there a more efficient/quicker way to do this with apply/sapply or some kind of loop where I can also select for which columns I want to perform this?
I'm new to R and stackoverflow so please let me know (and excuse me) if I'm doing/saying anything stupid
We may use combn to do the combinations in base R
out <- combn(df[1:4], 2, FUN = function(x)
sum(df$weight[x[[1]] & x[[2]]])/ sum(df$weight[as.logical(x[[1]])]) )
names(out) <- combn(names(df)[1:4], 2, FUN = paste, collapse = "_")
-output
> out
A_B A_C A_D B_C B_D C_D
0.4444444 0.2592593 0.2592593 0.6296296 0.6296296 0.6538462

Converting ENSEMBL IDs to Gene ID in a Data Frame

I have a large data-table of RNA-seq data that is listed by ensembl_gene_id, but I would like to convert to hgnc_symbol, for ease of visualization on heat maps.
So far I have the following code - but not sure how to proceed. Would it be better to convert the names from the beginning, or only on the subsetted data?
I am also more familiar with python, and normally, I would use a dictionary to map ensembl_gene_id and hgnc_symbol, but in R, not sure how to go about this. My gut says for loops wouldn't be scalable.
Any suggestions would be appreciated.
library(biomaRt)
library(RColorBrewer)
#Load ggplot2 for graphing
#library(ggplot2)
#Load the Gene Expression File. This one is MEAN TPM for genes across cell types.
GE_file <- read.csv(file = "mean_tpm_merged.csv")
#Get the header names of this file
headers <- names(GE_file)
# define biomart object
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
# query biomart
#Define Genes of Interest
GOI <- c("TFEB", "RAC1", "TFE3", "RAB5A")
# get the mapping of GOI and ENSEMBL IDs and create a dictionary
IDs <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol"),
filters = "hgnc_symbol", values = GOI,
mart = mart)
# make the row names the ENSMBL IDs
row.names(IDs) <- IDs[,2]
# Look by rows of interest for this data out of the large dataset
Data_subset <- subset(GE_file, gene %in% IDs$ensembl_gene_id)
# make the row names ENSMBL IDs
row.names(Data_subset) <- Data_subset[,1]
# delete the first row as its not needed for the numerical matrix
Data_subset_matrix <- as.matrix(Data_subset[,2:16])
# colors should be green/red if possible, or whatever is color blind compatible.
# should go row-wise for the coloring.
# excise colors for B cells/NK cells/CD8 T cells.
my_palette <- colorRampPalette(c("red","green"))(n = 299)
heatmap(Data_subset_matrix, Colv = NA, Rowv = NA, scale = 'row', col = my_palette)
Some Relevant outputs:
> dput(head(GE_file))
structure(list(gene = c("ENSG00000223116", "ENSG00000233440",
"ENSG00000207157", "ENSG00000229483", "ENSG00000252952", "ENSG00000235205"
), T.cell..CD4..naive..activated. = c(0, 0.0034414596504, 0,
0, 0, 0), NK.cell..CD56dim.CD16. = c(0, 0, 0, 0, 0, 0.0139463278778
), T.cell..CD4..TFH = c(0, 0, 0, 0, 0, 0), T.cell..CD4..memory.TREG = c(0,
0, 0, 0, 0, 0.000568207845073), T.cell..CD4..TH1.17 = c(0, 0.0196376949773,
0, 0, 0, 0), B.cell..naive = c(0, 0, 0, 0, 0, 0), T.cell..CD4..TH2 = c(0,
0, 0, 0, 0, 0), T.cell..CD4..TH1 = c(0, 0, 0, 0, 0, 0.000571213481481
), T.cell..CD4..naive = c(0, 0, 0, 0, 0, 0), T.cell..CD4..TH17 = c(0,
0.00434618468012, 0, 0, 0, 0), Monocyte..classical = c(0, 0,
0, 0, 0, 0), Monocyte..non.classical = c(0, 0, 0, 0, 0, 0), T.cell..CD4..naive.TREG = c(0,
0, 0, 0, 0, 0.000821516453853), T.cell..CD8..naive = c(0, 0,
0, 0, 0, 0.000508869486411), T.cell..CD8..naive..activated. = c(0,
0.00348680689669, 0, 0, 0, 0)), row.names = c(NA, 6L), class = "data.frame")
Get everything at one go:
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
IDs <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol"),
filters = "ensembl_gene_id", values = GE_file[,1],
mart = mart)
head(IDs)
ensembl_gene_id hgnc_symbol
1 ENSG00000207157 RNY3P4
2 ENSG00000229483 LINC00362
3 ENSG00000233440 HMGA1P6
4 ENSG00000235205 TATDN2P3
5 ENSG00000252952 RNU6-58P
GOI <- c("RNY3P4", "TATDN2P3")
Simple way, subset the ensembl ids in your master table, and subset your dataset according to that:
GOI_ens = IDs$ensembl_gene_id[IDs$hgnc_symbol %in% GOI]
Data_subset = subset(GE_file,gene %in% GOI_ens)[,-1]
Dictionary way, there's always something you can do, but you need to ensure no duplicated symbols:
dedup = !duplicated(IDs$hgnc_symbol)
dict = tapply(IDs$hgnc_symbol,IDs$ensembl_gene_id,unique)
subset(GE_file,dict[gene] %in% GOI)

dplyr split-apply-combine with list column

Say I have a tibble with two columns: a group variable (grp) and a list-column
containing matrices of equal dimension (mat).
mat1 <- matrix(c(2, 0, 0, 0), nrow = 2)
mat2 <- matrix(c(0, 0, 0, 0), nrow = 2)
mat3 <- matrix(c(0, 0, 0, 2), nrow = 2)
mat4 <- matrix(c(0, 0, 0, 0), nrow = 2)
df <- tibble(grp = c('a', 'a', 'b', 'b'),
mat = list(mat1, mat2, mat3, mat4))
Edit:
I want to calculate the mean matrix by group, and add it as a new list-column. I.e. The new column should be:
list(matrix(c(1, 0, 0, 0), nrow = 2),
matrix(c(1, 0, 0, 0), nrow = 2),
matrix(c(0, 0, 0, 1), nrow = 2),
matrix(c(0, 0, 0, 1), nrow = 2))
The best I can do is:
df_out <- df %>%
group_by(grp) %>%
mutate(n = n(),
mean_mat = list(Reduce('+', mat) / n)) %>%
ungroup()
It works, but I'm trying to understand why the call to list is necessary, and also hoping to find an alternative approach (either tidyverse or base R) that is perhaps simpler.

Pair wise binary comparison - optimizing code in R

I have a file that represents the gene structure of bacteria models. Each row represents a model. A row is a fixed length binary string of which genes are present (1 for present and 0 for absent). My task is to compare the gene sequence for each pair of models and get a score of how similar they are and computer a dissimilarity matrix.
In total there are 450 models (rows) in one file and there are 250 files. I have a working code however it takes roughly 1.6 hours to do the whole thing for only one file.
#Sample Data
Generation: 0
[0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0]
[1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1]
[1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]
[0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0]
[0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0]
[1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0]
What my code does:
Reads the file
Convert the binary string into a data frame Gene, Model_1, Model_2,
Model_3, … Model_450
Run a nested for loop to do the pair-wise comparison (only the top
half of the matrix) – I take the two corresponding columns and add
them, then count the positions where the sum is 2 (meaning present
in both models)
Write the data to a file
Create the matrix later
comparison code
generationFiles = list.files(pattern = "^Generation.*\\_\\d+.txt$")
start.time = Sys.time()
for(a in 1:length(generationFiles)){
fname = generationFiles[a]
geneData = read.table(generationFiles[a], sep = "\n", header = T, stringsAsFactors = F)
geneCount = str_count(geneData[1,1],"[1|0]")
geneDF <- data.frame(Gene = paste0("Gene_", c(1:geneCount)), stringsAsFactors = F)
#convert the string into a data frame
for(i in 1:nrow(geneData)){
#remove the square brackets
dataRow = substring(geneData[i,1], 2, nchar(geneData[i,1]) - 1)
#removing white spaces
dataRow = gsub(" ", "", dataRow, fixed = T)
#splitting the string
dataRow = strsplit(dataRow, ",")
#converting to numeric
dataRow = as.numeric(unlist(dataRow))
colName = paste("M_",i,sep = "")
geneDF <- cbind(geneDF, dataRow)
colnames(geneDF)[colnames(geneDF) == 'dataRow'] <- colName
dataRow <- NULL
}
summaryDF <- data.frame(Model1 = character(), Model2 = character(), Common = integer(),
Uncommon = integer(), Absent = integer(), stringsAsFactors = F)
modelNames = paste0("M_",c(1:450))
secondaryLevel = modelNames
fileName = paste0("D://BellosData//GC_3//Summary//",substr(fname, 1, nchar(fname) - 4),"_Summary.txt")
for(x in 1:449){
secondaryLevel = secondaryLevel[-1]
for(y in 1:length(secondaryLevel)){
result = geneDF[modelNames[x]] + geneDF[secondaryLevel[y]]
summaryDF <- rbind(summaryDF, data.frame(Model1 = modelNames[x],
Model2 = secondaryLevel[y],
Common = sum(result == 2),
Uncommon = sum(result == 1),
Absent = sum(result == 0)))
}
}
write.table(summaryDF, fileName, sep = ",", quote = F, row.names = F)
geneDF <- NULL
summaryDF <- NULL
geneData <-NULL
}
converting to matrix
maxNum = max(summaryDF$Common)
normalizeData = summaryDF[,c(1:3)]
normalizeData[c('Common')] <- lapply(normalizeData[c('Common')], function(x) 1 - x/maxNum)
normalizeData[1:2] <- lapply(normalizeData[1:2], factor, levels=unique(unlist(normalizeData[1:2])))
distMatrixN = xtabs(Common~Model1+Model2, data=normalizeData)
distMatrixN = distMatrixN + t(distMatrixN)
Is there a way to make the process run faster? Is there a more efficient way to do the comparison?
This code should be faster. Nested loops are nightmare slow in R. Operations like rbind-ing one row at a time is also among the worst and slowest ideas in R programming.
Generate 450 rows with 20 elements of 0, 1 on each row.
M = do.call(rbind, replicate(450, sample(0:1, 20, replace = T), simplify = F))
Generate list of combination(450, 2) numbers of row pairs
L = split(v<-t(utils::combn(450, 2)), seq(nrow(v))); rm(v)
Apply whatever comparison function you want. In this case, the number of 1's at the same position for each row combinations. If you want to calculate different metrics, just write another function(x) where M[x[1],] is the first row and M[x[2],] is the second row.
O = lapply(L, function(x) sum(M[x[1],]&M[x[2],]))
Code takes ~4 seconds a fairly slow 2.6 Ghz Sandy Bridge
Get a clean data.frame with your results, three columns : row 1, row 2, metric between the two rows
data.frame(row1 = sapply(L, `[`, 1),
row2 = sapply(L, `[`, 2),
similarity_metric = do.call(rbind, O))
To be honest, I didn't thoroughly comb through your code to replicate exactly what you were doing. If this is not what you are looking for (or can't be modified to achieve what you are looking for), leave a comment.

R Return p-values for categorical independent variables with glm

I recently asked a question about looping a glm command for all possible combinations of independent variables. Another user provided a great answer that runs all possible models, however I can't figure out how to produce a data.frame of all possible p-values.
The code suggested in the previous question works for independent variables that are binary (pasted below). However, several of my variables are categorical. Is there any way to adjust the code so that I can produce a table of all p-values for every possible model (there are 2,046 possible models with 10 independent variables...)?
# p-values in a data.frame
p_values <-
cbind(formula_vec, as.data.frame ( do.call(rbind,
lapply(glm_res, function(x) {
coefs <- coef(x)
rbind(c(coefs[,4] , rep(NA, length(ind_vars) - length(coefs[,4]) + 1)))
})
)))
An example of one independent variable is "Bedrock" where possible categories include: "till," "silt," and "glacial deposit." It's not feasible to assign a numerical value to these variables, which is part of the problem. Any suggestions would be appreciated.
In case of additional categorical variable IndVar4 (factor a, b, c) the coefficient table can be more than just a row longer. Adding variable IndVar4:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7548180 1.4005800 -1.2529223 0.2102340
IndVar1 -0.2830926 1.2076534 -0.2344154 0.8146625
IndVar2 0.1894432 0.1401217 1.3519903 0.1763784
IndVar3 0.1568672 0.2528131 0.6204867 0.5349374
IndVar4b 0.4604571 1.0774018 0.4273773 0.6691045
IndVar4c 0.9084545 1.0943227 0.8301523 0.4064527
Max number of rows is less then all variables + all categories:
max_values <- length(ind_vars) +
sum(sapply( dfPRAC, function(x) pmax(length(levels(x))-1,0)))
So the new corrected function is:
p_values <-
cbind(formula_vec, as.data.frame ( do.call(rbind,
lapply(glm_res, function(x) {
coefs <- coef(x)
rbind(c(coefs[,4] , rep(NA, max_values - length(coefs[,4]) + 1)))
})
)))
But the result is not so clean as with continuous variables. I think Metrics' idea to convert every categorical variable to (levels-1) dummy variables gives same results and maybe cleaner presentation.
Data:
dfPRAC <- structure(list(DepVar1 = c(0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1), DepVar2 = c(0, 1, 0, 0,
1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1),
IndVar1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
0, 0, 0, 1, 0, 0, 0, 1, 0),
IndVar2 = c(1, 3, 9, 1, 5, 1,
1, 8, 4, 6, 3, 15, 4, 1, 1, 3, 2, 1, 10, 1, 9, 9, 11, 5),
IndVar3 = c(0.500100322564443, 1.64241601558441, 0.622735778490702,
2.42429812749226, 5.10055213237027, 1.38479786027561, 7.24663629203007,
0.5102348706939, 2.91566510995229, 3.73356170379198, 5.42003495939846,
1.29312896116503, 3.33753833987496, 0.91783513806083, 4.7735736131668,
1.17609362602233, 5.58010703426296, 5.6668754863739, 1.4377813063642,
5.07724130837643, 2.4791994535923, 2.55100067348583, 2.41043629522981,
2.14411703944206)), .Names = c("DepVar1", "DepVar2", "IndVar1",
"IndVar2", "IndVar3"), row.names = c(NA, 24L), class = "data.frame")
dfPRAC$IndVar4 <- factor(rep(c("a", "b", "c"),8))
dfPRAC$IndVar5 <- factor(rep(c("d", "e", "f", "g"),6))
Set up the models:
dep_vars <- c("DepVar1", "DepVar2")
ind_vars <- c("IndVar1", "IndVar2", "IndVar3", "IndVar4", "IndVar5")
# create all combinations of ind_vars
ind_vars_comb <-
unlist( sapply( seq_len(length(ind_vars)),
function(i) {
apply( combn(ind_vars,i), 2, function(x) paste(x, collapse = "+"))
}))
# pair with dep_vars:
var_comb <- expand.grid(dep_vars, ind_vars_comb )
# formulas for all combinations
formula_vec <- sprintf("%s ~ %s", var_comb$Var1, var_comb$Var2)
# create models
glm_res <- lapply( formula_vec, function(f) {
fit1 <- glm( f, data = dfPRAC, family = binomial("logit"))
fit1$coefficients <- coef( summary(fit1))
return(fit1)
})
names(glm_res) <- formula_vec

Resources