trying to get a proper names(list) output - r

I'm trying to split a 2 level deep list of characters into a 1 level list using a suffix.
More precisely, I have a list of genes, each containing 6 lists of probes corresponding to 6 bins. The architecture looks like :
feat_indexed_probes_bin$HSPB6$bin1
[1] "cg14513218" "cg22891287" "cg20713852" "cg04719839" "cg27580050" "cg18139462" "cg02956481" "cg26608795" "cg15660498" "cg25654926" "cg04878216"
I'm trying to get a list "bins_indexed_probes" with the following architecture :
bins_indexed_probes$HSPB6_bin6 containing the same probes so I can pass it to my map-reducing function.
I tried many solutions such as melt(), for loop, etc but I can't figure how to perform a double nested loop ( on genes and on bins) and get a list output with only 1 level depth.
For the moment, my func to do so is the following :
create_map <- function(indexes = feat_indexed_probes_bin, binlist = c("bin1", "bin2", "bin3", "bin4", "bin5", "bin6"), genes = features) {
map <- list()
ret <- lapply(binlist, function(bin) {
lapply(rownames(features), function(gene) {
map[[paste(gene, "_", bin, sep = "")]] <- feat_indexed_probes_bin[[gene]][[bin]]
tmp_names <<- paste(gene, "_", bin, sep = "")
return(map)
})
names(map) <- tmp_names
rm(tmp_names)
})
return(ret)
}
it returns:
[[6]][[374]]
GDF10_bin6
"cg13565300"
[[6]][[375]]
NULL
[[6]][[376]]
[[6]][[376]]$HNF1B_bin6
[1] "cg03433642" "cg09679923" "cg17652435" "cg03348978" "cg02435495" "cg02701059" "cg05110178" "cg11862993" "cg09463047"
[[6]][[377]]
[[6]][[377]]$GPIHBP1_bin6
[1] "cg01953797" "cg00152340"
instead, I would expect something like
$GPIHBP1_bin1
"cg...." "cg...."
...
$GPIHBP1_bin6
"someotherprobe"
$someothergene_bin1
"probe" "probe"
...
I hope I'm being clear, and since this is my first time asking question, I already apologise if I didn't follow the stackoverflow protocol.
Thank you already for reading me

Consider a nested lapply with extract, [[, and setNames calls, all wrapped in do.call using c to bind return elements together.
bins_indexed_probes <- do.call(c,
lapply(1:6, function(i)
setNames(lapply(feat_indexed_probes_bin, `[[`, i),
paste0(names(feat_indexed_probes_bin), "_bin", i))
)
)
# RE-ORDER ELEMENTS BY NAME
bins_indexed_probes <- bins_indexed_probes[sort(names(bins_indexed_probes))]
Rextester Demo

Related

Nested list assignment R

I have a list of the following type
categories = list(
c("Women","Clothing", "Jeans"),
c("Women","Clothing", "Sweaters"),
c("Men","Accessories", "Belts"),
c("Women", "Accessories", "Jewelry" ))
I want to parse this list and create a list of lists to export in JSON and it should have the following structure:
Women={
Clothing= {
Jeans{},
Sweaters{}
},
accesories={
Jewleery{}
}
},
Men ={
Accessires={
Belts={}
}
So it should go over each element which is a char vector contained in the list and check if there is such element in the final list, if there isn't it should append it. It should append the element at the proper level. For example if Clothing is second element to Woman, it should append to the Women list of the final list. Or if Sweaters is thrid element to Women.Clothing it should apppend Clothing list of the Women list of the final list.
If the element exists at the given level already it should not append, instead it should go to next element in the char vector.
In the char vectors of the input lsit, the first element is always level 1 the second level 2 the third level 3 etc..
It should be done recursively, I tried few times but I have no idea how to assign to a nested list, specifically i need to do nested assigns.
I made the data into a matrix, transposed, then a dataframe:
x <- data.frame(t(vapply(categories, identity, character(3))), stringsAsFactors = F)
Then split, and lapply. You could do this recursively if you have more than 3 levels:
lapply(split(x, x$X1), function(df) {
lapply(split(df, df$X2), function(df) {
lapply(split(df, df$X3), function(x) list())
})
})
If you are looking for a recursive solution, then the following may help you:
output the full directory trajectory within a string at the end
## construct a data frame from list
df <- data.frame(matrix(unlist(categories),nrow = length(categories),byrow = T),stringsAsFactors = F)
## recursion function that makes nested list
f <- function(df, k=1) {
if (k == ncol(df)) return(lapply(split(df,df[,k]), toString)) ##
return(lapply(split(df,df[,k]), function(df) f(df, k+1)))
}
The nested list output looks as below
> f(df)
$Men
$Men$Accessories
$Men$Accessories$Belts
[1] "Men, Accessories, Belts"
$Women
$Women$Accessories
$Women$Accessories$Jewelry
[1] "Women, Accessories, Jewelry"
$Women$Clothing
$Women$Clothing$Jeans
[1] "Women, Clothing, Jeans"
$Women$Clothing$Sweaters
[1] "Women, Clothing, Sweaters"
output empty lists at the end
f <- function(df, k=1) {
if (k == ncol(df)) return(lapply(split(df,df[,k]), function(v) list()))
return(lapply(split(df,df[,k]), function(df) f(df, k+1)))
}
which gives:
> f(df)
$Men
$Men$Accessories
$Men$Accessories$Belts
list()
$Women
$Women$Accessories
$Women$Accessories$Jewelry
list()
$Women$Clothing
$Women$Clothing$Jeans
list()
$Women$Clothing$Sweaters
list()

Loop Changing to Matrix then Running tests

I have a dataframe with ~9000 rows of human coded data in it, two coders per item so about 4500 unique pairs. I want to break the dataset into each of these pairs, so ~4500 dataframes, run a kripp.alpha on the scores that were assigned, and then save those into a coder sheet I have made. I cannot get the loop to work to do this.
I can get it to work individually, using this:
example.m <- as.matrix(example.m)
s <- kripp.alpha(example.m)
example$alpha <- s$value
However, when trying a loop I am getting either "Error in get(v) : object 'NA' not found" when running this:
for (i in items) {
v <- i
v <- v[c("V1","V2")]
v <- assign(v, as.matrix(get(v)))
s <- kripp.alpha(v)
i$alpha <- s$value
}
Or am getting "In i$alpha <- s$value : Coercing LHS to a list" when running:
for (i in items) {
i.m <- i[c("V1","V2")]
i.m <- as.matrix(i.m)
s <- kripp.alpha(i.m)
i$alpha <- s$value
}
Here is an example set of data. Items is a list of individual dataframes.
l <- as.data.frame(matrix(c(4,3,3,3,1,1,3,3,3,3,1,1),nrow=2))
t <- as.data.frame(matrix(c(4,3,4,3,1,1,3,3,1,3,1,1),nrow=2))
items <- c("l","t")
I am sure this is a basic question, but what I want is for each file, i, to add a column with the alpha score at the end. Thanks!
Your problem is with scoping and extracting names from objects when referenced through strings. You'd need to eval() some of your object to make your current approach work.
Here's another solution
library("irr") # For kripp.alpha
# Produce the data
l <- as.data.frame(matrix(c(4,3,3,3,1,1,3,3,3,3,1,1),nrow=2))
t <- as.data.frame(matrix(c(4,3,4,3,1,1,3,3,1,3,1,1),nrow=2))
# Collect the data as a list right away
items <- list(l, t)
Now you can sapply() directly over the elements in the list.
sapply(items, function(v) {
kripp.alpha(as.matrix(v[c("V1","V2")]))$value
})
which produces
[1] 0.0 -0.5

R invoking a dynamically created list within for loop

I have the following dynamic list created with the names cluster_1, cluster_2... like so:
observedUserShifts <- vector("list")
cut <- 2
for (i in 1:cut) {
assign(paste('cluster_', i, sep=''), subset(sortedTestRTUser, cluster==i))
observedUserShifts[[i]] <- mean(cluster_1$shift_length_avg)
}
Notice that i have cut=2 so 2 lists are created dynamically with the names due to the 'assign' function: cluster_1 and cluster_2
I want to invoke each of the above lists within the for loop. Notice that i have hard coded cluster_1 in the for loop (2nd line inside for loop). How do I change this so that this is not hard coded?
I tried:
> observedUserShifts[[i]] <- mean((paste('cluster_','k',sep='')$shift_length_avg)
+ )
Error in paste("cluster_", "k", sep = "")$shift_length_avg :
$ operator is invalid for atomic vectors
Agree this is suboptimal coding practice, but to answer the specific question, use get:
for (i in 1:cut) {
assign(paste('cluster_', i, sep=''), subset(sortedTestRTUser, cluster==i))
observedUserShifts[[i]] <-
mean( get(paste('cluster_', i, sep='') )[['shift_length_avg']] )
}
Notice that instead of using $ I chose to use [[ with a quoted column name.

rbind multiple dataframes within a function

I found this code line below on SO and it worked as a charm outside a function to identify the list of dataframes and join them using rbind.
mylist<-ls(pattern='leg_')
mleg <- do.call(rbind, lapply(mylist, get))
But when I enclose this within a loop, I am getting an error message. I have tried to troubleshoot at various steps in the functions and those work but I might be missing something that is causing this error.
for(i in 1:(length(blg_idx))){
assign(paste(deparse(substitute(leg_)),i,sep=''),l_merge(get(paste(deparse(substitute(blg)),i,sep='')),get(paste(deparse(substitute(bsg)),i,sep=''))))
}
mylist<-ls(pattern='leg_')
#return(mylist) # this returns a good list of dataframes
#mlegleg <- rbind(leg_1,leg_2) # this works
mleg <- do.call(rbind, lapply(mylist, get))
return(mleg)
} #end function read_leg
Error in FUN(c("leg_1", "leg_2")[[1L]], ...) :
object 'leg_1' not found
When I return mylist from the function, it is able to identify all the dataframes and list them. The function is able to return leg_1 or leg_2 dataframe when I choose to return those in debugging.
[1] "leg_1" "leg_2"
Any help?
update
I found another of achieving what I need but I am sure it is inefficient although my list of dataframes is a maximum of 4
for(i in 1:(length(blg_idx))){
assign(paste(deparse(substitute(leg_)),i,sep=''),l_merge(get(paste(deparse(substitute(blg)),i,sep='')),get(paste(deparse(substitute(bsg)),i,sep=''))))
}
mylist<-ls(pattern='leg_')
#return(mylist)
#mlegleg <- rbind(leg_1,leg_2) # this works
# mleg <- do.call(rbind, lapply(mylist, get))
mleg <- leg_1
for(i in 2:(length(blg_idx))){
mleg <- rbind(leg,get(paste(deparse(substitute(leg_)),i,sep='')))
}
return(mleg)
} #end read_leg
update 2
Here is the reproducible example for the issue I am facing. For some reason do.call & get is unable to process the mylist parameter generated for dataframes generated within a function.
read_date <- function(x){
pur_1 <- data.frame(sku=859, X = sample(1:10),Y = sample(c("yes", "no"), 10, replace = TRUE))
pur_2 <- data.frame(sku=859, X = sample(11:20),Y = sample(c("yes", "no","na"), 10, replace = TRUE))
mylist<-ls(pattern='pur_')
pur_final <- do.call(rbind, lapply(mylist, get))
#fancier version that I want to achieve is below
#assign(paste('pur_',eval(pur_1$sku[1]),sep=''),do.call(rbind, lapply(mylist, get)))
return(pur_final)
}
read_date()
Error message is
read_date()
Error in FUN(c("pur_1", "pur_2")[[1L]], ...) : object 'pur_1' not found
update 3
I am sorry for unconventional management of this post but I will get better with my next ones.
Here is what I stumbled upon that is working for me with an exception.
pur_final <- do.call(rbind, mget(paste0("pur_", 1:2),envir = as.environment(-1)))
But the next not so big issue is to suppress the row.names that get added to the dataframe. Any suggestions to suppress the row.names in this context.
> read_date()
sku X Y
pur_1.1 859 8 yes
pur_1.2 859 4 no
pur_1.3 859 3 yes
....
pur_2.8 859 14 na
pur_2.9 859 13 na
pur_2.10 859 19 no
>
You do not have a reproducible example with which to test this solution but take a look at the help page for get and try this:
mleg <- do.call(rbind, lapply(mylist, get, envir = globalenv() ))
The answer above contains the key to your question: envir = globalenv()
It took me a while to realize that R will create a private environment for each function. And within that private environment your other variables don't exist. That is, unless you tell your function to look in the Global Environment by using the envir argument.
Here's a function that should take a character string as input and then identify all variables (e.g. data frames) in Global Environment that include that string of text in their name. Then it will try to bind those variables (data frames).
If all variables are data frames with the same column names, then it should return a single binded data frame. myBindedDF <- mergeCompatibleTables("mypattern")
bindCompatibleTables <- function(x){
if(is.character(x)){
mylist <- grep(x, ls(pos = 1), value=T)
mergedDF <- do.call(rbind, mget(mylist,envir = as.environment(1)))
return(bindedDF)
} else {
stop("Input is not a character string")
}
}
A late response but I just faced a similar issue to the update 2 posting where "object 'pur_1' not found".
If you want to use the following within a function when you have an unknown number of dataframes starting with "pur_", for example:
mylist <- ls(pattern='pur_')
mleg <- do.call(rbind, lapply(mylist, get))
Then you need to point to the correct environment within the function:
mylist <- ls(pattern='pur_')
mleg <- do.call(rbind, lapply(mylist, get, env=environment()))

combination of expand.grid and mapply?

I am trying to come up with a variant of mapply (call it xapply for now) that combines the functionality (sort of) of expand.grid and mapply. That is, for a function FUN and a list of arguments L1, L2, L3, ... of unknown length, it should produce a list of length n1*n2*n3 (where ni is the length of list i) which is the result of applying FUN to all combinations of the elements of the list.
If expand.grid worked to generate lists of lists rather than data frames, one might be able to use it, but I have in mind that the lists may be lists of things that won't necessarily fit into a data frame nicely.
This function works OK if there are exactly three lists to expand, but I am curious about a more generic solution. (FLATTEN is unused, but I can imagine that FLATTEN=FALSE would generate nested lists rather than a single list ...)
xapply3 <- function(FUN,L1,L2,L3,FLATTEN=TRUE,MoreArgs=NULL) {
retlist <- list()
count <- 1
for (i in seq_along(L1)) {
for (j in seq_along(L2)) {
for (k in seq_along(L3)) {
retlist[[count]] <- do.call(FUN,c(list(L1[[i]],L2[[j]],L3[[k]]),MoreArgs))
count <- count+1
}
}
}
retlist
}
edit: forgot to return the result. One might be able to solve this by making a list of the indices with combn and going from there ...
I think I have a solution to my own question, but perhaps someone can do better (and I haven't implemented FLATTEN=FALSE ...)
xapply <- function(FUN,...,FLATTEN=TRUE,MoreArgs=NULL) {
L <- list(...)
inds <- do.call(expand.grid,lapply(L,seq_along)) ## Marek's suggestion
retlist <- list()
for (i in 1:nrow(inds)) {
arglist <- mapply(function(x,j) x[[j]],L,as.list(inds[i,]),SIMPLIFY=FALSE)
if (FLATTEN) {
retlist[[i]] <- do.call(FUN,c(arglist,MoreArgs))
}
}
retlist
}
edit: I tried #baptiste's suggestion, but it's not easy (or wasn't for me). The closest I got was
xapply2 <- function(FUN,...,FLATTEN=TRUE,MoreArgs=NULL) {
L <- list(...)
xx <- do.call(expand.grid,L)
f <- function(...) {
do.call(FUN,lapply(list(...),"[[",1))
}
mlply(xx,f)
}
which still doesn't work. expand.grid is indeed more flexible than I thought (although it creates a weird data frame that can't be printed), but enough magic is happening inside mlply that I can't quite make it work.
Here is a test case:
L1 <- list(data.frame(x=1:10,y=1:10),
data.frame(x=runif(10),y=runif(10)),
data.frame(x=rnorm(10),y=rnorm(10)))
L2 <- list(y~1,y~x,y~poly(x,2))
z <- xapply(lm,L2,L1)
xapply(lm,L2,L1)
#ben-bolker, I had a similar desire and think I have a preliminary solution worked out, that I've also tested to work in parallel. The function, which I somewhat confusingly called gmcmapply (g for grid) takes an arbitrarily large named list mvars (that gets expand.grid-ed within the function) and a FUN that utilizes the list names as if they were arguments to the function itself (gmcmapply will update the formals of FUN so that by the time FUN is passed to mcmapply it's arguments reflect the variables that the user would like to iterate over (which would be layers in a nested for loop)). mcmapply then dynamically updates the values of these formals as it cycles over the expanded set of variables in mvars.
I've posted the preliminary code as a gist (reprinted with an example below) and would be curious to get your feedback on it. I'm a grad student, that is self-described as an intermediately-skilled R enthusiast, so this is pushing my R skills for sure. You or other folks in the community may have suggestions that would improve on what I have. I do think even as it stands, I'll be coming to this function quite a bit in the future.
gmcmapply <- function(mvars, FUN, SIMPLIFY = FALSE, mc.cores = 1, ...){
require(parallel)
FUN <- match.fun(FUN)
funArgs <- formals(FUN)[which(names(formals(FUN)) != "...")] # allow for default args to carry over from FUN.
expand.dots <- list(...) # allows for expanded dot args to be passed as formal args to the user specified function
# Implement non-default arg substitutions passed through dots.
if(any(names(funArgs) %in% names(expand.dots))){
dot_overwrite <- names(funArgs[which(names(funArgs) %in% names(expand.dots))])
funArgs[dot_overwrite] <- expand.dots[dot_overwrite]
#for arg naming and matching below.
expand.dots[dot_overwrite] <- NULL
}
## build grid of mvars to loop over, this ensures that each combination of various inputs is evaluated (equivalent to creating a structure of nested for loops)
grid <- expand.grid(mvars,KEEP.OUT.ATTRS = FALSE, stringsAsFactors = FALSE)
# specify formals of the function to be evaluated by merging the grid to mapply over with expanded dot args
argdefs <- rep(list(bquote()), ncol(grid) + length(expand.dots) + length(funArgs) + 1)
names(argdefs) <- c(colnames(grid), names(funArgs), names(expand.dots), "...")
argdefs[which(names(argdefs) %in% names(funArgs))] <- funArgs # replace with proper dot arg inputs.
argdefs[which(names(argdefs) %in% names(expand.dots))] <- expand.dots # replace with proper dot arg inputs.
formals(FUN) <- argdefs
if(SIMPLIFY) {
#standard mapply
do.call(mcmapply, c(FUN, c(unname(grid), mc.cores = mc.cores))) # mc.cores = 1 == mapply
} else{
#standard Map
do.call(mcmapply, c(FUN, c(unname(grid), SIMPLIFY = FALSE, mc.cores = mc.cores)))
}
}
example code below:
# Example 1:
# just make sure variables used in your function appear as the names of mvars
myfunc <- function(...){
return_me <- paste(l3, l1^2 + l2, sep = "_")
return(return_me)
}
mvars <- list(l1 = 1:10,
l2 = 1:5,
l3 = letters[1:3])
### list output (mapply)
lreturns <- gmcmapply(mvars, myfunc)
### concatenated output (Map)
lreturns <- gmcmapply(mvars, myfunc, SIMPLIFY = TRUE)
## N.B. This is equivalent to running:
lreturns <- c()
for(l1 in 1:10){
for(l2 in 1:5){
for(l3 in letters[1:3]){
lreturns <- c(lreturns,myfunc(l1,l2,l3))
}
}
}
### concatenated outout run on 2 cores.
lreturns <- gmcmapply(mvars, myfunc, SIMPLIFY = TRUE, mc.cores = 2)
Example 2. Pass non-default args to FUN.
## Since the apply functions dont accept full calls as inputs (calls are internal), user can pass arguments to FUN through dots, which can overwrite a default option for FUN.
# e.g. apply(x,1,FUN) works and apply(x,1,FUN(arg_to_change= not_default)) does not, the correct way to specify non-default/additional args to FUN is:
# gmcmapply(mvars, FUN, arg_to_change = not_default)
## update myfunc to have a default argument
myfunc <- function(rep_letters = 3, ...){
return_me <- paste(rep(l3, rep_letters), l1^2 + l2, sep = "_")
return(return_me)
}
lreturns <- gmcmapply(mvars, myfunc, rep_letters = 1)
A bit of additional functionality I would like to add but am still trying to work out is
cleaning up the output to be a pretty nested list with the names of mvars (normally, I'd create multiple lists within a nested for loop and tag lower-level lists onto higher level lists all the way up until all layers of the gigantic nested loop were done). I think using some abstracted variant of the solution provided here will work, but I haven't figured out how to make the solution flexible to the number of columns in the expand.grid-ed data.frame.
I would like an option to log the outputs of the child processesthat get called in mcmapply in a user-specified directory. So you could look at .txt outputs from every combination of variables generated by expand.grid (i.e. if the user prints model summaries or status messages as a part of FUN as I often do). I think a feasible solution is to use the substitute() and body() functions, described here to edit FUN to open a sink() at the beginning of FUN and close it at the end if the user specifies a directory to write to. Right now, I just program it right into FUN itself, but later it would be nice to just pass gmcmapply an argument called something like log_children = "path_to_log_dir. and then editing the body of the function to (pseudocode) sink(file = file.path(log_children, paste0(paste(names(mvars), sep = "_"), ".txt")
Let me know what you think!
-Nate

Resources