Extracting Nested Elements of an R List Generated by Loops - r

For lists within lists produced by a loop in R (in this example a list of caret models) I get an object with an unpredictable length and names for inner elements, such as list[[1]][[n repeats of 1]][[2]] where the internal [[1]] is repeated multiple times according to the function's input. In some cases, the length of n is not known, when accessing some older stored lists where input was not saved. While there are ways to work within a list index, like with list[length(list)], there appears to be no way to do this with repeated nested elements. This has made accessing them and passing them to various jobs awkward. I assume there is an efficient way to access them that I have missed, so I'm asking for help to do so, with an example case given below.
The function I'm generating gives out a list from a function that creates several outputs. The final list returned for a function having a complicated output structure is produced by returning something like:
return(list(listOfModels, trainingData, testingData))
The listofModels has variable length, depending on input of models given, and potentially other conditions depend on evaluation inside the function. It is made by:
listOfModels <- list(c(listOfModels, list(trainedModel)))
Where the "trainedModel" refers to the most recently trained model generated in the loop. The models used and the number of them may vary each time depending on choice. An unfortunate result is a complicated nested lists within a list.
That is, output[[1]] contains the models I want to access more efficiently, which are themselves list objects, while output[[2]] and output[[3]] are the dataframes used to train and evaluate the models. While accessing the dataframes is simple and has a defined, reproducible structure each time (simply being output[[2]], output[[3]] every time), output[[1]] becomes a mess. E.g., something like the following follows the "output[[1]]":
The only thing I am able to attempt in order to access this is using the fact that [[1]] is attached upon output[[1]] before [[2]]. All of the nested elements except one have a [[2]] at the end. Given the above pattern, there is an ugly solution that works, but is not a desirable format to work with. E.g., after evaluating n models given by a vector of strings called inputList, and a list given as output of the function, "output", I can have [[1]] repeated tens to hundreds of times.
for (i in (1:length(inputList)-1)){
eval(rlang::parse_expr(paste0(c("output", c(rep("[[1]]", 1+i)), "[[2]]" ) , collapse="")) )
}
This could be used to use all models for some downstream task like making predictions on new data, or whatever. In cases where the length of the inputList was not known, this could be found out by attempting to repeat this until finding an error, or something similar. This approach can be modified to call on a specific part of the list, for example, a certain model within inputList, if I know the original list input and can find the number for that model. Besides the bulkiness code working this way, compared to some way where I could just call on output[[1]][[n]] using some predictable format for various length n. One of the big problems is when accessing older runs that have been saved where the input list of models was not saved, leaving the length of n unknown. I don't know of any way of using something like length() or lengths() to count how many nested elements exist within a list. (For my example, output[[1]] is of length 1, no matter how many [[1]] repeat elements there are.)
I believe the simplest solution is to change the way the list is saved by the function, so that I can access it by a systematic reference, however, I have a bunch of old lists which I still want to access and perform some work with, and I'd also like to be able to have better control of working with lists in any case. So any help would be greatly appreciated.
I expected there would be some way to query the structure of nested R lists, which could be used to pass nested elements to separate functions, without having to use very long repetition of brackets.

Related

Object modification only happens in list

I have put objects that I would like to edit in a list.
Say, the names of the objects are kind of like this:
name1_X
name1_Y
name2_X
name2_Y
And there are different sets of these objects, that are stored in different lists, so for each different set, they would have a slightly different name, like:
name1_P_X
name1_F1_X
name2_F2_Y
and so on..
So for every "name" there are six objects. There are two each ending with X or Y for P, F1, F2. We have three lists (listbF_P, listbF_F1, listbF_F2), each containing objects that end with X and Y.
I edited the objects in the list like this (example for only one list):
for (i in 1:NROW(listbF_P)){
listbF_P[[i]]#first.year <- 1986
listbF_P[[i]]#last.year <- 2005
listbF_P[[i]]#year.aggregate.method <- "mean"
listbF_P[[i]]#id <- makeFieldID(listbF_P[[i]])
}
When I check whether the changes were successfully applied, it works but only when referring to the objects inside the list but not the same objects "unlisted".
So if I call
listbF_P[[1]]#last.year
it returns
"2005"
But if I call
name1_X#last.year
it returns
"Inf"
The problem with this is that I want the edited objects in a different list later.
So I need either a way that the latter call example returns "2005" or a way that I can search for a certain object name pattern in multiple lists to put the ones that fit the pattern into another list.
This is because the example above was made with multiple lists (listbF_P, listbF_F1, listbF_F2) and these lists contain a pattern matching "X" and another matching "Y".
So basically I want to have two lists with edited objects, one matching pattern "X" and the other matching pattern "Y".
I would call the list matching the desired patterns like this:
listbF_ALL_X <- mget(ls(pattern=".*_X$"))
listbF_ALL_Y <- mget(ls(pattern=".*_Y$"))
The first list would hence contain all objects ending with "X", e.g.:
name1_P_X
name1_F1_X
name1_F2_X
name2_P_X
[...]
and I would like to have the ones that I edited in the loop earlier
..but when calling the objects out of that list
listbF_ALL_X[[1]]#last.year
again just returns
"Inf"
since it takes the objects out of the environment and not the list. But I want it to return the desired number that has been changed (e.g. "2005").
I hope my problem and the two possible ways of solving them are clear..
If something isn't, ask :)
Thanks for any input
Regards
In R, unlike in many other modern languages, (almost) all objects are logically copies of each other. You can’t have multiple names that are references to the same object (see below for caveats).
But even if this was supported, your design looks confusing. Rather than have lots of related objects with different names, put your objects into nested lists and classes that logically relate them. That is, rather than have objects with names name{1..10}_{P,F1,F2}_{X,Y}, you should have one list, name, in which you store nested lists or classes with named members P, F1, F2 which, in turn, are objects that have names X and Y. Then you could access an object by, say, name[1L]$P$X (or name[1L]#P#X, if you’re using S4 objects with slots).
Or you use a more data-oriented approach and flatten all these nested objects into a table with corresponding columns P, F1, F2, X and Y. Which solution is more appropriate depends on your exact use-case.
Now for the caveat: you can use reference semantics in R by using *environments8 instead of regular objects. When copying an environment, a reference to the same environment object is created. However, this semantic is usually confusing because it’s contrary to the expectation in R, so it should be used with care. The ‘R6’ package creates an object system with reference semantics based on environments. For many purposes where reference semantics are indispensable, ‘R6’ is the right answer.
I found another solution:
I went on by modifying this part:
listbF_ALL_X <- mget(ls(pattern=".*_X$"))
listbF_ALL_Y <- mget(ls(pattern=".*_Y$"))
To not call objects from the environment but by calling objects from each list:
listbF_ALL_X <- c(c(listbF_P, listbF_F1, listbF_F2)[grepl(".*_X$", names(c(listbF_P, listbF_F1, listbF_F2)))])
listbF_ALL_Y <- c(c(listbF_P, listbF_F1, listbF_F2)[grepl(".*_Y$", names(c(listbF_P, listbF_F1, listbF_F2)))])
It's not the prettiest way of doing it but it works and in my case it was the solution that required the least amount of change in my script.

Subsetting list containing multiple classes by same index/vector

I'm needing to subset a list which contains an array as well as a factor variable. Essentially if you imagine each component of the array is relative to a single individual which is then associated to a two factor variable (treatment).
list(array=array(rnorm(2,4,1),c(5,5,10)), treatment= rep(c(1,2),5))
Typically when sub-setting multiple components of the array from the first component of the list I would use something like
list$array[,,c(2,4,6)]
this would return the array components in location 2,4 and 6. However, for the factor component of the list this wouldn't work as subsetting is different, what you would need is this:
list$treatment[c(2,4,6)]
Need to subset a list with containing different classes (array and vector) by the same relative number.
You're treating your list of matrices as some kind of 3-dimensional object, but it's not.
Your list$matrices is of itself a list as well, which means you can index at as a list as well, it doesn't matter if it is a list of matrices, numerics, plot-objects, or whatever.
The data you provided as an example can just be indexed at one level, so list$matrices[c(2,4,6)] works fine.
And I don't really get your question about saving the indices in a numeric vector, what's to stop you from this code?
indices <- c(2,4,6)
mysubset <- list(list$matrices[indices], list$treatment[indices])
EDIT, adding new info for edited question:
I see you actually have an 3-D array now. Which is kind of weird, as there is no clear convention of what can be seen as "components". I mean, from your question I understand that list$array[,,n] refers to the n-th individual, but from a pure code-point of view there is no reason why something like list$array[n,,] couldn't refer to that.
Maybe you got the idea from other languages, but this is not really R-ish, your earlier example with a list of matrices made more sense to me. And I think the most logical would have been a data.frame with columns matrix and treatment (which is conceptually close to a list with a vector and a list of matrices, but it's clearer to others what you have).
But anyway, what is your desired output?
If it's just subsetting: with this structure, as there are no constraints on what could have been the content, you just have to tell R exactly what you want. There is no one operator that takes a subset of a vector and the 3rd index of an array at the same time. You're going to have to tell R that you want 3rd index to use for subsetting, and that you want to use the same index for subsetting a vector. Which is basically just the code you already have:
idx <- c(2,4,6)
output <- list(list$array[,,idx], list$treatment[idx])
The way that you use for subsetting multiple matrices actually gives an error since you are giving extra dimension although you already specify which sublist you are in. Hence in order to subset matrices for the given indices you can usemy_list[[1]][indices] or directly my_list$matrices[indices]. It is the same for the case treatement my_list[[2]][indices] or my_list$treatement[indices]

index by name or by position in list / vector, which is faster?

I am currently trying to optimise the speed of a physical model computation. The specificity of this model is that it uses hundreds of input parameters, all stored in a big named vector:
initialize = c("temperature"=100, "airpressure"=150, "friction"=0.46)
The model, while iterating hundreds of times, needs to access the parameters, possibly updates them, etc.:
compute(initialize['temperature'], initialize['airpressure'])
initialize['friction'] <- updateP(initialize['friction'])
This is the logic. However I wonder if this is really efficient to work like this. What happens behind an indexation by name, is it fast? Some ideas to change this logic:
define each parameter as an independent variable in the environment?
(but how to pass a a large number of them as argument of a function?
have a list of parameters instead of a named vector?
access each parameter by its index in the vector, like this:
compute(initialize[1], initialize[2])
If I go with this last solution, of course I will loose the readability of the code (which parameter is actually initialize[1]?). So a way to go could be to define their positions first:
temperature.pos <- 1
airpressure.pos <- 2
compute(initialize[temperature.pos], initialize[airpressure.pos])
Of course, why didn't I try this and tested the speed? Well, it would take me hours to transform every location of parameters call in the script, that's why I ask before doing it.
And maybe there is a even more clever solution?
Thanks

R approach for iterative querying

This is a question of a general approach in R, I'm trying to find a way into R language but the data types and loop approaches (apply, sapply, etc) are a bit unclear to me.
What is my target:
Query data from API with parameters from a config list with multiple parameters. Return the data as aggregated data.frame.
First I want to define a list of multiple vectors (colums)
site segment id
google.com Googleuser 123
bing.com Binguser 456
How to manage such a list of value groups (row by row)? data.frames are column focused, you cant write a data.frame row by row in an R script. So the only way I found to define this initial config table is a csv, which is really an approach I try to avoid, but I can't find a way to make it more elegant.
Now I want to query my data, lets say with this function:
query.data <- function(site, segment, id){
config <- define_request(site, segment, id)
result <- query_api(config)
return result
}
This will give me a data.frame as a result, this means every time I query data the same columns are used. So my result should be one big data.frame, not a list of similar data.frames.
Now sapply allows to use one parameter-list and multiple static parameters. The mapply works, but it will give me my data in some crazy output I cant handle or even understand exactly what it is.
In principle the list of data.frames is ok, the data is correct, but it feels cumbersome to me.
What core concepts of R I did not understand yet? What would be the approach?
If you have a lapply/sapply solution that is returning a list of dataframes with identical columns, you can easily get a single large dataframe with do.call(). do.call() inputs each item of a list as arguments into another function, allowing you to do things such as
big.df <- do.call(rbind, list.of.dfs)
Which would append the component dataframes into a single large dataframe.
In general do.call(rbind,something) is a good trick to keep in your back pocket when working with R, since often the most efficient way to do something will be some kind of apply function that leaves you with a list of elements when you really want a single matrix/vector/dataframe/etc.

How to make loops in R that operate on and return multiple objects

This is my first post, and I think I have looked thoroughly for my answer with no luck, but I might not be typing in the right search terms, since I am relatively new to R. I apologize if this has been answered before and if it has a link would be greatly appreciated.
In essence, I am trying to make a loop that will operate on a set of data frames that I have read into R from .txt files using read.table. I am working with simulated vegetation data organized into many species by site matrices, so it would be best for me if I could create loops that will just operate on the objects I have read in using some functions I have made and then put out new objects into my workspace with a specific naming pattern (e.g. put "_av" on the end of the name of the object operated on when creating a new object).
for convenience sake, lets say I have only four matrices I want to work with, all which contain the phrase "mod" for model. I have read that I can put these data frames into a list of data frames by the following code:
list.mods=lapply(ls(pattern="mod"),get)
This does create a list which I have been having trouble on getting my functions to actually operate on. From what I read this is the best way to make a list of objects you want to operate on.
So lets say that list.mods is now my list of operable matrices - mod1, mod2, mod3, and mod4. Also, lets say I have a function that simply calculates Bray-Curtis dissimilarity as follows:
bc=function(x){
vegdist(x,method="bray")
}
I can use this by typing in:
mod1.bc=bc(mod1)
That works. But it seems like I should be able to apply my list of models to the function bc and have it output the models with a pattern mod1.bc, mod2.bc, mod3.bc, and mod4.bc. I cannot get my list of files to work in the function much less save each operation as a new object with a patterned name.
What am I doing wrong? In the end I might have as many as a hundred models or more and would really appreciate being able to create a list of items that I can run through loops.
Thanks in advance.
You can use lapply again:
new.list.mods <- lapply(list.mods, bc)
This will return a new list in which each element is the result of applying bc to the corresponding element of list.mods.
The 'apply' family of functions in R basically allows you to save typing. If that's easier for you to understand, you can use a 'for loop' instead. Of course you will need to know how to access elements in a list for that. There is a question about that.
How about collecting the names of the models/objects you want into a list:
mod_list <- sapply(ls(pattern = "mod"), as.name)
and then looping over them with your function:
output_list <- lapply(eval(mod_list), bc)
With this approach you avoid creating the potentially large and redundant list.mods object in your example. Also, I think this will result in conveniently named lists.

Resources