What's the difference between a list and a vector whose mode is list? - r

Title essentially says it all. I'm having trouble figuring out the difference between initializing a vector with vector(mode="list") and a list with list().
There are some minor differences in the signatures, list() can take value arguments or tag = value arguments whereas vector() cannot.
And then there's the following quote from the list() documentation:
Almost all lists in R internally are Generic Vectors
So is there any actual difference beside the fact that lists can be initialized with tags and values?

I'd say they're the same:
identical(list(),vector(mode="list", length=0))
## [1] TRUE
(see also this question about the confusing fact that a list is a vector in R: usually when R users refer to "vectors", they actually mean atomic vectors ...)
In my experience the most common use case for vector(mode="list",...) is when you want to initialize a list with length>0. vector(mode="list",10) might be a little more expressive than replicate(10,NULL). If you want to create a length-0 list I can't see any reason to use vector() instead of list().

Related

Extracting Nested Elements of an R List Generated by Loops

For lists within lists produced by a loop in R (in this example a list of caret models) I get an object with an unpredictable length and names for inner elements, such as list[[1]][[n repeats of 1]][[2]] where the internal [[1]] is repeated multiple times according to the function's input. In some cases, the length of n is not known, when accessing some older stored lists where input was not saved. While there are ways to work within a list index, like with list[length(list)], there appears to be no way to do this with repeated nested elements. This has made accessing them and passing them to various jobs awkward. I assume there is an efficient way to access them that I have missed, so I'm asking for help to do so, with an example case given below.
The function I'm generating gives out a list from a function that creates several outputs. The final list returned for a function having a complicated output structure is produced by returning something like:
return(list(listOfModels, trainingData, testingData))
The listofModels has variable length, depending on input of models given, and potentially other conditions depend on evaluation inside the function. It is made by:
listOfModels <- list(c(listOfModels, list(trainedModel)))
Where the "trainedModel" refers to the most recently trained model generated in the loop. The models used and the number of them may vary each time depending on choice. An unfortunate result is a complicated nested lists within a list.
That is, output[[1]] contains the models I want to access more efficiently, which are themselves list objects, while output[[2]] and output[[3]] are the dataframes used to train and evaluate the models. While accessing the dataframes is simple and has a defined, reproducible structure each time (simply being output[[2]], output[[3]] every time), output[[1]] becomes a mess. E.g., something like the following follows the "output[[1]]":
The only thing I am able to attempt in order to access this is using the fact that [[1]] is attached upon output[[1]] before [[2]]. All of the nested elements except one have a [[2]] at the end. Given the above pattern, there is an ugly solution that works, but is not a desirable format to work with. E.g., after evaluating n models given by a vector of strings called inputList, and a list given as output of the function, "output", I can have [[1]] repeated tens to hundreds of times.
for (i in (1:length(inputList)-1)){
eval(rlang::parse_expr(paste0(c("output", c(rep("[[1]]", 1+i)), "[[2]]" ) , collapse="")) )
}
This could be used to use all models for some downstream task like making predictions on new data, or whatever. In cases where the length of the inputList was not known, this could be found out by attempting to repeat this until finding an error, or something similar. This approach can be modified to call on a specific part of the list, for example, a certain model within inputList, if I know the original list input and can find the number for that model. Besides the bulkiness code working this way, compared to some way where I could just call on output[[1]][[n]] using some predictable format for various length n. One of the big problems is when accessing older runs that have been saved where the input list of models was not saved, leaving the length of n unknown. I don't know of any way of using something like length() or lengths() to count how many nested elements exist within a list. (For my example, output[[1]] is of length 1, no matter how many [[1]] repeat elements there are.)
I believe the simplest solution is to change the way the list is saved by the function, so that I can access it by a systematic reference, however, I have a bunch of old lists which I still want to access and perform some work with, and I'd also like to be able to have better control of working with lists in any case. So any help would be greatly appreciated.
I expected there would be some way to query the structure of nested R lists, which could be used to pass nested elements to separate functions, without having to use very long repetition of brackets.

How can I access data in a nested R list?

I want to learn how to access data from a nested list in R. I am relatively new to the R programming language, so I am unsure how to proceed.
The data is a 'large list(947 elements, 654.9mb) and takes the form:
The numbers within the datalist refer to station numbers and when I click on one (in Rstudio) it looks like this:
I want to kow how I can access the data within 'doy' for example. I have tried:
data[[1]]
which returns all the data for the first element of the list (site, location, doy,ltm etc). So clearly the number used within the square brackets is interpreted as an index for the list, as opposed to an identifier for the elements/station in the list.
Then I tried:
data$1
but it returned the error:
Error: unexpected numeric constant in "data$1"
Then I tried:
data[data$1==doy]
But was returned this:
Error: unexpected numeric constant in "data[data$1"
So at this point, I realise that it is not construing the number of the station as a category/factor within the list. It's just reading it as a number. So I thought I'd put some quotes around it to see if that changed what happened:
data[data$"1"=="doy"]
This returned
named list()
But when I looked at it in the environment, it was a list of 0.
I looked at some of the similar question here on Stack (like: accessing nested lists in R) and tried:
data[data$"1"=="doy",][[1]]
But just got:
Error in data[data$"1" == "doy", ] : incorrect number of dimensions
How can I access this data? It reminds me of a structure in Matlab, but it doesn't seem to be indexed in a similar fashion in R.
Let's look at some ways to do what you want:
data[[1]]
This returns the first element of the list, which is itself a list. You can use the $ subsetting shorthand, but the name of the first element is nonstandard. R prefers names that start with letters and include only alphanumeric characters, periods and underscores. You can escape this behavior with backticks:
data$`1`
If you want to access one of the elements of list 1 in your list of lists, you need to further subset. To get to doy, which is the third element of 1. You can do that four ways.
data[[1]][[3]]
data$`1`[[3]]
data[[1]]$doy
data$`1`$doy
One way (in addition to what Ben Norris has shown):
our_list[[c("1", "doy")]]
Reproducible example data (please provide next time)
our_list <- list(`1` = list(site = "x", doy = 3))

Object modification only happens in list

I have put objects that I would like to edit in a list.
Say, the names of the objects are kind of like this:
name1_X
name1_Y
name2_X
name2_Y
And there are different sets of these objects, that are stored in different lists, so for each different set, they would have a slightly different name, like:
name1_P_X
name1_F1_X
name2_F2_Y
and so on..
So for every "name" there are six objects. There are two each ending with X or Y for P, F1, F2. We have three lists (listbF_P, listbF_F1, listbF_F2), each containing objects that end with X and Y.
I edited the objects in the list like this (example for only one list):
for (i in 1:NROW(listbF_P)){
listbF_P[[i]]#first.year <- 1986
listbF_P[[i]]#last.year <- 2005
listbF_P[[i]]#year.aggregate.method <- "mean"
listbF_P[[i]]#id <- makeFieldID(listbF_P[[i]])
}
When I check whether the changes were successfully applied, it works but only when referring to the objects inside the list but not the same objects "unlisted".
So if I call
listbF_P[[1]]#last.year
it returns
"2005"
But if I call
name1_X#last.year
it returns
"Inf"
The problem with this is that I want the edited objects in a different list later.
So I need either a way that the latter call example returns "2005" or a way that I can search for a certain object name pattern in multiple lists to put the ones that fit the pattern into another list.
This is because the example above was made with multiple lists (listbF_P, listbF_F1, listbF_F2) and these lists contain a pattern matching "X" and another matching "Y".
So basically I want to have two lists with edited objects, one matching pattern "X" and the other matching pattern "Y".
I would call the list matching the desired patterns like this:
listbF_ALL_X <- mget(ls(pattern=".*_X$"))
listbF_ALL_Y <- mget(ls(pattern=".*_Y$"))
The first list would hence contain all objects ending with "X", e.g.:
name1_P_X
name1_F1_X
name1_F2_X
name2_P_X
[...]
and I would like to have the ones that I edited in the loop earlier
..but when calling the objects out of that list
listbF_ALL_X[[1]]#last.year
again just returns
"Inf"
since it takes the objects out of the environment and not the list. But I want it to return the desired number that has been changed (e.g. "2005").
I hope my problem and the two possible ways of solving them are clear..
If something isn't, ask :)
Thanks for any input
Regards
In R, unlike in many other modern languages, (almost) all objects are logically copies of each other. You can’t have multiple names that are references to the same object (see below for caveats).
But even if this was supported, your design looks confusing. Rather than have lots of related objects with different names, put your objects into nested lists and classes that logically relate them. That is, rather than have objects with names name{1..10}_{P,F1,F2}_{X,Y}, you should have one list, name, in which you store nested lists or classes with named members P, F1, F2 which, in turn, are objects that have names X and Y. Then you could access an object by, say, name[1L]$P$X (or name[1L]#P#X, if you’re using S4 objects with slots).
Or you use a more data-oriented approach and flatten all these nested objects into a table with corresponding columns P, F1, F2, X and Y. Which solution is more appropriate depends on your exact use-case.
Now for the caveat: you can use reference semantics in R by using *environments8 instead of regular objects. When copying an environment, a reference to the same environment object is created. However, this semantic is usually confusing because it’s contrary to the expectation in R, so it should be used with care. The ‘R6’ package creates an object system with reference semantics based on environments. For many purposes where reference semantics are indispensable, ‘R6’ is the right answer.
I found another solution:
I went on by modifying this part:
listbF_ALL_X <- mget(ls(pattern=".*_X$"))
listbF_ALL_Y <- mget(ls(pattern=".*_Y$"))
To not call objects from the environment but by calling objects from each list:
listbF_ALL_X <- c(c(listbF_P, listbF_F1, listbF_F2)[grepl(".*_X$", names(c(listbF_P, listbF_F1, listbF_F2)))])
listbF_ALL_Y <- c(c(listbF_P, listbF_F1, listbF_F2)[grepl(".*_Y$", names(c(listbF_P, listbF_F1, listbF_F2)))])
It's not the prettiest way of doing it but it works and in my case it was the solution that required the least amount of change in my script.

Subsetting list containing multiple classes by same index/vector

I'm needing to subset a list which contains an array as well as a factor variable. Essentially if you imagine each component of the array is relative to a single individual which is then associated to a two factor variable (treatment).
list(array=array(rnorm(2,4,1),c(5,5,10)), treatment= rep(c(1,2),5))
Typically when sub-setting multiple components of the array from the first component of the list I would use something like
list$array[,,c(2,4,6)]
this would return the array components in location 2,4 and 6. However, for the factor component of the list this wouldn't work as subsetting is different, what you would need is this:
list$treatment[c(2,4,6)]
Need to subset a list with containing different classes (array and vector) by the same relative number.
You're treating your list of matrices as some kind of 3-dimensional object, but it's not.
Your list$matrices is of itself a list as well, which means you can index at as a list as well, it doesn't matter if it is a list of matrices, numerics, plot-objects, or whatever.
The data you provided as an example can just be indexed at one level, so list$matrices[c(2,4,6)] works fine.
And I don't really get your question about saving the indices in a numeric vector, what's to stop you from this code?
indices <- c(2,4,6)
mysubset <- list(list$matrices[indices], list$treatment[indices])
EDIT, adding new info for edited question:
I see you actually have an 3-D array now. Which is kind of weird, as there is no clear convention of what can be seen as "components". I mean, from your question I understand that list$array[,,n] refers to the n-th individual, but from a pure code-point of view there is no reason why something like list$array[n,,] couldn't refer to that.
Maybe you got the idea from other languages, but this is not really R-ish, your earlier example with a list of matrices made more sense to me. And I think the most logical would have been a data.frame with columns matrix and treatment (which is conceptually close to a list with a vector and a list of matrices, but it's clearer to others what you have).
But anyway, what is your desired output?
If it's just subsetting: with this structure, as there are no constraints on what could have been the content, you just have to tell R exactly what you want. There is no one operator that takes a subset of a vector and the 3rd index of an array at the same time. You're going to have to tell R that you want 3rd index to use for subsetting, and that you want to use the same index for subsetting a vector. Which is basically just the code you already have:
idx <- c(2,4,6)
output <- list(list$array[,,idx], list$treatment[idx])
The way that you use for subsetting multiple matrices actually gives an error since you are giving extra dimension although you already specify which sublist you are in. Hence in order to subset matrices for the given indices you can usemy_list[[1]][indices] or directly my_list$matrices[indices]. It is the same for the case treatement my_list[[2]][indices] or my_list$treatement[indices]

Why doesn't R throw an error when I use only the initial part of my column name in a data frame?

I have a data frame containing various columns along with sender_bank_flag. I ran the below two queries on my data frame.
sum(s_50k_sample$sender_bank_flag, na.rm=TRUE)
sum(s_50k_sample$sender_bank, na.rm=TRUE)
I got the same output from both the queries even though there is no such column as sender_bank in my data frame. I expected to get an error for the second code. Didn't know R has such a functionality! Does anyone know what exactly is this functionality & how can it be better utilized?
Probably worthwhile to augment all comments into an answer.
Both my comment and BenBolker's point to doc page ?Extract:
Under Recursive (list-like) objects:
Both "[[" and "$" select a single element of the list. The main difference is that "$" does not allow computed indices, whereas "[[" does. x$name is equivalent to x[["name", exact = FALSE]]. Also, the partial matching behavior of "[[" can be controlled using the exact argument.
Under Character indices:
Character indices can in some circumstances be partially matched (see ?pmatch) to the names or dimnames of the object being subsetted (but never for subassignment). Unlike S (Becker et al p. 358), R never uses partial matching when extracting by "[", and partial matching is not by default used by "[[" (see argument exact).
Thus the default behaviour is to use partial matching only when extracting from recursive objects (except environments) by "$". Even in that case, warnings can be switched on by options(warnPartialMatchDollar = TRUE).
Note, the manual has rich information, and make sure you fully digest them. I formatted the content, adding Stack Overflow threads behind where relevant.
Links provided by phiver's comment are worth reading in a long term.

Resources