What is the (precise) meaning of "matrix-like?" - r

I would like to create "matrix-like" objects that are not necessarily "proper" matrices. But what, precisely, does "matrix-like" mean?
Example 1
> image(1:9)
Error in image.default(1:9) : argument must be matrix-like
Example 2
In the R Language Definition (in v3.3.1, §3.4.3) it is an hapax legomenon (emphasis added):
[An] example of a class method for [ is… if two indices are supplied (even if one is empty) it creates matrix-like indexing…
Example 3
The title of help(scale) reads, "Scaling and Centering of Matrix-like Objects" (emphasis added). There seems to be a clue there:
numeric-alike means that as.numeric(.) will be applied successfully if is.numeric(.) is not true.

matrix-like data is data in tabular form, with the dim attribute set. But length(dim(obj)) must be equal to 2, matrices are 2-dim objects.
Quoting from Advanced R by Hadley Wickham:
Matrices and arrays
Adding a dim attribute to an atomic vector allows it to behave like a
multi-dimensional array. A special case of the array is the matrix,
which has two dimensions. Matrices are used commonly as part of the
mathematical machinery of statistics. Arrays are much rarer, but worth
being aware of.
Matrices and arrays are created with matrix() and array(), or by using
the assignment form of dim()
See also the help("dim") page.
Example:
x <- 1:9
image(x) # error
y <- 1:9
dim(y) <- c(3, 3)
image(y)

Related

mpfr'izing a data.frame in R

I'm trying to convert a data.frame in R to mpfr format by multiplying by an mpfr unit constant. This works, as demonstrated in the code below, when applied to a column (result variable 'mpfr_col'), but for both approaches shown for working with a data.frame, it does not. The relevant errors for each attempt are listed in comment.
library(Rmpfr)
prec <- 256
m1 <- mpfr(1,prec)
col_build <- 1:10
test_df <- data.frame(col_build, col_build, col_build)
mpfr_col <- m1*(col_build)
mpfr_df <- m1*test_df # (list) object cannot be coerced to type 'double'
for(colnum in 1:length(colnames(test_df))){
test_df[,colnum] <- m1*test_df[,colnum] # attempt to replicate an object of type 'S4'
}
Answer:
Use [[colnum]] to access the columns instead of [,colnum]:
for(colnum in length(colnames(test_df))){
test_df[[colnum]] <- m1*test_df[[colnum]]
}
(Note: the print method of data.frame will fail, but the 'mpfr-izing' work. You can print it either by printing the columns individually or using as_tibble(test_df).
Explanation
The original fails because the [,colnum] assignment doesn't coerce the argument, I think. Using [[ returns an element (aka a column) of the list (aka the data.frame).
See this bit of Hadley Wickham's Advanced R book:
[ selects sub-lists. It always returns a list; if you use it with a
single positive integer, it returns a list of length one. [[ selects
an element within a list. $ is a convenient shorthand: x$y is
equivalent to x[["y"]].
And the help from Extract.data.frame {base}:
When [ and [[ are used to add or replace a whole column, no coercion
takes place but value will be replicated (by calling the generic
function rep) to the right length if an exact number of repeats can be
used.

What's the real meaning about 'Everything that exists is an object' in R?

I saw:
“To understand computations in R, two slogans are helpful:
• Everything that exists is an object.
• Everything that happens is a function call."
— John Chambers
But I just found:
a <- 2
is.object(a)
# FALSE
Actually, if a variable is a pure base type, it's result is.object() would be FALSE. So it should not be an object.
So what's the real meaning about 'Everything that exists is an object' in R?
The function is.object seems only to look if the object has a "class" attribute. So it has not the same meaning as in the slogan.
For instance:
x <- 1
attributes(x) # it does not have a class attribute
NULL
is.object(x)
[1] FALSE
class(x) <- "my_class"
attributes(x) # now it has a class attribute
$class
[1] "my_class"
is.object(x)
[1] TRUE
Now, trying to answer your real question, about the slogan, this is how I would put it. Everything that exists in R is an object in the sense that it is a kind of data structure that can be manipulated. I think this is better understood with functions and expressions, which are not usually thought as data.
Taking a quote from Chambers (2008):
The central computation in R is a function call, defined by the
function object itself and the objects that are supplied as the
arguments. In the functional programming model, the result is defined
by another object, the value of the call. Hence the traditional motto
of the S language: everything is an object—the arguments, the value,
and in fact the function and the call itself: All of these are defined
as objects. Think of objects as collections of data of all kinds. The data contained and the way the data is organized depend on the class from which the object was generated.
Take this expression for example mean(rnorm(100), trim = 0.9). Until it is is evaluated, it is an object very much like any other. So you can change its elements just like you would do it with a list. For instance:
call <- substitute(mean(rnorm(100), trim = 0.9))
call[[2]] <- substitute(rt(100,2 ))
call
mean(rt(100, 2), trim = 0.9)
Or take a function, like rnorm:
rnorm
function (n, mean = 0, sd = 1)
.Call(C_rnorm, n, mean, sd)
<environment: namespace:stats>
You can change its default arguments just like a simple object, like a list, too:
formals(rnorm)[2] <- 100
rnorm
function (n, mean = 100, sd = 1)
.Call(C_rnorm, n, mean, sd)
<environment: namespace:stats>
Taking one more time from Chambers (2008):
The key concept is that expressions for evaluation are themselves
objects; in the traditional motto of the S language, everything is an
object. Evaluation consists of taking the object representing an
expression and returning the object that is the value of that
expression.
So going back to our call example, the call is an object which represents another object. When evaluated, it becomes that other object, which in this case is the numeric vector with one number: -0.008138572.
set.seed(1)
eval(call)
[1] -0.008138572
And that would take us to the second slogan, which you did not mention, but usually comes together with the first one: "Everything that happens is a function call".
Taking again from Chambers (2008), he actually qualifies this statement a little bit:
Nearly everything that happens in R results from a function call.
Therefore, basic programming centers on creating and refining
functions.
So what that means is that almost every transformation of data that happens in R is a function call. Even a simple thing, like a parenthesis, is a function in R.
So taking the parenthesis like an example, you can actually redefine it to do things like this:
`(` <- function(x) x + 1
(1)
[1] 2
Which is not a good idea but illustrates the point. So I guess this is how I would sum it up: Everything that exists in R is an object because they are data which can be manipulated. And (almost) everything that happens is a function call, which is an evaluation of this object which gives you another object.
I love that quote.
In another (as of now unpublished) write-up, the author continues with
R has a uniform internal structure for representing all objects. The evaluation process keys off that structure, in a simple form that is essentially
composed of function calls, with objects as arguments and an object as the
value. Understanding the central role of objects and functions in R makes
use of the software more effective for any challenging application, even those where extending R is not the goal.
but then spends several hundred pages expanding on it. It will be a great read once finished.
Objects For x to be an object means that it has a class thus class(x) returns a class for every object. Even functions have a class as do environments and other objects one might not expect:
class(sin)
## [1] "function"
class(.GlobalEnv)
## [1] "environment"
I would not pay too much attention to is.object. is.object(x) has a slightly different meaning than what we are using here -- it returns TRUE if x has a class name internally stored along with its value. If the class is stored then class(x) returns the stored value and if not then class(x) will compute it from the type. From a conceptual perspective it matters not how the class is stored internally (stored or computed) -- what matters is that in both cases x is still an object and still has a class.
Functions That all computation occurs through functions refers to the fact that even things that you might not expect to be functions are actually functions. For example when we write:
{ 1; 2 }
## [1] 2
if (pi > 0) 2 else 3
## [1] 2
1+2
## [1] 3
we are actually making invocations of the {, if and + functions:
`{`(1, 2)
## [1] 2
`if`(pi > 0, 2, 3)
## [1] 2
`+`(1, 2)
## [1] 3

R: passing by parameter to function and using apply instead of nested loop and recursive indexing failed

I have two lists of lists. humanSplit and ratSplit. humanSplit has element of the form::
> humanSplit[1]
$Fetal_Brain_408_AGTCAA_L001_R1_report.txt
humanGene humanReplicate alignment RNAtype
66 DGKI Fetal_Brain_408_AGTCAA_L001_R1_report.txt 6 reg
68 ARFGEF2 Fetal_Brain_408_AGTCAA_L001_R1_report.txt 5 reg
If you type humanSplit[[1]], it gives the data without name $Fetal_Brain_408_AGTCAA_L001_R1_report.txt
RatSplit is also essentially similar to humanSplit with difference in column order. I want to apply fisher's test to every possible pairing of replicates from humanSplit and ratSplit. Now I defined the following empty vector which I will use to store the informations of my fisher's test
humanReplicate <- vector(mode = 'character', length = 0)
ratReplicate <- vector(mode = 'character', length = 0)
pvalue <- vector(mode = 'numeric', length = 0)
For fisher's test between two replicates of humanSplit and ratSplit, I define the following function. In the function I use `geneList' which is a data.frame made by reading a file and has form:
> head(geneList)
human rat
1 5S_rRNA 5S_rRNA
2 5S_rRNA 5S_rRNA
Now here is the main function, where I use a function getGenetype which I already defined in other part of the code. Also x and y are integers :
fishertest <-function(x,y) {
ratReplicateName <- names(ratSplit[x])
humanReplicateName <- names(humanSplit[y])
## merging above two based on the one-to-one gene mapping as in geneList
## defined above.
mergedHumanData <-merge(geneList,humanSplit[[y]], by.x = "human", by.y = "humanGene")
mergedRatData <- merge(geneList, ratSplit[[x]], by.x = "rat", by.y = "ratGene")
## [here i do other manipulation with using already defined function
## getGenetype that is defined outside of this function and make things
## necessary to define following contingency table]
contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
fisherTest <- fisher.test(contingencyTable)
humanReplicate <- c(humanReplicate,humanReplicateName )
ratReplicate <- c(ratReplicate,ratReplicateName )
pvalue <- c(pvalue , fisherTest$p)
}
After doing all this I do the make matrix eg to use in apply. Here I am basically trying to do something similar to double for loop and then using fisher
eg <- expand.grid(i = 1:length(ratSplit),j = 1:length(humanSplit))
junk = apply(eg, 1, fishertest(eg$i,eg$j))
Now the problem is, when I try to run, it gives the following error when it tries to use function fishertest in apply
Error in humanSplit[[y]] : recursive indexing failed at level 3
Rstudio points out problem in following line:
mergedHumanData <-merge(geneList,humanSplit[[y]], by.x = "human", by.y = "humanGene")
Ultimately, I want to do the following:
result <- data.frame(humanReplicate,ratReplicate, pvalue ,alternative, Conf.int1, Conf.int2, oddratio)
I am struggling with these questions:
In defining fishertest function, how should I pass ratSplit and humanSplit and already defined function getGenetype?
And how I should use apply here?
Any help would be much appreciated.
Up front: read ?apply. Additionally, the first three hits on google when searching for "R apply tutorial" are helpful snippets: one, two, and three.
Errors in fishertest()
The error message itself has nothing to do with apply. The reason it got as far as it did is because the arguments you provided actually resolved. Try to do eg$i by itself, and you'll see that it is returning a vector: the corresponding column in the eg data.frame. You are passing this vector as an index in the i argument. The primary reason your function erred out is because double-bracket indexing ([[) only works with singles, not vectors of length greater than 1. This is a great example of where production/deployed functions would need type-checking to ensure that each argument is a numeric of length 1; often not required for quick code but would have caught this mistake. Had it not been for the [[ limit, your function may have returned incorrect results. (I've been bitten by that many times!)
BTW: your code is also incorrect in its scoped access to pvalue, et al. If you make your function return just the numbers you need and the aggregate it outside of the function, your life will simplify. (pvalue <- c(pvalue, ...) will find pvalue assigned outside the function but will not update it as you want. You are defeating one purpose of writing this into a function. When thinking about writing this function, try to answer only this question: "how do I compare a single rat record with a single human record?" Only after that works correctly and simply without having to overwrite variables in the parent environment should you try to answer the question "how do I apply this function to all pairs and aggregate it?" Try very hard to have your function not change anything outside of its own environment.
Errors in apply()
Had your function worked properly despite these errors, you would have received the following error from apply:
apply(eg, 1, fishertest(eg$i, eg$j))
## Error in match.fun(FUN) :
## 'fishertest(eg$i, eg$j)' is not a function, character or symbol
When you call apply in this sense, it it parsing the third argument and, in this example, evaluates it. Since it is simply a call to fishertest(eg$i, eg$j) which is intended to return a data.frame row (inferred from your previous question), it resolves to such, and apply then sees something akin to:
apply(eg, 1, data.frame(...))
Now that you see that apply is being handed a data.frame and not a function.
The third argument (FUN) needs to be a function itself that takes as its first argument a vector containing the elements of the row (1) or column (2) of the matrix/data.frame. As an example, consider the following contrived example:
eg <- data.frame(aa = 1:5, bb = 11:15)
apply(eg, 1, mean)
## [1] 6 7 8 9 10
# similar to your use, will not work; this error comes from mean not getting
# any arguments, your error above is because
apply(eg, 1, mean())
## Error in mean.default() : argument "x" is missing, with no default
Realize that mean is a function itself, not the return value from a function (there is more to it, but this definition works). Because we're iterating over the rows of eg (because of the 1), the first iteration takes the first row and calls mean(c(1, 11)), which returns 6. The equivalent of your code here is mean()(c(1, 11)) will fail for a couple of reasons: (1) because mean requires an argument and is not getting, and (2) regardless, it does not return a function itself (in a "functional programming" paradigm, easy in R but uncommon for most programmers).
In the example here, mean will accept a single argument which is typically a vector of numerics. In your case, your function fishertest requires two arguments (templated by my previous answer to your question), which does not work. You have two options here:
Change your fishertest function to accept a single vector as an argument and parse the index numbers from it. Bothing of the following options do this:
fishertest <- function(v) {
x <- v[1]
y <- v[2]
ratReplicateName <- names(ratSplit[x])
## ...
}
or
fishertest <- function(x, y) {
if (missing(y)) {
y <- x[2]
x <- x[1]
}
ratReplicateName <- names(ratSplit[x])
## ...
}
The second version allows you to continue using the manual form of fishertest(1, 57) while also allowing you to do apply(eg, 1, fishertest) verbatim. Very readable, IMHO. (Better error checking and reporting can be used here, I'm just providing a MWE.)
Write an anonymous function to take the vector and split it up appropriately. This anonymous function could look something like function(ii) fishertest(ii[1], ii[2]). This is typically how it is done for functions that either do not transform as easily as in #1 above, or for functions you cannot or do not want to modify. You can either assign this intermediary function to a variable (which makes it no longer anonymous, figure that) and pass that intermediary to apply, or just pass it directly to apply, ala:
.func <- function(ii) fishertest(ii[1], ii[2])
apply(eg, 1, .func)
## equivalently
apply(eg, 1, function(ii) fishertest(ii[1], ii[2]))
There are two reasons why many people opt to name the function: (1) if the function is used multiple times, better to define once and reuse; (2) it makes the apply line easier to read than if it contained a complex multi-line function definition.
As a side note, there are some gotchas with using apply and family that, if you don't understand, will be confusing. Not the least of which is that when your function returns vectors, the matrix returned from apply will need to be transposed (with t()), after which you'll still need to rbind or otherwise aggregrate.
This is one area where using ddply may provide a more readable solution. There are several tutorials showing it off. For a quick intro, read this; for a more in depth discussion on the bigger picture in which ddply plays a part, read Hadley's Split, Apply, Combine Strategy for Data Analysis paper from JSS.

vectorize a bidimensional function in R

I have a some true and predicted labels
truth <- factor(c("+","+","-","+","+","-","-","-","-","-"))
pred <- factor(c("+","+","-","-","+","+","-","-","+","-"))
and I would like to build the confusion matrix.
I have a function that works on unary elements
f <- function(x,y){ sum(y==pred[truth == x])}
however, when I apply it to the outer product, to build the matrix, R seems unhappy.
outer(levels(truth), levels(truth), f)
Error in outer(levels(x), levels(x), f) :
dims [product 4] do not match the length of object [1]
What is the recommended strategy for this in R ?
I can always go through higher order stuff, but that seems clumsy.
I sometimes fail to understand where outer goes wrong, too. For this task I would have used the table function:
> table(truth,pred) # arguably a lot less clumsy than your effort.
pred
truth - +
- 4 2
+ 1 3
In this case, you are test whether a multivalued vector is "==" to a scalar.
outer assumes that the function passed to FUN can take vector arguments and work properly with them. If m and n are the lengths of the two vectors passed to outer, it will first create two vectors of length m*n such that every combination of inputs occurs, and pass these as the two new vectors to FUN. To this, outer expects, that FUN will return another vector of length m*n
The function described in your example doesn't really do this. In fact, it doesn't handle vectors correctly at all.
One way is to define another function that can handle vector inputs properly, or alternatively, if your program actually requires a simple matching, you could use table() as in #DWin 's answer
If you're redefining your function, outer is expecting a function that will be run for inputs:
f(c("+","+","-","-"), c("+","-","+","-"))
and per your example, ought to return,
c(3,1,2,4)
There is also the small matter of decoding the actual meaning of the error:
Again, if m and n are the lengths of the two vectors passed to outer, it will first create a vector of length m*n, and then reshapes it using (basically)
dim(output) = c(m,n)
This is the line that gives an error, because outer is trying to shape the output into a 2x2 matrix (total 2*2 = 4 items) while the function f, assuming no vectorization, has given only 1 output. Hence,
Error in outer(levels(x), levels(x), f) :
dims [product 4] do not match the length of object [1]

How to correctly use lists?

Brief background: Many (most?) contemporary programming languages in widespread use have at least a handful of ADTs [abstract data types] in common, in particular,
string (a sequence comprised of characters)
list (an ordered collection of values), and
map-based type (an unordered array that maps keys to values)
In the R programming language, the first two are implemented as character and vector, respectively.
When I began learning R, two things were obvious almost from the start: list is the most important data type in R (because it is the parent class for the R data.frame), and second, I just couldn't understand how they worked, at least not well enough to use them correctly in my code.
For one thing, it seemed to me that R's list data type was a straightforward implementation of the map ADT (dictionary in Python, NSMutableDictionary in Objective C, hash in Perl and Ruby, object literal in Javascript, and so forth).
For instance, you create them just like you would a Python dictionary, by passing key-value pairs to a constructor (which in Python is dict not list):
x = list("ev1"=10, "ev2"=15, "rv"="Group 1")
And you access the items of an R List just like you would those of a Python dictionary, e.g., x['ev1']. Likewise, you can retrieve just the 'keys' or just the 'values' by:
names(x) # fetch just the 'keys' of an R list
# [1] "ev1" "ev2" "rv"
unlist(x) # fetch just the 'values' of an R list
# ev1 ev2 rv
# "10" "15" "Group 1"
x = list("a"=6, "b"=9, "c"=3)
sum(unlist(x))
# [1] 18
but R lists are also unlike other map-type ADTs (from among the languages I've learned anyway). My guess is that this is a consequence of the initial spec for S, i.e., an intention to design a data/statistics DSL [domain-specific language] from the ground-up.
three significant differences between R lists and mapping types in other languages in widespread use (e.g,. Python, Perl, JavaScript):
first, lists in R are an ordered collection, just like vectors, even though the values are keyed (ie, the keys can be any hashable value not just sequential integers). Nearly always, the mapping data type in other languages is unordered.
second, lists can be returned from functions even though you never passed in a list when you called the function, and even though the function that returned the list doesn't contain an (explicit) list constructor (Of course, you can deal with this in practice by wrapping the returned result in a call to unlist):
x = strsplit(LETTERS[1:10], "") # passing in an object of type 'character'
class(x) # returns 'list', not a vector of length 2
# [1] list
A third peculiar feature of R's lists: it doesn't seem that they can be members of another ADT, and if you try to do that then the primary container is coerced to a list. E.g.,
x = c(0.5, 0.8, 0.23, list(0.5, 0.2, 0.9), recursive=TRUE)
class(x)
# [1] list
my intention here is not to criticize the language or how it is documented; likewise, I'm not suggesting there is anything wrong with the list data structure or how it behaves. All I'm after is to correct is my understanding of how they work so I can correctly use them in my code.
Here are the sorts of things I'd like to better understand:
What are the rules which determine when a function call will return a list (e.g., strsplit expression recited above)?
If I don't explicitly assign names to a list (e.g., list(10,20,30,40)) are the default names just sequential integers beginning with 1? (I assume, but I am far from certain that the answer is yes, otherwise we wouldn't be able to coerce this type of list to a vector w/ a call to unlist.)
Why do these two different operators, [], and [[]], return the same result?
x = list(1, 2, 3, 4)
both expressions return "1":
x[1]
x[[1]]
why do these two expressions not return the same result?
x = list(1, 2, 3, 4)
x2 = list(1:4)
Please don't point me to the R Documentation (?list, R-intro)--I have read it carefully and it does not help me answer the type of questions I recited just above.
(lastly, I recently learned of and began using an R Package (available on CRAN) called hash which implements conventional map-type behavior via an S4 class; I can certainly recommend this Package.)
Just to address the last part of your question, since that really points out the difference between a list and vector in R:
Why do these two expressions not return the same result?
x = list(1, 2, 3, 4); x2 = list(1:4)
A list can contain any other class as each element. So you can have a list where the first element is a character vector, the second is a data frame, etc. In this case, you have created two different lists. x has four vectors, each of length 1. x2 has 1 vector of length 4:
> length(x[[1]])
[1] 1
> length(x2[[1]])
[1] 4
So these are completely different lists.
R lists are very much like a hash map data structure in that each index value can be associated with any object. Here's a simple example of a list that contains 3 different classes (including a function):
> complicated.list <- list("a"=1:4, "b"=1:3, "c"=matrix(1:4, nrow=2), "d"=search)
> lapply(complicated.list, class)
$a
[1] "integer"
$b
[1] "integer"
$c
[1] "matrix"
$d
[1] "function"
Given that the last element is the search function, I can call it like so:
> complicated.list[["d"]]()
[1] ".GlobalEnv" ...
As a final comment on this: it should be noted that a data.frame is really a list (from the data.frame documentation):
A data frame is a list of variables of the same number of rows with unique row names, given class ‘"data.frame"’
That's why columns in a data.frame can have different data types, while columns in a matrix cannot. As an example, here I try to create a matrix with numbers and characters:
> a <- 1:4
> class(a)
[1] "integer"
> b <- c("a","b","c","d")
> d <- cbind(a, b)
> d
a b
[1,] "1" "a"
[2,] "2" "b"
[3,] "3" "c"
[4,] "4" "d"
> class(d[,1])
[1] "character"
Note how I cannot change the data type in the first column to numeric because the second column has characters:
> d[,1] <- as.numeric(d[,1])
> class(d[,1])
[1] "character"
Regarding your questions, let me address them in order and give some examples:
1) A list is returned if and when the return statement adds one. Consider
R> retList <- function() return(list(1,2,3,4)); class(retList())
[1] "list"
R> notList <- function() return(c(1,2,3,4)); class(notList())
[1] "numeric"
R>
2) Names are simply not set:
R> retList <- function() return(list(1,2,3,4)); names(retList())
NULL
R>
3) They do not return the same thing. Your example gives
R> x <- list(1,2,3,4)
R> x[1]
[[1]]
[1] 1
R> x[[1]]
[1] 1
where x[1] returns the first element of x -- which is the same as x. Every scalar is a vector of length one. On the other hand x[[1]] returns the first element of the list.
4) Lastly, the two are different between they create, respectively, a list containing four scalars and a list with a single element (that happens to be a vector of four elements).
Just to take a subset of your questions:
This article on indexing addresses the question of the difference between [] and [[]].
In short [[]] selects a single item from a list and [] returns a list of the selected items. In your example, x = list(1, 2, 3, 4)' item 1 is a single integer but x[[1]] returns a single 1 and x[1] returns a list with only one value.
> x = list(1, 2, 3, 4)
> x[1]
[[1]]
[1] 1
> x[[1]]
[1] 1
One reason lists work as they do (ordered) is to address the need for an ordered container that can contain any type at any node, which vectors do not do. Lists are re-used for a variety of purposes in R, including forming the base of a data.frame, which is a list of vectors of arbitrary type (but the same length).
Why do these two expressions not return the same result?
x = list(1, 2, 3, 4); x2 = list(1:4)
To add to #Shane's answer, if you wanted to get the same result, try:
x3 = as.list(1:4)
Which coerces the vector 1:4 into a list.
Just to add one more point to this:
R does have a data structure equivalent to the Python dict in the hash package. You can read about it in this blog post from the Open Data Group. Here's a simple example:
> library(hash)
> h <- hash( keys=c('foo','bar','baz'), values=1:3 )
> h[c('foo','bar')]
<hash> containing 2 key-value pairs.
bar : 2
foo : 1
In terms of usability, the hash class is very similar to a list. But the performance is better for large datasets.
You say:
For another, lists can be returned
from functions even though you never
passed in a List when you called the
function, and even though the function
doesn't contain a List constructor,
e.g.,
x = strsplit(LETTERS[1:10], "") # passing in an object of type 'character'
class(x)
# => 'list'
And I guess you suggest that this is a problem(?). I'm here to tell you why it's not a problem :-). Your example is a bit simple, in that when you do the string-split, you have a list with elements that are 1 element long, so you know that x[[1]] is the same as unlist(x)[1]. But what if the result of strsplit returned results of different length in each bin. Simply returning a vector (vs. a list) won't do at all.
For instance:
stuff <- c("You, me, and dupree", "You me, and dupree",
"He ran away, but not very far, and not very fast")
x <- strsplit(stuff, ",")
xx <- unlist(strsplit(stuff, ","))
In the first case (x : which returns a list), you can tell what the 2nd "part" of the 3rd string was, eg: x[[3]][2]. How could you do the same using xx now that the results have been "unraveled" (unlist-ed)?
This is a very old question, but I think that a new answer might add some value since, in my opinion, no one directly addressed some of the concerns in the OP.
Despite what the accepted answer suggests, list objects in R are not hash maps. If you want to make a parallel with python, list are more like, you guess, python lists (or tuples actually).
It's better to describe how most R objects are stored internally (the C type of an R object is SEXP). They are made basically of three parts:
an header, which declares the R type of the object, the length and some other meta data;
the data part, which is a standard C heap-allocated array (contiguous block of memory);
the attributes, which are a named linked list of pointers to other R objects (or NULL if the object doesn't have attributes).
From an internal point of view, there is little difference between a list and a numeric vector for instance. The values they store are just different. Let's break two objects into the paradigm we described before:
x <- runif(10)
y <- list(runif(10), runif(3))
For x:
The header will say that the type is numeric (REALSXP in the C-side), the length is 10 and other stuff.
The data part will be an array containing 10 double values.
The attributes are NULL, since the object doesn't have any.
For y:
The header will say that the type is list (VECSXP in the C-side), the length is 2 and other stuff.
The data part will be an array containing 2 pointers to two SEXP types, pointing to the value obtained by runif(10) and runif(3) respectively.
The attributes are NULL, as for x.
So the only difference between a numeric vector and a list is that the numeric data part is made of double values, while for the list the data part is an array of pointers to other R objects.
What happens with names? Well, names are just some of the attributes you can assign to an object. Let's see the object below:
z <- list(a=1:3, b=LETTERS)
The header will say that the type is list (VECSXP in the C-side), the length is 2 and other stuff.
The data part will be an array containing 2 pointers to two SEXP types, pointing to the value obtained by 1:3 and LETTERS respectively.
The attributes are now present and are a names component which is a character R object with value c("a","b").
From the R level, you can retrieve the attributes of an object with the attributes function.
The key-value typical of an hash map in R is just an illusion. When you say:
z[["a"]]
this is what happens:
the [[ subset function is called;
the argument of the function ("a") is of type character, so the method is instructed to search such value from the names attribute (if present) of the object z;
if the names attribute isn't there, NULL is returned;
if present, the "a" value is searched in it. If "a" is not a name of the object, NULL is returned;
if present, the position of the first occurence is determined (1 in the example). So the first element of the list is returned, i.e. the equivalent of z[[1]].
The key-value search is rather indirect and is always positional. Also, useful to keep in mind:
in hash maps the only limit a key must have is that it must be hashable. names in R must be strings (character vectors);
in hash maps you cannot have two identical keys. In R, you can assign names to an object with repeated values. For instance:
names(y) <- c("same", "same")
is perfectly valid in R. When you try y[["same"]] the first value is retrieved. You should know why at this point.
In conclusion, the ability to give arbitrary attributes to an object gives you the appearance of something different from an external point of view. But R lists are not hash maps in any way.
x = list(1, 2, 3, 4)
x2 = list(1:4)
all.equal(x,x2)
is not the same because 1:4 is the same as c(1,2,3,4).
If you want them to be the same then:
x = list(c(1,2,3,4))
x2 = list(1:4)
all.equal(x,x2)
Although this is a pretty old question I must say it is touching exactly the knowledge I was missing during my first steps in R - i.e. how to express data in my hand as an object in R or how to select from existing objects. It is not easy for an R novice to think "in an R box" from the very beginning.
So I myself started to use crutches below which helped me a lot to find out what object to use for what data, and basically to imagine real-world usage.
Though I not giving exact answers to the question the short text below might help the reader who just started with R and is asking similar questions.
Atomic vector ... I called that "sequence" for myself, no direction, just sequence of same types. [ subsets.
Vector ... the sequence with one direction from 2D, [ subsets.
Matrix ... bunch of vectors with the same length forming rows or columns, [ subsets by rows and columns, or by sequence.
Arrays ... layered matrices forming 3D
Dataframe ... a 2D table like in excel, where I can sort, add or remove rows or columns or make arit. operations with them, only after some time I truly recognized that data frame is a clever implementation of list where I can subset using [ by rows and columns, but even using [[.
List ... to help myself I thought about the list as of tree structure where [i] selects and returns whole branches and [[i]] returns item from the branch. And because it is tree like structure, you can even use an index sequence to address every single leaf on a very complex list using its [[index_vector]]. Lists can be simple or very complex and can mix together various types of objects into one.
So for lists you can end up with more ways how to select a leaf depending on situation like in the following example.
l <- list("aaa",5,list(1:3),LETTERS[1:4],matrix(1:9,3,3))
l[[c(5,4)]] # selects 4 from matrix using [[index_vector]] in list
l[[5]][4] # selects 4 from matrix using sequential index in matrix
l[[5]][1,2] # selects 4 from matrix using row and column in matrix
This way of thinking helped me a lot.
Regarding vectors and the hash/array concept from other languages:
Vectors are the atoms of R. Eg, rpois(1e4,5) (5 random numbers), numeric(55) (length-55 zero vector over doubles), and character(12) (12 empty strings), are all "basic".
Either lists or vectors can have names.
> n = numeric(10)
> n
[1] 0 0 0 0 0 0 0 0 0 0
> names(n)
NULL
> names(n) = LETTERS[1:10]
> n
A B C D E F G H I J
0 0 0 0 0 0 0 0 0 0
Vectors require everything to be the same data type. Watch this:
> i = integer(5)
> v = c(n,i)
> v
A B C D E F G H I J
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> class(v)
[1] "numeric"
> i = complex(5)
> v = c(n,i)
> class(v)
[1] "complex"
> v
A B C D E F G H I J
0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i
Lists can contain varying data types, as seen in other answers and the OP's question itself.
I've seen languages (ruby, javascript) in which "arrays" may contain variable datatypes, but for example in C++ "arrays" must be all the same datatype. I believe this is a speed/efficiency thing: if you have a numeric(1e6) you know its size and the location of every element a priori; if the thing might contain "Flying Purple People Eaters" in some unknown slice, then you have to actually parse stuff to know basic facts about it.
Certain standard R operations also make more sense when the type is guaranteed. For example cumsum(1:9) makes sense whereas cumsum(list(1,2,3,4,5,'a',6,7,8,9)) does not, without the type being guaranteed to be double.
As to your second question:
Lists can be returned from functions even though you never passed in a List when you called the function
Functions return different data types than they're input all the time. plot returns a plot even though it doesn't take a plot as an input. Arg returns a numeric even though it accepted a complex. Etc.
(And as for strsplit: the source code is here.)
If it helps, I tend to conceive "lists" in R as "records" in other pre-OO languages:
they do not make any assumptions about an overarching type (or rather the type of all possible records of any arity and field names is available).
their fields can be anonymous (then you access them by strict definition order).
The name "record" would clash with the standard meaning of "records" (aka rows) in database parlance, and may be this is why their name suggested itself: as lists (of fields).
why do these two different operators, [ ], and [[ ]], return the same result?
x = list(1, 2, 3, 4)
[ ] provides sub setting operation. In general sub set of any object
will have the same type as the original object. Therefore, x[1]
provides a list. Similarly x[1:2] is a subset of original list,
therefore it is a list. Ex.
x[1:2]
[[1]] [1] 1
[[2]] [1] 2
[[ ]] is for extracting an element from the list. x[[1]] is valid
and extract the first element from the list. x[[1:2]] is not valid as [[ ]]
does not provide sub setting like [ ].
x[[2]] [1] 2
> x[[2:3]] Error in x[[2:3]] : subscript out of bounds
you can try something like,
set.seed(123)
l <- replicate(20, runif(sample(1:10,1)), simplify = FALSE)
out <- vector("list", length(l))
for (i in seq_along(l)) {
out[[i]] <- length(unique(l[[i]])) #length(l[[i]])
}
unlist(out)
unlist(lapply(l,length))
unlist(lapply(l, class))
unlist(lapply(l, mean))
unlist(lapply(l, max))

Resources