aaply fails on a vector - r

I am trying to understand how to use the excellent plyr package's commands on a vector (in my case, of strings). I suppose I'd want to use aaply, but it fails, asking for a margin. But there aren't columns or rows in my vector!
To be a bit more concrete, the following command works, but returns results in a wierd list. states.df is a data frame, and region is the name of the state (returned using Hadley's map_data("state") command). Thus, states.df$region is a vector of strings (specifically, state names). opinion.new is a vector of numbers, named using state names.
states.df <- map_data("state")
ch = sapply(states.df$region, function (x) { opinion.new[names(opinion.new)==x] } )
What I'd like to do is:
ch = aaply(states.df$region, function (x) { opinion.new[names(opinion.new)==x] } )
Where ch is the vector of numbers looked up or pulled from opinion.new. But aaply requires an array, and fails on a vector.
Thanks!

If you want to use plyr on a vector, you have to use l*ply, as follows:
v <- 1:10
sapply(v, function(x)x^2)
[1] 1 4 9 16 25 36 49 64 81 100
laply(v, function(x)x^2)
[1] 1 4 9 16 25 36 49 64 81 100
In other words, sapply and laply are equivalent

Related

Difference between df["speed"] and df$speed

Suppose a data frame df has a column speed, then what is difference in the way accessing the column like so:
df["speed"]
or like so:
df$speed
The following calculates the mean value correctly:
lapply(df["speed"], mean)
But this prints all values under the column speed:
lapply(df$speed, mean)
There are two elements to the question in the OP. The first element was addressed in the comments: df["speed"] is an object of type data.frame() whereas df$speed is a numeric vector. We can see this via the str() function.
We'll illustrate this with Ezekiel's 1930 analysis of speed and stopping distance, the cars data set from the datasets package.
> library(datasets)
> data(cars)
>
> str(cars["speed"])
'data.frame': 50 obs. of 1 variable:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
> str(cars$speed)
num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
>
The second element that was not addressed in the comments is that lapply() behaves differently when passed a vector versus a list().
With a vector, lapply() processes each element in the vector independently, producing unexpected results for a function such as mean().
> unlist(lapply(cars$speed,mean))
[1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
[26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
What happened?
Since each element of cars$speed is processed by mean() independently, lapply() returns a list of 50 means of 1 number each: the original elements in the cars$speed vector.
Processing a list with lapply()
With a list, each element of the list is processed independently. We can calculate how many items will be processed by lapply() with the length() function.
> length(cars["speed"])
[1] 1
>
Since a data frame is also a list() that contains one element of type data.frame(), the length() function returns the value 1. Therefore, when processed by lapply(), a single mean is calculated, not one per row of the speed column.
> lapply(cars["speed"],mean)
$speed
[1] 15.4
>
If we pass the entire cars data frame as the input object for lapply(), we obtain one mean per column in the data frame, since both variables in the data frame are numeric.
> lapply(cars,mean)
$speed
[1] 15.4
$dist
[1] 42.98
>
A theoretical perspective
The differing behaviors of lapply() are explained by the fact that R is an object oriented language. In fact, John Chambers, creator of the S language on which R is based, once said:
In R, two slogans are helpful.
-- Everything that exists is an object, and
-- Everything that happens is a function call.
John Chambers, quoted in Advanced R, p. 79.
The fact that lapply() works differently on a data frame than a vector is an illustration of the object oriented feature of polymorphism where the same behavior is implemented in different ways for different types of objects.
While this looks like an beginner's question I think it's worth answering it since many beginners could have a similar question and a guide to the corresponding documentation is helpful IMHO.
No up-votes please - I am just collecting the comment fragments from the question that contribute to the answer - feel free to edit this answer...*
A data.frame is a list of vectors with the same length (number of elements). Please read the help in the R console (by typing ?data.frame)
The $ operator is implemented by returning one column as vector (?"$.data.frame")
lapply applies a function to each element of a list (see ?lapply). If the first param X is a scalar vector (integer, double...) with multiple elements, each element of the vector is converted ("coerced") into one separate list element (same as as.list(1:26))
Examples:
x <- data.frame(a = LETTERS, b = 1:26, stringsAsFactors = FALSE)
b.vector <- x$b
b.data.frame <- x["b"]
class(b.vector) # integer
class(b.data.frame) # data.frame
lapply(b.vector, mean)
# returns a result list with 26 list elements, the same as `lapply(1:26, mean)`
# [[1]]
# [1] 1
#
# [[2]]
# [1] 2
# ... up to list element 26
lapply(b.data.frame, mean)
# returns a list where each element of the input vector in param X
# becomes a separate list element (same as `as.list(1:26)`)
# $b
# [1] 13.5
So IMHO your original question can be reduced to: Why is lapply behaving differently if the first parameter is a scalar vector instead of a list?

Why do $ and [ on a data frame column give different output presentation and data types?

I'm new to R. Just learning via online tutorials. My question is:
1) Why does accessing the same column with different syntaxes have different output presentation?
Vertical Display:
> airquality["Ozone"]
Ozone
1 41
2 36
3 12
Horizontal Display:
airquality$Ozone
[1] 41 36 12 18 NA 28 23 19 8
[46] NA 21 37 20 12 13 NA NA NA
[91] 64 59 39 9 16 78 35 66 122
2) Why do the following have different data types?
> class(airquality["Ozone"])
[1] "data.frame"
> class(airquality$Ozone)
[1] "integer"
> class(airquality[["Ozone"]])
[1] "integer"
Same reason for both: airquality["Ozone"] returns a dataframe, whereas airquality$Ozone returns a vector. class() shows you their object types. str() is also good for succinctly showing you an object.
See the help on the '[' operator, which is also known as 'extracting', or the function getElement(). In R, you can call help() on a special character or operator, just surround it with quotes: ?'[' or ?'$' (In Python/C++/Java or most other languages we'd call this 'slicing').
As to why they print differently, print(obj) in R dispatches under-the-hood an object-specific print method. In this case: print.dataframe, which prints the dataframe column(s) vertically, with row-indices, vs print (or print.default) for a vector, which just prints the vector contents horizontally, with no indices.
Now back to extraction with the '[' vs '$' operators:
The most important distinction between ‘[’, ‘[[’ and ‘$’ is that the ‘[’ can select more than one element whereas the other two ’[[’ and ‘$’ select a single element.
There's also a '[[' extract syntax, which will do like '$' does in selecting a single element (vector):
airquality[["Ozone"]]
[1] 41 36 12 18
The difference between [["colname"]] and $colname is that in the former, the column-name can come from a variable, but in the latter, it must be a string. So [[varname]] would allow you to index different columns depending on value of varname.
Read the doc about the exact=TRUE and drop=TRUE options on extract(). Note drop=TRUE only works on arrays/matrices, not dataframes, where it's ignored:
airquality["Ozone", drop=TRUE]
In `[.data.frame`(airquality, "Ozone", drop = TRUE) :
'drop' argument will be ignored
It's all kinda confusing, offputting at first, eccentrically different and quirkily non-self-explanatory. But once you learn the syntax, it makes sense. Until then, it feels like hitting your head off a wall of symbols.
Please take a very brief skim of R-intro and R-lang#Indexing HTML or in PDF. Bookmark them and come back to them regularly. Read them on the bus or plane...
PS as #Henry mentioned, strictly when accessing a dataframe, we should insert a comma to disambiguate that the column-names get applied to columns, not rows: airquality[, "Ozone"]. If we used numeric indices, airquality[,1] and airquality[1] both extract the Ozone column, whereas airquality[1,] extracts the first row. R is applying some cleverness since usually strings aren't row-indices.
Anyway, it's all in the doc... not necessarily all contiguous or clearly-explained... welcome to R :-)

Determine when a sequence of numbers has been broken in R

Say I have a series of numbers:
seq1<-c(1:20,25:40,48:60)
How can I return a vector that lists points in which the sequence was broken, like so:
c(21,24)
[1] 21 24
c(41,47)
[1] 41 47
Thanks for any help.
To show my miserably failing attempt:
nums<-min(seq1):max(seq1) %in% seq1
which(nums==F)[1]
res.vec<-vector()
counter<-0
res.vec2<-vector()
counter2<-0
for (i in 2:length(seq1)){
if(nums[i]==F & nums[i-1]!=F){
counter<-counter+1
res.vec[counter]<-seq1[i]
}
if(nums[i]==T & nums[i-1]!=T){
counter2<-counter2+1
res.vec2[counter2]<-seq1[i]
}
}
cbind(res.vec,res.vec2)
I have changed the general function a bit so I think this should be a sepparate answer.
You could try
seq1<-c(1:20,25:40,48:60)
myfun<-function(data,threshold){
cut<-which(c(1,diff(data))>threshold)
return(cut)
}
You get the points you have to care about using
myfun(seq1,1)
[1] 21 37
In order to better use is convenient to create an object with it.
pru<-myfun(seq1,1)
So you can now call
df<-data.frame(pos=pru,value=seq1[pru])
df
pos value
1 21 25
2 37 48
You get a data frame with the position and the value of the brakes with your desired threshold. If you want a list instead of a data frame it works like this:
list(pos=pru,value=seq1[pru])
$pos
[1] 21 37
$value
[1] 25 48
Function diff will give you the differences between successive values
> x <- c(1,2,3,5,6,3)
> diff(x)
[1] 1 1 2 1 -3
Now look for those values that are not equal to one for "breakpoints" in your sequence.
Taking in account the comments made here. For a general purpose, you could use.
fun<-function(data,threshold){
t<-which(c(1,diff(data)) != threshold)
return(t)
}
Consider that data could be any numerical vector (such as a data frame column). I would also consider using grep with a similar approach but it all depends on user preference.

R data.table - searching for multiple rows by key which is composed of 2 columns

I have a data.table with two of its columns (V1,V2) set as a key. A third column contains multiple values.
I have another data.table with pairs I want to search for in the first table key, and return a united list of the lists in V3.
locations<-data.table(c(-159.58,0.2,345.1),c(21.901,22.221,66.5),list(c(10,20),c(11),c(12,33)))
setkey(locations,V1,V2)
searchfor<-data.table(lon=c(-159.58,345.1,11),lat=c(21.901,66.5,0))
The final result should look like this:
[1] 10 20 12 33
The following works when searching for just one item.
locations[.(-159.58,21.901),V3][[1]]
[1] 10 20
I don't know how to generalize this and use the table "searchfor" as source for the searched indices (BTW - the "searchfor" table can be changed to a different format if it makes the solution easier).
Also, how do I unite the different values in V3 I'll then (hopefully) receive, into one list?
You can use the same syntax that you used with a data.table as an index.
lst <- locations[ .(searchfor), V3]
You probably want to use only the non-null elements. If so, you can directly use nomatch=0L argument:
locations[ .(searchfor), V3, nomatch=0L]
# [[1]]
# [1] 10 20
#
# [[2]]
# [1] 12 33
This will return a list. If you want to return a vector instead, use the base function unlist() as follows:
locations[ .(searchfor), unlist(V3), nomatch=0L]
# [1] 10 20 12 33

R Pipelining with Anonymous Functions

I have a question which is an extension of another question.
I am wanting to be able to pipeline anonymous functions. In the previous question the answer to pipeline defined functions was to create a pipeline operator "%|>%" and to define it this way:
"%|>%" <- function(fun1, fun2){
function(x){fun2(fun1(x))}
}
This would allow you to call a series of functions while continually passing the result of the previous function to the next. The caveat was that the functions to to be predefined. Now I'm trying to figure how to do this with anonymous functions. The previous solution which used predefined functions looks like this:
square <- function(x){x^2}
add5 <- function(x){x + 5}
pipelineTest <-
square %|>%
add5
Which gives you this behviour:
> pipelineTest(1:10)
[1] 6 9 14 21 30 41 54 69 86 105
I would like to be able to define the pipelineTest function with anonymous functions like this:
anonymousPipelineTest <-
function(x){x^2} %|>%
function(x){x+5} %|>%
x
When I try to call this with the same arguments as above I get the following:
> anonymousPipelineTest(1:10)
function(x){fun2(fun1(x))}
<environment: 0x000000000ba1c468>
What I'm hoping to get is the same result as pipelineTest(1:10). I know that this is a trivial example. What I'm really trying to get at is a way to pipeline anonymous functions. Thanks for the help!
Using Compose, and calling the resulting function gives this:
"%|>%" <- function(...) Compose(...)()
Now get rid of the 'x' as the final "function" (replaced with an actual function, that is not needed but here for example):
anonymousPipelineTest <-
function(x){x^2} %|>%
function(x){x+5} %|>% function(x){x}
anonymousPipelineTest(1:10)
[1] 6 9 14 21 30 41 54 69 86 105
This is an application of an example offered on the ?funprog help page:
Funcall <- function(f, ...) f(...)
anonymousPipelineTest <- function(x) Reduce( Funcall, list(
function(x){x+5}, function(x){x^2}),
x, right=TRUE)
anonymousPipelineTest(1:10)
#[1] 6 9 14 21 30 41 54 69 86 105
I am putting up an answer which is the closest thing I've found for other people looking for the same thing. I won't give myself point for the answer though because it is not what I am wanting.
Returning a Function:
If you want to put several functions together the easiest thing I've found is to use the 'Compose' function found in the 'Functional' package for R. It would look something like this:
anonymousPipe <- Compose(
function(x){x^2},
function(x){x+5})
This allows you to call this series of functions like this:
> anonymousPipe(1:10)
[1] 6 9 14 21 30 41 54 69 86 105
Returning Data:
If all you want to do is start with some data and send it through a series of transformations (my original intent) then the first function in the 'Compose' function should be your starting data and after the close of the 'Compose' function add a parenthesis pair to call the function. It looks like this:
anonymousPipeData <- Compose(
seq(1:10),
function(x){x^2},
function(x){x+5})()
'anonymousPipeData' is now the data which is a result of the series of functions. Please note the pair of parenthesis at the end. This is what causes R to return the data rather than a function.

Resources