Mystery Matrix Subset - r

I came across this strange matrix operation the other day and can't figure out what it is doing.
Consider:
a<-matrix(nrow=2,ncol=2,c(9,8,7,6))
b<-matrix(nrow=2,ncol=2,c(1,2,1,2))
a[b]
Whoa! How can you even use a matrix to subset another matrix? Anyway - this is the result
a[b]
#[1] 9 6
I thought maybe b was providing the indexing to reference a (i.e. get 1,1 and then get 2,2. But if that is what is happening the rules get thrown out of the window when you do this
a<-matrix(nrow=3,ncol=3,c(9,8,7,6,5,4,3,2,1))
b<-matrix(nrow=3,ncol=3,c(1,2,3,2,2,2,1,1,1))
a[b]
#[1] 9 8 7 8 8 8 9 9 9
Does anyone know what is happening here?

this is not a mystery. in your second example the indexing matrix b is treated as a numeric:
as.numeric(b)
#[1] 1 2 3 2 2 2 1 1 1
a[as.numeric(b)]
#[1] 9 8 7 8 8 8 9 9 9
you have to remember that on top of having a two dimensional ij (row x column) indexing, matrices also have a one-dimensional one, where each element is assigned a number in sequence, starting with the top-left element and going down the columns. so a[1, 1] is the same as a[1] and a[2, 2] is the same as a[5]. hence a[b] gives you c(a[1], a[2], a[3], a[2],...,a[1]), which is the same as c(a[1,1], a[2,1], a[3,1], a[2,1], a[2,1],..., a[1,1]).

A matrix is essentially a numeric vector with a dimension attribute. In R, matrices are stored using "column-major order", meaning that the matrix is filled columnwisely. This implies the following:
a <- matrix(1:4, nrow = 2)
> a
# [,1] [,2]
#[1,] 1 3
#[2,] 2 4
Since it is also a vector, you will be still be able to access elements of a using single indices.
> a[1]
#1
> a[2]
#2
> a[3]
#3
When you put a variable between the bracket operator, it tries to coerce your variable into an integer vector, such that it becomes a set of indices for the numeric vector a.
To understand better, you could try the following:
a<-matrix(nrow=3,ncol=3,c(9,8,7,6,5,4,3,2,1))
b<-matrix(nrow=3,ncol=3,c(1,2,3,2,2,2,1,10,1)) # a[10] = NA
> a[b]
# [1] 9 8 7 8 8 8 9 NA 9
Since the brackets coerce the matrices into integer vectors, you can even imagine having a b matrix with floating values:
b<-matrix(nrow=3,ncol=3,c(1.1,2.1,3.9,2.8,2,2,1,10.5,1))
> a[b]
# [1] 9 8 7 8 8 8 9 NA 9
This is because, as said earlier:
> as.integer(b) # same as as.integer(c(1.1,2.1,3.9,2.8,2,2,1,10.5,1))
# 1 2 3 2 2 2 1 10 1

Related

Matrix indexing notation syntax

Lets create matrix m.
m <- matrix(1:9, 3,3, T); m
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
m[3,1] # 7
m[3][1] # 7
Why does second indexing notation work? Is there difference between these two notations? Is it safe to use?
But sequences behave differently:
m[1:2, 1:2] # works as expected, return matrix
m[1:2][1:2] # return vector 1 4, why?
A matrix is a vector with dim attributes. By doing the m[3], it returns only the 3rd element. If we want to use a chained extract, then extract the row with column index blank after the , (drop = FALSE - in case we want to avoid coercing the matrix to vector) and select the first element which is the first column
m[3,, drop = FALSE][1]
#[1] 7
In the OP's first option, it uses the row index and column index with 3, 1 which selects the element based on both index
In the updated example, OP specified row index as first 2 rows and columns as first 2 columns. So, it returns a matrix omitting the 3rd row and 3rd column
m[1:2, 1:2]
# [,1] [,2]
#[1,] 1 2
#[2,] 4 5
But, in the second case
m[1:2]
#[1] 1 4
extracts the first two elements
likewise, if we do
m[1:5]
#[1] 1 4 7 2 5
is the first five elements following the columnwise order
Therefore,
m[1:2][1:2]
returns only 1, 4 because from the first Extract, it is only extracting 1 and 4. Then, the second extract is based on that subset and it also have 2 elements. If we increase the index, those positions are not available and filled by NA
m[1:2][1:4]
#[1] 1 4 NA NA
The elementwise indexing is acting on the vector
c(m)
#[1] 1 4 7 2 5 8 3 6 9
where the first two elements are 1 and 4

Generate sequence between each element of 2 vectors

I have a for loop that generate each time 2 vectors of the same length (length can vary for each iteration) such as:
>aa
[1] 3 5
>bb
[1] 4 8
I want to create a sequence using each element of these vectors to obtain that:
>zz
[1] 3 4 5 6 7 8
Is there a function in R to create that?
We can use Mapto get the sequence of corresponding elements of 'aa' , 'bb'. The output is a list, so we unlist to get a vector.
unlist(Map(`:`, aa, bb))
#[1] 3 4 5 6 7 8
data
aa <- c(3,5)
bb <- c(4, 8)
One can obtain a sequence by using the colon operator : that separates the beginning of a sequence from its end. We can define such sequences for each vector, aa and bb, and concatenate the results with c() into a single series of numbers.
To avoid double entries in overlapping ranges we can use the unique() function:
zz <- unique(c(aa[1]:aa[length(aa)],bb[1]:bb[length(bb)]))
#> zz
#[1] 3 4 5 6 7 8
with
aa <- c(3,5)
bb <- c(4,8)
Depending on your desired output, here are a few more alternatives:
> do.call("seq",as.list(range(aa,bb)))
[1] 3 4 5 6 7 8
> Reduce(seq,range(aa,bb)) #all credit due to #BrodieG
[1] 3 4 5 6 7 8
> min(aa,bb):max(aa,bb)
[1] 3 4 5 6 7 8

Get specific column value for each row

I want to get a "m" length vector that, considering a m x n matrix, for each row, gives the value on the column identified by another column (say column "Z").
I made it using a for loop:
for (i in 1:dim(data.frame)[1]){vector[i] <- data.frame[i,data.frame$Z[i]]}
Do you see a simpler way to code it avoiding the loop?
"apply" is a possibility:
> M <- cbind( matrix(1:15,3,5), "Z"=c(3,1,2) )
> M
Z
[1,] 1 4 7 10 13 3
[2,] 2 5 8 11 14 1
[3,] 3 6 9 12 15 2
> v <- apply(M,1,function(x){x[x["Z"]]})
> v
[1] 7 2 6
>

cumsum the opposite of diff in r

I have a question and I'm not sure if I'm being totally stupid here or if this is a genuine problem, or if I've misunderstood what these functions do.
Is the opposite of diff the same as cumsum? I thought it was. However, using this example:
dd <- c(17.32571,17.02498,16.71613,16.40615,
16.10242,15.78516,15.47813,15.19073,
14.95551,14.77397)
par(mfrow = c(1,2))
plot(dd)
plot(cumsum(diff(dd)))
> dd
[1] 17.32571 17.02498 16.71613 16.40615 16.10242 15.78516 15.47813 15.19073 14.95551
[10] 14.77397
> cumsum(diff(dd))
[1] -0.30073 -0.60958 -0.91956 -1.22329 -1.54055 -1.84758 -2.13498 -2.37020 -2.55174
These aren't the same. Where have I gone wrong?
AHHH! Fridays.
Obviously
The functions are quite different: diff(x) returns a vector of length (length(x)-1) which contains the difference between one element and the next in a vector x, while cumsum(x) returns a vector of length equal to the length of x containing the sum of the elements in x
Example:
x <- c(1:10)
#[1] 1 2 3 4 5 6 7 8 9 10
> diff(x)
#[1] 1 1 1 1 1 1 1 1 1
v <- cumsum(x)
> v
#[1] 1 3 6 10 15 21 28 36 45 55
The function cumsum() is the cumulative sum and therefore the entries of the vector v[i] that it returns are a result of all elements in x between x[1] and x[i]. In contrast, diff(x) only takes the difference between one element x[i] and the next, x[i+1].
The combination of cumsum and diff leads to different results, depending on the order in which the functions are executed:
> cumsum(diff(x))
# 1 2 3 4 5 6 7 8 9
Here the result is the cumulative sum of a sequence of nine "1". Note that if this result is compared with the original vector x, the last entry 10 is missing.
On the other hand, by calculating
> diff(cumsum(x))
# 2 3 4 5 6 7 8 9 10
one obtains a vector that is again similar to the original vector x, but now the first entry 1 is missing.
In none of the cases the original vector is restored, therefore it cannot be stated that cumsum() is the opposite or inverse function of diff()
You forgot to account for the impact of the first element
dd == c(dd[[1]], dd[[1]] + cumsum(diff(dd)))
#RHertel answered it well, stating that diff() returns a vector with length(x)-1.
Therefore, another simple workaround would be to add 0 to the beginning of the original vector so that diff() computes the difference between x[1] and 0.
> x <- 5:10
> x
#[1] 5 6 7 8 9 10
> diff(x)
#[1] 1 1 1 1 1
> diff(c(0,x))
#[1] 5 1 1 1 1 1
This way it is possible to use diff() with c() as a representation of the inverse of cumsum()
> cumsum(diff(c(0,x)))
#[1] 1 2 3 4 5 6 7 8 9 10
> diff(c(0,cumsum(x)))
#[1] 1 2 3 4 5 6 7 8 9 10
If you know the value of "lag" and "difference".
x<-5:10
y<-diff(x,lag=1,difference=1)
z<-diffinv(y,lag=1,differences = 1,xi=5) #xi is first value.
k<-as.data.frame(cbind(x,z))
k
x z
1 5 5
2 6 6
3 7 7
4 8 8
5 9 9
6 10 10

what is the meaning of tapply(x,index) if no FUN?

I know the meaning of tapply(dat$sale,list(dat$year,dat$province),sum)in the code:
> dat=data.frame(
+ year=c(rep(2007,5),rep(2008,3),rep(2009,3)),
+ province=c("a","a","b","c","d","a","c","d","b","c","d"),
+ sale=1:11)
> tapply(dat$sale,list(dat$year,dat$province),sum)
a b c d
2007 3 3 4 5
2008 6 NA 7 8
2009 NA 9 10 11
what is the meaning of tapply(dat$sale,list(dat$year,dat$province)) if there is no FUN in it?
> tapply(dat$sale,list(dat$year,dat$province))
[1] 1 1 4 7 10 2 8 11 6 9 12
it is a subscripts ,what is the meaning of 12 or 9 in the result?
in which rule can i get 12 or 9?how to calculate it?
From ?tapply:
FUN the function to be applied, or NULL. In the case of functions
like +, %*%, etc., the function name must be backquoted or quoted. If
FUN is NULL, tapply returns a vector which can be used to subscript
the multi-way array tapply normally produces.
FUN defaults to NULL, so you get the subscripts.
Note that in R matrices/arrays, like those returned by tapply, are just vectors with dimensions. Matrices are column-major by default, so you will get the ith element of the first column until it wraps around to the second column:
> mat <- matrix(seq(9),ncol=3)
> mat
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> mat[4]
[1] 4

Resources