what is the meaning of tapply(x,index) if no FUN? - r

I know the meaning of tapply(dat$sale,list(dat$year,dat$province),sum)in the code:
> dat=data.frame(
+ year=c(rep(2007,5),rep(2008,3),rep(2009,3)),
+ province=c("a","a","b","c","d","a","c","d","b","c","d"),
+ sale=1:11)
> tapply(dat$sale,list(dat$year,dat$province),sum)
a b c d
2007 3 3 4 5
2008 6 NA 7 8
2009 NA 9 10 11
what is the meaning of tapply(dat$sale,list(dat$year,dat$province)) if there is no FUN in it?
> tapply(dat$sale,list(dat$year,dat$province))
[1] 1 1 4 7 10 2 8 11 6 9 12
it is a subscripts ,what is the meaning of 12 or 9 in the result?
in which rule can i get 12 or 9?how to calculate it?

From ?tapply:
FUN the function to be applied, or NULL. In the case of functions
like +, %*%, etc., the function name must be backquoted or quoted. If
FUN is NULL, tapply returns a vector which can be used to subscript
the multi-way array tapply normally produces.
FUN defaults to NULL, so you get the subscripts.
Note that in R matrices/arrays, like those returned by tapply, are just vectors with dimensions. Matrices are column-major by default, so you will get the ith element of the first column until it wraps around to the second column:
> mat <- matrix(seq(9),ncol=3)
> mat
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> mat[4]
[1] 4

Related

Axis numbering for R's apply function

Given the following simple matrix
mymatrix<-matrix(1:9,nrow=3)
mymatrix
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Let's do column and row wise sums:
apply(mymatrix,1,sum)
[1] 12 15 18
> apply(mymatrix,2,sum)
[1] 6 15 24
My intuition would have the axes reversed from what we see above. I think of rows as the first dimension of a matrix. So applying the sum operation on axis-1 should give us row sums. What is the proper way to understand the thinking of having the opposite polarity?
I actually misunderstood what matrix(1:9,nrow=3) generates: I had not paid attention to the output. I had presumed it would create
1 2 3
4 5 6
7 8 9
But instead it is columns-first. So apply does exactly what I expect: sums rows when axis=1 and sums columns when axis=2.

Crosschecking numbers of a matrix in R

I'm currently working with a large matrix of two columns, and what I want to check is If every line/combination (two columns) is also present in a dataframe loaded (two columns as well).
Example,
(obj_design <- matrix(c(2,5,4,7,6,6,20,12,4,0), nrow = 5, ncol = 2))
[,1] [,2]
[1,] 2 6
[2,] 5 20
[3,] 4 12
[4,] 7 4
[5,] 6 0
(refined_grid <- data.frame(i=1:4, j=1:12))
i j
1 1 1
2 2 2
3 3 3
4 4 4
5 1 5
6 2 6
7 3 7
8 4 8
9 1 9
10 2 10
11 3 11
12 4 12
Following the reproducible example, it would be selected (2,6) and (4,12).
I'm wondering if there's a function that I can use to check the whole matrix, and see if a specific line is in the dataframe, and (if possible) write separately (new dataset) which elements of the matrix it is in.
Any assistance would be wonderful.
Here is an option with match
i1 <- match(do.call(paste, as.data.frame(obj_design)),
do.call(paste, refined_grid), nomatch = 0)
refined_grid[i1,]
This code will give you which rows of the matrix exist in the dataframe.
which(paste(obj_design[,1], obj_design[,2]) %in%
paste(refined_grid$i, refined_grid$j)
)
Then you can just assign it to a vector!

Mystery Matrix Subset

I came across this strange matrix operation the other day and can't figure out what it is doing.
Consider:
a<-matrix(nrow=2,ncol=2,c(9,8,7,6))
b<-matrix(nrow=2,ncol=2,c(1,2,1,2))
a[b]
Whoa! How can you even use a matrix to subset another matrix? Anyway - this is the result
a[b]
#[1] 9 6
I thought maybe b was providing the indexing to reference a (i.e. get 1,1 and then get 2,2. But if that is what is happening the rules get thrown out of the window when you do this
a<-matrix(nrow=3,ncol=3,c(9,8,7,6,5,4,3,2,1))
b<-matrix(nrow=3,ncol=3,c(1,2,3,2,2,2,1,1,1))
a[b]
#[1] 9 8 7 8 8 8 9 9 9
Does anyone know what is happening here?
this is not a mystery. in your second example the indexing matrix b is treated as a numeric:
as.numeric(b)
#[1] 1 2 3 2 2 2 1 1 1
a[as.numeric(b)]
#[1] 9 8 7 8 8 8 9 9 9
you have to remember that on top of having a two dimensional ij (row x column) indexing, matrices also have a one-dimensional one, where each element is assigned a number in sequence, starting with the top-left element and going down the columns. so a[1, 1] is the same as a[1] and a[2, 2] is the same as a[5]. hence a[b] gives you c(a[1], a[2], a[3], a[2],...,a[1]), which is the same as c(a[1,1], a[2,1], a[3,1], a[2,1], a[2,1],..., a[1,1]).
A matrix is essentially a numeric vector with a dimension attribute. In R, matrices are stored using "column-major order", meaning that the matrix is filled columnwisely. This implies the following:
a <- matrix(1:4, nrow = 2)
> a
# [,1] [,2]
#[1,] 1 3
#[2,] 2 4
Since it is also a vector, you will be still be able to access elements of a using single indices.
> a[1]
#1
> a[2]
#2
> a[3]
#3
When you put a variable between the bracket operator, it tries to coerce your variable into an integer vector, such that it becomes a set of indices for the numeric vector a.
To understand better, you could try the following:
a<-matrix(nrow=3,ncol=3,c(9,8,7,6,5,4,3,2,1))
b<-matrix(nrow=3,ncol=3,c(1,2,3,2,2,2,1,10,1)) # a[10] = NA
> a[b]
# [1] 9 8 7 8 8 8 9 NA 9
Since the brackets coerce the matrices into integer vectors, you can even imagine having a b matrix with floating values:
b<-matrix(nrow=3,ncol=3,c(1.1,2.1,3.9,2.8,2,2,1,10.5,1))
> a[b]
# [1] 9 8 7 8 8 8 9 NA 9
This is because, as said earlier:
> as.integer(b) # same as as.integer(c(1.1,2.1,3.9,2.8,2,2,1,10.5,1))
# 1 2 3 2 2 2 1 10 1

Get specific column value for each row

I want to get a "m" length vector that, considering a m x n matrix, for each row, gives the value on the column identified by another column (say column "Z").
I made it using a for loop:
for (i in 1:dim(data.frame)[1]){vector[i] <- data.frame[i,data.frame$Z[i]]}
Do you see a simpler way to code it avoiding the loop?
"apply" is a possibility:
> M <- cbind( matrix(1:15,3,5), "Z"=c(3,1,2) )
> M
Z
[1,] 1 4 7 10 13 3
[2,] 2 5 8 11 14 1
[3,] 3 6 9 12 15 2
> v <- apply(M,1,function(x){x[x["Z"]]})
> v
[1] 7 2 6
>

cumsum the opposite of diff in r

I have a question and I'm not sure if I'm being totally stupid here or if this is a genuine problem, or if I've misunderstood what these functions do.
Is the opposite of diff the same as cumsum? I thought it was. However, using this example:
dd <- c(17.32571,17.02498,16.71613,16.40615,
16.10242,15.78516,15.47813,15.19073,
14.95551,14.77397)
par(mfrow = c(1,2))
plot(dd)
plot(cumsum(diff(dd)))
> dd
[1] 17.32571 17.02498 16.71613 16.40615 16.10242 15.78516 15.47813 15.19073 14.95551
[10] 14.77397
> cumsum(diff(dd))
[1] -0.30073 -0.60958 -0.91956 -1.22329 -1.54055 -1.84758 -2.13498 -2.37020 -2.55174
These aren't the same. Where have I gone wrong?
AHHH! Fridays.
Obviously
The functions are quite different: diff(x) returns a vector of length (length(x)-1) which contains the difference between one element and the next in a vector x, while cumsum(x) returns a vector of length equal to the length of x containing the sum of the elements in x
Example:
x <- c(1:10)
#[1] 1 2 3 4 5 6 7 8 9 10
> diff(x)
#[1] 1 1 1 1 1 1 1 1 1
v <- cumsum(x)
> v
#[1] 1 3 6 10 15 21 28 36 45 55
The function cumsum() is the cumulative sum and therefore the entries of the vector v[i] that it returns are a result of all elements in x between x[1] and x[i]. In contrast, diff(x) only takes the difference between one element x[i] and the next, x[i+1].
The combination of cumsum and diff leads to different results, depending on the order in which the functions are executed:
> cumsum(diff(x))
# 1 2 3 4 5 6 7 8 9
Here the result is the cumulative sum of a sequence of nine "1". Note that if this result is compared with the original vector x, the last entry 10 is missing.
On the other hand, by calculating
> diff(cumsum(x))
# 2 3 4 5 6 7 8 9 10
one obtains a vector that is again similar to the original vector x, but now the first entry 1 is missing.
In none of the cases the original vector is restored, therefore it cannot be stated that cumsum() is the opposite or inverse function of diff()
You forgot to account for the impact of the first element
dd == c(dd[[1]], dd[[1]] + cumsum(diff(dd)))
#RHertel answered it well, stating that diff() returns a vector with length(x)-1.
Therefore, another simple workaround would be to add 0 to the beginning of the original vector so that diff() computes the difference between x[1] and 0.
> x <- 5:10
> x
#[1] 5 6 7 8 9 10
> diff(x)
#[1] 1 1 1 1 1
> diff(c(0,x))
#[1] 5 1 1 1 1 1
This way it is possible to use diff() with c() as a representation of the inverse of cumsum()
> cumsum(diff(c(0,x)))
#[1] 1 2 3 4 5 6 7 8 9 10
> diff(c(0,cumsum(x)))
#[1] 1 2 3 4 5 6 7 8 9 10
If you know the value of "lag" and "difference".
x<-5:10
y<-diff(x,lag=1,difference=1)
z<-diffinv(y,lag=1,differences = 1,xi=5) #xi is first value.
k<-as.data.frame(cbind(x,z))
k
x z
1 5 5
2 6 6
3 7 7
4 8 8
5 9 9
6 10 10

Resources