Is the result of the which() function *always* ordered? - r

I want to assure that the result of which(..., arr.ind = TRUE) is always ordered, specifically: arranged ascending by (col, row). I do not see such a remark in the which function documentation, whereas it seems to be the case based on some experiments I made. How I can check / learn if it is the case?
Example. When I run the code below, the output is a matrix in which the results are arranged ascending by (col, row) columns.
> set.seed(1)
> vals <- rnorm(10)
> valsall <- sample(as.numeric(replicate(10, vals)))
> mat <- matrix(valsall, 10, 10)
> which(mat == max(mat), arr.ind = TRUE)
row col
[1,] 1 1
[2,] 3 1
[3,] 1 2
[4,] 2 2
[5,] 10 2
[6,] 1 6
[7,] 2 8
[8,] 4 8
[9,] 1 9
[10,] 6 9

Part1:
Answering a part of your question on how to understand functions on a deeper level, if the documentation is not enough, without going into the detail of function which().
As match() is not a primitive function (which are written in C), i.e. written using the basic building blocks of R, we can check what's going on behind the scenes by printing the function itself. Note that using the backticks allows to check functions that have reserved names, e.g. +, and is therefore optional in this example. This dense R code can be extremely tiresome to read, but I've found it very educational and it does solve some mental knots every once in a while.
> print(`which`)
function (x, arr.ind = FALSE, useNames = TRUE)
{
wh <- .Internal(which(x))
if (arr.ind && !is.null(d <- dim(x)))
arrayInd(wh, d, dimnames(x), useNames = useNames)
else wh
}
<bytecode: 0x00000000058673e0>
<environment: namespace:base>
Part2:
So after giving up on trying to understand the which and arrayInd function in the way described above, I'm trying it with common sense. The most efficient way to check each value of a matrix/array that makes sense to me, is to at some point convert it to a one-dimensional object. Coercion from matrix to atomic vector, or any reduction of dimensions will always result in concatenating the complete columns of each dimension, so to me it is natural that higher-level functions will also follow this fundamental rule.
> testmat <- matrix(1:10, nrow = 2, ncol = 5)
> testmat
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
> as.numeric(testmat)
[1] 1 2 3 4 5 6 7 8 9 10
I found Hadley Wickham's Advanced R an extremely valuable resource in answering your question, especially the chapters about functions and data structures.
[http://adv-r.had.co.nz/][1]

Related

How to write an apply() function to limit each element in a matrix column to a maximum allowable value?

I'm trying to learn how to use the apply() functions.
Suppose we have a 3 row, 2 column matrix of test <- matrix(c(1,2,3,4,5,6), ncol = 2), and we would like the maximum value of each element in the first column (1, 2, 3) to not exceed 2 for example, so we end up with a matrix of (1,2,2,4,5,6).
How would one write an apply() function to do this?
Here's my latest attempt: test1 <- apply(test[,1], 2, function(x) {if(x > 2){return(x = 2)} else {return(x)}})
We may use pmin on the first column with value 2 as the second argument, so that it does elementwise checking with the recycled 2 and gets the minimum for each value from the first column
test[,1] <- pmin(test[,1], 2)
-output
> test
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 2 6
Note that apply needs the 'X' as an array/matrix or one with dimensions, when we subset only a single column/row, it drops the dimensions because drop = TRUE by default
If you really want to use the apply() function, I guess you're looking for something like this:
t(apply(test, 1, function(x) c(min(x[1], 2), x[2])))
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 2 6
But if you want my opinion, akrun's suggestion is definitely better.

R : confusion regarding LHS and RHS of assignment and order of operation

I am having some fundamental confusion with R. I have a snippet of R code.
> m <- 1:10
> m
[1] 1 2 3 4 5 6 7 8 9 10
> dim(m) <- c(2,5)
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
Now I am a C/Python programmer and the line dim(m) <- c(2,5) is incredibly confusing to me. I realize that it effectively changed a vector into a matrix, however looking at it I do not understand the logic/order of operation.
<- is the assignment operator in R. So to me, logically the order of operation is : assign (2,5) to the output of dim(m). Since the output of dim(m) isn't assigned to a variable, the output would be lost.
Could someone explain how I should read the line dim(m) <- c(2,5)? What is the order of operation? It seems that the order of operation using <- to changes depending on the LHS and RHS of the equation.
These are special functions called Replacement Functions. I quote from Hadley's Advanced-R book:
Replacement functions act like they modify their arguments in place, and have the special name xxx<-. They typically have two arguments (x and value), although they can have more, and they must return the modified object. For example, the following function allows you to modify the second element of a vector:
`second<-` <- function(x, value) {
x[2] <- value
x
}
x <- 1:10
second(x) <- 5L
x
#> [1] 1 5 3 4 5 6 7 8 9 10
When R evaluates the assignment second(x) <- 5, it notices that the left hand side of the <- is not a simple name, so it looks for a function named second<- to do the replacement.
You can check the full chapter here under the Replacement Functions title.

Issue while executing drop() function in R

I am trying to find out usage of drop() function. I read the documentation that a matrix or array can be the input object for the function however the size of the matrix or object does not change. Can someone explain its actual usage and how it works?
I am using R version 3.2.1. Code snippet:
data1 <- matrix(data=(1:10),nrow=1,ncol=1)
drop(data1)
R has factors, which are very cool (and somewhat analogous to labeled levels in Stata). Unfortunately, the factor list sticks around even if you remove some data such that no examples of a particular level still exist.
# Create some fake data
x <- as.factor(sample(head(colors()),100,replace=TRUE))
levels(x)
x <- x[x!="aliceblue"]
levels(x) # still the same levels
table(x) # even though one level has 0 entries!
The solution is simple: run factor() again:
x <- factor(x)
levels(x)
If you need to do this on many factors at once (as is the case with a data.frame containing several columns of factors), use drop.levels() from the gdata package:
x <- x[x!="antiquewhite1"]
df <- data.frame(a=x,b=x,c=x)
df <- drop.levels(df)
R matrix is a two dimensional array. R has a lot of operator and functions that make matrix handling very convenient.
Matrix assignment:
>A <- matrix(c(3,5,7,1,9,4),nrow=3,ncol=2,byrow=TRUE)
>A
[,1] [,2]
[1,] 3 5
[2,] 7 1
[3,] 9 4
Matrix row and column count:
>rA <- nrow(A)
>rA
[1] 3
>cA <- ncol(A)
>cA
[1] 2
t(A) function returns a transposed matrix of A:
>B <- t(A)
>B
[,1] [,2] [,3]
[1,] 3 7 9
[2,] 5 1 4
Matrix multplication:
C <- A * A
C
[,1] [,2]
[1,] 9 25
[2,] 49 1
[3,] 81 16
Matrix Addition:
>C <- A + A
>C
[,1] [,2]
[1,] 6 10
[2,] 14 2
[3,] 18 8
Matrix subtraction (-) and division (/) operations ... ...
Sometimes a matrix needs to be sorted by a specific column, which can be done by using order() function.
Following is a csv file example:
,t1,t2,t3,t4,t5,t6,t7,t8
r1,1,0,1,0,0,1,0,2
r2,1,2,5,1,2,1,2,1
r3,0,0,9,2,1,1,0,1
r4,0,0,2,1,2,0,0,0
r5,0,2,15,1,1,0,0,0
r6,2,2,3,1,1,1,0,0
r7,2,2,3,1,1,1,0,1
Following R code will read in the above file into a matrix, and sort it by column 4, then write to a output file:
x <- read.csv("sortmatrix.csv",header=T,sep=",");
x <- x[order(x[,4]),];
x <- write.table(x,file="tp.txt",sep=",")
The result is:
"X","t1","t2","t3","t4","t5","t6","t7","t8"
"1","r1",1,0,1,0,0,1,0,2
"4","r4",0,0,2,1,2,0,0,0
"6","r6",2,2,3,1,1,1,0,0
"7","r7",2,2,3,1,1,1,0,1
"2","r2",1,2,5,1,2,1,2,1
"3","r3",0,0,9,2,1,1,0,1
"5","r5",0,2,15,1,1,0,0,0
The DROP function supports natively compiled, scalar user-defined functions.
Removes one or more user-defined functions from the current database
To execute DROP FUNCTION, at a minimum, a user must have ALTER permission on the schema to which the function belongs, or CONTROL permission on the function.
DROP FUNCTION will fail if there are Transact-SQL functions or views in the database that reference this function and were created by using SCHEMA BINDING, or if there are computed columns, CHECK constraints, or DEFAULT constraints that reference the function.
DROP FUNCTION will fail if there are computed columns that reference this function and have been indexed.
DROP FUNCTION { [ schema_name. ] function_name } [ ,...n ]

Parrelize a nested for loop in R

I have a dataframe with DNA barcodes in rownames, for which I would like to determine the difference (e.g. Levenshtein distance) between these barcodes. The values in the dataframe need to be processed later in the analysis. I've worked out an example which uses a slightly simplified analysis just comparing the individual bases (A,T,G,C) after a strsplit and puts the results in a matrix:
results <- matrix(data=NA,nrow=dim(vals)[1],ncol=dim(vals)[1])
# Do the string splitting and comparison of the barcodes one by one.
system.time(
for (i in 1:dim(dat)[1]) {
for (j in 1:dim(dat)[1]) {
results[i,j] <- sum(unlist(strsplit(rownames(dat)[i], split="")) != unlist(strsplit(rownames(dat)[j], split="")))
}
}
)
This all works as expected but off course is embarrasingly parallel. To save some time and to put our university cluster to good use, I would like to try and parallelize this function, but I'm having trouble getting it right. Hints would be appreciated!
Parallelisation should be the last step in optimising your code, after you've implemented the easier steps that should include:
Vectorisation
Using built-in high performance functions
In your case, you should use adist to compute the levenshtein distance.
# Function to simulate barcodes of given length
g <- function(n)paste(sample(c("G", "A", "C", "T"), size=n, replace=TRUE), collapse="")
# Replicate data
barcodes <- replicate(5, g(n=4))
Then use adist():
barcodes
[1] "CTAA" "AGGC" "CACT" "GGCG" "TTGA"
adist(barcodes, barcodes)
[,1] [,2] [,3] [,4] [,5]
[1,] 0 4 3 4 2
[2,] 4 0 4 2 3
[3,] 3 4 0 3 4
[4,] 4 2 3 0 4
[5,] 2 3 4 4 0

rollapply variation - growing window functions

How would one use rollapply (or some other R function) to grow the window size as the function progresses though the data. To phrase it another way, the first apply works with the first element, the second with the first two elements, the third with the first three elements etc.
If you are looking to apply min , max, sum or prod, these functions already have their cumulative counterparts as:
cummin, cummax, cumsum and cumprod
To apply more exotic functions on a growing / expanding window, you can simply use sapply
eg
# your vector of interest
x <- c(1,2,3,4,5)
sapply(seq_along(x), function(y,n) yourfunction(y[seq_len(n)]), y = x)
For a basic zoo object
x.Date <- as.Date("2003-02-01") + c(1, 3, 7, 9, 14) - 1
x <- zoo(rnorm(5), x.Date)
# cumsum etc will work and return a zoo object
cs.zoo <- cumsum(x)
# convert back to zoo for the `sapply` solution
# here `sum`
foo.zoo <- zoo(sapply(seq_along(x), function(n,y) sum(y[seq_len(n)]), y= x), index(x))
identical(cs.zoo, foo.zoo)
## [1] TRUE
From peering at the documentation at ?zooapply I think this will do what you want, where a is your matrix and sum can be any function:
a <- cbind(1:5,1:5)
# [,1] [,2]
# [1,] 1 1
# [2,] 2 2
# [3,] 3 3
# [4,] 4 4
# [5,] 5 5
rollapply(a,width=seq_len(nrow(a)),sum,align="right")
# [,1] [,2]
# [1,] 1 1
# [2,] 3 3
# [3,] 6 6
# [4,] 10 10
# [5,] 15 15
But mnel's answer seems sufficient and more generalizable.
in addition to #mnel's answer:
For more exotic functions you can simply use sapply
and if the sapply approach takes too long, you may be better off formulating your function iteratively.

Resources