Related
I am trying to create a function for theoretical hessian matrix that I can then evaluate at different locations. First I tried setting expressions as values in a matrix or array, but although I could initially set an expression into a matrix I couldn't replace with the value calculated.
hessian_matrix <- function(gx, respect_to){
out_mat <- matrix(0, nrow=length(respect_to), ncol=length(respect_to))
for(i in 1:length(respect_to)){
for(j in 1:length(respect_to)){
dthetad2x <- deriv(D(gx, respect_to[i]), respect_to[j], function.arg=TRUE)
# also tried
# dthetad2x <- as.expression(D(D(gx, respect_to[i])))
out_mat[i,j] <- dthetad2x
}
return(out_mat)
}
Because that didn't work, I decided to create an environment to house the indeces of the hessian matrix as object.
hessian_matrix <- function(gx, respect_to){
out_env <- new.env()
for(i in 1:length(respect_to)){
for(j in 1:length(respect_to)){
dthetad2x <- as.call(D(D(gx, respect_to[i]), respect_to[j]))
assign(paste0(i,j), dthetad2x, out_env)
}
}
return(out_env)
}
g <- expression(x^3-2*x*y-y^6)
h_g <- hessian_matrix(g, respect_to = c('x', 'y'))
This worked, and when I pass this in as a parameter to evaluate I can see the expression, but I can't evaluate it. I tried with call(), eval(), do.call(), get(), etc. and it didn't work. I also assigning the answer within the environment passed, making a new environment to return, or simply using variables.
fisher_observed <- function(h, at_val, params, sum=TRUE){
out_env <- new.env()
# add params to passed environment
for(i in 1:length(at_val)){
h[[names(at_val)[i]]] <- unname(at_val[i])
}
for(i in ls(h)){
value <- do.call(i, envir=h, at_val)
assign(i, value, out_env)
}
return(h)
}
fisher_observed(h_g, at_val=list(x=1,y=2))
According the code for do.call() this is how it should be used, but it isn't working when passed as a parameter in this way.
R already has the hessian matrix function. You do not have to write one. You could use deriv or deriv3 as shown below:
g <- expression(x^3 - 2 * x * y - y^6)
eval(deriv3(g, c('x','y')),list(x=1,y=2))
[1] -67
attr(,"gradient")
x y
[1,] -1 -194
attr(,"hessian")
, , x
x y
[1,] 6 -2
, , y
x y
[1,] -2 -480
If you want to use a function, you could do:
hessian <- function(expr,values){
nms <- names(values)
f <- eval(deriv3(g, nms),as.list(values))
matrix(attr(f, 'hessian'), length(values), dimnames = list(nms,nms))
}
hessian(g, c(x=1,y=2))
x y
x 6 -2
y -2 -480
Although the function is not necessary as you would do double computation in case you wanted the gradient and hessian
I think this (almost) does what you're looking for:
fisher_observed <- function(h, at_val) {
values <- numeric(length = length(names(h)))
for (i in seq_len(length(names(h)))) {
values[i] = purrr::pmap(.l = at_val, function(x, y) eval(h[[names(h)[i]]]))
}
names(values) = names(h)
return(values)
}
This currently returns a named list of evaluated points:
$`21`
[1] -2
$`22`
[1] -480
$`11`
[1] 6
$`12`
[1] -2
you'd still need to re-arrange this into a matrix (should be fairly easy given the column names are preserved. I think the key thing is that the names must be characters when looking up values in h_g.
You cannot have a matrix of "calls" but you can have a character matrix then evaluate it:
hessian_matrix <- function(gx, respect_to){
out_mat <- matrix("", nrow=length(respect_to), ncol=length(respect_to))
for(i in 1:length(respect_to)){
for(j in 1:length(respect_to)){
dthetad2x <- D(D(gx, respect_to[i]), respect_to[j])
out_mat[i,j] <- deparse(dthetad2x)
}
}
return(out_mat)
}
g <- expression(x^3-2*x*y-y^6)
h_g <- hessian_matrix(g, respect_to = c('x', 'y'))
h_g
#> [,1] [,2]
#> [1,] "3 * (2 * x)" "-2"
#> [2,] "-2" "-(6 * (5 * y^4))"
apply(h_g, 1:2, \(x) eval(str2lang(x), list(x=1, y=2)))
#> [,1] [,2]
#> [1,] 6 -2
#> [2,] -2 -480
I have two large sparse matrices (about 41,000 x 55,000 in size). The density of nonzero elements is around 10%. They both have the same row index and column index for nonzero elements.
I now want to modify the values in the first sparse matrix if values in the second matrix are below a certain threshold.
library(Matrix)
# Generating the example matrices.
set.seed(42)
# Rows with values.
i <- sample(1:41000, 227000000, replace = TRUE)
# Columns with values.
j <- sample(1:55000, 227000000, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000)
# Values for the second matrix.
x2 <- sample(1:3, 227000000, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
I now get the rows, columns and values from the first matrix in a new matrix. This way, I can simply subset them and only the ones I am interested in remain.
# Getting the positions and values from the matrices.
position_matrix_from_m1 <- rbind(i = m1#i, j = summary(m1)$j, x = m1#x)
position_matrix_from_m2 <- rbind(i = m2#i, j = summary(m2)$j, x = m2#x)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- position_matrix_from_m1[,position_matrix_from_m1[3,] > 0 & position_matrix_from_m1[3,] < 0.05]
# We add 1 to the values, since the sparse matrix is 0-based.
position_matrix_from_m1[1,] <- position_matrix_from_m1[1,] + 1
position_matrix_from_m1[2,] <- position_matrix_from_m1[2,] + 1
Now I am getting into trouble. Overwriting the values in the second matrix takes too long. I let it run for several hours and it did not finish.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
I thought about pasting the row and column information together. Then I have a unique identifier for each value. This also takes too long and is probably just very bad practice.
# We would get the unique identifiers after the subsetting.
m1_identifiers <- paste0(position_matrix_from_m1[1,], "_", position_matrix_from_m1[2,])
m2_identifiers <- paste0(position_matrix_from_m2[1,], "_", position_matrix_from_m2[2,])
# Now, I could use which and get the position of the values I want to change.
# This also uses to much memory.
m2_identifiers_of_interest <- which(m2_identifiers %in% m1_identifiers)
# Then I would modify the x values in the position_matrix_from_m2 matrix and overwrite m2#x in the sparse matrix object.
Is there a fundamental error in my approach? What should I do to run this efficiently?
Is there a fundamental error in my approach?
Yes. Here it is.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
Syntax as mat[rn, cn] (whether mat is a dense or sparse matrix) is selecting all rows in rn and all columns in cn. So you get a length(rn) x length(cn) matrix. Here is a small example:
A <- matrix(1:9, 3, 3)
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
rn <- 1:2
cn <- 2:3
A[rn, cn]
# [,1] [,2]
#[1,] 4 7
#[2,] 5 8
What you intend to do is to select (rc[1], cn[1]), (rc[2], cn[2]) ..., only. The correct syntax is then mat[cbind(rn, cn)]. Here is a demo:
A[cbind(rn, cn)]
#[1] 4 8
So you need to fix your code to:
m2[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 1
m1[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 0
Oh wait... Based on your construction of position_matrix_from_m1, this is just
ij <- t(position_matrix_from_m1[1:2, ])
m2[ij] <- 1
m1[ij] <- 0
Now, let me explain how you can do better. You have underused summary(). It returns a 3-column data frame, giving (i, j, x) triplet, where both i and j are index starting from 1. You could have worked with this nice output directly, as follows:
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# you never seem to use `position_matrix_from_m2` so I skip it
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
Now you can do:
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2[ij] <- 1
m1[ij] <- 0
Is there a even better solution? Yes! Note that nonzero elements in m1 and m2 are located in the same positions. So basically, you just need to change m2#x according to m1#x.
ind <- m1#x > 0 & m1#x < 0.05
m2#x[ind] <- 1
m1#x[ind] <- 0
A complete R session
I don't have enough RAM to create your large matrix, so I reduced your problem size a little bit for testing. Everything worked smoothly.
library(Matrix)
# Generating the example matrices.
set.seed(42)
## reduce problem size to what my laptop can bear with
squeeze <- 0.1
# Rows with values.
i <- sample(1:(41000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Columns with values.
j <- sample(1:(55000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000 * squeeze ^ 2)
# Values for the second matrix.
x2 <- sample(1:3, 227000000 * squeeze ^ 2, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
## give me more usable RAM
rm(i, j, x1, x2)
##
## fix to your code
##
m1a <- m1
m2a <- m2
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2a[ij] <- 1
m1a[ij] <- 0
##
## the best solution
##
m1b <- m1
m2b <- m2
ind <- m1#x > 0 & m1#x < 0.05
m2b#x[ind] <- 1
m1b#x[ind] <- 0
##
## they are identical
##
all.equal(m1a, m1b)
#[1] TRUE
all.equal(m2a, m2b)
#[1] TRUE
Caveat:
I know that some people may propose
m1c <- m1
m2c <- m2
logi <- m1 > 0 & m1 < 0.05
m2c[logi] <- 1
m1c[logi] <- 0
It looks completely natural in R's syntax. But trust me, it is extremely slow for large matrices.
I need to calculate this
where x is a vector of length n and f is a function.
What is the most efficient calculation for this in R?
One method is a double for loop, but that is obviously slow.
One fast way to do is the following:
Assume we have this vector:
x = c(0,1,2)
i.e. n=3, and assume f is a multiplication function:
Now, we use expand.grid.unique custom function which produces unique combinations within vector; in other words, it is similar to expand.grid base function but with unique combinations:
expand.grid.unique <- function(x, y, include.equals=FALSE)
{
x <- unique(x)
y <- unique(y)
g <- function(i)
{
z <- setdiff(y, x[seq_len(i-include.equals)])
if(length(z)) cbind(x[i], z, deparse.level=0)
}
do.call(rbind, lapply(seq_along(x), g))
}
In our vector case, when we cal expand.grid.unique(x,x), it produces the following result:
> expand.grid.unique(x,x)
[,1] [,2]
[1,] 0 1
[2,] 0 2
[3,] 1 2
Let's assign two_by_two to it:
two_by_two <- expand.grid.unique(x,x)
Since our function is assumed to be multiplication, then we need to calculate sum-product, i.e. dot product of first and second columns of two_by_two. For this we need %*% operator:
output <- two_by_two[,1] %*% two_by_two[,2]
> output
[,1]
[1,] 2
See ?combn
x <- 0:2
combn(x, 2)
# unique combos
[,1] [,2] [,3]
#[1,] 0 0 1
#[2,] 1 2 2
sum(combn(x, 2))
#[1] 6
combn() creates all the unique combinations. If you have a function that you want to sum, you can add a FUN to the call:
random_f <- function(x){x[1] + 2 * x[2]}
combn(x, 2, FUN = random_f)
#[1] 2 4 5
sum(combn(x, 2, FUN = random_f))
#[1] 11
let's assume we have 4 vectors
a <- c(200,204,209,215)
b <- c(215,220,235,245)
c <- c(230,236,242,250)
d <- c(240,242,243,267)
I basically want to create a loop which creates the differentials between each pair, and then calculate the Z scores for those differentials. So something like scale(d-a). How do I create the loop that basically goes scale(b-a), then scale(c-a), scale(d-a) etc? many thanks.
Single named variables don't lend themselves too well to "looping".
Let's use a list() of vectors instead:
vecs <- list(
a = c(200,204,209,215),
b = c(215,220,235,245),
c = c(230,236,242,250),
d = c(240,242,243,267)
)
This allows us to apply a function to all pairs using combn
scale_diff <- function(subset) {
z <- scale(subset[[1]] - subset[[2]])
colnames(z) <- paste(names(subset), collapse = " - ")
z
}
z_scores <- combn(vecs, 2, scale_diff, simplify = FALSE)
Now z_scores is a list of 6 matrices (column vectors). The column names show you which vectors were subtracted before scaling.
We can place it in a list and use combn to get the combinations and then apply the difference
lst1 <- list(a = a, b = b, c = c, d = d)
out <- combn(lst1, 2, FUN = function(x) scale(Reduce(`-`, x))[,1])
colnames(out) <- combn(names(lst1), 2, FUN = paste, collapse='_')
out
# a_b a_c a_d b_c b_d c_d
#[1,] 0.9108601 1.2009612 0.1290994 -0.7643506 -0.753390 -0.2219686
#[2,] 0.7759179 0.2401922 0.3872983 -0.9441978 -0.360317 0.3699477
#[3,] -0.5735045 -0.2401922 0.9036961 0.6744270 1.474024 1.1098432
#[4,] -1.1132735 -1.2009612 -1.4200939 1.0341214 -0.360317 -1.2578222
As #AlexR mentioned in the comments, if the attributes are important, then remove [,1] and keep it as a matrix of 1 column
out <- combn(lst1, 2, FUN = function(x) scale(Reduce(`-`, x)), simplify = FALSE)
Sometimes I want to transform several data columns (usually character or factor) into one new column (usually a number). I try to do this using a lookup matrix. For example, my dataset is
dset <- data.frame(
x=c("a", "a", "b"),
y=c("v", "w", "w"),
stringsAsFactors=FALSE
)
lookup <- matrix(c(1:4), ncol=2)
rownames(lookup) <- c("a", "b")
colnames(lookup) <- c("v", "w")
Ideally (for my purpose here), I would now do
transform(dset, z=lookup[x,y])
and get my new data column. While this works in the one-dimensional case, this fails here, as lookup[x,y] returns a matrix. I came up with this function, which looks rather slow:
fill_from_matrix <- function(m, ...) {
arg <- list(...)
len <- sapply(arg, length)
if(sum(diff(len))!=0) stop("differing lengths in fill_from_matrix")
if(length(arg)!=length(dim(m))) stop("differing dimensions in fill_from_matrix")
n <- len[[1]]
dims <- length(dim(m))
res <- rep(NA, n)
for (i in seq(1,n)) {
one_arg <- list(m)
for (j in seq(1,dims)) one_arg[[j+1]] <- arg[[j]][[i]]
res[i] <- do.call("[", one_arg)
}
return(res)
}
With this function, I can call transform and get the result I wanted:
transform(dset, z=fill_from_matrix(lookup,x,y))
# x y z
# 1 a v 1
# 2 a w 3
# 3 b w 4
However, I am not satisfied with the code and wonder if there is a more elegant (and faster) way to perform this kind of transformation. How do I get rid of the for loops?
This is really quite easy and I suspect fast with base R indexing because the "[" function accepts a two-column matrix for this precise purpose:
> dset$z <- lookup[ with(dset, cbind(x,y)) ]
> dset
x y z
1 a v 1
2 a w 3
3 b w 4
If you needed it as a specific function then:
lkup <- function(tbl, rowidx, colidx){ tbl[ cbind(rowidx, colidx)]}
zvals <- lkup(lookup, dset$x, dset$y)
zvals
#[1] 1 3 4
(I'm pretty sure you can also use three and four column matrices if you have arrays of those dimensions.)
You can use library dplyr for inner_join and use a data.frame instead of matrix as lookup table:
library(dplyr)
lookup = transform(expand.grid(c('a','b'),c('v','w')), v=1:4) %>%
setNames(c('x','y','val'))
inner_join(dset, lookup, by=c('x','y'))
# x y val
#1 a v 1
#2 a w 3
#3 b w 4
A fast way is also to use data.table package, with my definition of lookup:
library(data.table)
setDT(lookup)
setDT(dset)
setkey(lookup, x ,y)[dset]
# x y val
#1: a v 1
#2: a w 3
#3: b w 4
If for any reason you have your matrix lookup as input, transform it in a dataframe:
lookup = transform(expand.grid(rownames(lookup), colnames(lookup)), v=c(lookup))
names(lookup) = c('x','y','val')