Is there a simple way to demean a sparse matrix by columns while considering zero-values as missing (using Matrix package)?
There seem to be two problems I struggle with:
Finding proper column means
Empty cells are considered zero rather than missing:
M0 <- matrix(rep(1:5,4),nrow = 4)
M0[2,2] <- M0[2,3] <- 0
M <- as(M0, "sparseMatrix")
M
#[1,] 1 5 4 3 2
#[2,] 2 . . 4 3
#[3,] 3 2 1 5 4
#[4,] 4 3 2 1 5
colMeans(M)
#[1] 2.50 2.50 1.75 3.25 3.50
Correct result should be:
colMeans_correct <- colSums(M) / c(4,3,3,4,4)
colMeans_correct
#[1] 2.500000 3.333333 2.333333 3.250000 3.500000
Subtract column mean
Subtraction is performed also on the missing cells:
sweep(M, 2, colMeans_correct)
#4 x 5 Matrix of class "dgeMatrix"
# [,1] [,2] [,3] [,4] [,5]
#[1,] -1.5 1.6666667 1.6666667 -0.25 -1.5
#[2,] -0.5 -3.3333333 -2.3333333 0.75 -0.5
#[3,] 0.5 -1.3333333 -1.3333333 1.75 0.5
#[4,] 1.5 -0.3333333 -0.3333333 -2.25 1.5
P.S. hope it is not a problem posting a question composed of two problems. They are connected to the same task and seem to reflect the same problem - distinguish between missing and actual zero values.
One option is to divide the colSums by the colSums of the non-zero logical matrix
colSums(M)/colSums(M!=0)
#[1] 2.500000 3.333333 2.333333 3.250000 3.500000
Or another option is to replace the 0 with NA and get the colMeans with na.rm = TRUE argument
colMeans(M*NA^!M, na.rm = TRUE)
#[1] 2.500000 3.333333 2.333333 3.250000 3.500000
Or as #user20650 commented
colSums(M) / diff(M#p)
#[1] 2.500000 3.333333 2.333333 3.250000 3.500000
where the 'p' is a pointer mentioned in the ?sparseMatrix
In typical usage, p is missing, i and j are vectors of positive
integers and x is a numeric vector. These three vectors, which must
have the same length, form the triplet representation of the sparse
matrix.
If i or j is missing then p must be a non-decreasing integer vector
whose first element is zero. It provides the compressed, or “pointer”
representation of the row or column indices, whichever is missing. The
expanded form of p, rep(seq_along(dp),dp) where dp <- diff(p), is used
as the (1-based) row or column indices.
Related
First I create a 5x4 matrix with random numbers from 1 to 10:
A <- matrix(sample(1:10, 20, TRUE), 5, 4)
> A
[,1] [,2] [,3] [,4]
[1,] 1 5 6 6
[2,] 5 9 9 4
[3,] 10 6 1 8
[4,] 4 4 10 2
[5,] 10 9 7 5
In the following step I would like to obtain the returns by row (for row 1: (5-1)/1, (6-5)/5, (6-6)/6 and the same procedure for the other rows). The final matrix should therefore be a 5x3 matrix.
You can make use of the Base R funtion diff() applied to your transposed matrix:
Code:
# Data
set.seed(1)
A <- matrix(sample(1:10, 20, TRUE), 5, 4)
# [,1] [,2] [,3] [,4]
#[1,] 9 7 5 9
#[2,] 4 2 10 5
#[3,] 7 3 6 5
#[4,] 1 1 10 9
#[5,] 2 5 7 9
# transpose so we get per row and not column returns
t(diff(t(A))) / A[, -ncol(A)]
[,1] [,2] [,3]
[1,] -0.2222222 -0.2857143 0.8000000
[2,] -0.5000000 4.0000000 -0.5000000
[3,] -0.5714286 1.0000000 -0.1666667
[4,] 0.0000000 9.0000000 -0.1000000
[5,] 1.5000000 0.4000000 0.2857143
A <- matrix(sample(1:10, 20, TRUE), 5, 4)
fn.Calc <- function(a,b){(a-b)/a}
B <- matrix(NA, nrow(A), ncol(A)-1)
for (ir in 1:nrow(B)){
for (ic in 1:ncol(B)){
B[ir, ic] <- fn.Calc(A[ir, ic+1], A[ir, ic])
}
}
small note: when working with random functions providing a seed is welcomed ;)
So what we have here:
fn.Calc is just the calculation you are trying to do, i've isolated it in a function so that it's easier to change if needed
then a new B matrix is created having 1 column less then A but the same rows
finally we are going to loop every element in this B matrix, I like to use ir standing for incremental rows and ic standing for incremental column and finally inside the loop (B[ir, ic] <- fn.Calc(A[ir, ic+1], A[ir, ic])) is when the magic happens where the actual values are calculated and stored in B
it's a very basic approach without calling any package, there's probably many other ways to solve this that require less code.
Good day,
I have a feature co-location (fcm, from the quanteda package in R) of dimensions 94966 x 94966 (named fcm2, for illustration). I am able to select rows (class: fcm object) by the feature name or row number, e.g.:
a1 <- fcm2[1,]
and perform a normalization calculation on that particular row:
a2 <- a1/(max(a1)-min(a1))
My objective is to normalize each row in my original fcm. The strategy I attempted was to initialize an empty matrix and then use a for loop to go through the rows and perform the calculation. The initialization fails because of memory issues (Windows 10, 12 Gb RAM, R version 3.4.4):
fcm3 <- matrix(data = NA, nrow = 94966, ncol = 94966)
Error: cannot allocate vector of size 33.6 Gb
I had been able to accomplish the normalization by using a dataframe structure, but there is not enough memory to store the entire fcm2 as a dataframe:
Step 1: Extract a "sub-matrix" based on a list of keywords, convert to dataframe, drop unneeded columns
m <- fcm2[keywords(),]
df_m1 <- as.data.frame(m)
df_m1 <- subset(df_m1, select = -c(document, rt))
Step 2: Normalization
k <- 0 # initialize counter
df2 <- data.frame() # initialize
n4 <- nrow(df_m1) # count rows of the extracted sub-matrix as df (df_m1)
for(k in 1:n4){
a1 <- df_m1[k,] # store the (n4)th row
max_k <- max(a1)
min_k <- min(a1)
a2 <- a1/(max_k-min_k) # normalize so max is 1, 0s are still 0s
df2 <- rbind(df2, a2) # append normalized results into a row of a data.frame
}
Is there a more efficient way to normalize each row for the entire fcm?
Kind thanks!
Yo can write a function:
norm=function(mat){
mx=mat[cbind(1:nrow(mat),max.col(mat))]
mn=mat[cbind(1:nrow(mat),max.col(-mat))]
mat/(mx-mn)
}
And then use it.
Example
set.seed(1)
mat1=matrix(sample(20),5)
mat1
[,1] [,2] [,3] [,4]
[1,] 6 14 3 7 #max is 14, min is 3 thus divide by 11
[2,] 8 15 2 12
[3,] 11 9 20 17
[4,] 16 19 10 18
[5,] 4 1 5 13
norm(mat)
[,1] [,2] [,3] [,4]
[1,] 0.5454545 1.27272727 0.2727273 0.6363636
[2,] 0.6153846 1.15384615 0.1538462 0.9230769
[3,] 1.0000000 0.81818182 1.8181818 1.5454545
[4,] 1.7777778 2.11111111 1.1111111 2.0000000
[5,] 0.3333333 0.08333333 0.4166667 1.0833333
You can decide to print out in fraction form to see whether the results do represent what was needed:
MASS::fractions(norm(mat))
[,1] [,2] [,3] [,4]
[1,] 6/11 14/11 3/11 7/11
[2,] 8/13 15/13 2/13 12/13
[3,] 1 9/11 20/11 17/11
[4,] 16/9 19/9 10/9 2
[5,] 1/3 1/12 5/12 13/12
I can understand OP has constraint with memory, and he cannot allocate memory to hold another copy of that big matrix.
If memory permits then solution can be:
mat1 = t(apply(mat1, 1, function(x) x/(max(x)-min(x))))
With memory constraint, one can prefer to write a function to normalise a vector and apply it over all rows in for-loop. It should be an efficient way in given scenario.
# Function to normalize a vector
normalise <- function(x){
x/(max(x)-min(x))
}
#Apply over all rows of matrix
for(i in 1:nrow(mat1)){
mat1[i,] = normalise(mat1[i,])
}
mat1
# [,1] [,2] [,3] [,4]
# [1,] 0.5454545 1.27272727 0.2727273 0.6363636
# [2,] 0.6153846 1.15384615 0.1538462 0.9230769
# [3,] 1.0000000 0.81818182 1.8181818 1.5454545
# [4,] 1.7777778 2.11111111 1.1111111 2.0000000
# [5,] 0.3333333 0.08333333 0.4166667 1.0833333
Data: As used by #Onyambu
# Data
set.seed(1)
mat1=matrix(sample(20),5)
The most efficient way is to operate on the sparse values of the fcm object directly, avoiding any transformation into a dense object such as a matrix or data.frame. This is how the dfm and fcm manipulation and computation functions are defined in quanteda and why these are able to executive quickly and within limited memory.
To define such a function for your type of normalisation, you could use the following function, which I have demonstrated here on a simple fcm.
library("quanteda")
library("Matrix")
myfcm <- fcm(data_char_sampletext, window = 5)
myfcm
## Feature co-occurrence matrix of: 244 by 244 features.
Now we define a function that (for convenience) transforms the fcm into a sparse triplet representation (the dgTMatrix class) and extracts the non-zero values using split(). Each element of the resulting list will represent a row of your fcm, but only for non-zero values. (Because of this, we also have to return a zero for empty rows.)
fcm_customnorm <- function(x) {
x <- as(x, "dgTMatrix")
split_x <- split(x#x, x#i)
norm_x <- lapply(split_x, function(y) {
result <- y/(max(y) - min(y))
# transform any divisions by zero into zero
result[is.nan(result)] <- 0
result
})
x#x <- unlist(norm_x, use.names = FALSE)
quanteda:::as.fcm(x)
}
Applying this on a subset, we see that it works:
myfcm[1:5, 1:5]
## Feature co-occurrence matrix of: 5 by 5 features.
## 5 x 5 sparse Matrix of class "fcm"
## features
## features Instead we have a Fine
## Instead 0 5 1 4 1
## we 0 10 5 20 5
## have 0 0 0 4 1
## a 0 0 0 6 4
## Fine 0 0 0 0 0
fcm_customnorm(myfcm[1:5, 1:5])
## Feature co-occurrence matrix of: 5 by 5 features.
## 5 x 5 sparse Matrix of class "fcm"
## features
## features Instead we have a Fine
## Instead 0 1.0 0.8000000 0.3333333 1.00
## we 0 0.2 0.2000000 1.3333333 0.25
## have 0 0 0.6666667 0.3333333 3.00
## a 0 0 0 0.0000000 2.00
## Fine 0 0 0 0 0.00
Another option would have been to extract the simple triplet representation to a data.table (from the data.table package) and then perform your computations using the grouping functions and :=. But this approach is simpler and gets your desired result, which is a normalised fcm.
I'm trying to figure out a way to find the second maximum position for each row of a matrix, just like max.col function does, but for the second highest value.
Any ideas?
Thanks!
You could try to replace the maximum value with -Inf and then take max.col a second time:
set.seed(1)
mat <- matrix(sample(1:15), nrow = 5)
#mat
# [,1] [,2] [,3]
#[1,] 4 9 2
#[2,] 6 10 13
#[3,] 8 14 12
#[4,] 11 5 15
#[5,] 3 1 7
max.col(replace(mat, cbind(1:5, max.col(mat)), -Inf))
#[1] 1 2 3 1 1
Using a second max.col is handy because if you had ties, you can still use max.col to handle them appropriately.
I am searching for a solution that implements the following simple growth-rate formula by applying vectorization in R:
gr <- function(x){
a <- matrix(,nrow=nrow(x),ncol=ncol(x))
for (j in 1:ncol(x)){
for (i in 2:nrow(x)){
if (!is.na(x[i,j]) & !is.na(x[i-1,j]) & x[i-1,j] != 0){
result[i,j] <- x[i,j]/x[i-1,j]-1
}
}
}
return(a)
}
I found the xts package to generate lags of time-series, but in the end I always ended up having to compare to many values (see above), so I cannot simply use ifelse. One possible problem is when the time-series (e.g. a price index) has zeros in between. This would create NaNs in the result, which I am trying to avoid and which cannot simply be removed afterwards (edit: apparently they can, see the answers below!)
In short: I'd like to produce a table of correct growth rates for a given table of values. Here is an example:
m <- matrix(c(1:3,NA,2.4,2.8,3.9,0,1,3,0,2,1.3,2,NA,7,3.9,2.4),6,3)
generates:
[,1] [,2] [,3]
[1,] 1.0 3.9 1.3
[2,] 2.0 0.0 2.0
[3,] 3.0 1.0 NA
[4,] NA 3.0 7.0
[5,] 2.4 0.0 3.9
[6,] 2.8 2.0 2.4
correct result, produced by gr(m):
[,1] [,2] [,3]
[1,] NA NA NA
[2,] 1.0000000 -1 0.5384615
[3,] 0.5000000 NA NA
[4,] NA 2 NA
[5,] NA -1 -0.4428571
[6,] 0.1666667 NA -0.3846154
But this takes forever with large tables. Is there any way to use conditions without looping so extensively?
You can speed this up by performing the entire calculation in a single vectorized operation (with one additional operation to fix up the results whenever you divide by 0):
out <- rbind(NA, tail(m, -1) / head(m, -1) - 1)
out[!is.finite(out)] <- NA
out
# [,1] [,2] [,3]
# NA NA NA
# [2,] 1.0000000 -1 0.5384615
# [3,] 0.5000000 NA NA
# [4,] NA 2 NA
# [5,] NA -1 -0.4428571
# [6,] 0.1666667 NA -0.3846154
This is much faster than a looping solution, as demonstrated on a 1000 x 1000 example:
set.seed(144)
m <- matrix(rnorm(10000000), 10000, 1000)
system.time(j <- josilber(m))
# user system elapsed
# 1.425 0.030 1.446
system.time(g <- gr(m))
# user system elapsed
# 34.551 0.263 36.581
The vectorized solution provides a 25x speedup.
Here are a couple of ways:
1) no packages
rbind(NA, exp(diff(log(m)))-1)
giving:
[,1] [,2] [,3]
[1,] NA NA NA
[2,] 1.0000000 -1 0.5384615
[3,] 0.5000000 Inf NA
[4,] NA 2 NA
[5,] NA -1 -0.4428571
[6,] 0.1666667 Inf -0.3846154
If it's not important to have a first row of NA then it can be simplified to just exp(diff(log(m)))-1 .
2) zoo Another way is to use zoo's geomemtric diff function. Convert to zoo, take geometric differences and subtract 1. If it's important to have a first row of NAs then merge it back with a zero width series having the original time points (otherwise omit the merge statement and just use g as the answer):
library(zoo)
zm <- as.zoo(m)
g <- diff(zm, arithmetic = FALSE) - 1
merge(g, zoo(, time(zm))) # omit this line if 1st row of NAs not needed
giving:
g.1 g.2 g.3
1 NA NA NA
2 1.0000000 -1 0.5384615
3 0.5000000 Inf NA
4 NA 2 NA
5 NA -1 -0.4428571
6 0.1666667 Inf -0.3846154
I'd like to use the previous row value for a calculation involving the current row. The matrix looks something like:
A B
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 4 5
[5,] 5 6
I want to do the following operation: (cell[i]/cell[i-1])-1, essentially calculating the % change (-1 to 1) from the current row to the previous (excluding the first row).
The output should look like:
C D
[1,] NA NA
[2,] 1.0 0.5
[3,] 0.5 0.33
[4,] 0.33 0.25
[5,] 0.25 0.20
This can be accomplished easily using for-loops, but I am working with large data sets so I would like to use apply (or other inbuilt functions) for performance and cleaner code.
So far I've come up with:
test.perc <- sapply(test, function(x,y) x-x[y])
But it's not working.
Any ideas?
Thanks.
df/rbind(c(NA,NA), df[-nrow(df),]) - 1
will work.
1) division
ans1 <- DF[-1,] / DF[-nrow(DF),] - 1
or rbind(NA, ans1) if its important to have the NAs in the first row
2) diff
ans2 <- exp(sapply(log(DF), diff)) - 1
or rbind(NA, ans2) if its important to have the NAs in the first row
3) diff.zoo
library(zoo)
coredata(diff(as.zoo(DF), arithmetic = FALSE)) - 1
If its important to have the NA at the beginning then add the na.pad=TRUE argument like this:
coredata(diff(as.zoo(DF), arithmetic = FALSE, na.pad = TRUE)) - 1
Alternatively, sticking with your original sapply method:
sapply(dat, function(x) x/c(NA,head(x,-1)) - 1 )
Or a variation on #user3114046's answer:
dat/rbind(NA,head(dat,-1))-1
# A B
#[1,] NA NA
#[2,] 1.0000000 0.5000000
#[3,] 0.5000000 0.3333333
#[4,] 0.3333333 0.2500000
#[5,] 0.2500000 0.2000000