I am searching for a solution that implements the following simple growth-rate formula by applying vectorization in R:
gr <- function(x){
a <- matrix(,nrow=nrow(x),ncol=ncol(x))
for (j in 1:ncol(x)){
for (i in 2:nrow(x)){
if (!is.na(x[i,j]) & !is.na(x[i-1,j]) & x[i-1,j] != 0){
result[i,j] <- x[i,j]/x[i-1,j]-1
}
}
}
return(a)
}
I found the xts package to generate lags of time-series, but in the end I always ended up having to compare to many values (see above), so I cannot simply use ifelse. One possible problem is when the time-series (e.g. a price index) has zeros in between. This would create NaNs in the result, which I am trying to avoid and which cannot simply be removed afterwards (edit: apparently they can, see the answers below!)
In short: I'd like to produce a table of correct growth rates for a given table of values. Here is an example:
m <- matrix(c(1:3,NA,2.4,2.8,3.9,0,1,3,0,2,1.3,2,NA,7,3.9,2.4),6,3)
generates:
[,1] [,2] [,3]
[1,] 1.0 3.9 1.3
[2,] 2.0 0.0 2.0
[3,] 3.0 1.0 NA
[4,] NA 3.0 7.0
[5,] 2.4 0.0 3.9
[6,] 2.8 2.0 2.4
correct result, produced by gr(m):
[,1] [,2] [,3]
[1,] NA NA NA
[2,] 1.0000000 -1 0.5384615
[3,] 0.5000000 NA NA
[4,] NA 2 NA
[5,] NA -1 -0.4428571
[6,] 0.1666667 NA -0.3846154
But this takes forever with large tables. Is there any way to use conditions without looping so extensively?
You can speed this up by performing the entire calculation in a single vectorized operation (with one additional operation to fix up the results whenever you divide by 0):
out <- rbind(NA, tail(m, -1) / head(m, -1) - 1)
out[!is.finite(out)] <- NA
out
# [,1] [,2] [,3]
# NA NA NA
# [2,] 1.0000000 -1 0.5384615
# [3,] 0.5000000 NA NA
# [4,] NA 2 NA
# [5,] NA -1 -0.4428571
# [6,] 0.1666667 NA -0.3846154
This is much faster than a looping solution, as demonstrated on a 1000 x 1000 example:
set.seed(144)
m <- matrix(rnorm(10000000), 10000, 1000)
system.time(j <- josilber(m))
# user system elapsed
# 1.425 0.030 1.446
system.time(g <- gr(m))
# user system elapsed
# 34.551 0.263 36.581
The vectorized solution provides a 25x speedup.
Here are a couple of ways:
1) no packages
rbind(NA, exp(diff(log(m)))-1)
giving:
[,1] [,2] [,3]
[1,] NA NA NA
[2,] 1.0000000 -1 0.5384615
[3,] 0.5000000 Inf NA
[4,] NA 2 NA
[5,] NA -1 -0.4428571
[6,] 0.1666667 Inf -0.3846154
If it's not important to have a first row of NA then it can be simplified to just exp(diff(log(m)))-1 .
2) zoo Another way is to use zoo's geomemtric diff function. Convert to zoo, take geometric differences and subtract 1. If it's important to have a first row of NAs then merge it back with a zero width series having the original time points (otherwise omit the merge statement and just use g as the answer):
library(zoo)
zm <- as.zoo(m)
g <- diff(zm, arithmetic = FALSE) - 1
merge(g, zoo(, time(zm))) # omit this line if 1st row of NAs not needed
giving:
g.1 g.2 g.3
1 NA NA NA
2 1.0000000 -1 0.5384615
3 0.5000000 Inf NA
4 NA 2 NA
5 NA -1 -0.4428571
6 0.1666667 Inf -0.3846154
Related
Good day,
I have a feature co-location (fcm, from the quanteda package in R) of dimensions 94966 x 94966 (named fcm2, for illustration). I am able to select rows (class: fcm object) by the feature name or row number, e.g.:
a1 <- fcm2[1,]
and perform a normalization calculation on that particular row:
a2 <- a1/(max(a1)-min(a1))
My objective is to normalize each row in my original fcm. The strategy I attempted was to initialize an empty matrix and then use a for loop to go through the rows and perform the calculation. The initialization fails because of memory issues (Windows 10, 12 Gb RAM, R version 3.4.4):
fcm3 <- matrix(data = NA, nrow = 94966, ncol = 94966)
Error: cannot allocate vector of size 33.6 Gb
I had been able to accomplish the normalization by using a dataframe structure, but there is not enough memory to store the entire fcm2 as a dataframe:
Step 1: Extract a "sub-matrix" based on a list of keywords, convert to dataframe, drop unneeded columns
m <- fcm2[keywords(),]
df_m1 <- as.data.frame(m)
df_m1 <- subset(df_m1, select = -c(document, rt))
Step 2: Normalization
k <- 0 # initialize counter
df2 <- data.frame() # initialize
n4 <- nrow(df_m1) # count rows of the extracted sub-matrix as df (df_m1)
for(k in 1:n4){
a1 <- df_m1[k,] # store the (n4)th row
max_k <- max(a1)
min_k <- min(a1)
a2 <- a1/(max_k-min_k) # normalize so max is 1, 0s are still 0s
df2 <- rbind(df2, a2) # append normalized results into a row of a data.frame
}
Is there a more efficient way to normalize each row for the entire fcm?
Kind thanks!
Yo can write a function:
norm=function(mat){
mx=mat[cbind(1:nrow(mat),max.col(mat))]
mn=mat[cbind(1:nrow(mat),max.col(-mat))]
mat/(mx-mn)
}
And then use it.
Example
set.seed(1)
mat1=matrix(sample(20),5)
mat1
[,1] [,2] [,3] [,4]
[1,] 6 14 3 7 #max is 14, min is 3 thus divide by 11
[2,] 8 15 2 12
[3,] 11 9 20 17
[4,] 16 19 10 18
[5,] 4 1 5 13
norm(mat)
[,1] [,2] [,3] [,4]
[1,] 0.5454545 1.27272727 0.2727273 0.6363636
[2,] 0.6153846 1.15384615 0.1538462 0.9230769
[3,] 1.0000000 0.81818182 1.8181818 1.5454545
[4,] 1.7777778 2.11111111 1.1111111 2.0000000
[5,] 0.3333333 0.08333333 0.4166667 1.0833333
You can decide to print out in fraction form to see whether the results do represent what was needed:
MASS::fractions(norm(mat))
[,1] [,2] [,3] [,4]
[1,] 6/11 14/11 3/11 7/11
[2,] 8/13 15/13 2/13 12/13
[3,] 1 9/11 20/11 17/11
[4,] 16/9 19/9 10/9 2
[5,] 1/3 1/12 5/12 13/12
I can understand OP has constraint with memory, and he cannot allocate memory to hold another copy of that big matrix.
If memory permits then solution can be:
mat1 = t(apply(mat1, 1, function(x) x/(max(x)-min(x))))
With memory constraint, one can prefer to write a function to normalise a vector and apply it over all rows in for-loop. It should be an efficient way in given scenario.
# Function to normalize a vector
normalise <- function(x){
x/(max(x)-min(x))
}
#Apply over all rows of matrix
for(i in 1:nrow(mat1)){
mat1[i,] = normalise(mat1[i,])
}
mat1
# [,1] [,2] [,3] [,4]
# [1,] 0.5454545 1.27272727 0.2727273 0.6363636
# [2,] 0.6153846 1.15384615 0.1538462 0.9230769
# [3,] 1.0000000 0.81818182 1.8181818 1.5454545
# [4,] 1.7777778 2.11111111 1.1111111 2.0000000
# [5,] 0.3333333 0.08333333 0.4166667 1.0833333
Data: As used by #Onyambu
# Data
set.seed(1)
mat1=matrix(sample(20),5)
The most efficient way is to operate on the sparse values of the fcm object directly, avoiding any transformation into a dense object such as a matrix or data.frame. This is how the dfm and fcm manipulation and computation functions are defined in quanteda and why these are able to executive quickly and within limited memory.
To define such a function for your type of normalisation, you could use the following function, which I have demonstrated here on a simple fcm.
library("quanteda")
library("Matrix")
myfcm <- fcm(data_char_sampletext, window = 5)
myfcm
## Feature co-occurrence matrix of: 244 by 244 features.
Now we define a function that (for convenience) transforms the fcm into a sparse triplet representation (the dgTMatrix class) and extracts the non-zero values using split(). Each element of the resulting list will represent a row of your fcm, but only for non-zero values. (Because of this, we also have to return a zero for empty rows.)
fcm_customnorm <- function(x) {
x <- as(x, "dgTMatrix")
split_x <- split(x#x, x#i)
norm_x <- lapply(split_x, function(y) {
result <- y/(max(y) - min(y))
# transform any divisions by zero into zero
result[is.nan(result)] <- 0
result
})
x#x <- unlist(norm_x, use.names = FALSE)
quanteda:::as.fcm(x)
}
Applying this on a subset, we see that it works:
myfcm[1:5, 1:5]
## Feature co-occurrence matrix of: 5 by 5 features.
## 5 x 5 sparse Matrix of class "fcm"
## features
## features Instead we have a Fine
## Instead 0 5 1 4 1
## we 0 10 5 20 5
## have 0 0 0 4 1
## a 0 0 0 6 4
## Fine 0 0 0 0 0
fcm_customnorm(myfcm[1:5, 1:5])
## Feature co-occurrence matrix of: 5 by 5 features.
## 5 x 5 sparse Matrix of class "fcm"
## features
## features Instead we have a Fine
## Instead 0 1.0 0.8000000 0.3333333 1.00
## we 0 0.2 0.2000000 1.3333333 0.25
## have 0 0 0.6666667 0.3333333 3.00
## a 0 0 0 0.0000000 2.00
## Fine 0 0 0 0 0.00
Another option would have been to extract the simple triplet representation to a data.table (from the data.table package) and then perform your computations using the grouping functions and :=. But this approach is simpler and gets your desired result, which is a normalised fcm.
Is there a simple way to demean a sparse matrix by columns while considering zero-values as missing (using Matrix package)?
There seem to be two problems I struggle with:
Finding proper column means
Empty cells are considered zero rather than missing:
M0 <- matrix(rep(1:5,4),nrow = 4)
M0[2,2] <- M0[2,3] <- 0
M <- as(M0, "sparseMatrix")
M
#[1,] 1 5 4 3 2
#[2,] 2 . . 4 3
#[3,] 3 2 1 5 4
#[4,] 4 3 2 1 5
colMeans(M)
#[1] 2.50 2.50 1.75 3.25 3.50
Correct result should be:
colMeans_correct <- colSums(M) / c(4,3,3,4,4)
colMeans_correct
#[1] 2.500000 3.333333 2.333333 3.250000 3.500000
Subtract column mean
Subtraction is performed also on the missing cells:
sweep(M, 2, colMeans_correct)
#4 x 5 Matrix of class "dgeMatrix"
# [,1] [,2] [,3] [,4] [,5]
#[1,] -1.5 1.6666667 1.6666667 -0.25 -1.5
#[2,] -0.5 -3.3333333 -2.3333333 0.75 -0.5
#[3,] 0.5 -1.3333333 -1.3333333 1.75 0.5
#[4,] 1.5 -0.3333333 -0.3333333 -2.25 1.5
P.S. hope it is not a problem posting a question composed of two problems. They are connected to the same task and seem to reflect the same problem - distinguish between missing and actual zero values.
One option is to divide the colSums by the colSums of the non-zero logical matrix
colSums(M)/colSums(M!=0)
#[1] 2.500000 3.333333 2.333333 3.250000 3.500000
Or another option is to replace the 0 with NA and get the colMeans with na.rm = TRUE argument
colMeans(M*NA^!M, na.rm = TRUE)
#[1] 2.500000 3.333333 2.333333 3.250000 3.500000
Or as #user20650 commented
colSums(M) / diff(M#p)
#[1] 2.500000 3.333333 2.333333 3.250000 3.500000
where the 'p' is a pointer mentioned in the ?sparseMatrix
In typical usage, p is missing, i and j are vectors of positive
integers and x is a numeric vector. These three vectors, which must
have the same length, form the triplet representation of the sparse
matrix.
If i or j is missing then p must be a non-decreasing integer vector
whose first element is zero. It provides the compressed, or “pointer”
representation of the row or column indices, whichever is missing. The
expanded form of p, rep(seq_along(dp),dp) where dp <- diff(p), is used
as the (1-based) row or column indices.
I have a matrix in which every row has at least one NA cell, and every column has at least one NA cell as well. What I need is to find the largest subset of this matrix that contains no NAs.
For example, for this matrix A
A <-
structure(c(NA, NA, NA, NA, 2L, NA,
1L, 1L, 1L, 0L, NA, NA,
1L, 8L, NA, 1L, 1L, NA,
NA, 1L, 1L, 6L, 1L, 3L,
NA, 1L, 5L, 1L, 1L, NA),
.Dim = c(6L, 5L),
.Dimnames =
list(paste0("R", 1:6),
paste0("C", 1:5)))
A
C1 C2 C3 C4 C5
R1 NA 1 1 NA NA
R2 NA 1 8 1 1
R3 NA 1 NA 1 5
R4 NA 0 1 6 1
R5 2 NA 1 1 1
R6 NA NA NA 3 NA
There are two solutions (8 cells): A[c(2, 4), 2:5] and A[2:5, 4:5], though finding just one valid solution is enough for my purposes. The dimensions of my actual matrix are 77x132.
Being a noob, I see no obvious way to do this. Could anyone help me with some ideas?
1) optim In this approach we relax the problem to a continuous optimization problem which we solve with optim.
The objective function is f and the input to it is a 0-1 vector whose first nrow(A) entries correspond to rows and whose remaining entries correspond to columns. f uses a matrix Ainf which is derived from A by replacing the NAs with a large negative number and the non-NAs with 1. In terms of Ainf the negative of the number of elements in the rectangle of rows and columns corresponding to x is -x[seq(6)] %*% Ainf %*$ x[-seq(6)] which we minimize as a function of x subject to each component of x lying between 0 and 1.
Although this is a relaxation of the original problem to continuous optimization it seems that we get an integer solution, as desired, anyways.
Actually most of the code below is just to get the starting value. To do that we first apply seriation. This permutes the rows and columns giving a more blocky structure and then in the permuted matrix we find the largest square submatrix.
In the case of the specific A in the question the largest rectangular submatrix happens to be square and the starting values are already sufficiently good that they produce the optimum but we will perform the optimization anyways so it works in general. You can play around with different starting values if you like. For example, change k from 1 to some higher number in largestSquare in which case largestSquare will return k columns giving k starting values which can be used in k runs of optim taking the best.
If the starting values are sufficiently good then this should produce the optimum.
library(seriation) # only used for starting values
A.na <- is.na(A) + 0
Ainf <- ifelse(A.na, -prod(dim(A)), 1) # used by f
nr <- nrow(A) # used by f
f <- function(x) - c(x[seq(nr)] %*% Ainf %*% x[-seq(nr)])
# starting values
# Input is a square matrix of zeros and ones.
# Output is a matrix with k columns such that first column defines the
# largest square submatrix of ones, second defines next largest and so on.
# Based on algorithm given here:
# http://www.geeksforgeeks.org/maximum-size-sub-matrix-with-all-1s-in-a-binary-matrix/
largestSquare <- function(M, k = 1) {
nr <- nrow(M); nc <- ncol(M)
S <- 0*M; S[1, ] <- M[1, ]; S[, 1] <- M[, 1]
for(i in 2:nr)
for(j in 2:nc)
if (M[i, j] == 1) S[i, j] = min(S[i, j-1], S[i-1, j], S[i-1, j-1]) + 1
o <- head(order(-S), k)
d <- data.frame(row = row(M)[o], col = col(M)[o], mx = S[o])
apply(d, 1, function(x) {
dn <- dimnames(M[x[1] - 1:x[3] + 1, x[2] - 1:x[3] + 1])
out <- c(rownames(M) %in% dn[[1]], colnames(M) %in% dn[[2]]) + 0
setNames(out, unlist(dimnames(M)))
})
}
s <- seriate(A.na)
p <- permute(A.na, s)
# calcualte largest square submatrix in p of zeros rearranging to be in A's order
st <- largestSquare(1-p)[unlist(dimnames(A)), 1]
res <- optim(st, f, lower = 0*st, upper = st^0, method = "L-BFGS-B")
giving:
> res
$par
R1 R2 R3 R4 R5 R6 C1 C2 C3 C4 C5
0 1 1 1 0 0 0 1 0 1 1
$value
[1] -9
$counts
function gradient
1 1
$convergence
[1] 0
$message
[1] "CONVERGENCE: NORM OF PROJECTED GRADIENT <= PGTOL"
2) GenSA Another possibility is to repeat (1) but instead of using optim use GenSA from the GenSA package. It does not require starting values (although you can provide a starting value using the par argument and this might improve the solution in some cases) so the code is considerably shorter but since it uses simulated annealing it can be expected to take substantially longer to run. Using f (and nr and Ainf which f uses) from (1). Below we try it without a starting value.
library(GenSA)
resSA <- GenSA(lower = rep(0, sum(dim(A))), upper = rep(1, sum(dim(A))), fn = f)
giving:
> setNames(resSA$par, unlist(dimnames(A)))
R1 R2 R3 R4 R5 R6 C1 C2 C3 C4 C5
0 1 1 1 0 0 0 1 0 1 1
> resSA$value
[1] -9
I have a solution, but it doesn't scale very well:
findBiggestSubmatrixNonContiguous <- function(A) {
A <- !is.na(A); ## don't care about non-NAs
howmany <- expand.grid(nr=seq_len(nrow(A)),nc=seq_len(ncol(A)));
howmany <- howmany[order(apply(howmany,1L,prod),decreasing=T),];
for (ri in seq_len(nrow(howmany))) {
nr <- howmany$nr[ri];
nc <- howmany$nc[ri];
rcom <- combn(nrow(A),nr);
ccom <- combn(ncol(A),nc);
comcom <- expand.grid(ri=seq_len(ncol(rcom)),ci=seq_len(ncol(ccom)));
for (comi in seq_len(nrow(comcom)))
if (all(A[rcom[,comcom$ri[comi]],ccom[,comcom$ci[comi]]]))
return(list(ri=rcom[,comcom$ri[comi]],ci=ccom[,comcom$ci[comi]]));
}; ## end for
NULL;
}; ## end findBiggestSubmatrixNonContiguous()
It's based on the idea that if the matrix has a small enough density of NAs, then by searching for the largest submatrices first, you'll be likely to find a solution fairly quickly.
The algorithm works by computing a cartesian product of all counts of rows and counts of columns that could be indexed out of the original matrix to produce the submatrix. The set of pairs of counts is then decreasingly ordered by the size of the submatrix that would be produced by each pair of counts; in other words, ordered by the product of the two counts. It then iterates over these pairs. For each pair, it computes all combinations of row indexes and column indexes that could be taken for that pair of counts, and tries each combination in turn until it finds a submatrix that contains zero NAs. Upon finding such a submatrix, it returns that set of row and column indexes as a list.
The result is guaranteed to be correct because it tries submatrix sizes in decreasing order, so the first one it finds must be the biggest (or tied for the biggest) possible submatrix that satisfies the condition.
## OP's example matrix
A <- data.frame(C1=c(NA,NA,NA,NA,2L,NA),C2=c(1L,1L,1L,0L,NA,NA),C3=c(1L,8L,NA,1L,1L,NA),C4=c(NA,1L,1L,6L,1L,3L),C5=c(NA,1L,5L,1L,1L,NA),row.names=c('R1','R2','R3','R4','R5','R6'));
A;
## C1 C2 C3 C4 C5
## R1 NA 1 1 NA NA
## R2 NA 1 8 1 1
## R3 NA 1 NA 1 5
## R4 NA 0 1 6 1
## R5 2 NA 1 1 1
## R6 NA NA NA 3 NA
system.time({ res <- findBiggestSubmatrixNonContiguous(A); });
## user system elapsed
## 0.094 0.000 0.100
res;
## $ri
## [1] 2 3 4
##
## $ci
## [1] 2 4 5
##
A[res$ri,res$ci];
## C2 C4 C5
## R2 1 1 1
## R3 1 1 5
## R4 0 6 1
We see that the function works very quickly on the OP's example matrix, and returns a correct result.
randTest <- function(NR,NC,probNA,seed=1L) {
set.seed(seed);
A <- replicate(NC,sample(c(NA,0:9),NR,prob=c(probNA,rep((1-probNA)/10,10L)),replace=T));
print(A);
print(system.time({ res <- findBiggestSubmatrixNonContiguous(A); }));
print(res);
print(A[res$ri,res$ci,drop=F]);
invisible(res);
}; ## end randTest()
I wrote the above function to make testing easier. We can call it to test a random input matrix of size NR by NC, with a probability of choosing NA in any given cell of probNA.
Here are a few trivial tests:
randTest(8L,1L,1/3);
## [,1]
## [1,] NA
## [2,] 1
## [3,] 4
## [4,] 9
## [5,] NA
## [6,] 9
## [7,] 0
## [8,] 5
## user system elapsed
## 0.016 0.000 0.003
## $ri
## [1] 2 3 4 6 7 8
##
## $ci
## [1] 1
##
## [,1]
## [1,] 1
## [2,] 4
## [3,] 9
## [4,] 9
## [5,] 0
## [6,] 5
randTest(11L,3L,4/5);
## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] NA NA NA
## [3,] NA NA NA
## [4,] 2 NA NA
## [5,] NA NA NA
## [6,] 5 NA NA
## [7,] 8 0 4
## [8,] NA NA NA
## [9,] NA NA NA
## [10,] NA 7 NA
## [11,] NA NA NA
## user system elapsed
## 0.297 0.000 0.300
## $ri
## [1] 4 6 7
##
## $ci
## [1] 1
##
## [,1]
## [1,] 2
## [2,] 5
## [3,] 8
randTest(10L,10L,1/3);
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] NA NA 0 3 8 3 9 1 6 NA
## [2,] 1 NA NA 4 5 8 NA 8 2 NA
## [3,] 4 2 5 3 7 6 6 1 1 5
## [4,] 9 1 NA NA 4 NA NA 1 NA 9
## [5,] NA 7 NA 8 3 NA 5 3 7 7
## [6,] 9 3 1 2 7 NA NA 9 NA 7
## [7,] 0 2 NA 7 NA NA 3 8 2 6
## [8,] 5 0 1 NA 3 3 7 1 NA 6
## [9,] 5 1 9 2 2 5 NA 7 NA 8
## [10,] NA 7 1 6 2 6 9 0 NA 5
## user system elapsed
## 8.985 0.000 8.979
## $ri
## [1] 3 4 5 6 8 9 10
##
## $ci
## [1] 2 5 8 10
##
## [,1] [,2] [,3] [,4]
## [1,] 2 7 1 5
## [2,] 1 4 1 9
## [3,] 7 3 3 7
## [4,] 3 7 9 7
## [5,] 0 3 1 6
## [6,] 1 2 7 8
## [7,] 7 2 0 5
I don't know an easy way of verifying if the above result is correct, but it looks good to me. But it took almost 9 seconds to generate this result. Running the function on moderately larger matrices, especially a 77x132 matrix, is probably a lost cause.
Waiting to see if someone can come up with a brilliant efficient solution...
How I can rewrite this function to vectorized variant. As I know, using loops are not good practice in R:
# replaces rows that contains all NAs with non-NA values from previous row and K-th column
na.replace <- function(x, k) {
for (i in 2:nrow(x)) {
if (!all(is.na(x[i - 1, ])) && all(is.na(x[i, ]))) {
x[i, ] <- x[i - 1, k]
}
}
x
}
This is input data and returned data for function:
m <- cbind(c(NA,NA,1,2,NA,NA,NA,6,7,8), c(NA,NA,2,3,NA,NA,NA,7,8,9))
m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] NA NA
[6,] NA NA
[7,] NA NA
[8,] 6 7
[9,] 7 8
[10,] 8 9
na.replace(m, 2)
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
Here is a solution using na.locf in the zoo package. row.na is a vector with one component per row of m such that a component is TRUE if the corresponding row of m is all NA and FALSE otherwise. We then set all elements of such rows to the result of applying na.locf to column 2.
At the expense of a bit of speed the lines ending with ## could be replaced with row.na <- apply(is.na(m), 1, all) which is a bit more readable.
If we knew that if any row has an NA in column 2 then all columns of that row are NA, as in the question, then the lines ending in ## could be reduced to just row.na <- is.na(m[, 2])
library(zoo)
nr <- nrow(m) ##
nc <- ncol(m) ##
row.na <- .rowSums(is.na(m), nr, nc) == nc ##
m[row.na, ] <- na.locf(m[, 2], na.rm = FALSE)[row.na]
The result is:
> m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
REVISED Some revisions to improve speed as in comments below. Also added alternatives in discussion.
Notice that, unless you have a pathological condition where the first row is all NANA (in which case you're screwed anyway), you don't need to check whether all(is.na(x[i−1,]))all(is.na(x[i - 1, ])) is T or F because in the previous time thru the loop you "fixed" row i−1i-1 .
Further, all you care about is that the designated k-th value is not NA. The rest of the row doesn't matter.
BUT: The k-th value always "falls through" from the top, so perhaps you should:
1) treat the k-th column as a vector, e.g. c(NA,1,NA,NA,3,NA,4,NA,NA) and "fill-down" all numeric values. That's been done many times on SO questions.
2) Every row which is entirely NA except for column k gets filled with that same value.
I think that's still best done using either a loop or apply
You probably need to clarify whether some rows have both numeric and NA values, which your example fails to include. If that's the case, then things get trickier.
The most important part in this answer is getting the grouping you want, which is:
groups = cumsum(rowSums(is.na(m)) != ncol(m))
groups
#[1] 0 0 1 2 2 2 2 3 4 5
Once you have that the rest is just doing your desired operation by group, e.g.:
library(data.table)
dt = as.data.table(m)
k = 2
cond = rowSums(is.na(m)) != ncol(m)
dt[, (k) := .SD[[k]][1], by = cumsum(cond)]
dt[!cond, names(dt) := .SD[[k]]]
dt
# V1 V2
# 1: NA NA
# 2: NA NA
# 3: 1 2
# 4: 2 3
# 5: 3 3
# 6: 3 3
# 7: 3 3
# 8: 6 7
# 9: 7 8
#10: 8 9
Here is another base only vectorized approach:
na.replace <- function(x, k) {
is.all.na <- rowSums(is.na(x)) == ncol(x)
ref.idx <- cummax((!is.all.na) * seq_len(nrow(x)))
ref.idx[ref.idx == 0] <- NA
x[is.all.na, ] <- x[ref.idx[is.all.na], k]
x
}
And for fair comparison with #Eldar's solution, replace is.all.na with is.all.na <- is.na(x[, k]).
Finally I realized my version of vectorized solution and it works as expected. Any comments and suggestions are welcome :)
# Last Observation Move Forward
# works as na.locf but much faster and accepts only 1D structures
na.lomf <- function(object, na.rm = F) {
idx <- which(!is.na(object))
if (!na.rm && is.na(object[1])) idx <- c(1, idx)
rep.int(object[idx], diff(c(idx, length(object) + 1)))
}
na.replace <- function(x, k) {
v <- x[, k]
i <- which(is.na(v))
r <- na.lomf(v)
x[i, ] <- r[i]
x
}
Here's a workaround with the na.locf function from zoo:
m[na.locf(ifelse(apply(m, 1, function(x) all(is.na(x))), NA, 1:nrow(m)), na.rm=F),]
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 2 3
[6,] 2 3
[7,] 2 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
I'd like to use the previous row value for a calculation involving the current row. The matrix looks something like:
A B
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 4 5
[5,] 5 6
I want to do the following operation: (cell[i]/cell[i-1])-1, essentially calculating the % change (-1 to 1) from the current row to the previous (excluding the first row).
The output should look like:
C D
[1,] NA NA
[2,] 1.0 0.5
[3,] 0.5 0.33
[4,] 0.33 0.25
[5,] 0.25 0.20
This can be accomplished easily using for-loops, but I am working with large data sets so I would like to use apply (or other inbuilt functions) for performance and cleaner code.
So far I've come up with:
test.perc <- sapply(test, function(x,y) x-x[y])
But it's not working.
Any ideas?
Thanks.
df/rbind(c(NA,NA), df[-nrow(df),]) - 1
will work.
1) division
ans1 <- DF[-1,] / DF[-nrow(DF),] - 1
or rbind(NA, ans1) if its important to have the NAs in the first row
2) diff
ans2 <- exp(sapply(log(DF), diff)) - 1
or rbind(NA, ans2) if its important to have the NAs in the first row
3) diff.zoo
library(zoo)
coredata(diff(as.zoo(DF), arithmetic = FALSE)) - 1
If its important to have the NA at the beginning then add the na.pad=TRUE argument like this:
coredata(diff(as.zoo(DF), arithmetic = FALSE, na.pad = TRUE)) - 1
Alternatively, sticking with your original sapply method:
sapply(dat, function(x) x/c(NA,head(x,-1)) - 1 )
Or a variation on #user3114046's answer:
dat/rbind(NA,head(dat,-1))-1
# A B
#[1,] NA NA
#[2,] 1.0000000 0.5000000
#[3,] 0.5000000 0.3333333
#[4,] 0.3333333 0.2500000
#[5,] 0.2500000 0.2000000