I'd like to use the previous row value for a calculation involving the current row. The matrix looks something like:
A B
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 4 5
[5,] 5 6
I want to do the following operation: (cell[i]/cell[i-1])-1, essentially calculating the % change (-1 to 1) from the current row to the previous (excluding the first row).
The output should look like:
C D
[1,] NA NA
[2,] 1.0 0.5
[3,] 0.5 0.33
[4,] 0.33 0.25
[5,] 0.25 0.20
This can be accomplished easily using for-loops, but I am working with large data sets so I would like to use apply (or other inbuilt functions) for performance and cleaner code.
So far I've come up with:
test.perc <- sapply(test, function(x,y) x-x[y])
But it's not working.
Any ideas?
Thanks.
df/rbind(c(NA,NA), df[-nrow(df),]) - 1
will work.
1) division
ans1 <- DF[-1,] / DF[-nrow(DF),] - 1
or rbind(NA, ans1) if its important to have the NAs in the first row
2) diff
ans2 <- exp(sapply(log(DF), diff)) - 1
or rbind(NA, ans2) if its important to have the NAs in the first row
3) diff.zoo
library(zoo)
coredata(diff(as.zoo(DF), arithmetic = FALSE)) - 1
If its important to have the NA at the beginning then add the na.pad=TRUE argument like this:
coredata(diff(as.zoo(DF), arithmetic = FALSE, na.pad = TRUE)) - 1
Alternatively, sticking with your original sapply method:
sapply(dat, function(x) x/c(NA,head(x,-1)) - 1 )
Or a variation on #user3114046's answer:
dat/rbind(NA,head(dat,-1))-1
# A B
#[1,] NA NA
#[2,] 1.0000000 0.5000000
#[3,] 0.5000000 0.3333333
#[4,] 0.3333333 0.2500000
#[5,] 0.2500000 0.2000000
Related
From the gtools library, take combinations(5,2). This gives the following output:
> combinations(5,2)
[,1] [,2]
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 1 5
[5,] 2 3
[6,] 2 4
[7,] 2 5
[8,] 3 4
[9,] 3 5
[10,] 4 5
Storing this as comb we can extract rows, e.g. comb[1,] which is 1 2. Now suppose that I want to use comb[1,] as an input to a function of two arguments, say beta(a,b) (documented here). beta(1,2) works just fine, but beta(comb[1,]) will throw the error Error in beta(comb[1,]) : argument "b" is missing, with no default. Why is this? I'm confident that this is a type issue, but R is not a strongly typed language. How could this error be avoided be a user who wants to use something like beta(comb[1,])?
If we need to apply beta on the output of combinations, one option is to convert it to a list and then use do.call
library(gtools)
do.call(beta, asplit(combinations(5,2), 2))
#[1] 0.500000000 0.333333333 0.250000000 0.200000000 0.083333333
#[6] 0.050000000 0.033333333 0.016666667 0.009523810 0.003571429
Or another option is apply
apply(combinations(5, 2), 1, FUN = function(x) beta(x[1], x[2]))
this would avoid creating any objects
-checking the output with individual entry to beta
beta(1, 2) #1st row of combinations output
#[1] 0.5
beta(1, 3) # 2nd row of combinations output
#[1] 0.3333333
beta(1, 4) # 3rd row of combinations output
#[1] 0.25
This is a very difficult question to phrase properly, and I have already visited numerous posts on SO without finding a solution for this problem.
I have 2 dataframes with identical dimensions, (in reality they are each 983 27, but in this example they are 10 x 5).
df1 <- data.frame(v=runif(10),w=runif(10),x=runif(10),y=runif(10),z=runif(10))
df2 <- data.frame(v=runif(10),w=runif(10),x=runif(10),y=runif(10),z=runif(10))
df1
v w x y z
1 0.47183652 0.22260903 0.22871379 0.549137695 0.19310086
2 0.26030258 0.33811230 0.66651066 0.432569755 0.88964481
3 0.99671428 0.87778858 0.76554728 0.486628372 0.28298038
4 0.51320543 0.62279625 0.52370766 0.003457935 0.20230251
5 0.09182823 0.88205170 0.43630438 0.308291706 0.03875207
6 0.29005832 0.96372511 0.65346596 0.411204978 0.22091272
7 0.76790152 0.47633721 0.79825487 0.329127652 0.48165651
8 0.85939833 0.70695256 0.05128899 0.631819822 0.26584177
9 0.14903837 0.09196876 0.56711615 0.443217700 0.33934426
10 0.79928314 0.15035157 0.82297350 0.203435449 0.21088680
df2
v w x y z
1 0.9733651 0.1407513 0.32073105 0.18886833 0.76234111
2 0.9009754 0.1303898 0.48968741 0.45347721 0.78475371
3 0.8460530 0.6597701 0.20024460 0.59079853 0.63302668
4 0.9879135 0.2348028 0.73954442 0.70185877 0.23834780
5 0.5748540 0.4139660 0.79869841 0.02760473 0.99871034
6 0.9164362 0.7166881 0.25280647 0.35890724 0.03500226
7 0.1302808 0.3734517 0.25132321 0.67417021 0.57109357
8 0.1114569 0.7319093 0.57513770 0.11055742 0.86348983
9 0.6596877 0.5261662 0.50796080 0.95685045 0.17689039
10 0.8299933 0.8244658 0.04408135 0.33849748 0.96904940
I need to iterate through each column, and for each day T, count the number of days on which (T-1,T-2,T-3...T-n) < T, for both dataframes simultaneously, then compute the % frequency.
The steps would be:
for example on Day T=2, consider df1[2,1] (which is 0.26030258) and go back and flag any days prior to T=2 that are less than 0.26030258. Since we are using T=2 as an example, the only prior observation is df1[1,1]. If df1[1,1] < df1[2,1] flag this day as 1 IF
df2[1,1] is ALSO less than df2[2,1]
Finally, still for example T=2, sum the number of 1s and divide by the number of observations to generate a frequency for T=2.
Again, I need to do this for 983 dates, and across 27 columns. I have tried various methods using rollify, as well as various functions wrapped in sapply, but it is challenging given the dynamic width of the countif criterion, let alone doing this across 2 DFs at the same time.
I think something like this:
m1 = as.matrix(df1)
m2 = as.matrix(df2)
results = matrix(nrow = nrow(df1) - 1, ncol = ncol(df1))
colnames(results) = names(df1)
for(i in 2:nrow(df1)) {
results[i - 1, ] = rowSums(t(m1[1:(i - 1), , drop = FALSE]) < m1[i, ] & t(m2[1:(i - 1), , drop = FALSE]) < m2[i, ]) / (i - 1)
}
results
# v w x y z
# [1,] 0.0000000 1.0 1.0000000 0.0000000 0.0000000
# [2,] 0.5000000 0.0 0.0000000 0.0000000 1.0000000
# [3,] 0.0000000 0.0 0.3333333 0.6666667 0.6666667
# [4,] 0.2500000 0.0 0.0000000 0.0000000 0.0000000
# [5,] 0.0000000 0.4 0.4000000 0.6000000 0.6000000
# [6,] 0.0000000 0.0 0.3333333 0.0000000 0.1666667
# [7,] 0.0000000 0.0 0.4285714 0.0000000 0.2857143
# [8,] 0.1250000 0.5 0.6250000 0.5000000 0.8750000
# [9,] 0.2222222 0.0 0.4444444 0.4444444 0.2222222
There's a bit of guesswork since you haven't responded yet to comments, but this should be easily modifiable in case my assumptions are wrong.
For the first df:
df1_result <- matrix(nrow = 10, ncol = 5)
for(j in 1:ncol(df1)){
for(i in 1:nrow(df1)){
df1_result[i, j] <- df1 %>%
filter(df1[ ,j] < df1[i, j] & row_number() < i) %>%
nrow()
}
}
Resulting in:
> df1_result
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 1 0 1 0 1
[3,] 0 1 0 2 0
[4,] 2 1 1 2 0
[5,] 4 3 3 2 1
[6,] 2 2 3 5 2
[7,] 4 4 6 0 2
[8,] 4 7 2 1 2
[9,] 0 3 5 3 5
[10,] 6 7 5 8 9
Will gladly expand when you respond to comments.
Data
set.seed(1701)
df1 <- data.frame(v=runif(10),w=runif(10),x=runif(10),y=runif(10),z=runif(10))
df2 <- data.frame(v=runif(10),w=runif(10),x=runif(10),y=runif(10),z=runif(10))
> df1
v w x y z
1 0.127393428 0.85600486 0.4791849 0.55089910 0.9201376
2 0.766723202 0.02407293 0.8265008 0.35612092 0.9279873
3 0.054421675 0.51942589 0.1076198 0.80230714 0.5993939
4 0.561384595 0.20590965 0.2213454 0.73043828 0.1135139
5 0.937597936 0.71206404 0.6717478 0.72341749 0.2472984
6 0.296445079 0.27272126 0.5053170 0.98789408 0.4514940
7 0.665117463 0.66765977 0.8849426 0.04751297 0.3097986
8 0.652215607 0.94837341 0.3560469 0.06630861 0.2608917
9 0.002313313 0.46710461 0.5732139 0.55040341 0.5375610
10 0.661490602 0.84157353 0.5091688 0.95719901 0.9608329
Is there a simple way to demean a sparse matrix by columns while considering zero-values as missing (using Matrix package)?
There seem to be two problems I struggle with:
Finding proper column means
Empty cells are considered zero rather than missing:
M0 <- matrix(rep(1:5,4),nrow = 4)
M0[2,2] <- M0[2,3] <- 0
M <- as(M0, "sparseMatrix")
M
#[1,] 1 5 4 3 2
#[2,] 2 . . 4 3
#[3,] 3 2 1 5 4
#[4,] 4 3 2 1 5
colMeans(M)
#[1] 2.50 2.50 1.75 3.25 3.50
Correct result should be:
colMeans_correct <- colSums(M) / c(4,3,3,4,4)
colMeans_correct
#[1] 2.500000 3.333333 2.333333 3.250000 3.500000
Subtract column mean
Subtraction is performed also on the missing cells:
sweep(M, 2, colMeans_correct)
#4 x 5 Matrix of class "dgeMatrix"
# [,1] [,2] [,3] [,4] [,5]
#[1,] -1.5 1.6666667 1.6666667 -0.25 -1.5
#[2,] -0.5 -3.3333333 -2.3333333 0.75 -0.5
#[3,] 0.5 -1.3333333 -1.3333333 1.75 0.5
#[4,] 1.5 -0.3333333 -0.3333333 -2.25 1.5
P.S. hope it is not a problem posting a question composed of two problems. They are connected to the same task and seem to reflect the same problem - distinguish between missing and actual zero values.
One option is to divide the colSums by the colSums of the non-zero logical matrix
colSums(M)/colSums(M!=0)
#[1] 2.500000 3.333333 2.333333 3.250000 3.500000
Or another option is to replace the 0 with NA and get the colMeans with na.rm = TRUE argument
colMeans(M*NA^!M, na.rm = TRUE)
#[1] 2.500000 3.333333 2.333333 3.250000 3.500000
Or as #user20650 commented
colSums(M) / diff(M#p)
#[1] 2.500000 3.333333 2.333333 3.250000 3.500000
where the 'p' is a pointer mentioned in the ?sparseMatrix
In typical usage, p is missing, i and j are vectors of positive
integers and x is a numeric vector. These three vectors, which must
have the same length, form the triplet representation of the sparse
matrix.
If i or j is missing then p must be a non-decreasing integer vector
whose first element is zero. It provides the compressed, or “pointer”
representation of the row or column indices, whichever is missing. The
expanded form of p, rep(seq_along(dp),dp) where dp <- diff(p), is used
as the (1-based) row or column indices.
I am searching for a solution that implements the following simple growth-rate formula by applying vectorization in R:
gr <- function(x){
a <- matrix(,nrow=nrow(x),ncol=ncol(x))
for (j in 1:ncol(x)){
for (i in 2:nrow(x)){
if (!is.na(x[i,j]) & !is.na(x[i-1,j]) & x[i-1,j] != 0){
result[i,j] <- x[i,j]/x[i-1,j]-1
}
}
}
return(a)
}
I found the xts package to generate lags of time-series, but in the end I always ended up having to compare to many values (see above), so I cannot simply use ifelse. One possible problem is when the time-series (e.g. a price index) has zeros in between. This would create NaNs in the result, which I am trying to avoid and which cannot simply be removed afterwards (edit: apparently they can, see the answers below!)
In short: I'd like to produce a table of correct growth rates for a given table of values. Here is an example:
m <- matrix(c(1:3,NA,2.4,2.8,3.9,0,1,3,0,2,1.3,2,NA,7,3.9,2.4),6,3)
generates:
[,1] [,2] [,3]
[1,] 1.0 3.9 1.3
[2,] 2.0 0.0 2.0
[3,] 3.0 1.0 NA
[4,] NA 3.0 7.0
[5,] 2.4 0.0 3.9
[6,] 2.8 2.0 2.4
correct result, produced by gr(m):
[,1] [,2] [,3]
[1,] NA NA NA
[2,] 1.0000000 -1 0.5384615
[3,] 0.5000000 NA NA
[4,] NA 2 NA
[5,] NA -1 -0.4428571
[6,] 0.1666667 NA -0.3846154
But this takes forever with large tables. Is there any way to use conditions without looping so extensively?
You can speed this up by performing the entire calculation in a single vectorized operation (with one additional operation to fix up the results whenever you divide by 0):
out <- rbind(NA, tail(m, -1) / head(m, -1) - 1)
out[!is.finite(out)] <- NA
out
# [,1] [,2] [,3]
# NA NA NA
# [2,] 1.0000000 -1 0.5384615
# [3,] 0.5000000 NA NA
# [4,] NA 2 NA
# [5,] NA -1 -0.4428571
# [6,] 0.1666667 NA -0.3846154
This is much faster than a looping solution, as demonstrated on a 1000 x 1000 example:
set.seed(144)
m <- matrix(rnorm(10000000), 10000, 1000)
system.time(j <- josilber(m))
# user system elapsed
# 1.425 0.030 1.446
system.time(g <- gr(m))
# user system elapsed
# 34.551 0.263 36.581
The vectorized solution provides a 25x speedup.
Here are a couple of ways:
1) no packages
rbind(NA, exp(diff(log(m)))-1)
giving:
[,1] [,2] [,3]
[1,] NA NA NA
[2,] 1.0000000 -1 0.5384615
[3,] 0.5000000 Inf NA
[4,] NA 2 NA
[5,] NA -1 -0.4428571
[6,] 0.1666667 Inf -0.3846154
If it's not important to have a first row of NA then it can be simplified to just exp(diff(log(m)))-1 .
2) zoo Another way is to use zoo's geomemtric diff function. Convert to zoo, take geometric differences and subtract 1. If it's important to have a first row of NAs then merge it back with a zero width series having the original time points (otherwise omit the merge statement and just use g as the answer):
library(zoo)
zm <- as.zoo(m)
g <- diff(zm, arithmetic = FALSE) - 1
merge(g, zoo(, time(zm))) # omit this line if 1st row of NAs not needed
giving:
g.1 g.2 g.3
1 NA NA NA
2 1.0000000 -1 0.5384615
3 0.5000000 Inf NA
4 NA 2 NA
5 NA -1 -0.4428571
6 0.1666667 Inf -0.3846154
How I can rewrite this function to vectorized variant. As I know, using loops are not good practice in R:
# replaces rows that contains all NAs with non-NA values from previous row and K-th column
na.replace <- function(x, k) {
for (i in 2:nrow(x)) {
if (!all(is.na(x[i - 1, ])) && all(is.na(x[i, ]))) {
x[i, ] <- x[i - 1, k]
}
}
x
}
This is input data and returned data for function:
m <- cbind(c(NA,NA,1,2,NA,NA,NA,6,7,8), c(NA,NA,2,3,NA,NA,NA,7,8,9))
m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] NA NA
[6,] NA NA
[7,] NA NA
[8,] 6 7
[9,] 7 8
[10,] 8 9
na.replace(m, 2)
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
Here is a solution using na.locf in the zoo package. row.na is a vector with one component per row of m such that a component is TRUE if the corresponding row of m is all NA and FALSE otherwise. We then set all elements of such rows to the result of applying na.locf to column 2.
At the expense of a bit of speed the lines ending with ## could be replaced with row.na <- apply(is.na(m), 1, all) which is a bit more readable.
If we knew that if any row has an NA in column 2 then all columns of that row are NA, as in the question, then the lines ending in ## could be reduced to just row.na <- is.na(m[, 2])
library(zoo)
nr <- nrow(m) ##
nc <- ncol(m) ##
row.na <- .rowSums(is.na(m), nr, nc) == nc ##
m[row.na, ] <- na.locf(m[, 2], na.rm = FALSE)[row.na]
The result is:
> m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
REVISED Some revisions to improve speed as in comments below. Also added alternatives in discussion.
Notice that, unless you have a pathological condition where the first row is all NANA (in which case you're screwed anyway), you don't need to check whether all(is.na(x[i−1,]))all(is.na(x[i - 1, ])) is T or F because in the previous time thru the loop you "fixed" row i−1i-1 .
Further, all you care about is that the designated k-th value is not NA. The rest of the row doesn't matter.
BUT: The k-th value always "falls through" from the top, so perhaps you should:
1) treat the k-th column as a vector, e.g. c(NA,1,NA,NA,3,NA,4,NA,NA) and "fill-down" all numeric values. That's been done many times on SO questions.
2) Every row which is entirely NA except for column k gets filled with that same value.
I think that's still best done using either a loop or apply
You probably need to clarify whether some rows have both numeric and NA values, which your example fails to include. If that's the case, then things get trickier.
The most important part in this answer is getting the grouping you want, which is:
groups = cumsum(rowSums(is.na(m)) != ncol(m))
groups
#[1] 0 0 1 2 2 2 2 3 4 5
Once you have that the rest is just doing your desired operation by group, e.g.:
library(data.table)
dt = as.data.table(m)
k = 2
cond = rowSums(is.na(m)) != ncol(m)
dt[, (k) := .SD[[k]][1], by = cumsum(cond)]
dt[!cond, names(dt) := .SD[[k]]]
dt
# V1 V2
# 1: NA NA
# 2: NA NA
# 3: 1 2
# 4: 2 3
# 5: 3 3
# 6: 3 3
# 7: 3 3
# 8: 6 7
# 9: 7 8
#10: 8 9
Here is another base only vectorized approach:
na.replace <- function(x, k) {
is.all.na <- rowSums(is.na(x)) == ncol(x)
ref.idx <- cummax((!is.all.na) * seq_len(nrow(x)))
ref.idx[ref.idx == 0] <- NA
x[is.all.na, ] <- x[ref.idx[is.all.na], k]
x
}
And for fair comparison with #Eldar's solution, replace is.all.na with is.all.na <- is.na(x[, k]).
Finally I realized my version of vectorized solution and it works as expected. Any comments and suggestions are welcome :)
# Last Observation Move Forward
# works as na.locf but much faster and accepts only 1D structures
na.lomf <- function(object, na.rm = F) {
idx <- which(!is.na(object))
if (!na.rm && is.na(object[1])) idx <- c(1, idx)
rep.int(object[idx], diff(c(idx, length(object) + 1)))
}
na.replace <- function(x, k) {
v <- x[, k]
i <- which(is.na(v))
r <- na.lomf(v)
x[i, ] <- r[i]
x
}
Here's a workaround with the na.locf function from zoo:
m[na.locf(ifelse(apply(m, 1, function(x) all(is.na(x))), NA, 1:nrow(m)), na.rm=F),]
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 2 3
[6,] 2 3
[7,] 2 3
[8,] 6 7
[9,] 7 8
[10,] 8 9