Related
Let's say I have a symmetric matrix A, for example:
> A <- matrix(runif(16),nrow = 4,byrow = T)
> ind <- lower.tri(A)
> A[ind] <- t(A)[ind]
> A
[,1] [,2] [,3] [,4]
[1,] 0.4212778 0.6874073 0.1551896 0.46757640
[2,] 0.6874073 0.5610995 0.1779030 0.54072946
[3,] 0.1551896 0.1779030 0.9515304 0.79429777
[4,] 0.4675764 0.5407295 0.7942978 0.01206526
I also have a 4 x 3 matrix B that gives specific positions of matrix A, for example:
> B<-matrix(c(1,2,4,2,1,3,3,2,4,4,1,3),nrow=4,byrow = T)
> B
[,1] [,2] [,3]
[1,] 1 2 4
[2,] 2 1 3
[3,] 3 2 4
[4,] 4 1 3
The B matrix represents the following positions of A: (1,1), (1,2), (1,4), (2,2), (2,1), (2,3), (3,3), (3,2), (3,4), (4,4), (4,1), (4,3).
I want to change the values of A that are NOT in the positions given by B, replacing them by Inf. The result I want is:
[,1] [,2] [,3] [,4]
[1,] 0.4212778 0.6874073 Inf 0.46757640
[2,] 0.6874073 0.5610995 0.1779030 Inf
[3,] Inf 0.1779030 0.9515304 0.79429777
[4,] 0.4675764 Inf 0.7942978 0.01206526
How can I do that quickly avoiding a for loop (which I'm able to code)? I've seen many similar posts, but no one gave me what I want. Thank you!
You want to do something like matrix subsetting (e.g., P[Q]) except that you can't use negative indexing in matrix subsetting (e.g., P[-Q] is not allowed). Here's a work-around.
Store the elements you want to retain from A in a 2-column matrix where each row is a coordinate of A:
Idx <- cbind(rep(1:4, each=ncol(B)), as.vector(t(B)))
Create a matrix where all values are Inf, and then overwrite the values you wanted to "keep" from A:
Res <- matrix(Inf, nrow=nrow(A), ncol=ncol(A))
Res[Idx] <- A[Idx]
Result
Res
# [,1] [,2] [,3] [,4]
#[1,] 0.9043131 0.639718071 Inf 0.19158238
#[2,] 0.6397181 0.601327568 0.007363378 Inf
#[3,] Inf 0.007363378 0.752123162 0.61428003
#[4,] 0.1915824 Inf 0.614280026 0.02932679
Here is a one-liner
A[cbind(1:nrow(A), sum(c(1:ncol(A))) - rowSums(B))] <- Inf
[,1] [,2] [,3] [,4]
[1,] 0.4150663 0.23440503 Inf 0.6665222
[2,] 0.2344050 0.38736067 0.01352211 Inf
[3,] Inf 0.01352211 0.88319263 0.9942303
[4,] 0.6665222 Inf 0.99423028 0.7630221
Another way would be to identify the cells with an apply and set then to inf.
cnum <- 1:ncol(A)
A[cbind(1:nrow(A), apply(B, 1, function(x) cnum[-which(cnum %in% x)]))] <- Inf
A
# [,1] [,2] [,3] [,4]
# [1,] 0.9148060 0.9370754 Inf 0.8304476
# [2,] 0.9370754 0.5190959 0.7365883 Inf
# [3,] Inf 0.7365883 0.4577418 0.7191123
# [4,] 0.8304476 Inf 0.7191123 0.9400145
Note: set.seed(42).
A <- matrix(runif(16),nrow = 4,byrow = T)
ind <- lower.tri(A)
A[ind] <- t(A)[ind]
## >A[]
## [,1] [,2] [,3] [,4]
## [1,] 0.07317535 0.167118857 0.0597721 0.2128698
## [2,] 0.16711886 0.008661005 0.6419335 0.6114373
## [3,] 0.05977210 0.641933514 0.7269202 0.3547959
## [4,] 0.21286984 0.611437278 0.3547959 0.4927997
The first thing to notice is that the matrix B is not very helpful in its current form, because the information we need is the rows and each value in B
B<-matrix(c(1,2,4,2,1,3,3,2,4,4,1,3),nrow=4,byrow = T)
> B
## [,1] [,2] [,3]
## [1,] 1 2 4
## [2,] 2 1 3
## [3,] 3 2 4
## [4,] 4 1 3
So we can create that simply by using melt and use Var1 and value.
>melt(B)
## Var1 Var2 value
## 1 1 1 1
## 2 2 1 2
## 3 3 1 3
## 4 4 1 4
## 5 1 2 2
## 6 2 2 1
## 7 3 2 2
## 8 4 2 1
## 9 1 3 4
## 10 2 3 3
## 11 3 3 4
## 12 4 3 3
We need to replace the non existing index in A by inf. This is not easy to do directly. So an easy way out would be to create another matrix of Inf and fill the values of A according to the index of melt(B)
> C<-matrix(Inf,nrow(A),ncol(A))
idx <- as.matrix(melt(B)[,c("Var1","value")])
C[idx]<-A[idx]
> C
## [,1] [,2] [,3] [,4]
## [1,] 0.07317535 0.167118857 0.0597721 0.2128698
## [2,] 0.16711886 0.008661005 0.6419335 Inf
## [3,] Inf 0.641933514 0.7269202 0.3547959
## [4,] 0.21286984 Inf 0.3547959 0.4927997
Another approach that accomplishes matrix subsetting (e.g., P[Q]) would be to create the index Q manually. Here's one approach.
Figure out which column index is "missing" from each row of B:
col_idx <- apply(B, 1, function(x) (1:nrow(A))[-match(x, 1:nrow(A))])
Create subsetting matrix Q
Idx <- cbind(1:nrow(A), col_idx)
Do the replacement
A[Idx] <- Inf
Of course, you can make this a one-liner if you really want to:
A[cbind(1:nrow(A), apply(B, 1, function(x) (1:nrow(A))[-match(x, 1:nrow(A))])]
If I have a vector, for example
vec <- c(3,4,5,NA)
I can replace the NA with the median value of the other values in the vector with the following code:
vec[which(is.na(vec))] <- median(vec, na.rm = T)
However, if I have a matrix containing NAs, applying this same code across all columns of the matrix doesn't give me back a matrix, just returning the medians of each matrix column.
mat <- matrix(c(1,NA,3,5,6,7,NA,3,4,NA,2,8), ncol = 3)
apply(mat, 2, function(x) x[which(is.na(x))] <- median(x, na.rm=T) )
#[1] 3 6 4
How can I get the matrix back with NAs replaced by column medians? This question is similar: Replace NA values by row means but I can't adapt any of the solutions to my case.
There is a convenient function (na.aggregate) in zoo to replace the NA elements with the specified FUN.
library(zoo)
apply(mat, 2, FUN = function(x) na.aggregate(x, FUN = median))
# [,1] [,2] [,3]
#[1,] 1 6 4
#[2,] 3 7 4
#[3,] 3 6 2
#[4,] 5 3 8
Or as #G.Grothendieck commented, na.aggregate can be directly applied on the matrix
na.aggregate(mat, FUN = median)
Adding return(x) as last line of the function within apply will solve it.
> apply(mat, 2, function(x){
x[which(is.na(x))] <- median(x, na.rm=T)
return(x)
})
[,1] [,2] [,3]
[1,] 1 6 4
[2,] 3 7 4
[3,] 3 6 2
[4,] 5 3 8
I Matlab it is possible to add elements vertically to vector even though their index scope far beyond just the next row. For example:
test = 1:5;
test(5,:) = 1:5;
will return.
1 2 3 4 5
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
1 2 3 4 5
Is there a nice, not so hacky way to do this in R?
Although joran is probably right about altering your procedure when switching to R, you could exploit the fact that indexing a vector results in enlarging too, as Frank noted in the comments. Taking advantage that matrices are vectors with a "dim" attribute and -for convenience- using the by-column storage of matrices, you could use something like:
add_col = function(x, col, value)
{
nr = NROW(x)
nc = if(col > NCOL(x)) col else NCOL(x)
i1 = nr * (col - 1) + 1
i2 = i1 + length(value) - 1
x[i1:i2] = value
length(x) = nr * nc
dim(x) = c(nr, nc)
return(x)
}
test = 1:5
add_col(test, 3, 1:3)
# [,1] [,2] [,3]
#[1,] 1 NA 1
#[2,] 2 NA 2
#[3,] 3 NA 3
#[4,] 4 NA NA
#[5,] 5 NA NA
t(add_col(add_col(test, 3, 1:3), 6, 4:1))
# [,1] [,2] [,3] [,4] [,5]
#[1,] 1 2 3 4 5
#[2,] NA NA NA NA NA
#[3,] 1 2 3 NA NA
#[4,] NA NA NA NA NA
#[5,] NA NA NA NA NA
#[6,] 4 3 2 1 NA
This might be a bit hacky but is not much slower than, just, indexing a vector out of bounds:
X = seq_len(1e5)
microbenchmark::microbenchmark(add_col(X, 1e2, seq_along(X)),
t(add_col(X, 1e2, seq_along(X))), #with a transpose
"[<-"(X, (1e7 - (length(X) - 1)):1e7, seq_along(X)), #just vector out-of-bounds indexing
times = 50)
#Unit: milliseconds
# expr min lq median uq max neval
# add_col(X, 100, seq_along(X)) 34.79408 40.02492 45.61020 63.24266 78.48069 50
# t(add_col(X, 100, seq_along(X))) 79.81389 84.06544 87.57906 102.75845 110.72842 50
# `[<-`(X, (1e+07 - (length(X) - 1)):1e+07, seq_along(X)) 17.25123 18.06138 21.48956 24.69084 48.91988 50
identical(c(add_col(X, 1e2, seq_along(X))), "[<-"(X, (1e7 - (length(X) - 1)):1e7, seq_along(X)))
#[1] TRUE
How I can rewrite this function to vectorized variant. As I know, using loops are not good practice in R:
# replaces rows that contains all NAs with non-NA values from previous row and K-th column
na.replace <- function(x, k) {
for (i in 2:nrow(x)) {
if (!all(is.na(x[i - 1, ])) && all(is.na(x[i, ]))) {
x[i, ] <- x[i - 1, k]
}
}
x
}
This is input data and returned data for function:
m <- cbind(c(NA,NA,1,2,NA,NA,NA,6,7,8), c(NA,NA,2,3,NA,NA,NA,7,8,9))
m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] NA NA
[6,] NA NA
[7,] NA NA
[8,] 6 7
[9,] 7 8
[10,] 8 9
na.replace(m, 2)
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
Here is a solution using na.locf in the zoo package. row.na is a vector with one component per row of m such that a component is TRUE if the corresponding row of m is all NA and FALSE otherwise. We then set all elements of such rows to the result of applying na.locf to column 2.
At the expense of a bit of speed the lines ending with ## could be replaced with row.na <- apply(is.na(m), 1, all) which is a bit more readable.
If we knew that if any row has an NA in column 2 then all columns of that row are NA, as in the question, then the lines ending in ## could be reduced to just row.na <- is.na(m[, 2])
library(zoo)
nr <- nrow(m) ##
nc <- ncol(m) ##
row.na <- .rowSums(is.na(m), nr, nc) == nc ##
m[row.na, ] <- na.locf(m[, 2], na.rm = FALSE)[row.na]
The result is:
> m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
REVISED Some revisions to improve speed as in comments below. Also added alternatives in discussion.
Notice that, unless you have a pathological condition where the first row is all NANA (in which case you're screwed anyway), you don't need to check whether all(is.na(x[i−1,]))all(is.na(x[i - 1, ])) is T or F because in the previous time thru the loop you "fixed" row i−1i-1 .
Further, all you care about is that the designated k-th value is not NA. The rest of the row doesn't matter.
BUT: The k-th value always "falls through" from the top, so perhaps you should:
1) treat the k-th column as a vector, e.g. c(NA,1,NA,NA,3,NA,4,NA,NA) and "fill-down" all numeric values. That's been done many times on SO questions.
2) Every row which is entirely NA except for column k gets filled with that same value.
I think that's still best done using either a loop or apply
You probably need to clarify whether some rows have both numeric and NA values, which your example fails to include. If that's the case, then things get trickier.
The most important part in this answer is getting the grouping you want, which is:
groups = cumsum(rowSums(is.na(m)) != ncol(m))
groups
#[1] 0 0 1 2 2 2 2 3 4 5
Once you have that the rest is just doing your desired operation by group, e.g.:
library(data.table)
dt = as.data.table(m)
k = 2
cond = rowSums(is.na(m)) != ncol(m)
dt[, (k) := .SD[[k]][1], by = cumsum(cond)]
dt[!cond, names(dt) := .SD[[k]]]
dt
# V1 V2
# 1: NA NA
# 2: NA NA
# 3: 1 2
# 4: 2 3
# 5: 3 3
# 6: 3 3
# 7: 3 3
# 8: 6 7
# 9: 7 8
#10: 8 9
Here is another base only vectorized approach:
na.replace <- function(x, k) {
is.all.na <- rowSums(is.na(x)) == ncol(x)
ref.idx <- cummax((!is.all.na) * seq_len(nrow(x)))
ref.idx[ref.idx == 0] <- NA
x[is.all.na, ] <- x[ref.idx[is.all.na], k]
x
}
And for fair comparison with #Eldar's solution, replace is.all.na with is.all.na <- is.na(x[, k]).
Finally I realized my version of vectorized solution and it works as expected. Any comments and suggestions are welcome :)
# Last Observation Move Forward
# works as na.locf but much faster and accepts only 1D structures
na.lomf <- function(object, na.rm = F) {
idx <- which(!is.na(object))
if (!na.rm && is.na(object[1])) idx <- c(1, idx)
rep.int(object[idx], diff(c(idx, length(object) + 1)))
}
na.replace <- function(x, k) {
v <- x[, k]
i <- which(is.na(v))
r <- na.lomf(v)
x[i, ] <- r[i]
x
}
Here's a workaround with the na.locf function from zoo:
m[na.locf(ifelse(apply(m, 1, function(x) all(is.na(x))), NA, 1:nrow(m)), na.rm=F),]
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 2 3
[6,] 2 3
[7,] 2 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
I have a numeric called area of length 166860. This consists of 412 different elements, most of length 405 and some of length 809. I have their start and end ids.
My goal is to extract them and put them in a matrix/data frame with 412 columns
Right now, I'm trying this code:
m = matrix(NA,ncol=412, nrow=809)
for (j in 1:412){
temp.start = start.ids[j]
temp.end = end.ids[j]
m[,j] = area[temp.start:temp.end]
}
But I just end up with this error message:
"Error in m[, j] = area[temp.start:temp.end] :
number of items to replace is not a multiple of replacement length"
Here's a quite easy approach:
Example data:
area <- c(1:4, 1:5, 1:6, 1:3)
# [1] 1 2 3 4 1 2 3 4 5 1 2 3 4 5 6 1 2 3
start.ids <- which(area == 1)
# [1] 1 5 10 16
end.ids <- c(which(area == 1)[-1] - 1, length(area))
# [1] 4 9 15 18
Create a list with one-row matrices:
mats <- mapply(function(x, y) t(area[seq(x, y)]), start.ids, end.ids)
# [[1]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 3 4
#
# [[2]]
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 2 3 4 5
#
# [[3]]
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 1 2 3 4 5 6
#
# [[4]]
# [,1] [,2] [,3]
# [1,] 1 2 3
Use the function rbind.fill.matrix from the plyr package to create the matrix and transpose it (t):
library(plyr)
m <- t(rbind.fill.matrix(mats))
# [,1] [,2] [,3] [,4]
# 1 1 1 1 1
# 2 2 2 2 2
# 3 3 3 3 3
# 4 4 4 4 NA
# 5 NA 5 5 NA
# 6 NA NA 6 NA
You are setting the column length to be 412, and matrices cannot be flexible/variable in their length. This means the value you assign to the columns must either have a length of 412 or something less that can fill into a length of 412. From the manual on ?matrix:
If there are too few elements in data to fill the matrix, then the elements in data are recycled. If data has length zero, NA of an appropriate type is used for atomic vectors (0 for raw vectors) and NULL for lists.
As another commenter said, you may have intended to assign to the rows in which case m[j, ] is the way to do that, but you have to then pad the value you are assigning with NA or allow NA's to be filled so the value being assigned is always of length 809.
m = matrix(NA,ncol=412, nrow=809)
for (j in 1:412){
temp.start = start.ids[j]
temp.end = end.ids[j]
val <- area[temp.start:temp.end]
m[j, ] = c(val, rep(NA, 809 - length(val)))
}
How about this? I've manufactured some sample data:
#here are the random sets of numbers - length either 408 or 809
nums<-lapply(1:412,function(x)runif(sample(c(408,809),1)))
#this represents your numeric (one list of all the numbers)
nums.vec<-unlist(nums)
#get data about the series (which you have)
nums.lengths<-sapply(nums,function(x)length(x))
nums.starts<-cumsum(c(1,nums.lengths[-1]))
nums.ends<-nums.starts+nums.lengths-1
new.vec<-unlist(lapply(1:412,function(x){
v<-nums.vec[nums.starts[x]:nums.ends[x]]
c(v,rep(0,(809-length(v))))
}))
matrix(new.vec,ncol=412)
What about
m[j,] = area[temp.start:temp.end]
?
Edit:
a <- area[temp.start:temp.end]
m[1:length(a),j] <- a
Maybe others have better answers. As I see it, you have two options:
Change m[,j] to m[1:length(area[temp.start:temp.end]),j] and then you will not get an error but you would have some NA's left.
Use a list of matrices instead, so you would get different dimensions for each matrix.