I've got a large nxn matrix and would like to take off-diagonal slices of varying sizes. For example:
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
I'd like an R function which, when given the matrix and "width of diagonal slice" would return an nxn matrix of just those values. So for the matrix above and, say, 3, I'd get:
1 x x x x x
1 2 x x x x
1 2 3 x x x
x 2 3 4 x x
x x 3 4 5 x
x x x 4 5 6
At the moment I'm using (forgive me) a for loop which is incredibly slow:
getDiags<-function(ndiags, cormat){
resmat=matrix(ncol=ncol(cormat),nrow=nrow(cormat))
dimnames(resmat)<-dimnames(cormat)
for(j in 1:ndiags){
resmat[row(resmat) == col(resmat) + j] <-
cormat[row(cormat) == col(cormat) + j]
}
return(resmat)
}
I realise that this is a very "un-R" way to go about solving this problem. Is there a better way to do it, probably using diag or lower.tri?
size <- 6
mat <- matrix(seq_len(size ^ 2), ncol = size)
low <- 0
high <- 3
delta <- rep(seq_len(ncol(mat)), nrow(mat)) -
rep(seq_len(nrow(mat)), each = ncol(mat))
#or Ben Bolker's better alternative
delta <- row(mat) - col(mat)
mat[delta < low | delta > high] <- NA
mat
this works with 5000 x 5000 matrices on my machine
If you want to use upper.tri and lower.tri you could write functions like these:
cormat <- mapply(rep, 1:6, 6)
u.diags <- function(X, n) {
X[n:nrow(X),][lower.tri(X[n:nrow(X),])] <- NA
return(X)
}
or
l.diags <- function(X, n) {
X[,n:ncol(X)][upper.tri(X[,n:ncol(X)])] <- NA
return(X)
}
or
n.diags <- function(X, n.u, n.l) {
X[n.u:nrow(X),][lower.tri(X[n.u:nrow(X),])] <- NA
X[,n.l:ncol(X)][upper.tri(X[,n.l:ncol(X)])] <- NA
return(X)
}
l.diags(cormat, 3)
u.diags(cormat, 3)
n.diags(cormat, 3, 1)
you can do:
matrix:
m<-
matrix(1:6,ncol = 6, nrow=6 ,byrow = T)
function:
n_diag <- function (x, n) {
d <- dim(x)
ndiag <- .row(d) - n >= .col(d)
x[upper.tri(x) | ndiag] <- NA
return(x)
}
call:
n_diag(m,3)
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 1 NA NA NA NA NA
#[2,] 1 2 NA NA NA NA
#[3,] 1 2 3 NA NA NA
#[4,] NA 2 3 4 NA NA
#[5,] NA NA 3 4 5 NA
#[6,] NA NA NA 4 5 6
just for fun:
#lapply(1:6, n_diag, x = m)
Related
I'm trying to build this function to check the multiples of 3, from 0 to the half of the element "number". I'm adding "n" that limits the number of results that I will get.
function1 <- function(number, n){
half <- number / 2
lessequal <- seq.int(from = 0, to = half, length.out = n)
multiple <- (lessequal %% 3) == 0
return (lessequal [multiple])
}
When I run this function with n = 2
function1 (24, 2)
[1] 0 12
When the expected result would be:
[1] 0 3
If I run it with n = 4. The outcome is always 2 elements instead of 4.
function1 (12, 4)
[1] 0 12
When I expected to get:
[1] 0 3 6 9
What am I doing wrong?
Thanks.
Try with this code:
function1 <- function(number, n){
half <- number / 2
lessequal <- seq.int(from = 0, to = half, by=1)
multiple <- (lessequal %% 3) == 0
vals <- lessequal [multiple]
vals <- vals[1:n]
return (vals)
}
function1 (24, 4)
Output:
function1 (24, 4)
[1] 0 3 6 9
For most of the cases you can get the output with seq function itself without passing numbers.
function1 <- function(n){
seq(0, length.out = n, by = 3)
}
function1(2)
#[1] 0 3
function1(4)
#[1] 0 3 6 9
If there are going to be cases when first n multiples of 3 will be higher than number/2 you can use :
function1 <- function(number, n){
val <- seq(0, number/2, by = 3)
val[1:n]
}
function1(24, 2)
#[1] 0 3
function1(24, 4)
#[1] 0 3 6 9
function1(12, 4)
#[1] 0 3 6 NA
we can use %/% (natural division) and then grab the first n numbers that can be divided by 3 i.e 3*k-1 as we start the vector from 0
get_odd_n <- function(x, n) lapply(x, function(i) (0:(i%/%2))[3*(0:(n-1))+1])
setNames(get_odd_n(6*1:6+4, 7), 6*1:6+4)
$`10`
[1] 0 3 NA NA NA NA NA
$`16`
[1] 0 3 6 NA NA NA NA
$`22`
[1] 0 3 6 9 NA NA NA
$`28`
[1] 0 3 6 9 12 NA NA
$`34`
[1] 0 3 6 9 12 15 NA
$`40`
[1] 0 3 6 9 12 15 18
I was wondering if there might be a way in R to distribute n among k units without repetition (e.g., 3 5 2 is the same as 5 3 2, and 2 3 5 and 5 2 3) and without considering 0 combinations (i.e., no 9 1 0) and see the make-up of this distribution?
For example if n = 9 and k = 3 then we expect the make-up to be:
(Note: k will always be the # of columns)
3 3 3
4 3 2
4 1 4
5 2 2
5 1 3
6 2 1
7 1 1
makeup <- function(n, k){
# your suggested solution #
}
These are called integer partitions (more specifically restricted integer partitions) and can efficiently be generated with the packages partitions or arrangements like so:
partitions::restrictedparts(9, 3, include.zero = FALSE)
[1,] 7 6 5 4 5 4 3
[2,] 1 2 3 4 2 3 3
[3,] 1 1 1 1 2 2 3
arrangements::partitions(9, 3)
[,1] [,2] [,3]
[1,] 1 1 7
[2,] 1 2 6
[3,] 1 3 5
[4,] 1 4 4
[5,] 2 2 5
[6,] 2 3 4
[7,] 3 3 3
They are much faster than the solutions thus provided:
library(microbenchmark)
microbenchmark(arrangePack = arrangements::partitions(20, 5),
partsPack = partitions::restrictedparts(20, 5, include.zero = FALSE),
myfun2(20, 5, 20),
myfun1(20, 5, 20),
makeup(20, 5),
mycomb(20, 5), times = 3, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
arrangePack 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 3
partsPack 3.070203 2.755573 2.084231 2.553477 1.854912 1.458389 3
myfun2(20, 5, 20) 10005.679667 8528.784033 6636.284386 7580.133387 5852.625112 4872.050067 3
myfun1(20, 5, 20) 12770.400243 10574.957696 8005.844282 9164.764625 6897.696334 5610.854109 3
makeup(20, 5) 15422.745155 12560.083171 9248.916738 10721.316721 7812.997976 6162.166646 3
mycomb(20, 5) 1854.125325 1507.150003 1120.616461 1284.278219 950.015812 760.280469 3
In fact, for the example below, the other functions will error out because of memory:
system.time(arrangements::partitions(100, 10))
user system elapsed
0.068 0.031 0.099
arrangements::npartitions(100, 10)
[1] 2977866
You may try gtools::combinations for this work like below with repeats.allowed=TRUE option:
m <- gtools::combinations(9, 3, repeats.allowed = TRUE)
m[rowSums(m) == 9,]
A probable function could be, with options(expressions = 500000), this function could go till n = 500 (successfully ran on my machine for n=500, r=3):
mycomb <- function(n, r, sumval){
m <- combinations(n, r, repeats.allowed = TRUE)
m[rowSums(m) == sumval,]
}
mycomb(9,3,9)
Output:
# [,1] [,2] [,3]
#[1,] 1 1 7
#[2,] 1 2 6
#[3,] 1 3 5
#[4,] 1 4 4
#[5,] 2 2 5
#[6,] 2 3 4
#[7,] 3 3 3
Here's a base solution using expand.grid. I'm not going to recommend it for large n, but it works:
makeup <- function(n, k) {
x <- expand.grid(rep(list(1:n), 3)) # generate all combinations
x <- x[rowSums(x) == n,] # filter out stuff that doesn't sum to n
x <- as.data.frame(t(apply(x, 1, sort))) # order everything
unique(x) # keep non-duplicates
}
A little rethinking simplifies this greatly. If we have a vector of n objects, we can break it apart at n-1 different spots.. starting from this, we can reduce the work substantially:
makeup <- function(n, k) {
splits <- combn(n-1, k-1) # locations where to split up the data
bins <- rbind(rep(0, ncol(splits)), splits) # add an extra "split" before the 1st element
x <- apply(bins, 2, function(x) c(x[-1],9) -x) # count how many items in each bin
x <- as.data.frame(t(apply(x, 2, sort))) # order everything
unique(x) # keep non-duplicates
}
using matrix in base R:
myfun1 <- function( n, k){
x <- as.matrix(expand.grid( rep(list(seq_len(n)), k)))
x <- x[rowSums(x) == n,]
x[ ! duplicated( t( apply(x, 1, sort)) ),]
}
myfun1( n = 9, k = 3 )
May be this using data.table.
myfun2 <- function( n, k){
require('data.table')
dt <- do.call(CJ, rep(list(seq_len(n)), k))
dt <- dt[rowSums(dt) == n,]
dt[which(!duplicated(dt[, transpose(lapply( transpose(.SD), sort ))])),]
}
myfun2( n = 9, k = 3 )
# V1 V2 V3
# 1: 7 1 1
# 2: 6 2 1
# 3: 5 3 1
# 4: 4 4 1
# 5: 5 2 2
# 6: 4 3 2
# 7: 3 3 3
I have a list and a Matrix as per below:
List Y:
$`1`
V1 V2
1 1 1
2 1 2
3 2 1
4 2 2
$`2`
V1 V2
5 5 5
6 11 2
$`3`
V1 V2
7 10 1
8 10 2
9 11 1
10 5 6
Matrix Z:
[,1][,2][,3][,4][,5][,6]
[1,] 2 1 5 5 10 1
I consider below as points1, points2 and points3 in Matrix Z respectively
points1 -(2,1)
[,1][,2]
[1,] 2 1
points2 - (5,5)
[,3][,4]
[1,] 5 5
points3 - (10,1)
[,5][,5]
[1,] 10 1
I want to calculate the sum of distances between all points in list Y[[1]] and points1, all points in List Y[[2]] and points2 and all points in List Y[[3]] and points 3 in r. How can I do this?
rowsums(|y-z|^2)
Based on the description,
Map(function(y, z) rowSums(abs(y - z[col(y)])^2),
Y, split(Z, as.numeric(gl(ncol(Z), 2, ncol(Z)))))
Try the following. It uses Map to apply a function to every vector of the two lists passed to Map. Note that we cannot simply do
Map('-', Y, Z2)
because R would do the subtractions columnwise, not row by row.
f <- function(x, y){
for(i in seq_len(nrow(x)))
x[i, ] <- x[i, ] - y
x
}
Z2 <- split(Z, rep(1:3, each = 2))
Map(f, Y, Z2)
Let me try to make this question as general as possible.
Let's say I have two variables a and b.
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
So b has 17 observations and is a subset of a which has 20 observations.
My question is the following: how I would use these two variables to generate a third variable c which like a has 20 observations but for which observations 7, 11 and 15 are missing, and for which the other observations are identical to b but in the order of a?
Or to put it somewhat differently: how could I squeeze in these missing observations into variable b at locations 7, 11 and 15?
It seems pretty straightforward (and it probably is) but I have been not getting this to work for a bit too long now.
1) loop Try this loop:
# test data
set.seed(123) # for reproducibility
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
# lets work with vectors
A <- a[[1]]
B <- b[[1]]
j <- 1
C <- A
for(i in seq_along(A)) if (A[i] == B[j]) j <- j+1 else C[i] <- NA
which gives:
> C
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
2) Reduce Here is a loop-free version:
f <- function(j, a) j + (a == B[j])
r <- Reduce(f, A, acc = TRUE)
ifelse(duplicated(r), NA, A)
giving:
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
3) dtw. Using dtw in the package of the same name we can get a compact loop-free one-liner:
library(dtw)
ifelse(duplicated(dtw(A, B)$index2), NA, A)
giving:
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
REVISED Added additional solutions.
Here's a more complicated way of doing it, using the Levenshtein distance algorithm, that does a better job on more complicated examples (it also seemed faster in a couple of larger tests I tried):
# using same data as G. Grothendieck:
set.seed(123) # for reproducibility
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
A = a[[1]]
B = b[[1]]
# compute the transformation between the two, assigning infinite weight to
# insertion and substitution
# using +1 here because the integers fed to intToUtf8 have to be larger than 0
# could also adjust the range more dynamically based on A and B
transf = attr(adist(intToUtf8(A+1), intToUtf8(B+1),
costs = c(Inf,1,Inf), counts = TRUE), 'trafos')
C = A
C[substring(transf, 1:nchar(transf), 1:nchar(transf)) == "D"] <- NA
#[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
More complex matching example (where the greedy algorithm would perform poorly):
A = c(1,1,2,2,1,1,1,2,2,2)
B = c(1,1,1,2,2,2)
transf = attr(adist(intToUtf8(A), intToUtf8(B),
costs = c(Inf,1,Inf), counts = TRUE), 'trafos')
C = A
C[substring(transf, 1:nchar(transf), 1:nchar(transf)) == "D"] <- NA
#[1] NA NA NA NA 1 1 1 2 2 2
# the greedy algorithm would return this instead:
#[1] 1 1 NA NA 1 NA NA 2 2 2
The data frame version, which isn't terribly different from G.'s above.
(Assumes a,b setup as above).
j <- 1
c <- a
for (i in (seq_along(a[,1]))) {
if (a[i,1]==b[j,1]) {
j <- j+1
} else
{
c[i,1] <- NA
}
}
Here is my problem. I have a large vector of positive data. My goal is to remove the sequences of at least N consecutive values that are repeated in the vector (all of repeated values must be strictly > 0).
I've written a program that works and is as follows :
X is my vector of numeric values ;
N is the minimum length of repeated sequences.
rmpParNASerieRepetee <- function(X, N)
{
X_ <- paste("T", paste(X, collapse="T"), "T", sep="")
ind.parcours <- 1
ind.sup <- c()
# Loop on the values
while ( ind.parcours <= (length(X)-N+1) )
{
# indices of my sequence of N values
deb <- ind.parcours
fin <- ind.parcours + N-1
# sequence of N values to search in the vector
serie <- X[deb:fin]
serie_ <- paste("T", paste(serie, collapse="T"), "T", sep="")
borne <- 1*(ind.parcours < (length(X)-N+1)) + 0*(ind.parcours == (length(X)-N+1))
if (sum(X[(length(X)-N+1):length(X)]==serie)==3) borne <- 0
# split my string vector by my sequence vector of N values and count the pieces of result
if ( length(unlist(strsplit(X_, serie_)))-1 > borne && length(which(serie!=0))>=N)
{ ind.sup <- unique(c(ind.sup, deb:fin)) }
ind.parcours <- ind.parcours+1
}
if (length(ind.sup !=0)) { X[ind.sup] <- NA }
list_return <- list(X=X, Ind.sup=unique(sort(ind.sup)))
return (list_return)
}
I think my function is really not optimal (calculation time of 1:15 for a vector of 92,000 values, N=18). And I have to do this step 1600 times... It would take around 3 months...
Does anyone have a better idea ?
Example :
x <- c(1,2,3,4,0,4,1,2,3,8,9,1,2,3,4,0)
N <- 3
# (1,2,3) is a sequence of 3 elements which is repeated
# (1,2,3,4) is sequence of 4 elements which is repeated
# no other sequence of length at least 3 repeats
# my result should also be :
# NA NA NA NA 0 4 NA NA NA 8 9 NA NA NA NA 0
# The result of my program is :
# $X
# [1] NA NA NA NA 0 4 NA NA NA 8 9 NA NA NA NA 0
#$Ind.sup
# [1] 1 2 3 4 7 8 9 12 13 14 15
A way:
f <- function(X, N)
{
.rle <- rle(sort(X))
res <- .rle$values[.rle$lengths >= N]
res <- res[res > 0]
inds <- X %in% res
X[inds] <- NA
list(X = X, Ind = which(inds))
}
#> f(X, 3)
#$X
# [1] NA NA NA NA 0 0 0 0 NA NA NA NA NA NA 8 9 NA NA NA NA NA NA 0 0 0
#
#$Ind
# [1] 1 2 3 4 9 10 11 12 13 14 17 18 19 20 21 22
Try using table and %in% to get faster speed due to vectorisation.
rmpParNASerieRepetee<-function(X,N){
tab<-table(X[X>0])
over.n<-as.numeric(names(tab)[tab>=N])
ind<-X %in% over.n
Ind.sup<-which(ind)
X<-ifelse(ind,NA,X)
list(Ind.sup,X)
}
X <- c(1,2,3,4,0,0,0,0,1,4,1,2,3,4,8,9,1,2,3,1,4,1,0,0,0)
rmpParNASerieRepetee(X,3)
# [[1]]
# [1] 1 2 3 4 9 10 11 12 13 14 17 18 19 20 21 22
#
# [[2]]
# [1] NA NA NA NA 0 0 0 0 NA NA NA NA NA NA 8 9 NA NA NA NA NA NA 0 0 0
And a little test for 92000 values:
X<-sample(1:10000,92000,TRUE)
system.time(rmpParNASerieRepetee(X,3))
# user system elapsed
# 0.14 0.00 0.14
One way to think about this is that in a sequence, each element differs from the last one by 1, so:
X <- c(1,2,3,4,0,0,0,0,1,4,1,2,3,4,8,9,1,2,3,1,4,1,0,0,0)
y <- X[-1]
diff <- y-X[1:length(X)-1]
diff
[1] 1 1 1 -4 0 0 0 1 3 -3 1 1 1 4 1 -8 1 1 -2 3 -3 -1 0 0
And now you're looking for sequences of > N 1's in diff.
I have optimized my function, and now it takes "only" 10 minutes for a vector of length 92000.
Maybe someone could find an other solution more faster than mine.
Imagine my vector is X<-c(1,2,3,4,0,7,8,1,2,3,NA,8,9,1,2,3,4) and N=3.
c(1,2,3) et c(1,2,3,4) are the only repeated sequences of length at least N without NA or 0. So my result should be NA NA NA NA 0 7 8 NA NA NA NA 8 9 NA NA NA NA.
To answer my problem, I use this principle :
I create a big string like this : X_ <- T1T2T3T4T0T7T8T1T2T3TNAT8T9T1T2T3T4 in which, all X values are concatened by T. For each little string of length N=3 (ex : the first is T1T2T3T), I break my big string X_ using strsplit function with the pattern "little string". If the length of the result is more than 2, the sequence is repeated.
Care must be taken not to take null values in the series, and some adaptation must be done to avoid edge phenomena (borne in my function)...
I created these functions which work :
# Function to count NA in a vector
count.na <- function(vec) { return (length(which(is.na(vec)))) }
# Function to detect sequence of stricly postive numbers of length at least N
rmpParNASerieRepetee <- function(X, N, val.min=0)
{
# Collapse the vector to make a big string
X_ <- paste("T", paste(X, collapse="T"), "T", sep="")
# Index term
ind.parcours <- 1
ind.sup <- c()
# Loop on X values
while ( ind.parcours <= (length(X)-N+1) )
{
# Selection of the sequence to be detected
deb <- ind.parcours
fin <- ind.parcours + N-1
serie <- X[deb:fin]
# All values are > 0
if ( length(which(serie>0)) >= (N-count.na(serie)) )
{
# Research of repetition with strsplit
serie_ <- paste("T", paste(serie, collapse="T"), "T", sep="")
borne <- 1*(ind.parcours < (length(X)-N+1)) + 0*(ind.parcours == (length(X)-N+1))
if (sum(X[(length(X)-N+1):length(X)]==serie, na.rm=TRUE)==N) borne <- 0
if (length(unlist(strsplit(X_, serie_)))-1 > borne)
ind.sup <- unique( c(ind.sup, deb:fin) )
# Incrementation
ind.parcours <- ind.parcours + 1
}
# Contains 0
else
{ ind.parcours <- ind.parcours + max(which(serie==0))
}
}
# Invalidaion of repeated sequences
if (length(ind.sup !=0)) { X[ind.sup] <- NA }
# Return
list_return <- list(X=X, Ind.sup=unique(sort(ind.sup)))
return (list_return)
}
I hope someone will find an other way to solve my problem !