Removing rows/columns with only one element from a binary matrix - r

I'm trying to remove "singletons" from a binary matrix. Here, singletons refers to elements that are the only "1" value in the row AND the column in which they appear. For example, given the following matrix:
> matrix(c(0,1,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,1,1), nrow=6)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0 1 0 0 0 0 0
[2,] 1 0 1 0 0 0 0
[3,] 0 0 0 1 0 0 0
[4,] 1 1 0 0 0 0 0
[5,] 0 0 0 0 1 1 1
[6,] 0 0 0 0 1 0 1
...I would like to remove all of row 3 (and, if possible, all of column 4), because the 1 in [3,4] is the only 1 in that row/column combination. [1,2] is fine, since there are other 1's in column [,2]; similarly, [2,3] is fine, since there are other 1's in row [2,]. Any help would be appreciated - thanks!

You first want to find which rows and columns are singletons and then check if there are pairs of singletons rows and columns that share an index. Here is a short bit of code to accomplish this task:
foo <- matrix(c(0,1,0,...))
singRows <- which(rowSums(foo) == 1)
singCols <- which(colSums(foo) == 1)
singCombinations <- expand.grid(singRows, singCols)
singPairs <- singCombinations[apply(singCombinations, 1,
function(x) which(foo[x[1],] == 1) == x[2]),]
noSingFoo <- foo[-unique(singPairs[,1]), -unique(singPairs[,2])]
With many sinlgeton ros or columns you might need to make this a bit more efficient, but it does the job.
UPDATE: Here is the more efficient version I knew could be done. This way you loop only over the rows (or columns if desired) and not all combinations. Thus it is much more efficient for matrices with many singleton rows/columns.
## starting with foo and singRows as before
singPairRows <- singRows[sapply(singRows, function(singRow)
sum(foo[,foo[singRow,] == 1]) == 1)]
singPairs <- sapply(singPairRows, function(singRow)
c(singRow, which(foo[singRow,] == 1)))
noSingFoo <- foo[-singPairs[1,], -singPairs[2,]]
UPDATE 2: I have compared the two methods (mine=nonsparse and #Chris's=sparse) using the rbenchmark package. I have used a range of matrix sizes (from 10 to 1000 rows/columns; square matrices only) and levels of sparsity (from 0.1 to 5 non-zero entries per row/column). The relative level of performance is shown in the heat map below. Equal performance (log2 ratio of run times) is designated by white, faster with sparse method is red and faster with non-sparse method is blue. Note that I am not including the conversion to a sparse matrix in the performance calculation, so that will add some time to the sparse method. Just thought it was worth a little effort to see where this boundary was.

cr1msonB1ade's way is a great answer. For more computationally intensive matrices (millions x millions), you can use this method:
Encode your matrix in sparse notation:
DT <- structure(list(i = c(1, 2, 2, 3, 4, 4, 5, 5, 5, 6, 6), j = c(2,
1, 3, 4, 1, 2, 5, 6, 7, 5, 7), val = c(1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1)), .Names = c("i", "j", "val"), row.names = c(NA, -11L
), class = "data.frame")
Gives (0s are implicit)
> DT
i j val
1 1 2 1
2 2 1 1
3 2 3 1
4 3 4 1
5 4 1 1
6 4 2 1
7 5 5 1
8 5 6 1
9 5 7 1
10 6 5 1
11 6 7 1
Then we can filter using:
DT <- data.table(DT)
DT[, rowcount := .N, by = i]
DT[, colcount := .N, by = j]
Giving:
>DT[!(rowcount*colcount == 1)]
i j val rowcount colcount
1: 1 2 1 1 2
2: 2 1 1 2 2
3: 2 3 1 2 1
4: 4 1 1 2 2
5: 4 2 1 2 2
6: 5 5 1 3 2
7: 5 6 1 3 1
8: 5 7 1 3 2
9: 6 5 1 2 2
10: 6 7 1 2 2
(Note the (3,4) row is now missing)

Related

Classify column values into new column with specific output

I want to classify the rows of a data frame based on a threshold applied to a given numeric reference column. If the reference column has a value below the threshold, then the result is 0, which I want to add to a new column. If the reference column value is over the threshold, then the new column will have value 1 in all consecutive rows with value over the threshold until a new 0 result comes up. If a new reference value is over the threshold then the value to add is 2, and so on.
If we set up the threshold > 2 then an example of what I would like to obtain is:
row
reference
result
1
2
0
2
1
0
3
4
1
4
3
1
5
1
0
6
6
2
7
8
2
8
4
2
9
1
0
10
3
3
11
6
3
row <- c(1:11)
reference <- c(2,1,4,3,1,6,8,4,1,3,6)
result <- c(0,0,1,1,0,2,2,2,0,3,3)
table <- cbind(row, reference, result)
Thank you!
We can use run-length encoding (rle) for this.
The below assumes a data.frame:
r <- rle(quux$reference <= 2)
r$values <- ifelse(r$values, 0, cumsum(r$values))
quux$result2 <- inverse.rle(r)
quux
# row reference result result2
# 1 1 2 0 0
# 2 2 1 0 0
# 3 3 4 1 1
# 4 4 3 1 1
# 5 5 1 0 0
# 6 6 6 2 2
# 7 7 8 2 2
# 8 8 4 2 2
# 9 9 1 0 0
# 10 10 3 3 3
# 11 11 6 3 3
Data
quux <- structure(list(row = 1:11, reference = c(2, 1, 4, 3, 1, 6, 8, 4, 1, 3, 6), result = c(0, 0, 1, 1, 0, 2, 2, 2, 0, 3, 3)), row.names = c(NA, -11L), class = "data.frame")
As noted in the comments by #Sotos, would consider alternative name for your object.
Since it wasn't clear if data.frame or matrix, assume we have a data.frame df based on your data:
df <- as.data.frame(table)
And have a threshold of 2:
threshold = 2
You can adapt this solution by #flodel:
df$new_result = ifelse(
x <- reference > threshold,
cumsum(c(x[1], diff(x) == 1)),
0)
df
In this case, the diff(x) will include a vector, where values of 1 indicate where result should be increased by cumsum (in the sample data, this occurs in rows 3, 6, and 10). These are transitions from FALSE to TRUE (0 to 1), where reference goes from below to above threshold. Note that x[1] is added/combined since the diff values will be 1 element shorter in length.
Using the ifelse, these new incremental values only apply to those where reference exceeds threshold, otherwise set at 0.
Output
row reference result new_result
1 1 2 0 0
2 2 1 0 0
3 3 4 1 1
4 4 3 1 1
5 5 1 0 0
6 6 6 2 2
7 7 8 2 2
8 8 4 2 2
9 9 1 0 0
10 10 3 3 3
11 11 6 3 3

Get the first non-null value from selected cells in a row

Good afternoon, friends!
I'm currently performing some calculations in R (df is displayed below). My goal is to display in a new column the first non-null value from selected cells for each row.
My df is:
MD <- c(100, 200, 300, 400, 500)
liv <- c(0, 0, 1, 3, 4)
liv2 <- c(6, 2, 0, 4, 5)
liv3 <- c(1, 1, 1, 1, 1)
liv4 <- c(1, 0, 0, 3, 5)
liv5 <- c(0, 2, 7, 9, 10)
df <- data.frame(MD, liv, liv2, liv3, liv4, liv5)
I want to display (in a column called "liv6") the first non-null value from 5 cells (given the data, liv1 = 0, liv2 = 6 , liv3 = 1, liv 4 = 1 and liv5 = 1). The result should be 6. And this calculation should be repeated fro each row in my dataframe..
I do know how to do this in Python, but not in R..
Any help is highly appreciated!
One option with dplyr could be:
df %>%
rowwise() %>%
mutate(liv6 = with(rle(c_across(liv:liv5)), values[which.max(values != 0)]))
MD liv liv2 liv3 liv4 liv5 liv6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 0 6 1 1 0 6
2 200 0 2 1 0 2 2
3 300 1 0 1 0 7 1
4 400 3 4 1 3 9 3
5 500 4 5 1 5 10 4
A Base R solution:
df$liv6 <- apply(df[-1], 1, function(x) x[min(which(x != 0))])
output
df
MD liv liv2 liv3 liv4 liv5 liv6
1 100 0 6 1 1 0 2
2 200 0 2 1 0 2 2
3 300 1 0 1 0 7 1
4 400 3 4 1 3 9 1
5 500 4 5 1 5 10 1
A simple base R option is to apply across relevant columns (I exclude MD here, you can use any data frame subsetting style you want), then just take the first value of the non-zero values of that row.
df$liv6 <- apply(df[-1], 1, \(x) head(x[x > 0], 1))
df
#> MD liv liv2 liv3 liv4 liv5 liv6
#> 1 100 0 6 1 1 0 6
#> 2 200 0 2 1 0 2 2
#> 3 300 1 0 1 0 7 1
#> 4 400 3 4 1 3 9 3
#> 5 500 4 5 1 5 10 4
One approach is to use purrr::detect to detect the first non-zero element of each row.
We define a function which takes a numeric vector (row) and returns a boolean indicating whether each element is non-zero:
is_nonzero <- function(x) x != 0
We use this function to detect the first non-zero element in each row via purrr:detect
first_nonzero <- apply(df %>% dplyr::select(liv:liv5), 1, function(x) {
purrr::detect(x, is_nonzero, .dir = "forward")
})
We finally create the new column:
df$liv6 <- first_nonzero
As a result, we have
> df
MD liv liv2 liv3 liv4 liv5 liv6
100 0 6 1 1 0 6
200 0 2 1 0 2 2
300 1 0 1 0 7 1
400 3 4 1 3 9 3
500 4 5 1 5 10 4
Another straightforward solution is:
Reduce(function(x, y) ifelse(!x, y, x), df[, -1])
#[1] 6 2 1 3 4
This way should be very efficient, since we "scan" by column, as, presumably, the data have much fewer columns than rows.
The Reduce approach is a more functional form of a simple, old-school, loop:
ans = df[, 2]
for(j in 3:ncol(df)) {
i = !ans
ans[i] = df[i, j]
}
ans
#[1] 6 2 1 3 4

Have 3 matrices of same dimensions - I want to get the highest value of each cell of the three different matrices

Basically I have 3 matrices of the same dimensions. They only consist of values 0 , 1, 2 ,3. I would like to create a new matrix that takes the highest value from each of the corresponding matrices.
For example, if the first row of the matrices are as follows:
A: 0 1 0 0 1
B: 2 0 0 2 0
C: 0 3 0 3 0
Final: 2 3 0 3 1
I was trying to do a for function with apply but I couldn't get it working.
Edit: I think pmax is the function to do according to the comments.. Thanks! Im am just starting out and learning about R so sorry if this is a simple question.
Here's some sample data:
m1 <- matrix(sample(0:3, 12, replace = TRUE), 4)
m2 <- matrix(sample(0:3, 12, replace = TRUE), 4)
m3 <- matrix(sample(0:3, 12, replace = TRUE), 4)
And the result
pmax(m1, m2, m3)
# [,1] [,2] [,3]
# [1,] 3 1 3
# [2,] 2 3 1
# [3,] 1 3 3
# [4,] 3 3 3

Rescore Items from Scoring Key

I have a set of data on which respondents were given a series of questions, each with five response options (e.g., 1:5). Given those five options, I have a scoring key for each question, where some responses are worth full points (e.g., 2), others half points (1), and others no points (0). So, the data frame is n (people) x k (questions), and the scoring key is a k (questions) x m (responses) matrix.
What I am trying to do is to programmatically create a new dataset of the rescored items. Trivial dataset:
x <- sample(c(1:5), 50, replace = TRUE)
y <- sample(c(1:5), 50, replace = TRUE)
z <- sample(c(1:5), 50, replace = TRUE)
dat <- data.frame(cbind(x,y,z)) # 3 items, 50 observations (5 options per item)
head(dat)
x y z
1 3 1 2
2 2 1 3
3 5 3 4
4 1 4 5
5 1 3 4
6 4 5 4
# Each option is scored 0, 1, or 2:
key <- matrix(sample(c(0,0,1,1,2), size = 15, replace = TRUE), ncol=5)
key
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 1 2
[2,] 2 1 1 1 2
[3,] 2 2 1 1 2
Some other options, firstly using Map:
data.frame(Map( function(x,y) key[y,x], dat, seq_along(dat) ))
# x y z
#1 0 2 2
#2 0 2 1
#3 2 1 1
#4 0 1 2
#5 0 1 1
#6 1 2 1
Secondly using matrix indexing on key:
newdat <- dat
newdat[] <- key[cbind( as.vector(col(dat)), unlist(dat) )]
newdat
# x y z
#1 0 2 2
#2 0 2 1
#3 2 1 1
#4 0 1 2
#5 0 1 1
#6 1 2 1
Things would be even simpler if you specified key as a list:
key <- list(x=c(0,0,0,1,2),y=c(2,1,1,1,2),z=c(2,2,1,1,2))
data.frame(Map("[",key,dat))
# x y z
#1 0 2 2
#2 0 2 1
#3 2 1 1
#4 0 1 2
#5 0 1 1
#6 1 2 1
For posterity, I was discussing this issue with a friend, who suggested another approach. The benefits of this is that it still uses mapvalues() to do the rescoring, but does not require a for loop, instead uses "from" in sapply to do the indexing.
library(plyr)
scored <- sapply(1:ncol(raw), function(x, dat, key){
mapvalues(dat[,x], from = 1:ncol(key), to = key[x,])
}, dat = dat, key = key)
My current working approach is to use 1) mapvalues, which lives within package:plyr to do the heavy lifting: it takes a vector of data to modify, and two additional parameters "from", which is the original data (here 1:5), and "to", or what we want to convert the data to; and, 2) A for loop with index notation, in which we cycle through the available questions, extract the vector pertaining to each using the current loop value, and use it to select the proper row from our scoring key.
library(plyr)
newdat <- matrix(data=NA, nrow=nrow(dat), ncol=ncol(dat))
for (i in 1:3) {
newdat[,i] <- mapvalues(dat[,i], from = c(1,2,3,4,5),
to = c(key[i,1], key[i,2], key[i,3], key[i,4], key[i,5]))
}
head(newdat)
[,1] [,2] [,3]
[1,] 0 2 2
[2,] 0 2 1
[3,] 2 1 1
[4,] 0 1 2
[5,] 0 1 1
[6,] 1 2 1
I am pretty happy with this solution, but if anyone has any better approaches, I would love to see them!

Extract integers from ranges

In R, what's an efficient way to extract the integers from ranges?
Let's say I have a matrix of ranges (column1=start, column2=end)
1 5
3 6
10 13
I would like to store the encompassing unique integers of all the ranges in the matrix into an object:
1
2
3
4
5
6
10
11
12
13
This would be applied to a matrix containing ~4 million ranges, so hopefully someone can offer a solution that is somewhat efficient.
Suppose you had start = 3, end = 7, and you'd marked each as a '1' on a number line starting at 1
starts: 0 0 1 0 0 0 0 0 0 ...
ends + 1: 0 0 0 0 0 0 0 1 0 ...
The cumulative sum of the starts minus the cumulative sum of the ends, and the difference between the two, is
cumsum(starts): 0 0 1 1 1 1 1 1 1 ...
cumsum(ends + 1): 0 0 0 0 0 0 0 1 1 ...
diff: 0 0 1 1 1 1 1 0 0
and the locations of the 1's in the diff are
which(diff > 0): 3 4 5 6 7
Use tabulate to allow for multiple starts / ends at the same location, and
range2 <- function(ranges)
{
max <- max(ranges)
starts <- tabulate(ranges[,1], max)
ends <- tabulate(ranges[,2] + 1L, max)
which(cumsum(starts) - cumsum(ends) > 0L)
}
For the question, this gives
> eg <- matrix(c(1, 3, 10, 5, 6, 13), 3)
> range2(eg)
[1] 1 2 3 4 5 6 10 11 12 13
It is pretty fast, for Andrie's example
> system.time(runs <- range2(xx))
user system elapsed
0.108 0.000 0.111
(this sounds a bit like DNA sequence analysis, for which GenomicRanges might be your friend; you'd use the coverage and slice functions on reads, perhaps input with readGappedAlignments).
I don't know that it is particularly efficient, but if your matrix of ranges is ranges then the following should work:
unique(unlist(apply(ranges, 1, function(x) x[1]:x[2])))
Use sequence and rep:
x <- matrix(c(1, 5, 3, 6, 10, 13), ncol=2, byrow=TRUE)
ranges <- function(x){
len <- x[, 2] - x[, 1] + 1
#allocate space
a <- b <- vector("numeric", sum(len))
a <- rep(x[, 1], len)
b <- sequence(len)-1
unique(a+b)
}
ranges(x)
[1] 1 2 3 4 5 6 10 11 12 13
Since this makes use of only vectorised code, this should be quite fast, even for large data sets. On my machine an input matrix of 1 million rows takes ~5 seconds to run:
set.seed(1)
xx <- sample(1e6, 1e6)
xx <- matrix(c(xx, xx+sample(1:100, 1e6, replace=TRUE)), ncol=2)
str(xx)
int [1:1000000, 1:2] 265509 372124 572853 908206 201682 898386 944670 660794 629110 61786 ...
system.time(zz <- ranges(xx))
user system elapsed
4.33 0.78 5.22
str(zz)
num [1:51470518] 265509 265510 265511 265512 265513 ...
Is it not something as simple as:
x <- matrix(c(1, 5, 3, 6, 10, 13), ncol=2, byrow=TRUE)
do.call(":",as.list(range(x)))
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13
Edit
Looks like I got the wrong end of the stick, but my answer can be modified to use union, although this is just a wrapper for unique:
Reduce("union",apply(x,1,function(y) do.call(":",as.list(y))))
[1] 1 2 3 4 5 6 10 11 12 13

Resources