Write a R function to find binary subset - r

I have a binary vector x=(1,0,0,1). lower-order terms including itself of this vectors are assumed (0,0,0,0), (0,0,0,1), (1,0,0,0) and (1,0,0,1). How do I find this lower-order vectors in R.
What I understand so far: basically we want o find subsets, replace each 1 by 0. But to do it in R? I am clueless?
here what I tried so far.
a<-c(1,0,0,1)
M<-length(a)
for(i in 1:M){
ifelse(a[i]==1, a[i]<-0, next)
print(a)
}
[1] 0 0 0 1
[1] 0 0 0 0
what I am looking for in detail: for example, I have 4 factors A,B,C,D. Here (1,0,0,1) means AD.
Now I want a subset of (1,0,0,1) that means AD. In my subsets, I can not have B and C. Result will be {} {A} {D} {AD} in binary form (0,0,0,0), (1,0,0,0),(0,0,0,1),(1,0,0,1).

Here's a method relying on expand.grid to do the heavy lifting:
vecs = lapply(a, seq, 0) # keep 0s as 0, make 1s c(1, 0)
do.call(expand.grid, vecs) # generate all combinations
# Var1 Var2 Var3 Var4
# 1 1 0 0 1
# 2 0 0 0 1
# 3 1 0 0 0
# 4 0 0 0 0

Using RcppAlgos::permuteGeneral.
library(RcppAlgos)
A <- t(apply(permuteGeneral(length(a), sum(a)), 1, function(x) {a[x] <- 0; a}))
A[!duplicated(A), ]
# [,1] [,2] [,3] [,4]
# [1,] 0 0 0 1
# [2,] 0 0 0 0
# [3,] 1 0 0 1
# [4,] 1 0 0 0

We can use the which, combn, and *apply functions to perform this operation. Since this is a step-by-step operation, it may be helpful to look at the results line-by-line.
Here it is wrapped in a function called find_binary_subsets:
find_binary_subsets <- function(x){
# where does x equal 1
x_eq_1 <- which(x == 1)
# combinations of indexes where x == 1
l_w_x <- lapply(length(x_eq_1):1,
FUN = function(l) combn(x_eq_1, l))
# loop over the combinations of indexes where x == 1, replace by 0, return vector
# apply(., 2) loops over the columns of a matrix, which is what we want
combs <- lapply(l_w_x,
FUN = function(d)
apply(d, 2, FUN = function(i){x[i] <- 0; x}))
# cbind results, then transpose to arrange by row
t(cbind(do.call("cbind", combs), x))
}
find_binary_subsets(a)
[,1] [,2] [,3] [,4]
0 0 0 0
0 0 0 1
1 0 0 0
x 1 0 0 1

Related

Count number of rox for each column in a mtrix with non-zero cells in R [duplicate]

Very new to R and I have a .rda file that contains a matrix of gene IDs and counts for each ID in 96 columns. It looks like this:
I want to get separate counts for the number of non-zero items in each column. I've been trying the sum() function in a loop, but perhaps I don't understand loop syntax in R. Any help appreciated.
Thanks!
Forest
What about:
apply(your.matrix, 2, function(c)sum(c!=0))
Does this help?
edit:
Even better:
colSums(your.matrix != 0)
edit 2:
Here we go, with an example for ya:
> example = matrix(sample(c(0,0,0,100),size=70,replace=T),ncol=7)
> example
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0 100 0 0 100 0 100
[2,] 100 0 0 0 0 0 100
[3,] 0 0 0 0 0 0 100
[4,] 0 100 0 0 0 0 0
[5,] 0 0 100 100 0 0 0
[6,] 0 0 0 100 0 0 0
[7,] 0 100 100 0 0 0 0
[8,] 100 0 0 0 0 0 0
[9,] 100 100 0 0 100 0 0
[10,] 0 0 0 0 0 100 0
> colSums(example != 0)
[1] 3 4 2 2 2 1 3
(new example, the previous example with '1' values was not suited to show that we are summing the number of cells, not their contents)
with x being a column or vector;
length(which(x != 0))
Another method using plyr's numcolwise:
library(plyr)
dat <- data.frame(a = sample(1:25, 25),
b = rep(0, 25),
c = sample(1:25, 25))
nonzero <- function(x) sum(x != 0)
numcolwise(nonzero)(dat)
a b c
1 25 0 25
There is a way to count the number of columns that have zeros. This one uses dplyr.
First, data.frame operation mode needs to be rowwise() then, columns must be subset with c_across() which returns a vector, that can be used in any function that takes vectors. Finally the values are assigned to a new column using mutate().
library(dplyr)
df <- data.frame(a = sample(0:10, 100, replace = T),
b = sample(0:10, 100, replace = T),
c = sample(0:10, 100, replace = T))
df %>%
rowwise() %>%
mutate(`N_zeros` = sum(c_across(everything()) == 0))
This idea can also be modified for any other operation that would take all or a subset of columns for row-wise operation.
See documentation of c_across() for more details. Tested with dplyr version 1.0.6.

R: Mutate last sequence of specific values

I have a dataframe containing columns with 0's and 1's. I want to mutate the last sequence of 1's into zeros like this:
# data
a <- c(0,1,1,0,1,1,1)
b <- c(0,1,1,1,0,1,1)
c <- data.frame(cbind(a,b))
head(c,7)
# desired output
a_desired <- c(0,1,1,0,0,0,0)
b_desired <- c(0,1,1,1,0,0,0)
c_desired <- data.frame(cbind(a_desired,b_desired))
head(c_desired,7)
such that I end up with the same sequence except that the last sequence of 1's has been mutated into 0's. I've tried using tail() but haven't found a solution so far
You may try using rle
apply(c, 2, function(x){
y <- max(which(rle(x == 1)$values))
x[(sum(rle(x == 1)$lengths[1:(y-1)]) + 1): sum(rle(x == 1)$lengths[1:y])] <- 0
x
})
a b
[1,] 0 0
[2,] 1 1
[3,] 1 1
[4,] 0 1
[5,] 0 0
[6,] 0 0
[7,] 0 0
purrr::map variant
library(purrr)
map(c, function(x){
last1 <- max(which(x == 1))
last0 <- which(x[1:last1] == 0)
c(x[seq_len(max(last0))], rep(0, length(x) - max(last0)))
})
You can try a combination of cumsum of x == 0 and replace the values where this is equal to max.
sapply(c, function(x) {
. <- cumsum(diff(c(0,x)==1)==1)
`[<-`(x, . == max(.), 0L)
#replace(x, . == max(.), 0L) #Alternaive to [<-
})
# a b
#[1,] 0 0
#[2,] 1 1
#[3,] 1 1
#[4,] 0 1
#[5,] 0 0
#[6,] 0 0
#[7,] 0 0
Or the same but written i a different way (thanks to #thelatemail
)
sapply(c, function(x) {
cs <- cumsum(diff(c(0,x)==1)==1)
x[cs == max(cs)] <- 0L
x
})
Or another variant iterating from the last element to the beginning until 0 is found.
sapply(c, function(x) {
n <- length(x)
i <- n
while(x[i] != 1 & i>1L) i <- i-1L
while(x[i] != 0 & i>1L) i <- i-1L
x[i:n] <- 0L
x
})
You can write your own function:
fun <- function(x){
y <- rle(x)
y$values[length(y$values)] <- 0
inverse.rle(y)
}
Now run:
data.frame(sapply(c, fun))
a b
1 0 0
2 1 1
3 1 1
4 0 1
5 0 0
6 0 0
7 0 0
If you sequences always end with 1s, you can try (given df <- data.frame(a,b))
> df * sapply(df, function(x) rev(cumsum(rev(x != 1)) != 0))
a b
1 0 0
2 1 1
3 1 1
4 0 1
5 0 0
6 0 0
7 0 0

Replace specific numbers that follow each other in a matrix

I am working with survey data and i would like to replace specific values - that follow each other- in a data frame.
For example
v1 v2 v3 v4 v5
0 2 0 0 55
0 0 3 0 1
3 0 1 1 2
0 2 0 2 0
If I replace (0,2,0) with 1's and the rest of the data frame with 0's, the new matrix will look like
v1 v2 v3 v4 v5
1 1 1 0 0
0 0 0 0 0
0 0 0 0 0
1 1 1 1 1
How can I do this to n-lenght specific number, i.e. (1,3); (1,2,4,5,8,2)?
As others have pointed out, you need to clarify your question a bit to make sure that we are answering it correctly.
My assumptions are that, you are matching a pattern (c(0,2,0)) in the example you show, and you are only matching the pattern across rows. That is, it cannot wrap from row 2, column 5 to row 3, column 1, nor will it check matches in the columns only.
If those assumptions are correct, then the following function will work. It replicates the example you provided, and returns a matrix. You can modify the replace value (rep_val) and fill values (fill_val), 1 and 0 in your example, respectively, with the optional parameters. This function could also be improved to be more elegant, but I think it works.
Code
replace_pattern <- function(x, pattern, rep_with = 1, fill_val = 0)
{
n <- length(pattern)
if (n > ncol(x))
stop("pattern is longer than number of columns")
new_x <- matrix(fill_val, nrow = nrow(x), ncol = ncol(x))
# loop over each row
for (rr in seq_len(nrow(x))) {
# start matching the pattern at the entry = length of pattern
# and look backwards
for (cc in n:ncol(x)) {
cur_cols <- (cc - n + 1):cc
cur_vals <- x[rr, cur_cols]
# if it matches the pattern, replace the values with specified value
if (isTRUE(all.equal(cur_vals, pattern, check.attributes = FALSE))) {
new_x[rr, cur_cols] <- rep_with
}
}
}
new_x
}
Testing
Using your example, and setting it to xx
xx:
v1 v2 v3 v4 v5
0 2 0 0 55
0 0 3 0 1
3 0 1 1 2
0 2 0 2 0
And then calling replace_pattern(xx, c(0, 2, 0)) returns the second matrix you provided.
x2 <- xx
x2[2, 2] <- 1
replace_pattern(x2, c(1, 3))
Returns:
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 0 1 1 0 0
[3,] 0 0 0 0 0
[4,] 0 0 0 0 0
And
xx3 <- rbind(xx, c(1, 2, 4, 5, 8))
replace_pattern(xx3, c(1, 2, 4, 5, 8))
Returns:
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 0 0 0 0 0
[3,] 0 0 0 0 0
[4,] 0 0 0 0 0
[5,] 1 1 1 1 1
Finally, replace_pattern(xx, c(1,2,4,5,8,2)) will fail because the pattern is longer than the number of columns.

Group column in blocks of 4 observations and return 1 based on a condition

Here my example data.frame:
df = read.table(text = 'Value
1
1
0
1
0
0
0
0
0
0
1
1
1
0
1
1
0
0
0
0', header = TRUE)
I need to divide the column in blocks of 4 and if there exists at least a 1 within each block I need to return a data.frame with 1s.
Here my expected result:
Result
1
1
1
Or also return a data.frame with 1s and 0s, where 0s are printed for blocks with no 1s:
Result
1
0
1
1
0
A simple vectorized way could be to convert to a 4 rows matrix and then run a colSums (this, ofcourse assumes your data length is dividable by 4)
as.integer(colSums(matrix(df$Value, 4)) > 0)
# [1] 1 0 1 1 0
Or using the matrixStats package
matrixStats::colMaxs(matrix(df$Value, 4))
# [1] 1 0 1 1 0
With data.table
library(data.table)
setDT(df)[, grp := as.integer(gl(.N, 4, .N))][, +(any(Value==1)) , grp]$V1
#[1] 1 0 1 1 0
Or with rowsum from base R
+(rowsum(df$Value, gl(20, 4, 20))>0)
Or with tidyverse
library(dplyr)
df %>%
group_by(grp = gl(n(), 4, n())) %>%
summarise(Value = max(Value))
Here is a base R method that uses matrix multiplication to calculate the sum of 1 values and then uses sign to convert values greater than 1 to 0.
sign(rep(1, 4) %*% matrix(df$Value, 4))
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 1 1 0
Note that this assumes that the vector is made up of 0s and 1s. It may fail if there are negative numbers.

What is the least memory demanding methods to do these steps?

I have posted a question yesterday, and got wonderful response from the experts. However, I am facing another question now, I found the jobs cannot be done in my real data as my starting file (df1) are too large. I wonder if there are faster method to do the same job without using adply or for loop?
My original questions is listed as below:
Step 1: I have a simplified dataframe like this:
df1 = data.frame (B=c(1,0,1), C=c(1,1,0)
, D=c(1,0,1), E=c(1,1,0), F=c(0,0,1)
, G=c(0,1,0), H=c(0,0,1), I=c(0,1,0))
B C D E F G H I
1 1 1 1 1 0 0 0 0
2 0 1 0 1 0 1 0 1
3 1 0 1 0 1 0 1 0
Step 2: I want to do row wise subtraction, i.e. (row1 - row2), (row1 - row3) and (row2 - row3)
row1-row2 1 0 1 0 0 -1 0 -1
row1-row3 0 1 0 1 -1 0 -1 0
row2-row3 -1 1 -1 1 -1 1 -1 1
step 3: replace all -1 to 0
row1-row2 1 0 1 0 0 0 0 0
row1-row3 0 1 0 1 0 0 0 0
row2-row3 0 1 0 1 0 1 0 1
Could you mind to teach me how to do so in a less memory-demanding approach?
The fastest way I know to do step 2 is to use indices into df1 for the various pairwise comparisons you want to do. The combn() function can be used to generate the set of row-by-row comparisons required. (Using this will be the rate limiting step for big data sets.)
For the combinations of row-by-rows operations we want to form:
> cmb <- combn(as.numeric(rownames(df1)), 2)
> cmb
[,1] [,2] [,3]
[1,] 1 1 2
[2,] 2 3 3
The rows of cmb represent the two sets of indices required from df1 required to form the three rows of your requested output. (The columns, 3, represent the 3 rows in your expected result.)
The next step is to use the two rows of cmb to index df1 and use a standard vectorised operation in R via -, e.g.:
> (out <- df1[cmb[1,], ] - df1[cmb[2,], ])
B C D E F G H I
1 1 0 1 0 0 -1 0 -1
1.1 0 1 0 1 -1 0 -1 0
2 -1 1 -1 1 -1 1 -1 1
Step 3 can now be done, although I am assuming that there can only be 1, 0, and -1 values in the resulting output:
> out[out < 0] <- 0
> out
B C D E F G H I
1 1 0 1 0 0 0 0 0
1.1 0 1 0 1 0 0 0 0
2 0 1 0 1 0 1 0 1
Which is consistent with the output you requested.
For big operations, doing this with matrices might be faster overall. So we could do:
> mat <- data.matrix(df1)
> cmb <- combn(seq_len(NROW(mat)), 2)
> cmb
[,1] [,2] [,3]
[1,] 1 1 2
[2,] 2 3 3
> out2 <- mat[cmb[1,], ] - mat[cmb[2,], ]
> out2[out2 < 0] <- 0
> out2
B C D E F G H I
[1,] 1 0 1 0 0 0 0 0
[2,] 0 1 0 1 0 0 0 0
[3,] 0 1 0 1 0 1 0 1
If you need the rownames as you show, then you can easily generate these at the end:
> apply(cmb, 2, function(x) paste("row", x[1], "-row", x[2], sep = ""))
[1] "row1-row2" "row1-row3" "row2-row3"
which can be used as:
> rownames(out) <- apply(cmb, 2, function(x) paste("row", x[1], "-row", x[2], sep = ""))
> out
B C D E F G H I
row1-row2 1 0 1 0 0 0 0 0
row1-row3 0 1 0 1 0 0 0 0
row2-row3 0 1 0 1 0 1 0 1
Using the sqldf package or RSQLite directly would allow one to do this with all computations done outside of R so that there would be no intermediate storage required. We illustrate using sqldf. See the sqldf home page for more info.
Alternative 1 In this approach note that we use dbname = tempfile() so that it performs all computations in an external database (which it creates on the fly and automatically deletes) rather than doing it in memory.
library(sqldf)
gc()
DF <- sqldf("select x.rowid x, y.rowid y,
max(x.B - y.B, 0) B, max(x.C - y.C, 0) C,
max(x.D - y.D, 0) D, max(x.E - y.E, 0) E,
max(x.F - y.F, 0) F, max(x.G - y.G, 0) G,
max(x.H - y.H, 0) H, max(x.I - y.I, 0) I
from df1 x, df1 y
where x.rowid > y.rowid", dbname = tempfile())
This would only require that we are able to store df1 and DF in our workspace.
Alternative 2. If even that overflows we can write out df1, remove it, perform the calculation below and then we would only need sufficient storage to store the result, DF.
read.csv.sql uses dbname = tempfile() by default so in this case we do not need to specify it.
write.table(df1, "data.txt", sep = ",", quote = FALSE)
rm(df1)
gc()
DF <- read.csv.sql("data.txt", sql = "select
x.rowid x, y.rowid y,
max(x.B - y.B, 0) B, max(x.C - y.C, 0) C,
max(x.D - y.D, 0) D, max(x.E - y.E, 0) E,
max(x.F - y.F, 0) F, max(x.G - y.G, 0) G,
max(x.H - y.H, 0) H, max(x.I - y.I, 0) I
from file x, file y
where x.rowid > y.rowid")
(Of course, if its really this large then you might have trouble doing any subsequent calculations on it too.)
Output. At any rate, both alternatives give the same result shown below. x and y show which input rows were subtracted.
> DF
x y B C D E F G H I
1 2 1 0 0 0 0 0 1 0 1
2 3 1 0 0 0 0 1 0 1 0
3 3 2 1 0 1 0 1 0 1 0
Note. Although the question asked for optimizing memory rather than speed if speed were an issue one could add indexes.
Since the data is homogeneous, use a matrix representation. Organize it so that the 'rows' are columns, as
m <- t(as.matrix(df1))
mode(m) <- "integer" # maybe already true?
pre-allocate the space for an answer
n <- ncol(m) - 1
ans <- matrix(0L, nrow(m), (n+1) * n / 2)
We want to compare column 1 to columns 1:n + 1L (the 1L treats the number one as an integer value, rather than real). This is m[,1] - m[, 1:n + 1L], using R's recycling. Iterating over columns, with idx and off helping to keep track of the index of the columns we want to compare to, and the placement columns in the answer
off <- 0
for (i in 1:n) {
idx <- i:n + 1L
ans[, off + seq_along(idx)] <- m[, i] - m[, idx]
off <- off + length(idx)
}
The final step is
ans[ans<0L] <- 0L
Maybe there are additional efficiencies from realizing that the truth table under the original operation is 0 unless m[,1] == 1 & m[, 1:n + 1L] == 0. Likewise if space were a serious issue then the data might be represented as mode(m) <- "raw" and the arithmetic operations replaced with the comparison just suggested, along the lines of:
m <- t(as.matrix(df1))
mode(m) <- "raw"
off <- 0
x0 <- as.raw(0); x1 <- as.raw(1)
ans <- matrix(raw(), nrow(m), (n+1) * n / 2)
for (i in 1:n) {
idx <- i:n + 1L
updt <- which((m[, i] == x1) & (m[, idx] == x0))
ans[off + updt] <- x1
off <- off + length(idx) * nrow(ans)
}

Resources