generate a vector with set number of 1s [duplicate] - r

This question already has answers here:
Assigning a specific number of values informed by a probability distribution (in R)
(3 answers)
Closed 4 years ago.
I want to generate a large vector of just 0's and 1's of arbitrary length. But I want at max 10 1's in the vector.
(For those familiar, a 10-sparse vector of some arbitrary length)
How can I do this in R/Rstudio

rep(0,n) #generate n zeros
sample(0:10,1) #generate random number between 0 and 10
rep(1,sample(0:10,1)) # generate random number of ones
sample(c(rep(0,n),rep(1,sample(0:10,1)))) # combine and permute

# function that generates a 10-sparce vector
GenerateSparceVector = function(N) {
# number of 1s
n = sample(1:10,1)
# create vector
vec = c(rep(1, n), rep(0, N-n))
# randomise vector
sample(vec) }
# for reproducibility
set.seed(32)
# apply the function
GenerateSparceVector(20)
# [1] 0 0 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 1
Note that I assumed you need at least one 1 in your vector.
Every time you run it there's an equal probability of getting 1, 2, 3, ... 10 1s in your vector.

Related

How can I use rowSums with conditions to return binary value?

Say I have a data frame with a column for summed data. What is the most efficient way to return a binary 0 or 1 in a new column if any value in columns a, b, or c are NOT zero? rowSums is fine for a total, but I also need a simple indicator if anything differs from a value.
tt <- data.frame(a=c(0,-5,0,0), b=c(0,5,10,0), c=c(-5,0,0,0))
tt[, ncol(tt)+1] <- rowSums(tt)
This yields:
> tt
a b c V4
1 0 0 -5 -5
2 -5 5 0 0
3 0 10 10 20
4 0 0 0 0
The fourth column is a simple sum of the data in the first three columns. How can I add a fifth column that returns a binary 1/0 value if any value differs from a criteria set on the first three columns?
For example, is there a simple way to return a 1 if any of a, b, or c are NOT 0?
as.numeric(rowSums(tt != 0) > 0)
# [1] 1 1 1 0
tt != 0 gives us a logical matrix telling us where there are values not equal to zero in tt.
When the sum of each row is greater than zero (rowSums(tt != 0) > 0), we know that at least one value in that row is not zero.
Then we convert the result to numeric (as.numeric(.)) and we've got a binary vector result.
We can use Reduce
+(Reduce(`|`, lapply(tt, `!=`, 0)))
#[1] 1 1 1 0
One could also use the good old apply loop:
+apply(tt != 0, 1, any)
#[1] 1 1 1 0
The argument tt != 0 is a logical matrix with entries stating whether the value is different from zero. Then apply() with margin 1 is used for a row-wise operation to check if any of the entries is true. The prefix + converts the logical output into numeric 0 or 1. It is a shorthand version of as.numeric().

How to use tabulate function to count zeros?

I am trying to count integers in a vector that also contains zeros. However, tabulate doesn't count the zeros. Any ideas what I am doing wrong?
Example:
> tabulate(c(0,4,4,5))
[1] 0 0 0 2 1
but the answer I expect is:
[1] 1 0 0 0 2 1
Use a factor and define its levels
tabulate(factor(c(0,4,4,5), 0:5))
#[1] 1 0 0 0 2 1
The explanation for the behaviour you're seeing is in ?tabulate (bold face mine)
bin: a numeric vector (of positive integers), or a factor. Long
vectors are supported.
In other words, if you give a numeric vector, it needs to have positive >0 integers. Or use a factor.
I got annoyed enough by tabulate to write a short function that can count not only the zeroes but any other integers in a vector:
my.tab <- function(x, levs) {
sapply(levs, function(n) {
length(x[x==n])
}
)}
The parameter x is an integer vector that we want to tabulate. levs is another integer vector that contains the "levels" whose occurrences we count. Let's set x to some integer vector:
x <- c(0,0,1,1,1,2,4,5,5)
A) Use my.tab to emulate R's built-in tabulate. 0-s will be ignored:
my.tab(x, 1:max(x))
# [1] 3 1 0 1 2
B) Count the occurrences of integers from 0 to 6:
my.tab(x, 0:6)
# [1] 2 3 1 0 1 2 0
C) If you want to know (for some strange reason) only how many 1-s and 4-s your x vector contains, but ignore everything else:
my.tab(x, c(1,4))
# [1] 3 1

R: how to identify the minimum representative vector/row sets from a list of vectors/dataframe

I have a list of vectors each contains a number of elements. I would like to identify the minimum number of vectors that have the maximum coverage of unique elements. For example, if the vectors are represented as row in binary dataframe with the unique elements in columns, as follows:
>df<- data.frame(a=c(1,0,0,0,0),b=c(1,1,1,1,0),c=c(1,0,1,1,1),d=c(0,0,0,1,1),e=c(0,0,0,1,1), f=c(0,0,0,0,1))
> df
a b c d e f
1 1 1 1 0 0 0
2 0 1 0 0 0 0
3 0 1 1 0 0 0
4 0 1 1 1 1 0
5 0 0 1 1 1 1
Given that the vectors are the rows from 1 to 5 and they contain different combinations of the elements a to f. I would like to get the minimum representative vectors or rows covering as many elements. In this example, the minimum representative (maximum parsimonious) vectors are rows 1 and 5. Is there a way to do that statistically? I tried to visualize the dataset in the two-way clustered heatmap to manually identify the minimum combinations. however, is there a statistical approach that can handle this and capable of providing some numeric measure of the selection performance?
Another example to illustrate my question. Given the following vectors:
> vec.1 <- c("a", "c", "d")
> vec.2 <- c("a", "b", "c", "d")
> vec.3 <- c("b","e")
> vec.4 <- c("b", "c", "d", "g")
> vec.5 <- c("f","g")
The minimum combination is 2,3 and 5 because they cover all elements, from a to g, with minimum overlap. In larger datasets, multiple answers can be possible, however, the smaller the number of vectors in a combination the better.
Thank you.
One solution is to compute 'overlaps' between rows and extract the row pair with maximum overlap as follows:
m <- apply(df, 1, function(x) apply(df, 1, function(y) sum(x | y)))
which(m == max(m), arr.ind = TRUE)
The resulting output is:
row col
[1,] 5 1
[2,] 1 5
You can pick either combination (since row 1 vs. row 5 and row 5 vs. row 1 are same).
This method uses n^2 operations though. Not sure if there is a more efficient package/algorithm that finds maximum hamming distance pairs of rows, which seems to be what you want.
It is a combinatorical problem. First: is there one row with all 1? If not second: is there a combination of two rows which covers all elements. If not: ... three ... Use the function combn() to generate the combinations. If a combination is found calculate the amount of overlap to select the comination with minimal overlap:
df<- data.frame(a=c(1,0,0,0,0),b=c(1,1,1,1,0),c=c(1,0,1,1,1),d=c(0,0,0,1,1),e=c(0,0,0,1,1), f=c(0,0,0,0,1))
n <- nrow(df)
test1.allc <- function(i) all(colSums(df[i,, drop=FALSE])>0)
for (i in 1:n) {
Ci <- combn(n,i)
t1 <- apply(Ci, 2, test1.allc)
if (any(t1)) break # minimal number of rows/vectors is i
}
print(i) # number of rows needed to have all elements
Ci <- Ci[, t1, drop=FALSE] # only valid combinations
overlap <- function(j) { o <- colSums(df[j,, drop=FALSE]); sum(o) - length(o) }
j <- which.min(apply(Ci, 2, overlap))
print(j) # the j-th combination(s) has/have minimal overlap
for (jj in j) print(df[Ci[, jj],])

Implementing simple scoring function with permutation test in R

I'm new in R, and I want to calculate some specific score for bunch of genes in biology.
can somebody help me to implement this ? :-)
I have following two vectors:
vector 1: (0.01,0.02,0.04,0.5,0.9,0.002,0.07,0.008)
vector 2: (1,0,0,1,0,0,0,0)
vector 2 shows the membership of vector 1 elements in specific set c
I want to implement a scoring function which would do the following steps :
1) takes vector 1 and vector 2 as inputs.
2) sort the vector 1 with decreasing values and then sort the vector 2 with corresponding vector 1
3) it's go through the sorted vector 1 and if for the element i of the vector 1 the corresponding element of sorted vector 2 is 1, then the score should be increased by (m-l),
else it should be decreased by l .
m= length of vector 1
l= # of non-zero elements in vector 2
4) finally do the permutation on the vectors 1 and vector 2 and re-calculate the score of step 3. the permutation should preserve the true membership of vector 1 element in vector 2 . for example : vector 1: (10,7,4), vector 2: (0,0,1), after one possible permutation : vector 1: (4,7,10), vector2: (1,0,0)
here is my attempt :
vector1<- c(0.01,0.02,0.04,0.5,0.9,0.002,0.07,0.008)
vector2<- c(1,0,0,1,0,0,0,0)
m<-length(vector1)
l<-nnzero(vector2, na.counted = NA)
score=0
score_function<-function (a,b){
a<-sort(a,decreasing = T)
for (i in a){
if (b[i]==1) {
score= + m-1
} else{ score= score-l }
}
score
}
but I couldn't sort the b (vector 2) according to vector 1 (a)
If you want to sort by another vector use order() as an index to "[":
> vector1<- c(0.01,0.02,0.04,0.5,0.9,0.002,0.07,0.008)
>
> vector2<- c(1,0,0,1,0,0,0,0)
> vector2[ order(vector1) ]
[1] 0 0 1 0 0 0 1 0

need to count number of specific transitions in a vector in R

I am programming a sampler in R, which basically is a big for loop, and for every Iterations I have to count the number of transitions in a vector. I have a vector called k, which contains zeros and ones, with 1000 entries in the vector.
I have used the following, horribly slow, code:
#we determine the number of transitions n00,n01,n10,n11
n00=n01=n10=n11=0 #reset number of transitions between states from last time
for(j in 1:(1000-1)){
if(k[j+1]==1 && k[j]==0) {n01<-n01+1}
else { if(k[j+1]==1 && k[j]==1) {n11<-n11+1}
else { if(k[j+1]==0 && k[j]==1) {n10<-n10+1}
else{n00<-n00+1}
}
}
}
So for every time the loop goes, the variables n00,n01,n10,n11 counts the transitions in the vector. For example, n00 counts number of times a 0 is followed by another 0. And so on...
This is very slow, and I am very new to R, so I am kind of desperate here. I do not understand how to use grep, if that even is possible.
Thank you for your help
Try something like this:
x <- sample(0:1,20,replace = TRUE)
> table(paste0(head(x,-1),tail(x,-1)))
00 01 10 11
4 3 4 8
The head and tail return portions of the vector x: all but the last element, and then all but the first element. This means that the corresponding elements are the consecutive pairs from x.
Then paste0 just converts each one to a character vector and pastes the first elements, the second element, etc. The result is a character vector with elements like "00", "01", etc. Then table just counts up how many of each there are.
You can assign the result to a new variable like so:
T <- table(paste0(head(x,-1),tail(x,-1)))
Experiment yourself with each piece of the code to see how it works. Run just head(x,-1), etc. to see what each piece does.
To address the comment below, to ensure that all types appear with counts when you run table, convert it to a factor first:
x1 <- factor(paste0(head(x,-1),tail(x,-1)),levels = c('00','01','10','11'))
table(x1)
If we don't care about distinguishing the n00 and n11 cases, then this becomes much simpler:
x <- sample(0:1,20,replace = TRUE)
# [1] 0 0 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0 0
table(diff(x))
# -1 0 1
# 4 11 4
Since the question says that you're primarily interested in the transitions, this may be acceptable, otherwise one of the other answers would be preferable.
x <- sample(0:1, 10, replace = TRUE)
# my sample: [1] 0 0 0 0 0 1 0 1 1 0
rl <- rle(x)
zero_to_zero <- sum(rl$len[rl$val == 0 & rl$len > 1] - 1)
one_to_one <- sum(rl$len[rl$val == 1 & rl$len > 1] - 1)
zero_to_one <- sum(diff(rl$val) == -1)
one_to_zero <- sum(diff(rl$val) == 1)
x
# [1] 0 0 0 0 0 1 0 1 1 0
zero_to_zero
# [1] 4
one_to_one
# [1] 1
zero_to_one
# [1] 2
one_to_zero
# [1] 2
#joran's answer is faaaar cleaner though...Still, I thought I just as well could finish the stroll I started down (the dirty) trail, and share the result.

Resources