R - Match two vectors with conditional - r

I've got two binary vectors and I'm trying to find the most efficient way of comparing them based on slightly more than just a standard "are they equal?".
My function is that if I have vector x and vector y I want to find out how many times in vector x do I have a 1 at the same index that vector y has a 0. I also need to when vector y has a 1 + has a 0 where vector x also has a 0. (Note: If I find either of these I can just find the inverse to get the other, I'm just not sure which is easier/more efficient ie. VectorY Score = length(VectorX) - VectorX Score)
Ex:
vector x: 1 1 1 0 0 1 - Score: 2
vector y: 0 1 0 1 0 1 - Score: 4
I know that I could just use a for loop to go through each index, but I'd like something more efficient if possible. I have vector lengths of 100 and I need to do many of these comparisons so speed matters.
I tried to use the sum command, but I can't figure out how to add complex conditionals to it. I can find every spot that matches, but that's not enough to solve this.
Ex:
sum(vectorX == vectorY)

Sample:
> vx
[1] 1 1 1 0 0 1
> vy
[1] 0 1 0 1 0 1
You said: "how many times in vector x do I have a 1 at the same index that vector y has a 0"
> vx==1 & vy==0 # constructs this vector:
[1] TRUE FALSE TRUE FALSE FALSE FALSE
> sum(vx==1 & vy==0) # its sum is the answer (TRUE=1, FALSE=0)
[1] 2
You also said: "when vector y has a 1 + has a 0 where vector x also has a 0" which I don't understand but you can clarify that and probably work it out yourself given the answer I've just given you.

Related

how to count the number of distinct items in a linear programming problem

I am studying some linear programming problems with all-binary variables, where it is necessary to count (and then either constrain or maximise/minimise) the number of distinct items in the solution.
This is the post I could find that seemed closest to it:
https://stats.stackexchange.com/questions/136608/constrained-assignment-problem-linear-programming-genetic-algorithm-etc
The 'items' being counted in this case are the supply centers used. I am trying to understand if the approach suggested in the above post is correct for my purposes.
In the answer by user 'TAS', the example is 3 shops by 2 supply centers, and the idea is (A) to assign one (and only one) supply center to each shop, so that: (B) the distance travelled is minimal, (C) no supply center must supply more than a given maximal number of shops (in this case 3, i.e. no limit), and (D) the max total number of supply centers used is limited (in this case to 2).
I tried to reconstruct how the problem was set up, starting from a dataset like the one I would have in my case.
df <- cbind(expand.grid(shop=1:3,supply=1:2),distance=c(2.8,5.4,1.4,4.2,3.0,6.3))
df["Entry"] <- 1:dim(df)[[1]]
shop.mat <- table(df$shop,df$Entry)
shop.mat
1 2 3 4 5 6
1 1 0 0 1 0 0
2 0 1 0 0 1 0
3 0 0 1 0 0 1
supply.mat <- table(df$supply,df$Entry)
supply.mat
1 2 3 4 5 6
1 1 1 1 0 0 0
2 0 0 0 1 1 1
N_supply <- dim(supply.mat)[[1]]
N_shop <- dim(shop.mat)[[1]]
N_entry <- dim(df)[[1]]
The solution vector will have length N_entry + N_supply, and each row of the constraint matrix will need to have the same length.
constr.mat <- NULL
dir <- NULL
rhs <- NULL
(A) is addressed by constraining each line in the shop.mat to be == 1:
constr.mat <- rbind(constr.mat,cbind(shop.mat,matrix(0,N_shop,N_supply)))
dir <- c(dir,rep("==",N_shop))
rhs <- c(rhs,rep(1,N_shop))
(B) is addressed by setting the objective vector to the distance for each Entry, and 0 for each shop (because there is no cost in adding one more supply center, although in reality there might be):
obj <- c(aggregate(distance~Entry,df,c)[["distance"]],rep(0,N_supply))
(C) is addressed by rearranging the equation and turning it into a <= 0 constraint:
constr.mat <- rbind(constr.mat,cbind(supply.mat,-diag(table(df$supply))))
dir <- c(dir,rep("<=",N_supply))
rhs <- c(rhs,rep(0,N_supply))
(D) is addressed by adding a constraint <= 2:
constr.mat <- rbind(constr.mat,c(rep(0,N_entry),rep(1,N_supply)))
dir <- c(dir,"<=")
rhs <- c(rhs,2)
The problem can then be solved using lpSolve:
require(lpSolve)
sol <- lp("min", obj, constr.mat, dir, rhs, all.bin = TRUE,num.bin.solns = 1, use.rw=FALSE, transpose.constr=TRUE)
sol$solution
[1] 1 0 1 0 1 0 1 1
sol$objval
[1] 7.2
selected_Entry <- dimnames(shop.mat)[[2]][as.logical(sol$solution[1:N_entry])]
selected_Entry
[1] "1" "3" "5"
df[df$Entry %in% selected_Entry,]
shop supply distance Entry
1 1 1 2.8 1
3 3 1 1.4 3
5 2 2 3.0 5
I can see that in this specific case the solution vector is forced (by constraints (C)) to have '1' in any of the 'supply' variables for which at least one corresponding Entry is selected. If this were not the case, the row sums for constraints (C) would be > 0.
But: suppose the distances were different and only supply center 1 were chosen for all 3 shops. What would stop the solution vector variable for supply center 2 from being set to '1'?
The current solution gives:
constr.mat %*% sol$solution
[,1]
1 1
2 1
3 1
1 -1
2 -2
2
But this alternative solution would still meet all the constraints:
constr.mat %*% c(1,1,1,0,0,0,1,1)
[,1]
1 1
2 1
3 1
1 0
2 -3
2
despite the fact that supplier center 2 was not used.
In this case this would not affect the solution, because there is no cost associated to including the supply centers (the corresponding elements of the objective vector are 0).
But if I wanted to get from the solution the count of distinct supply centers used, I think this would not work.
A few years ago I asked for advice on this problem on another forum, and someone immediately gave me the solution, however saying that he/she 'was not sure it was the most efficient'.
It was the following: all the same as above, and then for each of the supply centers, add to the constr.mat twice the supply.mat augmented by the negated diagonal matrix of the number of entries per supply center, constraining the first N_supply added rows to be <= 0, and the last N_supply rows to be >= 1 - the diagonal of the above mentioned diagonal matrix.
constr.mat <- rbind(constr.mat,cbind(supply.mat,-diag(table(df$supply))),cbind(supply.mat,-diag(table(df$supply))))
dir <- c(dir,rep("<=",N_supply),rep(">=",N_supply))
rhs <- c(rhs,rep(0,N_supply),1-table(df$supply))
The addition of these constraints makes sure that the 'supply' variables in the solution vector are 1 if and only if the corresponding supply center has been used, and 0 if and only if it hasn't been used.
For instance, the original solution would still work:
paste(t(unlist(constr.mat %*% sol$solution)),dir,rhs)
[1] "1 == 1" "1 == 1" "1 == 1" "-1 <= 0"
[5] "-2 <= 0" "2 <= 2" "-1 <= 0" "-2 <= 0"
[9] "-1 >= -2" "-2 >= -2"
[BTW I would not know how to turn this into an evaluated logical vector; any idea?]
whereas the other solution, which erroneously set the variable for supply center 2 to 1 although this supply center wasn't used, would instead not be valid:
paste(t(unlist(constr.mat %*% c(1,1,1,0,0,0,1,1))),dir,rhs)
[1] "1 == 1" "1 == 1" "1 == 1" "0 <= 0"
[5] "-3 <= 0" "2 <= 2" "0 <= 0" "-3 <= 0"
[9] "0 >= -2" "-3 >= -2"
(the last constraint would not be met).
Q1 Do you think the above makes sense, i.e. is it true that we need the additional constraint rows I mentioned to make sure that the 'supply' variables in the solution vector are appropriately set, or am I wrong?
Q2 Can you think of a more efficient way to count occurrences of distinct items in such problems (the example here is small, but I am often dealing with VERY large ones, where adding so many more constraints is not really helping, despite all the presolve in the world)?
Thanks!
Note: this question was originally posted in another community. I deleted it from there.
EDIT after consulting the Wikipedia page on 'Uncapacited Facility Location Problem', mentioned in the original post I linked above.
In fact there is a cost associated to opening a new supply center, so the objective vector should not have 0's at the end, but some cost ($f_i$ in the Wikipedia formulation).
Only then the issue of $\sum_iy_i$ not always being the number of open supply centers disappears, because the $\sum_jx_{i,j} \le m \cdot y_i$ contraints will still ensure that whenever a given center is used, the corresponding $y_i$ is 1; and there will be no need for the other condition I imposed, because there is now a cost associated with setting to 1 each $y_i$, therefore only the strictly necessary $y_i$'s will be set to 1.
So in short, if the objective vector is properly constructed, with costs for each supply center, I can do without several constraints.
In fact, depending on the value for the supply center opening cost, the constraint on the max total number of centers may even be superseded.
Then it would be interesting to evaluate the suggestion made in the Wikipedia discussion, namely to split the 'big M' constraints into several smaller ones. If it's true that it makes the problem easier to solve computationally, why not...
EDIT: The algorithm is fixed.
I had a similar issue that I wanted to minimize the number of distinct values in a solution. The following is how I came up with the answer to mathematically calculate the number of distinct items.
Suppose we have the following set:
11 12 13 11 11
We can see there are 3 distinct numbers in there (11, 12, and 13). Following is the way to compute it.
Write the numbers in triangle matrix like this:
11 12 13 11 11 row=0
12 13 11 11 row=1
13 11 11 row=2
11 11 row=3
if I get the difference of 11 and 12 and assign a binary variable to
1 if |a1 - a2| != 0
0 if |a1 - a2| == 0
then I have the following:
1 1 0 0 --> 0
1 1 1 --> 1
1 1 --> 1
0 --> 0
1 + 1 + (extra 1) = 3
if a number is distinct then its row should be all 1s.
So for the above case we have 2 rows of full 1s, meaning we have 2 numbers that are distinct from the first number. So, in total we have 3.
Now to translate into Linear Programming:
Assume:
Variables = a(1), a(2), a(3), a(4), ..., a(n)
Binary Variables b(i,j) where i,j in [0...n]
Binary Variable c(i) where i in [0...n]
The Linear Program would be:
obj = 1
for i in range(0, n):
for j in range(i+1, n):
# This is |a(i) - a(j)| part
addConstr( b(i,j) * BigM >= a(i) - a(j))
addConstr( b(i,j) * BigM >= -(a(i) - a(j)))
# So here c(i) will be 0 if there is any 0 in the row otherwise it is 1.
addConstr(c(i) * BigM >= sum(b(i,j) for all j) - (n-i))
obj = obj + c(i)
Minimize(Sum(obj))

How can I use rowSums with conditions to return binary value?

Say I have a data frame with a column for summed data. What is the most efficient way to return a binary 0 or 1 in a new column if any value in columns a, b, or c are NOT zero? rowSums is fine for a total, but I also need a simple indicator if anything differs from a value.
tt <- data.frame(a=c(0,-5,0,0), b=c(0,5,10,0), c=c(-5,0,0,0))
tt[, ncol(tt)+1] <- rowSums(tt)
This yields:
> tt
a b c V4
1 0 0 -5 -5
2 -5 5 0 0
3 0 10 10 20
4 0 0 0 0
The fourth column is a simple sum of the data in the first three columns. How can I add a fifth column that returns a binary 1/0 value if any value differs from a criteria set on the first three columns?
For example, is there a simple way to return a 1 if any of a, b, or c are NOT 0?
as.numeric(rowSums(tt != 0) > 0)
# [1] 1 1 1 0
tt != 0 gives us a logical matrix telling us where there are values not equal to zero in tt.
When the sum of each row is greater than zero (rowSums(tt != 0) > 0), we know that at least one value in that row is not zero.
Then we convert the result to numeric (as.numeric(.)) and we've got a binary vector result.
We can use Reduce
+(Reduce(`|`, lapply(tt, `!=`, 0)))
#[1] 1 1 1 0
One could also use the good old apply loop:
+apply(tt != 0, 1, any)
#[1] 1 1 1 0
The argument tt != 0 is a logical matrix with entries stating whether the value is different from zero. Then apply() with margin 1 is used for a row-wise operation to check if any of the entries is true. The prefix + converts the logical output into numeric 0 or 1. It is a shorthand version of as.numeric().

How to use tabulate function to count zeros?

I am trying to count integers in a vector that also contains zeros. However, tabulate doesn't count the zeros. Any ideas what I am doing wrong?
Example:
> tabulate(c(0,4,4,5))
[1] 0 0 0 2 1
but the answer I expect is:
[1] 1 0 0 0 2 1
Use a factor and define its levels
tabulate(factor(c(0,4,4,5), 0:5))
#[1] 1 0 0 0 2 1
The explanation for the behaviour you're seeing is in ?tabulate (bold face mine)
bin: a numeric vector (of positive integers), or a factor. Long
vectors are supported.
In other words, if you give a numeric vector, it needs to have positive >0 integers. Or use a factor.
I got annoyed enough by tabulate to write a short function that can count not only the zeroes but any other integers in a vector:
my.tab <- function(x, levs) {
sapply(levs, function(n) {
length(x[x==n])
}
)}
The parameter x is an integer vector that we want to tabulate. levs is another integer vector that contains the "levels" whose occurrences we count. Let's set x to some integer vector:
x <- c(0,0,1,1,1,2,4,5,5)
A) Use my.tab to emulate R's built-in tabulate. 0-s will be ignored:
my.tab(x, 1:max(x))
# [1] 3 1 0 1 2
B) Count the occurrences of integers from 0 to 6:
my.tab(x, 0:6)
# [1] 2 3 1 0 1 2 0
C) If you want to know (for some strange reason) only how many 1-s and 4-s your x vector contains, but ignore everything else:
my.tab(x, c(1,4))
# [1] 3 1

How to get which() to return similar indices from two dataframes?

I have two dataframes (ma.sig, pricebreak) that look like this:
Date A B C
01/1 1 0 1
02/1 1 0 1
Date D E G
01/1 1 0 1
02/1 0 1 0
For starters, I just want to retrieve the column indices for all non-zero values in the first row. I tried doing this via the following methods:
sig <- which(!ma.sig[1,]==0&!pricebreak[1,]==0)
and
sig <- which(!ma.sig[1,]==0)&which(!pricebreak[1,]==0)
I would like it to return something like: 1, 3 (based in the above sample dataframe). However, I get this string of logical sequences:
[1] TRUE FALSE TRUE
How do I get it to return the columns indices? I do not want to use merge to merge my dataframes because of the nature of the data.
EDIT: Just for background information, the above data frames are 'signals' that are on when the values are non-zero. I'm trying to use sig to collect indices that I can use for my main dataframe so that I can only calculate and print outputs when the signals are on.
#serhatCevikel already given the answer:
I am just trying to explain it more for your convenience.
ma.sig =
Date A B C
01/1 1 0 1
02/1 1 0 1
pricebrake =
Date D E G
01/1 1 0 1
02/1 0 1 0
Now as per your method:
sig <- which(!ma.sig[1,]==0)&which(!pricebreak[1,]==0)
print(sig)
gives:
TRUE TRUE TRUE
Now try:
which(sig)
it will return index of TRUE value:
1 2 3
Please let me know if you get this. I have checked it twice in my terminal. Hope you will get this too.

need to count number of specific transitions in a vector in R

I am programming a sampler in R, which basically is a big for loop, and for every Iterations I have to count the number of transitions in a vector. I have a vector called k, which contains zeros and ones, with 1000 entries in the vector.
I have used the following, horribly slow, code:
#we determine the number of transitions n00,n01,n10,n11
n00=n01=n10=n11=0 #reset number of transitions between states from last time
for(j in 1:(1000-1)){
if(k[j+1]==1 && k[j]==0) {n01<-n01+1}
else { if(k[j+1]==1 && k[j]==1) {n11<-n11+1}
else { if(k[j+1]==0 && k[j]==1) {n10<-n10+1}
else{n00<-n00+1}
}
}
}
So for every time the loop goes, the variables n00,n01,n10,n11 counts the transitions in the vector. For example, n00 counts number of times a 0 is followed by another 0. And so on...
This is very slow, and I am very new to R, so I am kind of desperate here. I do not understand how to use grep, if that even is possible.
Thank you for your help
Try something like this:
x <- sample(0:1,20,replace = TRUE)
> table(paste0(head(x,-1),tail(x,-1)))
00 01 10 11
4 3 4 8
The head and tail return portions of the vector x: all but the last element, and then all but the first element. This means that the corresponding elements are the consecutive pairs from x.
Then paste0 just converts each one to a character vector and pastes the first elements, the second element, etc. The result is a character vector with elements like "00", "01", etc. Then table just counts up how many of each there are.
You can assign the result to a new variable like so:
T <- table(paste0(head(x,-1),tail(x,-1)))
Experiment yourself with each piece of the code to see how it works. Run just head(x,-1), etc. to see what each piece does.
To address the comment below, to ensure that all types appear with counts when you run table, convert it to a factor first:
x1 <- factor(paste0(head(x,-1),tail(x,-1)),levels = c('00','01','10','11'))
table(x1)
If we don't care about distinguishing the n00 and n11 cases, then this becomes much simpler:
x <- sample(0:1,20,replace = TRUE)
# [1] 0 0 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0 0
table(diff(x))
# -1 0 1
# 4 11 4
Since the question says that you're primarily interested in the transitions, this may be acceptable, otherwise one of the other answers would be preferable.
x <- sample(0:1, 10, replace = TRUE)
# my sample: [1] 0 0 0 0 0 1 0 1 1 0
rl <- rle(x)
zero_to_zero <- sum(rl$len[rl$val == 0 & rl$len > 1] - 1)
one_to_one <- sum(rl$len[rl$val == 1 & rl$len > 1] - 1)
zero_to_one <- sum(diff(rl$val) == -1)
one_to_zero <- sum(diff(rl$val) == 1)
x
# [1] 0 0 0 0 0 1 0 1 1 0
zero_to_zero
# [1] 4
one_to_one
# [1] 1
zero_to_one
# [1] 2
one_to_zero
# [1] 2
#joran's answer is faaaar cleaner though...Still, I thought I just as well could finish the stroll I started down (the dirty) trail, and share the result.

Resources