Create Row Labels based on an existing label in R - r

I have a df with a label "S" for anywhere my numeric column is <35.
I'd like to use each S position and label "S-1", "S-2", "S-3" for the 3 previous rows to S, then "S+1", "S+2" for the next 2 rows of S.
like this..
N S
45
56
67 S-3
47 S-2
52 S-1
28 S
89 S+1
66 S+2
55
76
I was using this to start me off, just as an example.
n <- sample(50:100, 10, replace=T)
data <- data.frame(N=n)
data <- rbind(data, 30)
data <- rbind(data,data,data,data,data,data)
data$S <- ifelse(data$N<35, "S", "")
Any ideas..?

here is an option using base R, where we get the index of rows where 'N' is less than 35 ('i1'), create the 'S' column with blank ("") elements, loop through 'i1', get the sequence of 3 elements before, 2 elements after, paste with 'S', get the intersect of sequence with the index ('ind') and assign the strings ('val') to the 'S' column
i1 <- which(data$N < 35)
data$S <- ""
out <- do.call(rbind, lapply(i1, function(i) data.frame(ind =(i-3): (i+2),
val = c(paste0("S-", 3:1), "S", paste0("S+", 1:2)), stringsAsFactors = FALSE)))
i2 <- out$ind %in% seq_len(nrow(data))
data$S[out$ind[i2]] <- out$val[i2]
data
set.seed(24)
n <- sample(50:100, 10, replace=T)
data <- data.frame(N=n)
data <- rbind(data, 30)
data <- rbind(data,data,data,data,data,data)

Without dealing with possible overlap, here is a bunch of ifelse() statements to get the job done. Maybe you can think of a more appropriate way to generalize it.
You can use lag() and lead() with the dplyr package.
data %>% mutate(S = ifelse(S == "S", S,
ifelse(lag(S == "S"), "S+1",
ifelse(lag(S == "S", 2), "S+2",
ifelse(lead(S == "S"), "S-1",
ifelse(lead(S == "S", 2), "S-2", ""))))),
S = ifelse(is.na(S), "", S))
(You would get NA values in the first two rows if the first value is not <35, so replace these with "".)
N S
1 52
2 86
3 86
4 57
5 54
6 57
7 51
8 98
9 100 S-2
10 73 S-1
11 30 S
12 52 S+1
13 86 S+2
14 86

This is a long-ish answer since I break it down into pieces I would normally implement using a pipeline and lambda expressions, but it should be easy enough to follow.
I will work on row indices and compute two vectors, one containing the index closest to i on the left that has label "S", and one containing the index closest to i on the right.
indices <- 1:length(data$S)
closest_left <- rep(NA, length = length(indices))
closest_right <- rep(NA, length = length(indices))
I compute these using purrr's reduce functions but you could easily do it in a loop as well.
this_or_left <- function(left_val, i) {
res <- if (data$S[[i]] == "S") i else left_val
closest_left[[i]] <<- if (data$S[[i]] == "S") i else left_val
}
this_or_right <- function(right_val, i) {
res <- if (data$S[[i]] == "S") i else right_val
closest_right[[i]] <<- if (data$S[[i]] == "S") i else right_val
}
purrr::reduce(indices, this_or_left, .init = this_or_left(NA, 1))
purrr::reduce_right(indices, this_or_right, .init = this_or_right(NA, length(indices)))
Whether you could do it with vectorised expressions I don't know. Possibly. I didn't try.
Now, I simply have to compute the distance to the closest S and make labels from that, using empty labels if the distance is greater than 3 and label "S" if the distance is zero.
get_dist <- Vectorize(function(i) {
down <- i - closest_left[i]
up <- closest_right[i] - i
if (is.na(down) || down > up) up
else if (is.na(up) || down <= up) -down
else NA
})
make_label <- Vectorize(function(dist) {
if (abs(dist) > 3) ""
else if (dist == 0) "S"
else if (dist < 0) paste0("S", dist)
else if (dist > 0) paste0("S+", dist)
})
make_label(get_dist(indices))
Here, I used Vectorized expressions to change it up a little.

Related

R: Introducing a WHILE LOOP to a Function

I am working with the R programming language.
I am trying to count the first time a certain pattern (e.g. ABCD) appears in a random string (e.g. ACABCDCDBCABCDBC - answer =6 ). I wrote a function to do this:
library(stringr)
letters <- c("A", "B", "C", "D")
results <- list()
for (i in 1:100)
{
iteration_i = i
letters_i = paste(sample(letters, 100, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25)),collapse="")
position_i = str_locate(letters_i, "ADBC")
results_tmp = data.frame(iteration_i , letters_i, position_i)
results[[i]] <- results_tmp
}
results_df <- do.call(rbind.data.frame, results)
This looks something like this now (note: I don't think this is correct - in row 5, I see ABCD at the beginning of the row, but its being recorded as NA for some reason):
iteration_i letters_i start end
1 1 BACDCCCDCCCDCDDBBCBBAACACBBBBAAABDDDACAABDDABBABADCDDCDACCBBBCABCDABCDCCCDADDDBADBDCADAABDBDCDCAACCB NA NA
2 2 CACACCCCDCCBADACBBAADBCABBAAAAADBDDBCADCAAADADAAABDCABBAABABBCBDADCDDDDCDBADDBDCBCDDDBDCDDAACBBBBACA 20 23
3 3 CDCBDAABDDDDADBAAABBADAADBDDDBDADDCABADDDCDABBBCBCBBACBBDADABBCDCCACDBCDCDDBDBADBCDCADDADDDBDBAAABBD 79 82
4 4 ADBCDBADADBAAACAADACACACACBDDCACBDACCBDAAABDBAAAABBCCDBADADDADCBCABCBAABDCBCDCDACDCCDBADCBDDAADBCDAC 1 4
5 5 D**ABCD**DDCCBCDABADBBBBCDBCADCBBBDCAAACACCCBCBCADBDDABBACACBDABAAACCAAAAACCCCBCBCCABABDDADBABDDDCCDDCCC NA NA
6 6 DDDDDBDDDDBDDDABDDADAADCABCDAABBCCCDAABDDAACBDABBBBBABBCBDADBDCCAAADACCBCDDBDCAADCBBBCACDBBADDDDCABC NA NA
Currently, I am only generating 100 letters and hoping that this is enough to observe the desired pattern (sometimes this doesn't happen, notice the NA's) - is there a way to add a WHILE LOOP to what I have written to keep generating letters until the desired pattern first appears?
Can someone please show me how to do this?
Thanks!
The loop is a repeat loop, not while, that only breaks when the pattern is found. I have set the results list length to 2, there's no point in making it bigger just to test the code.
library(stringr)
Letters <- c("A", "B", "C", "D")
Pattern <- "ADBC"
n <- 2L
set.seed(2022)
results <- vector("list", length = n)
for (i in seq.int(n)) {
repeat {
l <- sample(Letters, 100, replace = TRUE, prob=c(0.25, 0.25, 0.25, 0.25))
letters_i <- paste(l, collapse = "")
position_i <- str_locate(letters_i, pattern = Pattern)
if(any(!is.na(position_i))) break
}
results_tmp <- data.frame(iteration = i, letters = letters_i, position_i)
results[[i]] <- results_tmp
}
results_df <- do.call(rbind.data.frame, results)
results_df
#> iteration letters start end
#> 1 1 ADBDBDBBCABBBDDBADDAADCBBADACACDCCBBADAADCDDABADCABCDCDDCCCBDDAABACCBDAAAADBDDCCCCADBCBBDABBDCCCBADD 83 86
#> 2 2 DDBDBDBCDDBDBBBDBABBCCBBCCBDBDABBAAABACABADCCBBABADBCCCDABABBDBADCADCABDDDAAACCBDCAACACACBBDDDACCDDC 50 53
Created on 2022-06-11 by the reprex package (v2.0.1)

R: Find set of columns which contain most 1s in matrix of 0 and 1

I have a matrix of 1s and 0s where the rows are individuals and the columns are events. A 1 indicates that an event happened to an individual and a 0 that it did not.
I want to find which set of (in the example) 5 columns/events that cover the most rows/individuals.
Test Data
#Make test data
set.seed(123)
d <- sapply(1:300, function(x) sample(c(0,1), 30, T, c(0.9,0.1)))
colnames(d) <- 1:300
rownames(d) <- 1:30
My attempt
My initial attempt was just based on combining the set of 5 columns with the highest colMeans:
#Get top 5 columns with highest row coverage
col_set <- head(sort(colMeans(d), decreasing = T), 5)
#Have a look the set
col_set
>
197 199 59 80 76
0.2666667 0.2666667 0.2333333 0.2333333 0.2000000
#Check row coverage of the column set
sum(apply(d[,colnames(d) %in% names(col_set)], 1, sum) > 0) / 30 #top 5
>
[1] 0.7
However this set does not cover the most rows. I tested this by pseudo-random sampling 10.000 different sets of 5 columns, and then finding the set with the highest coverage:
#Get 5 random columns using colMeans as prob in sample
##Random sample 10.000 times
set.seed(123)
result <- lapply(1:10000, function(x){
col_set2 <- sample(colMeans(d), 5, F, colMeans(d))
cover <- sum(apply(d[,colnames(d) %in% names(col_set2)], 1, sum) > 0) / 30 #random 5
list(set = col_set2, cover = cover)
})
##Have a look at the best set
result[which.max(sapply(result, function(x) x[["cover"]]))]
>
[[1]]
[[1]]$set
59 169 262 68 197
0.23333333 0.10000000 0.06666667 0.16666667 0.26666667
[[1]]$cover
[1] 0.7666667
The reason for supplying the colMeans to sample is that the columns with the highest coverages are the ones I am most interested in.
So, using pseudo-random sampling I can collect a set of columns with higher coverage than when just using the top 5 columns. However, since my actual data sets are larger than the example I am looking for a more efficient and rational way of finding the set of columns with the highest coverage.
EDIT
For the interested, I decided to microbenchmark the 3 solutions provided:
#Defining G. Grothendieck's coverage funciton outside his solutions
coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30
#G. Grothendieck top solution
solution1 <- function(d){
cols <- tail(as.numeric(names(sort(colSums(d)))), 20)
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
}
#G. Grothendieck "Older solution"
solution2 <- function(d){
require(lpSolve)
ones <- rep(1, 300)
res <- lp("max", colSums(d), t(ones), "<=", 5, all.bin = TRUE, num.bin.solns = 10)
m <- matrix(res$solution[1:3000] == 1, 300)
cols <- which(rowSums(m) > 0)
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
}
#user2554330 solution
bestCols <- function(d, n = 5) {
result <- numeric(n)
for (i in seq_len(n)) {
result[i] <- which.max(colMeans(d))
d <- d[d[,result[i]] != 1,, drop = FALSE]
}
result
}
#Benchmarking...
microbenchmark::microbenchmark(solution1 = solution1(d),
solution2 = solution2(d),
solution3 = bestCols(d), times = 10)
>
Unit: microseconds
expr min lq mean median uq max neval
solution1 390811.850 497155.887 549314.385 578686.3475 607291.286 651093.16 10
solution2 55252.890 71492.781 84613.301 84811.7210 93916.544 117451.35 10
solution3 425.922 517.843 3087.758 589.3145 641.551 25742.11 10
This looks like a relatively hard optimization problem, because of the ways columns interact. An approximate strategy would be to pick the column with the highest mean; then delete the rows with ones in that column, and repeat. You won't necessarily find the best solution this way, but you should get a fairly good one.
For example,
set.seed(123)
d <- sapply(1:300, function(x) sample(c(0,1), 30, T, c(0.9,0.1)))
colnames(d) <- 1:300
rownames(d) <- 1:30
bestCols <- function(d, n = 5) {
result <- numeric(n)
for (i in seq_len(n)) {
result[i] <- which.max(colMeans(d))
d <- d[d[,result[i]] != 1,, drop = FALSE]
}
cat("final dim is ", dim(d))
result
}
col_set <- bestCols(d)
sum(apply(d[,colnames(d) %in% col_set], 1, sum) > 0) / 30 #top 5
This gives 90% coverage.
The following provides a heuristic to find an approximate solution. Find the N=20 columns, say, with the most ones, cols, and then use brute force to find every subset of 5 columns out of those 20. The subset having the highest coverage is shown below and its coverage is 93.3%.
coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30
N <- 20
cols <- tail(as.numeric(names(sort(colSums(d)))), N)
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
## [1] 90 123 197 199 286
coverage(co[, itop])
## [1] 0.9333333
Repeating this for N=5, 10, 15 and 20 we get coverages of 83.3%, 86.7%, 90% and 93.3%. The higher the N the better the coverage but the lower the N the less the run time.
Older solution
We can approximate the problem with a knapsack problem that chooses the 5 columns with largest numbers of ones using integer linear programming.
We get the 10 best solutions to this approximate problem, get all columns which are in at least one of the 10 solutions. There are 14 such columns and we then use brute force to find which subset of 5 of the 14 columns has highest coverage.
library(lpSolve)
ones <- rep(1, 300)
res <- lp("max", colSums(d), t(ones), "<=", 5, all.bin = TRUE, num.bin.solns = 10)
coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30
# each column of m is logical 300-vector defining possible soln
m <- matrix(res$solution[1:3000] == 1, 300)
# cols is the set of columns which are in any of the 10 solutions
cols <- which(rowSums(m) > 0)
length(cols)
## [1] 14
# use brute force to find the 5 best columns among cols
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
## [1] 90 123 197 199 286
coverage(co[, itop])
## [1] 0.9333333
You can try to test if there is a better column and exchange this with the one currently in the selection.
n <- 5 #Number of columns / events
i <- rep(1, n)
for(k in 1:10) { #How many times itterate
tt <- i
for(j in seq_along(i)) {
x <- +(rowSums(d[,i[-j]]) > 0)
i[j] <- which.max(colSums(x == 0 & d == 1))
}
if(identical(tt, i)) break
}
sort(i)
#[1] 90 123 197 199 286
mean(rowSums(d[,i]) > 0)
#[1] 0.9333333
Taking into account, that the initial condition influences the result you can take random starts.
n <- 5 #Number of columns / events
x <- apply(d, 2, function(x) colSums(x == 0 & d == 1))
diag(x) <- -1
idx <- which(!apply(x==0, 1, any))
x <- apply(d, 2, function(x) colSums(x != d))
diag(x) <- -1
x[upper.tri(x)] <- -1
idx <- unname(c(idx, which(apply(x==0, 1, any))))
res <- sample(idx, n)
for(l in 1:100) {
i <- sample(idx, n)
for(k in 1:10) { #How many times itterate
tt <- i
for(j in seq_along(i)) {
x <- +(rowSums(d[,i[-j]]) > 0)
i[j] <- which.max(colSums(x == 0 & d == 1))
}
if(identical(tt, i)) break
}
if(sum(rowSums(d[,i]) > 0) > sum(rowSums(d[,res]) > 0)) res <- i
}
sort(res)
#[1] 90 123 197 199 286
mean(rowSums(d[,res]) > 0)
#[1] 0.9333333

If multiple condition not giving output in r

I have a dataframe
df= data.frame(ID=paste(rep("a",7),c(1:7),sep = ""),
col2=c(12,10,1,2,5,10,8),
col3=c(200,150,180,450,100,130,200))
I'm trying to check a combination of those element that sum upto a particular number AND for those same elements checking a sumproduct
But it is not giving any output
For e.g from dataframe df these values make a sum of 25, so
col2 col3
10 150
5 100
10 130
=sumproduct((E3:E5,D3:D5) [excel formula]
output = 3300
And this is I expect as output from following code a2,a5,a6
This is my code
for(i in 1:nrow(df)){
comb <- combn(1:nrow(df), i, FUN = NULL, simplify = TRUE)
for (j in 1:ncol(comb)){
subvec <- comb[,j]
a <- sum(df[subvec,2])
b <- sum(df[i, 2] * df[i, 3])
if(a == 25 && b == 3300){
print(df[subvec,1])
}
}
}

The which function in R is not giving the desired output

I have a matrix that contains 3 columns and in total 10,000 elements. First and second columns are indexes and third column is the score. I want to normalize the score column based on this formula:
Normalized_score_i_j = score_i_j / ((sqrt(score_i_i) * (sqrt(score_j_j))
score_i_j = the current score itself
score_i_i = look at current score's index in first column, and in the dataset look for a score that has that index in both its first and second columns
score_j_j = look at current score's index in second column, and in the dataset look for a score that has that index in both its first and second columns
An example is for instance, if df is as follow:
df <- read.table(text = "
First.Protein,Second.Protein,Score
1,1,25
1,2,90
1,3,82
1,4,19
2,1,90
2,2,99
2,3,76
2,4,79
3,1,82
3,2,76
3,3,91
3,4,33
4,1,28
4,2,11
4,3,99
4,4,50
", header = TRUE, sep = ",")
If we are normalizing this row:
First.Protein Second.Protein Score
4 3 99
The normalized score will be:
The score itself divided by the sqrt of a score that its First.Protein and Second.Protein index are both 4 multiplied by the sqrt of a score where its First.Protein and Second.Protein indexes are both 3.
Therefore:
Normalized = 99 / (sqrt(50) * sqrt(91)) = 1.467674
I have the code below, but it is behaving very weirdly and is giving me values that are not at all normalized and are in fact very odd:
for(i in 1:nrow(Smith_Waterman_Scores))
{
Smith_Waterman_Scores$Score[i] <-
Smith_Waterman_Scores$Score[i] /
(sqrt(Smith_Waterman_Scores$Score[which(Smith_Waterman_Scores$First.Protein==Smith_Waterman_Scores$First.Protein[i] & Smith_Waterman_Scores$Second.Protein==Smith_Waterman_Scores$First.Protein[i])])) *
(sqrt(Smith_Waterman_Scores$Score[which(Smith_Waterman_Scores$First.Protein==Smith_Waterman_Scores$Second.Protein[i] & Smith_Waterman_Scores$Second.Protein==Smith_Waterman_Scores$Second.Protein[i])]))
}
Here's a re-write of your original attempt (which() is not necessary; just use the logical vector for sub-setting; with() allows you to refer to variables in the data frame without having to re-type the name of the data.frame -- easier to read but also easier to make a mistake)
orig0 <- function(df) {
for(i in 1:nrow(df)) {
df$Score[i] <- with(df, {
ii <- First.Protein == First.Protein[i] &
Second.Protein == First.Protein[i]
jj <- First.Protein == Second.Protein[i] &
Second.Protein == Second.Protein[i]
Score[i] / (sqrt(Score[ii]) * sqrt(Score[jj]))
})
}
df$Score
}
The problem is that Score[ii] and Score[jj] appear on the right-hand side both before and after they've been updated. Here's a revision where the original columns are interpreted as 'read-only'
orig1 <- function(df) {
normalized <- numeric(nrow(df)) # pre-allocate
for(i in 1:nrow(df)) {
normalized[i] <- with(df, {
ii <- First.Protein == First.Protein[i] &
Second.Protein == First.Protein[i]
jj <- First.Protein == Second.Protein[i] &
Second.Protein == Second.Protein[i]
Score[i] / (sqrt(Score[ii]) * sqrt(Score[jj]))
})
}
normalized
}
I think the results are now correct (see below). A better implementation would use sapply (or vapply) to avoid having to worry about the allocation of the return value
orig2 <- function(df) {
sapply(seq_len(nrow(df)), function(i) {
with(df, {
ii <- First.Protein == First.Protein[i] &
Second.Protein == First.Protein[i]
jj <- First.Protein == Second.Protein[i] &
Second.Protein == Second.Protein[i]
Score[i] / (sqrt(Score[ii]) * sqrt(Score[jj]))
})
})
}
Now that the results are correct, we can ask about performance. Your solution requires a scan of, e.g., First.Protein, each time through the loop. There are N=nrow(df) elements of First.Protein, and you're going through the loop N times, so you'll be making a multiple of N * N = N^2 comparisons -- if you increase the size of the data frame from 10 to 100 rows, the time taken will change from 10 * 10 = 100 units, to 100 * 100 = 10000 units of time.
Several of the answers attempt to avoid that polynomial scaling. My answer does this using match() on a vector of values; this probably scales as N (each look-up occurs in constant time, and there are N look-ups), which is much better than polynomial.
Create a subset of data with identical first and second proteins
ii = df[df$First.Protein == df$Second.Protein,]
Here's the ijth score from the original data frame
s_ij = df$Score
Look up First.Protein of df in ii and record the score; likewise for Second.Protein
s_ii = ii[match(df$First.Protein, ii$First.Protein), "Score"]
s_jj = ii[match(df$Second.Protein, ii$Second.Protein), "Score"]
The normalized scores are then
> s_ij / (sqrt(s_ii) * sqrt(s_jj))
[1] 1.0000000 1.8090681 1.7191871 0.5374012 1.8090681 1.0000000 0.8007101
[8] 1.1228571 1.7191871 0.8007101 1.0000000 0.4892245 0.7919596 0.1563472
[15] 1.4676736 1.0000000
This will be fast, using a single call to match() instead of many calls to which() inside a for loop or tests for identity inside an apply() -- both of the latter make N^2 comparisons and so scale very poorly.
I summarized some of the proposed solutions as
f0 <- function(df) {
contingency = xtabs(Score ~ ., df)
diagonals <- unname(diag(contingency))
i <- df$First.Protein
j <- df$Second.Protein
idx <- matrix(c(i, j), ncol=2)
contingency[idx] / (sqrt(diagonals[i]) * sqrt(diagonals[j]))
}
f1 <- function(df) {
ii = df[df$First.Protein == df$Second.Protein,]
s_ij = df$Score
s_ii = ii[match(df$First.Protein, ii$First.Protein), "Score"]
s_jj = ii[match(df$Second.Protein, ii$Second.Protein), "Score"]
s_ij / (sqrt(s_ii) * sqrt(s_jj))
}
f2 <- function(dt) {
dt.lookup <- dt[First.Protein == Second.Protein]
setkey(dt,"First.Protein" )
setkey(dt.lookup,"First.Protein" )
colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score1")
dt <- dt[dt.lookup]
setkey(dt,"Second.Protein" )
setkey(dt.lookup,"Second.Protein")
colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score2")
dt[dt.lookup][
, Normalized := Score / (sqrt(Score1) * sqrt(Score2))][
, .(First.Protein, Second.Protein, Normalized)]
}
f3 <- function(dt) {
eq = dt[First.Protein == Second.Protein]
dt[eq, Score_ii := i.Score, on = "First.Protein"]
dt[eq, Score_jj := i.Score, on = "Second.Protein"]
dt[, Normalised := Score/sqrt(Score_ii * Score_jj)]
dt[, c("Score_ii", "Score_jj") := NULL]
}
I know how to programmatically check that the first two generate consistent results; I don't know data.table well enough to get the normalized result out in the same order as the input columns for f2() so can't compare with the others (though they look correct 'by eye'). f3() produces numerically similar but not identical results
> identical(orig1(df), f0(df))
[1] TRUE
> identical(f0(df), f1(df))
[1] TRUE
> identical(f0(df), { f3(dt3); dt3[["Normalized"]] }) # pass by reference!
[1] FALSE
> all.equal(f0(df), { f3(dt3); dt3[["Normalized"]] })
[1] TRUE
There are performance differences
library(data.table)
dt2 <- as.data.table(df)
dt3 <- as.data.table(df)
library(microbenchmark)
microbenchmark(f0(df), f1(df), f2(dt2), f3(dt3))
with
> microbenchmark(f0(df), f1(df), f2(df), f3(df))
Unit: microseconds
expr min lq mean median uq max neval
f0(df) 967.117 992.8365 1059.7076 1030.9710 1094.247 2384.360 100
f1(df) 176.238 192.8610 210.4059 207.8865 219.687 333.260 100
f2(df) 4884.922 4947.6650 5156.0985 5017.1785 5142.498 6785.975 100
f3(df) 3281.185 3329.4440 3463.8073 3366.3825 3443.400 5144.430 100
The solutions f0 - f3 are likely to scale well (especially data.table) with real data; the fact that the times are in microseconds probably means that speed is not important (now that we are not implementing an N^2 algorithm).
On reflection, a more straight-forward impelementation of f1() just looks up the 'diagonal' elements
f1a <- function(df) {
ii = df[df$First.Protein == df$Second.Protein, ]
d = sqrt(ii$Score[order(ii$First.Protein)])
df$Score / (d[df$First.Protein] * d[df$Second.Protein])
}
You may be doing this in a very round-about manner. Can you see if this works for you:
R> xx
First Second Score
1 1 1 25
2 1 2 90
3 1 3 82
4 1 4 19
5 2 1 90
6 2 2 99
7 2 3 76
8 2 4 79
9 3 1 82
10 3 2 76
11 3 3 91
12 3 4 33
13 4 1 28
14 4 2 11
15 4 3 99
16 4 4 50
R> contingency = xtabs(Score ~ ., data=xx)
R> contingency
Second
First 1 2 3 4
1 25 90 82 19
2 90 99 76 79
3 82 76 91 33
4 28 11 99 50
R> diagonals <- unname(diag(contingency))
R> diagonals
[1] 25 99 91 50
R> normalize <- function (i, j, contingencies, diagonals) {
+ contingencies[i, j] / (sqrt(diagonals[i]) * sqrt(diagonals[j]))
+ }
R> normalize(4, 3, contingency, diagonals)
[1] 1.467674
Here's how I'd approach using data.table. Hopefully #MartinMorgan finds this easier to understand :-).
require(data.table) # v1.9.6+
dt = as.data.table(df) # or use setDT(df) to convert by reference
eq = dt[First.Protein == Second.Protein]
So far, I've just created a new data.table eq which contains all rows where both columns are equal.
dt[eq, Score_ii := i.Score, on = "First.Protein"]
dt[eq, Score_jj := i.Score, on = "Second.Protein"]
Here we add columns Score_ii and Score_jj while joining on columns First.Protein and Second.Protein. That it is a join operation should be clear because of on= argument. The i. refers to the Score column in the data.table provided in the i-argument (here, eq's Score).
Note that we can use match() here as well. But that wouldn't work if you've to lookup directly (and as efficiently) based on more than one column. Using on=, we can extend this quite easily, and is also much easier to read/understand.
Once we've all the required columns, the task is just to get the final Normalised column (and delete the intermediates if they're not necessary).
dt[, Normalised := Score/sqrt(Score_ii * Score_jj)]
dt[, c("Score_ii", "Score_jj") := NULL] # delete if you don't want them
I'll leave out the micro- and milli- second benchmarks as I'm not interested in them.
PS: The columns Score_ii and Score_jj are added above on purpose under the assumption that you might need them. If you don't want them at all, you can also do:
Score_ii = eq[dt, Score, on = "First.Protein"] ## -- (1)
Score_jj = eq[dt, Score, on = "Second.Protein"]
(1) reads: for each row in dt get matching row in eq while matching on column First.Protein and extract eq$Score corresponding to that matching row.
Then, we can directly add the Normalised column as:
dt[, Normalised := Score / sqrt(Score_ii * Score_jj)]
You can can implement this with joins, here is an example using data.table:
library(data.table)
dt <- data.table(df)
dt.lookup <- dt[First.Protein == Second.Protein]
setkey(dt,"First.Protein" )
setkey(dt.lookup,"First.Protein" )
colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score1")
dt <- dt[dt.lookup]
setkey(dt,"Second.Protein" )
setkey(dt.lookup,"Second.Protein")
colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score2")
dt <- dt[dt.lookup][
, Normalized := Score / (sqrt(Score1) * sqrt(Score2))][
, .(First.Protein, Second.Protein, Normalized)]
Just make sure you don't use for loops.
Loop through rows using apply:
#compute
df$ScoreNorm <-
apply(df, 1, function(i){
i[3] /
(
sqrt(df[ df$First.Protein == i[1] &
df$Second.Protein == i[1], "Score"]) *
sqrt(df[ df$First.Protein == i[2] &
df$Second.Protein == i[2], "Score"])
)
})
#test output
df[15, ]
# First.Protein Second.Protein Score ScoreNorm
# 15 4 3 99 1.467674

How to pairwise compare values referring to distinct elements in two matrices of different formats?

I've got a set of objects, let's say with the IDs 'A' to 'J'. And I've got two data frames which look the following way (as you can see, the second data frame is symmetric):
df1 <- data.frame(ID = LETTERS[1:5], Var = c(9,13,15,11,28))
df2 <- as.data.frame(matrix(data = c(NA,42,83,74,84,42,NA,26,69,9,83,26,NA,67,95,74,69,67,NA,6,84,9,95,6,NA), ncol = 5, nrow = 5, dimnames = list(df1$ID, df1$ID)))
For example, take the objects 'B' and 'E'. I want to know: Is 13+28 (from df1) less than 9 (from df2)? I'd like to know this for all pairs of objects. The output should be
(a) a logical data frame structured like df2 and
(b) the number of "TRUE" values.
Most of the time I will only need result (b), but sometimes I would also need (a). So if (b) can be calculated without (a) and if this would be significantly faster, then I'd like to have both algorithms in order to select the suitable one dependent on which output I need to answer a particular question.
I'm comparing around 2000 objects, so the algorithm should be reasonably fast. So far I've been only able to implement this with two nested for-loops which is awfully slow. I bet there is a much nicer way to do this, maybe exploiting vectorisation.
This is what it currently looks like:
df3 <- as.data.frame(matrix(data = NA, ncol = nrow(df1), nrow = nrow(df1),
dimnames = list(df1$ID, df1$ID)))
for (i in 2:nrow(df3)){
for (j in 1:(i-1)){
sum.val <- df1[df1$ID == rownames(df3)[i], "Var"] + df1[df1$ID == names(df3)[j], "Var"]
df3[i,j] <- sum.val <= df2[i,j]
}
}
#
Is this what you want?
df3 <- outer(df1$Var, df1$Var, "+")
df3
df4 <- df3 < df2
df4
sum(df4, na.rm = TRUE)
Here's one way to do it...
# Get row and column indices
ind <- t( combn( df1$ID , 2 ) )
# Get totals
tot <- with( df1 , Var[ match( ind[,1] , ID ) ] + Var[ match( ind[,2] , ID ) ] )
# Make df2 a matrix
m <- as.matrix( df2 )
# Total number of values is simply
sum( m[ ind ] > tot )
#[1] 7
# Find which values in upper triangle part of the matrix exceed those from df1 (1 = TRUE)
m[upper.tri(m)] <- m[ ind ] > tot
# A B C D E
#A NA 1 1 1 0
#B 42 NA 1 0 1
#C 83 26 NA 1 1
#D 74 69 67 NA 0
#E 84 9 95 6 NA
This will do what you want.
# Generate the data
df1 <- data.frame(ID = LETTERS[1:5], Var = c(9,13,15,11,28))
df2 <- as.data.frame(matrix(data = c(NA,42,83,74,84,42,NA,26,
69,9,83,26,NA,67,95,74,69,
67,NA,6,84,9,95,6,NA),
ncol = 5, nrow = 5,
dimnames = list(df1$ID, df1$ID)))
# Define a pairwise comparison index matrix using 'combn'
idx <- combn(nrow(df1), 2)
# Create a results matrix
res <- matrix(NA, ncol = ncol(df2), nrow = nrow(df2))
# Loop through 'idx' for each possible comparison (without repeats)
for(i in 1:ncol(idx)){
logiTest <- (df1$Var[idx[1,i]] + df1$Var[idx[2,i]]) < df2[idx[1,i], idx[2,i]]
res[idx[1,i], idx[2, i]] <- logiTest
res[idx[2,i], idx[1, i]] <- logiTest
}
# Count the number of 'true' comparisons
nTrues <- sum(res, na.rm = TRUE)/2
The code simply uses a pairwise comparison index (idx) to define which elements in both df1 and df2 are to be used in each iteration of the 'for loop'. It then uses this same index to define where in the 'res' matrix the answer to the logical test is to be written.
N.B. This code will break down if the order of elements in df1 and df2 are not the same. In such cases, it would be appropriate to use the actual letters to define which values to compare.

Resources