Subtraction based on two factors

Subtraction based on two factors - r

My dataframe looks like so:
group <- c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C")
value <- c(3:6, 1:4, 4:9)
type <- c("d", "d", "e", "e", "g", "g", "e", "e", "d", "d", "e", "e", "f", "f")
df <- cbind.data.frame(group, value, type)
df
group value type
1 A 3 d
2 A 4 d
3 A 5 e
4 A 6 e
5 B 1 g
6 B 2 g
7 B 3 e
8 B 4 e
9 C 4 d
10 C 5 d
11 C 6 e
12 C 7 e
13 C 8 f
14 C 9 f
Within each level of factor "group" I would like to subtract the values based on "type", such that (for group "A") 3 - 5 (1st value of d - 1st value of e) and 4 - 6 (2nd value of d - 2nd value of d). My outcome should look similarly to this..
A
group d_e
1 A -2
2 A -2
B
group g_e
1 B -2
2 B -2
C
group d_e d_f e_f
1 C -2 -4 -2
2 C -2 -4 -2
So if - as for group C - there are more than 2 types, I would like to calculate the difference between each combination of types.
Reading this post I reckon I could maybe use ddply and transform. However, I am struggling with finding a way to automatically assign the types, given that each group consists of different types and also different numbers of types.
Do you have any suggestions as to how I could manage that?

Its not clear why the sample answer in the post has two identical rows in each output group and not just one but at any rate this produces similar output to that shown:
DF <- df[!duplicated(df[-2]), ]
f <- function(x) setNames(
data.frame(group = x$group[1:2], as.list(- combn(x$value, 2, diff))),
c("group", combn(x$type, 2, paste, collapse = "_"))
)
by(DF, DF$group, f)
giving:
DF$group: A
group d_e
1 A -2
2 A -2
------------------------------------------------------------
DF$group: B
group d_e
1 B -2
2 B -2
------------------------------------------------------------
DF$group: C
group d_e d_f e_f
1 C -2 -4 -2
2 C -2 -4 -2
REVISED minor improvements.

Related

How to remove rows based on the column values

I have a large data.frame, example:
> m <- matrix(c(3,6,2,5,3,3,2,5,4,3,5,3,6,3,6,7,5,8,2,5,5,4,9,2,2), nrow=5, ncol=5)
> colnames(m) <- c("A", "B", "C", "D", "E")
> rownames(m) <- c("a", "b", "c", "d", "e")
> m
A B C D E
a 3 3 5 7 5
b 6 2 3 5 4
c 2 5 6 8 9
d 5 4 3 2 2
e 3 3 6 5 2
I would like to remove all rows, where A and/or B columns have greater value than C D and E columns.
So in this case rows b, d, e should be removed and I should get this:
A B C D E
a 3 3 5 7 5
c 2 5 6 8 9
Can not remove them one by one because the data.frame has more than a million rows.
Thanks

Use subsetting, together with pmin() and pmax() to retain the values that you want. I'm not sure that I fully understand your criteria (you said "C D and E" but since you want to throw away row e, I think that you meant C, D or E ), but the following seems to do what you want:
> m[pmax(m[,"A"],m[,"B"])<=pmin(m[,"C"],m[,"D"],m[,"E"]),]
A B C D E
a 3 3 5 7 5
c 2 5 6 8 9

# creating the df
m <- matrix(c(3,6,2,5,3,3,2,5,4,3,5,3,6,3,6,7,5,8,2,5,5,4,9,2,2), nrow=5, ncol=5)
colnames(m) <- c("A", "B", "C", "D", "E")
rownames(m) <- c("a", "b", "c", "d", "e")
# initialize as data frame.
m <- as.data.frame(m)
df_n <- m
for(i in 1:nrow(m)){
#print(i)
#print(paste(max(m[,1:2][i,]), max(m[,3:5][i,])))
if(max(m[,1:2][i,]) > (max(m[,3:4][i,])) || max(m[,1:2][i,]) > ((m[,5])[i])){
#df_n <- m[-i,]
df_n[i,] <- NA
}
}
#df_n
df_n <- df_n[complete.cases(df_n), ]
print(df_n)
Results
> print(df_n)
A B C D E
a 3 3 5 7 5
c 2 5 6 8 9

Here's another solution with apply:
m[apply(m, 1, function(x) max(x[1], x[2]) < min(x[3], x[4], x[5])),]
Result:
A B C D E
a 3 3 5 7 5
c 2 5 6 8 9
I think what you actually meant is to remove rows where max(A, B) > min(C, D, E), which translates to keep rows where all values of A and B are smaller than all values of C, D, and E.

Correction for multiple testing for very large files with repetitions

I have 10 files with size ~8-9 Gb like:
7 72603 0.0780181622612
15 72603 0.027069072329
20 72603 0.00215643186987
24 72603 0.00247965378216
29 72603 0.0785606184492
32 72603 0.0486866833899
33 72603 0.000123332654879
For each pair of numbers (1st and 2nd column) I have p-value (3rd column).
However, I have repeated pairs (they can be in different files) and I want to get rid of one of them. If the files were smaller, I would use pandas. E.g.:
7 15 0.0012423442
...
15 7 0.0012423442
Also I want to apply to this set a correction for multiple testing, but the vector of values is very large.
Is it possible to do this with Python or R?

> df <- data.frame(V1 = c("A", "A", "B", "B", "C", "C"),
+ V2 = c("B", "C", "A", "C", "A", "B"),
+ n = c(1, 3, 1, 2, 3, 2))
> df
V1 V2 n
1 A B 1
2 A C 3
3 B A 1
4 B C 2
5 C A 3
6 C B 2
> df[!duplicated(t(apply(df, 1, sort))), ]
V1 V2 n
1 A B 1
2 A C 3
4 B C 2

Summing columns of characters in R data frame to create a new column

I currently have a data frame in R, where each entry is a character. However, each character also corresponds to a point value, where: B = 10, S = 1, C = 1, X = 0.
For example, consider the following data frame
> df = data.frame(p1 = c("B", "B", "C", "C", "S", "S", "X"), p2 = c("X", "B", "B", "S", "C", "S", "X"), p3 = c("C", "B", "B", "X", "C", "S", "X"))
> df
p1 p2 p3
1 B X C
2 B B B
3 C B B
4 C S X
5 S C C
6 S S S
7 X X X
I want to create three new columns in R: c1, c2, c3 where these are essentially the "lagged" sum (using the numeric values of each characters) of the p1, p2, and p3 values.
p1 p2 p3 c1 c2 c3
1 B X C 0 10 10
2 B B B 0 10 20
3 C B B 0 1 11
4 C S X 0 1 2
5 S C C 0 1 2
6 S S S 0 1 2
7 X X X 0 0 0
For example, c1 is always initialized to 0. c2 will be the point value of p1, and c3 will be the sum of c2 and the point value of p1.
In general c_i = c_{i-1} + p_{i-1}.
Is there an easy way to do this in R? Thank you in advance, as I am a relatively novice R user.

Something like this would work. matchFun is a function that does the matching.
matchFun <- function(x) c(10, 1, 1, 0)[x]
within(df, {
c3 <- rowSums(sapply(list(p1, p2), matchFun))
c2 <- matchFun(p1)
c1 <- 0L
})
# p1 p2 p3 c1 c2 c3
# 1 B X C 0 10 10
# 2 B B B 0 10 20
# 3 C B B 0 1 11
# 4 C S X 0 1 2
# 5 S C C 0 1 2
# 6 S S S 0 1 2
# 7 X X X 0 0 0

Compute matrix of sums

Suppose I have a data.frame with several columns of categorical data, and one column of quantitative data. Here's an example:
my_data <- structure(list(A = c("f", "f", "f", "f", "t", "t", "t", "t"),
B = c("t", "t", "t", "t", "f", "f", "f", "f"),
C = c("f","f", "t", "t", "f", "f", "t", "t"),
D = c("f", "t", "f", "t", "f", "t", "f", "t")),
.Names = c("A", "B", "C", "D"),
row.names = 1:8, class = "data.frame")
my_data$quantity <- 1:8
Now my_data looks like this:
A B C D quantity
1 f t f f 1
2 f t f t 2
3 f t t f 3
4 f t t t 4
5 t f f f 5
6 t f f t 6
7 t f t f 7
8 t f t t 8
What's the most elegant way to get a cross tab / sum of quantity where both values =='t'? That is, I'm looking for an output like this:
A B C D
A "?" "?" "?" "?"
B "?" "?" "?" "?"
C "?" "?" "?" "?"
D "?" "?" "?" "?"
..where the intersection of x/y is the sum of quantity where x=='t' and y=='t'. (I only care about half this table, really, since half is duplicated)
So for example the value of A/C should be:
good_rows <- with(my_data, A=='t' & C=='t')
sum(my_data$quantity[good_rows])
15
*Edit: What I already had was:
nodes <- names(my_data)[-ncol(my_data)]
sapply(nodes, function(rw) {
sapply(nodes, function(cl) {
good_rows <- which(my_data[, rw]=='t' & my_data[, cl]=='t')
sum(my_data[good_rows, 'quantity'])
})
})
Which gives the desired result:
A B C D
A 26 0 15 14
B 0 10 7 6
C 15 7 22 12
D 14 6 12 20
I like this solution because, being very 'literal', it's fairly readable: two apply funcs (aka loops) to go through rows * columns, compute each cell, and produce the matrix. Also plenty fast enough on my actual data (tiny: 192 rows x 10 columns). I didn't like it because it seems like a lot of lines. Thank you for the answers so far! I will review and absorb.

Try using matrix multiplication
temp <- (my_data[1:4]=="t")*my_data$quantity
t(temp) %*% (my_data[1:4]=="t")
# A B C D
#A 26 0 15 14
#B 0 10 7 6
#C 15 7 22 12
#D 14 6 12 20
(Although this might be a fluke)

For each row name, you could build a vector dat that's just the rows with that value equal to t. Then you could multiply the true/false values in this data subset by that row's quantity value (so it's 0 when false and the quantity value when true), finally taking the column sum.
sapply(c("A", "B", "C", "D"), function(x) {
dat <- my_data[my_data[,x] == "t",]
colSums((dat[,-5] == "t") * dat[,5])
})
# A B C D
# A 26 0 15 14
# B 0 10 7 6
# C 15 7 22 12
# D 14 6 12 20

Convert nominal results from round robin tournaments into a list of adjacency matrices

I would like to take nominal results from a round-robin tournament and convert them to a list of binary adjacency matrices.
By convention, results from these tournaments are written by recording the name of the winner. Here is code for an example table where four individuals (A,B,C,D) compete against each other:
set <- c(rep(1, 6), rep(2,6))
trial <- (1:12)
home <- c("B", "A", "C", "D", "B", "C", "D", "C", "B", "A", "A", "D")
visitor <- c("D", "C", "B", "A", "A", "D", "B", "A", "C", "D", "B", "C" )
winners.rr1 <- c("D", "A", "B", "A", "A", "D", "D", "A", "B", "D", "A", "D")
winners.rr2 <- c("D", "A", "C", "A", "A", "D", "D", "A", "C", "A", "A", "D")
winners.rr3 <- c("D", "A", "B", "A", "A", "D", "D", "A", "B", "D", "A", "D")
roundrobin <- data.frame(set=set, trial=trial, home=home, visitor=visitor,
winners.rr1=winners.rr1, winners.rr2=winners.rr2,
winners.rr3=winners.rr3)
Here's the table:
> roundrobin
set trial home visitor winners.rr1 winners.rr2 winners.rr3
1 1 1 B D D D D
2 1 2 A C A A A
3 1 3 C B B C B
4 1 4 D A A A A
5 1 5 B A A A A
6 1 6 C D D D D
7 2 7 D B D D D
8 2 8 C A A A A
9 2 9 B C B C B
10 2 10 A D D A D
11 2 11 A B A A A
12 2 12 D C D D D
This table shows the winners from three round robin tournaments. Within each tournament, there are two sets: each player competes against all others once at home, and once as a visitor. This makes for a total of 12 trials in each round robin tournament.
So, in the first trial in the first set, player D defeated player B. In the second trial of the first set, player A defeated player C, and so on.
I would like to turn these results into a list of six adjacency matrices. Each matrix is to be derived from each set within each round robin tournament. Wins are tallied on rows as "1", and losses are tallied as "0" on rows. ("Home" and "visitor" designations are irrelevant for what follows).
Here is what the adjacency matrix from Set 1 of the first round robin would look like:
> Adj.mat.set1.rr1
X A B C D
1 A NA 1 1 1
2 B 0 NA 1 0
3 C 0 0 NA 0
4 D 0 1 1 NA
And here is what Set 2 of the first round robin would look like:
> Adj.mat.set2.rr1
X A B C D
1 A NA 1 1 0
2 B 0 NA 1 0
3 C 0 0 NA 0
4 D 1 1 1 NA
The latter matrix shows, for example, that player A won 2 trials, player B won 1 trial, player C won 0 trials, and player D won 3 trials.
The trick of this manipulation is therefore to convert each win (recorded as a name) into a score of "1" in the appropriate row on the adjacency matrix, while losses are recorded as "0".
Any help is much appreciated.

Here's one way to go about it, although I imagine there must be a simpler approach - perhaps involving plyr. The following splits the data frame into subsets corresponding to set, then, for each round, sets up a table of zeroes (with NA diagonal) to hold results, and finally sets "winning cells" to 1 by subsetting the table with a matrix. Output class is set to matrix to ensure matrices are presented as such.
results <- lapply(split(roundrobin, roundrobin$set), function(set) {
lapply(grep('^winners', names(set)), function(i) {
tab <- table(set$home, set$visitor)
tab[] <- 0
diag(tab) <- NA
msub <- t(apply(set, 1, function(x) {
c(x[i], setdiff(c(x['home'], x['visitor']), x[i]))
}))
tab[msub] <- 1
class(tab) <- 'matrix'
tab
})
})
Results for set 1:
> results[[1]]
[[1]]
A B C D
A NA 1 1 1
B 0 NA 1 0
C 0 0 NA 0
D 0 1 1 NA
[[2]]
A B C D
A NA 1 1 1
B 0 NA 0 0
C 0 1 NA 0
D 0 1 1 NA
[[3]]
A B C D
A NA 1 1 1
B 0 NA 1 0
C 0 0 NA 0
D 0 1 1 NA