Concatenating difference rows with multi-column keys - r

Suppose I have a data.frame where if I take multiple columns together (say a, b, and c), then I have an identifier that is unique to two different rows (that differ on column name, and a bunch of value columns x, y, and z).
I'd like to take the difference on the value columns, preserve the key columns, and give the name column a new value like diff.
So for example, suppose I have the following data:
a b c x y z name
1 1 M J 0.0 1.0 2.0 alpha
2 1 M K 0.1 0.9 2.0 alpha
3 1 O J 0.2 0.8 2.0 alpha
4 1 O K 0.3 0.7 2.0 alpha
5 2 M J 0.4 0.6 2.0 alpha
6 2 M K 0.5 0.5 2.0 alpha
7 2 O J 0.6 0.4 2.0 alpha
8 2 O K 0.7 0.3 2.0 alpha
9 1 M J 0.0 2.0 1.0 beta
10 1 M K 0.1 1.9 3.0 beta
11 1 O J 0.2 1.8 1.0 beta
12 1 O K 0.3 1.7 3.0 beta
13 2 M J 0.4 1.6 1.0 beta
14 2 M K 0.5 1.5 3.0 beta
15 2 O J 0.6 1.4 1.0 beta
16 2 O K 0.7 1.3 3.0 beta
Then I want the new data frame to be:
a b c x y z name
1 1 M J 0.0 1.0 2.0 alpha
2 1 M K 0.1 0.9 2.0 alpha
3 1 O J 0.2 0.8 2.0 alpha
4 1 O K 0.3 0.7 2.0 alpha
5 2 M J 0.4 0.6 2.0 alpha
6 2 M K 0.5 0.5 2.0 alpha
7 2 O J 0.6 0.4 2.0 alpha
8 2 O K 0.7 0.3 2.0 alpha
9 1 M J 0.0 2.0 1.0 beta
10 1 M K 0.1 1.9 3.0 beta
11 1 O J 0.2 1.8 1.0 beta
12 1 O K 0.3 1.7 3.0 beta
13 2 M J 0.4 1.6 1.0 beta
14 2 M K 0.5 1.5 3.0 beta
15 2 O J 0.6 1.4 1.0 beta
16 2 O K 0.7 1.3 3.0 beta
17 1 M J 0.0 -1.0 1.0 diff
18 1 M K 0.0 -1.0 -1.0 diff
19 1 O J 0.0 -1.0 1.0 diff
20 1 O K 0.0 -1.0 -1.0 diff
21 2 M J 0.0 -1.0 1.0 diff
22 2 M K 0.0 -1.0 -1.0 diff
23 2 O J 0.0 -1.0 1.0 diff
24 2 O K 0.0 -1.0 -1.0 diff
What's the easiest way to accomplish this?

You could make each column individually:
colx = ave(df$x, paste(df$a, df$b, df$c), FUN=function(x) x[1]-x[2])
coly = ave(df$y, paste(df$a, df$b, df$c), FUN=function(x) x[1]-x[2])
colz = ave(df$z, paste(df$a, df$b, df$c), FUN=function(x) x[1]-x[2])
And then put them together:
df2 = subset(df, name=="alpha")
df2$name = "diff"
df2$x = colx[1:(length(colx)/2)]
df2$y = coly[1:(length(coly)/2)]
df2$z = colz[1:(length(colz)/2)]
Now join to original
df = rbind(df, df2)
That gives:
a b c x y z name
1 1 m j 0.0 1.0 2 a
2 1 m k 0.1 0.9 2 a
3 1 o j 0.2 0.8 2 a
4 1 o k 0.3 0.7 2 a
5 2 m j 0.4 0.6 2 a
6 2 m k 0.5 0.5 2 a
7 2 o j 0.6 0.4 2 a
8 2 o k 0.7 0.3 2 a
9 1 m j 0.0 2.0 1 b
10 1 m k 0.1 1.9 3 b
11 1 o j 0.2 1.8 1 b
12 1 o k 0.3 1.7 3 b
13 2 m j 0.4 1.6 1 b
14 2 m k 0.5 1.5 3 b
15 2 o j 0.6 1.4 1 b
16 2 o k 0.7 1.3 3 b
17 1 m j 0.0 -1.0 1 diff
18 1 m k 0.0 -1.0 -1 diff
19 1 o j 0.0 -1.0 1 diff
20 1 o k 0.0 -1.0 -1 diff
21 2 m j 0.0 -1.0 1 diff
22 2 m k 0.0 -1.0 -1 diff
23 2 o j 0.0 -1.0 1 diff
24 2 o k 0.0 -1.0 -1 diff

If your matrix is always sorted and ballanced. Then this should work
half<-1:(nrow(df)/2)
rbind(
df,
cbind(
df[half, 1:3],
df[half, 4:6] - df[half+half[length(half)], 4:6],
name="diff"
)
)

Related

Create lower triangle genetic distance matrix

I have distance matrix like this
1 2 3 4 5
A 0.1 0.2 0.3 0.5 0.6
B 0.7 0.8 0.9 1 1.1
C 1.2 1.3 1.4 1.5 1.6
D 1.7 1.8 1.9 2 2.1
E 2.2 2.3 2.4 2.5 2.6
and now I want to create lower triangle matrix like this
1 2 3 4 5 A B C D E
1 0
2 0.1 0
3 0.2 0.1 0
4 0.4 0.3 0.2 0
5 0.5 0.4 0.3 0.1 0
A 0.1 0.2 0.3 0.5 0.6 0
B 0.7 0.8 0.9 1 1.1 0.6 0
C 1.2 1.3 1.4 1.5 1.6 1.1 0.5 0
D 1.7 1.8 1.9 2 2.1 1.6 1 0.5 0
E 2.2 2.3 2.4 2.5 2.6 2.1 1.5 1 0.5 0
I just deducted distance between 2 from 1 from first table to get genetic distance between 1 and 2 (0.2 - 0.1=0.1) and like this I did for rest of the entries and I do not know doing like this is correct or not?, after doing calculation like that made lower triangle matrix. I tried like this in R
x <- read.csv("AD2.csv", head = FALSE, sep = ",")
b<-lower.tri(b, diag = FALSE)
but I am getting only TRUE and FALSE as output not like distance matrix.
can any one help to solve this problem and here is link to my example data.
You can make use of dist to calculate sub-matrices. Then use cbind and create the top and bottom half. Then rbind the 2 halves. Then set upper triangular to NA to create the desired output.
mat <- rbind(
cbind(as.matrix(dist(tbl[1,])), tbl),
cbind(tbl, as.matrix(dist(tbl[,1])))
)
mat[upper.tri(mat, diag=FALSE)] <- NA
mat
Hope it helps.
data:
tbl <- as.matrix(read.table(text="1 2 3 4 5
A 0.1 0.2 0.3 0.5 0.6
B 0.7 0.8 0.9 1 1.1
C 1.2 1.3 1.4 1.5 1.6
D 1.7 1.8 1.9 2 2.1
E 2.2 2.3 2.4 2.5 2.6", header=TRUE, check.names=FALSE, row.names=1))

Multiply values depending on values of certains columns

I have two data base, df and cf. I want to multiply each value of A in df by each coefficient in cf depending on the value of B and C in table df.
For example
row 2 in df A= 20 B= 4 and C= 2 so the correct coefficient is 0.3,
the result is 20*0.3 = 6
There is a simple way to do that in R!?
Thanks in advance!!
df
A B C
20 4 2
30 4 5
35 2 2
24 3 3
43 2 1
cf
C
B/C 1 2 3 4 5
1 0.2 0.3 0.5 0.6 0.7
2 0.1 0.5 0.3 0.3 0.4
3 0.9 0.1 0.6 0.6 0.8
4 0.7 0.3 0.7 0.4 0.6
One solution with apply:
#iterate over df's rows
apply(df, 1, function(x) {
x[1] * cf[x[2], x[3]]
})
#[1] 6.0 18.0 17.5 14.4 4.3
Try this vectorized:
df[,1] * cf[as.matrix(df[,2:3])]
#[1] 6.0 18.0 17.5 14.4 4.3
A solution using dplyr and a vectorised function:
df = read.table(text = "
A B C
20 4 2
30 4 5
35 2 2
24 3 3
43 2 1
", header=T, stringsAsFactors=F)
cf = read.table(text = "
0.2 0.3 0.5 0.6 0.7
0.1 0.5 0.3 0.3 0.4
0.9 0.1 0.6 0.6 0.8
0.7 0.3 0.7 0.4 0.6
")
library(dplyr)
# function to get the correct element of cf
# vectorised version
f = function(x,y) cf[x,y]
f = Vectorize(f)
df %>%
mutate(val = f(B,C),
result = val * A)
# A B C val result
# 1 20 4 2 0.3 6.0
# 2 30 4 5 0.6 18.0
# 3 35 2 2 0.5 17.5
# 4 24 3 3 0.6 14.4
# 5 43 2 1 0.1 4.3
The final dataset has both result and val in order to check which value from cf was used each time.

Expanding rows of data

I have an issue of expanding rows of my data frame. I tried expand from tidyr inside of a dplyr chain. The point is that it seems that this function is expanding the data but by changing the order of expand element which is not desired. I want to keep order of sp column after expand.
Here is my attempt
df <- data.frame(label1=letters[1:6],label2=letters[7:12])
sp <- c(-1,0,seq(0.1,0.5,0.1),seq(-2,-2.5,-0.1),seq(0.1,0.5,0.1))
sp
# [1] -1.0 0.0 0.1 0.2 0.3 0.4 0.5 -2.0 -2.1 -2.2 -2.3 -2.4 -2.5 0.1 0.2 0.3 0.4 0.5
library(dplyr)
library(tidyr)
expanded <- df%>%
expand(df,sp)
> head(expanded)
label1 label2 sp
1 a g -2.5
2 a g -2.4
3 a g -2.3
4 a g -2.2
5 a g -2.1
6 a g -2.0
I want to expand df based on sp order. how can we do that?
expected output
label1 label2 sp
1 a g -1.0
2 a g 0.0
3 a g 0.1
4 a g 0.2
5 a g 0.3
6 a g 0.4
7 a g 0.5
8 a g -2
9 a g -2.1
10 a g -2.2
11 a g -2.3
12 a g -2.4
13 a g -2.5
14 b h -1.0
15 b h 0.0
16 b h 0.1
and so on
We can match the column 'sp' with the vector sp in the global environment to do the ordering
r1 <- df %>%
expand(df, sp) %>%
arrange(label1, label2, match(sp, unique(.GlobalEnv$sp)))
dim(r1)
#[1] 78 3
identical(unique(r1$sp), unique(sp))
#[1] TRUE
Update
If there are duplicates in the 'sp' vector and we want to expand on all the values, one option is to do the expansion on the sequence of the vector and later change the values
r2 <- df %>%
expand(df, sp=seq_along(sp)) %>%
mutate(sp = .GlobalEnv$sp[sp])
dim(r2)
#[1] 108 3
head(r2, length(sp))
# label1 label2 sp
# 1 a g -1.0
# 2 a g 0.0
# 3 a g 0.1
# 4 a g 0.2
# 5 a g 0.3
# 6 a g 0.4
# 7 a g 0.5
# 8 a g -2.0
# 9 a g -2.1
# 10 a g -2.2
# 11 a g -2.3
# 12 a g -2.4
# 13 a g -2.5
# 14 a g 0.1
# 15 a g 0.2
# 16 a g 0.3
# 17 a g 0.4
# 18 a g 0.5

How to reset row names?

Here is a sample data set:
sample1 <- data.frame(Names=letters[1:10], Values=sample(seq(0.1,1,0.1)))
When I'm reordering the data set, I'm losing the row names order
sample1[order(sample1$Values), ]
Names Values
7 g 0.1
4 d 0.2
3 c 0.3
9 i 0.4
10 j 0.5
5 e 0.6
8 h 0.7
6 f 0.8
1 a 0.9
2 b 1.0
Desired output:
Names Values
1 g 0.1
2 d 0.2
3 c 0.3
4 i 0.4
5 j 0.5
6 e 0.6
7 h 0.7
8 f 0.8
9 a 0.9
10 b 1.0
Try
rownames(Ordersample2) <- 1:10
or more generally
rownames(Ordersample2) <- NULL
I had a dplyr usecase:
df %>% as.data.frame(row.names = 1:nrow(.))

Merge data frames based on rownames in R

How can I merge the columns of two data frames, containing a distinct set of columns but some rows with the same names? The fields for rows that don't occur in both data frames should be filled with zeros:
> d
a b c d e f g h i j
1 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10
2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
> e
k l m n o p q r s t
1 11 12 13 14 15 16 17 18 19 20
3 21 22 23 24 25 26 27 28 29 30
> de
a b c d e f g h i j k l m n o p q r s t
1 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10 11 12 13 14 15 16 17 18 19 20
2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0 0 0 0 0 0 0 0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 21 22 23 24 25 26 27 28 29 30
See ?merge:
the name "row.names" or the number 0 specifies the row names.
Example:
R> de <- merge(d, e, by=0, all=TRUE) # merge by row names (by=0 or by="row.names")
R> de[is.na(de)] <- 0 # replace NA values
R> de
Row.names a b c d e f g h i j k l m n o p q r s
1 1 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10 11 12 13 14 15 16 17 18 19
2 2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0 0 0 0 0 0 0
3 3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 21 22 23 24 25 26 27 28 29
t
1 20
2 0
3 30

Resources