R: Applying a function on data frame columns defined in another table - r

I have a dataframe (df) that looks like this:
A B C D E F G H
a 1 2 3 4 5 3 4 2
b 3 4 5 5 4 5 5 4
c 1 4 6 7 9 6 7 4
d 2 4 5 7 8 5 7 4
e 2 2 4 5 7 4 5 2
I would like to compute the ratio of different columns, based on a table like this, that indicates which column needs to be the numerator and denominator:
num denom
A E
B G
F C
H D
So for example the first column in the output will be column A divided by column E.
The output would look like this (I want to keep the column names of the numerators):
A B F H
a 0.2 0.5 1 0.5
b 0.75 0.8 1 0.8
c 0.111111111 0.571428571 1 0.571428571
d 0.25 0.571428571 1 0.571428571
e 0.285714286 0.4 1 0.4
I hope I am making this understandable, here is the dput of the original table:
structure(list(A = c(1L, 3L, 1L, 2L, 2L), B = c(2L, 4L, 4L, 4L,
2L), C = c(3L, 5L, 6L, 5L, 4L), D = c(4L, 5L, 7L, 7L, 5L), E = c(5L,
4L, 9L, 8L, 7L), F = c(3L, 5L, 6L, 5L, 4L), G = c(4L, 5L, 7L,
7L, 5L), H = c(2L, 4L, 4L, 4L, 2L)), .Names = c("A", "B", "C",
"D", "E", "F", "G", "H"), class = "data.frame", row.names = c("a",
"b", "c", "d", "e"))
Any help will be much appreciated! Thanks

Even easier with dplyr::transmute()
library(dplyr)
df %>% transmute(A=A/E, B=B/G, F=F/C, H=H/D)
# A B F H
#1 0.2000000 0.5000000 6.00 0.5000000
#2 0.7500000 0.8000000 6.25 0.8000000
#3 0.1111111 0.5714286 10.50 0.5714286
#4 0.2500000 0.5714286 8.75 0.5714286
#5 0.2857143 0.4000000 10.00 0.4000000

Here we go:
mdf <- data.frame(num=c("A", "B", "F", "H"), denom=c("E", "G", "C", "D"), stringsAsFactors = F)
df_num <- df[, mdf$num]
df_denom <- df[, mdf$denom]
df_num/df_denom
The stringAsFactors = Fis important...

Related

How to move two specific rows to top of dataframe?

Below I have a DF.
A B C D
a 4 2 2
g 5 2 2
d 7 65 7
e 3 6 7
I would like to make this DF so that column A has "g" in the first row, and "d" in the second row. I would like to do this by calling the value in column A (rather than an index). How can I do this?
Ideal output
A B C D
g 5 2 2
d 7 65 7
a 4 2 2
e 3 6 7
We may convert to factor with levels specified in an order before arrangeing
library(forcats)
library(dplyr)
DF %>%
arrange(fct_relevel(A, 'g', 'd'))
A B C D
1 g 5 2 2
2 d 7 65 7
3 a 4 2 2
4 e 3 6 7
with fct_relevel, we can specify the order of specific levels without specifying the rest of the levels
> with(DF, fct_relevel(A, 'g', 'd'))
[1] a g d e
Levels: g d a e
data
DF <- structure(list(A = c("a", "g", "d", "e"), B = c(4L, 5L, 7L, 3L
), C = c(2L, 2L, 65L, 6L), D = c(2L, 2L, 7L, 7L)), class = "data.frame",
row.names = c(NA,
-4L))
Another possible solution:
library(dplyr)
df <- data.frame(
stringsAsFactors = FALSE,
A = c("a", "g", "d", "e"),
B = c(4L, 5L, 7L, 3L),
C = c(2L, 2L, 65L, 6L),
D = c(2L, 2L, 7L, 7L)
)
df %>% arrange(match(A, c("g", "d", setdiff(c("g", "d"), A))))
#> A B C D
#> 1 g 5 2 2
#> 2 d 7 65 7
#> 3 a 4 2 2
#> 4 e 3 6 7
Try the code below
with(
df,
df[match(c("g","d",A[!A%in%c("g","d")]),A),]
)
and you will see
A B C D
2 g 5 2 2
3 d 7 65 7
1 a 4 2 2
4 e 3 6 7
Just to add a base R solution if you are not interested in external packages, you can specify the row order directly:
# Sample Data
DF <- structure(list(A = c("a", "g", "d", "e"), B = c(4L, 5L, 7L, 3L
), C = c(2L, 2L, 65L, 6L), D = c(2L, 2L, 7L, 7L)), class = "data.frame",
row.names = c(NA, -4L))
A hard code for this example:
DF2 <- DF[c(2,3,1,4),]
A more generalizable example:
# specify desired rows
rownums <- which(DF$A %in% c("g","d"), arr.ind = TRUE)
# Specify other rows
otherrows <- seq(1:nrow(DF))[!(seq(1:nrow(DF)) %in% rownums)]
# Organize
DF2 <- DF[c(rownums,otherrows),]

Merge two matrix by column names and row names

I want to merge the two matrices according to the column names and row names.
The values in both matrices are numeric, and the merger is to average the two matrices that have appeared together.
matrix1:
A B C
x 1 4 3
z 5 2 4
k 1 2 3
and matrix2:
A B C D
x 6 4 1 2
y 2 3 1 3
z 1 4 1 4
k 7 5 3 1
so the output will be:
A B C D
x 3.5 4 2 2
y 2 3 1 3
z 3 3 2.5 4
k 4 3.5 3 1
My idea is to use for loop or apply function, but if the matrices are big, then this program will run for a long time. Any advice? Thank you!
You can use rownames and colnames to subset matrix2 and update only part of it.
matrix2[rownames(matrix1), colnames(matrix1)] <- (matrix1 + matrix2[rownames(matrix1), colnames(matrix1)])/2
matrix2
# A B C D
#x 3.5 4.0 2.0 2
#y 2.0 3.0 1.0 3
#z 3.0 3.0 2.5 4
#k 4.0 3.5 3.0 1
data
matrix1 <- structure(c(1L, 5L, 1L, 4L, 2L, 2L, 3L, 4L, 3L), .Dim = c(3L,
3L), .Dimnames = list(c("x", "z", "k"), c("A", "B", "C")))
matrix2 <- structure(c(6L, 2L, 1L, 7L, 4L, 3L, 4L, 5L, 1L, 1L, 1L, 3L, 2L,
3L, 4L, 1L), .Dim = c(4L, 4L), .Dimnames = list(c("x", "y", "z",
"k"), c("A", "B", "C", "D")))

Compute mean pairwise covariance between elements in a list

I have the following data frames:
# df1
id cg_v
1 a
2 b
3 a b
4 b c
5 b c d
6 d
# df2
id cg
1 a
2 b
3 a
3 b
4 b
4 c
5 b
5 c
5 d
6 d
I need to add a column to df1 that contains the mean covariance computed across each pair of elements in cg_v. If cg_v contains only one element, then I would like the new column to contain its variance.
I can get a covariance matrix by cov(crossprod(table(df2)))
# a b c d
a 0.9166667 0.0000000 -0.5833333 -0.6666667
b 0.0000000 2.0000000 1.0000000 0.0000000
c -0.5833333 1.0000000 0.9166667 0.3333333
d -0.6666667 0.0000000 0.3333333 0.6666667
What do I do from here?
The end result should be like this:
# df1
id cg_v cg_cov
1 a 0.9166667
2 b 2.0000000
3 a b 0.0000000
4 b c 1.0000000
5 b c d 0.4444444 # This is equal to (1.0000000 + 0.3333337 + 0.0000000)/3
6 d 0.6666667
Code to generate df1 and df2:
df1 <- structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L),
cg_v = c("a", "b", "a b", "b c", "b c d", "d")),
.Names = c("id", "cg_v"),
class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(id = c(1L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 6L),
cg = c("a", "b", "a", "b", "b", "c", "b", "c", "d", "d")),
.Names = c("id", "cg"),
class = "data.frame", row.names = c(NA, -10L))
I think I found a solution for this problem using data.tables and reshape. What do you want to do with the three letters b c d? I assumed that you want to have the covariance of the first two letters:
require(reshape)
require(data.table)
dt1 <- data.table(id = c(1L, 2L, 3L, 4L, 5L, 6L),
cg_v = c("a", "b", "a b", "b c", "b c d", "d"))
dt2 <- data.table(id = c(1L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 6L),
cg = c("a", "b", "a", "b", "b", "c", "b", "c", "d", "d"))
cov_dt <- data.table(melt(cov(crossprod(table(df2)))))
dt1 <- cbind(dt1, t(sapply(strsplit(as.character(df1$cg_v), " "), function(x)x[1:2])))
#replace the na with the first colomn
dt1[is.na(V2), V2 := V1]
# Merge them on two columns
setkey(dt1, "V1", "V2")
setkey(cov_dt, "X1", "X2")
result <- cov_dt[dt1]
> result[,.(id, cg_v, value)]
id cg_v value
1: 1 a 0.9166667
2: 3 a b 0.0000000
3: 2 b 2.0000000
4: 4 b c 1.0000000
5: 5 b c d 1.0000000
6: 6 d 0.6666667
Variant which also works if there are more than 2 letters (not the most efficient code):
require(reshape)
require(combinat)
df1 <- data.frame(id = c(1L, 2L, 3L, 4L, 5L, 6L),
cg_v = c("a", "b", "a b", "b c", "b c d", "d"))
df2 <- data.frame(id = c(1L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 6L),
cg = c("a", "b", "a", "b", "b", "c", "b", "c", "d", "d"))
cov_dt <- cov(crossprod(table(df2)))
mat <- sapply(strsplit(as.character(df1$cg_v), " "), function(x) if(length(x) == 1){c(x,x)} else(x))
# Should be all minimal 2
sapply(mat, length) > 1
mat <- sapply(mat, function(x) matrix(combn(x,2), nrow = 2))
df1$cg_cov <- sapply(mat, function(x) mean(apply(x,2, function(x) cov_dt[x[1],x[2]])))
> df1
id cg_v cg_cov
1 1 a 0.9166667
2 2 b 2.0000000
3 3 a b 0.0000000
4 4 b c 1.0000000
5 5 b c d 0.4444444
6 6 d 0.6666667

How to collapse session path data into from-to paths for visualizing network data?

What are some ways to transform session path data such as this:
df
# Session Link1 Link2 Link3 Link4 Link5
# 1 1 A B
# 2 2 C
# 3 3 D A B
# 4 4 C F G H J
# 5 5 A B C
Into a data set that looks like this:
desired
# Session From To
# 1 1 A B
# 2 2 C <NA>
# 3 3 D A
# 4 3 A B
# 5 4 C F
# 6 4 F G
# 7 4 G H
# 8 4 H J
# 9 5 A B
# 10 5 B C
Data for reproducibility:
df <- structure(list(Session = 1:5, Link1 = structure(c(1L, 2L, 3L, 2L, 1L), .Label = c("A", "C", "D"), class = "factor"), Link2 = structure(c(3L, 1L, 2L, 4L, 3L), .Label = c("", "A", "B", "F"), class = "factor"), Link3 = structure(c(1L, 1L, 2L, 4L, 3L), .Label = c("", "B", "C", "G"), class = "factor"), Link4 = structure(c(1L, 1L, 1L, 2L, 1L), .Label = c("", "H"), class = "factor"), Link5 = structure(c(1L, 1L, 1L, 2L, 1L), .Label = c("", "J"), class = "factor")), .Names = c("Session", "Link1", "Link2", "Link3", "Link4", "Link5"), class = "data.frame", row.names = c(NA, -5L))
desired <- structure(list(Session = c(1L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L), From = structure(c(1L, 3L, 4L, 1L, 3L, 5L, 6L, 7L, 1L, 2L), .Label = c("A", "B", "C", "D", "F", "G", "H"), class = "factor"), To = structure(c(2L, NA, 1L, 2L, 4L, 5L, 6L, 7L, 2L, 3L), .Label = c("A", "B", "C", "F", "G", "H", "J"), class = "factor")), .Names = c("Session", "From", "To"), class = "data.frame", row.names = c(NA, -10L))
We could use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)). Reshape from 'wide' to 'long' format with melt specifying the id.var as 'Session'. Remove the 'value' elements that are empty [value!='']. Grouped by 'Session', we insert 'NA' values in the 'value' column for those 'Session' that have only a single row (if...else), create a two columns ('From' and 'To') by removing the last and first element of 'V1' grouped by 'Session'.
library(data.table)#v1.9.5+
melt(setDT(df), id.var='Session')[value!=''][,
if(.N==1L) c(value, NA) else value, by = Session][,
list(From=V1[-.N], To=V1[-1L]), by = Session]
# Session From To
#1: 1 A B
#2: 2 C NA
#3: 3 D A
#4: 3 A B
#5: 4 C F
#6: 4 F G
#7: 4 G H
#8: 4 H J
#9: 5 A B
#10: 5 B C
The above could be simplified to a single block after the melt step. For some reason, tmp[-.N] is not working. So I used tmp[1:(.N-1)].
melt(setDT(df), id.var= 'Session')[value!='', {
tmp <- if(.N==1L) c(value, NA) else value
list(From= tmp[1:(.N-1)], To= tmp[-1L]) }, by = Session]
# Session From To
#1: 1 A B
#2: 2 C NA
#3: 3 D A
#4: 3 A B
#5: 4 C F
#6: 4 F G
#7: 4 G H
#8: 4 H J
#9: 5 A B
#10: 5 B C
Inspired by #akrun, this is my personal stab at the problem. Granted, the results are tweaked to include the terminal from-to path for each pair:
library(dplyr)
library(tidyr)
gather(df, "Link_Num", "Value", -Session) %>%
group_by(Session) %>%
mutate(to = Value,
from = lag(to)) %>%
filter(Link_Num != "Link1" &
from != "") %>%
select(Session, from, to, Link_Num) %>%
arrange(Session)
Which yields:
Session from to Link_Num
1 1 A B Link2
2 1 B Link3
3 2 C Link2
4 3 D A Link2
5 3 A B Link3
6 3 B Link4
7 4 C F Link2
8 4 F G Link3
9 4 G H Link4
10 4 H J Link5
11 5 A B Link2
12 5 B C Link3
13 5 C Link4
Another approach with dplyr functions melt and lead:
library(dplyr)
df$spacer <- ""
df %>% melt(id.var = "Session") %>%
arrange(Session) %>%
mutate(To = lead(value)) %>%
filter(To !="" & value !="" | To =="" & variable =="Link1") %>%
mutate(To = ifelse(To == "", NA, To)) %>% select(-variable)
# Session value To
# 1 1 A B
# 2 2 C <NA>
# 3 3 D A
# 4 3 A B
# 5 4 C F
# 6 4 F G
# 7 4 G H
# 8 4 H J
# 9 5 A B
# 10 5 B C

R- How to merge multiple dataframes of different lengths?

I have been stuck with this issue for a while now. Need some help.
I am reading the following files (which can be mire than 3 files files) into a dataframe.
My input files look like the following:
file1:
someName someMOD someID
A T754(P),M691(O),S692(P),S694(P),S739(P),S740(P),S759(P),S762(P) 1
B S495(P) 2
C S162(P),Q159(D) 3
D S45(P),C47(C),S48(P),S26(P) 4
E S18(P) 5
file2:
someName someMOD someID
C S162(P),Q159(D) 3
D S45(P),C47(C),S48(P),S26(P) 4
F S182(P) 6
E S18(P) 5
Z Q100(P) 9
A T754(P),M691(O),S694(P),S739(P),S740(P) 1
file3:
someName someMOD someID
A T754(P),M691(O),S692(P),S694(P),S739(P),S740(P),S759(P) 1
B S495(P) 2
D S45(P),C47(C),S48(P),S26(P) 4
E S18(P) 5
F S182(P) 6
L Z182(P) 8
C S162(P),Q159(D) 3
My Code:
fileList <- dir(pattern="*.xls")
i<-1
j<-1
a<-list()
mybigtable<-data.frame
for (f in 1:length(fileList)){
fileName <- fileList[f]
X <-read.xls(fileName)
if(regexpr("Drug_Rep", fileName)[1]>0){
a[[i]]<-X
}
i=i+1
}
else{
#Don't do anything
}
}
#Now i want to merge my dataframes
mymerge <- function(x, y)
merge(x, y, by=c("someName", "someID"), all=TRUE))
Reduce(mymerge,a) #passing my list of dataframes 'a'
I did dput() on my 'a' list:
list(structure(list(someName = structure(c(1L, 2L, 4L, 5L, 6L,
7L, 3L), .Label = c("A", "B", "C", "D", "E", "F", "L"), class = "factor"),
someMOD = structure(c(6L, 5L, 4L, 2L, 3L, 7L, 1L), .Label = c("S162(P),Q159(D)",
"S18(P)", "S182(P)", "S45(P),C47(C),S48(P),S26(P)", "S495(P)",
"T754(P),M691(O),S692(P),S694(P),S739(P),S740(P),S759(P)",
"Z182(P)"), class = "factor"), someID = c(1L, 2L, 4L, 5L,
6L, 8L, 3L)), .Names = c("someName", "someMOD", "someID"), class = "data.frame", row.names = c(NA,
-7L)), structure(list(someName = structure(1:5, .Label = c("A",
"B", "C", "D", "E"), class = "factor"), someMOD = structure(c(5L,
4L, 1L, 3L, 2L), .Label = c("S162(P),Q159(D)", "S18(P)", "S45(P),C47(C),S48(P),S26(P)",
"S495(P)", "T754(P),M691(O),S692(P),S694(P),S739(P),S740(P),S759(P),S762(P)"
), class = "factor"), someID = 1:5), .Names = c("someName", "someMOD",
"someID"), class = "data.frame", row.names = c(NA, -5L)), structure(list(
someName = structure(c(2L, 3L, 5L, 4L, 6L, 1L), .Label = c("A",
"C", "D", "E", "F", "Z"), class = "factor"), someMOD = structure(c(2L,
5L, 4L, 3L, 1L, 6L), .Label = c("Q100(P)", "S162(P),Q159(D)",
"S18(P)", "S182(P)", "S45(P),C47(C),S48(P),S26(P)", "T754(P),M691(O),S694(P),S739(P),S740(P)"
), class = "factor"), someID = c(3L, 4L, 6L, 5L, 9L, 1L)), .Names = c("someName",
"someMOD", "someID"), class = "data.frame", row.names = c(NA,
-6L)))
What is my mistake in populating a list? Any help is really appreciated.
I am just trying to get an out put like the following:
The problem with the code I gave you before is that merge gets confused if there are any duplicate column names, and you're merging more than 3 datasets. You'll have to rename your someMOD columns so they don't clash. A for loop works as well as anything for this purpose.
dupvars <- which(!names(a[[1]]) %in% c("someName", "someID"))
for(i in seq_along(a))
names(a[[i]])[dupvars] <- paste0(names(a[[i]])[dupvars], i)
# and then merge
Reduce(mymerge, a)
Perhaps the problem is that you're actually not trying to merge in the standard sense, but reshape. In this case, you can rbind all the data.frames together after adding a "time" variable, and use dcast from "reshape2" to get what you're after:
Add a "time" variable and rbind the data.frames together
temp <- do.call(rbind,
lapply(seq_along(a),
function(x) data.frame(a[[x]], time = x)))
head(temp)
# someName someMOD someID time
# 1 A T754(P),M691(O),S692(P),S694(P),S739(P),S740(P),S759(P) 1 1
# 2 B S495(P) 2 1
# 3 D S45(P),C47(C),S48(P),S26(P) 4 1
# 4 E S18(P) 5 1
# 5 F S182(P) 6 1
# 6 L Z182(P) 8 1
Transform the data.frame from a "long" format to a "wide" format
library(reshape2)
dcast(temp, someName + someID ~ time, value.var="someMOD")
# someName someID 1
# 1 A 1 T754(P),M691(O),S692(P),S694(P),S739(P),S740(P),S759(P)
# 2 B 2 S495(P)
# 3 C 3 S162(P),Q159(D)
# 4 D 4 S45(P),C47(C),S48(P),S26(P)
# 5 E 5 S18(P)
# 6 F 6 S182(P)
# 7 L 8 Z182(P)
# 8 Z 9 <NA>
# 2
# 1 T754(P),M691(O),S692(P),S694(P),S739(P),S740(P),S759(P),S762(P)
# 2 S495(P)
# 3 S162(P),Q159(D)
# 4 S45(P),C47(C),S48(P),S26(P)
# 5 S18(P)
# 6 <NA>
# 7 <NA>
# 8 <NA>
# 3
# 1 T754(P),M691(O),S694(P),S739(P),S740(P)
# 2 <NA>
# 3 S162(P),Q159(D)
# 4 S45(P),C47(C),S48(P),S26(P)
# 5 S18(P)
# 6 S182(P)
# 7 <NA>
# 8 Q100(P)

Resources