Optimizing matching in R - r

Hoping someone can help. I have a ton of ortholog mapping to do in R, which is proving to be incredibly time consuming. I've posted an example structure below. Obvious answers such as iterating line by line (for i in 1:nrow(df)) and string splitting, or using sapply have been tried and are incredibly slow. I am therefore hoping for a vectorized option.
stringsasFactors = F
# example accession mapping
map <- data.frame(source = c("1", "2 4", "3", "4 6 8", "9"),
target = c("a b", "c", "d e f", "g", "h i"))
# example protein list
df <- data.frame(sourceIDs = c("1 2", "3", "4", "5", "8 9"))
# now, map df$sourceIDs to map$target
# expected output
> matches
[1] "a b c" "d e f" "g" "" "g h i"
I appreciate any help!

In most cases, the best approach to this kind of problem is to create data.frames with one observation per row.
map_split <- lapply(map, strsplit, split = ' ')
long_mappings <- mapply(expand.grid, map2$source, map2$target, SIMPLIFY = FALSE)
all_map <- do.call(rbind, long_mappings)
names(all_map) <- c('source', 'target')
Now all_map looks like this:
source target
1 1 a
2 1 b
3 2 c
4 4 c
5 3 d
6 3 e
7 3 f
8 4 g
9 6 g
10 8 g
11 9 h
12 9 i
Doing the same for df...
sourceIDs_split <- strsplit(df$sourceIDs, ' ')
df_long <- data.frame(
index = rep(seq_along(sourceIDs_split), lengths(sourceIDs_split)),
source = unlist(sourceIDs_split)
)
Give us this for df_long:
index source
1 1 1
2 1 2
3 2 3
4 3 4
5 4 5
6 5 8
7 5 9
Now they just need to be merged and collapsed.
matches <- merge(df_long, all_map, by = 'source', all.x = TRUE)
tapply(
matches$target,
matches$index,
function(x) {
paste0(sort(x), collapse = ' ')
}
)
# 1 2 3 4 5
# "a b c" "d e f" "c g" "" "g h i"

Related

Replace multiple words with a word single word

I want to replace multiple letters/words with a single letter/word, multiple times in a dataframe. As an example,
Some data:
df = data.frame(
a = 1:8,
b = c("colour1 o", "colour2 O", "colour3 out", "colour4 Out",
"soundi i", "soundr I", "sounde in", "soundw In"))
df
a b
1 1 colour1 o
2 2 colour2 O
3 3 colour3 out
4 4 colour4 Out
5 5 soundi i
6 6 soundr I
7 7 sounde in
8 8 soundw In
Here is what I want to replace with:
df_repl <- list(
O = c("o", "out", "Out"),
In = c("i", "in", "I"))
So in df$b o, out and Out should become O and i, in and I become In, but only if they are separated from any other words by a space, so o in colour is not capitalised.
This gets me half way there, but I think I need another nested for-loop to move through df_repl...
for (word in df_repl[[1]]){
patt <- paste0('\\b', word, '\\b')
repl <- paste(names(df_repl[1]))
df$b <- gsub(patt, repl, df$b)
}
df
a b
1 1 colour1 O
2 2 colour2 O
3 3 colour3 O
4 4 colour4 O
5 5 soundi i
6 6 soundr I
7 7 sounde in
8 8 soundw In
Above o, out and Out become O but i, in and I are not altered, here is the desired output:
a b
1 1 colour1 O
2 2 colour2 O
3 3 colour3 O
4 4 colour4 O
5 5 soundi In
6 6 soundr In
7 7 sounde In
8 8 soundw In
In the real data there are many more than two replacement words/letters so I can't just rerun the for-loop again. I'm not tied to a for-loop solution, but preferably using base R, any suggestions much appreciated.
EDIT
Trying to clarify my question:
Whenever one of o, out or Out occur in df$b I want to replace it with O
Whenever one of i, in or I occur in df$b I want to replace it with In
I can achieve the desired output like this:
for (word in df_repl[[1]]){
patt <- paste0('\\b', word, '\\b')
repl <- paste(names(df_repl[1]))
df$b <- gsub(patt, repl, df$b)
}
for (word in df_repl[[2]]){
patt <- paste0('\\b', word, '\\b')
repl <- paste(names(df_repl[2]))
df$b <- gsub(patt, repl, df$b)
}
But in my real dataset df_repl is length 50 rather two so I don't want to copy/paste/edit/rerun the for-loop 50 times
You can skip the loop over the words in df_repl when you paste them with | (or) between the words like:
for(i in names(df_repl)) {
df$b <- sub(paste(paste0("\\b",df_repl[[i]],"\\b"), collapse = "|")
, i, df$b)
}
df
# a b
#1 1 colour1 O
#2 2 colour2 O
#3 3 colour3 O
#4 4 colour4 O
#5 5 soundi In
#6 6 soundr In
#7 7 sounde In
#8 8 soundw In
You may try using three separate calls to sub:
df$b <- sub("\\bo\\b", "i", df$b)
df$b <- sub("\\bout\\b", "in", df$b)
df$b <- sub("\\bOut\\b", "I", df$b)
df
a b
1 1 colour1 i
2 2 colour2 O
3 3 colour3 in
4 4 colour4 I
5 5 soundi i
6 6 soundr I
7 7 sounde in
8 8 soundw In
To automate this, you could try using sapply with an index:
terms_in <- c("o", "out", "Out")
pat <- paste0("\\b", terms_in, "\\b")
replace <- c("i", "in", "I")
sapply(seq_along(pat), function(x) {
df$b <<- sub(pat[x], replace[x], df$b)
})
This is another solution:
library(stringr)
in1 <- str_split(df$b, " ", simplify = TRUE)[,1]
in2 <- str_split(df$b, " ", simplify = TRUE)[,2]
in2[in2 %in% c("o", "out", "Out")] <- "O"
in2[in2 %in% c("i", "in", "I")] <- "In"
df$b <- paste(in1, in2, sep=" ")
df
If you have a long list of words in your data, you could also move c(word list) outside:
in1<- str_split(df$b, " ", simplify = TRUE)[,1]
in2<- str_split(df$b, " ", simplify = TRUE)[,2]
o <- c("o", "Out", "Out")
i <- c("i", "in", "I")
in2[in2 %in% o] <- "O"
in2[in2 %in% i] <- "In"
df$b <- paste(in1, in2, sep=" ")
df
> df
a b
1 1 colour1 O
2 2 colour2 O
3 3 colour3 O
4 4 colour4 O
5 5 soundi In
6 6 soundr In
7 7 sounde In
8 8 soundw In

I have a sample dataset , which has missing values in it

I have a sample dataset , which has missing values in it.I want to create a new column with a message of different combinations where it should tell which columns values are missing.
Example:
Dataset:
A B C D
1 2 4
4 4
4 1
3 2 3
The permutaions of the above data set is :
"a" ,"b","c","d" ,"a, b","a, c" ,"a, d" , "b, c","b, d","c, d" , "a, b, c","a, b, d","a, c, d","b, c, d","a, b, c, d"
Result:
A B C D Message
1 2 4 Column B is missing
4 4 column A and D is Missing
4 1 Column C and D is Missing
All column values are missing
3 2 3 Column B is Missing
Any suggestion would be really appreciated
Here's a way using apply from base R -
set.seed(4)
df <- data.frame(matrix(sample(c(1:5, NA), 15, replace = T), ncol = 3))
names(df) <- LETTERS[1:3]
df$msg <- apply(df, 1, function(x) {
if(anyNA(x)) {
paste0(paste0(names(x)[which(is.na(x))], collapse = " "), " missing", collapse = "")
} else {
"No missing"
}
})
df
A B C msg
1 4 2 5 No missing
2 1 5 2 No missing
3 2 NA 1 B missing
4 2 NA NA B C missing
5 5 1 3 No missing

Pairing truncated character into a dataframe

I have a chr[1:10] truncated data and each line is organized in such way and some rows don't have the same length:
[1] "\nA B C D E"
[2] "\n1 3 4 5"
[3] "\nF G H"
[4] "\n6 7 8"
Here's an updated version of my question
line.1 <- c("A B C D E")
line.2 <- c("1 3 4 5")
line <- rbind(line.1, line.2)
line <- data.frame(line)
line
line.1 A B C D E
line.2 1 3 4 5
So, my desired output should be:
V1 V2 V3 V4 V5
Line.1 A B C D E
Line.2 1 3 4 5
I can't quite figure out how to split it into different columns with the extra space in between being counted as one value.
Here's one way to do it:
# Build the character vector
x <- c("\nA B C D E", "\n1 3 4 5", "\nF G H", "\n6 7 8")
# Remove the new line characters
x <- sub("\n", "", x)
# Select every other element of the character vector as column 1
Col1 <- paste(x[c(T, F)], collapse = ' ')
Col1 <- strsplit(Col1, ' ')[[1]]
# Do the same for column 2
Col2 <- paste(x[c(F, T)], collapse = ' ')
Col2 <- strsplit(Col2, ' ')[[1]]
# Combine them in a data frame
data.frame(Col1, Col2)
# Col1 Col2
# 1 A 1
# 2 B
# 3 C 3
# 4 D 4
# 5 E 5
# 6 F 6
# 7 G 7
# 8 H 8
The use of strsplit is what splits the values into different columns:
> strsplit(line.2, ' ')[[1]]
[1] "1" "" " 3" "4" "5"
So to combine both lines as a dataframe, you can do:
data.frame(rbind(strsplit(line.1, ' ')[[1]], strsplit(line.2, ' ')[[1]]))

R: split data frame rows by space, remove common elements, put unequal length columns in new df

Suppose, I have df with two rows of strings that I need to split by space, unlist, then find anti-intersection and reuse in a list. I can do it brute force by working with each row individually. Problem is that there can be more than 2 rows etc. My working solution thus far is below, but there must be a simpler way of not accessing each line. Thanks!!
df = structure(list(A = structure(1:2, .Label = c("R1", "R2"), class = "factor"),
B = c("a b c d e f g o l",
"b h i j k l m n o p q"
)), .Names = c("A", "B"), row.names = c(NA, -2L), class = "data.frame")
dat1 = unlist(strsplit(df[1,2]," "))
dat2 = unlist(strsplit(df[2,2]," "))
f <- function (...)
{
aux <- list(...)
ind <- rep(1:length(aux), sapply(aux, length))
x <- unlist(aux)
boo <- !(duplicated(x) | duplicated(x, fromLast = T))
split(x[boo], ind[boo])
}
excl = (f(dat1, dat2))
L <- list(excl[[1]],excl[[2]])
cfun <- function(L) {
pad.na <- function(x,len) {
c(x,rep("",len-length(x)))
}
maxlen <- max(sapply(L,length))
print(maxlen)
do.call(data.frame,lapply(L,pad.na,len=maxlen))
}
a = cfun(L)
What I had:
A B
1 Food a b c d e f g
2 HABA b h i j k l m n o p q
What I got:
c..a....c....d....e....f....g.......... c..h....i....j....k....m....n....p....q..
1 a h
2 c i
3 d j
4 e k
5 f m
6 g n
7 p
8 q
Edit: The goal is to eliminate common elements from all columns. I.e. if "4" is present in row 1 and seen anywhere else - remove. New test set:
df1 = structure(list(A = structure(1:3, .Label = c("R1", "R2", "R3"
), class = "factor"), B = c("1 4 78 5 4 6 7 0", "2 3 76 8 2 1 8 0",
"4 7 1 2")), .Names = c("A", "B"), row.names = c(NA, -3L), class = "data.frame")
Current output from suggested code:
a b c
1 4 2 4
2 78 3 7
3 5 76 2
4 4 8 NA
5 6 2 NA
6 7 8 NA
7 0 0 NA
2, 4, and 7 should not be there as they are seen in more than 1 column. Bottom line - output should consist of unique numbers/elements only in any columns. Thanks!!
Here's one way using base R that avoids a lot of your current code
## split column B on the space character
s <- strsplit(df$B, " ")
## find the intersection of all s
r <- Reduce(intersect, s)
## iterate over s, removing the intersection characters in r
l <- lapply(s, function(x) x[!x %in% r])
## reset the length of each vector in l to the length of the longest vector
## then create the new data frame
setNames(as.data.frame(lapply(l, "length<-", max(lengths(l)))), letters[seq_along(l)])
# a b
# 1 a h
# 2 c i
# 3 d j
# 4 e k
# 5 f m
# 6 g n
# 7 <NA> p
# 8 <NA> q
I presume that's what you are shooting for?
Note that lengths() is a new function in the base package of R version 3.2.0 that is a faster more efficient replacement for sapply(x, length) on a list.

How can I build an inverted index from a data frame in R?

Say I have a data frame in R : data.frame(x=1:4, y=c("a b c", "b", "a c", "c"))
x y
1 1 a b c
2 2 b
3 3 a c
4 4 c
Now I want to build a new data frame, an inverted index which is quite common in IR or recommendation systems, from it:
y x
a 1 3
b 1 2
c 1 3 4
How can I do this in an efficient way?
conv <- function(x) {
l <- function(z) {
paste(x$x[grep(z, x$y)], collapse=' ')
}
lv <- Vectorize(l)
alphabet <- unique(unlist(strsplit(as.character(x$y), ' '))) # hard-coding this might be preferred for some uses.
y <- lv(alphabet)
data.frame(y=names(y), x=y)
}
x <- data.frame(x=1:4, y=c("a b c", "b", "a c", "c"))
> conv(x)
## y x
## a a 1 3
## b b 1 2
## c c 1 3 4
An attempt, after converting y to characters:
test <- data.frame(x=1:4,y=c("a b c","b","a c","c"),stringsAsFactors=FALSE)
result <- strsplit(test$y," ")
result2 <- sapply(unique(unlist(result)),function(y) sapply(result,function(x) y %in% x))
result3 <- apply(result2,2,function(x) test$x[which(x)])
final <- data.frame(x=names(result3),y=sapply(result3,paste,collapse=" "))
> final
x y
a a 1 3
b b 1 2
c c 1 3 4
quick and dirty
original.df <- data.frame(x=1:4, y=c("a b c", "b", "a c", "c"))
original.df$y <- as.character(original.df$y)
y.split <- strsplit(original.df$y, " ")
y.unlisted <- unique(unlist(y.split))
new.df <-
sapply(y.unlisted, function(element)
paste(which(sapply(y.split, function(y.row) element %in% y.row)), collapse=" " ))
as.data.frame(new.df)
> new.df
a 1 3
b 1 2
c 1 3 4

Resources