Identifying patterns in two strings in R

Identifying patterns in two strings in R - r

I want to evaluate if ColA contains a new string than ColB. However, I am not interested in certain types of string, for example, oil. I would like to have an indicator variable as follow:
ColA ColB Ind
-------------------------- ------------------------ -----
coconut+grape+pine grape+coconut TRUE
orange+apple+grape+pine grape+coconut TRUE
grape+pine grape+oil TRUE
oil+grape grape+apple FALSE
grape grape+oil FALSE
grape+pine grape+orange+pine FALSE
Any Suggestions using R?
Many thanks!

Since we need to split the strings, we'll start with strsplit,
strsplit(dat$ColA, '+', fixed = TRUE)
# [[1]]
# [1] "coconut" "grape" "pine"
# [[2]]
# [1] "orange" "apple" "grape" "pine"
# [[3]]
# [1] "grape" "pine"
# [[4]]
# [1] "oil" "grape"
# [[5]]
# [1] "grape"
# [[6]]
# [1] "grape" "pine"
From here, we want to determine what is in ColA that is not in ColB. I'll use Map to run setdiff on each set (ColA's [[1]] with ColB's [[1]], etc).
Map(setdiff, strsplit(dat$ColA, '+', fixed = TRUE), strsplit(dat$ColB, '+', fixed = TRUE))
# [[1]]
# [1] "pine"
# [[2]]
# [1] "orange" "apple" "pine"
# [[3]]
# [1] "pine"
# [[4]]
# [1] "oil"
# [[5]]
# character(0)
# [[6]]
# character(0)
To determine which one has "new words", we can just check for non-zero length using lengths(.) > 0:
lengths(Map(setdiff, strsplit(dat$ColA, '+', fixed = TRUE), strsplit(dat$ColB, '+', fixed = TRUE))) > 0
# [1] TRUE TRUE TRUE TRUE FALSE FALSE
But since you don't care about oil, we need to remove that as well.
lapply(Map(setdiff, strsplit(dat$ColA, '+', fixed = TRUE), strsplit(dat$ColB, '+', fixed = TRUE)), setdiff, "oil")
# [[1]]
# [1] "pine"
# [[2]]
# [1] "orange" "apple" "pine"
# [[3]]
# [1] "pine"
# [[4]]
# character(0)
# [[5]]
# character(0)
# [[6]]
# character(0)
lengths(lapply(Map(setdiff, strsplit(dat$ColA, '+', fixed = TRUE), strsplit(dat$ColB, '+', fixed = TRUE)),
setdiff, "oil")) > 0
# [1] TRUE TRUE TRUE FALSE FALSE FALSE
#akrun suggested a tidyverse variant:
library(dplyr)
library(purrr) # map2_lgl
library(stringr) # str_extract_all
dat %>%
mutate(
new = map2_lgl(
str_extract_all(ColB, "\\w+"), str_extract_all(ColA, "\\w+"),
~ !all(setdiff(.y, "oil") %in% .x)
)
)
# ColA ColB Ind new
# 1 coconut+grape+pine grape+coconut TRUE TRUE
# 2 orange+apple+grape+pine grape+coconut TRUE TRUE
# 3 grape+pine grape+oil TRUE TRUE
# 4 oil+grape grape+apple FALSE FALSE
# 5 grape grape+oil FALSE FALSE
# 6 grape+pine grape+orange+pine FALSE FALSE
Data
dat <- structure(list(ColA = c("coconut+grape+pine", "orange+apple+grape+pine", "grape+pine", "oil+grape", "grape", "grape+pine"), ColB = c("grape+coconut", "grape+coconut", "grape+oil", "grape+apple", "grape+oil", "grape+orange+pine"), Ind = c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE)), class = "data.frame", row.names = c(NA, -6L))

Here's a solution similar to r2evans's that calls strsplit only once with the help of do.call.
rid <- function(x) x[!x %in% z] ## helper FUN to get rid of the oil
z <- "oil"
L <- sapply(unname(dat), strsplit, "\\+")
dat$ind <- sapply(1:nrow(L), function(x) length(do.call(setdiff, rev(Map(rid, L[x,]))))) > 0
dat
# V1 V2 ind
# 1 grape+coconut coconut+grape+pine TRUE
# 2 grape+coconut orange+apple+grape+pine TRUE
# 3 grape+oil grape+pine TRUE
# 4 grape+apple oil+grape FALSE
# 5 grape+oil grape FALSE
# 6 grape+orange+pine grape+pine FALSE
Data:
dat <- structure(list(V1 = c("grape+coconut", "grape+coconut", "grape+oil",
"grape+apple", "grape+oil", "grape+orange+pine"), V2 = c("coconut+grape+pine",
"orange+apple+grape+pine", "grape+pine", "oil+grape", "grape",
"grape+pine")), row.names = c(NA, -6L), class = "data.frame")

Related

R - find cases in list of lists which meet specific condition

I have a large list of lists with three columns (TRUE/FALSE) per case. I want to find out for which cases all three columns = TRUE.
Example data:
l1 <- list( c("TRUE","FALSE","FALSE") , c("FALSE","FALSE","FALSE") , c("FALSE","FALSE","FALSE") )
l2 <- list( c("TRUE","TRUE","TRUE") , c("TRUE","TRUE","TRUE") , c("FALSE","FALSE","FALSE") )
l3 <- list( c("TRUE","TRUE","TRUE") , c("FALSE","FALSE","FALSE") , c("TRUE","FALSE","FALSE") )
mylist <- list(l1,l2,l3)
In the output I need to see which cases in which lists meet the condition, so something like
l2[[1]]
l3[[1]]
Hope that someone can help! Thank you so much in advance!

With rapply:
rapply(mylist, \(x) all(x == "TRUE"), how = "list")
output
[[1]]
[[1]][[1]]
[1] FALSE
[[1]][[2]]
[1] FALSE
[[1]][[3]]
[1] FALSE
[[2]]
[[2]][[1]]
[1] TRUE
[[2]][[2]]
[1] TRUE
[[2]][[3]]
[1] FALSE
[[3]]
[[3]][[1]]
[1] TRUE
[[3]][[2]]
[1] FALSE
[[3]][[3]]
[1] FALSE
Or, if you want a more compact result:
rapply(mylist, \(x) all(x == "TRUE"), how = "list") |>
lapply(\(x) which(unlist(x)))
[[1]]
integer(0)
[[2]]
[1] 1 2
[[3]]
[1] 1
Another compact solution with rrapply::rrapply:
rrapply::rrapply(mylist, \(x) all(x == "TRUE"), how = "melt")
L1 L2 value
1 2 1 TRUE, TRUE, TRUE
2 2 2 TRUE, TRUE, TRUE
3 3 1 TRUE, TRUE, TRUE
Note: you probably have logical vector in your real data, which is made of c(TRUE, FALSE) (without the brackets). In that case, all(x), is sufficient.

Perhaps this helps
which(sapply(mylist, \(x) sapply(x, \(y) all(y == 'TRUE'))), arr.ind = TRUE)
or use
lapply(mylist, \(x) Filter(\(y) all(y == 'TRUE'), x))
[[1]]
list()
[[2]]
[[2]][[1]]
[1] "TRUE" "TRUE" "TRUE"
[[2]][[2]]
[1] "TRUE" "TRUE" "TRUE"
[[3]]
[[3]][[1]]
[1] "TRUE" "TRUE" "TRUE"
Or may be
> lapply(mylist, \(x) lapply(x, \(y) all(y == 'TRUE')))
[[1]]
[[1]][[1]]
[1] FALSE
[[1]][[2]]
[1] FALSE
[[1]][[3]]
[1] FALSE
[[2]]
[[2]][[1]]
[1] TRUE
[[2]][[2]]
[1] TRUE
[[2]][[3]]
[1] FALSE
[[3]]
[[3]][[1]]
[1] TRUE
[[3]][[2]]
[1] FALSE
[[3]][[3]]
[1] FALSE

Here is another trick using lapply
lapply(
mylist,
function(x) {
as.list(
colMeans(type.convert(list2DF(x), as.is = TRUE)) == 1
)
}
)
which gives
[[1]]
[[1]][[1]]
[1] FALSE
[[1]][[2]]
[1] FALSE
[[1]][[3]]
[1] FALSE
[[2]]
[[2]][[1]]
[1] TRUE
[[2]][[2]]
[1] TRUE
[[2]][[3]]
[1] FALSE
[[3]]
[[3]][[1]]
[1] TRUE
[[3]][[2]]
[1] FALSE
[[3]][[3]]
[1] FALSE

Count the number of `TRUE` in a list

I have a list as such
$`1`
[1] TRUE
$`5`
[1] TRUE
$`14`
[1] FALSE
$`17`
[1] TRUE
$`19`
[1] TRUE
$`20`
[1] TRUE
Is there an easy way to count the total number of TRUE values in the list?
I tried doing this trucount <- function(z){sum(z,na.rm = TRUE)} , but it doesn't work.
In the above example, the solution would return 5

You can use isTRUE():
> ll = list(`1`=TRUE, `5`=TRUE, `14`=TRUE, `17`=TRUE, `19`=TRUE, `20`=TRUE, `21`=FALSE)
> length(which(sapply(ll, isTRUE)))
[1] 6

How do I search if the numbers in one vector are within a range of two other vectors in R?

I have two vectors. One is start and one is stop for a range of Nucleotides in a protein. Ex. one range is 1374742-1375555.
domainStart = c(1374742,1374760,1374769,1375822,1376182,1376320,1376350)
domainStop = c(1375555, 1375726,1375516, 1378129, 1376638, 1376638, 1377382)
Next I have a long list of nucleotide mutation positions.
db = c(37788, 40303, 138445, 161587, 165946,172979,177605, 200118, 244427, 251156, 258459, 265170, 344062)
I want to know if any of the mutation positions (db) are in the ranges of the domain (1374742-1375555) and return TRUE /FALSE as a vector for each position. Thanks!

You could use map2() from the purrr package:
domainStart = c(1374742,1374760,1374769,1375822,1376182,1376320,1376350)
domainStop = c(1375555, 1375726,1375516, 1378129, 1376638, 1376638, 1377382)
db = c(37788, 40303, 138445, 161587, 165946,172979,177605, 200118, 244427, 251156, 258459, 265170, 344062)
purrr:::map2(domainStart, domainStop, ~which(db > .x & db < .y))
# [[1]]
# integer(0)
#
# [[2]]
# integer(0)
#
# [[3]]
# integer(0)
#
# [[4]]
# integer(0)
#
# [[5]]
# integer(0)
#
# [[6]]
# integer(0)
#
# [[7]]
# integer(0)
Each element of the list identifies the position of the match in db for each pair of start/stop values. Here it is with some that actually work:
db <- c(1374750, 1374761, 1374770)
purrr:::map2(domainStart, domainStop, ~which(db > .x & db < .y))
# [[1]]
# [1] 1 2 3
#
# [[2]]
# [1] 2 3
#
# [[3]]
# [1] 3
#
# [[4]]
# integer(0)
#
# [[5]]
# integer(0)
#
# [[6]]
# integer(0)
#
# [[7]]
# integer(0)
Update: Fixed to address comment
db <- c(1374750, 1374761, 1374770)
purrr:::map2(domainStart, domainStop, function(.x,.y){
mx <- db[which(db > .x & db < .y)]
if(length(mx) == 0){
mx <- NA
}
data.frame(domainStart = .x, domainStop = .y, db = mx)
})
# [[1]]
# domainStart domainStop db
# 1 1374742 1375555 1374750
# 2 1374742 1375555 1374761
# 3 1374742 1375555 1374770
#
# [[2]]
# domainStart domainStop db
# 1 1374760 1375726 1374761
# 2 1374760 1375726 1374770
#
# [[3]]
# domainStart domainStop db
# 1 1374769 1375516 1374770
#
# [[4]]
# domainStart domainStop db
# 1 1375822 1378129 NA
#
# [[5]]
# domainStart domainStop db
# 1 1376182 1376638 NA
#
# [[6]]
# domainStart domainStop db
# 1 1376320 1376638 NA
#
# [[7]]
# domainStart domainStop db
# 1 1376350 1377382 NA

Perhaps we can try the code below
df <- data.frame(Start = domainStart, Stop = domainStop)
apply(
outer(db, domainStart, `>=`) & outer(db, domainStart, `<=`),
1,
function(v) {
df[which(v, arr.ind = TRUE), ]
}
)

Which files have some content in R

I have a list which contain lines of files, sample of which is shown.
list(c("\"ID\",\"SIGNALINTENSITY\",\"SNR\"", "\"NM_012429\",\"7.19739265676517\",\"0.738130599770152\"",
"\"NM_003980\",\"12.4036181424743\",\"13.753593768862\"", "\"AY044449\",\"8.74973537284918\",\"1.77200602833912\"",
"\"NM_005015\",\"11.3735054810744\",\"6.76079815107347\""), c("\"ID\",\"SIGNALINTENSITY\",\"SNR\"",
"\"NM_012429\",\"7.07699512126353\",\"0.987579612646805\"", "\"NM_003980\",\"11.3172936656653\",\"8.38227473088534\"",
"\"AY044449\",\"9.2865464417786\",\"2.61149606120517\"", "\"NM_005015\",\"10.1228142794354\",\"3.98707517627092\""
), c("ID,SIGNALINTENSITY,SNR", "1,NM_012429,6.44764696592035,0.84120306786724",
"2,NM_003980,9.52604513443066,3.02404186191898", "3,AY044449,9.11930818670925,2.24361163736047",
"4,NM_005015,10.5672879852575,5.29334273442728"))
I want to confirm the match when reading lines. I tried to find out which files has content starting with NM or GE by the following code
which(lapply(lines, function(x) any(grepl(paste(c("^NM_","^GE"),collapse = "|"), x, ignore.case = TRUE))) == T)
which is supposed to give index of all the three, but it return integer(0). I am not sure what I am missing.

Try this:
lyst <- list(c("\"ID\",\"SIGNALINTENSITY\",\"SNR\"", "\"NM_012429\",\"7.19739265676517\",\"0.738130599770152\"",
"\"NM_003980\",\"12.4036181424743\",\"13.753593768862\"", "\"AY044449\",\"8.74973537284918\",\"1.77200602833912\"",
"\"NM_005015\",\"11.3735054810744\",\"6.76079815107347\""), c("\"ID\",\"SIGNALINTENSITY\",\"SNR\"",
"\"NM_012429\",\"7.07699512126353\",\"0.987579612646805\"", "\"NM_003980\",\"11.3172936656653\",\"8.38227473088534\"",
"\"AY044449\",\"9.2865464417786\",\"2.61149606120517\"", "\"NM_005015\",\"10.1228142794354\",\"3.98707517627092\""
), c("ID,SIGNALINTENSITY,SNR", "1,NM_012429,6.44764696592035,0.84120306786724",
"2,NM_003980,9.52604513443066,3.02404186191898", "3,AY044449,9.11930818670925,2.24361163736047",
"4,NM_005015,10.5672879852575,5.29334273442728"))
Assuming lyst is given string as per your question then you can do:
lapply(1:length(lyst), function(x)grepl("^NM|^GE",gsub('"',"", lyst[[x]])))
Logic:
First replacing the ' " ' with nothing using gsub then using the '^' to determining if the start of string is NM or GE using grepl.
However, if someone is interested in matching with optional numbers and commas
one can also use this regex:
lapply(1:3, function(x)grepl("^(NM|GE)|^\\d+,(NM|GE)",gsub('"',"", lyst[[x]])))
Output:
> lapply(1:3, function(x)grepl("^(NM|GE)|^\\d+,(NM|GE)",gsub('"',"", lyst[[x]])))
[[1]]
[1] FALSE TRUE TRUE FALSE TRUE
[[2]]
[1] FALSE TRUE TRUE FALSE TRUE
[[3]]
[1] FALSE TRUE TRUE FALSE TRUE

dat <- lapply(
lines,
function(x) read.csv(text = x)
)
# [[1]]
# ID SIGNALINTENSITY SNR
# 1 NM_012429 7.197393 0.7381306
# 2 NM_003980 12.403618 13.7535938
# 3 AY044449 8.749735 1.7720060
# 4 NM_005015 11.373505 6.7607982
#
# [[2]]
# ID SIGNALINTENSITY SNR
# 1 NM_012429 7.076995 0.9875796
# 2 NM_003980 11.317294 8.3822747
# 3 AY044449 9.286546 2.6114961
# 4 NM_005015 10.122814 3.9870752
#
# [[3]]
# ID SIGNALINTENSITY SNR
# 1 NM_012429 6.447647 0.8412031
# 2 NM_003980 9.526045 3.0240419
# 3 AY044449 9.119308 2.2436116
# 4 NM_005015 10.567288 5.2933427
To filter lines:
lapply(
dat,
function(df) df[grepl("^NM_|^GE", df$ID, ignore.case = TRUE), ]
)
# [[1]]
# ID SIGNALINTENSITY SNR
# 1 NM_012429 7.197393 0.7381306
# 2 NM_003980 12.403618 13.7535938
# 4 NM_005015 11.373505 6.7607982
#
# [[2]]
# ID SIGNALINTENSITY SNR
# 1 NM_012429 7.076995 0.9875796
# 2 NM_003980 11.317294 8.3822747
# 4 NM_005015 10.122814 3.9870752
#
# [[3]]
# ID SIGNALINTENSITY SNR
# 1 NM_012429 6.447647 0.8412031
# 2 NM_003980 9.526045 3.0240419
# 4 NM_005015 10.567288 5.2933427
Or if just indices are needed:
lapply(
dat,
function(df) grepl("^NM_|^GE", df$ID, ignore.case = TRUE)
)
# [[1]]
# [1] TRUE TRUE FALSE TRUE
#
# [[2]]
# [1] TRUE TRUE FALSE TRUE
#
# [[3]]
# [1] TRUE TRUE FALSE TRUE
Or with grep instead of grepl:
lapply(
dat,
function(df) grep("^NM_|^GE", df$ID, ignore.case = TRUE)
)
# [[1]]
# [1] 1 2 4
#
# [[2]]
# [1] 1 2 4
#
# [[3]]
# [1] 1 2 4

Combining sequences with similar gene IDs

I have a list of gene IDs along with their sequences in R.
$2435
[1]"ATGCGGGCGGGGGTCGTCGA"
$2435
[1]"ATGCGGCGCGCGCGCTATATACGC"
$2435
[1]"ATGCGGCGCCTCTCATCGCGGGGG"
I want to combine the sequences with the same gene IDs in that list in R.
$2435
[1]"ATGCGGGCGGGGGTCGTCGAATGCGGCGCGCGCGCTATATACGCATGCGGCGCCTCTCATCGCGGGGG"

Use lapply after matching the names with unique. Here's some sample data:
A <- list("12" = "AAAABBBBCCCCDDDD",
"34" = "GGGG",
"12" = "XXXXXXXXXXXXXXXXXXXXXXX",
"10" = "FFFFGGGG",
"10" = "HHHHIIII")
A
# $`12`
# [1] "AAAABBBBCCCCDDDD"
#
# $`34`
# [1] "GGGG"
#
# $`12`
# [1] "XXXXXXXXXXXXXXXXXXXXXXX"
#
# $`10`
# [1] "FFFFGGGG"
#
# $`10`
# [1] "HHHHIIII"
Subset the related names and paste them together.
lapply(unique(names(A)), function(x) paste(A[names(A) %in% x], collapse = ""))
# [[1]]
# [1] "AAAABBBBCCCCDDDDXXXXXXXXXXXXXXXXXXXXXXX"
#
# [[2]]
# [1] "GGGG"
#
# [[3]]
# [1] "FFFFGGGGHHHHIIII"

l <- list("A" = "ABC", "B" = "XYX", "A" = "DEF", "C" = "YZY", "A" = "GHI")
tapply(l, names(l), paste, collapse = "", simplify = FALSE)
# $A
# [1] "ABCDEFGHI"
#
# $B
# [1] "XYX"
#
# $C
# [1] "YZY"

Bonus:
For a dataframe output, use this:
aggregate(unlist(A), by=list(id=names(A)), paste, collapse="")
Where A is you list.
Using #Ananda's A, I get this:
id x
1 10 FFFFGGGGHHHHIIII
2 12 AAAABBBBCCCCDDDDXXXXXXXXXXXXXXXXXXXXXXX
3 34 GGGG

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Identifying patterns in two strings in R - r

Related

R - find cases in list of lists which meet specific condition

Count the number of `TRUE` in a list

How do I search if the numbers in one vector are within a range of two other vectors in R?

Which files have some content in R

Combining sequences with similar gene IDs

Categories

Resources