Find a vector of strings in R - r

I have a vector of strings like:
vector=c("a","hb","cd")
and also I have a matrix which has a column, each element of this column is a list of strings which separated by "|" separator, like:
1 "ab|hb"
2 "ab|hbc|cd"
I want to find each string of vector appears in which row of matrix completely.
For the above vector, the result is:
NA, 1, 2

You can use strsplit for splitting strings:
x <- strsplit("ab|hbc|cd", split="|", fixed=T)
and then check if values of vector appear in the data, e.g.
sapply(vector, function(x) x %in% strsplit("a|ab|cd|efg|bh",
split="|", fixed=T)[[1]])
Warning: strsplit outputs data as a list, so in the example above I extract only the first element of the list with [[1]], however you can deal with it in other way if you choose.
EDIT: answering to your question on data as a vector:
data <- c("ab|cd|ef", "aaa|b", "ab", "wf", "fg|hb|a", "cd|cd|df")
sapply(sapply(data, function(x) strsplit(x, split="|", fixed=T)[[1]]),
function(y) sapply(vector, function(z) z %in% y))

Here's an approach using regular expressions:
# Example data
vector <- c("a","hb","cd")
mat <- matrix(c("ab|hb", "ab|hbc|cd"), nrow = 2)
sapply(paste0("\\b", vector, "\\b"), function(x)
if(length(tmp <- grep(x, mat[ , 1]))) tmp else NA,
USE.NAMES = FALSE)
# [1] NA 1 2

Related

How can I use lapply, sapply or apply to filter a data frame in R?

I am trying to remove all field that does not contain 10 digit numbers and those that have 10 zeros, I want to achieve this with the lapply or sapply or apply function. my code below does not work:
lapply(df, function(x) filter(x %like% "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]" | !x %in% "0000000000"))
10 zeroes are part of 10-digit numbers so you don't need to test for them separately.
df <- data.frame(a = c('123456789', '123456789', '123'),
b = c('0000000000', '2345', '1234'))
result <- lapply(df, function(x) grep('\\d{10}', x, value = TRUE, invert = TRUE))
#$a
#[1] "123456789" "123456789" "123"
#$b
#[1] "2345" "1234"
You can also use nchar to count number of characters.
result <- lapply(df, function(x) x[nchar(x) != 10])

What is the best way to find an element in a regex expression in R?

I have a list of strings in R like so:
"A(123:456)"
"B(23456:345)"
"C(3451:45600)"
I want to parse out the first number and the second number in the parenthesis for all these items:
first second
123 456
23456 345
3451 45600
What is the best way to do this in a vectorized manner? I've thought of using substrings and index of, but then heard of regexes, but am wondering of the most "R" way to do this.
You could use regexpr to match the pattern,
and regmatches to extract the matched patterns.
You could define the pattern to match (to be extracted) as \\d+, which means 1 or more digits.
This will match the first 3 digits that occur in each pattern.
And extract the matches with regmatches, like this:
v <- c("A(123:456)", "B(234:345)", "C(345:456)")
regmatches(v, regexpr('\\d+', v))
The above will give a vector of values:
[1] "123" "234" "345"
To get a data.frame with two columns of the numeric values,
you can use gregmatches instead of regmatches.
That returns a list of lists,
from which you can extract the values into vectors:
m <- regmatches(v, gregexpr('\\d+', v))
first <- sapply(m, function(x) x[[1]])
second <- sapply(m, function(x) x[[2]])
Or as #RuiBarradas pointed out in a comment, you can simplify the sapply calls like this:
first <- sapply(m, '[[', 1)
second <- sapply(m, '[[', 2)
Here's one way with regex:
# Your data
df <- data.frame(obs=c("A(123:456)","B(234:345)","C(345:456)"))
# extraction:
df$first <- gsub(df$obs,pattern="^.*\\((.*)\\:.*$",replacement="\\1")
Here are two ways.
The first is the simplest and if your strings always have exactly two characters followed by the three digit number of interest, it will work.
The second uses regular expressions.
substr(x, 3, 5)
[1] "123" "234" "345"
sub("^.*\\(([[:digit:]]*).*", "\\1", x)
[1] "123" "234" "345"
Then, if you want numeric results, use as.integer or as.numeric.
DATA.
x <- scan(what = character(), text = '
"A(123:456)"
"B(234:345)"
"C(345:456)"')
EDIT.
After the question's edit by the OP, the solutions above are no longer valid. The following one is. Note that the regex has changed and that I now also use strsplit.
res <- do.call(rbind, strsplit(sub("^.*\\((.*)\\).*$", "\\1", x), ":"))
res <- as.data.frame(res, stringsAsFactors = FALSE)
names(res) <- c("first", "second")
res
# first second
#1 123 456
#2 234 345
#3 345 456
The columns of this dataframe are both of class character. In order to have numbers, coerce them with
res[] <- lapply(res, as.integer)

Paste some elements of mixed vector

I have a vector with terms that may be followed by zero or more qualifiers starting with "/". The first element should always be a term.
mesh <- c("Animals", "/physiology" , "/metabolism*",
"Insects", "Arabidopsis", "/immunology" )
I'd like to join the qualifier with the last term and get a new vector
Animals/physiology
Animals/metabolism*
Insects
Arabidopsis/immunology
Make a group identifier by grepling for values not starting with a /, split on this group identifier, then paste0:
unlist(by(mesh, cumsum(grepl("^[^/]",mesh)), FUN=function(x) paste0(x[1], x[-1])))
# 11 12 2 3
# "Animals/physiology" "Animals/metabolism*" "Insects" "Arabidopsis/immunology"
Another option is tapply
unlist(tapply(mesh, cumsum(grepl("^[^/]", mesh)),
FUN = function(x) paste0(x[1], x[-1])), use.names=FALSE)
#[1] "Animals/physiology" "Animals/metabolism*" "Insects" "Arabidopsis/immunology"
Can think of anything more elegant than this:
mesh <- c("Animals", "/physiology" , "/metabolism*",
"Insects", "Arabidopsis", "/immunology" )
#gets "prefixes", assuming they all start with a letter:
pre <- grep(pattern = "^[[:alpha:]]", x = mesh)
#gives integer IDs for the prefix-suffix groupings
id <- rep(1:length(pre), times = diff(c(pre,length(mesh) + 1)))
#function that pastes the first term in vector to any remaining ones
#will just return first term if there are no others
combine <- function(x) paste0(x[1], x[-1])
#groups mesh by id, then applies combine to each group
results <- tapply(mesh, INDEX = id, FUN = combine)
unlist(results)

Change column in data frames in list

I have a list of 78 data frames (list_of_df) that all have the same first column with all annotated ensembl transcript id:s, however they have the extension ".1", i e ("ENST00000448914.1" and so on) and I would like to remove that in order to match them against pure ENST-IDs.
I have tried to use lapply with a sapply inside like this:
lapply(list_of_df, function(x)
cbind(x,sapply(x$target_id, function(y) unlist(strsplit(y,split=".",fixed=T))[1])) )
but it takes forever, does anyone have a better idea of how to possibly do it?
We loop through the list of data.frames, and use sub to remove the . followed by numbers in the first column.
lapply(list_of_df, function(x) {
x[,1] <-sub('\\.\\d+', '', x[,1])
x })
#[[1]]
# target_id value
#1 ENST000049 39
#2 ENST010393 42
#[[2]]
# target_id value
#1 ENST123434 423
#2 ENST00838 23
NOTE: Even if the OP's first column is factor, this should work.
data
list_of_df <- list(data.frame(target_id= c("ENST000049.1",
"ENST010393.14"), value= c(39, 42), stringsAsFactors=FALSE),
data.frame(target_id=c("ENST123434.42", "ENST00838.22"),
value= c(423, 23), stringsAsFactors=FALSE))
You could simplify your code to:
lapply(list_of_df, function(x) x[,1] = unlist(strsplit(x[,1], split=".", fixed=TRUE))[1])
If your columns have factor as class, you can wrap x[,1] in as.character:
lapply(list_of_df, function(x) x[,1] = unlist(strsplit(as.character(x[,1]), split=".", fixed=TRUE))[1])
You could also make use of the stringi package:
library(stringi)
lapply(list_of_df, function(x) x[,1] = stri_split_fixed(x[,1], ".", n=1, tokens_only=TRUE))

lapply: extract specific element

I have a list of subsets obtained through:
lapply(1:5, function(x) combn(5,x))
I would like to extract a specific vector from this list. For example, the 16th element of this list, which is (1,2,3). Any hints? Thanks.
The command produces all the subsets of (1,2,3,4,5), which is a list of 2^5=32 subsets. The 16th being (1,2,3). I want to know how to extract this by using its position (16th).
We could try by splitting (split) the matrix to a list of vectors for each list elements, concatenate c the output to flatten the list, and subset using the numeric index.
lst2 <- do.call(`c`,lapply(lst, function(x) split(x, col(x))))
lst2[[16]]
#[1] 1 2 3
Or instead of splitting the matrix output, we could use the FUN argument within combn to create list and then concatenate c using do.call
lst <- do.call(`c`,lapply(1:5, function(x) combn(5, x, FUN=list)))
lst[[16]]
#[1] 1 2 3
Or instead of do.call(c,..), we can use (contributed by #Marat Talipov)
lst <- unlist(lapply(1:5, function(x)
combn(5, x, FUN=list)), recursive=FALSE)
data
lst <- lapply(1:5, function(x) combn(5,x))
I would rather consider producing the right data instead of looping again on them :)
lst = Reduce('c', lapply(1:5, function(x) as.list(data.frame(combn(5,x)))))
> lst[[16]]
[1] 1 2 3

Resources