How to grep exact matches using a vector in R? - r

I have a large data frame where I want to extract rows based on a column value. My problem is that grep will take all instances of the value (e.g. will take "11" if I wanted to grep "1"). How do I get exact matches? Example below simply illustrates my issue. I only want to grep the "metm1" row but it is grepping all rows even though they are not exact matches.
## make data
df1 <- data.frame(matrix(, nrow=4, ncol=2))
colnames(df1) <- c("met", "dt1")
df1$met <- c("metm11", "metm1", "metm1", "metm12")
df1$dt1 <- c("0.666", "0.777", "0.99", "0.01")
# make list for grep
mets <- "metm1"
# grep
new_df <- as.data.frame(df1[grep(paste(mets, collapse = "|"), df1$met), ])

You may place ^ and $ anchors around the search term to force an exact match:
regex <- paste0("^(?:", paste(mets, collapse = "|"), ")$")
new_df <- as.data.frame(df1[grep(regex, df1$met, fixed=TRUE), ])
For reference, the regex pattern being used here in:
^(?:metm1)$
^(?:metm1|metm2|metm3)$ <-- for multiple terms

You can use simply == to make exact match.
df1[df1$met == mets,]
# met dt1
#2 metm1 0.777
#3 metm1 0.99
In case mets is more than one element long use %in% as already pointed out in the comments by #MrFlick.
df1[df1$met %in% mets,]
# met dt1
#2 metm1 0.777
#3 metm1 0.99

Another solution is by using boundary anchors \\b:
df1[grep(paste0("\\b(", paste0(mets, collapse = "|"),")\\b"), df1$met), ]
met dt1
2 metm1 0.777
3 metm1 0.99
Using dplyr you'd filter with grepl, which returns TRUE and FALSE whereas grep returns indices of matches:
library(dplyr)
df1 %>%
filter(grepl(paste0("\\b(", paste0(mets, collapse = "|"),")\\b"), met))

Related

Extract String Part to Column in R

Consider the following dataframe:
status
1 file-status-done-bad
2 file-status-maybe-good
3 file-status-underreview-good
4 file-status-complete-final-bad
We want to extract the last part of status, wherein part is delimited by -. Such:
status status_extract
1 file-status-done-bad done
2 file-status-maybe-good maybe
3 file-status-ok-underreview-good underreview
4 file-status-complete-final-bad final
In SQL this is easy, select split_part(status, '-', -2).
However, the solutions I've seen with R either operate on vectors or are messy to extract particular elements (they return ALL elements). How is this done in a mutate chain? The below is a failed attempt.
df %>%
mutate(status_extract = str_split_fixed(status, pattern = '-')[[-2]])
Found the a really simple answer.
library(tidyverse)
df %>%
mutate(status_extract = word(status, -1, sep = "-"))
In base R you can combine the functions sapply and strsplit
df$status_extract <- sapply(strsplit(df$status, "-"), function(x) x[length(x) - 1])
# status status_extract
# 1 file-status-done-bad done
# 2 file-status-maybe-good maybe
# 3 file-status-underreview-good underreview
# 4 file-status-complete-final-bad final
You can use map() and nth() to extract the nth value from a vector.
library(tidyverse)
df %>%
mutate(status_extract = map_chr(str_split(status, "-"), nth, -2))
# status status_extract
# 1 file-status-done-bad done
# 2 file-status-maybe-good maybe
# 3 file-status-underreview-good underreview
# 4 file-status-complete-final-bad final
which is equivalent to a base version like
sapply(strsplit(df$status, "-"), function(x) rev(x)[2])
# [1] "done" "maybe" "underreview" "final"
You can use regex to get what you want without splitting the string.
sub('.*-(\\w+)-.*$', '\\1', df$status)
#[1] "done" "maybe" "underreview" "final"

How to use grepl function multiple times, in R

I have a vector like go_id and a data.frame like data.
go_id <- c("[GO:0000086]", "[GO:0000209]", "[GO:0000278]")
protein_id <- c("Q96IF1","P26371","Q8NHG8","P60372","O75526","Q01130")
bio_process <- c("[GO:0000086]; [GO:0000122]; [GO:0000932]", "[GO:0005829]; [GO:0008544]","[GO:0000209]; [GO:0005737]; [GO:0005765]","NA","[GO:0000398]; [GO:0003729]","[GO:0000278]; [GO:0000381]; [GO:0000398]; [GO:0003714]")
data <- as.data.frame(cbind(protein_id,bio_process))
How can I keep the rows of the data for which bio_process cell contains at least one of the go_ids elements? I note that the GO code can not be repeated in the same bio_process cell.
To be more precise, i would like to receive only the first, the third and the sixth row of the data.frame.
I have tried a for loop using 'grepl' function, like this:
go_id <- gsub("GO:","", go_id, fixed = TRUE)
for (i in 1:6) {
new_data <- data[grepl("\\[GO:go_id[i]\\]",data$Gene.ontology..biological.process.)]
}
Which I know it can not work because I can not fit in a variable value into a regular expression.
Any ideas on this?
Thank you
We can use Reduce with grepl
data$ind <- Reduce(`|`, lapply(go_id, function(pat)
grepl(pat, data$bio_process, fixed = TRUE)))
data
# protein_id bio_process ind
#1 Q96IF1 [GO:0000086]; [GO:0000122]; [GO:0000932] TRUE
#2 P26371 [GO:0005829]; [GO:0008544] FALSE
#3 Q8NHG8 [GO:0000209]; [GO:0005737]; [GO:0005765] TRUE
#4 P60372 NA FALSE
#5 O75526 [GO:0000398]; [GO:0003729] FALSE
#6 Q01130 [GO:0000278]; [GO:0000381]; [GO:0000398]; [GO:0003714] TRUE
You should use fixed = TRUE in grepl() :
vect <- rep(FALSE, nrow(data))
for(id in go_id){
vect <- vect | grepl(id, data$bio_process, fixed = T)
}
data[vect,]
You can subset using str_extract to define the pattern on those substrings that are distinctive:
library(stringr)
data[grepl(paste(str_extract(go_id, "\\d{4}]"), collapse="|"), data$bio_process),]
protein_id bio_process
1 Q96IF1 [GO:0000086]; [GO:0000122]; [GO:0000932]
3 Q8NHG8 [GO:0000209]; [GO:0005737]; [GO:0005765]
6 Q01130 [GO:0000278]; [GO:0000381]; [GO:0000398]; [GO:0003714]
EDIT:
The most straighforward solution is subsetting with grepland paste0 to add the escape slashes for the metacharacter [:
data[grepl(paste0("\\", go_id, collapse="|"), data$bio_process),]

Keep doubled columns which differ in only 2 letters in a data.frame

I have a data frame in R which consists of around 100 columns. Most of the columns are doubled but differ in 2 letters. I want to keep these columns and delete those columns which are not doubled.
Here is an example:
234-rgz SK 234-rgz PV 556-gft SK 456-hjk SK 456-hjk PV
The Output should be:
234-rgz SK 234-rgz PV 456-hjk SK 456-hjk PV
All columns have the same naming conventions. A number starting from 2 to 150 then a "-" after this 4 or 5 letters, then a space and then "SK" or "PV". I thought of using regular expression but then I don't solving the problem how I get rid of those single columns. Thanks for your help!
You can use duplicated on the column names after removing the suffix part. The output will be logical index which can be used to subset the original dataset.
v1 <- colnames(df1)
v2 <- sub('\\s+[^ ]+$', '', v1)
indx <- duplicated(v2)|duplicated(v2, fromLast=TRUE)
v1[indx]
#[1] "234-rgz SK" "234-rgz PV" "456-hjk SK" "456-hjk PV"
To subset the columns in the dataframe,
df1[indx]
Or another option is splitting the column names string to substring and use grep to match the substring that have a frequency >1
tbl <- table(unlist(strsplit(v1, '\\s+.*')))
df1[grep(paste(names(tbl)[tbl>1], collapse="|"), v1)]
data
set.seed(24)
df1 <- as.data.frame(matrix(sample(0:9, 5*10, replace=TRUE), ncol=5,
dimnames=list(NULL, c('234-rgz SK', '234-rgz PV' , '556-gft SK',
'456-hjk SK' , '456-hjk PV') )) )

Splitting a dataframe by column name indices

This is a variation of an earlier question.
df <- data.frame(matrix(rnorm(9*9), ncol=9))
names(df) <- c("c_1", "d_1", "e_1", "a_p", "b_p", "c_p", "1_o1", "2_o1", "3_o1")
I want to split the dataframe by the index that is given in the column.names after the underscore "_". (The indices can be any character/number in different lengths; these are just random examples).
indx <- gsub(".*_", "", names(df))
and name the resulting dataframes accordingly n the end i would like get three dataframes, called:
df_1
df_p
df_o1
Thank you!
Here, you can split the column names by indx, get the subset of data within the list using lapply and [, set the names of the list elements using setNames, and use list2env if you need them as individual datasets (not so recommended as most of the operations can be done within the list and later if you want, it can be saved using write.table with lapply.
list2env(
setNames(
lapply(split(colnames(df), indx), function(x) df[x]),
paste('df', sort(unique(indx)), sep="_")),
envir=.GlobalEnv)
head(df_1,2)
# c_1 d_1 e_1
#1 1.0085829 -0.7219199 0.3502958
#2 -0.9069805 -0.7043354 -1.1974415
head(df_o1,2)
# 1_o1 2_o1 3_o1
#1 0.7924930 0.434396 1.7388130
#2 0.9202404 -2.079311 -0.6567794
head(df_p,2)
# a_p b_p c_p
#1 -0.12392272 -1.183582 0.8176486
#2 0.06330595 -0.659597 -0.6350215
Or using Map. This is similar to the above approach ie. split the column names by indx and use [ to extract the columns, and the rest is as above.
list2env(setNames(Map(`[` ,
list(df), split(colnames(df), indx)),
paste('df',unique(sort(indx)), sep="_")), envir=.GlobalEnv)
Update
You can do:
indx1 <- factor(indx, levels=unique(indx))
split(colnames(df), indx1)
you can try this :
invisible(sapply(unique(indx),
function(x)
assign(paste("df",x,sep="_"),
df[,grepl(paste0("_",x,"$"),colnames(df))],
envir=.GlobalEnv)))
# the code applies to each unique element of indx the assignement (in the global environment)
# of the columns corresponding to indx in a new data.frame, named according to the indx.
# invisible function avoids that the data.frames are printed on screen.
> ls()
[1] "df" "df_1" "df_o1" "df_p" "indx"
> df_1[1:3,]
c_1 d_1 e_1
1 1.8033188 0.5578494 2.2458750
2 1.0095556 -0.4042410 -0.9274981
3 0.7122638 1.4677821 0.7770603
> df_o1[1:3,]
1_o1 2_o1 3_o1
1 -2.05854176 -0.92394923 -0.4932116
2 -0.05743123 -0.24143979 1.9060076
3 0.68055653 -0.70908036 1.4514368
> df_p[1:3,]
a_p b_p c_p
1 -0.2106823 -0.1170719 2.3205184
2 -0.1826542 -0.5138504 1.9341230
3 -1.0551739 -0.2990706 0.5054421

Subsetting data frame by vector of elements

I spent about 20 minutes looking through previous questions, but could not find what I am looking for. I have a large data frame I want to subset down based on a list of names, but the names in the data frame can also have a postfix not indicated in the list.
In other words, is there a simpler generic way (for infinite numbers of postfixes) to do the following:
data <- data.frame("name"=c("name1","name1_post1","name2","name2_post1",
"name2_post2","name3","name4"),
"data"=rnorm(7,0,1),
stringsAsFactors=FALSE)
names <- c("name2","name3")
subset <- data[ data$name %in% names | data$name %in% paste0(names,"_post1") | data$name %in% paste0(names,"_post2") , ]
In response to #Arun's answer. The names in my data actually include more than one underscore, making the problem more complicated.
data <- data.frame("name"=c("name1_target_time","name1_target_time_post1","name2_target_time","name2_target_time_post1",
"name2_target_time_post2","name3_target_time","name4_target_time"),
"data"=rnorm(7,0,1),
stringsAsFactors=FALSE)
names <- c("name2_target_time","name3_target_time")
subset <- data[ data$name %in% names | data$name %in% paste0(names,"_post1") | data$name %in% paste0(names,"_post2") , ]
Edit: solution using regular expressions (following OP's follow-up in comment):
data[grepl(paste(names, collapse="|"), data$name), ]
# name data
# 3 name2 1.4934931
# 4 name2_post1 -1.6070809
# 5 name2_post2 -0.4157518
# 6 name3 0.4220084
On your new data:
# name data
# 3 name2_target_time 0.6295361
# 4 name2_target_time_post1 0.8951720
# 5 name2_target_time_post2 0.6602126
# 6 name3_target_time 2.2734835
Also, as #flodel shows under comments, this also works fine!
subset(data, sub("_post\\d+$", "", name) %in% names)
Old solution:
data[sapply(strsplit(data$name, "_"), "[[", 1) %in% names, ]
# name data
# 3 name2 1.4934931
# 4 name2_post1 -1.6070809
# 5 name2_post2 -0.4157518
# 6 name3 0.4220084
The idea: First split the string at _ using strsplit. This results in a list. For ex: name2 will result in just name2 (first element of the list). But name2_post1 will result in name2 and post1 (second element of the list). By wrapping it with sapply and using [[ with 1, we can select just the "first" element of this resulting list. Then we can use that with %in% to check if they are present in names (which is straightforward).
A grep solution would probably look something like the following:
subset <- data[grep("(name2)|(name3)",names(data)),]

Resources