Regex expression exceptions in subsetting data with grepl - r

I'm trying to subset data in R by certain characters in a field and cannot find the correct regex logic to get what I need. I need to subset records for which the ID contains either:
Just "AB"
"AB" and "ABC"
But NOT fields with ONLY "ABC"
These patterns fall within any part of the field (beginning, middle, end) in this data set and have no certain separators.
Example dataset TEST:
Record ID value
1 blueAB_ABC 7
2 green_ABCblue 9
3 ABC_green 45
4 green_AB 23
5 CD_red 45
So for this example I would want to subset records 1 and 4.
I've gotten as far as returning those with just AB and excluding ABC, but cannot seem to find the proper regex to get all with "AB" and potentially "ABC".
AB_set <- subset(TEST, grepl("*AB", ID) & !grepl("*ABC", ID) )
Record ID value
4 green_AB 23
What I'm hoping to get:
Record ID value
1 blueAB_ABC 7
4 green_AB 23
EDIT: Just to clarify, I updated the dataset to show that the pattern in question may fall next to other characters than an underscore, or may not necessarily occur at the beginning/end (as previously noted, "no certain separators").

You can get this by specifying that "AB" should be surrounded by either underscore or a word boundary.
df[grepl("(\\b|_)AB(\\b|_)", df$ID),]
Record ID value
1 1 blue_AB_ABC 7
4 4 green_AB 23

"ABC" is not needed because "AB" is always required to be matched. The following matches AB only if it is surrounded by underscore or it starts or ends an ID:
AB_set <- subset(TEST, grepl("(^|_)AB(_|$)", TEST$ID))
Result:
Record ID value
1 1 blue_AB_ABC 7
4 4 green_AB 23
Data:
TEST = structure(list(Record = 1:5, ID = structure(c(2L, 5L, 1L, 4L,
3L), .Label = c("ABC_green", "blue_AB_ABC", "CD_red", "green_AB",
"green_ABC_blue"), class = "factor"), value = c(7L, 9L, 45L,
23L, 45L)), .Names = c("Record", "ID", "value"), class = "data.frame", row.names = c(NA,
-5L))

Related

Identifying, extracting and counting patterns in sequences

Hello lovely and nice people of SO, I'm working with a data-frame that contains only two columns one column corresponds to a Unique ID generated by a Virtual Machine and the second column contains a name but this particularly column may also contain the string "ERROR" and the objective is to create a script that will allow us to identify every time the string "ERROR" is found and capture the last and following names around it and also the unique ID assigned to the string "ERROR", to illustrate lets look at the following example:
If I have this data
ID
NAMES
1
James
3
ERROR
6
Keras
88
Kelly
53
Micheal
55
ERROR
7
Cindy
834
Keras
Then we would like to have come up with the following list:
ID
NAMES
3
James-Keras
55
Micheal-Cindy
This is because the first string "ERROR" found had an ID of 3 and was between the names James (before ERROR) and Keras (After ERROR) the next "ERROR" had an ID of 55 and was between Micheal and Cindy what if "ERROR" is a the top of the list or the bottom then we should only include whatever name we find it is ok to have lets say " NA-NAME" is ERROR was found at the top...
But here is where it gets tricky if we ever run into a sequence with consecutive strings "ERROR" we should always use as a "guide" the very last one in descending order for instance:
If I have this data set
ID
NAMES
1
James
3
ERROR
6
ERROR
88
ERROR
53
Jude
55
ERROR
7
Cindy
834
Keras
then we will want to have
ID
NAMES
88
James-Jude
55
Jude-Cindy
and this is because the string ERROR was repeated 3 times consecutively but the last one was at ID 88 so that means that we'll take that as a reference and record the names before and after it, another way of seeing this is to view the strings "ERROR" as a block so we'll record the names before and after each block of strings "ERROR"
Thank you so much to everyone that is trying to help me out I'd really appreciate if you can reference a book or functions that could help me out thank you so much.
We may create a function to do this
f1 <- function(dat) {
subdat1 <- subset(dat, !duplicated(with(rle(NAMES == "ERROR"),
rep(seq_along(values), lengths)), fromLast = TRUE))
subdat2 <- subset(dat, !duplicated(with(rle(NAMES == "ERROR"),
rep(seq_along(values), lengths))))
ind <- which(subdat1$NAMES == "ERROR")
do.call(rbind, lapply(ind[c(TRUE, diff(ind) > 1)], function(i)
data.frame(ID = subdat1$ID[i],NAMES = paste(subdat1$NAMES[i-1],
subdat2$NAMES[i+1], sep="-"))))
}
-testing
> f1(df1)
ID NAMES
1 3 James-Keras
2 55 Micheal-Cindy
> f1(df2)
ID NAMES
1 88 James-Jude
2 55 Jude-Cindy
data
df1 <- structure(list(ID = c(1L, 3L, 6L, 88L, 53L, 55L, 7L, 834L), NAMES = c("James",
"ERROR", "Keras", "Kelly", "Micheal", "ERROR", "Cindy", "Keras"
)), class = "data.frame", row.names = c(NA, -8L))
df2 <- structure(list(ID = c(1L, 3L, 6L, 88L, 53L, 55L, 7L, 834L), NAMES = c("James",
"ERROR", "ERROR", "ERROR", "Jude", "ERROR", "Cindy", "Keras")),
class = "data.frame", row.names = c(NA,
-8L))

Remove row with specific number in R

I want to remove row with the test "student2". However, I don't want to remove row like "student22", "student 23"... etc.
For example:
Student.Code Values
1 canada.student12 2
2 canada.student2 3 # remove
3 canada.student23 5 # keep
4 US.student2 6 # remove
5 US.student32 2
6 Aus.student87 645
7 Turkey.student25 4 #keep
I used the code grepl("student2", example$Student.code, fixed = TRUE but it also find (remove) the rows with like "student23"
We can use grepl("student2$", example$Student.Code)
library(tidyverse)
example <- tibble::tribble(
~Student.Code, ~Values,
"canada.student12", 2L,
"canada.student2", 3L,
"canada.student23", 5L,
"US.student2", 6L,
"US.student32", 2L,
"Aus.student87", 645L,
"Turkey.student25", 4L
)
example$Student.Code
grepl("student2$", example$Student.Code)
[1] FALSE TRUE FALSE TRUE FALSE FALSE FALSE
example %>%
filter(!grepl("student2$", Student.Code))
# A tibble: 5 x 2
Student.Code Values
<chr> <int>
1 canada.student12 2
2 canada.student23 5
3 US.student32 2
4 Aus.student87 645
5 Turkey.student25 4
Data:
df <- data.frame(
Student = c("canada.student12", "canada.student2", "canada.student23","US.student2", "US.student32", "Aus.student87", "Turkey.student25"),
Value = c(2,3,5,6,2,654,5)
)
Solution: (in base R)
The idea is to use grepl to match those values where the number 2 occurs at the word boundary, that is, in regex, at \\b, and to exclude these strings with the negator !:
df[!grepl("student2\\b", df$Student),]
Student Value
1 canada.student12 2
3 canada.student23 5
5 US.student32 2
6 Aus.student87 654
7 Turkey.student25 5
Alternatively, you can also go the opposite way and match those patterns that you want to keep:
df[grepl("student(?=\\d{2,})", df$Student, perl = T),]
Here, the idea is to use positive lookahead to match values with student iff they are followed immediately by at least two digits (\\d{2,}). (Note that when using lookahead or lookbehind you need to include perl = T.)
If you have a variable with an exact value you want to remove, don't use grep or grepl.
example <- tibble::tribble(
~Student.Code, ~Values,
"canada.student12", 2L,
"canada.student2", 3L,
"canada.student23", 5L,
"US.student2", 6L,
"US.student32", 2L,
"Aus.student87", 645L,
"Turkey.student25", 4L
)
example <- example[example$Student.Code != "canada.student2",]
# or, in dplyr
example <- filter(example, Student.Code != "canada.student2")
# for multiple values
example <- filter(example, !(Student.Code %in% c("canada.student2", "US.student2")))
fixed = TRUE is not working because all it means is 'search for this exact string in the input strings', not 'only match this exact string (it must be the whole value)'

Looking for how to use separate() with multiple separators in R (ClinVar variant data dealing)

Dear StackOverflow community
I'm a biologist and I'm working with a disease/genetic variants from ClinVar official database. My aim is to extract all gene names, transcripts and variants from this list.
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/ClinVarFullRelease_2020-01.xml.gz
However, ClinVar offers the information I need in a single column called "Name". (I've separated some of the values with different results that I want to deal with in the example in the table below:)
Name ClinicalSignificance
1 NG_012236.2:g.11027del Pathogenic
2 NM_018077.3(RBM28):c.1052T>C (p.Leu351Pro) Pathogenic
3 NC_012920.1:m.7445A>G Pathogenic
4 m.7510T>C Pathogenic
5 NC_000023.11:g.(134493178_134493182)_(134501172_134501176)del Pathogenic
(there is other type of data, however since it does not contain the information I need I will treat it as garbage)
I am looking for a way to split the "Name" column in 3 other columns, using multiple separators. I've tried using "|" as part of my regex argument for multiple matches. However, for each time it works, sends the data that has already been separated to a column to the right.
My code:
ClinVar_Clean <- separate(ClinVar_Clean, Name, into = c("Transcript","gene.var"),sep = "(?<=\\.[0-9]{1,2})[(]|(?<=[0-9]{3,16}\\.[0-9]{1,2}):|(?=[cmpng]\\.)")
ClinVar_Clean <- separate(ClinVar_Clean, gene.var, into = c("Gene","Variant"),sep = "\\):|(?=[cmpng]\\.)")
My result:
Transcript Gene Variant ClinicalSignificance
1 NG_012236.2 <NA> Pathogenic
2 NM_018077.3 RBM28 Pathogenic
3 NC_012920.1 <NA> Pathogenic
4 m.7510T>C Pathogenic
5 NC_000023.11 <NA> Pathogenic
How the result should look like:
Transcript Gene Variant ClinicalSignificance
1 NG_012236.2 g.11027del Pathogenic
2 NM_018077.3 RBM28 c.1052T>C (p.Leu351Pro) Pathogenic
3 NC_012920.1 m.7445A>G Pathogenic
4 m.7510T>C Pathogenic
5 NC_000023.11 g.(134493178_134493182)_(134501172_134501176)del Pathogenic
I also tried to execute each separator individually, instead of shifting the data to the right, however it also overwrites the remaining data.
Please if anyone could help, appreciates!
I was trying to do this with one single extract/separate but I couldn't come up which would give the exact expected output. So here is an attempt breaking it down into separate steps using str_extract from stringr and sub from base R.
library(dplyr)
library(stringr)
df %>%
mutate(Transcript = str_extract(Name, ".*(?<=:)"),
Gene = str_extract(Transcript, "(?<=\\().*(?=\\))"),
Variant = sub(".*:(.*)", "\\1", Name)) %>%
select(Transcript, Gene, Variant)
# Transcript Gene Variant
#1 NG_012236.2: <NA> g.11027del
#2 NM_018077.3(RBM28): RBM28 c.1052T>C(p.Leu351Pro)
#3 NC_012920.1: <NA> m.7445A>G
#4 <NA> <NA> m.7510T>C
#5 NC_000023.11: <NA> g.(134493178_134493182)_(134501172_134501176)del
In Transcript we capture everything before the colon.
For Gene, we get character which is in parenthesis in Transcript.
For Variant, we get everything after colon.
data
df <- structure(list(Name = structure(c(4L, 5L, 3L, 1L, 2L), .Label = c("m.7510T>C",
"NC_000023.11:g.(134493178_134493182)_(134501172_134501176)del",
"NC_012920.1:m.7445A>G", "NG_012236.2:g.11027del",
"NM_018077.3(RBM28):c.1052T>C(p.Leu351Pro)"
), class = "factor"), ClinicalSignificance = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "Pathogenic", class = "factor")), class =
"data.frame", row.names = c("1", "2", "3", "4", "5"))

is.element on column of lists in data frame

I have a data frame with a column that contains some elements that are lists. I would like to find out which rows of the data frame contain a keyword in that column.
The data frame, df, looks a bit like this
idstr tag
1 wl
2 other.to
3 other.from
4 c("wl","other.to")
5 wl
6 other.wl
7 c("ll","other.to")
The goal is to assign all of the rows with 'wl' in their tag to a new data frame. In this example, I would want a new data frame that looks like:
idstr tag
1 wl
4 c("wl","other.to")
5 wl
I tried something like this
df_wl <- df[which(is.element('wl',df$tag)),]
but this only returns the first element of the data frame (whether or not it contains 'wl'). I think the trouble lies in iterating through the rows and implementing the "is.element" function. Here are two implementations of the function and it's results:
is.element('wl',df$tag[[4]]) > TRUE
is.element('wl',df$tag[4]) > FALSE
How do you suggest I iterate through the dataframe to assign df_wl with it's proper values?
PS: Here's the dput:
structure(list(idstr = 1:7, tag = structure(c(6L, 5L, 4L, 2L, 6L, 3L, 1L), .Label = c("c(\"ll\",\"other.to\")", "c(\"wl\",\"other.to\")", "other.wl", "other.from", "other.to", "wl"), class = "factor")), .Names = c("idstr", "tag"), row.names = c(NA, -7L), class = "data.frame")
Based on your dput data. this may work. The regular expression (^wl$)|(\"wl\") matches wl from beginning to end, or any occurrence of "wl" (wrapped in double quotes)
df[grepl("(^wl$)|(\"wl\")", df$tag),]
# idstr tag
# 1 1 wl
# 4 4 c("wl","other.to")
# 5 5 wl

Get the time between consecutive dates stored in a single column

I am trying to figure out how to get the time between consecutive events when events are stored as a column of dates in a dataframe.
sampledf=structure(list(cust = c(1L, 1L, 1L, 1L), date = structure(c(9862,
9879, 10075, 10207), class = "Date")), .Names = c("cust", "date"
), row.names = c(NA, -4L), class = "data.frame")
I can get an answer with
as.numeric(rev(rev(difftime(c(sampledf$date[-1],0),sampledf$date))[-1]))
# [1] 17 196 132
but it is really ugly. Among other things, I only know how to exclude the first item in a vector, but not the last so I have to rev() twice to drop the last value.
Is there a better way?
By the way, I will use ddply to do this to a larger set of data for each cust id, so the solution would need to work with ddply.
library(plyr)
ddply(sampledf,
c("cust"),
summarize,
daysBetween = as.numeric(rev(rev(difftime(c(date[-1],0),date))[-1]))
)
Thank you!
Are you looking for this?
as.numeric(diff(sampledf$date))
# [1] 17 196 132
To remove the last element, use head:
head(as.numeric(diff(sampledf$date)), -1)
# [1] 17 196
require(plyr)
ddply(sampledf, .(cust), summarise, daysBetween = as.numeric(diff(date)))
# cust daysBetween
# 1 1 17
# 2 1 196
# 3 1 132
You can just use diff.
as.numeric(diff(sampledf$date))
To leave off the last, element, you can do:
[-length(vec)] #where `vec` is your vector
In this case I don't think you need to leave anything off though, because diff is already one element shorter:
test <- ddply(sampledf,
c("cust"),
summarize,
daysBetween = as.numeric(diff(sampledf$date)
))
test
# cust daysBetween
#1 1 17
#2 1 196
#3 1 132

Resources