extracting before digits before and after forward slash / - r

I have trouble with extracting the string before and after /.
x <- c("maximusa/b=5/1","maximusa/b=-4/1","maximusa/b=3/-2")
before_slash=sub(".*=(\\d+).*","\\1", x, perl = TRUE)
gives
"5" "maximusa/b=-4/1" "3"
then
after_slash=sub("^.*\\/(d+)","\\1", x, perl = TRUE)
gives
"maximusa/b=5/1" "maximusa/b=-4/1" "maximusa/b=3/-2"
OTH, the expected output
before slash 5 -4 3
after slash 1 1 -2
how can I get the expected output ?
thx for answers,
I would like to add one more condition to extract the strings
assume we have string like that.
Same as the OP how could we extract with + sign as well as ignoring the parentheses ? Current solution of #mob gives
x <- c("maximusa/b=(5/+1)","maximusa/b=(-4/1)","maximusa/b=(+3/-2)")
after_slash=sub("^.*/(\\d+)","\\1", x, perl = TRUE)
> after_slash
[1] "maximusa/b=(5/+1)" "1)" "maximusa/b=(+3/-2)"
and
before_slash=sub(".*=(-?\\d+).*","\\1", x, perl = TRUE)
> before_slash
[1] "maximusa/b=(5/+1)" "maximusa/b=(-4/1)" "maximusa/b=(+3/-2)"
I tried some but no luck!

One problem is that
after_slash=sub("^.*\\/(d+)","\\1", x, perl = TRUE)
should be
after_slash=sub("^.*/(\\d+)","\\1", x, perl = TRUE)
To capture negative integers as well, you'll want to use
before_slash=sub(".*=(-?\\d+).*","\\1", x, perl = TRUE)
after_slash=sub("^.*/(-?\\d+)","\\1", x, perl = TRUE)
The tokens -? mean "the - character, 0 or 1 times"

We can use str_extract_all to match a - (if any) followed by one or more digits ([0-9]+) and change the type of it to numeric
library(tidyverse)
map_dfc(str_extract_all(x, "-?[0-9]+"), as.numeric)
# A tibble: 2 x 3
# V1 V2 V3
# <dbl> <dbl> <dbl>
#1 5 -4 3
#2 1 1 -2
Or with read.table after getting the substring with sub and then specifying the sep as / to create a two column data.frame
read.table(text= sub(".*=", "", x), sep="/")
# V1 V2
#1 5 1
#2 -4 1
#3 3 -2
Or another option is strsplit
sapply(strsplit(x, "[=/]"), `[`, 3:4)
Update
If the OP's string have () as well, the first option should work well, but in the second option, we can change
x1 <- c("maximusa/b=(5/1)","maximusa/b=(-4/1)","maximusa/b=(3/-2)")
read.table(text= gsub(".*=|[()]", "", x1), sep="/")
# V1 V2
#1 5 1
#2 -4 1
#3 3 -2

This should work, too.
matrix(as.numeric(unlist(strsplit(
gsub("(^\\w*\\/)(b=)(-?\\d)(\\/)(-?\\d$)", "\\3 \\5", x), " "))), 2)
# [,1] [,2] [,3]
# [1,] 5 -4 3
# [2,] 1 1 -2

Related

Regular expression in R to remove pairs

I have a code that outputs a pair of integers as "(1, 21)", as a string. The integers are always between 1 and 99.
I want to extract the integers into an array as numeric. How can I do this? I've done some research and it seems regex is the way to go, but I'm unsure exactly how to do this here.
Here are several base R one-liners. These each produce a data frame. Use as.matrix(...) on that if you want a matrix/array. (2) seems particularly compact.
1) trimws/read.table trim non-digits off the ends using trimws and then use read.table to read it in giving the data frame shown.
x <- c("(1, 21)", "(2, 22)", "(3, 33)") # input
read.table(text = trimws(x, white = "\\D"), sep = ",")
## V1 V2
## 1 1 21
## 2 2 22
## 3 3 33
2) gsub/read.table Another approach is to convert each non-digit to a space and then use read.table:
read.table(text = gsub("\\D", " ", x))
## V1 V2
## 1 1 21
## 2 2 22
## 3 3 33
3) strcapture Define a regular expression with captures to use with strcapture.
strcapture("(\\d+), (\\d+)", x, data.frame(V1 = integer(0), V2 = integer(0)))
## V1 V2
## 1 1 21
## 2 2 22
## 3 3 33
4) chartr/read.table Use chartr to replace ( with a space and then use read.table defining the comment character as ).
read.table(text = chartr("(", " ", x), sep = ",", comment.char = ")")
## V1 V2
## 1 1 21
## 2 2 22
## 3 3 33
You can use gsub to remove ( and ) using [()] and then use strsplit to split at , . unlist the retuned list and convert it to as.integer and create a matrix or array.
matrix(as.integer(unlist(strsplit(gsub("[()]", "", x), ", ", TRUE))), 2)
# [,1] [,2]
#[1,] 1 3
#[2,] 21 31
Data:
x <- c("(1, 21)", "(3, 31)")
Tidyverse way
x <- c("(1, 21)", "(33, 99)", "(1, 7)")
library(tidyverse)
map_dfr(str_split(str_replace(x, '\\((\\d+)\\,\\s(\\d+)\\)', '\\1 \\2'), ' '), ~ set_names(.x, c('A', 'B')))
#> # A tibble: 3 x 2
#> A B
#> <chr> <chr>
#> 1 1 21
#> 2 33 99
#> 3 1 7
Created on 2021-06-02 by the reprex package (v2.0.0)

Count substring matches within start/stop positions of pattern in R

From a given string " GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT", I want a matrix that counts triplets in substrings that
start with “AAA” or “GAA”
end with “AGT”
and
have at least 2 and at most N other triplets between the start and the end.
For my problem n=10;
So I have below code and result:
library( stringr )
#sample data
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
#set constants
start <- c("AAA", "GAA")
end <- "AGT"
n <- 10 # << set as desired
#build regex
regex <- paste0( "(", paste0( start, collapse = "|" ), ")", paste0( "([A-Z]{3}){2,", n, "}" ), end )
str_extract_all( dna, regex )
Result:
[[1]]
[1] "GAACCCACTAGTATAAAATTTGGGAGT" "AAACCCTTTGGGAGT"
Now how to modify the code so that it returns a matrix containing the starting position of each match along with the actual number of "triplet(every 3 characters will be considered as a triplet)" between the start and the end for each match.
Like for the above results, the result should be:
1 7
31 3
here 1 is the starting position for GAACCCACTAGTATAAAATTTGGGAGT and 7 is the number of triplet between starting pattern GAA and ending pattern AGT.
Same for AAACCCTTTGGGAGT"
31 is the starting position of AAA and 7 is the number of triplets between AAA and AGT
Here's a function that might give you what I think you need, though it doesn't provide the matrix you said you wanted.
func <- function(dna, n = 10, starts, stops) {
if (length(n) == 1L) n <- c(2L, n)
startptn <- paste0("(", paste(starts, collapse = "|"), ")")
stopptn <- paste0("(", paste(stops, collapse = "|"), ")")
starts_ind <- gregexpr(startptn, dna)
stops_ind <- gregexpr(stopptn, dna)
# stops_ind ends on the first char of the triplet, so add 2
stops_ind <- lapply(stops_ind, `+`, 2L)
candidates <- Map(function(bgn, end, txt) {
mtx <- outer(bgn, end, FUN = function(b, e) substring(txt, b, e))
vec <- mtx[nzchar(mtx)]
vec
}, starts_ind, stops_ind, dna)
# "6L" is the first/last triplet defined in starts/stops
cand_triplets <- lapply(candidates, function(z) nchar(z) - 6L)
lens <- lengths(candidates)
df <- data.frame( id = rep(seq_along(dna), lens), dna = rep(dna, lengths(candidates)) )
df$match <- unlist(candidates)
df$inner <- substring(df$match, 4, nchar(df$match) - 3)
df$ntriplets <- nchar(df$inner) / 3
if (nrow(df) > 0) {
df <- df[ abs(df$ntriplets %% 1) < 1e-5 &
n[1] <= df$ntriplets &
df$ntriplets <= n[2], , drop = FALSE ]
}
df
}
Demo:
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
func(c(dna, dna), 10, c("AAA","GAA"), "AGT")
# id dna match inner ntriplets
# 1 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGT CCCACT 2
# 2 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG 7
# 6 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT AAACCCTTTGGGAGT CCCTTTGGG 3
# 7 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGT CCCACT 2
# 8 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG 7
# 12 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT AAACCCTTTGGGAGT CCCTTTGGG 3
func(c(dna, dna), 20, c("AAA","GAA"), "AGT")
# id dna match inner ntriplets
# 1 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGT CCCACT 2
# 2 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG 7
# 4 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT CCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGG 13
# 6 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT AAACCCTTTGGGAGT CCCTTTGGG 3
# 7 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGT CCCACT 2
# 8 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG 7
# 10 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT CCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGG 13
# 12 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT AAACCCTTTGGGAGT CCCTTTGGG 3
In the input:
each input dna can produce 0 or more rows in the output; I use c(dna, dna) merely to demonstrate that this works on multiple dna strings, vectorized;
if n is length 1, then it defaults to ranging between 2 and n; if length 2, then both are used appropriately;
In the output:
id numbers each of the dna input strings;
dna is the actual string; I include both id and dna in case a string is accidentally repeated (minor);
match is the matching substring, including the start and stop triplets;
inner is the same substring with the start/stop triplets removed;
ntriplets is really just the nchar of $inner;
the code filters to ensure we have triplets (/3 reduces it, and %% 1 should be 0), and then how many triplets we have based on the input n
(If you want to see all matches, set n to Inf, though it'll still filter out non-triplet inner strings.)
As to your request of a matrix of lengths, if we insert a browser() in the Map and form mtx, we'll see that we are using matrices:
bgn
# [1] 1 15 31
# attr(,"match.length")
# [1] 3 3 3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
end
# [1] 12 27 45
# attr(,"match.length")
# [1] 3 3 3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
txt
# [1] "GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT"
mtx <- outer(bgn, end, FUN = function(b, e) substring(txt, b, e))
mtx
# [,1] [,2] [,3]
# [1,] "GAACCCACTAGT" "GAACCCACTAGTATAAAATTTGGGAGT" "GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT"
# [2,] "" "AAAATTTGGGAGT" "AAAATTTGGGAGTCCCAAACCCTTTGGGAGT"
# [3,] "" "" "AAACCCTTTGGGAGT"
nchar(mtx) - 6
# [,1] [,2] [,3]
# [1,] 6 21 39
# [2,] -6 7 25
# [3,] -6 -6 9
(The negatives are just an artifact of the debugging environment and my naïve subtraction of 6 with empty strings present; this does not reflect a bug in the function.)
To me, this matrix suggests we have 2, 7, and 13 triplets in the top row.
i have an ugly solution. You just need to analysis your pattern.
library( stringr )
#sample data
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
#set constants
start <- c("AAA", "GAA")
end <- "AGT"
n <- 10 # << set as desired
#build regex
regex <- paste0( "(", paste0( start, collapse = "|" ), ")", paste0( "([A-Z]{3}){2,", n, "}" ), end )
triplet_string <- str_extract_all( dna, regex )
triplet_matrix <- str_locate_all(dna, regex)
# first solution:
# differenz between end and start position (-1) divide by 3 for triplet and substract the first and last triplet
triplet_length_1 <- (triplet_matrix[[1]][,2] - (triplet_matrix[[1]][,1]-1))/3 - 2
df <- data.frame(startpos = triplet_matrix[[1]][,1],
lengthtriplet = triplet_length_1)
# second solution:
# getting the length of each ORF and substract 2 for start and stopp codon
triplet_length_2 <- nchar(triplet_string[[1]])/3 - 2
df <- data.frame(startpos = triplet_matrix[[1]][,1],
lengthtriplet = triplet_length_2)
I hope that i could help you.

Subset dataframe if a $ symbol exists in a string column

I have a dataframe with a time column and a string column. I want to subset this dataframe - where I only keep the rows in which the string column contains a $ symbol somewhere in it.
After subsetting, I want to clean the string column so that it only contains the characters after the $ symbol until there is a space or symbol
df <- data.frame("time"=c(1:10),
"string"=c("$ABCD test","test","test $EFG test",
"$500 test","$HI/ hello","test $JK/",
"testing/123","$MOO","$abc","123"))
I want the final output to be:
Time string
1 ABCD
3 EFG
4 500
5 HI
6 JK
8 MOO
9 abc
It only keeps rows that have a $ in the string column, and then only keeps the characters after the $ symbol and until a space or symbol
I have had some success with sub simply to pull out the string, but haven't been able to apply that to the df and subset it. Thanks for your help.
Until someone comes up with pretty regex solutions, here is my take:
# subset for $ signs and convert to character class
res <- df[ grepl("$", df$string, fixed = TRUE),]
res$string <- as.character(res$string)
# split on non alpha and non $, and grab the one with $, then remove $
res$clean <- sapply(strsplit(res$string, split = "[^a-zA-Z0-9$']", perl = TRUE),
function(i){
x <- i[grepl("$", i, fixed = TRUE)]
# in case when there is more than one $
# x <- i[grepl("$", i, fixed = TRUE)][1]
gsub("$", "", x, fixed = TRUE)
})
res
# time string clean
# 1 1 $ABCD test ABCD
# 3 3 test $EFG test EFG
# 4 4 $500 test 500
# 5 5 $HI/ hello HI
# 6 6 test $JK/ JK
# 8 8 $MOO MOO
# 9 9 $abc abc
We can do this by extracting the substring with regexpr/regmatches to extract only substring that follows a $
i1 <- grep("$", df$string, fixed = TRUE)
transform(df[i1,], string = regmatches(string, regexpr("(?<=[$])\\w+", string, perl = TRUE)))
# time string
#1 1 ABCD
#3 3 EFG
#4 4 500
#5 5 HI
#6 6 JK
#8 8 MOO
#9 9 abc
Or with the tidyverse syntax
library(tidyverse)
df %>%
filter(str_detect(string, fixed("$"))) %>%
mutate(string = str_extract(string, "(?<=[$])\\w+"))

r substring wildcard search to find text

I have a data.frame column that has values such as below. I want to use each cell and create two columns- num1 and num2 such that num1=everything before "-" and num2=everything between "-" and "."
I am thinking of using gregexpr function as shown here and write a for loop to iterate over each row. Is there a faster way to do this?
60-150.PNG
300-12.PNG
employee <- c('60-150.PNG','300-12.PNG')
employ.data <- data.frame(employee)
Try
library(tidyr)
extract(employ.data, employee, into=c('num1', 'num2'),
'([^-]*)-([^.]*)\\..*', convert=TRUE)
# num1 num2
#1 60 150
#2 300 12
Or
library(data.table)#v1.9.5+
setDT(employ.data)[, tstrsplit(employee, '[-.]', type.convert=TRUE)[-3]]
# V1 V2
#1: 60 150
#2: 300 12
Or based on #rawr's comment
read.table(text=gsub('-|.PNG', ' ', employ.data$employee),
col.names=c('num1', 'num2'))
# num1 num2
#1 60 150
#2 300 12
Update
To keep the original column
extract(employ.data, employee, into=c('num1', 'num2'), remove=FALSE,
'([^-]*)-([^.]*)\\..*', convert=TRUE)
# employee num1 num2
#1 60-150.PNG 60 150
#2 300-12.PNG 300 12
Or
setDT(employ.data)[, paste0('num', 1:2) := tstrsplit(employee,
'[-.]', type.convert=TRUE)[-3]]
# employee num1 num2
#1: 60-150.PNG 60 150
#2: 300-12.PNG 300 12
Or
cbind(employ.data, read.table(text=gsub('-|.PNG', ' ',
employ.data$employee),col.names=c('num1', 'num2')))
# employee num1 num2
#1 60-150.PNG 60 150
#2 300-12.PNG 300 12
You can try cSplit from my "splitstackshape" package:
library(splitstackshape)
cSplit(employ.data, "employee", "-|.PNG", fixed = FALSE)
# employee_1 employee_2
# 1: 60 150
# 2: 300 12
Since you mention gregexpr, you can probably try something like:
do.call(rbind,
regmatches(as.character(employ.data$employee),
gregexpr("-|.PNG", employ.data$employee),
invert = TRUE))[, -3]
[,1] [,2]
[1,] "60" "150"
[2,] "300" "12"
Another option using stringi
library(stringi)
data.frame(type.convert(stri_split_regex(employee, "[-.]", simplify = TRUE)[, -3]))
# X1 X2
# 1 60 150
# 2 300 12
Or with the simple gsub.
gsub("-.*", "", employ.data$employee) # substitute everything after - with nothing
gsub(".*-(.*)\\..*", "\\1", employ.data$employee) #keep only anything between - and .
The strsplit function will give you what you're looking for, output to a list.
employee <- c('60-150.PNG','300-12.PNG')
strsplit(employee, "[-]")
##Output:
[[1]]
[1] "60" "150.PNG"
[[2]]
[1] "300" "12.PNG"
Note the second argument to strsplit is a regex value, not just a character to split on, so more complicated regexp can be used.

R: count word occurrence by row and create variable

new to R. I am looking to create a function to count the number of rows that contain 1 or more of the following words ("foo", "x", "y") from a column.
I then want to label that row with a variable, such as "1".
I have a data frame that looks like this:
a->
id text time username
1 "hello x" 10 "me"
2 "foo and y" 5 "you"
3 "nothing" 15 "everyone"
4 "x,y,foo" 0 "know"
The correct output should be:
count: 3
new data frame
a2 ->
id text time username keywordtag
1 "hello x" 10 "me" 1
2 "foo and y" 5 "you" 1
3 "nothing" 15 "everyone"
4 "x,y,foo" 0 "know" 1
Any hints on how to do this would be appreciated!
Here are 2 approaches with base and qdap:
a <- read.table(text='id text time username
1 "hello x" 10 "me"
2 "foo and y" 5 "you"
3 "nothing" 15 "everyone"
4 "x,y,foo" 0 "know"', header=TRUE)
# Base
a$keywordtag <- as.numeric(grepl("\\b[foo]\\b|\\b[x]\\b|\\b[y]\\b", a$text))
a
# qdap
library(qdap)
terms <- termco(gsub("(,)([^ ])", "\\1 \\2", a$text),
id(a), list(terms = c(" foo ", " x ", " y ")))
a$keywordtag <- as.numeric(counts(terms)[[3]] > 0)
a
# output
## id text time username keywordtag
## 1 1 hello x 10 me 1
## 2 2 foo and y 5 you 1
## 3 3 nothing 15 everyone 0
## 4 4 x,y,foo 0 know 1
The base approach is bar far more eloquent and simple.
# EDIT (borrowing from Richard I believe this is most generalizable and undestandable):
words <- c("foo", "x", "y")
regex <- paste(sprintf("\\b[%s]\\b", words), collapse="|")
within(a,{
keywordtag = as.numeric(grepl(regex, a$text))
})
This is probably much safer than my previous answer.
> string <- c("foo", "x", "y")
> a$keywordtag <-
(1:nrow(a) %in% c(sapply(string, grep, a$text, fixed = TRUE)))+0
> a
# id text time username keywordtag
# 1 1 hello x 10 me 1
# 2 2 foo and y 5 you 1
# 3 3 nothing 15 everyone 0
# 4 4 x,y,foo 0 know 1
Your question boils down to splitting a vector of strings on multiple delimiters and checking if any of the tokens are in your set of desired words. You can split on multiple delimiters using strsplit (I'll use comma and whitespace, since your question doesn't specify the full set of delimiters for your problem), and I'll use intersect to check if it contains any word in your set:
m <- c("foo", "x", "y")
a$keywordtag <- as.numeric(unlist(lapply(strsplit(as.character(a$text), ",|\\s"),
function(x) length(intersect(x, m)) > 0)))
a
# id text time username keywordtag
# 1 1 hello x 10 me 1
# 2 2 foo and y 5 you 1
# 3 3 exciting 15 everyone 0
# 4 4 x,y,foo 0 know 1
I've included "exciting", which is a word that contains "x" but that isn't listed as a match by this approach.
Another way of Tyler Rinker's answer:
within(a,{keywordtag = as.numeric(grepl("foo|x|y", fixed = FALSE, a$keywordtag))})

Resources