Regular expression in R to remove pairs - r

I have a code that outputs a pair of integers as "(1, 21)", as a string. The integers are always between 1 and 99.
I want to extract the integers into an array as numeric. How can I do this? I've done some research and it seems regex is the way to go, but I'm unsure exactly how to do this here.

Here are several base R one-liners. These each produce a data frame. Use as.matrix(...) on that if you want a matrix/array. (2) seems particularly compact.
1) trimws/read.table trim non-digits off the ends using trimws and then use read.table to read it in giving the data frame shown.
x <- c("(1, 21)", "(2, 22)", "(3, 33)") # input
read.table(text = trimws(x, white = "\\D"), sep = ",")
## V1 V2
## 1 1 21
## 2 2 22
## 3 3 33
2) gsub/read.table Another approach is to convert each non-digit to a space and then use read.table:
read.table(text = gsub("\\D", " ", x))
## V1 V2
## 1 1 21
## 2 2 22
## 3 3 33
3) strcapture Define a regular expression with captures to use with strcapture.
strcapture("(\\d+), (\\d+)", x, data.frame(V1 = integer(0), V2 = integer(0)))
## V1 V2
## 1 1 21
## 2 2 22
## 3 3 33
4) chartr/read.table Use chartr to replace ( with a space and then use read.table defining the comment character as ).
read.table(text = chartr("(", " ", x), sep = ",", comment.char = ")")
## V1 V2
## 1 1 21
## 2 2 22
## 3 3 33

You can use gsub to remove ( and ) using [()] and then use strsplit to split at , . unlist the retuned list and convert it to as.integer and create a matrix or array.
matrix(as.integer(unlist(strsplit(gsub("[()]", "", x), ", ", TRUE))), 2)
# [,1] [,2]
#[1,] 1 3
#[2,] 21 31
Data:
x <- c("(1, 21)", "(3, 31)")

Tidyverse way
x <- c("(1, 21)", "(33, 99)", "(1, 7)")
library(tidyverse)
map_dfr(str_split(str_replace(x, '\\((\\d+)\\,\\s(\\d+)\\)', '\\1 \\2'), ' '), ~ set_names(.x, c('A', 'B')))
#> # A tibble: 3 x 2
#> A B
#> <chr> <chr>
#> 1 1 21
#> 2 33 99
#> 3 1 7
Created on 2021-06-02 by the reprex package (v2.0.0)

Related

Count substring matches within start/stop positions of pattern in R

From a given string " GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT", I want a matrix that counts triplets in substrings that
start with “AAA” or “GAA”
end with “AGT”
and
have at least 2 and at most N other triplets between the start and the end.
For my problem n=10;
So I have below code and result:
library( stringr )
#sample data
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
#set constants
start <- c("AAA", "GAA")
end <- "AGT"
n <- 10 # << set as desired
#build regex
regex <- paste0( "(", paste0( start, collapse = "|" ), ")", paste0( "([A-Z]{3}){2,", n, "}" ), end )
str_extract_all( dna, regex )
Result:
[[1]]
[1] "GAACCCACTAGTATAAAATTTGGGAGT" "AAACCCTTTGGGAGT"
Now how to modify the code so that it returns a matrix containing the starting position of each match along with the actual number of "triplet(every 3 characters will be considered as a triplet)" between the start and the end for each match.
Like for the above results, the result should be:
1 7
31 3
here 1 is the starting position for GAACCCACTAGTATAAAATTTGGGAGT and 7 is the number of triplet between starting pattern GAA and ending pattern AGT.
Same for AAACCCTTTGGGAGT"
31 is the starting position of AAA and 7 is the number of triplets between AAA and AGT
Here's a function that might give you what I think you need, though it doesn't provide the matrix you said you wanted.
func <- function(dna, n = 10, starts, stops) {
if (length(n) == 1L) n <- c(2L, n)
startptn <- paste0("(", paste(starts, collapse = "|"), ")")
stopptn <- paste0("(", paste(stops, collapse = "|"), ")")
starts_ind <- gregexpr(startptn, dna)
stops_ind <- gregexpr(stopptn, dna)
# stops_ind ends on the first char of the triplet, so add 2
stops_ind <- lapply(stops_ind, `+`, 2L)
candidates <- Map(function(bgn, end, txt) {
mtx <- outer(bgn, end, FUN = function(b, e) substring(txt, b, e))
vec <- mtx[nzchar(mtx)]
vec
}, starts_ind, stops_ind, dna)
# "6L" is the first/last triplet defined in starts/stops
cand_triplets <- lapply(candidates, function(z) nchar(z) - 6L)
lens <- lengths(candidates)
df <- data.frame( id = rep(seq_along(dna), lens), dna = rep(dna, lengths(candidates)) )
df$match <- unlist(candidates)
df$inner <- substring(df$match, 4, nchar(df$match) - 3)
df$ntriplets <- nchar(df$inner) / 3
if (nrow(df) > 0) {
df <- df[ abs(df$ntriplets %% 1) < 1e-5 &
n[1] <= df$ntriplets &
df$ntriplets <= n[2], , drop = FALSE ]
}
df
}
Demo:
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
func(c(dna, dna), 10, c("AAA","GAA"), "AGT")
# id dna match inner ntriplets
# 1 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGT CCCACT 2
# 2 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG 7
# 6 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT AAACCCTTTGGGAGT CCCTTTGGG 3
# 7 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGT CCCACT 2
# 8 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG 7
# 12 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT AAACCCTTTGGGAGT CCCTTTGGG 3
func(c(dna, dna), 20, c("AAA","GAA"), "AGT")
# id dna match inner ntriplets
# 1 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGT CCCACT 2
# 2 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG 7
# 4 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT CCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGG 13
# 6 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT AAACCCTTTGGGAGT CCCTTTGGG 3
# 7 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGT CCCACT 2
# 8 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG 7
# 10 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT CCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGG 13
# 12 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT AAACCCTTTGGGAGT CCCTTTGGG 3
In the input:
each input dna can produce 0 or more rows in the output; I use c(dna, dna) merely to demonstrate that this works on multiple dna strings, vectorized;
if n is length 1, then it defaults to ranging between 2 and n; if length 2, then both are used appropriately;
In the output:
id numbers each of the dna input strings;
dna is the actual string; I include both id and dna in case a string is accidentally repeated (minor);
match is the matching substring, including the start and stop triplets;
inner is the same substring with the start/stop triplets removed;
ntriplets is really just the nchar of $inner;
the code filters to ensure we have triplets (/3 reduces it, and %% 1 should be 0), and then how many triplets we have based on the input n
(If you want to see all matches, set n to Inf, though it'll still filter out non-triplet inner strings.)
As to your request of a matrix of lengths, if we insert a browser() in the Map and form mtx, we'll see that we are using matrices:
bgn
# [1] 1 15 31
# attr(,"match.length")
# [1] 3 3 3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
end
# [1] 12 27 45
# attr(,"match.length")
# [1] 3 3 3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
txt
# [1] "GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT"
mtx <- outer(bgn, end, FUN = function(b, e) substring(txt, b, e))
mtx
# [,1] [,2] [,3]
# [1,] "GAACCCACTAGT" "GAACCCACTAGTATAAAATTTGGGAGT" "GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT"
# [2,] "" "AAAATTTGGGAGT" "AAAATTTGGGAGTCCCAAACCCTTTGGGAGT"
# [3,] "" "" "AAACCCTTTGGGAGT"
nchar(mtx) - 6
# [,1] [,2] [,3]
# [1,] 6 21 39
# [2,] -6 7 25
# [3,] -6 -6 9
(The negatives are just an artifact of the debugging environment and my naïve subtraction of 6 with empty strings present; this does not reflect a bug in the function.)
To me, this matrix suggests we have 2, 7, and 13 triplets in the top row.
i have an ugly solution. You just need to analysis your pattern.
library( stringr )
#sample data
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
#set constants
start <- c("AAA", "GAA")
end <- "AGT"
n <- 10 # << set as desired
#build regex
regex <- paste0( "(", paste0( start, collapse = "|" ), ")", paste0( "([A-Z]{3}){2,", n, "}" ), end )
triplet_string <- str_extract_all( dna, regex )
triplet_matrix <- str_locate_all(dna, regex)
# first solution:
# differenz between end and start position (-1) divide by 3 for triplet and substract the first and last triplet
triplet_length_1 <- (triplet_matrix[[1]][,2] - (triplet_matrix[[1]][,1]-1))/3 - 2
df <- data.frame(startpos = triplet_matrix[[1]][,1],
lengthtriplet = triplet_length_1)
# second solution:
# getting the length of each ORF and substract 2 for start and stopp codon
triplet_length_2 <- nchar(triplet_string[[1]])/3 - 2
df <- data.frame(startpos = triplet_matrix[[1]][,1],
lengthtriplet = triplet_length_2)
I hope that i could help you.

extracting before digits before and after forward slash /

I have trouble with extracting the string before and after /.
x <- c("maximusa/b=5/1","maximusa/b=-4/1","maximusa/b=3/-2")
before_slash=sub(".*=(\\d+).*","\\1", x, perl = TRUE)
gives
"5" "maximusa/b=-4/1" "3"
then
after_slash=sub("^.*\\/(d+)","\\1", x, perl = TRUE)
gives
"maximusa/b=5/1" "maximusa/b=-4/1" "maximusa/b=3/-2"
OTH, the expected output
before slash 5 -4 3
after slash 1 1 -2
how can I get the expected output ?
thx for answers,
I would like to add one more condition to extract the strings
assume we have string like that.
Same as the OP how could we extract with + sign as well as ignoring the parentheses ? Current solution of #mob gives
x <- c("maximusa/b=(5/+1)","maximusa/b=(-4/1)","maximusa/b=(+3/-2)")
after_slash=sub("^.*/(\\d+)","\\1", x, perl = TRUE)
> after_slash
[1] "maximusa/b=(5/+1)" "1)" "maximusa/b=(+3/-2)"
and
before_slash=sub(".*=(-?\\d+).*","\\1", x, perl = TRUE)
> before_slash
[1] "maximusa/b=(5/+1)" "maximusa/b=(-4/1)" "maximusa/b=(+3/-2)"
I tried some but no luck!
One problem is that
after_slash=sub("^.*\\/(d+)","\\1", x, perl = TRUE)
should be
after_slash=sub("^.*/(\\d+)","\\1", x, perl = TRUE)
To capture negative integers as well, you'll want to use
before_slash=sub(".*=(-?\\d+).*","\\1", x, perl = TRUE)
after_slash=sub("^.*/(-?\\d+)","\\1", x, perl = TRUE)
The tokens -? mean "the - character, 0 or 1 times"
We can use str_extract_all to match a - (if any) followed by one or more digits ([0-9]+) and change the type of it to numeric
library(tidyverse)
map_dfc(str_extract_all(x, "-?[0-9]+"), as.numeric)
# A tibble: 2 x 3
# V1 V2 V3
# <dbl> <dbl> <dbl>
#1 5 -4 3
#2 1 1 -2
Or with read.table after getting the substring with sub and then specifying the sep as / to create a two column data.frame
read.table(text= sub(".*=", "", x), sep="/")
# V1 V2
#1 5 1
#2 -4 1
#3 3 -2
Or another option is strsplit
sapply(strsplit(x, "[=/]"), `[`, 3:4)
Update
If the OP's string have () as well, the first option should work well, but in the second option, we can change
x1 <- c("maximusa/b=(5/1)","maximusa/b=(-4/1)","maximusa/b=(3/-2)")
read.table(text= gsub(".*=|[()]", "", x1), sep="/")
# V1 V2
#1 5 1
#2 -4 1
#3 3 -2
This should work, too.
matrix(as.numeric(unlist(strsplit(
gsub("(^\\w*\\/)(b=)(-?\\d)(\\/)(-?\\d$)", "\\3 \\5", x), " "))), 2)
# [,1] [,2] [,3]
# [1,] 5 -4 3
# [2,] 1 1 -2

Replace ( gsub) all rows in a column from values in another column?

Suppose I have a dataframe as such,
df = data.frame ( a = c(1,14,15,11) , b= c("xxxchrxxx","xxxchryy","zzchrzz","aachraa") )
a b
1 1 xxxchrxxx
2 14 xxxchryy
3 15 zzchrzz
4 11 aachraa
what I want is to replace chr from column b with chrx, x derive from column a
a b
1 1 xxxchr1xxx
2 14 xxxchr14yy
3 15 zzchr15zz
4 11 aachr11aa
however I cant get gsub to work since its expecting a single element
df$b = gsub ( "chr",paste0("chr",df$a), df$b)
any way to do this?
The reason is that gsub replacement takes only a vector with length 1. According to ?gsub
replacement - if a character vector of length 2 or more is supplied, the first element is used with a warning.
If it needs to have a vectorized replacement, use str_replace
library(stringr)
str_replace(df$b, "chr", paste0("chr", df$a))
#[1] "xxxchr1xxx" "xxxchr14yy" "zzchr15zz" "aachr11aa"
Based on the example, it is only a simple paste
df$b <- with(df, paste0(b, a))
EDIT:: With stringr:
stringr::str_replace_all(df$b,"chr",paste0("chr",df$a))
Continuing with paste0:
df$b<-paste0(df$b,df$a)
a b
1 1 chr1
2 14 chr14
3 15 chr15
4 11 chr11
df = data.frame ( a = c(1,14,15,11) , b= c("chr","chr","chr","chr") )
df$b <- paste0(df$b, df$a)
df
#> a b
#> 1 1 chr1
#> 2 14 chr14
#> 3 15 chr15
#> 4 11 chr11
Created on 2019-02-22 by the reprex package (v0.2.1)

R split array into Data frame

VERY new to R and struggling with knowing exactly what to ask, have found a similar question here
How to split a character vector into data frame?
but this has fixed length, and I've been unable to adjust for my problem
I've got some data in an array in R
TEST <- c("Value01:100|Value02:200|Value03:300|","Value04:1|Value05:2|",
"StillAValueButNamesAreNotConsistent:12345.6789|",
"AlsoNotAllLinesAreTheSameLength:1|")
The data is stored in pairs, and I'm looking to split out into a dataframe as such:
Variable Value
Value01 100
Value02 200
Value03 300
Value04 1
Value05 2
StillAValueButNamesAreNotConsistent 12345.6789
AlsoNotAllLinesAreTheSameLength 1
The Variable name is a string and the value will always be a number
Any help would be great!
Thanks
One can use tidyr based solution. Convert vector TEST to a data.frame and remove the last | from each row as that doesn't carry any meaning as such.
Now, use tidyr::separate_rows to expand rows based on | and then separate data in 2 columns using tidyr::separate function.
library(dplyr)
library(tidyr)
data.frame(TEST) %>%
mutate(TEST = gsub("\\|$","",TEST)) %>%
separate_rows(TEST, sep = "[|]") %>%
separate(TEST, c("Variable", "Value"), ":")
# Variable Value
# 1 Value01 100
# 2 Value02 200
# 3 Value03 300
# 4 Value04 1
# 5 Value05 2
# 6 StillAValueButNamesAreNotConsistent 12345.6789
# 7 AlsoNotAllLinesAreTheSameLength 1
We can do it in base R with one line. Just change the | characters to line breaks then use : as the sep value in read.table(). You can also set column names there too.
read.table(text = gsub("\\|", "\n", TEST), sep = ":",
col.names = c("Variable", "Value"))
# Variable Value
# 1 Value01 100.00
# 2 Value02 200.00
# 3 Value03 300.00
# 4 Value04 1.00
# 5 Value05 2.00
# 6 StillAValueButNamesAreNotConsistent 12345.68
# 7 AlsoNotAllLinesAreTheSameLength 1.00
With Base R:
(I've broken out each step to hopefully make the code clear)
# your data
myvec <- c("Value01:100|Value02:200|Value03:300|","Value04:1|Value05:2|",
"StillAValueButNamesAreNotConsistent:12345.6789|",
"AlsoNotAllLinesAreTheSameLength:1|")
# convert into one long string
all_text_str <- paste0(myvec, collapse="")
# split the string by "|"
all_text_vec <- unlist(strsplit(all_text_str, split="\\|"))
# split each "|"-group by ":"
data_as_list <- strsplit(all_text_vec, split=":")
# collect into a dataframe
df <- do.call(rbind, data_as_list)
# clean up the dataframe by adding names and converting value to numeric
names(df) <- c("variable", "value")
df$value <- as.numeric(df$value)
With help of strsplit and unlist function. Each command is shown with output below.
Input
TEST
# [1] "Value01:100|Value02:200|Value03:300|"
# [2] "Value04:1|Value05:2|"
# [3] "StillAValueButNamesAreNotConsistent:12345.6789|"
# [4] "AlsoNotAllLinesAreTheSameLength:1|"
Splitting by | and then by :
my_list <- strsplit(unlist(strsplit(TEST, "|", fixed = TRUE)), ":", fixed = TRUE)
my_list
# [[1]]
# [1] "Value01" "100"
# [[2]]
# [1] "Value02" "200"
# [[3]]
# [1] "Value03" "300"
# [[4]]
# [1] "Value04" "1"
# [[5]]
# [1] "Value05" "2"
# [[6]]
# [1] "StillAValueButNamesAreNotConsistent" "12345.6789"
# [[7]]
# [1] "AlsoNotAllLinesAreTheSameLength" "1"
Converting above list to data.frame
df <- data.frame(matrix(unlist(my_list), ncol = 2, byrow=TRUE))
df
# X1 X2
# 1 Value01 100
# 2 Value02 200
# 3 Value03 300
# 4 Value04 1
# 5 Value05 2
# 6 StillAValueButNamesAreNotConsistent 12345.6789
# 7 AlsoNotAllLinesAreTheSameLength 1
Colnames to dataframe
names(df) <- c("Variable", "Value")
df
# Variable Value
# 1 Value01 100
# 2 Value02 200
# 3 Value03 300
# 4 Value04 1
# 5 Value05 2
# 6 StillAValueButNamesAreNotConsistent 12345.6789
# 7 AlsoNotAllLinesAreTheSameLength 1

Text processing on data frame in r

I have a text file in which data is stored is stored as given below
{{2,3,4},{1,3},{4},{1,2} .....}
I want to remove the brackets and convert it to two column format where first column is bracket number and followed by the term
1 2
1 3
1 4
2 1
2 3
3 4
4 1
4 2
so far i have read the file
tab <- read.table("test.txt",header=FALSE,sep="}")
This gives a dataframe
V1 V2 V3 V4
1 {{2,3,4 {1,3 {4 {1,2 .....
How to proceed ?
We read it with readLines and then remove the {} with strsplit and convert it to two column dataframe with index and reshape to 'long' format with separate_rows
library(tidyverse)
v1 <- setdiff(unlist(strsplit(lines, "[{}]")), c("", ","))
tibble(index = seq_along(v1), Col = v1) %>%
separate_rows(Col, convert = TRUE)
# A tibble: 8 x 2
# index Col
# <int> <int>
#1 1 2
#2 1 3
#3 1 4
#4 2 1
#5 2 3
#6 3 4
#7 4 1
#8 4 2
Or a base R method would be replace the , after the } with another delimiter, split by , into a list and stack it to a two column data.frame
v1 <- scan(text=gsub("[{}]", "", gsub("},", ";", lines)), what = "", sep=";", quiet = TRUE)
stack(setNames(lapply(strsplit(v1, ","), as.integer), seq_along(v1)))[2:1]
data
lines <- readLines(textConnection("{{2,3,4},{1,3},{4},{1,2}}"))
#reading from file
lines <- readLines("yourfile.txt")
Data:
tab <- read.table(text=' V1 V2 V3 V4
1 {{2,3,4 {1,3 {4 {1,2
2 {{2,3,4 {1,3 {4 {1,2 ')
Code: using gsub, remove { and split the string by ,, then make a data frame. The column names are removed. Finally the list of dataframes in df1 are combined together using rbindlist
df1 <- lapply( seq_along(tab), function(x) {
temp <- data.frame( x, strsplit( gsub( "{", "", tab[[x]], fixed = TRUE ), split = "," ),
stringsAsFactors = FALSE)
colnames(temp) <- NULL
temp
} )
Output:
data.table::rbindlist(df1)
# V1 V2 V3
# 1: 1 2 2
# 2: 1 3 3
# 3: 1 4 4
# 4: 2 1 1
# 5: 2 3 3
# 6: 3 4 4
# 7: 4 1 1
# 8: 4 2 2

Resources