Regex: how to keep all digits when splitting a string? - r

Question
Using a regular expression, how do I keep all digits when splitting a string?
Overview
I would like to split each element within the character vector sample.text into two elements: one of only digits and one of only the text.
Current Attempt is Dropping Last Digit
This regular expression - \\d\\s{1} - inside of base::strsplit() removes the last digit. Below is my attempt, along with my desired output.
# load necessary data -----
sample.text <-
c("111110 Soybean Farming", "0116 Soybeans")
# split string by digit and one space pattern ------
strsplit(sample.text, split = "\\d\\s{1}")
# [[1]]
# [1] "11111" "Soybean Farming"
#
# [[2]]
# [1] "011" "Soybeans"
# desired output --------
# [[1]]
# [1] "111110" "Soybean Farming"
#
# [[2]]
# [1] "0116" "Soybeans"
# end of script #
Any advice on how I can split sample.text to keep all digits would be much appreciated! Thank you.

Because you're splitting on \\d, the digit there is consumed in the regex, and not present in the output. Use lookbehind for a digit instead:
strsplit(sample.text, split = "(?<=\\d) ", perl=TRUE)
http://rextester.com/GDVFU71820

Some alternative solutions, using very simple pattern matching on the first occurrence of space:
1) Indirectly by using sub to substitute your own delimiter, then strsplit on your delimiter:
E.g. you can substitute ';' for the first space (if you know that character does not exist in your data):
strsplit( sub(' ', ';', sample.text), split=';')
2) Using regexpr and regmatches
You can effectively match on the first " " (space character), and split as follows:
regmatches(sample.text, regexpr(" ", sample.text), invert = TRUE)
Result is a list, if that's what you are after per your sample desired output:
[[1]]
[1] "111110" "Soybean Farming"
[[2]]
[1] "0116" "Soybeans"
3) Using stringr library:
library(stringr)
str_split_fixed(sample.text, " ", 2) #outputs a character matrix
[,1] [,2]
[1,] "111110" "Soybean Farming"
[2,] "0116" "Soybeans"

Related

Extract a string that spans across multiple lines - stringr

I need to extract a string that spans across multiple lines on an object.
The objetc:
> text <- paste("abc \nd \ne")
> cat(text)
abc
d
e
With str_extract_all I can extract all the text between ‘a’ and ‘c’, for example.
> str_extract_all(text, "a.*c")
[[1]]
[1] "abc"
Using the function ‘regex’ and the argument ‘multiline’ set to TRUE, I can extract a string across multiple lines. In this case, I can extract the first character of multiple lines.
> str_extract_all(text, regex("^."))
[[1]]
[1] "a"
> str_extract_all(text, regex("^.", multiline = TRUE))
[[1]]
[1] "a" "d" "e"
But when I try the to extract "every character between a and d" (a regex that spans across multiple lines), the output is "character(0)".
> str_extract_all(text, regex("a.*d", multiline = TRUE))
[[1]]
character(0)
The desired output is:
“abcd”
How to get it with stringr?
dplyr:
library(dplyr)
library(stringr)
data.frame(text) %>%
mutate(new = lapply(str_extract_all(text, "(?!e)\\w"), paste0, collapse = ""))
text new
1 abc \nd \ne abcd
Here we use the character class \\w, which does not include the new line metacharacter \n. The negative lookahead (?!e) makes sure the e is not matched.
base R:
unlist(lapply(str_extract_all(text, "(?!e)\\w"), paste0, collapse = ""))
[1] "abcd"
str_remove_all(text,"\\s\\ne?")
[1] "abcd"
OR
paste0(trimws(strsplit(text, "\\ne?")[[1]]), collapse="")
[1] "abcd"
The anwers above remove line breaks. So, a two step approach can work to get the desired output 'abcd'.
1 - Use str_remove_all or gsub to remove the line breaks (in this case, also removing blank spaces).
2 - Use str_extract_all to get the desired output ('abcd' in this case).
> text %>%
+ str_remove_all("\\s\\n") %>%
+ str_extract_all("a.*d")
[[1]]
[1] "abcd"
Short regex reference:
\n - new line (return)
\s - any whitespace
\r - carriage return
Update:
In base R to get the desired output abcd:
text <- gsub("[\r\n]|[[:blank:]]", "", text)
substr(text,1, nchar(text)-1)
[1] "abcd"
First answer:
We can use gsub:
gsub("[\r\n]|[[:blank:]]", "", text)
[1] "abcde"

Removing unwanted parts of strings in a list, and combining the pieces into a single string in R

I am trying to take a list of strings, remove everything except capital letters, and output a list of strings without any spaces or breaks.
Unfortunately, I have been trying to use str_extract_all() but it outputs the relevent pieces of the string separated as a list of character vectors, when there was non-capital letter string elements contained in the original string.
Can anyone please suggest a way to get the desired output?
# Some example data:
a <- list("n[28.0313]MVNNGHSFNVEYDDSQDK[28.0313]AVLK[28.0313]D_+4",
"SLGKVGTRC[71.0371]CTK[28.0313]PESER_+4",
"n[28.0313]AVVQDPALK[28.0313]PLALVY_+3",
"n[28.0313]TCVADESHAGC[71.0371]EK[28.0313]_+2")
# The desired output:
list("MVNNGHSFNVEYDDSQDKAVLKD",
"SLGKVGTRCCTKPESER",
"AVVQDPALKPLALVY",
"TCVADESHAGCEK")
# What I've tried so far:
a %>% str_extract_all("[A-Z]+")
[[1]]
[1] "MVNNGHSFNVEYDDSQDK" "AVLK" "D"
[[2]]
[1] "SLGKVGTRC" "CTK" "PESER"
[[3]]
[1] "AVVQDPALK" "PLALVY"
[[4]]
[1] "TCVADESHAGC" "EK"
# Not what I want.
I need to find a way to isolate the strings and combine them, but I'm at the limit of my R knowledge.
As it is a list of multiple elements, we can just paste it together by looping over the list
library(dplyr)
library(stringr)
library(purrr)
a %>%
str_extract_all("[A-Z]+") %>%
map_chr(str_c, collapse="")
-output
[1] "MVNNGHSFNVEYDDSQDKAVLKD" "SLGKVGTRCCTKPESER"
[3] "AVVQDPALKPLALVY" "TCVADESHAGCEK"
Or just use gsub to match all characters other than the upper case and replace with blank
gsub("[^A-Z]+", "", a)
[1] "MVNNGHSFNVEYDDSQDKAVLKD" "SLGKVGTRCCTKPESER" "AVVQDPALKPLALVY" "TCVADESHAGCEK"
or with str_remove_all
str_remove_all(a, "[^A-Z]+")
[1] "MVNNGHSFNVEYDDSQDKAVLKD" "SLGKVGTRCCTKPESER" "AVVQDPALKPLALVY" "TCVADESHAGCEK"
The output is a vector, which we can wrap it in a list
list(str_remove_all(a, "[^A-Z]+"))

Remove run-length duplicate numbers and NA in strings

I have columns with large strings of decimal numbers and NA:
df <- data.frame(
A_gsr =c("2.752,2.752,2.752,2.752,2.752,2.752,2.752,2.911,2.911,3.555",
"2.999,2.999,2.999,2.752,2.752,2.752,2.752"),
B_gsr = c("1.34,1.34,1.34,1.55,1.55,1.55,1.55,1.55,1.55,1.55",
"1.56,1.56,1.56,1.55,1.55,1.55,1.55,NA,NA,NA,NA,1.34,1.34,1.34"),
C_gsr = c("NA,NA,NA,0.147,0.147,0.147,0.147,0.147,NA",
"0.146,0.146,0.146,0.146,0.146,0.146,0.146,0.146,0.146,0.146")
)
I want to remove all run-length duplicates. Using gsub and backreference, I'm getting pretty close to what I want to have:
lapply(df[,1:3], function(x) gsub("((\\d\\.\\d+,)|(NA,))\\1+", "\\1", x))
$A_gsr
[1] "2.752,2.911,3.555" "2.999,2.752,2.752"
$B_gsr
[1] "1.34,1.55,1.55" "1.56,1.55,NA,1.34,1.34"
$C_gsr
[1] "NA,0.147,NA" "0.146,0.146"
However, not close enough - there are still some run-length dups, all at the end of the strings. The expected result is this:
$A_gsr
[1] "2.752,2.911,3.555" "2.999,2.752"
$B_gsr
[1] "1.34,1.55" "1.56,1.55,NA,1.34"
$C_gsr
[1] "NA,0.147,NA" "0.146"
You can use
lapply(df[,1:3], function(x) gsub("\\b(\\d+\\.\\d+|NA)(?:,\\1)+\\b", "\\1", x))
## => $A_gsr
## [1] "2.752,2.911,3.555" "2.999,2.752"
##
## $B_gsr
## [1] "1.34,1.55" "1.56,1.55,NA,1.34"
##
## $C_gsr
## [1] "NA,0.147,NA" "0.146"
See the regex demo and the R demo online.
Details:
\b - a word boundary
(\d+\.\d+|NA) - Group 1: one or more digits, ., one or more digits, OR NA string
(?:,\1)+ - one or more repetitions of a comma and the value in Group 1
\b - a word boundary

regex strsplit expression in R so it only applies once to the first occurrence of a specific character in each string?

I have a list filled with strings:
string<- c("SPG_L_subgenual_ACC_R", "SPG_R_MTG_L_pole", "MTG_L_pole_CerebellumGM_L")
I need to split the strings so they appear like:
"SPG_L", "subgenual_ACC_R", "SPG_R", "MTG_L_pole", "MTG_L_pole", "CerebellumGM_L"
I tried using the following regex expression to split the strings:
str_split(string,'(?<=[[RL]|pole])_')
But this leads to:
"SPG_L", "subgenual" "ACC_R", "SPG_R", "MTG_L", "pole", "MTG_L", "pole", "CerebellumGM_L"
How do I edit the regex expression so it splits each string element at the "_" after the first occurrence of "R", "L" unless the first occurrence of "R" or "L" is followed by "pole", then it splits the string element after the first occurrence of "pole" and only splits each string element once?
I suggest a matching approach using
^(.*?[RL](?:_pole)?)_(.*)
See the regex demo
Details
^ - start of string
(.*?[RL](?:_pole)?) - Group 1:
.*? - any zero or more chars other than line break chars as few as possible
[RL](?:_pole)? - R or L optionally followed with _pole
_ - an underscore
(.*) - Group 2: any zero or more chars other than line break chars as many as possible
See the R demo:
library(stringr)
x <- c("SPG_L_subgenual_ACC_R", "SPG_R_MTG_L_pole", "MTG_L_pole_CerebellumGM_L", "SFG_pole_R_IFG_triangularis_L", "SFG_pole_R_IFG_opercularis_L" )
res <- str_match_all(x, "^(.*?[RL](?:_pole)?)_(.*)")
lapply(res, function(x) x[-1])
Output:
[[1]]
[1] "SPG_L" "subgenual_ACC_R"
[[2]]
[1] "SPG_R" "MTG_L_pole"
[[3]]
[1] "MTG_L_pole" "CerebellumGM_L"
[[4]]
[1] "SFG_pole_R" "IFG_triangularis_L"
[[5]]
[1] "SFG_pole_R" "IFG_opercularis_L"
split_again = function(x){
if(length(x) > 1){
return(x)
}
else{
str_split(
string = x,
pattern = '(?<=[R|L])_',
n = 2)
}
}
str_split(
string = string,
pattern = '(?<=pole)_',
n = 2) %>%
lapply(split_again) %>%
unlist()
you could use sub then strsplit as shown:
strsplit(sub("^.*?[LR](?:_pole)?\\K_",":",string,perl=TRUE),":")
[[1]]
[1] "SPG_L" "subgenual_ACC_R"
[[2]]
[1] "SPG_R" "MTG_L_pole"
[[3]]
[1] "MTG_L_pole" "CerebellumGM_L"

Extract text in parentheses in R

Two related questions. I have vectors of text data such as
"a(b)jk(p)" "ipq" "e(ijkl)"
and want to easily separate it into a vector containing the text OUTSIDE the parentheses:
"ajk" "ipq" "e"
and a vector containing the text INSIDE the parentheses:
"bp" "" "ijkl"
Is there any easy way to do this? An added difficulty is that these can get quite large and have a large (unlimited) number of parentheses. Thus, I can't simply grab text "pre/post" the parentheses and need a smarter solution.
Text outside the parenthesis
> x <- c("a(b)jk(p)" ,"ipq" , "e(ijkl)")
> gsub("\\([^()]*\\)", "", x)
[1] "ajk" "ipq" "e"
Text inside the parenthesis
> x <- c("a(b)jk(p)" ,"ipq" , "e(ijkl)")
> gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", x, perl=T)
[1] "bp" "" "ijkl"
The (?<=\\()[^()]*(?=\\)) matches all the characters which are present inside the brackets and then the following (*SKIP)(*F) makes the match to fail. Now it tries to execute the pattern which was just after to | symbol against the remaining string. So the dot . matches all the characters which are not already skipped. Replacing all the matched characters with an empty string will give only the text present inside the rackets.
> gsub("\\(([^()]*)\\)|.", "\\1", x, perl=T)
[1] "bp" "" "ijkl"
This regex would capture all the characters which are present inside the brackets and matches all the other characters. |. or part helps to match all the remaining characters other than the captured ones. So by replacing all the characters with the chars present inside the group index 1 will give you the desired output.
The rm_round function in the qdapRegex package I maintain was born to do this:
First we'll get and load the package via pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(qdapRegex)
## Then we can use it to remove and extract the parts you want:
x <-c("a(b)jk(p)", "ipq", "e(ijkl)")
rm_round(x)
## [1] "ajk" "ipq" "e"
rm_round(x, extract=TRUE)
## [[1]]
## [1] "b" "p"
##
## [[2]]
## [1] NA
##
## [[3]]
## [1] "ijkl"
To condense b and p use:
sapply(rm_round(x, extract=TRUE), paste, collapse="")
## [1] "bp" "NA" "ijkl"

Resources