R Regex searching text between delimiter - r

I have a data file which has text in the following format:
"name: alex age: 27 profession: it"
I want to pull the data between ':' (it should exclude the preceding field name before ":" e.g. name, age, and profession are the only corresponding values that should be retrieved. The token names are not same; they can change.)
I want data to be
alex 27 it

We can use gsub to match word (\\w+), then a :, one or more spaces (\\s+) followed by a word captured as a group ((\\w+)) and replace it with the backreference.
gsub("\\w+:\\s+(\\w+)", "\\1", str1)
#[1] "alex 27 it"
NOTE: Here, we assume the pattern of the string is in key: value pair

Using str_split with a negative lookback Regex can you split the text into a vector of three
st <- "name: alex age: 27 profession: it"
str_split(st,"(?<!:) ")
after that it is easy to remove the text that we dont want with gsub
str_split(st,"(?<!:) ") %>% unlist() %>% gsub("^.*: ","",.)
now using the same technic but extracting the names and using setNames we get a named list wich is very comfortable to work with
dta <- setNames(
str_split(st,"(?<!:) ") %>%
unlist() %>%
gsub("^.*: ","",.) %>%
as.list(),
str_split(st,"(?<!:) ") %>%
unlist() %>%
gsub(":.*$","",.))
dta$profession
[1] "it"

A solution with str_extract_all from stringr. This matches alphanumerics ([[:alnum:]]) that are followed by a : and a space (\\s) and ends at a word boundary (\\b):
library(stringr)
str_extract_all(string, "(?<=:\\s)[[:alnum:]]+\\b")[[1]]
# [1] "alex" "27" "it"
or:
paste(str_extract_all(string, "(?<=:\\s)[[:alnum:]]+\\b")[[1]], collapse = " ")
# [1] "alex 27 it"

Related

R splitting string on predefined location

I have string, which should be split into parts from "random" locations. Split occurs always from next comma after colon.
My idea was to find colons with
stringr::str_locate_all(test, ":") %>%
unlist()
then find commas
stringr::str_locate_all(test, ",") %>%
unlist()
and from there to figure out position where it should be split up, but could not find suitable way to do it. Feels like there is always 6 characters after colon before the comma, but I can't be sure about that for whole data.
Here is example string:
dput(test)
"AA,KK,QQ,JJ,TT,99,88:0.5083,66,55:0.8303,AK,AQ,AJs,AJo:0.9037,ATs:0.0024,ATo:0.5678"
Here is what result should be
dput(result)
c("AA,KK,QQ,JJ,TT,99,88:0.5083", "66,55:0.8303", "AK,AQ,AJs,AJo:0.9037",
"ATs:0.0024", "ATo:0.5678")
Perehaps we can use regmatches like below
> regmatches(test, gregexpr("(\\w+,?)+:[0-9.]+", test))[[1]]
[1] "AA,KK,QQ,JJ,TT,99,88:0.5083" "66,55:0.8303"
[3] "AK,AQ,AJs,AJo:0.9037" "ATs:0.0024"
[5] "ATo:0.5678"
here is one option with strsplit - replace the , after the digit followed by the . and one or more digits (\\d+) with a new delimiter using gsub and then split with strsplit in base R
result1 <- strsplit(gsub("([0-9]\\.[0-9]+),", "\\1;", test), ";")[[1]]
-checking
> identical(result, result1)
[1] TRUE
If the number of characters are fixed, use a regex lookaround
result1 <- strsplit(test, "(?<=:.{6}),", perl = TRUE)[[1]]

reverse the name if it seperate by comma

If there is a first and last name is like "nandan, vivek". I want to display as "vivek nandan".
n<-("nandan,vivek")
result:
[1] vivek nandan
where first name:vivek
last name:nandan
this is the author name.
We can try using sub here:
input <- "nankin,vivek"
sub("([^,]+),\\s*(.*)", "\\2 \\1", input)
[1] "vivek nankin"
The regex pattern used above matches the last name followed by the first name, in separate capture groups. It then replaces with those capture groups, in reverse order, separated by a single space.
An option would be sub to capture the substring that are letters ([a-z]+) followed by a , and again capture the next word ([a-z]+). In the replacement, reverse the order of the backreferences
sub("([a-z]+),([a-z]+)", "\\2 \\1", n)
#[1] "vivek nandan"
A non-regex option would be to split the string and then paste the reversed words
paste(rev(strsplit(n, ",")[[1]]), collapse=" ")
#[1] "vivek nandan"
Or extract the word and paste
library(stringr)
paste(word(n, 2, sep=","), word(n, 1, sep=","))
#[1] "vivek nandan"
data
n<- "nandan,vivek"

Apply a regex only to the first word of a phrase (defined with spaces)

I have this regex to separate letters from numbers (and symbols) of a word: (?<=[a-zA-Z])(?=([[0-9]|[:punct:]])). My test string is: "CALLE15 CRA22".
I want to apply this regex only to the first word of that sentence (the word is defined with spaces). Namely, I want apply that only to "CALLE15".
One solution is split the string (sentence) into words and then apply the regex to the first word, but I want to do all in one regex. Other solution is to use r stringr::str_replace() (or sub()) that replace only the first match, but I need stringr::str_replace_all (or gsub()) for other reasons.
What I need is to insert a space between the two that I do with the replacement function. The outcome I want is "CALLE 15 CRA22" and with the posibility of "CALLE15 CRA 22". I try a lot of positions for the space and nothing, neither the ^ at the beginning.
https://rubular.com/r/7dxsHdOA3avTdX
Thanks for your help!!!!
I am unsure about your problem statement (see my comment above), but the following reproduces your expected output and uses str_replace_all
ss <- "CALLE15 CRA22"
library(stringr)
str_replace_all(ss, "^([A-Za-z]+)(\\d+)(\\s.+)$", "\\1 \\2\\3")
#[1] "CALLE 15 CRA22"
Update
To reproduce the output of the sample string from the comment above
ss <- "CLL.6 N 5-74NORTE"
pat <- c(
"(?<=[A-Za-z])(?![A-Za-z])",
"(?<![A-Za-z])(?=[A-Za-z])",
"(?<=[0-9])(?![0-9])",
"(?<![0-9])(?=[0-9])")
library(stringr)
str_split(ss, sprintf("(%s)", paste(pat, collapse = "|"))) %>%
unlist() %>%
.[nchar(trimws(.)) > 0] %>%
paste(collapse = " ")
#[1] "CLL . 6 N 5 - 74 NORTE"

regex commas not between two numbers

I am looking for a regex for gsub to remove all the unwanted commas:
Data:
,,,,,,,12345
12345,1345,1354
123,,,,,,
12345,
,12354
Desired result:
12345
12345,1345,1354
123
12345
12354
This is the progress I have made so far:
(,(?!\d+))
You seem to want to remove all leading and trailing commas.
You may do it with
gsub("^,+|,+$", "", x)
See the regex demo
The regex contans two alternations, ^,+ matches 1 or more commas at the start and ,+$ matches 1+ commas at the end, and gsub replaces these matches with empty strings.
See R demo
x <- c(",,,,,,,12345","12345,1345,1354","123,,,,,,","12345,",",12354")
gsub("^,+|,+$", "", x)
## [1] "12345" "12345,1345,1354" "123" "12345"
## [5] "12354"
You can also use str_extract from stringr. Thanks to greedy matching, you don't have to specify how many times a digit occurs, the longest match is automatically chosen:
library(dplyr)
library(stringr)
df %>%
mutate(V1 = str_extract(V1, "\\d.+\\d"))
or if you prefer base R:
df$V1 = regmatches(df$V1, gregexpr("\\d.+\\d", df$V1))
Result:
V1
1 12345
2 12345,1345,1354
3 123
4 12345
5 12354
Data:
df = read.table(text = ",,,,,,,12345
12345,1345,1354
123,,,,,,
12345,
,12354")

removing spaces outside quotes in r

Rearranging simpsons names with R to follow first name, last name format but there are large spaces between the names, is it possible to remove spaces outside the quoted names?
library(stringr)
simpsons <- c("Moe Syzlak", "Burns, C. Montgomery", "Rev. Timothy Lovejoy", "Ned Flanders", "Simpson, Homer", "Dr. Julius Hibbert")
reorder <- sapply(sapply(str_split(simpsons, ","), str_trim),rev)
for (i in 1:length(name) ) {
splitname[i]<- paste(unlist(splitname[i]), collapse = " ")
}
splitname <- unlist(splitname)
If we need to rearrange the first name followed by last name, we could use sub. We capture one or more than character which is not a , in a group, followed by , followed by 0 or more space (\\s*), capture one or more characters that are not a , as the 2nd group, and in the replacement reverse the backreference to get the output.
sub("([^,]+),\\s*([^,]+)", "\\2 \\1", simpsons)
#[1] "Moe Syzlak" "C. Montgomery Burns" "Rev. Timothy Lovejoy" "Ned Flanders" "Homer Simpson" "Dr. Julius Hibbert"

Resources