gsub R extracting string - r

I am trying to extract a string between two commas with gsub. If I have the following
xz<- "1620 Honeylocust Drive, 60210 IL, USA"
and I want to extract everything between the two commas, (60120 IL), is it possible to use gsub?
I have tried
gsub(".*,","",xz)
The result is USA. How can I do it?

We can match zero or more characters that are not a , ([^,]*) followed by a , followed by zero or more space from the start (^) of the string or | a , followed by zero or more characters that are not a , ([^,]*) at the end ($) of string and replace with blank ("")
gsub("^[^,]*,\\s*|,[^,]*$", "", xz)
#[1] "60210 IL"
Or another option is using sub and capture as a group
sub("^[^,]+,\\s+([^,]+).*", "\\1", xz)
#[1] "60210 IL"
Or another option is regexpr/regmatches
regmatches(xz, regexpr("(?<=,\\s)[^,]*(?=,)", xz, perl = TRUE))
#[1] "60210 IL"
Or with str_extract from stringr
library(stringr)
str_extract(xz, "(?<=,\\s)[^,]*(?=,)")
#[1] "60210 IL"
Update
With the new string,
xz1 <- "1620, Honeylocust Drive, 60210 IL, USA"
sub(".*,\\s+(+[0-9]+[^,]+).*", "\\1", xz1)
#[1] "60210 IL"

You could also do this using strsplit and grep (here I did it in 2 lines for readability):
xz1 <- "1620, Honeylocust Drive, 60210 IL, USA"
a1 <- strsplit(xz1, "[ ]*,[ ]*")[[1]]
grep("^[0-9]+[ ]+[A-Z]+", a1, value=TRUE)
#[1] "60210 IL"
It's not using gsub, and in the present case it is not better, but maybe it is easier to adapt to other situations.

Related

Convert sign in column names if not at certain position in R [duplicate]

I have a character string of names which look like
"_6302_I-PAL_SPSY_000237_001"
I need to remove the first occurred underscore, so that it will be as
"6302_I-PAL_SPSY_000237_001"
I aware of gsub but it removes all of underscores. Thank you for any suggestions.
gsub function do the same, to remove starting of the string symbol ^ used
x <- "_6302_I-PAL_SPSY_000237_001"
x <- gsub("^\\_","",x)
[1] "6302_I-PAL_SPSY_000237_001"
We can use sub with pattern as _ and replacement as blanks (""). This will remove the first occurrence of '_'.
sub("_", "", str1)
#[1] "6302_I-PAL_SPSY_000237_001"
NOTE: This will remove the first occurence of _ and it will not limit based on the position i.e. at the start of the string.
For example, suppose we have string
str2 <- "6302_I-PAL_SPSY_000237_001"
sub("_", "", str2)
#[1] "6302I-PAL_SPSY_000237_001"
As the example have _ in the beginning, another option is substring
substring(str1, 2)
#[1] "6302_I-PAL_SPSY_000237_001"
data
str1 <- "_6302_I-PAL_SPSY_000237_001"
This can be done with base R's trimws() too
string1<-"_6302_I-PAL_SPSY_000237_001"
trimws(string1, which='left', whitespace = '_')
[1] "6302_I-PAL_SPSY_000237_001"
In case we have multiple words with leading underscores, we may have to include a word boundary (\\b) in our regex, and use either gsub or stringr::string_remove:
string2<-paste(string1, string1)
string2
[1] "_6302_I-PAL_SPSY_000237_001 _6302_I-PAL_SPSY_000237_001"
library(stringr)
str_remove_all(string2, "\\b_")
> str_remove_all(string2, "\\b_")
[1] "6302_I-PAL_SPSY_000237_001 6302_I-PAL_SPSY_000237_001"

reverse the name if it seperate by comma

If there is a first and last name is like "nandan, vivek". I want to display as "vivek nandan".
n<-("nandan,vivek")
result:
[1] vivek nandan
where first name:vivek
last name:nandan
this is the author name.
We can try using sub here:
input <- "nankin,vivek"
sub("([^,]+),\\s*(.*)", "\\2 \\1", input)
[1] "vivek nankin"
The regex pattern used above matches the last name followed by the first name, in separate capture groups. It then replaces with those capture groups, in reverse order, separated by a single space.
An option would be sub to capture the substring that are letters ([a-z]+) followed by a , and again capture the next word ([a-z]+). In the replacement, reverse the order of the backreferences
sub("([a-z]+),([a-z]+)", "\\2 \\1", n)
#[1] "vivek nandan"
A non-regex option would be to split the string and then paste the reversed words
paste(rev(strsplit(n, ",")[[1]]), collapse=" ")
#[1] "vivek nandan"
Or extract the word and paste
library(stringr)
paste(word(n, 2, sep=","), word(n, 1, sep=","))
#[1] "vivek nandan"
data
n<- "nandan,vivek"

Extract only first appearance of number after gsub

I have a string like this:
Sample.ID<-"(<SampleID>, 2213 )"
I am using the following gsub code to extract the numbers from this string:
ID<-as.numeric(gsub("\\D", "", Sample.ID))
This is ok, but sometimes in my data the string is like this:
Sample.ID<-"(<SampleID>, 2213-EQUINOX BELL 2-P, )"
Then I have a problem, as it take all number (i.e. 22132) where I just wanted to have 2213.
What is the work-around?
Thanks,
Phuong
You can capture the digits and then use a backreference
sub(".*?(\\d+).*", "\\1", Sample.ID)
[1] "2213" "2213"
As your data looks like
Sample.ID<-"(<SampleID>, 2213-EQUINOX BELL 2-P, )"
use (?<=, )\d+ to match the number: Regex-test
The following code match the whole string and extract the 1st group:
gsub(".*(?<=, )(\\d+).*", "\\1", id, perl=TRUE)
Check the snippet: R-gsub
We can match zero or more characters that are not a comma ([^,]*) from the start (^) of the string followed by a ,, one or more space (\\s+) or | a - or space followed by other characters (.*) and replace it with blank ("")
as.numeric(gsub("^[^,]*,\\s+|(-|\\s+).*", "", Sample.ID))
#[1] 2213 2213
If there are no other restrictions, then str_extract can be used to extract the first occurrence of number
library(stringr)
as.numeric(str_extract(Sample.ID, "\\d+"))
#[1] 2213 2213
Or with parse_number from readr
readr::parse_number(Sample.ID)
#[1] 2213 2213
Or a similar option with base R
as.numeric(regmatches(Sample.ID, regexpr("\\d+", Sample.ID)))
#[1] 2213 2213
data
Sample.ID <- c("(<SampleID>, 2213 )", "(<SampleID>, 2213-EQUINOX BELL 2-P, )")

Remove others in a string except a needed word including certain patterns in R

I have a vector including certain strings, and I would like remove other parts in each string except the word including certain patter (here is mir).
s <- c("a mir-96 line (kk27)", "mir-133a cell",
"d mir-14-3p in", "m mir133 (sas)", "mir_23_5p r 27")
I want to obtain:
mir-96, mir-133a, mir-14-3p, mir133, mir_23_5p
I know the idea: use the gsub() and pattern is: a word beginning with (or including) mir.
But I have no idea how to construct such patter.
Or other idea?
Any help will be appreciated!
One way in base R would be splitting every string into words and then extracting only those with mir in it
unlist(lapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE)))
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p"
We can save the unlist step in lapply by using sapply as suggested by #Rich Scriven in comments
sapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE))
We can use sub to match zero or more characters (.*) followed by a word boundary (\\b) followed by the string (mir and one or more characters that are not a white space (\\S+), capture it as a group by placing inside the (...) followed by other characters, and in the replacement use the backreference of the captured group (\\1)
sub(".*\\b(mir\\S+).*", "\\1", s)
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p"
Update
If there are multiple 'mir.*' substring, then we want to extract strings having some numeric part
sub(".*\\b(mir[^0-9]*[0-9]+\\S*).*", "\\1", s1)
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p" "mir_23-5p"
data
s1 <- c("a mir-96 line (kk27)", "mir-133a cell", "d mir-14-3p in", "m mir133 (sas)",
"mir_23_5p r 27", "a mir_23-5p 1 mir-net")

How to change the word separator character in a vector?

I have a character vector consisting of the following style:
mylist <- c('John Myer Stewert','Steve',' Michael Boris',' Daniel and Frieds','Michael-Myer')
I'm trying to create a character vector like this:
mylist <- c('John+Myer+Stewert','Steve',' Michael+Boris',' Daniel+and+Frieds','Michael+Myer')
I have tried:
test <- cat(paste(shQuote(mylist , type="cmd"), collapse="+"))
That seems wrong. How can I change the word separator in mylist as shown above?
You could use chartr(). Just re-use the + sign for both space and - characters.
chartr(" -", "++", trimws(mylist))
# [1] "John+Myer+Stewert" "Steve" "Michael+Boris"
# [4] "Daniel+and+Frieds" "Michael+Myer"
Note that I also trimmed the leading whitespace since there is really no need to keep it.
We can use gsub by matching the space (" ") as pattern and replace it with "+".
gsub(" ", "+", trimws(mylist))
#[1] "John+Myer+Stewert" "Steve" "Michael+Boris"
#[4] "Daniel+and+Frieds" "Michael-Myer"
I assumed that the leading spaces as typo. If it is not, we can either use regex lookarounds
gsub("(?<=[a-z])[ -](?=[[:alpha:]])", "+", mylist, perl = TRUE)
#[1] "John+Myer+Stewert" "Steve" " Michael+Boris"
#[4] " Daniel+and+Frieds" "Michael+Myer"
Or some PCRE regex
gsub("(^ | $)(*SKIP)(*F)|[ -]", "+", mylist, perl = TRUE)
#[1] "John+Myer+Stewert" "Steve" " Michael+Boris"
#[4] " Daniel+and+Frieds" "Michael+Myer"
You can use the package stringr.
library(stringr)
str_replace_all(trimws(mylist), "[ -]", "+")
#[1] "John+Myer+Stewert" "Steve" "Michael+Boris"
#[4] "Daniel+and+Frieds" "Michael+Myer"
Between [] we specify what we want to replace with +. In this case, that is a single white space and -. I used trimws from Akrun's answer to get rid of the extra white space in the beginning of some elements in your string.
This is yet another alternative.
library(stringi)
stri_replace_all_regex(trimws(mylist), "[ -]", "+")

Resources