extracting the second last word between the special characters "/" - r

I would like to extract the second last string after the '/' symbol. For example,
url<- c('https://example.com/names/ani/digitalcod-org','https://example.com/names/bmc/ambulancecod.org' )
df<- data.frame (url)
I want to extract the second word from the last between the two // and would like to get the words 'ani' and 'bmc'
so, I tried this
library(stringr)
df$name<- word(df$url,-2)
I need output which as follows:
name
ani
bmc

You can use word but you need to specify the separator,
library(stringr)
word(url, -2, sep = '/')
#[1] "ani" "bmc"

Try this:
as.data.frame(sapply(str_extract_all(df$url,"\\w{2,}(?=\\/)"),"["))[3,]
# V1 V2
#3 ani bmc
as.data.frame(sapply(str_extract_all(df$url,"\\w{2,}(?=\\/)"),"["))[2:3,]
# V1 V2
#2 names names
#3 ani bmc

Use gsub with
.*?([^/]+)/[^/]+$
In R:
urls <- c('https://example.com/names/ani/digitalcod-org','https://example.com/names/bmc/ambulancecod.org' )
gsub(".*?([^/]+)/[^/]+$", "\\1", urls)
This yields
[1] "ani" "bmc"
See a demo on regex101.com.

Here is a solution using strsplit
words <- strsplit(url, '/')
L <- lengths(words)
vapply(seq_along(words), function (k) words[[k]][L[k]-1], character(1))
# [1] "ani" "bmc"

A non-regex approach using basename
basename(mapply(sub, pattern = basename(url), replacement = "", x = url, fixed = TRUE))
#[1] "ani" "bmc"
basename(url) "removes all of the path up to and including the last path separator (if any)" and returns
[1] "digitalcod-org" "ambulancecod.org"
use mapply to replace this outcome for every element in url by "" and call basename again.

Related

How to extract everything after a specific string?

I'd like to extract everything after "-" in vector of strings in R.
For example in :
test = c("Pierre-Pomme","Jean-Poire","Michel-Fraise")
I'd like to get
c("Pomme","Poire","Fraise")
Thanks !
With str_extract. \\b is a zero-length token that matches a word-boundary. This includes any non-word characters:
library(stringr)
str_extract(test, '\\b\\w+$')
# [1] "Pomme" "Poire" "Fraise"
We can also use a back reference with sub. \\1 refers to string matched by the first capture group (.+), which is any character one or more times following a - at the end:
sub('.+-(.+)', '\\1', test)
# [1] "Pomme" "Poire" "Fraise"
This also works with str_replace if that is already loaded:
library(stringr)
str_replace(test, '.+-(.+)', '\\1')
# [1] "Pomme" "Poire" "Fraise"
Third option would be using strsplit and extract the second word from each element of the list (similar to word from #akrun's answer):
sapply(strsplit(test, '-'), `[`, 2)
# [1] "Pomme" "Poire" "Fraise"
stringr also has str_split variant to this:
str_split(test, '-', simplify = TRUE)[,2]
# [1] "Pomme" "Poire" "Fraise"
We can use sub to match characters (.*) until the - and in the replacement specify ""
sub(".*-", "", test)
Or another option is word
library(stringr)
word(test, 2, sep="-")
I think the other answers might be what you're looking for, but if you don't want to lose the original context you can try something like this:
library(tidyverse)
tibble(test) %>%
separate(test, c("first", "last"), remove = F)
This will return a dataframe containing the original strings plus components, which might be more useful down the road:
# A tibble: 3 x 3
test first last
<chr> <chr> <chr>
1 Pierre-Pomme Pierre Pomme
2 Jean-Poire Jean Poire
3 Michel-Fraise Michel Fraise
For some reason the responses here didn't work for my particular string. I found this response more helpful (i.e., using Stringr's lookbehind function): stringr str_extract capture group capturing everything.

Extract substring in R using grepl

I have a table with a string column formatted like this
abcdWorkstart.csv
abcdWorkcomplete.csv
And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.
grepl("Work{*}.csv", data$filename)
Basically I want to extract whatever between Work and .csv
desired outcome:
start
complete
I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:
fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\\.csv$", "\\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"
You can work around this by filtering out the unchanged ones:
out[ out != fn ]
# [1] "start" "complete"
Or marking them invalid with NA (or something else):
out[ out == fn ] <- NA
out
# [1] "start" "complete" NA
With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
library(stringr)
str_extract(x, "(?<=Work).+(?=\\.csv)")
# [1] "start" "complete"
Just as an alternative way, remove everything you don't want.
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
gsub("^.*Work|\\.csv$", "", x)
#[1] "start" "complete"
please note:
I have to use gsub. Because I first remove ^.*Work then \\.csv$.
For [\\s\\S] or \\d\\D ... (does not work with [g]?sub)
https://regex101.com/r/wFgkgG/1
Works with akruns approach:
regmatches(v1, regexpr("(?<=Work)[\\s\\S]+(?=[.]csv)", v1, perl = T))
str1<-
'12
.2
12'
gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)
. matches also \n when using the R engine.
Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches
regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"
data
v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')

String between first two (.dots)

Hi have data which contains two or more dots. My requirement is to get string from first to second dot.
E.g string <- "abcd.vdgd.dhdsg"
Result expected =vdgd
I have used
pt <-strapply(string, "\\.(.*)\\.", simplify = TRUE)
which is giving correct data but for string having more than two dots its not working as expected.
e.g string <- "abcd.vdgd.dhdsg.jsgs"
its giving dhdsg.jsgs but expected is vdgd
Could anyone help me.
Thanks & Regards,
In base R we can use strsplit
ss <- "abcd.vdgd.dhdsg"
unlist(strsplit(ss, "\\."))[2]
#[1] "vdgd"
Or using gregexpr with regmatches
unlist(regmatches(ss, gregexpr("[^\\.]+", ss)))[2]
#[1] "vdgd"
Or using gsub (thanks #TCZhang)
gsub("^.+?\\.(.+?)\\..*$", "\\1", ss)
#[1] "vdgd"
Another option:
string <- "abcd.vdgd.dhdsg.jsgs"
library(stringr)
str_extract(string = string, pattern = "(?<=\\.).*?(?=\\.)")
[1] "vdgd"
I like this one because the str_extract function will return the first instance of the correct pattern, but you could also use str_extract_all to get all instances.
str_extract_all(string = string, pattern = "(?<=\\.).*?(?=\\.)")
[[1]]
[1] "vdgd" "dhdsg"
From here, you could index to get any position between two dots you want.
Another solution with the qdapRegex package:
library(qdapRegex)
ex_between("abcd.vdgd.dhdsg.jsgs", ".", ".")[[1]][1]
# "vdgd"
You can use read.table as well if you wish.Here providing the string as given in your problem and selecting the separator as dot("."), Once the column is converted into a data.frame, you may choose to select whatever column you want to pick(In this case it is column number 2).
read.table(text=string, sep=".",stringsAsFactors = FALSE)[,2]
Output:
> read.table(text=string, sep=".",stringsAsFactors = FALSE)[,2]
[1] "vdgd"
Here is a fun easy way via stringr
stringr::word(string, 2, sep = '\\.')
Here are two options that are vectorized over the input string vector:
You can try tstrsplit from data.table, which is vectorized over string:
> string <- c("abcd.vdgd.dhdsg", "abcd.vdgd.dhdsg.jsgs")
> tstrsplit(string, '.', fixed = TRUE)[[2]]
[1] "vdgd" "vdgd"
or regex:
> sub('.*?\\.(.*?)\\..*', '\\1', string)
[1] "vdgd" "vdgd"`

How to extract substring from a string?

There are some strings which show the following pattern
ABC, DEF.JHI
AB,DE.(JH)
Generally, it includes three sections which are separated with , and . The last character can be either normal character or sth like ). I would like to extract the last part. For example, I would like to generate the following two strings based on the above ones
JHI
(JH)
Is there a way to do that in R?
library(stringr)
str1 <- c("ABC, DEF.JHI","AB,DE.(JH)")
str_extract(str1,perl('(?<=\\.).*'))
#[1] "JHI" "(JH)"
(?<=\\.) search for . followed by .* all characters
You can just split on the . using strsplit and extract the second element.
str1 <- c("ABC, DEF.JHI","AB,DE.(JH)")
unlist(lapply(strsplit(str1, "\\."), "[", 2))
# [1] "JHI" "(JH)"
Here's another possibility:
sapply(strsplit(str1, "\\.\\(|\\.|\\)"), "[[", 2)
Riffing on #josiber's answer you could remove the part of the string before the .
str1 <- c("ABC, DEF.JHI","AB,DE.(JH)")
gsub(".*\\.", "", str1)
# [1] "JHI" "(JH)"
EDIT
In case your third element is not always preceded by a ., to extract the final part
str1 <- c("ABC, DEF.JHI","AB,DE.(JH)", "ABC.DE, (JH)")
gsub(".*[,.]", "" , str1)
# [1] "JHI" "(JH)" " (JH)"

Remove part of a string

How do I remove part of a string? For example in ATGAS_1121 I want to remove everything before _.
Use regular expressions. In this case, you can use gsub:
gsub("^.*?_","_","ATGAS_1121")
[1] "_1121"
This regular expression matches the beginning of the string (^), any character (.) repeated zero or more times (*), and underscore (_). The ? makes the match "lazy" so that it only matches are far as the first underscore. That match is replaced with just an underscore. See ?regex for more details and references
You can use a built-in for this, strsplit:
> s = "TGAS_1121"
> s1 = unlist(strsplit(s, split='_', fixed=TRUE))[2]
> s1
[1] "1121"
strsplit returns both pieces of the string parsed on the split parameter as a list. That's probably not what you want, so wrap the call in unlist, then index that array so that only the second of the two elements in the vector are returned.
Finally, the fixed parameter should be set to TRUE to indicate that the split parameter is not a regular expression, but a literal matching character.
If you're a Tidyverse kind of person, here's the stringr solution:
R> library(stringr)
R> strings = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
R> strings %>% str_replace(".*_", "_")
[1] "_1121" "_1432" "_1121"
# Or:
R> strings %>% str_replace("^[A-Z]*", "")
[1] "_1121" "_1432" "_1121"
Here's the strsplit solution if s is a vector:
> s <- c("TGAS_1121", "MGAS_1432")
> s1 <- sapply(strsplit(s, split='_', fixed=TRUE), function(x) (x[2]))
> s1
[1] "1121" "1432"
Maybe the most intuitive solution is probably to use the stringr function str_remove which is even easier than str_replace as it has only 1 argument instead of 2.
The only tricky part in your example is that you want to keep the underscore but its possible: You must match the regular expression until it finds the specified string pattern (?=pattern).
See example:
strings = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
strings %>% stringr::str_remove(".+?(?=_)")
[1] "_1121" "_1432" "_1121"
Here the strsplit solution for a dataframe using dplyr package
col1 = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
col2 = c("T", "M", "A")
df = data.frame(col1, col2)
df
col1 col2
1 TGAS_1121 T
2 MGAS_1432 M
3 ATGAS_1121 A
df<-mutate(df,col1=as.character(col1))
df2<-mutate(df,col1=sapply(strsplit(df$col1, split='_', fixed=TRUE),function(x) (x[2])))
df2
col1 col2
1 1121 T
2 1432 M
3 1121 A

Resources