R - Extract text between symbol or delimiter '/' - r

I have several vectors, like these ones:
str <- c("AT/FBA/1/12/360/26/SF/96", "AT/RLMW/1/12/360/44/SF/122", "AT/ACR/1/12/362/66/SF/175", "AT/AA/1/12/363/72/SF/281", "AT/BB/1/12/364/90/SF/310", "AT/ANT/1/123/364/92/SF/338")
N.B. that each argument between '/' may change in length (amount of characters).
I want to extract the 5th and 6th arguments delimited by the '/'.
for example in this case:
"360/26", "360/44", "362/66", "363/72", "364/90", "364/92"
I checked at these answers from similar questions:
Extract text after a symbol in R -
Extracting part of string by position in R -
I tried to use:
sub("^([^/]+/){4}([^/]+).*", "\\2", str)
but it gives me only the 5th argument, as follow:
[1] "360" "360" "362" "363" "364" "364" "364" "365" "365" "366" "365" "002" "002" "002" "002" "003"
[17] "003" "003" "004" "004" "004" "005"
then I tried
scan(text=str, sep="/", what="", quiet=TRUE)[c(5:6)]
but it gives me just the two arguments without the delimiter '/'.

A simple regex solution would be
sub("^([^/]*/){4}([^/]*/[^/]*)/.*", "\\2", str)
returning the desired
[1] "360/26" "360/44" "362/66" "363/72" "364/90"
[6] "364/92"

Use read.table like this:
with(read.table(text = str, sep = "/"), paste(V5, V6, sep = "/"))
## [1] "360/26" "360/44" "362/66" "363/72" "364/90" "364/92"

Will this work:
apply(sapply(strsplit(str, split = '/'), '[', c(5,6)),2, function(x) paste(x, collapse = '/'))
[1] "360/26" "360/44" "362/66" "363/72" "364/90" "364/92"

Here is a tidyverse solution I thought you could also use:
library(dplyr)
library(tidyr)
str %>%
as_tibble() %>%
separate(value, into = LETTERS[1:8], sep = "\\/") %>%
select(5, 6) %>%
unite("Extract", c("E", "F"), sep = "/")
# A tibble: 6 x 1
Extract
<chr>
1 360/26
2 360/44
3 362/66
4 363/72
5 364/90
6 364/92

Related

Ho to extract words between two hyphens?

How to extract all between two hyphens in R
ts = ("az_bna_njh","j_hj_lkiuy","ml_", "_kk")
I need to extract bna,hj,ml, and kk
We can use
sub("^\\w+_(\\w+)_.*", "\\1", trimws(ts, whitespace = "_"))
#[1] "bna" "hj" "ml" "kk"
Or another option is
sub("^\\w+_(\\w+)_.*", "\\1", gsub("^_|_$", "", ts))
Also you can try:
#Data
ts = c("az_bna_njh","j_hj_lkiuy","ml_", "_kk")
#Code
gsub(".*_(.*)\\_.*", "\\1", trimws(ts,whitespace = '_'))
Output:
[1] "bna" "hj" "ml" "kk"
Another way you can try
library(stringr)
str_replace_all(ts, c("^.*_(\\w+)_.*$" = "\\1", "^_|_$" = ""))
#[1] "bna" "hj" "ml" "kk"

How to replace space with "_" after last slash in a string with R

I have a list of strings, and for each string, I need to replace all spaces after the last slash with an "_". Here's a minimum reproducible example.
my_list <- list("abc/as 345/as df.pdf", "adf3344/aer4 ffsd.doc", "abc/3455/dfr.xls", "abc/3455/dfr serf_dff.xls", "abc/34 5 5/dfr 345 dsdf 334.pdf")
After doing the replacement, the result should be:
list("abc/as 345/as_df.pdf", "adf3344/aer4_ffsd.doc", "abc/3455/dfr.xls", "abc/3455/dfr_serf_dff.xls", "abc/34 5 5/dfr_345_dsdf_334.pdf")
I thought of matching the text after the last slash using regex, and then replace " " for "_", but didn't find a way to implement it.
It would be something like this:
gsub(pattern, "_", my_list),
in which pattern would be a regex that would be saying: match every space after the last slash (there is at least one slash in every element of the list).
You may use negative lookahead:
gsub(" (?!.*/.*)", "_", unlist(my_list), perl = TRUE)
# [1] "abc/as 345/as_df.pdf" "adf3344/aer4_ffsd.doc"
# [3] "abc/3455/dfr.xls" "abc/3455/dfr_serf_dff.xls"
# [5] "abc/34 5 5/dfr_345_dsdf_334.pdf"
Here we match and replace all such spaces that ahead of them there are no more slashes left.
You can use dirname, basename and file.path :
as.list(file.path(
dirname(unlist(my_list)),
gsub(" ", "_", basename(unlist(my_list)))
))
# [[1]]
# [1] "abc/as 345/as_df.pdf"
#
# [[2]]
# [1] "adf3344/aer4_ffsd.doc"
#
# [[3]]
# [1] "abc/3455/dfr.xls"
#
# [[4]]
# [1] "abc/3455/dfr_serf_dff.xls"
#
# [[5]]
# [1] "abc/34 5 5/dfr_345_dsdf_334.pdf"
or a bit more efficient and compact :
as.list(file.path(
dirname(. <- unlist(my_list)),
gsub(" ", "_", basename(.))
))
Here's a thought. First, split by slash:
l2 <- strsplit(unlist(my_list), "/")
l2
# [[1]]
# [1] "abc" "as 345" "as df.pdf"
# [[2]]
# [1] "adf3344" "aer4 ffsd.doc"
# [[3]]
# [1] "abc" "3455" "dfr.xls"
# [[4]]
# [1] "abc" "3455" "dfr serf_dff.xls"
# [[5]]
# [1] "abc" "34 5 5" "dfr 345 dsdf 334.pdf"
Now we do a gsub on just the last element of each split-string, recombining with slashes:
mapply(function(a,i) paste(c(a[-i], gsub(" ", "_", a[i])), collapse="/"),
l2, lengths(l2), SIMPLIFY=FALSE)
# [[1]]
# [1] "abc/as 345/as_df.pdf"
# [[2]]
# [1] "adf3344/aer4_ffsd.doc"
# [[3]]
# [1] "abc/3455/dfr.xls"
# [[4]]
# [1] "abc/3455/dfr_serf_dff.xls"
# [[5]]
# [1] "abc/34 5 5/dfr_345_dsdf_334.pdf"
Here's a solution that uses the gsubfn package.
You use the regex (/[^/]+)$ to find the content following the last slash and you edit that content with a function that converts spaces to underscores.
library(gsubfn)
change_space_to_underscore <- function(x) gsub(x = x, pattern = "[[:space:]]+", replacement = "_")
gsubfn(x = my_list,
pattern = "(/[^/]+)$",
replacement = change_space_to_underscore)

Spitting Character String then Pasting it together

I was manipulating my count-data (fcm) and had my Barcode ID's as column names in the format: TCGA.BH.A0DQ.11A.12R.A089.07 etc
I proceeded to use:
CountCol= colnames(fcm)
Barcode = strsplit(as.character(CountCol), ".", fixed=TRUE)
giving me a list of all the split character strings such as :
head(Barcode,2)
[[1]]
[1] "TCGA" "3C" "AAAU" "01A" "11R" "A41B" "07"
[[2]]
[1] "TCGA" "3C" "AALI" "01A" "11R" "A41B" "07"
My question is now how do I put only the first three elements together to make new column names separated by a "-" (i.e. TCGA-3C-AAAU for the first and so forth for the next ~1200 values)
I hope this was clear.
I tried a few methods but keep coming short of the correct solution.
try sapply
sapply(Barcode,function(x){paste(x[1:3],collapse="-")})
You could also use the purrrlibrary for a more simplified code:
library(purrr)
x <- c("TCGA", "3C", "AAAU", "01A", "11R", "A41B", "07" )
y <- c("TCGA", "3C", "AALI", "01A", "11R", "A41B", "07" )
z <- list(x, y)
purrr::map(z, ~paste(.[1:3], collapse = "-"))
[[1]]
[1] "TCGA-3C-AAAU"
[[2]]
[1] "TCGA-3C-AALI"

Splitting CamelCase in R

Is there a way to split camel case strings in R?
I have attempted:
string.to.split = "thisIsSomeCamelCase"
unlist(strsplit(string.to.split, split="[A-Z]") )
# [1] "this" "s" "ome" "amel" "ase"
string.to.split = "thisIsSomeCamelCase"
gsub("([A-Z])", " \\1", string.to.split)
# [1] "this Is Some Camel Case"
strsplit(gsub("([A-Z])", " \\1", string.to.split), " ")
# [[1]]
# [1] "this" "Is" "Some" "Camel" "Case"
Looking at Ramnath's and mine I can say that my initial impression that this was an underspecified question has been supported.
And give Tommy and Ramanth upvotes for pointing out [:upper:]
strsplit(gsub("([[:upper:]])", " \\1", string.to.split), " ")
# [[1]]
# [1] "this" "Is" "Some" "Camel" "Case"
Here is one way to do it
split_camelcase <- function(...){
strings <- unlist(list(...))
strings <- gsub("^[^[:alnum:]]+|[^[:alnum:]]+$", "", strings)
strings <- gsub("(?!^)(?=[[:upper:]])", " ", strings, perl = TRUE)
return(strsplit(tolower(strings), " ")[[1]])
}
split_camelcase("thisIsSomeGood")
# [1] "this" "is" "some" "good"
Here's an approach using a single regex (a Lookahead and Lookbehind):
strsplit(string.to.split, "(?<=[a-z])(?=[A-Z])", perl = TRUE)
## [[1]]
## [1] "this" "Is" "Some" "Camel" "Case"
Here is a one-liner using the gsubfn package's strapply. The regular expression matches the beginning of the string (^) followed by one or more lower case letters ([[:lower:]]+) or (|) an upper case letter ([[:upper:]]) followed by zero or more lower case letters ([[:lower:]]*) and processes the matched strings with c (which concatenates the individual matches into a vector). As with strsplit it returns a list so we take the first component ([[1]]) :
library(gsubfn)
strapply(string.to.split, "^[[:lower:]]+|[[:upper:]][[:lower:]]*", c)[[1]]
## [1] "this" "Is" "Camel" "Case"
I think my other answer is better than the follwing, but if only a oneliner to split is needed...here we go:
library(snakecase)
unlist(strsplit(to_parsed_case(string.to.split), "_"))
#> [1] "this" "Is" "Some" "Camel" "Case"
The beginnings of an answer is to split all the characters:
sp.x <- strsplit(string.to.split, "")
Then find which string positions are upper case:
ind.x <- lapply(sp.x, function(x) which(!tolower(x) == x))
Then use that to split out each run of characters . . .
Here an easy solution via snakecase + some tidyverse helpers:
install.packages("snakecase")
library(snakecase)
library(magrittr)
library(stringr)
library(purrr)
string.to.split = "thisIsSomeCamelCase"
to_parsed_case(string.to.split) %>%
str_split(pattern = "_") %>%
purrr::flatten_chr()
#> [1] "this" "Is" "Some" "Camel" "Case"
Githublink to snakecase: https://github.com/Tazinho/snakecase

Extracting values after pattern

A beginner question...
I have a list like this:
x <- c("aa=v12, bb=x21, cc=f35", "xx=r53, bb=g-25, yy=h48", "nn=u75, bb=26, gg=m98")
(but many more lines)
I need to extract what is between "bb=" and ",". I.e. I want:
x21
g-25
26
Having read many similar questions here, I suppose it is stringr with str_extract I should use, but somehow I can't get it to work. Thanks for all help.
/Chris
strapply in the gsubfn package can do that. Note that [^,]* matches a string of non-commas.
strapply extracts the back referenced portion (the part within parentheses):
> library(gsubfn)
> strapply(x, "bb=([^,]*)", simplify = TRUE)
[1] "x21" "g-25" "26"
If there are several x vectors then provide them in a list like this:
> strapply(list(x, x), "bb=([^,]*)")
[[1]]
[1] "x21" "g-25" "26"
[[2]]
[1] "x21" "g-25" "26"
An option using regexpr:
> temp = regexpr('bb=[^,]*', x)
> substr(x, temp + 3, temp + attr(temp, 'match.length') - 1)
[1] "x21" "g-25" "26"
Here's one solution using the base regex functions in R. First we use strsplit to split on the comma. Then we use grepl to filter only the items that start with bb= and gsub to extract all the characters after bb=.
> x <- c("aa=v12, bb=x21, cc=f35", "xx=r53, bb=g-25, yy=h48", "nn=u75, bb=26, gg=m98")
> y <- unlist(strsplit(x , ","))
> unlist(lapply(y[grepl("bb=", y)], function(x) gsub("^.*bb=(.*)", "\\1", x)))
[1] "x21" "g-25" "26"
It looks like str_replace is the function you are after if you want to go that route:
> str_replace(y[grepl("bb=",y)], "^.*bb=(.*)", "\\1")
[1] "x21" "g-25" "26"
Read it in with commas as separators and take the second column:
> x.split <- read.table(textConnection(x), header=FALSE, sep=",", stringsAsFactors=FALSE)[[2]]
[1] " bb=x21" " bb=g-25" " bb=26"
Then remove the "bb="
> gsub("bb=", "", x.split )
[1] " x21" " g-25" " 26"

Resources