A beginner question...
I have a list like this:
x <- c("aa=v12, bb=x21, cc=f35", "xx=r53, bb=g-25, yy=h48", "nn=u75, bb=26, gg=m98")
(but many more lines)
I need to extract what is between "bb=" and ",". I.e. I want:
x21
g-25
26
Having read many similar questions here, I suppose it is stringr with str_extract I should use, but somehow I can't get it to work. Thanks for all help.
/Chris
strapply in the gsubfn package can do that. Note that [^,]* matches a string of non-commas.
strapply extracts the back referenced portion (the part within parentheses):
> library(gsubfn)
> strapply(x, "bb=([^,]*)", simplify = TRUE)
[1] "x21" "g-25" "26"
If there are several x vectors then provide them in a list like this:
> strapply(list(x, x), "bb=([^,]*)")
[[1]]
[1] "x21" "g-25" "26"
[[2]]
[1] "x21" "g-25" "26"
An option using regexpr:
> temp = regexpr('bb=[^,]*', x)
> substr(x, temp + 3, temp + attr(temp, 'match.length') - 1)
[1] "x21" "g-25" "26"
Here's one solution using the base regex functions in R. First we use strsplit to split on the comma. Then we use grepl to filter only the items that start with bb= and gsub to extract all the characters after bb=.
> x <- c("aa=v12, bb=x21, cc=f35", "xx=r53, bb=g-25, yy=h48", "nn=u75, bb=26, gg=m98")
> y <- unlist(strsplit(x , ","))
> unlist(lapply(y[grepl("bb=", y)], function(x) gsub("^.*bb=(.*)", "\\1", x)))
[1] "x21" "g-25" "26"
It looks like str_replace is the function you are after if you want to go that route:
> str_replace(y[grepl("bb=",y)], "^.*bb=(.*)", "\\1")
[1] "x21" "g-25" "26"
Read it in with commas as separators and take the second column:
> x.split <- read.table(textConnection(x), header=FALSE, sep=",", stringsAsFactors=FALSE)[[2]]
[1] " bb=x21" " bb=g-25" " bb=26"
Then remove the "bb="
> gsub("bb=", "", x.split )
[1] " x21" " g-25" " 26"
Related
I have several vectors, like these ones:
str <- c("AT/FBA/1/12/360/26/SF/96", "AT/RLMW/1/12/360/44/SF/122", "AT/ACR/1/12/362/66/SF/175", "AT/AA/1/12/363/72/SF/281", "AT/BB/1/12/364/90/SF/310", "AT/ANT/1/123/364/92/SF/338")
N.B. that each argument between '/' may change in length (amount of characters).
I want to extract the 5th and 6th arguments delimited by the '/'.
for example in this case:
"360/26", "360/44", "362/66", "363/72", "364/90", "364/92"
I checked at these answers from similar questions:
Extract text after a symbol in R -
Extracting part of string by position in R -
I tried to use:
sub("^([^/]+/){4}([^/]+).*", "\\2", str)
but it gives me only the 5th argument, as follow:
[1] "360" "360" "362" "363" "364" "364" "364" "365" "365" "366" "365" "002" "002" "002" "002" "003"
[17] "003" "003" "004" "004" "004" "005"
then I tried
scan(text=str, sep="/", what="", quiet=TRUE)[c(5:6)]
but it gives me just the two arguments without the delimiter '/'.
A simple regex solution would be
sub("^([^/]*/){4}([^/]*/[^/]*)/.*", "\\2", str)
returning the desired
[1] "360/26" "360/44" "362/66" "363/72" "364/90"
[6] "364/92"
Use read.table like this:
with(read.table(text = str, sep = "/"), paste(V5, V6, sep = "/"))
## [1] "360/26" "360/44" "362/66" "363/72" "364/90" "364/92"
Will this work:
apply(sapply(strsplit(str, split = '/'), '[', c(5,6)),2, function(x) paste(x, collapse = '/'))
[1] "360/26" "360/44" "362/66" "363/72" "364/90" "364/92"
Here is a tidyverse solution I thought you could also use:
library(dplyr)
library(tidyr)
str %>%
as_tibble() %>%
separate(value, into = LETTERS[1:8], sep = "\\/") %>%
select(5, 6) %>%
unite("Extract", c("E", "F"), sep = "/")
# A tibble: 6 x 1
Extract
<chr>
1 360/26
2 360/44
3 362/66
4 363/72
5 364/90
6 364/92
How to extract all between two hyphens in R
ts = ("az_bna_njh","j_hj_lkiuy","ml_", "_kk")
I need to extract bna,hj,ml, and kk
We can use
sub("^\\w+_(\\w+)_.*", "\\1", trimws(ts, whitespace = "_"))
#[1] "bna" "hj" "ml" "kk"
Or another option is
sub("^\\w+_(\\w+)_.*", "\\1", gsub("^_|_$", "", ts))
Also you can try:
#Data
ts = c("az_bna_njh","j_hj_lkiuy","ml_", "_kk")
#Code
gsub(".*_(.*)\\_.*", "\\1", trimws(ts,whitespace = '_'))
Output:
[1] "bna" "hj" "ml" "kk"
Another way you can try
library(stringr)
str_replace_all(ts, c("^.*_(\\w+)_.*$" = "\\1", "^_|_$" = ""))
#[1] "bna" "hj" "ml" "kk"
I am trying web scraping of movies of 2019 from IMDB. I am extracting the Director's name from a nested list.
Now, the issue is the name of the Directors are not given for all the movies but for selected few, hence I need to extract the Director's name where ever the term 'Director:\n' appears.
The nested list is as follows:
[[1]]
[1] "Henry Cavill,Freya Allan,Anya Chalotra,Mimi Ndiweni\n"
[[2]]
[1] "\n"
[2] "Director:\nJ.J. Abrams"
[3] "|"
[4] "Stars:\nCarrie Fisher,Mark Hamill,Adam Driver,Daisy Ridley\n"
[[3]]
[1] "Pedro Pascal,Carl Weathers,Rio Hackford,Gina Carano\n"
[[4]]
[1] "\n"
[2] "Director:\nTom Hooper"
[3] "|"
[4] "Stars:\nFrancesca Hayward,Taylor Swift,Laurie Davidson,Robbie Fairchild\n"
[[5]]
[1] "Guy Pearce,Andy Serkis,Stephen Graham,Joe Alwyn\n"
[[6]]
[1] "\n"
[2] "Director:\nMichael Bay"
[3] "|"
[4] "Stars:\nRyan Reynolds,Mélanie Laurent,Manuel Garcia-Rulfo,Ben Hardy\n"
Here as one can see, the Director's name appears in an alternate manner but this is just for example purpose. Thanks in advance.
Expected Output:
directors_data
NA,"J.J. Abrams",NA,"Michael Bay"
Here is a base R solution, where you can use use the method grep+gsub, or the method regmatches + gregexpr.
Assuming you data is a list lst, then you can try the following code to extract the director's name:
sapply(lst, function(x) ifelse(length(r <- grep("Director",x,value = T)),gsub("Director:\n","",r),NA))
or
sapply(lst, function(x) ifelse(length(r<-unlist(regmatches(x,gregexpr("(?<=Director:\n)(.*)",x,perl = T)))),r,NA))
You can use str_extract to extract string and map to loop over each element in the list
library(purrr)
library(stringr)
map_chr(list_df, ~{temp <- na.omit(str_extract(.x, "(?<=Director:\n)(.*)"));
if(length(temp) > 0) temp else NA})
#[1] NA "J.J. Abrams" NA "Tom Hooper"
data
Since you did not provide a reproducible example I created one myself.
list_df <- list("Henry Cavill,Freya Allan,Anya Chalotra,Mimi Ndiweni\n",
c("\n", "Director:\nJ.J. Abrams", "|", "Stars:\nCarrie Fisher,Mark Hamill,Adam Driver,Daisy Ridley\n"
), "Pedro Pascal,Carl Weathers,Rio Hackford,Gina Carano\n",
c("\n", "Director:\nTom Hooper", "|", "Stars:\nFrancesca Hayward,Taylor Swift,Laurie Davidson,Robbie Fairchild\n"
))
Base R solution:
directors_data <- gsub("Director:\n", "",
unlist(Map(function(x){x[2]}, list_df)), fixed = TRUE)
Base R solution not using unlist and using mapply not Map:
directors_data <- gsub(".*\\\n", "",
mapply(function(x){x[2]}, list_df, SIMPLIFY = TRUE))
Base R solution if pattern appears at different indices per list element:
directors_data <- gsub(".*\\\n", "",
mapply(function(x) {
ifelse(length(x[which(grepl("Director", x))]) > 0,
x[which(grepl("Director", x))],
NA)}, list_df, SIMPLIFY = TRUE))
I was manipulating my count-data (fcm) and had my Barcode ID's as column names in the format: TCGA.BH.A0DQ.11A.12R.A089.07 etc
I proceeded to use:
CountCol= colnames(fcm)
Barcode = strsplit(as.character(CountCol), ".", fixed=TRUE)
giving me a list of all the split character strings such as :
head(Barcode,2)
[[1]]
[1] "TCGA" "3C" "AAAU" "01A" "11R" "A41B" "07"
[[2]]
[1] "TCGA" "3C" "AALI" "01A" "11R" "A41B" "07"
My question is now how do I put only the first three elements together to make new column names separated by a "-" (i.e. TCGA-3C-AAAU for the first and so forth for the next ~1200 values)
I hope this was clear.
I tried a few methods but keep coming short of the correct solution.
try sapply
sapply(Barcode,function(x){paste(x[1:3],collapse="-")})
You could also use the purrrlibrary for a more simplified code:
library(purrr)
x <- c("TCGA", "3C", "AAAU", "01A", "11R", "A41B", "07" )
y <- c("TCGA", "3C", "AALI", "01A", "11R", "A41B", "07" )
z <- list(x, y)
purrr::map(z, ~paste(.[1:3], collapse = "-"))
[[1]]
[1] "TCGA-3C-AAAU"
[[2]]
[1] "TCGA-3C-AALI"
Is there a way to split camel case strings in R?
I have attempted:
string.to.split = "thisIsSomeCamelCase"
unlist(strsplit(string.to.split, split="[A-Z]") )
# [1] "this" "s" "ome" "amel" "ase"
string.to.split = "thisIsSomeCamelCase"
gsub("([A-Z])", " \\1", string.to.split)
# [1] "this Is Some Camel Case"
strsplit(gsub("([A-Z])", " \\1", string.to.split), " ")
# [[1]]
# [1] "this" "Is" "Some" "Camel" "Case"
Looking at Ramnath's and mine I can say that my initial impression that this was an underspecified question has been supported.
And give Tommy and Ramanth upvotes for pointing out [:upper:]
strsplit(gsub("([[:upper:]])", " \\1", string.to.split), " ")
# [[1]]
# [1] "this" "Is" "Some" "Camel" "Case"
Here is one way to do it
split_camelcase <- function(...){
strings <- unlist(list(...))
strings <- gsub("^[^[:alnum:]]+|[^[:alnum:]]+$", "", strings)
strings <- gsub("(?!^)(?=[[:upper:]])", " ", strings, perl = TRUE)
return(strsplit(tolower(strings), " ")[[1]])
}
split_camelcase("thisIsSomeGood")
# [1] "this" "is" "some" "good"
Here's an approach using a single regex (a Lookahead and Lookbehind):
strsplit(string.to.split, "(?<=[a-z])(?=[A-Z])", perl = TRUE)
## [[1]]
## [1] "this" "Is" "Some" "Camel" "Case"
Here is a one-liner using the gsubfn package's strapply. The regular expression matches the beginning of the string (^) followed by one or more lower case letters ([[:lower:]]+) or (|) an upper case letter ([[:upper:]]) followed by zero or more lower case letters ([[:lower:]]*) and processes the matched strings with c (which concatenates the individual matches into a vector). As with strsplit it returns a list so we take the first component ([[1]]) :
library(gsubfn)
strapply(string.to.split, "^[[:lower:]]+|[[:upper:]][[:lower:]]*", c)[[1]]
## [1] "this" "Is" "Camel" "Case"
I think my other answer is better than the follwing, but if only a oneliner to split is needed...here we go:
library(snakecase)
unlist(strsplit(to_parsed_case(string.to.split), "_"))
#> [1] "this" "Is" "Some" "Camel" "Case"
The beginnings of an answer is to split all the characters:
sp.x <- strsplit(string.to.split, "")
Then find which string positions are upper case:
ind.x <- lapply(sp.x, function(x) which(!tolower(x) == x))
Then use that to split out each run of characters . . .
Here an easy solution via snakecase + some tidyverse helpers:
install.packages("snakecase")
library(snakecase)
library(magrittr)
library(stringr)
library(purrr)
string.to.split = "thisIsSomeCamelCase"
to_parsed_case(string.to.split) %>%
str_split(pattern = "_") %>%
purrr::flatten_chr()
#> [1] "this" "Is" "Some" "Camel" "Case"
Githublink to snakecase: https://github.com/Tazinho/snakecase