Extracting specific data from nested list - r

I am trying web scraping of movies of 2019 from IMDB. I am extracting the Director's name from a nested list.
Now, the issue is the name of the Directors are not given for all the movies but for selected few, hence I need to extract the Director's name where ever the term 'Director:\n' appears.
The nested list is as follows:
[[1]]
[1] "Henry Cavill,Freya Allan,Anya Chalotra,Mimi Ndiweni\n"
[[2]]
[1] "\n"
[2] "Director:\nJ.J. Abrams"
[3] "|"
[4] "Stars:\nCarrie Fisher,Mark Hamill,Adam Driver,Daisy Ridley\n"
[[3]]
[1] "Pedro Pascal,Carl Weathers,Rio Hackford,Gina Carano\n"
[[4]]
[1] "\n"
[2] "Director:\nTom Hooper"
[3] "|"
[4] "Stars:\nFrancesca Hayward,Taylor Swift,Laurie Davidson,Robbie Fairchild\n"
[[5]]
[1] "Guy Pearce,Andy Serkis,Stephen Graham,Joe Alwyn\n"
[[6]]
[1] "\n"
[2] "Director:\nMichael Bay"
[3] "|"
[4] "Stars:\nRyan Reynolds,Mélanie Laurent,Manuel Garcia-Rulfo,Ben Hardy\n"
Here as one can see, the Director's name appears in an alternate manner but this is just for example purpose. Thanks in advance.
Expected Output:
directors_data
NA,"J.J. Abrams",NA,"Michael Bay"

Here is a base R solution, where you can use use the method grep+gsub, or the method regmatches + gregexpr.
Assuming you data is a list lst, then you can try the following code to extract the director's name:
sapply(lst, function(x) ifelse(length(r <- grep("Director",x,value = T)),gsub("Director:\n","",r),NA))
or
sapply(lst, function(x) ifelse(length(r<-unlist(regmatches(x,gregexpr("(?<=Director:\n)(.*)",x,perl = T)))),r,NA))

You can use str_extract to extract string and map to loop over each element in the list
library(purrr)
library(stringr)
map_chr(list_df, ~{temp <- na.omit(str_extract(.x, "(?<=Director:\n)(.*)"));
if(length(temp) > 0) temp else NA})
#[1] NA "J.J. Abrams" NA "Tom Hooper"
data
Since you did not provide a reproducible example I created one myself.
list_df <- list("Henry Cavill,Freya Allan,Anya Chalotra,Mimi Ndiweni\n",
c("\n", "Director:\nJ.J. Abrams", "|", "Stars:\nCarrie Fisher,Mark Hamill,Adam Driver,Daisy Ridley\n"
), "Pedro Pascal,Carl Weathers,Rio Hackford,Gina Carano\n",
c("\n", "Director:\nTom Hooper", "|", "Stars:\nFrancesca Hayward,Taylor Swift,Laurie Davidson,Robbie Fairchild\n"
))

Base R solution:
directors_data <- gsub("Director:\n", "",
unlist(Map(function(x){x[2]}, list_df)), fixed = TRUE)
Base R solution not using unlist and using mapply not Map:
directors_data <- gsub(".*\\\n", "",
mapply(function(x){x[2]}, list_df, SIMPLIFY = TRUE))
Base R solution if pattern appears at different indices per list element:
directors_data <- gsub(".*\\\n", "",
mapply(function(x) {
ifelse(length(x[which(grepl("Director", x))]) > 0,
x[which(grepl("Director", x))],
NA)}, list_df, SIMPLIFY = TRUE))

Related

how to extract part of a string matching pattern with separation in r

I'm trying to extract part of a file name that matches a set of letters with variable length. The file names consist of several parameters separated by "_", but they vary in the number of parts. I'm trying to pull some of the parameters out to use separately.
Example file names:
a = "Vel_Mag_ft_modelExisting_350cfs_blah3.tif"
b = "Depth_modelDesign_11000cfs_blah2.tif"
I'm trying to pull out the parts that start with "model" so I end up with
"modelExisting"
"modelDesign"
The filenames are stored as a variable in a data.frame
I've tried
library(tidyverse)
tibble(files = c(a,b))%>%
mutate(attempt1 = str_extract(files, "model"),
attempt2 = str_match(str_split(files, "_"), "model"))
but just ended up with the "model" in all cases and not the "model...." that I need.
The pieces I need are a consisent number of pieces from the end, but I couldn't figure out how to specify that either. I tried
str_split(files, "_")[-3]
but this threw an error that it must be size 480 or 1 not size 479
We can create a function to capture the word before the _ and one or more digits (\\1), in the replacement, specify the backreference (\\1) of the captured group
f1 <- function(x) sub(".*_([[:alpha:]]+)_\\d+.*", "\\1", x)
-testing
> f1(a)
[1] "modelExisting"
> f1(b)
[1] "modelDesign"
We can use strsplit or regmatches like below
> s <- c("Vel_Mag_ft_modelExisting_350cfs_blah3.tif", "Depth_modelDesign_11000cfs_blah2.tif")
> lapply(strsplit(s, "_"), function(x) x[which(grepl("^\\d+", x)) - 1])
[[1]]
[1] "modelExisting"
[[2]]
[1] "modelDesign"
> regmatches(s, gregexpr("[[:alpha:]]+(?=_\\d+)", s, perl = TRUE))
[[1]]
[1] "modelExisting"
[[2]]
[1] "modelDesign"

take a part from a string without a specific pattern

I have a column where in
cell.1 is "UNIV ZURICH;NOTREPORTED;NOTREPORTED;NOTREPORTED"
cell.2 is "UNIBG"
s = c("UNIV ZURICH;NOTREPORTED;NOTREPORTED;NOTREPORTED", "UNIBG")
s1 = unlist(strsplit(s, split=';', fixed=TRUE))[1]
s1
and I want to get
cell.1 UNIV ZURICH
cell.2 UNIBG
many thanks in advance,
s = c("UNIV ZURICH;NOTREPORTED;NOTREPORTED;NOTREPORTED", "UNIBG")
s1 = strsplit(s, split=';')
result = data.frame(mycol = unlist(lapply(s1, function(x){x[1]})))
> result
mycol
1 UNIV ZURICH
2 UNIBG
Your strplit() approach is a good idea, it gives:
strsplit(s, split=';', fixed=TRUE)
[[1]]
[1] "UNIV ZURICH" "NOTREPORTED" "NOTREPORTED" "NOTREPORTED"
[[2]]
[1] "UNIBG"
In order to get what you are looking for, you need to extract the first element of each element of the list you obtained and then merge them, here is a way to do so (btw, fixed=TRUE is now required for this example).
s1 <- unlist(lapply(strsplit(s, split=';', fixed=TRUE), `[`, 1))
Previously, you were merging all elements in one list:
unlist(strsplit(s, split=';', fixed=TRUE))
[1] "UNIV ZURICH" "NOTREPORTED" "NOTREPORTED" "NOTREPORTED"
[5] "UNIBG"
and then you were taking the first element of this vector.

How do I remove these characters from my vector of strings

I need a solution to how I can clean my vector of strings which has characters and symbols,
for example
[1]c("hiv3=0", "comdiab=0", "ppl=0")
[2]c("fxet3=1", "hiv3=0", "ppl=0")
[3]c("fxet3=1", "escol4=0", "alcool=0", "tipores3=1")
[4]c("escol4=0", "alcool=0", "ppl=0", "tipores3=1")
The intended string will produce
[1]"hiv3=0,comdiab=0, ppl=0"
[2]"fxet3=1, hiv3=0, ppl=0"
[3]"fxet3=1, escol4=0, alcool=0, tipores3=1"
[4]"escol4=0, alcool=0, ppl=0, tipores3=1"
Any solution is acceptable, though I have tried using the gsub function
Regex solution would be very much acceptable also
Based on the post, it seems to be a listof vectors. We can use paste to create a single string from the list of vectors
sapply(lst1, paste, collapse=", ")
#[1] "hiv3=0, comdiab=0, ppl=0"
#[2] "fxet3=1, hiv3=0, ppl=0"
#[3] "fxet3=1, escol4=0, alcool=0, tipores3=1"
#[4] "escol4=0, alcool=0, ppl=0, tipores3=1"
or otherwise can be modified as
sapply(lst1, toString)
data
lst1 <- list(c("hiv3=0", "comdiab=0", "ppl=0"), c("fxet3=1", "hiv3=0",
"ppl=0"), c("fxet3=1", "escol4=0", "alcool=0", "tipores3=1"),
c("escol4=0", "alcool=0", "ppl=0", "tipores3=1"))
tidyverse answer
library(tidyverse)
my_strings <- list(c("hiv3=0", "comdiab=0", "ppl=0"),
c("fxet3=1", "hiv3=0", "ppl=0"),
c("fxet3=1", "escol4=0", "alcool=0", "tipores3=1"),
c("escol4=0", "alcool=0", "ppl=0", "tipores3=1"))
map_chr(.x = my_strings, .f = str_c, collapse = " ")
# [1] "hiv3=0 comdiab=0 ppl=0"
# [2] "fxet3=1 hiv3=0 ppl=0"
# [3] "fxet3=1 escol4=0 alcool=0 tipores3=1"
# [4] "escol4=0 alcool=0 ppl=0 tipores3=1"

Splitting CamelCase in R

Is there a way to split camel case strings in R?
I have attempted:
string.to.split = "thisIsSomeCamelCase"
unlist(strsplit(string.to.split, split="[A-Z]") )
# [1] "this" "s" "ome" "amel" "ase"
string.to.split = "thisIsSomeCamelCase"
gsub("([A-Z])", " \\1", string.to.split)
# [1] "this Is Some Camel Case"
strsplit(gsub("([A-Z])", " \\1", string.to.split), " ")
# [[1]]
# [1] "this" "Is" "Some" "Camel" "Case"
Looking at Ramnath's and mine I can say that my initial impression that this was an underspecified question has been supported.
And give Tommy and Ramanth upvotes for pointing out [:upper:]
strsplit(gsub("([[:upper:]])", " \\1", string.to.split), " ")
# [[1]]
# [1] "this" "Is" "Some" "Camel" "Case"
Here is one way to do it
split_camelcase <- function(...){
strings <- unlist(list(...))
strings <- gsub("^[^[:alnum:]]+|[^[:alnum:]]+$", "", strings)
strings <- gsub("(?!^)(?=[[:upper:]])", " ", strings, perl = TRUE)
return(strsplit(tolower(strings), " ")[[1]])
}
split_camelcase("thisIsSomeGood")
# [1] "this" "is" "some" "good"
Here's an approach using a single regex (a Lookahead and Lookbehind):
strsplit(string.to.split, "(?<=[a-z])(?=[A-Z])", perl = TRUE)
## [[1]]
## [1] "this" "Is" "Some" "Camel" "Case"
Here is a one-liner using the gsubfn package's strapply. The regular expression matches the beginning of the string (^) followed by one or more lower case letters ([[:lower:]]+) or (|) an upper case letter ([[:upper:]]) followed by zero or more lower case letters ([[:lower:]]*) and processes the matched strings with c (which concatenates the individual matches into a vector). As with strsplit it returns a list so we take the first component ([[1]]) :
library(gsubfn)
strapply(string.to.split, "^[[:lower:]]+|[[:upper:]][[:lower:]]*", c)[[1]]
## [1] "this" "Is" "Camel" "Case"
I think my other answer is better than the follwing, but if only a oneliner to split is needed...here we go:
library(snakecase)
unlist(strsplit(to_parsed_case(string.to.split), "_"))
#> [1] "this" "Is" "Some" "Camel" "Case"
The beginnings of an answer is to split all the characters:
sp.x <- strsplit(string.to.split, "")
Then find which string positions are upper case:
ind.x <- lapply(sp.x, function(x) which(!tolower(x) == x))
Then use that to split out each run of characters . . .
Here an easy solution via snakecase + some tidyverse helpers:
install.packages("snakecase")
library(snakecase)
library(magrittr)
library(stringr)
library(purrr)
string.to.split = "thisIsSomeCamelCase"
to_parsed_case(string.to.split) %>%
str_split(pattern = "_") %>%
purrr::flatten_chr()
#> [1] "this" "Is" "Some" "Camel" "Case"
Githublink to snakecase: https://github.com/Tazinho/snakecase

Extracting values after pattern

A beginner question...
I have a list like this:
x <- c("aa=v12, bb=x21, cc=f35", "xx=r53, bb=g-25, yy=h48", "nn=u75, bb=26, gg=m98")
(but many more lines)
I need to extract what is between "bb=" and ",". I.e. I want:
x21
g-25
26
Having read many similar questions here, I suppose it is stringr with str_extract I should use, but somehow I can't get it to work. Thanks for all help.
/Chris
strapply in the gsubfn package can do that. Note that [^,]* matches a string of non-commas.
strapply extracts the back referenced portion (the part within parentheses):
> library(gsubfn)
> strapply(x, "bb=([^,]*)", simplify = TRUE)
[1] "x21" "g-25" "26"
If there are several x vectors then provide them in a list like this:
> strapply(list(x, x), "bb=([^,]*)")
[[1]]
[1] "x21" "g-25" "26"
[[2]]
[1] "x21" "g-25" "26"
An option using regexpr:
> temp = regexpr('bb=[^,]*', x)
> substr(x, temp + 3, temp + attr(temp, 'match.length') - 1)
[1] "x21" "g-25" "26"
Here's one solution using the base regex functions in R. First we use strsplit to split on the comma. Then we use grepl to filter only the items that start with bb= and gsub to extract all the characters after bb=.
> x <- c("aa=v12, bb=x21, cc=f35", "xx=r53, bb=g-25, yy=h48", "nn=u75, bb=26, gg=m98")
> y <- unlist(strsplit(x , ","))
> unlist(lapply(y[grepl("bb=", y)], function(x) gsub("^.*bb=(.*)", "\\1", x)))
[1] "x21" "g-25" "26"
It looks like str_replace is the function you are after if you want to go that route:
> str_replace(y[grepl("bb=",y)], "^.*bb=(.*)", "\\1")
[1] "x21" "g-25" "26"
Read it in with commas as separators and take the second column:
> x.split <- read.table(textConnection(x), header=FALSE, sep=",", stringsAsFactors=FALSE)[[2]]
[1] " bb=x21" " bb=g-25" " bb=26"
Then remove the "bb="
> gsub("bb=", "", x.split )
[1] " x21" " g-25" " 26"

Resources