I have a column where in
cell.1 is "UNIV ZURICH;NOTREPORTED;NOTREPORTED;NOTREPORTED"
cell.2 is "UNIBG"
s = c("UNIV ZURICH;NOTREPORTED;NOTREPORTED;NOTREPORTED", "UNIBG")
s1 = unlist(strsplit(s, split=';', fixed=TRUE))[1]
s1
and I want to get
cell.1 UNIV ZURICH
cell.2 UNIBG
many thanks in advance,
s = c("UNIV ZURICH;NOTREPORTED;NOTREPORTED;NOTREPORTED", "UNIBG")
s1 = strsplit(s, split=';')
result = data.frame(mycol = unlist(lapply(s1, function(x){x[1]})))
> result
mycol
1 UNIV ZURICH
2 UNIBG
Your strplit() approach is a good idea, it gives:
strsplit(s, split=';', fixed=TRUE)
[[1]]
[1] "UNIV ZURICH" "NOTREPORTED" "NOTREPORTED" "NOTREPORTED"
[[2]]
[1] "UNIBG"
In order to get what you are looking for, you need to extract the first element of each element of the list you obtained and then merge them, here is a way to do so (btw, fixed=TRUE is now required for this example).
s1 <- unlist(lapply(strsplit(s, split=';', fixed=TRUE), `[`, 1))
Previously, you were merging all elements in one list:
unlist(strsplit(s, split=';', fixed=TRUE))
[1] "UNIV ZURICH" "NOTREPORTED" "NOTREPORTED" "NOTREPORTED"
[5] "UNIBG"
and then you were taking the first element of this vector.
Related
I am trying web scraping of movies of 2019 from IMDB. I am extracting the Director's name from a nested list.
Now, the issue is the name of the Directors are not given for all the movies but for selected few, hence I need to extract the Director's name where ever the term 'Director:\n' appears.
The nested list is as follows:
[[1]]
[1] "Henry Cavill,Freya Allan,Anya Chalotra,Mimi Ndiweni\n"
[[2]]
[1] "\n"
[2] "Director:\nJ.J. Abrams"
[3] "|"
[4] "Stars:\nCarrie Fisher,Mark Hamill,Adam Driver,Daisy Ridley\n"
[[3]]
[1] "Pedro Pascal,Carl Weathers,Rio Hackford,Gina Carano\n"
[[4]]
[1] "\n"
[2] "Director:\nTom Hooper"
[3] "|"
[4] "Stars:\nFrancesca Hayward,Taylor Swift,Laurie Davidson,Robbie Fairchild\n"
[[5]]
[1] "Guy Pearce,Andy Serkis,Stephen Graham,Joe Alwyn\n"
[[6]]
[1] "\n"
[2] "Director:\nMichael Bay"
[3] "|"
[4] "Stars:\nRyan Reynolds,Mélanie Laurent,Manuel Garcia-Rulfo,Ben Hardy\n"
Here as one can see, the Director's name appears in an alternate manner but this is just for example purpose. Thanks in advance.
Expected Output:
directors_data
NA,"J.J. Abrams",NA,"Michael Bay"
Here is a base R solution, where you can use use the method grep+gsub, or the method regmatches + gregexpr.
Assuming you data is a list lst, then you can try the following code to extract the director's name:
sapply(lst, function(x) ifelse(length(r <- grep("Director",x,value = T)),gsub("Director:\n","",r),NA))
or
sapply(lst, function(x) ifelse(length(r<-unlist(regmatches(x,gregexpr("(?<=Director:\n)(.*)",x,perl = T)))),r,NA))
You can use str_extract to extract string and map to loop over each element in the list
library(purrr)
library(stringr)
map_chr(list_df, ~{temp <- na.omit(str_extract(.x, "(?<=Director:\n)(.*)"));
if(length(temp) > 0) temp else NA})
#[1] NA "J.J. Abrams" NA "Tom Hooper"
data
Since you did not provide a reproducible example I created one myself.
list_df <- list("Henry Cavill,Freya Allan,Anya Chalotra,Mimi Ndiweni\n",
c("\n", "Director:\nJ.J. Abrams", "|", "Stars:\nCarrie Fisher,Mark Hamill,Adam Driver,Daisy Ridley\n"
), "Pedro Pascal,Carl Weathers,Rio Hackford,Gina Carano\n",
c("\n", "Director:\nTom Hooper", "|", "Stars:\nFrancesca Hayward,Taylor Swift,Laurie Davidson,Robbie Fairchild\n"
))
Base R solution:
directors_data <- gsub("Director:\n", "",
unlist(Map(function(x){x[2]}, list_df)), fixed = TRUE)
Base R solution not using unlist and using mapply not Map:
directors_data <- gsub(".*\\\n", "",
mapply(function(x){x[2]}, list_df, SIMPLIFY = TRUE))
Base R solution if pattern appears at different indices per list element:
directors_data <- gsub(".*\\\n", "",
mapply(function(x) {
ifelse(length(x[which(grepl("Director", x))]) > 0,
x[which(grepl("Director", x))],
NA)}, list_df, SIMPLIFY = TRUE))
I have an R list of approx. 90 character vectors (representing 90 documents), each containing several author names. As a means to stem (or normalize, what have you) the names, I'd like to drop all characters after the white-space and first character just past the comma in each element. So, for example, "Smith, Joe" would become "Smith, J" (or "Smith J" would fine).
1) I've tried using lapply with str_sub, but I can't seem to specify keeping one character past the comma (each element has different character length). 2) I also tried using lapply to split on the comma and make the last and first names separate elements, then using modify_depth to apply str_sub, but I can't figure out how to specifically use the str_sub only on the second element.
Fake sample to replicate issue.
doc1 = c("King, Stephen", "Martin, George")
doc2 = c("Clancy, Tom", "Patterson, James", "Stine, R.L.")
author = list(doc1,doc2)
What I've tried:
myfun1 = function(x,arg1){str_split(x, ", ")}
author = lapply(author, myfun1)
myfun2 = function(x,arg1){str_sub(x, end = 1L)}
f2 = modify_depth(author, myfun2, .depth = 2)
f2
[[1]]
[[1]][[1]]
[1] "K" "S"
[[1]][[2]]
[1] "M" "G"
Ultimately, I'm hoping after applying a solution, including maybe using unite(), the result will be as follows:
[[1]]
[[1]][[1]]
[1] "King S"
[[1]][[2]]
[1] "Martin G"
lapply( author, function(x) gsub( "(^.*, [A-Z]).*$", "\\1", x))
# [[1]]
# [1] "King, S" "Martin, G"
#
# [[2]]
# [1] "Clancy, T" "Patterson, J" "Stine, R"
What it does:
lapply loops over list of authors
gsub replaces a part of the elements of the vectors, defined by the regex "(^.*, [A-Z]).*$" with the first group (the part between the round brackets).
the regex "(^.*, [A-Z]).*$" puts everything from the start ^.* , until (and including) the first 'comma space, captal' , [A-Z] into a group.
I have a list of phrases, in which I want to replace certain words with a similar word, in case it is misspelled.
How can I search a string, a word that matches and replace it?
The expected result is the following example:
a1<- c(" the classroom is ful ")
a2<- c(" full")
In this case I would be replacing ful for full in a1
Take a look at the hunspell package. As the comments have already suggested, your problem is much more difficult than it seems, unless you already have a dictionary of misspelled words and their correct spelling.
library(hunspell)
a1 <- c(" the classroom is ful ")
bads <- hunspell(a1)
bads
# [[1]]
# [1] "ful"
hunspell_suggest(bads[[1]])
# [[1]]
# [1] "fool" "flu" "fl" "fuel" "furl" "foul" "full" "fun" "fur" "fut" "fol" "fug" "fum"
So even in your example, would you want to replace ful with full, or many of the other options here?
The package does let you use your own dictionary. Let's say you're doing that, or at least you're happy with the first returned suggestion.
library(stringr)
str_replace_all(a1, bads[[1]], hunspell_suggest(bads[[1]])[[1]][1])
# [1] " the classroom is fool "
But, as the other comments and answers have pointed out, you do need to be careful with the word showing up within other words.
a3 <- c(" the thankful classroom is ful ")
str_replace_all(a3,
paste("\\b",
hunspell(a3)[[1]],
"\\b",
collapse = "", sep = ""),
hunspell_suggest(hunspell(a3)[[1]])[[1]][1])
# [1] " the thankful classroom is fool "
Update
Based on your comment, you already have a dictionary, structured as a vector of badwords and another vector of their replacements.
library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"
Update 2
Addressing your comment, with your new example the issue is back to having words showing up in other words. The solutions is to use \\b. This represents a word boundary. Using pattern "thin" it will match to "thin", "think", "thinking", etc. But if you bracket with \\b it anchors the pattern to a word boundary. \\bthin\\b will only match "thin".
Your example:
a <- c(" thin, thic, thi")
badwords.corpus <- c("thin", "thic", "thi" )
goodwords.corpus <- c("think", "thick", "this")
The solution is to modify badwords.corpus
badwords.corpus <- paste("\\b", badwords.corpus, "\\b", sep = "")
badwords.corpus
# [1] "\\bthin\\b" "\\bthic\\b" "\\bthi\\b"
Then create the vect.corpus as I describe in the previous update, and use in str_replace_all.
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a, vect.corpus)
# [1] " think, thick, this"
I think the function you are looking for is gsub():
gsub (pattern = "ful", replacement = a2, x = a1)
Create a list of the corrections then replace them using gsubfn which is a generalization of gsub that can also take list, function and proto object replacement objects. The regular expression matches a word boundary, one or more word characters and another word boundary. Each time it finds a match it looks up the match in the list names and if found replaces it with the corresponding list value.
library(gsubfn)
L <- list(ful = "full") # can add more words to this list if desired
gsubfn("\\b\\w+\\b", L, a1, perl = TRUE)
## [1] " the classroom is full "
For a kind of ordered replacement, you can try this
a1 <- c("the classroome is ful")
# ordered replacement
badwords.corpus <- c("ful", "classroome")
goodwords.corpus <- c("full", "classroom")
qdap::mgsub(badwords.corpus, goodwords.corpus, a1) # or
stringi::stri_replace_all_fixed(a1, badwords.corpus, goodwords.corpus, vectorize_all = FALSE)
For unordered replacement you can use an approximate string matching (see stringdist::amatch). Here is an example
a1 <- c("the classroome is ful")
a1
[1] "the classroome is ful"
library(stringdist)
goodwords.corpus <- c("full", "classroom")
badwords.corpus <- unlist(strsplit(a1, " ")) # extract words
for (badword in badwords.corpus){
patt <- paste0('\\b', badword, '\\b')
repl <- goodwords.corpus[amatch(badword, goodwords.corpus, maxDist = 1)] # you can change the distance see ?amatch
final.word <- ifelse(is.na(repl), badword, repl)
a1 <- gsub(patt, final.word, a1)
}
a1
[1] "the classroom is full"
When I call strsplit() on a column of a data frame, depending on the results of the strsplit(), I sometimes get one or two "sublists" as a result of splitting. For example,
v <- c("50", "1 h 30 ", "1 h", NA)
split <- strsplit(v, "h")
[[1]]
[1] "50"
[[2]]
[1] "1" " 30"
[[3]]
[1] "1 "
[[4]]
[1] NA
I know I can access the individual lists of split using '[]' and '[[]]' tells me the contents of those sublists, so I think I understand that. And that I can access the " 30" in [[2]] by doing split[[2]][2].
Unfortunately, I don't know how to access this programmatically over the entire column that I have. I am trying to convert the column to numeric data. But that "1 h 30" case is giving me a lot of trouble.
func1 <- function(x){
split.l <- strsplit(x, "h")
len <- lapply(split.l, length)
total <- ifelse(len == 2, as.numeric(split.l[2]) + as.numeric(split.l[1]) * 60, as.numeric(split.l[2]))
return(total)
}
v <- ifelse(grepl("h", v), func1(v), as.numeric(v))
I know len returns the vector of the length of the splits. But when it comes to actually accessing that individual sublist's second element, I simply don't know how to do it properly. This will generate an error because split.l[1] and split.l[2] will only return the first two elements of the entire original dataframe column every time. [[1]] and [[2]] won't work either. I need something like [[i]][1] and [[i]][2]. But I'm trying not to use a for loop and iterate.
To make a long story short, how do I access the inner list element programmatically
For reference, I did look at this which helped. But I still haven't been able to solve it. apply strsplit to specific column in a data.frame
I'm really struggling with lists and list processing in R so any help is appreciated.
A common idiom is lapply(l,[, 2), which applied to your example gives:
> lapply(split, `[`, 2)
[[1]]
[1] NA
[[2]]
[1] " 30 "
[[3]]
[1] NA
[[4]]
[1] NA
sapply() will collapse this to a vector if it can.
What is being done is lapply() takes each component of split in turn — this is the [[i]] bit of your pseudo code — and to each of those we want to extract the nth element. We do by applying the [ function with argument n— in this case 2L.
If you want the first element unless there is a second element, in which case take the second, you could just write a wrapper instead of using [ directly:
wrapper <- function(x) {
if(length(x) > 1L) {
x[2L]
} else {
x[1L]
}
}
lapply(split, wrapper)
which gives
> lapply(split, wrapper)
[[1]]
[1] "50"
[[2]]
[1] " 30 "
[[3]]
[1] "1 "
[[4]]
[1] NA
or perhaps
lens <- lengths(split)
out <- lapply(split, `[`, 2L)
ind <- lens == 1L
out[ind] <- lapply(split[ind], `[`, 1L)
out
but that loops over the output from strsplit() twice.
I have character vector where each level is a word. It has been generated from a text in which some segments are marked up with angular brackets. These segments vary in length. I need the marked up segments to be merged in the vector.
The input looks like this:
c("This","is","some","text","with","<marked","up","chunks>[L]","in","it")
I need the output to look like this:
c("This","is","some","text","with","<marked up chunks>[L]","in","it")
Thanks.
Here's an approach that also works with multiple chunks in a vector:
vec <- c("This","is","some","text","with","<marked","up","chunks>[L]","in","it")
from <- grep("<", vec)
to <- grep(">", vec)
idx <- mapply(seq, from, to, SIMPLIFY = FALSE)
new_strings <- sapply(idx, function(x)
paste(vec[x], collapse = " "))
replacement <- unlist(mapply(function(x, y) c(y, rep(NA, length(x) - 1)),
idx, new_strings, SIMPLIFY = FALSE))
new_vec <- "attributes<-"(na.omit(replace(vec, unlist(idx), replacement)), NULL)
[1] "This" "is"
[3] "some" "text"
[5] "with" "<marked up chunks>[L]"
[7] "in" "it"