This question already has an answer here:
strsplit on all spaces and punctuation except apostrophes [duplicate]
(1 answer)
Closed 7 years ago.
I'm trying to turn a character vector novel.lower.mid into a list of single words. So far, this is the code I've used:
midnight.words.l <- strsplit(novel.lower.mid, "\\W")
This produces a list of all the words. However, it splits everything, including contractions. The word "can't" becomes "can" and "t". How do I make sure those words aren't separated, or that the function just ignores the apostrophe?
We can use
library(stringr)
str_extract_all(novel.lower.mid, "\\b[[:alnum:]']+\\b")
Or
strsplit(novel.lower.mid, "(?!')\\W", perl=TRUE)
If you just want your current "\W" split to not include apostrophes, negate \w and ':
novel.lower.mid <- c("I won't eat", "green eggs and", "ham")
strsplit(novel.lower.mid, "[^\\w']", perl=T)
# [[1]]
# [1] "I" "won't" "eat"
#
# [[2]]
# [1] "green" "eggs" "and"
#
# [[3]]
# [1] "ham"
Related
I have an R list of approx. 90 character vectors (representing 90 documents), each containing several author names. As a means to stem (or normalize, what have you) the names, I'd like to drop all characters after the white-space and first character just past the comma in each element. So, for example, "Smith, Joe" would become "Smith, J" (or "Smith J" would fine).
1) I've tried using lapply with str_sub, but I can't seem to specify keeping one character past the comma (each element has different character length). 2) I also tried using lapply to split on the comma and make the last and first names separate elements, then using modify_depth to apply str_sub, but I can't figure out how to specifically use the str_sub only on the second element.
Fake sample to replicate issue.
doc1 = c("King, Stephen", "Martin, George")
doc2 = c("Clancy, Tom", "Patterson, James", "Stine, R.L.")
author = list(doc1,doc2)
What I've tried:
myfun1 = function(x,arg1){str_split(x, ", ")}
author = lapply(author, myfun1)
myfun2 = function(x,arg1){str_sub(x, end = 1L)}
f2 = modify_depth(author, myfun2, .depth = 2)
f2
[[1]]
[[1]][[1]]
[1] "K" "S"
[[1]][[2]]
[1] "M" "G"
Ultimately, I'm hoping after applying a solution, including maybe using unite(), the result will be as follows:
[[1]]
[[1]][[1]]
[1] "King S"
[[1]][[2]]
[1] "Martin G"
lapply( author, function(x) gsub( "(^.*, [A-Z]).*$", "\\1", x))
# [[1]]
# [1] "King, S" "Martin, G"
#
# [[2]]
# [1] "Clancy, T" "Patterson, J" "Stine, R"
What it does:
lapply loops over list of authors
gsub replaces a part of the elements of the vectors, defined by the regex "(^.*, [A-Z]).*$" with the first group (the part between the round brackets).
the regex "(^.*, [A-Z]).*$" puts everything from the start ^.* , until (and including) the first 'comma space, captal' , [A-Z] into a group.
This question already has answers here:
Complete word matching using grepl in R
(3 answers)
Closed 3 years ago.
I would like to use str_extract_all to extract specific text strings from many columns of a spreadsheet containing error descriptions. A sample list:
fire_match <- c('fire', 'burned', 'burnt', 'burn', 'injured', 'injury', 'hurt', 'dangerous',
'accident', 'collided', 'collide', 'crashed', 'crash', 'smolder', 'flame', 'melting',
'melted', 'melt', 'danger')
My code technically does what it is supposed to do, but I am also extracting (for example) 'fire' from 'misfire'. This is incorrect. I am also having difficulty extracting results that are not case sensitive.
This is a direct example of what is getting me 90% of the way there:
fire$Cause.Trigger <- str_extract_all(CAUSE_TEXT, paste(fire_match, collapse="|") )
My desired result is:
CAUSE_TEXT <- c("something caught fire", "something misfired",
"something caught Fire", "Injury occurred")
something caught fire -> fire
something misfired -> N/A
something caught Fire -> fire
an Injury occurred -> injury
You can just add \b to your individial terms to make sure they match a word boundry.
pattern <- paste0("\\b", paste(fire_match , collapse="\\b|\\b"), "\\b")
str_extract_all(CAUSE_TEXT, regex(pattern, ignore_case = TRUE))
# [[1]]
# [1] "fire"
# [[2]]
# character(0)
# [[3]]
# [1] "Fire"
# [[4]]
# [1] "Injury"
Split the regex by space if the group of words is not matched.
If group of words is matched then keep them as it is.
text <- c('considerate and helpful','not bad at all','this is helpful')
pattern <- c('considerate and helpful','not bad')
Output :
considerate and helpful, not bad, at, all, this, is, helpful
Thank you for the help!
Of course, just put the words in front of \w+:
library("stringr")
text <- c('considerate and helpful','not bad at all','this is helpful')
parts <- str_extract_all(text, "considerate and helpful|not bad|\\w+")
parts
Which yields
[[1]]
[1] "considerate and helpful"
[[2]]
[1] "not bad" "at" "all"
[[3]]
[1] "this" "is" "helpful"
It does not split on whitespaces but rather extracts the "words".
Given a vector of character strings, where each string is a comma-separated list of species names (i.e. Genus species). Each string can have a variable number of species in it (e.g. as shown in the example below, the number of species in a given string ranges from 1 to 3).
trees <- c("Erythrina poeppigiana", "Erythrina poeppigiana, Juglans regia x Juglans nigra", "Erythrina poeppigiana, Juglans regia x Juglans nigra, Chloroleucon eurycyclum")
I wish to obtain a vector of character strings of the same length, but where each string is a comma-separated list of the genus portions of the names only
genera <- c("Erythrina", "Erythrina, Juglans", "Erythrina, Juglans, Chloroleucon")
The screwy species is the "Juglans regia x Juglans nigra" hyrbid species. This should just come out as "Juglans", as it is all contained between two commas and is therefore just one species. In hybrids like this, the genus is always the same on both sides of the "x", so just the first word in that portion of the string is fine, just like with the more standard cases. However, solutions that attempt to pull out "every other word" won't work due to these hybrids.
My attempt was to first strsplit by ", " to separate out the individual species names, then strsplit again by " " to separate out the genus names:
split.list <- sapply(strsplit(trees, split=", "), strsplit, 1, split=" ")
split.list
[[1]]
[[1]][[1]]
[1] "Erythrina" "poeppigiana"
[[2]]
[[2]][[1]]
[1] "Erythrina" "poeppigiana"
[[2]][[2]]
[1] "Juglans" "regia" "x" "Juglans" "nigra"
[[3]]
[[3]][[1]]
[1] "Erythrina" "poeppigiana"
[[3]][[2]]
[1] "Juglans" "regia" "x" "Juglans" "nigra"
[[3]][[3]]
[1] "Chloroleucon" "eurycyclum"
But then the indexing to pull out the genus names and recombine is quite complicated (and I can't even figure it out!). Is there a cleaner solution for an ordered split and recombination?
It would also be acceptable to leverage the fact that genus names are the only words that are capitalized in all string. Maybe a regex that pull just words with capital letters?
Here is an idea via Base R,
sapply(strsplit(trees, ' '), function(i) toString(i[c(TRUE, FALSE)]))
#[1] "Erythrina" "Erythrina, Terminalia" "Erythrina, Terminalia, Chloroleucon"
EDIT
Further to your comment, for the new trees, you can simply do,
sapply(strsplit(trees, ', '), function(i) toString(sub('\\s+.*', '', i)))
#[1] "Erythrina, Juglans" "Erythrina"
#[3] "Erythrina, Juglans, Chloroleucon"
Two related questions. I have vectors of text data such as
"a(b)jk(p)" "ipq" "e(ijkl)"
and want to easily separate it into a vector containing the text OUTSIDE the parentheses:
"ajk" "ipq" "e"
and a vector containing the text INSIDE the parentheses:
"bp" "" "ijkl"
Is there any easy way to do this? An added difficulty is that these can get quite large and have a large (unlimited) number of parentheses. Thus, I can't simply grab text "pre/post" the parentheses and need a smarter solution.
Text outside the parenthesis
> x <- c("a(b)jk(p)" ,"ipq" , "e(ijkl)")
> gsub("\\([^()]*\\)", "", x)
[1] "ajk" "ipq" "e"
Text inside the parenthesis
> x <- c("a(b)jk(p)" ,"ipq" , "e(ijkl)")
> gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", x, perl=T)
[1] "bp" "" "ijkl"
The (?<=\\()[^()]*(?=\\)) matches all the characters which are present inside the brackets and then the following (*SKIP)(*F) makes the match to fail. Now it tries to execute the pattern which was just after to | symbol against the remaining string. So the dot . matches all the characters which are not already skipped. Replacing all the matched characters with an empty string will give only the text present inside the rackets.
> gsub("\\(([^()]*)\\)|.", "\\1", x, perl=T)
[1] "bp" "" "ijkl"
This regex would capture all the characters which are present inside the brackets and matches all the other characters. |. or part helps to match all the remaining characters other than the captured ones. So by replacing all the characters with the chars present inside the group index 1 will give you the desired output.
The rm_round function in the qdapRegex package I maintain was born to do this:
First we'll get and load the package via pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(qdapRegex)
## Then we can use it to remove and extract the parts you want:
x <-c("a(b)jk(p)", "ipq", "e(ijkl)")
rm_round(x)
## [1] "ajk" "ipq" "e"
rm_round(x, extract=TRUE)
## [[1]]
## [1] "b" "p"
##
## [[2]]
## [1] NA
##
## [[3]]
## [1] "ijkl"
To condense b and p use:
sapply(rm_round(x, extract=TRUE), paste, collapse="")
## [1] "bp" "NA" "ijkl"