multiple ordered strsplit, then recombine

multiple ordered strsplit, then recombine - r

Given a vector of character strings, where each string is a comma-separated list of species names (i.e. Genus species). Each string can have a variable number of species in it (e.g. as shown in the example below, the number of species in a given string ranges from 1 to 3).
trees <- c("Erythrina poeppigiana", "Erythrina poeppigiana, Juglans regia x Juglans nigra", "Erythrina poeppigiana, Juglans regia x Juglans nigra, Chloroleucon eurycyclum")
I wish to obtain a vector of character strings of the same length, but where each string is a comma-separated list of the genus portions of the names only
genera <- c("Erythrina", "Erythrina, Juglans", "Erythrina, Juglans, Chloroleucon")
The screwy species is the "Juglans regia x Juglans nigra" hyrbid species. This should just come out as "Juglans", as it is all contained between two commas and is therefore just one species. In hybrids like this, the genus is always the same on both sides of the "x", so just the first word in that portion of the string is fine, just like with the more standard cases. However, solutions that attempt to pull out "every other word" won't work due to these hybrids.
My attempt was to first strsplit by ", " to separate out the individual species names, then strsplit again by " " to separate out the genus names:
split.list <- sapply(strsplit(trees, split=", "), strsplit, 1, split=" ")
split.list
[[1]]
[[1]][[1]]
[1] "Erythrina" "poeppigiana"
[[2]]
[[2]][[1]]
[1] "Erythrina" "poeppigiana"
[[2]][[2]]
[1] "Juglans" "regia" "x" "Juglans" "nigra"
[[3]]
[[3]][[1]]
[1] "Erythrina" "poeppigiana"
[[3]][[2]]
[1] "Juglans" "regia" "x" "Juglans" "nigra"
[[3]][[3]]
[1] "Chloroleucon" "eurycyclum"
But then the indexing to pull out the genus names and recombine is quite complicated (and I can't even figure it out!). Is there a cleaner solution for an ordered split and recombination?
It would also be acceptable to leverage the fact that genus names are the only words that are capitalized in all string. Maybe a regex that pull just words with capital letters?

Here is an idea via Base R,
sapply(strsplit(trees, ' '), function(i) toString(i[c(TRUE, FALSE)]))
#[1] "Erythrina" "Erythrina, Terminalia" "Erythrina, Terminalia, Chloroleucon"
EDIT
Further to your comment, for the new trees, you can simply do,
sapply(strsplit(trees, ', '), function(i) toString(sub('\\s+.*', '', i)))
#[1] "Erythrina, Juglans" "Erythrina"
#[3] "Erythrina, Juglans, Chloroleucon"

Related

Removing unwanted parts of strings in a list, and combining the pieces into a single string in R

I am trying to take a list of strings, remove everything except capital letters, and output a list of strings without any spaces or breaks.
Unfortunately, I have been trying to use str_extract_all() but it outputs the relevent pieces of the string separated as a list of character vectors, when there was non-capital letter string elements contained in the original string.
Can anyone please suggest a way to get the desired output?
# Some example data:
a <- list("n[28.0313]MVNNGHSFNVEYDDSQDK[28.0313]AVLK[28.0313]D_+4",
"SLGKVGTRC[71.0371]CTK[28.0313]PESER_+4",
"n[28.0313]AVVQDPALK[28.0313]PLALVY_+3",
"n[28.0313]TCVADESHAGC[71.0371]EK[28.0313]_+2")
# The desired output:
list("MVNNGHSFNVEYDDSQDKAVLKD",
"SLGKVGTRCCTKPESER",
"AVVQDPALKPLALVY",
"TCVADESHAGCEK")
# What I've tried so far:
a %>% str_extract_all("[A-Z]+")
[[1]]
[1] "MVNNGHSFNVEYDDSQDK" "AVLK" "D"
[[2]]
[1] "SLGKVGTRC" "CTK" "PESER"
[[3]]
[1] "AVVQDPALK" "PLALVY"
[[4]]
[1] "TCVADESHAGC" "EK"
# Not what I want.
I need to find a way to isolate the strings and combine them, but I'm at the limit of my R knowledge.

As it is a list of multiple elements, we can just paste it together by looping over the list
library(dplyr)
library(stringr)
library(purrr)
a %>%
str_extract_all("[A-Z]+") %>%
map_chr(str_c, collapse="")
-output
[1] "MVNNGHSFNVEYDDSQDKAVLKD" "SLGKVGTRCCTKPESER"
[3] "AVVQDPALKPLALVY" "TCVADESHAGCEK"
Or just use gsub to match all characters other than the upper case and replace with blank
gsub("[^A-Z]+", "", a)
[1] "MVNNGHSFNVEYDDSQDKAVLKD" "SLGKVGTRCCTKPESER" "AVVQDPALKPLALVY" "TCVADESHAGCEK"
or with str_remove_all
str_remove_all(a, "[^A-Z]+")
[1] "MVNNGHSFNVEYDDSQDKAVLKD" "SLGKVGTRCCTKPESER" "AVVQDPALKPLALVY" "TCVADESHAGCEK"
The output is a vector, which we can wrap it in a list
list(str_remove_all(a, "[^A-Z]+"))

How to iterate through an R list of character vectors to modify each element by keeping all characters up to and including one character past comma

I have an R list of approx. 90 character vectors (representing 90 documents), each containing several author names. As a means to stem (or normalize, what have you) the names, I'd like to drop all characters after the white-space and first character just past the comma in each element. So, for example, "Smith, Joe" would become "Smith, J" (or "Smith J" would fine).
1) I've tried using lapply with str_sub, but I can't seem to specify keeping one character past the comma (each element has different character length). 2) I also tried using lapply to split on the comma and make the last and first names separate elements, then using modify_depth to apply str_sub, but I can't figure out how to specifically use the str_sub only on the second element.
Fake sample to replicate issue.
doc1 = c("King, Stephen", "Martin, George")
doc2 = c("Clancy, Tom", "Patterson, James", "Stine, R.L.")
author = list(doc1,doc2)
What I've tried:
myfun1 = function(x,arg1){str_split(x, ", ")}
author = lapply(author, myfun1)
myfun2 = function(x,arg1){str_sub(x, end = 1L)}
f2 = modify_depth(author, myfun2, .depth = 2)
f2
[[1]]
[[1]][[1]]
[1] "K" "S"
[[1]][[2]]
[1] "M" "G"
Ultimately, I'm hoping after applying a solution, including maybe using unite(), the result will be as follows:
[[1]]
[[1]][[1]]
[1] "King S"
[[1]][[2]]
[1] "Martin G"

lapply( author, function(x) gsub( "(^.*, [A-Z]).*$", "\\1", x))
# [[1]]
# [1] "King, S" "Martin, G"
#
# [[2]]
# [1] "Clancy, T" "Patterson, J" "Stine, R"
What it does:
lapply loops over list of authors
gsub replaces a part of the elements of the vectors, defined by the regex "(^.*, [A-Z]).*$" with the first group (the part between the round brackets).
the regex "(^.*, [A-Z]).*$" puts everything from the start ^.* , until (and including) the first 'comma space, captal' , [A-Z] into a group.

Regex: how to keep all digits when splitting a string?

Question
Using a regular expression, how do I keep all digits when splitting a string?
Overview
I would like to split each element within the character vector sample.text into two elements: one of only digits and one of only the text.
Current Attempt is Dropping Last Digit
This regular expression - \\d\\s{1} - inside of base::strsplit() removes the last digit. Below is my attempt, along with my desired output.
# load necessary data -----
sample.text <-
c("111110 Soybean Farming", "0116 Soybeans")
# split string by digit and one space pattern ------
strsplit(sample.text, split = "\\d\\s{1}")
# [[1]]
# [1] "11111" "Soybean Farming"
#
# [[2]]
# [1] "011" "Soybeans"
# desired output --------
# [[1]]
# [1] "111110" "Soybean Farming"
#
# [[2]]
# [1] "0116" "Soybeans"
# end of script #
Any advice on how I can split sample.text to keep all digits would be much appreciated! Thank you.

Because you're splitting on \\d, the digit there is consumed in the regex, and not present in the output. Use lookbehind for a digit instead:
strsplit(sample.text, split = "(?<=\\d) ", perl=TRUE)
http://rextester.com/GDVFU71820

Some alternative solutions, using very simple pattern matching on the first occurrence of space:
1) Indirectly by using sub to substitute your own delimiter, then strsplit on your delimiter:
E.g. you can substitute ';' for the first space (if you know that character does not exist in your data):
strsplit( sub(' ', ';', sample.text), split=';')
2) Using regexpr and regmatches
You can effectively match on the first " " (space character), and split as follows:
regmatches(sample.text, regexpr(" ", sample.text), invert = TRUE)
Result is a list, if that's what you are after per your sample desired output:
[[1]]
[1] "111110" "Soybean Farming"
[[2]]
[1] "0116" "Soybeans"
3) Using stringr library:
library(stringr)
str_split_fixed(sample.text, " ", 2) #outputs a character matrix
[,1] [,2]
[1,] "111110" "Soybean Farming"
[2,] "0116" "Soybeans"

Extract text in parentheses in R

Two related questions. I have vectors of text data such as
"a(b)jk(p)" "ipq" "e(ijkl)"
and want to easily separate it into a vector containing the text OUTSIDE the parentheses:
"ajk" "ipq" "e"
and a vector containing the text INSIDE the parentheses:
"bp" "" "ijkl"
Is there any easy way to do this? An added difficulty is that these can get quite large and have a large (unlimited) number of parentheses. Thus, I can't simply grab text "pre/post" the parentheses and need a smarter solution.

Text outside the parenthesis
> x <- c("a(b)jk(p)" ,"ipq" , "e(ijkl)")
> gsub("\\([^()]*\\)", "", x)
[1] "ajk" "ipq" "e"
Text inside the parenthesis
> x <- c("a(b)jk(p)" ,"ipq" , "e(ijkl)")
> gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", x, perl=T)
[1] "bp" "" "ijkl"
The (?<=\\()[^()]*(?=\\)) matches all the characters which are present inside the brackets and then the following (*SKIP)(*F) makes the match to fail. Now it tries to execute the pattern which was just after to | symbol against the remaining string. So the dot . matches all the characters which are not already skipped. Replacing all the matched characters with an empty string will give only the text present inside the rackets.
> gsub("\\(([^()]*)\\)|.", "\\1", x, perl=T)
[1] "bp" "" "ijkl"
This regex would capture all the characters which are present inside the brackets and matches all the other characters. |. or part helps to match all the remaining characters other than the captured ones. So by replacing all the characters with the chars present inside the group index 1 will give you the desired output.

The rm_round function in the qdapRegex package I maintain was born to do this:
First we'll get and load the package via pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(qdapRegex)
## Then we can use it to remove and extract the parts you want:
x <-c("a(b)jk(p)", "ipq", "e(ijkl)")
rm_round(x)
## [1] "ajk" "ipq" "e"
rm_round(x, extract=TRUE)
## [[1]]
## [1] "b" "p"
##
## [[2]]
## [1] NA
##
## [[3]]
## [1] "ijkl"
To condense b and p use:
sapply(rm_round(x, extract=TRUE), paste, collapse="")
## [1] "bp" "NA" "ijkl"

Using R to compare two words and find letters unique to second word (across c. 6000 cases)

I have a dataframe comprising two columns of words. For each row I'd like to identify any letters that occur in only the word in the second column e.g.
carpet carpelt #return 'l'
bag flag #return 'f' & 'l'
dog dig #return 'i'
I'd like to use R to do this automatically as I have 6126 rows.
As an R newbie, the best I've got so far is this, which gives me the unique letters across both words (and is obviously very clumsy):
x<-(strsplit("carpet", ""))
y<-(strsplit("carpelt", ""))
z<-list(l1=x, l2=y)
unique(unlist(z))
Any help would be much appreciated.

The function you’re searching for is setdiff:
chars_for = function (str)
strsplit(str, '')[[1]]
result = setdiff(chars_for(word2), chars_for(word1))
(Note the inverted order of the arguments in setdiff.)
To apply it to the whole data.frame, called x:
apply(x, 1, function (words) setdiff(chars_for(words[2]), chars_for(words[1])))

Use regex :) Paste your word with brackets [] and then use replace function for regex. This regex finds any letter from those in brackets and replaces it with empty string (you can say that it "removes" these letters).
require(stringi)
x <- c("carpet","bag","dog")
y <- c("carplet", "flag", "smog")
pattern <- stri_paste("[",x,"]")
pattern
## [1] "[carpet]" "[bag]" "[dog]"
stri_replace_all_regex(y, pattern, "")
## [1] "l" "fl" "sm"

x <- c("carpet","bag","dog")
y <- c("carpelt", "flag", "dig")
Following (somewhat) with what you were going for with strsplit, you could do
> sx <- strsplit(x, "")
> sy <- strsplit(y, "")
> lapply(seq_along(sx), function(i) sy[[i]][ !sy[[i]] %in% sx[[i]] ])
#[[1]]
#[1] "l"
#
#[[2]]
#[1] "f" "l"
#
#[[3]]
#[1] "i"
This uses %in% to logically match the characters in y with the characters in x. I negate the matching with ! to determine those those characters that are in y but not in x.