How to extract several substrings with a foor loop in R - r

I have the following 100 strings:
[3] "Department_Complementary_Demand_Converted_Sum"
[4] "Department_Home_Demand_Converted_Sum"
[5] "Department_Store A_Demand_Converted_Sum"
[6] "Department_Store B_Demand_Converted_Sum"
...
[100] "Department_Unisex_Demand_Converted_Sum"
Obviously I can for every string use substr() with different start and end values for the string indices. But as one can see, all the strings start with Department_ and end with _Demand_Converted_Sum. I want to only extract what's inbetween. If there was a way to always start at index 11 from the left and end on index 21 from the left then I can just run a for loop over all the 100 strings above.
Example
Given input: Department_Unisex_Demand_Converted_Sum
Expected output: Unisex

Looks a like a classic case for lookarounds:
library(stringr)
str_extract(str, "(?<=Department_)[^_]+(?=_)")
[1] "Complementary" "Home" "Store A"
Data:
str <- c("Department_Complementary_Demand_Converted_Sum",
"Department_Home_Demand_Converted_Sum",
"Department_Store A_Demand_Converted_Sum")

Using strsplit(),
sapply(strsplit(string, '_'), '[', 2)
# [1] "Complementary" "Home" "Store A"
or stringi::stri_sub_all().
unlist(stringi::stri_sub_all(str, 12, -22))
# [1] "Complementary" "Home" "Store A"

Related

How to iterate through an R list of character vectors to modify each element by keeping all characters up to and including one character past comma

I have an R list of approx. 90 character vectors (representing 90 documents), each containing several author names. As a means to stem (or normalize, what have you) the names, I'd like to drop all characters after the white-space and first character just past the comma in each element. So, for example, "Smith, Joe" would become "Smith, J" (or "Smith J" would fine).
1) I've tried using lapply with str_sub, but I can't seem to specify keeping one character past the comma (each element has different character length). 2) I also tried using lapply to split on the comma and make the last and first names separate elements, then using modify_depth to apply str_sub, but I can't figure out how to specifically use the str_sub only on the second element.
Fake sample to replicate issue.
doc1 = c("King, Stephen", "Martin, George")
doc2 = c("Clancy, Tom", "Patterson, James", "Stine, R.L.")
author = list(doc1,doc2)
What I've tried:
myfun1 = function(x,arg1){str_split(x, ", ")}
author = lapply(author, myfun1)
myfun2 = function(x,arg1){str_sub(x, end = 1L)}
f2 = modify_depth(author, myfun2, .depth = 2)
f2
[[1]]
[[1]][[1]]
[1] "K" "S"
[[1]][[2]]
[1] "M" "G"
Ultimately, I'm hoping after applying a solution, including maybe using unite(), the result will be as follows:
[[1]]
[[1]][[1]]
[1] "King S"
[[1]][[2]]
[1] "Martin G"
lapply( author, function(x) gsub( "(^.*, [A-Z]).*$", "\\1", x))
# [[1]]
# [1] "King, S" "Martin, G"
#
# [[2]]
# [1] "Clancy, T" "Patterson, J" "Stine, R"
What it does:
lapply loops over list of authors
gsub replaces a part of the elements of the vectors, defined by the regex "(^.*, [A-Z]).*$" with the first group (the part between the round brackets).
the regex "(^.*, [A-Z]).*$" puts everything from the start ^.* , until (and including) the first 'comma space, captal' , [A-Z] into a group.

How to match everything except for digits followed by a space and ONLY digits followed by a space?

The problem
What the header says, basically. Given a string, I need to extract from it everything that is not a leading number followed by a space. So, given this string
"420 species of grass"
I would like to get
"species of grass"
But, given a string with a number not in the beginning, like so
"The clock says it is 420"
or a string with a number not followed by a space, like so
"It is 420 already"
I would like to get back the same string, with the number preserved
"The clock says it is 420"
"It is 420 already"
What I have tried
Matching a leading number followed by a space works as expected:
library(stringr)
str_extract_all("420 species of grass", "^\\d+(?=\\s)")
[[1]]
[1] "420"
> str_extract_all("The clock says it is 420", "^\\d+(?=\\s)")
[[1]]
character(0)
> str_extract_all("It is 420 already", "^\\d+(?=\\s)")
[[1]]
character(0)
But, when I try to match anything but a leading number followed by a space, it doesn't:
> str_extract_all("420 species of grass", "[^(^\\d+(?=\\s))]+")
[[1]]
[1] "species" "of" "grass"
> str_extract_all("The clock says it is 420", "[^(^\\d+(?=\\s))]+")
[[1]]
[1] "The" "clock" "says" "it" "is"
> str_extract_all("It is 420 already", "[^(^\\d+(?=\\s))]+")
[[1]]
[1] "It" "is" "already"
It seems this regex matches anything but digits AND spaces instead.
How do I fix this?
I think #Douglas's answer is more concise, however, I guess your actual case would be more complicated and you may want to check ?regexpr which can identify the starting position of your specific pattern.
A method using for loop is below:
list <- list("420 species of grass",
"The clock says it is 420",
"It is 420 already")
extract <- function(x) {
y <- vector('list', length(x))
for (i in seq_along(x)) {
if (regexpr("420", x[[i]])[[1]] > 1) {
y[[i]] <- x[[i]]
}
else{
y[[i]] <- substr(x[[i]], (regexpr(" ", x[[i]])[[1]] + 1), nchar(x[[i]]))
}
}
return(y)
}
> extract(list)
[[1]]
[1] "species of grass"
[[2]]
[1] "The clock says it is 420"
[[3]]
[1] "It is 420 already"
I think the easiest way to do this is by removing the numbers instead of extracting the desired pattern:
library(stringr)
strings <- c("420 species of grass", "The clock says it is 420", "It is 420 already")
str_remove(strings, pattern = "^\\d+\\s")
[1] "species of grass" "The clock says it is 420" "It is 420 already"
An easy way out is to replace any digits followed by spaces that occur right from the start of string using this regex,
^\d+\s+
with empty string.
Regex Demo using substitution
Sample R code using sub demo
sub("^\\d+\\s+", "", "420 species of grass")
sub("^\\d+\\s+", "", "The clock says it is 420")
sub("^\\d+\\s+", "", "It is 420 already")
Prints,
[1] "species of grass"
[1] "The clock says it is 420"
[1] "It is 420 already"
Alternative way to achieve same using matching, you can use following regex and capture contents of group1,
^(?:\d+\s+)?(.*)$
Regex Demo using match
Also, anything you place inside a character set looses its special meaning like positive lookahead inside it [^(^\\d+(?=\\s))]+ and simply behaves as a literal, so your regex becomes incorrect.
Edit:
Although solution using sub is better but in case you want match based solution using R codes, you need to use str_match instead of str_extract_all and for accessing group1 contents you need to use [,2]
R Code Demo using match
library(stringr)
print(str_match("420 species of grass", "^(?:\\d+\\s+)?(.*)$")[,2])
print(str_match("The clock says it is 420", "^(?:\\d+\\s+)?(.*)$")[,2])
print(str_match("It is 420 already", "^(?:\\d+\\s+)?(.*)$")[,2])
Prints,
[1] "species of grass"
[1] "The clock says it is 420"
[1] "It is 420 already"

conditional concatenation in R

I have a vector like this:
> myarray
[1] "AA\tThis is ",
[2] "\tthe ",
[3] "\tbegining."
[4] "BB\tA string of "
[5] "\tcharacters."
[6] "CC\tA short line."
[7] "DD\tThe "
[8] "\tend."`
I am trying to write a function that processes the above to generate this:
> myoutput
[1] "AA\tThis is the begining."
[2] "BB\tA string of characters."
[3] "CC\tA short line"
[4] "DD\tThe end."`
This is doable by looping through the rows and using an if statement to concatenate the current row with the last one if it starts with a \t. I was wondering if there is a more efficient way of achieving the same result.
# Create your example data
myarray <- c("AA\this is ", "\tthe ", "\tbeginning", "BB\tA string of ", "\tcharacters.", "CC\tA short line.", "DD\tThe", "\tend")
# Find where each "sentence" starts based on detecting
# that the first character isn't \t
starts <- grepl("^[^\t]", myarray)
# Create a grouping variable
id <- cumsum(starts)
# Remove the leading \t as that seems like what your example output wants
tmp <- sub("^\t", "", myarray)
# split into groups and paste the groups together
sapply(split(tmp, id), paste, collapse = "")
And running it we get
> sapply(split(tmp, id), paste, collapse = "")
1 2
"AA\this is the beginning" "BB\tA string of characters."
3 4
"CC\tA short line." "DD\tThe end"
An option is to use paste than replace AA,BB etc. with additional character say ## and and strsplit as:
#Data
myarray <- c("AA\this is ", "\tthe ", "\tbeginning", "BB\tA string of ",
"\tcharacters.", "CC\tA short line.", "DD\tThe", "\tend")
strsplit(gsub("([A-Z]{2})","##\\1",
paste(sub("^\t","", myarray), collapse = "")),"##")[[1]][-1]
# [1] "AA\this is the beginning"
# [2] "BB\tA string of characters."
# [3] "CC\tA short line."
# [4] "DD\tTheend"

R: Explode string but keep quoted text as a single word

I encountered this question:
PHP explode the string, but treat words in quotes as a single word
and similar dealing with using Regex to explode words in a sentence, separated by a space, but keeping quoted text intact (as a single word).
I would like to do the same in R. I have attempted to copy-paste the regular expression into stri_split in the stringi package as well as strsplit in base R, but as I suspect the regular expression uses a format R does not recognize. The error is:
Error: '\S' is an unrecognized escape in character string...
The desired output would be:
mystr <- '"preceded by itself in quotation marks forms a complete sentence" preceded by itself in quotation marks forms a complete sentence'
myfoo(mystr)
[1] "preceded by itself in quotation marks forms a complete sentence" "preceded" "by" "itself" "in" "quotation" "marks" "forms" "a" "complete" "sentence"
Trying: strsplit(mystr, '/"(?:\\\\.|(?!").)*%22|\\S+/') gives:
Error in strsplit(mystr, "/\"(?:\\\\.|(?!\").)*%22|\\S+/") :
invalid regular expression '/"(?:\\.|(?!").)*%22|\S+/', reason 'Invalid regexp'
A simple option would be to use scan:
> x <- scan(what = "", text = mystr)
Read 11 items
> x
[1] "preceded by itself in quotation marks forms a complete sentence"
[2] "preceded"
[3] "by"
[4] "itself"
[5] "in"
[6] "quotation"
[7] "marks"
[8] "forms"
[9] "a"
[10] "complete"
[11] "sentence"

removing everything after first 'backslash' in a string

I have a vector like below
vec <- c("abc\edw\www", "nmn\ggg", "rer\qqq\fdf"......)
I want to remove everything after as soon as first slash is encountered, like below
newvec <- c("abc","nmn","rer")
Thank you.
My original vector is as below (only the head)
[1] "peoria ave\nste \npeoria" [2] "wood dr\nphoenix"
"central ave\nphoenix"
[4] "southern ave\nphoenix" [5] "happy valley rd\nste
\nglendaleaz " "the americana at brand\n americana way\nglendale"
Here the problem is my original csv file does not contain backslashes, but when i read it backslashes appear. Original csv file is as below
[1] "peoria ave [2] "wood dr
nste nphoenix"
npeoria"
As you can see, they are actually separated by "ENTER" but when i read it in R using read.csv() they are replaced by backslashes.
another solution :
sub("\\\\.*", "", x)
vec <- c("abc\\edw\\www", "nmn\\ggg", "rer\\qqq\\fdf")
sub("([^\\\\])\\\\.*","\\1", vec)
[1] "abc" "nmn" "rer"
strssplit(vec, "\\\\") should do the job.
TO select the first element [[1]][1] 2nd [[1]][2]

Resources