split strings by pattern without deleting pattern strings - r

For a pattern that starts with "pr" following with multiple "r", e.g., pr, prr, pr...r. I would like to split the non-pattern string and ALL pattern strings, without deleting the pattern. strsplit() does the job but deletes all pr..r. However, stringr::str_extract_all extracts patterned strings but non-pattern strings gone.
Is there a way to simply keep all strings but single out patterned strings?
x<-c("zprzzzprrrrrzpzr")
"z" "pr" "zzz" "prrrrr" "zpzr" # desired output; keep original character order

This is a bit hacky but you can do one replacement to separate out the values you want with some separator character and then split on that separator character. For example
unlist(strsplit(gsub("(pr+)","~\\1~", x), "~"))
# [1] "z" "pr" "zzz" "prrrrr" "zpzr"
which will work fine if you don't have "~" in your string.

Here is a way using stringr. I would hope there is a way to make this a bit more concise.
Locate the pattern with str_locate_all()
Add one to all the end positions, so that we have split locations
Add the start and end positions to the vector to split correctly
Use the vectorized str_sub() to extract them all
library(stringr)
x <- c("zprzzzprrrrrzpzr")
locs <- str_locate_all(x, "(pr+)")[[1]]
locs[,2] <- locs[,2] + 1
locs_all <- sort(c(1, locs, nchar(x) + 1))
str_sub(x, head(locs_all, -1), tail(locs_all, -1))
# [1] "zp" "prz" "zzzp" "prrrrrz" "zpzr"

Related

Trimming String up to a Specific Character in R

I have a vector of Strings in R that represent NCAA Basketball Player's names.
The names are displayed as below:
(Justin Ahrens\justin-ahrens-1
Kyle Ahrens\kyle-ahrens-1
...
Zavier Simpson\zavier-simpson-1)
How can I trim these names so I get "Justin Ahrens" and "Zavier Simpson" rather than what I have now?
I am assuming that different rows in your example correspond to different elements in the vector of strings. Then you can apply the following:
string_vector <- c("(Justin Ahrens\\justin-ahrens-1",
"Kyle Ahrens\\kyle-ahrens-1",
"Zavier Simpson\\zavier-simpson-1)")
gsub("\\(|\\)|\\\\.+$", "", string_vector)
Here, you are getting rid of any parenthesis \\(|\\) or any string that comes after backslash, including the backslash \\\\.+$
It depends how your names are presented in the data, so I will offer a couple of possible solutions using stringr
library(stringr)
## Case 1: names are vectorised
players1<- c("Justin Ahrens\\justin-ahrens-1",
"Kyle Ahrens\\kyle-ahrens-1",
"Zavier Simpson\\zavier-simpson-1")
# Use str_extract
players1 %>% str_extract("[^\\\\]+")
## Case 2: one long character string containing multiple names and newlines
players2 <- ("(Justin Ahrens\\justin-ahrens-1\nKyle Ahrens\\kyle-ahrens-1\nZavier Simpson\\zavier-simpson-1)")
# Split strings on new lines (alternatively "1" if no newline characters), then map str_extract
players2 %>% str_split("\n") %>%
purrr::map(~str_extract(.x,"[^(\\\\]+")) %>%
unlist

Extract text in two columns from a string

I have a table where one column has data like this:
table$test_string<- "[projectname](https://somewebsite.com/projectname/Abc/xyz-09)"
1.) I am trying to extract the first part of this string within the square brackets in one column, i.e.
table$project_name <- "projectname"
using the regex:
project_name <- "^\\[|(?:[a-zA-Z]|[0-9])+|\\]$"
table$project_name <- str_extract(table$test_string, project_name)
If I test the regex on 1 value (1 row individually) of the table, the above regex works with using
str_extract_all(table$test_string, project_name[[1]][2]).
However, I get NA when I apply the regex pattern to the whole table and an error if I use str_extract_all.
2.) Second part of the string, which is a URL in another column,
table$url_link <- "https://somewebsite.com/projectname/Abc/xyz-09"
I am using the following regex expression for URL:
url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
table$url_link <- str_extract(table$test_string, url_pattern)
and this works on the whole table, however, I still get the ')' last paranthesis in the url link.
What am I missing here? and why does the first regex work individually and not on the whole table?
and for the url, how do I not get the last paranthesis?
It feels like you could simplify things considerably by using parentheses to group capture. For example:
test_string<- "[projectname](https://somewebsite.com/projectname/Abc/xyz-09)"
regex <- "\\[(.*)\\]\\((.*)\\)"
gsub(regex, "\\1", test_string)
#> [1] "projectname"
gsub(regex, "\\2", test_string)
#> [1] "https://somewebsite.com/projectname/Abc/xyz-09"
We can make use of convenient functions from qdapRegex
library(qdapRegex)
rm_round(test_string, extract = TRUE)[[1]]
#[1] "https://somewebsite.com/projectname/Abc/xyz-09"
rm_square(test_string, extract = TRUE)[[1]]
#[1] "projectname"
data
test_string<- "[projectname](https://somewebsite.com/projectname/Abc/xyz-09)"

R, str_replace, gsub, how to replace a vector of characters for another vector of characters?

I'm trying to "delete" specific characters from strings in several rows.
I was able to extract the specific characters I want to "delete" from the column, but I'm not able to replace them recursively for "".
I've tried some options with mapvalues, gsub and str_replace but I haven't had any luck
#Example data
test_col<-data.frame(sequence=c("ATGCRYSW\n",
"ATGCRYSW\\n",
"ATGCRYSW\r\n",
"ATGCRYSW\r\nATGCRYSW",
"ATGCRYSW"),
stringsAsFactors = FALSE)
#vector of allowed characters in strings
permitted_seq_chars<-c("A","C","G","T","R","Y","S","W","K",
"M","B","D","H","V","N","+","-","X")
#get all the unique characters in column of interest
all_unique_source_seq_chars<-unique(unlist(strsplit(test_col[["sequence"]],
split ="")))
#subset invalid characters
all_unique_source_seq_invalid_chars<-setdiff(all_unique_source_seq_chars,
permitted_seq_chars )
#'delete' invalid characters one by one. So far the only way I've been able to
# do so, but i would like to not depend on fixed variables if new ones arise
# in the future
str_replace_all(test_col$sequence, c( "\n"= "",
"\\"="",
"n"=""))
is there any any way to do this recursively just looking at all_unique_source_seq_invalid_chars?
An option would be to paste the individual characters as a pattern string wrapped by square brackets to evaluate it literally (in case there are meta characters) and then replace with blank ("") in gsub
pat <- paste0("[^", gsub("\\s{2,}", " ", paste(permitted_seq_chars, collapse="")), "]")
gsub(pat, "", test_col$sequence)
#[1] "ATGCRYSW" "ATGCRYSW" "ATGCRYSW"
#[4] "ATGCRYSWATGCRYSW" "ATGCRYSW"

What is the best way in R to identify the first character in a string?

I am trying to find a way to loop through some data in R that contains both numbers and characters and where the first character is found return all values after. For example:
column
000HU89
87YU899
902JUK8
result
HU89
YU89
JUK8
have tried stringr_detct / grepl but the value of the first character is by nature unknown so I am having difficultly pulling it out.
We could use str_extract
stringr::str_extract(x, "[A-Z].*")
#[1] "HU89" "YU899" "JUK8"
data
x <- c("000HU89", "87YU899", "902JUK8")
Ronak's answer is simple.
Though I would also like to provide another method:
column <-c("000HU89", "87YU899" ,"902JUK8")
# Get First character
first<-c(strsplit(gsub("[[:digit:]]","",column),""))[[1]][1]
# Find the location of first character
loc<-gregexpr(pattern =first,column)[[1]][1]
# Extract everything from that chacracter to the right
substring(column, loc, last = 1000000L)
We can use sub from base R to match one or more digits (\\d+) at the start (^) of the string and replace with blank ("")
sub("^\\d+", "", x)
#[1] "HU89" "YU899" "JUK8"
data
x <- c("000HU89", "87YU899", "902JUK8")
In base R we can do
x <- c("000HU89", "87YU899", "902JUK8")
regmatches(x, regexpr("\\D.+", x))
# [1] "HU89" "YU899" "JUK8"

gsubfn on data frame

Search-and-replace an element in a data frame given a list of replacements.
Code:
testing123tmp <- data.frame(x=c("it's", "not", "working"))
testing123tmp$x <- as.character(testing123tmp$x)
tmp <- list("it's" = "hey", "working"="dead")
apply(testing123tmp,2,function(x) gsubfn('.', tmp, x))
Expected Output:
x
[1,] hey
[2,] not
[3,] dead
My current output:
x
[1,] "it's"
[2,] "not"
[3,] "working"
Been looking around for possible solution in chartr and gsub, but would like simplicity (short coding) given multiple gsub is required for such operation. Also my variable tmp can be scaled to many-pair replacement such that:
tmp <- list("it's" = "hey",
"working"="dead",
"other" = "other1",
.. = .. ,
.. = .. ,
.. = .. )
Edit/Update #1:
would also like solution in gsubfn above and data-framed
The issues are these:
The dot only matches one character so it will never match an entire string unless that entire string has one character and therefore no name in tmp will ever be matched. Use ".*" to match the entire string. If you wanted to match words, i.e. there are possibly several words separated by whitespace in each component of x so that for example one component of x might be "it's not" and we still wanted to match it's then use "\\S+". There are other variations one could imagine as well and this gives a framework that encompasses many of them.
the third argument to gsubfn can already be a vector and gsubfn will iterate over it so it is not necessary to use apply. (It will still work with apply but it is unnecessary.)
to keep everything in a data frame one easy way is to use transform as shown below (or alternately use transform2, also in the gsubfn package). The x will automatically refer to the x column in the testing123tmp data frame and transform will produce a new data frame not overwriting the original. If you want to keep these separate assign the result of transform to a new name or if you want to overwrite testing123tmp then assign it back to testing123tmp.
we can use stringsAsFactors = FALSE to avoid generating character columns.
testing123tmp <- data.frame(x=c("it's", "not", "working"), stringsAsFactors = FALSE)
Thus we can reduce the code to:
transform(testing123tmp, y = gsubfn(".*", tmp, x))
giving the following data.frame:
x y
1 it's hey
2 not not
3 working dead
If we wanted to overwrite the x column rather than keep separate input and output columns we could have used x = ... in the transform statement instead of y = ... .
You may write
gsubfn(".*", tmp, testing123tmp$x)
# [1] "hey" "not" "dead"
and then
testing123tmp$x <- gsubfn(".*", tmp, testing123tmp$x)
As for your approach, there was no need for apply as gsubfn is vectorized over that parameter, and the problem was to match only .---one symbol, while it's and working are of varying length.
However, if you are replacing one word with another word, then there is no need for regex. For instance,
idx <- testing123tmp$x %in% names(tmp)
testing123tmp$x[idx] <- unlist(tmp)[testing123tmp$x[idx]]
should work faster. If the task is more involved, then I guess
library(stringr)
str_replace_all(testing123tmp$x, unlist(tmp))
# [1] "hey" "not" "dead"
should be more robust than gsubfn as you don't need to deal with patterns like .*.

Resources