I have a following character in R. Is there way to populate only text coming after [SQ].
Input
df # df is a character
[1] "[Mi][OD][SQ]Nice message1."
[2] "[Mi][OD][SQ]Nice message2."
[3] "[RO] ERROR: Could not SQLExecDirect 'SELECT * FROM "
Expected output
df
[1] Nice message1. Nice message2
In case there are more [SQ] like below
df # df is a character
[1] "[Mi][OD][SQ]Nice message1."
[2] "[Mi][OD][SQ]Nice message2."
[3] "[RO] ERROR: Could not SQLExecDirect 'SELECT * FROM "
[4] "[Mi][OD][SQ]Nice message3."
Expected output
df
[1] Nice message1. Nice message2. Nice message3
An option is to use str_extract to extract the substring and then wrap with na.omit to remove the NA elements which occur when there is no match for a string. Here, we use a regex lookaround to check the pattern [SQ] that precedes other characters to extract those characters that are succeeding it
library(stringr)
as.vector(na.omit( str_extract(df, "(?<=\\[SQ\\]).*")))
#[1] "Nice message1" "Nice message2" "Nice message3"
If it needs to be a single string, then str_c to collapse the strings
str_c(na.omit( str_extract(df, "(?<=\\[SQ\\]).*")), collapse = '. ')
#[1] "Nice message1. Nice message2. Nice message3"
data
df <- c("[Mi][OD][SQ]Nice message1.", "[Mi][OD][SQ]Nice message2.",
"[RO] ERROR: Could not SQLExecDirect 'SELECT * FROM ", "[Mi][OD][SQ]Nice message3."
)
Related
I want to change the rownames of cov_stats, such that it contains a substring of the FileName column values. I only want to retain the string that begins with "SRR" followed by 8 digits (e.g., SRR18826803).
cov_list <- list.files(path="./stats/", full.names=T)
cov_stats <- rbindlist(sapply(cov_list, fread, simplify=F), use.names=T, idcol="FileName")
rownames(cov_stats) <- gsub("^\.\/\SRR*_\stats.\txt", "SRR*", cov_stats[["FileName"]])
Second attempt
rownames(cov_stats) <- gsub("^SRR[:digit:]*", "", cov_stats[["FileName"]])
Original strings
> cov_stats[["FileName"]]
[1] "./stats/SRR18826803_stats.txt" "./stats/SRR18826804_stats.txt"
[3] "./stats/SRR18826805_stats.txt" "./stats/SRR18826806_stats.txt"
[5] "./stats/SRR18826807_stats.txt" "./stats/SRR18826808_stats.txt"
Desired substring output
[1] "SRR18826803" "SRR18826804"
[3] "SRR18826805" "SRR18826806"
[5] "SRR18826807" "SRR18826808"
Would this work for you?
library(stringr)
stringr::str_extract(cov_stats[["FileName"]], "SRR.{0,8}")
You can use
rownames(cov_stats) <- sub("^\\./stats/(SRR\\d{8}).*", "\\1", cov_stats[["FileName"]])
See the regex demo. Details:
^ - start of string
\./stats/ - ./stats/ string
(SRR\d{8}) - Group 1 (\1): SRR string and then eight digits
.* - the rest of the string till its end.
Note that sub is used (not gsub) because there is only one expected replacement operation in the input string (since the regex matches the whole string).
See the R demo:
cov_stats <- c("./stats/SRR18826803_stats.txt", "./stats/SRR18826804_stats.txt", "./stats/SRR18826805_stats.txt", "./stats/SRR18826806_stats.txt", "./stats/SRR18826807_stats.txt")
sub("^\\./stats/(SRR\\d{8}).*", "\\1", cov_stats)
## => [1] "SRR18826803" "SRR18826804" "SRR18826805" "SRR18826806" "SRR18826807"
An equivalent extraction stringr approach:
library(stringr)
rownames(cov_stats) <- str_extract(cov_stats[["FileName"]], "SRR\\d{8}")
I need to extract a string that spans across multiple lines on an object.
The objetc:
> text <- paste("abc \nd \ne")
> cat(text)
abc
d
e
With str_extract_all I can extract all the text between ‘a’ and ‘c’, for example.
> str_extract_all(text, "a.*c")
[[1]]
[1] "abc"
Using the function ‘regex’ and the argument ‘multiline’ set to TRUE, I can extract a string across multiple lines. In this case, I can extract the first character of multiple lines.
> str_extract_all(text, regex("^."))
[[1]]
[1] "a"
> str_extract_all(text, regex("^.", multiline = TRUE))
[[1]]
[1] "a" "d" "e"
But when I try the to extract "every character between a and d" (a regex that spans across multiple lines), the output is "character(0)".
> str_extract_all(text, regex("a.*d", multiline = TRUE))
[[1]]
character(0)
The desired output is:
“abcd”
How to get it with stringr?
dplyr:
library(dplyr)
library(stringr)
data.frame(text) %>%
mutate(new = lapply(str_extract_all(text, "(?!e)\\w"), paste0, collapse = ""))
text new
1 abc \nd \ne abcd
Here we use the character class \\w, which does not include the new line metacharacter \n. The negative lookahead (?!e) makes sure the e is not matched.
base R:
unlist(lapply(str_extract_all(text, "(?!e)\\w"), paste0, collapse = ""))
[1] "abcd"
str_remove_all(text,"\\s\\ne?")
[1] "abcd"
OR
paste0(trimws(strsplit(text, "\\ne?")[[1]]), collapse="")
[1] "abcd"
The anwers above remove line breaks. So, a two step approach can work to get the desired output 'abcd'.
1 - Use str_remove_all or gsub to remove the line breaks (in this case, also removing blank spaces).
2 - Use str_extract_all to get the desired output ('abcd' in this case).
> text %>%
+ str_remove_all("\\s\\n") %>%
+ str_extract_all("a.*d")
[[1]]
[1] "abcd"
Short regex reference:
\n - new line (return)
\s - any whitespace
\r - carriage return
Update:
In base R to get the desired output abcd:
text <- gsub("[\r\n]|[[:blank:]]", "", text)
substr(text,1, nchar(text)-1)
[1] "abcd"
First answer:
We can use gsub:
gsub("[\r\n]|[[:blank:]]", "", text)
[1] "abcde"
I have a vector like this:
> myarray
[1] "AA\tThis is ",
[2] "\tthe ",
[3] "\tbegining."
[4] "BB\tA string of "
[5] "\tcharacters."
[6] "CC\tA short line."
[7] "DD\tThe "
[8] "\tend."`
I am trying to write a function that processes the above to generate this:
> myoutput
[1] "AA\tThis is the begining."
[2] "BB\tA string of characters."
[3] "CC\tA short line"
[4] "DD\tThe end."`
This is doable by looping through the rows and using an if statement to concatenate the current row with the last one if it starts with a \t. I was wondering if there is a more efficient way of achieving the same result.
# Create your example data
myarray <- c("AA\this is ", "\tthe ", "\tbeginning", "BB\tA string of ", "\tcharacters.", "CC\tA short line.", "DD\tThe", "\tend")
# Find where each "sentence" starts based on detecting
# that the first character isn't \t
starts <- grepl("^[^\t]", myarray)
# Create a grouping variable
id <- cumsum(starts)
# Remove the leading \t as that seems like what your example output wants
tmp <- sub("^\t", "", myarray)
# split into groups and paste the groups together
sapply(split(tmp, id), paste, collapse = "")
And running it we get
> sapply(split(tmp, id), paste, collapse = "")
1 2
"AA\this is the beginning" "BB\tA string of characters."
3 4
"CC\tA short line." "DD\tThe end"
An option is to use paste than replace AA,BB etc. with additional character say ## and and strsplit as:
#Data
myarray <- c("AA\this is ", "\tthe ", "\tbeginning", "BB\tA string of ",
"\tcharacters.", "CC\tA short line.", "DD\tThe", "\tend")
strsplit(gsub("([A-Z]{2})","##\\1",
paste(sub("^\t","", myarray), collapse = "")),"##")[[1]][-1]
# [1] "AA\this is the beginning"
# [2] "BB\tA string of characters."
# [3] "CC\tA short line."
# [4] "DD\tTheend"
I encountered this question:
PHP explode the string, but treat words in quotes as a single word
and similar dealing with using Regex to explode words in a sentence, separated by a space, but keeping quoted text intact (as a single word).
I would like to do the same in R. I have attempted to copy-paste the regular expression into stri_split in the stringi package as well as strsplit in base R, but as I suspect the regular expression uses a format R does not recognize. The error is:
Error: '\S' is an unrecognized escape in character string...
The desired output would be:
mystr <- '"preceded by itself in quotation marks forms a complete sentence" preceded by itself in quotation marks forms a complete sentence'
myfoo(mystr)
[1] "preceded by itself in quotation marks forms a complete sentence" "preceded" "by" "itself" "in" "quotation" "marks" "forms" "a" "complete" "sentence"
Trying: strsplit(mystr, '/"(?:\\\\.|(?!").)*%22|\\S+/') gives:
Error in strsplit(mystr, "/\"(?:\\\\.|(?!\").)*%22|\\S+/") :
invalid regular expression '/"(?:\\.|(?!").)*%22|\S+/', reason 'Invalid regexp'
A simple option would be to use scan:
> x <- scan(what = "", text = mystr)
Read 11 items
> x
[1] "preceded by itself in quotation marks forms a complete sentence"
[2] "preceded"
[3] "by"
[4] "itself"
[5] "in"
[6] "quotation"
[7] "marks"
[8] "forms"
[9] "a"
[10] "complete"
[11] "sentence"
Q: How can I replace underscores "_" with backslash-underscores "_" in an R string? I'd prefer to use the stringr package.
Also, can anyone explain why line 5 below fails to get the desired result? I was almost certain that would work.
library(stringr)
s <- "foo_bar_baz"
str_replace_all(s, "_", 5) # [1] "foo5bar5baz"
str_replace_all(s, "_", "\_") # Error: '\_' is an unrecognized escape in character string starting ""\_"
str_replace_all(s, "_", "\\_") # [1] "foo_bar_baz"
str_replace_all(s, "_", "\\\_") # Error: '\_' is an unrecognized escape in character string starting ""\\\_"
str_replace_all(s, "_", "\\\\_") # [1] "foo\\_bar\\_baz"
Context: I'm making a LaTeX table using xtable and need to sanitize my column names since they all have underscores and break LaTeX.
It is all much easier. Replace literal strings with literal strings with the help of fixed("_"), no need for a regex.
> library(stringr)
> s <- "foo_bar_baz"
> str_replace_all(s, fixed("_"), "\\_")
[1] "foo\\_bar\\_baz"
And if you use cat:
> cat(str_replace_all(s, fixed("_"), "\\_"))
foo\_bar\_baz>
You will see that you actually have 1 backslash in the result.