add space in string when meeting a given pattern - r

I have a string as follows:
a<-c("AbcDef(123)")
> a
[1] "AbcDef(123)"
Is there any efficient way to transform it as
a<-c("Abc Def (123)")
In other words, I would like to add a space when meeting a upper case or a special character ( .

one possibility :
gsub("(?<=[^A-Z(])(?=[A-Z(])", " ", a, perl=T)

Mine's a bit kludgy and uses two gsubs. The inner gsub adds spaces, the outer gsub removes the leading whitespace.
a <- "AbcDef(123)"
gsub("^\\s", "", gsub("([A-Z(])", " \\1", a))

Try this:
gsub("(?<=.)([A-Z(])", " \\1", a, perl = TRUE)
giving:
[1] "Abc Def (123)"
If the string with spaces has no one-character pieces it can be simplified to this:
gsub("(.)([A-Z(])", "\\1 \\2", a)

Related

Delete string parts within delimiter

I have a string as"dfgdf" sa"2323":
a <- "as\"dfgdf\" sa\"2323\""
The delimiter (same for the start and the end) here is ". So what I want is to get a string were everything is deleted within delimiter but not delimiter itself. So the end result string should look like as"" sa""
You could match " and forget what is matched using \K
Then use a negated character class matching any char except " or a whitespace character and use lookarounds to assert " to the right.
Use perl=TRUE to enable Perl-like regular expressions.
a <- "as\"dfgdf\" sa\"2323\""
gsub('"\\K[^"\\s]+(?=")', "", a, perl=TRUE)
Output
[1] "as\"\" sa\"\""
R demo
Here is another base R option using paste0 + strsplit
s <- paste0(paste0(unlist(strsplit(a, '"\\w+"')), '""'), collapse = "")
which gives
> s
[1] "as\"\" sa\"\""
> cat(s)
as"" sa""
Here is one option with a regex lookaround to match a word (\\w+) that succeeds a double quote and precedes one as pattern and is replaced by blank ("")
cat(gsub('(?<=")\\w+(?=")', "", a, perl = TRUE), "\n")
#as"" sa""
Or without regex lookaround
cat(gsub('"\\w+"', '""', a), "\n")
#as"" sa""
I also found a way with stringr library:
library(stringr)
a <- "as\"dfgdf\" sa\"2323\""
result <- str_replace_all(a, "\".*?\"", "\"\"")
cat(result)

How to insert a white space before open bracket

I have a string 3.4(2.5-4.7), I want to insert a white space before the open bracket "(" so that the string becomes 3.4 (2.5-4.7).
Any idea how this could be done in R?
x <- "3.4(2.5-4.7)"
sub("(.*)(?=\\()", "\\1 ", x, perl = T)
[1] "3.4 (2.5-4.7)"
This regex is based on lookahead: it creates one capturing group subsuming everything up until the lookahead, namely, the opening parenthesis (?=\\()), recalls it and inserts one whitespace after it in the replacement argument to sub (which is enough unless you have more than one such substitution per string, in which case gsubis needed). The argument perl = Tneeds to be added to enable the lookahead.
EDIT:
If you have a string like this:
x <- "3.4(2.5to4.7)"
the regex gets slightly more complex; the underlying idea though remains the same: you divide the string into different captruing groups (...), which you then recall using appropriate backreference in the replacement argument while adding the sought spaces:
sub("(.*)(\\(\\d+\\.\\d+)(to)(\\d+\\.\\d+\\))", "\\1 \\2 \\3 \\4", x)
[1] "3.4 (2.5 to 4.7)"
EDIT2:
x <- '3.4(2.5,4.7)'
sub("(.*)(\\(\\d+\\.\\d+)(,)(\\d+\\.\\d+\\))", "\\1 \\2\\3 \\4", x)
[1] "3.4 (2.5, 4.7)"
EDIT3:
x <- '3(2,4)'
sub("(.*)(\\(\\d+)(,)(\\d+)", "\\1 \\2\\3 \\4", x)
A very short way uses sub, which will substitute the first open bracket ( with a space followed by an open bracket, i.e. (.
x <- '3.4(2.5-4.7)'
sub("\\(", " (", x)
# [1] "3.4 (2.5-4.7)"
Alternatively, you can specify the argument fixed = TRUE which considers the pattern as fixed and not as a regular expression.
x <- '3.4(2.5-4.7)'
sub("(", " (", x, fixed = TRUE)
# [1] "3.4 (2.5-4.7)"
Try
gsub('(.*)(\\(.*\\))', '\\1 \\2', '3.4(2.5-4.7)')
#[1] "3.4 (2.5-4.7)"
The way the regex works is that it creates two groups. The first group (.*) it takes all elements and the second group (\\(.*\\)) takes all elements after the parenthesis. Note that we need to escape the parenthesis so we use \\(. We then join those two groups with a space between them \\1 \\2

How to throw out spaces and underscores only from the beginning of the string?

I want to ignore the spaces and underscores in the beginning of a string in R.
I can write something like
txt <- gsub("^\\s+", "", txt)
txt <- gsub("^\\_+", "", txt)
But I think there could be an elegant solution
txt <- " 9PM 8-Oct-2014_0.335kwh "
txt <- gsub("^[\\s+|\\_+]", "", txt)
txt
The output should be "9PM 8-Oct-2014_0.335kwh ". But my code gives " 9PM 8-Oct-2014_0.335kwh ".
How can I fix it?
You could bundle the \s and the underscore only in a character class and use quantifier to repeat that 1+ times.
^[\s_]+
Regex demo
For example:
txt <- gsub("^[\\s_]+", "", txt, perl=TRUE)
Or as #Tim Biegeleisen points out in the comment, if only the first occurrence is being replaced you could use sub instead:
txt <- sub("[\\s_]+", "", txt, perl=TRUE)
Or using a POSIX character class
txt <- sub("[[:space:]_]+", "", txt)
More info about perl=TRUE and regular expressions used in R
R demo
The stringr packages offers some task specific functions with helpful names. In your original question you say you would like to remove whitespace and underscores from the start of your string, but in a comment you imply that you also wish to remove the same characters from the end of the same string. To that end, I'll include a few different options.
Given string s <- " \t_blah_ ", which contains whitespace (spaces and tabs) and underscores:
library(stringr)
# Remove whitespace and underscores at the start.
str_remove(s, "[\\s_]+")
# [1] "blah_ "
# Remove whitespace and underscores at the start and end.
str_remove_all(s, "[\\s_]+")
# [1] "blah"
In case you're looking to remove whitespace only – there are, after all, no underscores at the start or end of your example string – there are a couple of stringr functions that will help you keep things simple:
# `str_trim` trims whitespace (\s and \t) from either or both sides.
str_trim(s, side = "left")
# [1] "_blah_ "
str_trim(s, side = "right")
# [1] " \t_blah_"
str_trim(s, side = "both") # This is the default.
# [1] "_blah_"
# `str_squish` reduces repeated whitespace anywhere in string.
s <- " \t_blah blah_ "
str_squish(s)
# "_blah blah_"
The same pattern [\\s_]+ will also work in base R's sub or gsub, with some minor modifications, if that's your jam (see Thefourthbird`s answer).
You can use stringr as:
txt <- " 9PM 8-Oct-2014_0.335kwh "
library(stringr)
str_trim(txt)
[1] "9PM 8-Oct-2014_0.335kwh"
Or the trimws in Base R
trimws(txt)
[1] "9PM 8-Oct-2014_0.335kwh"

How to change the word separator character in a vector?

I have a character vector consisting of the following style:
mylist <- c('John Myer Stewert','Steve',' Michael Boris',' Daniel and Frieds','Michael-Myer')
I'm trying to create a character vector like this:
mylist <- c('John+Myer+Stewert','Steve',' Michael+Boris',' Daniel+and+Frieds','Michael+Myer')
I have tried:
test <- cat(paste(shQuote(mylist , type="cmd"), collapse="+"))
That seems wrong. How can I change the word separator in mylist as shown above?
You could use chartr(). Just re-use the + sign for both space and - characters.
chartr(" -", "++", trimws(mylist))
# [1] "John+Myer+Stewert" "Steve" "Michael+Boris"
# [4] "Daniel+and+Frieds" "Michael+Myer"
Note that I also trimmed the leading whitespace since there is really no need to keep it.
We can use gsub by matching the space (" ") as pattern and replace it with "+".
gsub(" ", "+", trimws(mylist))
#[1] "John+Myer+Stewert" "Steve" "Michael+Boris"
#[4] "Daniel+and+Frieds" "Michael-Myer"
I assumed that the leading spaces as typo. If it is not, we can either use regex lookarounds
gsub("(?<=[a-z])[ -](?=[[:alpha:]])", "+", mylist, perl = TRUE)
#[1] "John+Myer+Stewert" "Steve" " Michael+Boris"
#[4] " Daniel+and+Frieds" "Michael+Myer"
Or some PCRE regex
gsub("(^ | $)(*SKIP)(*F)|[ -]", "+", mylist, perl = TRUE)
#[1] "John+Myer+Stewert" "Steve" " Michael+Boris"
#[4] " Daniel+and+Frieds" "Michael+Myer"
You can use the package stringr.
library(stringr)
str_replace_all(trimws(mylist), "[ -]", "+")
#[1] "John+Myer+Stewert" "Steve" "Michael+Boris"
#[4] "Daniel+and+Frieds" "Michael+Myer"
Between [] we specify what we want to replace with +. In this case, that is a single white space and -. I used trimws from Akrun's answer to get rid of the extra white space in the beginning of some elements in your string.
This is yet another alternative.
library(stringi)
stri_replace_all_regex(trimws(mylist), "[ -]", "+")

Removing punctuation between two words

I have a data frame (df) and I would like to remove punctuation.
However there an issue with dot between 2 words and at the end of one word like this:
test.
test1.test2
I use this to remove the punctuation:
library(tm)
removePunctuation(df)
and the result I take is this:
test
test1test2
but I would like to take this as result:
test
test1 test2
How is it possible to have a space between two words in the removing process?
You can use chartr for single character substitution:
chartr(".", " ", c("test1.test2"))
# [1] "test1 test2"
#akrun suggested trimws to remove the space at the end of your test string:
str <- c("test.", "test1.test2")
trimws(chartr(".", " ", str))
# [1] "test" "test1 test2"
We can use gsub to replace the . with a white space and remove the trailing/leading spaces (if any) with trimws.
trimws(gsub('[.]', ' ', str1))
#[1] "test" "test1 test2"
NOTE: In regex, . by itself means any character. So we should either keep it inside square brackets[.]) or escape it (\\.) or with option fixed=TRUE
trimws(gsub('.', ' ', str1, fixed=TRUE))
data
str1 <- c("test.", "test1.test2")
you can also use strsplit:
a <- "test."
b <- "test1.test2"
do.call(paste, as.list(strsplit(a, "\\.")[[1]]))
[1] "test"
do.call(paste, as.list(strsplit(b, "\\.")[[1]]))
[1] "test1 test2"

Resources