Removing punctuation between two words - r

I have a data frame (df) and I would like to remove punctuation.
However there an issue with dot between 2 words and at the end of one word like this:
test.
test1.test2
I use this to remove the punctuation:
library(tm)
removePunctuation(df)
and the result I take is this:
test
test1test2
but I would like to take this as result:
test
test1 test2
How is it possible to have a space between two words in the removing process?

You can use chartr for single character substitution:
chartr(".", " ", c("test1.test2"))
# [1] "test1 test2"
#akrun suggested trimws to remove the space at the end of your test string:
str <- c("test.", "test1.test2")
trimws(chartr(".", " ", str))
# [1] "test" "test1 test2"

We can use gsub to replace the . with a white space and remove the trailing/leading spaces (if any) with trimws.
trimws(gsub('[.]', ' ', str1))
#[1] "test" "test1 test2"
NOTE: In regex, . by itself means any character. So we should either keep it inside square brackets[.]) or escape it (\\.) or with option fixed=TRUE
trimws(gsub('.', ' ', str1, fixed=TRUE))
data
str1 <- c("test.", "test1.test2")

you can also use strsplit:
a <- "test."
b <- "test1.test2"
do.call(paste, as.list(strsplit(a, "\\.")[[1]]))
[1] "test"
do.call(paste, as.list(strsplit(b, "\\.")[[1]]))
[1] "test1 test2"

Related

How to throw out spaces and underscores only from the beginning of the string?

I want to ignore the spaces and underscores in the beginning of a string in R.
I can write something like
txt <- gsub("^\\s+", "", txt)
txt <- gsub("^\\_+", "", txt)
But I think there could be an elegant solution
txt <- " 9PM 8-Oct-2014_0.335kwh "
txt <- gsub("^[\\s+|\\_+]", "", txt)
txt
The output should be "9PM 8-Oct-2014_0.335kwh ". But my code gives " 9PM 8-Oct-2014_0.335kwh ".
How can I fix it?
You could bundle the \s and the underscore only in a character class and use quantifier to repeat that 1+ times.
^[\s_]+
Regex demo
For example:
txt <- gsub("^[\\s_]+", "", txt, perl=TRUE)
Or as #Tim Biegeleisen points out in the comment, if only the first occurrence is being replaced you could use sub instead:
txt <- sub("[\\s_]+", "", txt, perl=TRUE)
Or using a POSIX character class
txt <- sub("[[:space:]_]+", "", txt)
More info about perl=TRUE and regular expressions used in R
R demo
The stringr packages offers some task specific functions with helpful names. In your original question you say you would like to remove whitespace and underscores from the start of your string, but in a comment you imply that you also wish to remove the same characters from the end of the same string. To that end, I'll include a few different options.
Given string s <- " \t_blah_ ", which contains whitespace (spaces and tabs) and underscores:
library(stringr)
# Remove whitespace and underscores at the start.
str_remove(s, "[\\s_]+")
# [1] "blah_ "
# Remove whitespace and underscores at the start and end.
str_remove_all(s, "[\\s_]+")
# [1] "blah"
In case you're looking to remove whitespace only – there are, after all, no underscores at the start or end of your example string – there are a couple of stringr functions that will help you keep things simple:
# `str_trim` trims whitespace (\s and \t) from either or both sides.
str_trim(s, side = "left")
# [1] "_blah_ "
str_trim(s, side = "right")
# [1] " \t_blah_"
str_trim(s, side = "both") # This is the default.
# [1] "_blah_"
# `str_squish` reduces repeated whitespace anywhere in string.
s <- " \t_blah blah_ "
str_squish(s)
# "_blah blah_"
The same pattern [\\s_]+ will also work in base R's sub or gsub, with some minor modifications, if that's your jam (see Thefourthbird`s answer).
You can use stringr as:
txt <- " 9PM 8-Oct-2014_0.335kwh "
library(stringr)
str_trim(txt)
[1] "9PM 8-Oct-2014_0.335kwh"
Or the trimws in Base R
trimws(txt)
[1] "9PM 8-Oct-2014_0.335kwh"

Need to trim last character string only if is blank or "."

I have a large vector of words read from an excel file. Some of those records end with space or "." period. Only in those cases, I need to trim those chars.
Example:
"depresion" "tristeza."
"nostalgia" "preocupacion."
"enojo." "soledad "
"frustracion" "desesperacion "
"angustia." "desconocidos."
Notice some words end normal without "." or " ".
Is there a way to do that?
I have this
substr(conceptos, 1, nchar(conceptos)-1))
to test for the last character (conceptos is this long vector)
Thanks for any advise,
We can use sub to match zero or more . or spaces and replace it with blank ("")
sub("(\\.| )*$", "", v1)
#[1] "depresion" "tristeza" "nostalgia" "preocupacion" "enojo"
#[6] "soledad" "frustracion" "desesperacion"
#[9] "angustia" "desconocidos"
data
v1 <- c("depresion","tristeza.","nostalgia","preocupacion.",
"enojo.","soledad ","frustracion","desesperacion ",
"angustia.","desconocidos.")
Regular expressions are good for this:
library(stringr)
x = c("depresion", "tristeza.", "nostalgia", "preocupacion.",
"enojo.", "soledad ", "frustracion", "desesperacion ",
"angustia.", "desconocidos.")
x_replaced = str_replace(x, "(\\.|\\s)$", "")
The pattern (\\.|\\s)$ will match a . or any whitespace that occurs right at the end of the string.
Try this:
iif((mid(trim(conceptos), Len(conceptos), 1) == ".") ? substr(conceptos, 1, nchar(conceptos)-1)) : trim(conceptos))

Put space after a specifc word in a string vector in R

I want to put a space after a specific character in a string vector in R.
Example:
Text <-"<U+00A6>Word"
My goal is to put a space after the ">" to seperate the string in two characters to come to: <U+00A6> Word
I tried with gsub, but I do not have the right idea:
Text = gsub("<*", " ", Text)
But that only puts a space after each character.
Can you advise on that?
You can use this:
sub(">", "> ", Text)
# [1] "<U+0093> Word"
or this (without repeating the >):
sub("(?<=>)", " ", Text, perl = TRUE)
# [1] "<U+0093> Word"
If you just want to extract Word, you can use:
sub(".*>", "", Text)
# [1] "Word"
We can use str_extract to extract the word after the >
library(stringr)
str_extract(Text, "(?<=>)\\w+")
#[1] "Word"
Or another option is strsplit
strsplit(Text, ">")[[1]][2]
#[1] "Word"

How to change the word separator character in a vector?

I have a character vector consisting of the following style:
mylist <- c('John Myer Stewert','Steve',' Michael Boris',' Daniel and Frieds','Michael-Myer')
I'm trying to create a character vector like this:
mylist <- c('John+Myer+Stewert','Steve',' Michael+Boris',' Daniel+and+Frieds','Michael+Myer')
I have tried:
test <- cat(paste(shQuote(mylist , type="cmd"), collapse="+"))
That seems wrong. How can I change the word separator in mylist as shown above?
You could use chartr(). Just re-use the + sign for both space and - characters.
chartr(" -", "++", trimws(mylist))
# [1] "John+Myer+Stewert" "Steve" "Michael+Boris"
# [4] "Daniel+and+Frieds" "Michael+Myer"
Note that I also trimmed the leading whitespace since there is really no need to keep it.
We can use gsub by matching the space (" ") as pattern and replace it with "+".
gsub(" ", "+", trimws(mylist))
#[1] "John+Myer+Stewert" "Steve" "Michael+Boris"
#[4] "Daniel+and+Frieds" "Michael-Myer"
I assumed that the leading spaces as typo. If it is not, we can either use regex lookarounds
gsub("(?<=[a-z])[ -](?=[[:alpha:]])", "+", mylist, perl = TRUE)
#[1] "John+Myer+Stewert" "Steve" " Michael+Boris"
#[4] " Daniel+and+Frieds" "Michael+Myer"
Or some PCRE regex
gsub("(^ | $)(*SKIP)(*F)|[ -]", "+", mylist, perl = TRUE)
#[1] "John+Myer+Stewert" "Steve" " Michael+Boris"
#[4] " Daniel+and+Frieds" "Michael+Myer"
You can use the package stringr.
library(stringr)
str_replace_all(trimws(mylist), "[ -]", "+")
#[1] "John+Myer+Stewert" "Steve" "Michael+Boris"
#[4] "Daniel+and+Frieds" "Michael+Myer"
Between [] we specify what we want to replace with +. In this case, that is a single white space and -. I used trimws from Akrun's answer to get rid of the extra white space in the beginning of some elements in your string.
This is yet another alternative.
library(stringi)
stri_replace_all_regex(trimws(mylist), "[ -]", "+")

add space in string when meeting a given pattern

I have a string as follows:
a<-c("AbcDef(123)")
> a
[1] "AbcDef(123)"
Is there any efficient way to transform it as
a<-c("Abc Def (123)")
In other words, I would like to add a space when meeting a upper case or a special character ( .
one possibility :
gsub("(?<=[^A-Z(])(?=[A-Z(])", " ", a, perl=T)
Mine's a bit kludgy and uses two gsubs. The inner gsub adds spaces, the outer gsub removes the leading whitespace.
a <- "AbcDef(123)"
gsub("^\\s", "", gsub("([A-Z(])", " \\1", a))
Try this:
gsub("(?<=.)([A-Z(])", " \\1", a, perl = TRUE)
giving:
[1] "Abc Def (123)"
If the string with spaces has no one-character pieces it can be simplified to this:
gsub("(.)([A-Z(])", "\\1 \\2", a)

Resources