replacing "." with a character in R - r

I have a string that is
NAME = "Brad.Pitt"
I want to replace "." with a space (" "). I tried using sub, gsub, str_replace, str_replace all. They aren't working fine. Is there anything that I am missing out ?

Try:
NAME = "Brad.Pitt"
gsub("\\.", " ", NAME)

The main thing missing is that all the commands you mention regard the pattern as a regular expression, not a fixed string and, in particular, in a regular expression dot matches any character -- not just dot. You can find out more about these by reading ?"regular expression"
Here are some alternatives. (1) The first one uses fixed=TRUE so that the special characters in the first argument are not interpreted as regular expression characters. We used sub which substitutes one occurrence but if there were multiple occurrences and you wanted to replace them all with space then use gsub in place of sub. (2) The second escapes the dot with square brackets so it is not interpreted as a regular expression character. Another way of escaping dot is to preface it with a backslash so using R 4.0 r"{...}" notation we can use r"{\.}" as the pattern or without that notation see the other answer. (3) chartr does not use regular expressions at all. It will replace every dot with space. (4) substr<- replaces the 5th character with space. (5) scan read in the words separately creating a character vector with two components and then paste pastes them back together with a space between them. If there were multiple dots then it would replace each with a space.
# 1
sub(".", " ", NAME, fixed = TRUE)
## [1] "Brad Pitt"
# 2
sub("[.]", " ", NAME)
## [1] "Brad Pitt"
# 3
chartr(".", " ", NAME)
## [1] "Brad Pitt"
# 4
`substr<-`(NAME, 5, 5, ' ')
## [1] "Brad Pitt"
# 5
paste(scan(text = NAME, what = "", sep = ".", quiet = TRUE), collapse = " ")
## [1] "Brad Pitt"
Headings
Also note that if the reason you have the dot is that you read it in as a heading then you can use check.names = FALSE to avoid that. Compare the headings in the output of the two read.csv commands below.
Lines <- "BRAD PITT
1
2"
read.csv(text = Lines)
## BRAD.PITT
## 1 1
## 2 2
read.csv(text = Lines, check.names = FALSE)
## BRAD PITT
## 1 1
## 2 2

Related

Replace matched patterns in a string based on condition

I have a text string containing digits, letters and spaces. Some of its substrings are month abbreviations. I want to perform a condition-based pattern replacement, namely to enclose a month abbreviation in whitespaces if and only if a given condition is fulfilled. As an example, let the condition be as follows: "preceeded by a digit and succeeded by a letter".
I tried stringr package but I fail to combine the functions str_replace_all() and str_locate_all():
# Input:
txt = "START1SEP2 1DECX JANEND"
# Desired output:
# "START1SEP2 1 DEC X JANEND"
# (A) What I could do without checking the condition:
library(stringr)
patt_month = paste("(", paste(toupper(month.abb), collapse = "|"), ")", sep='')
str_replace_all(string = txt, pattern = patt_month, replacement = " \\1 ")
# "START1 SEP 2 1 DEC X JAN END"
# (B) But I actually only need replacements inside the condition-based bounds:
str_locate_all(string = txt, pattern = paste("[0-9]", patt_month, "[A-Z]", sep=''))[[1]]
# start end
# [1,] 12 16
# To combine (A) and (B), I'm currently using an ugly for() loop not shown here and want to get rid of it
You are looking for lookarounds:
(?<=\d)DEC(?=[A-Z])
See a demo on regex101.com.
Lookarounds make sure a certain position is matched without consuming any characters. They are available in front of sth. (called lookbehind) or to make sure anything that follows is of a certain type (called lookahead). You have positive and negative ones on both sides, thus you have four types (pos./neg. lookbehind/-ahead).
A short memo:
(?=...) is a pos. lookahead
(?!...) is a neg. lookahead
(?<=...) is a pos. lookbehind
(?<!...) is a neg. lookbehind
A Base R version
patt_month <- capture.output(cat(toupper(month.abb),"|"))#concatenate all month.abb with OR
pat <- paste0("(\\s\\d)(", patt_month, ")([A-Z]\\s)")#make it a three group thing
gsub(pattern = pat, replacement = "\\1 \\2 \\3", txt, perl =TRUE)#same result as above
Also works for txt2 <- "START1SEP2 1JANY JANEND" out of the box.
[1] "START1SEP2 1 JAN Y JANEND"

How to throw out spaces and underscores only from the beginning of the string?

I want to ignore the spaces and underscores in the beginning of a string in R.
I can write something like
txt <- gsub("^\\s+", "", txt)
txt <- gsub("^\\_+", "", txt)
But I think there could be an elegant solution
txt <- " 9PM 8-Oct-2014_0.335kwh "
txt <- gsub("^[\\s+|\\_+]", "", txt)
txt
The output should be "9PM 8-Oct-2014_0.335kwh ". But my code gives " 9PM 8-Oct-2014_0.335kwh ".
How can I fix it?
You could bundle the \s and the underscore only in a character class and use quantifier to repeat that 1+ times.
^[\s_]+
Regex demo
For example:
txt <- gsub("^[\\s_]+", "", txt, perl=TRUE)
Or as #Tim Biegeleisen points out in the comment, if only the first occurrence is being replaced you could use sub instead:
txt <- sub("[\\s_]+", "", txt, perl=TRUE)
Or using a POSIX character class
txt <- sub("[[:space:]_]+", "", txt)
More info about perl=TRUE and regular expressions used in R
R demo
The stringr packages offers some task specific functions with helpful names. In your original question you say you would like to remove whitespace and underscores from the start of your string, but in a comment you imply that you also wish to remove the same characters from the end of the same string. To that end, I'll include a few different options.
Given string s <- " \t_blah_ ", which contains whitespace (spaces and tabs) and underscores:
library(stringr)
# Remove whitespace and underscores at the start.
str_remove(s, "[\\s_]+")
# [1] "blah_ "
# Remove whitespace and underscores at the start and end.
str_remove_all(s, "[\\s_]+")
# [1] "blah"
In case you're looking to remove whitespace only – there are, after all, no underscores at the start or end of your example string – there are a couple of stringr functions that will help you keep things simple:
# `str_trim` trims whitespace (\s and \t) from either or both sides.
str_trim(s, side = "left")
# [1] "_blah_ "
str_trim(s, side = "right")
# [1] " \t_blah_"
str_trim(s, side = "both") # This is the default.
# [1] "_blah_"
# `str_squish` reduces repeated whitespace anywhere in string.
s <- " \t_blah blah_ "
str_squish(s)
# "_blah blah_"
The same pattern [\\s_]+ will also work in base R's sub or gsub, with some minor modifications, if that's your jam (see Thefourthbird`s answer).
You can use stringr as:
txt <- " 9PM 8-Oct-2014_0.335kwh "
library(stringr)
str_trim(txt)
[1] "9PM 8-Oct-2014_0.335kwh"
Or the trimws in Base R
trimws(txt)
[1] "9PM 8-Oct-2014_0.335kwh"

Regex: how to keep all digits when splitting a string?

Question
Using a regular expression, how do I keep all digits when splitting a string?
Overview
I would like to split each element within the character vector sample.text into two elements: one of only digits and one of only the text.
Current Attempt is Dropping Last Digit
This regular expression - \\d\\s{1} - inside of base::strsplit() removes the last digit. Below is my attempt, along with my desired output.
# load necessary data -----
sample.text <-
c("111110 Soybean Farming", "0116 Soybeans")
# split string by digit and one space pattern ------
strsplit(sample.text, split = "\\d\\s{1}")
# [[1]]
# [1] "11111" "Soybean Farming"
#
# [[2]]
# [1] "011" "Soybeans"
# desired output --------
# [[1]]
# [1] "111110" "Soybean Farming"
#
# [[2]]
# [1] "0116" "Soybeans"
# end of script #
Any advice on how I can split sample.text to keep all digits would be much appreciated! Thank you.
Because you're splitting on \\d, the digit there is consumed in the regex, and not present in the output. Use lookbehind for a digit instead:
strsplit(sample.text, split = "(?<=\\d) ", perl=TRUE)
http://rextester.com/GDVFU71820
Some alternative solutions, using very simple pattern matching on the first occurrence of space:
1) Indirectly by using sub to substitute your own delimiter, then strsplit on your delimiter:
E.g. you can substitute ';' for the first space (if you know that character does not exist in your data):
strsplit( sub(' ', ';', sample.text), split=';')
2) Using regexpr and regmatches
You can effectively match on the first " " (space character), and split as follows:
regmatches(sample.text, regexpr(" ", sample.text), invert = TRUE)
Result is a list, if that's what you are after per your sample desired output:
[[1]]
[1] "111110" "Soybean Farming"
[[2]]
[1] "0116" "Soybeans"
3) Using stringr library:
library(stringr)
str_split_fixed(sample.text, " ", 2) #outputs a character matrix
[,1] [,2]
[1,] "111110" "Soybean Farming"
[2,] "0116" "Soybeans"

Regex expression starting from a certain character

Example: "example._AL(5)._._4500_GRE/Jan_2018"
I am trying to extract text from the above string containing parentheses. I wanna extract everything starting from AL.
Output should look like: "AL(5)._._4500_GRE/Jan_2018"
There is some question on what we can assume is known but here are a few variations which make various assumptions.
1) word( This removes everything prior to the first word followed by a parenthesis.
"^" matches the start of string
".*?" is the shortest match of anything provided we still match rest of regex
"\\w+" matches a word
"\\(" matches a left paren
(...) forms a capture group which the replacement string can refer to as "\\1"
Code
x <- "example.AL(5)._._4500_GRE/Jan_2018"
sub("^.*?(\\w+\\()", "\\1", x)
## [1] "AL(5)._._4500_GRE/Jan_2018"
1a) or matching a word followed by ( followed by anything and extracting that:
library(gsubfn)
strapplyc(x, "\\w+\\(.*", simplify = TRUE)
## [1] "AL(5)._._4500_GRE/Jan_2018"
2) AL( or if we know that the word is AL then:
sub("^.*?(AL\\(.*)", "\\1", x)
## [1] "AL(5)._._4500_GRE/Jan_2018"
3) remove up to 1st dot or if we know that the part to be removed is the part before and including the first dot:
sub("^.*?\\.", "", x)
## [1] "AL(5)._._4500_GRE/Jan_2018"
4) dot separated fields If the format of the input is dot-separated fields we can parse them all out at once like this:
read.table(text = x, sep = ".", as.is = TRUE)
## V1 V2 V3 V4
## 1 example AL(5) _ _4500_GRE/Jan_2018

Extract date from given string in r

string<-c("Posted 69 months ago (7/4/2011)")
library(gsubfn)
strapplyc(string, "(.*)", simplify = TRUE)
I apply above function but nothing happens.
In this I want to extract only date part i.e 7/4/2011.
The first one shows how to fix the code in the question to give the desired answer. The next 2 solutions are the same except they use different regular expressions. The fourth solution shows how to do it with gsub. The fifth breaks the gsub into two sub calls and the sixth uses read.table.
1) Escape parens The problem is that ( and ) have special meaning in regular expressions so you must escape them if you want to match them literally. By using "[(]" as we do below (or writing them as "\\(" ) they are matched literally. The inner parentheses define the capture group as we don't want that group to include the literal parentheses themselves:
strapplyc(string, "[(](.*)[)]", simplify = TRUE)
## [1] "7/4/2011"
2) Match content Another way to do it is to match the data itself rather than the surrounding parentheses. Here "\\d+" matches one or more digits:
strapplyc(string, "\\d+/\\d+/\\d+", simplify = TRUE)
## [1] "7/4/2011"
You could specify the number of digits if you want to be even more specific but it seems unnecessary here if the data looks similar to that in the question.
3) Match 8 or more digits and slashes Given that there are no other sequences of 8 or more characters consisting only of slashes and digits in the rest of the string we could just pick out that:
strapplyc(string, "[0-9/]{8,}", simplify = TRUE)
## [1] "7/4/2011"
4) Remove text before and after Another way of doing it is to remove everything up to the ( and after the ) like this:
gsub(".*[(]|[)].*", "", string)
## [1] "7/4/2011"
5) sub This is the same as (4) except it breaks the gsub into two sub invocations, one removing everything up to ( and the other removing ) onwards. The regular expressions are therefore slightly simpler.
sub(".*\\(", "", sub("\\).*", "", string))
6) read.table This solution uses no regular expressions at all. It defines sep and comment.char in read.table so that the second column of the result of read.table is the required date or dates.
read.table(text = string, sep = "(", comment.char = ")", as.is = TRUE)$V2
## [1] "7/4/2011"
Note: Note that you don't need the c in defining string
string <- c("Posted 69 months ago (7/4/2011)")
string2 <- "Posted 69 months ago (7/4/2011)"
identical(string, string2)
## [1] TRUE
We can do this with gsub by matching one or more characters that are not a ( ([^(]+) from the start (^) of the string or | the ) at the end ($) of the string and replace it with ""
gsub("[^[^(]+\\(|\\)$", "", string)
#[1] "7/4/2011"
Or using capture groups
sub("^[^(]+\\(([^)]+).*", "\\1", string)
#[1] "7/4/2011"
Or with str_extract, we match one or more characters that are not a ) ([^)]+) that follows the ( ((?<=[(]))
library(stringr)
str_extract(string, "(?<=[(])[^)]+")
#[1] "7/4/2011"

Resources