Using regular expression in string replacement - r

I have a broken csv file that I am attempting to read into R and repair using a regular expression.
The reason it is broken is that it contains some fields which include a comma but does not wrap those fields in double quotes. So I have to use a regular expression to find these fields, and wrap them in double quotes.
Here is an example of the data source:
DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,$1,250.00
So you can see that in the third row, the Price field contains a comma but it is not wrapped in double quotes. This breaks the read.table function.
My approach is to use readLines and str_replace_all to wrap the price with commas in double quotes. But I am not good at regular expressions and stuck.
vector <- read.Lines(file)
vector_temp <- str_replace_all(vector, ",\\$[0-9]+,\\d{3}\\.\\d{2}", ",\"\\$[0-9]+,\\d{3}\\.\\d{2}\"")
I want the output to be:
DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,"$1,250.00"
With this format, I can read into R.
Appreciate any help!

lines <- readLines(textConnection(object="DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,$1,250.00"))
library(stringi)
library(tidyverse)
stri_split_regex(lines, ",", n=3, simplify=TRUE) %>%
as_data_frame() %>%
docxtractr::assign_colnames(1)
## DataField1 DataField2 Price
## 1 ID1 Value1
## 2 ID2 Value2 $500.00
## 3 ID3 Value3 $1,250.00
from there you can readr::write_csv() or write.csv()

The extra facilities in the stringi or stringr packages do not seem needed. gsub seems perfectly suited for this. You just need understand about capture-groups with paired parentheses (brackets to Brits) and the use of the double-backslash_n convention for referring to capture-group matches in the replacement argument:
txt <- "DataField1,DataField2,Price, extra
ID1,Value1, ,
ID2,Value2,$500.00,
ID3,Value3,$1,250.00, o"
vector<- gsub("([$][0-9]{1,3}([,]([0-9]{3})){0,10}([.][0-9]{0,2}))" , "\"\\1\"", readLines(textConnection(txt)) )
> read.csv(text=vector)
DataField1 DataField2 Price extra
1 ID1 Value1
2 ID2 Value2 $500.00
3 ID3 Value3 $1,250.00 o
You are putting quotes around specific sequence of digits possibly repeated(commas digits) and possible period and 2 digits . There might be earlier SO questions about formatting as "currency".

Here are some solutions:
1) read.pattern This uses read.pattern in the gsubfn package to read in a file (assumed to be called sc.csv) such that the capture groups, i.e. the parenthesized portions, of the pattern are the fields. This will read in the file and process it all in one step so it is not necessary to use readLines first.
^(.*?), that begins the pattern will match everything from the start until the first comma. Then (.*?), will match to the next comma and finally (.*)$ will match everything else to the end. Normally * is greedy, i.e. it matches as much as it can, but the question mark after it makes it ungreedy. We needed to specify perl=TRUE so that it uses perl regular expressions since by default gsubfn uses tcl regular expressions based on Henry Spencer's regex parser which does not support *? . If you would rather have character columns instead of factor then add the as.is=TRUE argument to read.pattern.
The final line of code removes the $ and , characters from the Price column and converts it to numeric. (Omit this line if you actually want it formatted.)
library(gsubfn)
DF <- read.pattern("sc.csv", pattern = "^(.*?),(.*?),(.*)$", perl = TRUE, header = TRUE)
DF$Price <- as.numeric(gsub("[$,]", "", DF$Price)) ##
giving:
> DF
DataField1 DataField2 Price
1 ID1 Value1 NA
2 ID2 Value2 500
3 ID3 Value3 1250
2) sub This uses very simple regular expression (just a single character match) and no packages. Using vector as defined in the question this replaces the first two commas with semicolons. Then it can be read in using sep = ";"
read.table(text = sub(",", ";", sub(",", ";", vector)), header = TRUE, sep = ";")
Add the line marked ## in (1) if you want numeric prices.

Related

Replace matched patterns in a string based on condition

I have a text string containing digits, letters and spaces. Some of its substrings are month abbreviations. I want to perform a condition-based pattern replacement, namely to enclose a month abbreviation in whitespaces if and only if a given condition is fulfilled. As an example, let the condition be as follows: "preceeded by a digit and succeeded by a letter".
I tried stringr package but I fail to combine the functions str_replace_all() and str_locate_all():
# Input:
txt = "START1SEP2 1DECX JANEND"
# Desired output:
# "START1SEP2 1 DEC X JANEND"
# (A) What I could do without checking the condition:
library(stringr)
patt_month = paste("(", paste(toupper(month.abb), collapse = "|"), ")", sep='')
str_replace_all(string = txt, pattern = patt_month, replacement = " \\1 ")
# "START1 SEP 2 1 DEC X JAN END"
# (B) But I actually only need replacements inside the condition-based bounds:
str_locate_all(string = txt, pattern = paste("[0-9]", patt_month, "[A-Z]", sep=''))[[1]]
# start end
# [1,] 12 16
# To combine (A) and (B), I'm currently using an ugly for() loop not shown here and want to get rid of it
You are looking for lookarounds:
(?<=\d)DEC(?=[A-Z])
See a demo on regex101.com.
Lookarounds make sure a certain position is matched without consuming any characters. They are available in front of sth. (called lookbehind) or to make sure anything that follows is of a certain type (called lookahead). You have positive and negative ones on both sides, thus you have four types (pos./neg. lookbehind/-ahead).
A short memo:
(?=...) is a pos. lookahead
(?!...) is a neg. lookahead
(?<=...) is a pos. lookbehind
(?<!...) is a neg. lookbehind
A Base R version
patt_month <- capture.output(cat(toupper(month.abb),"|"))#concatenate all month.abb with OR
pat <- paste0("(\\s\\d)(", patt_month, ")([A-Z]\\s)")#make it a three group thing
gsub(pattern = pat, replacement = "\\1 \\2 \\3", txt, perl =TRUE)#same result as above
Also works for txt2 <- "START1SEP2 1JANY JANEND" out of the box.
[1] "START1SEP2 1 JAN Y JANEND"

Apply a regex only to the first word of a phrase (defined with spaces)

I have this regex to separate letters from numbers (and symbols) of a word: (?<=[a-zA-Z])(?=([[0-9]|[:punct:]])). My test string is: "CALLE15 CRA22".
I want to apply this regex only to the first word of that sentence (the word is defined with spaces). Namely, I want apply that only to "CALLE15".
One solution is split the string (sentence) into words and then apply the regex to the first word, but I want to do all in one regex. Other solution is to use r stringr::str_replace() (or sub()) that replace only the first match, but I need stringr::str_replace_all (or gsub()) for other reasons.
What I need is to insert a space between the two that I do with the replacement function. The outcome I want is "CALLE 15 CRA22" and with the posibility of "CALLE15 CRA 22". I try a lot of positions for the space and nothing, neither the ^ at the beginning.
https://rubular.com/r/7dxsHdOA3avTdX
Thanks for your help!!!!
I am unsure about your problem statement (see my comment above), but the following reproduces your expected output and uses str_replace_all
ss <- "CALLE15 CRA22"
library(stringr)
str_replace_all(ss, "^([A-Za-z]+)(\\d+)(\\s.+)$", "\\1 \\2\\3")
#[1] "CALLE 15 CRA22"
Update
To reproduce the output of the sample string from the comment above
ss <- "CLL.6 N 5-74NORTE"
pat <- c(
"(?<=[A-Za-z])(?![A-Za-z])",
"(?<![A-Za-z])(?=[A-Za-z])",
"(?<=[0-9])(?![0-9])",
"(?<![0-9])(?=[0-9])")
library(stringr)
str_split(ss, sprintf("(%s)", paste(pat, collapse = "|"))) %>%
unlist() %>%
.[nchar(trimws(.)) > 0] %>%
paste(collapse = " ")
#[1] "CLL . 6 N 5 - 74 NORTE"

Separating multiple value numbers (with characters) and text

I have a file in Excel that has, as an example, text such as this "4.56/505AB" in a cell. The numbers all vary, as does the length of text, so the text can be single or multiple characters, and the numbers can contain characters such as a decimal point or slash mark.
The ideal, separated format for this example would be: column 1 = 4.56/505, column 2 = AB.
What I've tried:
"Split_Text" in Excel, which removed the special characters from the number, and resulted in the following output: column 1 = 456505, column 2 = ./AB
R with the "G_sub" command, which resulted in: [1] " 4 . 56 / 505 AB"
Is there a way to take these methods further, or will this be a manual fix? Thank you!
Assuming the first uppercase letter is the beginning of the second column
df <- data.frame(c1 = c("4.56/505AB", "1.23/202CD"))
library(stringr)
df$c2 <- str_extract(df$c1, "[^[A-Z]]+")
df$c3 <- str_extract(df$c1, "[A-Z]+")
df
# c1 c2 c3
# 1 4.56/505AB 4.56/505 AB
# 2 1.23/202CD 1.23/202 CD
1) sub/read.table Match the leading characters and the trailing characters within the two capture groups and separate them with a semicolon. Then read that in using read.table. No packages are used.
x <- "4.56/505AB"
pat <- "^([0-9.,/]+)(.*)"
read.table(text = sub(pat, "\\1;\\2", x), sep = ";", as.is = TRUE)
## V1 V2
## 1 4.56/505 AB
The result has character columns but if you prefer factor then omit
the as.is = TRUE. Also we have assumed there are no semicolons in the input but if there are then use some other character that does not appear in the input in place of the semicolon in the two places where semicolon appears.
1a) If we can assume that the second column always starts with a letter then we could just replace the first letter encountered by semicolon followed by that letter and then read it in using read.table. This has the advantage of using a slghtly simpler pattern.
read.table(text = sub("([[:alpha:]])", ";\\1", x), sep = ";", as.is = TRUE)
2) read.pattern Using the same input x and pattern pat it is even shorter using read.pattern in the gsubfn package:
library(gsubfn)
read.pattern(text = x, pattern = pat, as.is = TRUE)
## V1 V2
## 1 4.56/505 AB
Update: revised.

concatenate a string that contains backquote characters R

I have a string that contains back quotes, which mess up the concatenate function. If you try to concatenate with back ticks, the concatenate function doesn't like this:
a <- c(`table`, `chair`, `desk`)
Error: object 'chair' not found
So I can create the variable:
bad.string <- "`table`, `chair`, `desk`"
a <- gsub("`", "", bad.string)
That gives a string "table, chair, desk".
It then should be like:
good.object <- c("table", "chair", "couch", "lamp", "stool")
I don't know why the backquotes cause the concatenate function to break, but how can I replace the string to not have the illegal characters?
Try:
good.string <- trimws(unlist(strsplit(gsub("`", "", bad.string), ",")))
Here gsub() is used to remove the backticks, strsplit converts a single string into a list of strings, where the comma in the original string denotes the separation, unlist() converts the list of strings into a vector of strings and trimws() deletes trailing or leading whitespaces.
From the documentation on quotes, back ticks are reserved for non-standard variable names such as
`the dog` <- 1:5
`the dog`
# [1] 1 2 3 4 5
So when you are trying to use concatenate, R is doing nothing wrong. It looks at all the variable in c() and tries to find them, causing the error.
If this is a vector you wrote, just copy replace all of the backticks with single or double quotes.
If this is somehow being generated in R, bring the entire thing out as a string, then use gsub() and eval(parse())
eval(parse(text = gsub('\`',"\'","c(`table`, `chair`, `desk`)")))
[1] "table" "chair" "desk"
EDIT: For the new example of bad.string
You have to go through, replace all of the back ticks with double quotes, then you can read it through read.csv(). This is a little janky though as it gives back a row vector, so we transpose it to get back a column vector
bad_string <- "`table`, `chair`, `desk`"
okay_string <- gsub('\`','\"',bad.string)
okay_string
# [1] "\"table\", \"chair\", \"desk\""
t(read.csv(text = okay_string,header=FALSE, strip.white = TRUE))
# [,1]
# V1 "table"
# V2 "chair"
# V3 "desk"

subsetting data with only entries with in the parentheses

How can i subset data that contains only entries with in the parentheses from description column
data= ID description control
1814668 glycoprotein 2 (Gp2) (Fy2) LMN_2904435
1791634 claudin 10 (Cldn10), transcript variant 1 ILMN_1214954 NM
1790993 claudin 10 (Cldn10), transcript variant 2 ILMN_2515816
output
ID description control
1814668 Gp2, Fy2 LMN_2904435
1791634 Cldn10 ILMN_1214954 NM
1790993 Cldn10 ILMN_2515816
You could try
df2$description <- gsub('.*\\(([^)]+)\\).*', '\\1', df2$description)
Or use bracketXtract from qdap
library(qdap)
unlist(bracketXtract(df2$description, 'round'))
Or
library(qdapRegex)
unlist(rm_round(df2$description, extract=TRUE))
Update
Based on the new dataset "df2N",
df2N$description <- sapply(rm_round(df2N$description,
extract=TRUE),toString)
Or using str_extract
library(stringr)
sapply(str_extract_all(df2N$description,
perl('(?<=\\()[^)]+(?=\\))')), toString)
Probably not as great as #akrun 's solutions but here is another option, using function gsub (twice...) from base R:
df2$description <- gsub("^,\\s|,\\s$",
"",
gsub("^[^(]*\\(|\\)[^()]*\\(|\\)[^(]*$",
", ",
df2$description, perl=T))
#[1] "Gp2, Fy2" "Cldn10" "Cldn10"
First, it's telling R to search for either:
^[^(]*\\(: anything that is not a opening bracket, at the beginning of the
string, and ending with an opening bracket
\\)[^()]*\\(: a closing bracket followed by anything that is not a bracket, ending with an opening bracket
\\)[^(]*$: a closing bracket, followed by anything that is not an opening bracket and goes till the end of string
and replace it by a comma followed by a space.
Second, it replaces the "comma followed by a space" at the beginning and at the end of the string by an empty string.

Resources