concatenate a string that contains backquote characters R - r

I have a string that contains back quotes, which mess up the concatenate function. If you try to concatenate with back ticks, the concatenate function doesn't like this:
a <- c(`table`, `chair`, `desk`)
Error: object 'chair' not found
So I can create the variable:
bad.string <- "`table`, `chair`, `desk`"
a <- gsub("`", "", bad.string)
That gives a string "table, chair, desk".
It then should be like:
good.object <- c("table", "chair", "couch", "lamp", "stool")
I don't know why the backquotes cause the concatenate function to break, but how can I replace the string to not have the illegal characters?

Try:
good.string <- trimws(unlist(strsplit(gsub("`", "", bad.string), ",")))
Here gsub() is used to remove the backticks, strsplit converts a single string into a list of strings, where the comma in the original string denotes the separation, unlist() converts the list of strings into a vector of strings and trimws() deletes trailing or leading whitespaces.

From the documentation on quotes, back ticks are reserved for non-standard variable names such as
`the dog` <- 1:5
`the dog`
# [1] 1 2 3 4 5
So when you are trying to use concatenate, R is doing nothing wrong. It looks at all the variable in c() and tries to find them, causing the error.
If this is a vector you wrote, just copy replace all of the backticks with single or double quotes.
If this is somehow being generated in R, bring the entire thing out as a string, then use gsub() and eval(parse())
eval(parse(text = gsub('\`',"\'","c(`table`, `chair`, `desk`)")))
[1] "table" "chair" "desk"
EDIT: For the new example of bad.string
You have to go through, replace all of the back ticks with double quotes, then you can read it through read.csv(). This is a little janky though as it gives back a row vector, so we transpose it to get back a column vector
bad_string <- "`table`, `chair`, `desk`"
okay_string <- gsub('\`','\"',bad.string)
okay_string
# [1] "\"table\", \"chair\", \"desk\""
t(read.csv(text = okay_string,header=FALSE, strip.white = TRUE))
# [,1]
# V1 "table"
# V2 "chair"
# V3 "desk"

Related

extract substring in R

Suppose I have list of string "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR" and need to get a vector of string that contains only numbers with bracket like eg. [+229][+57].
Is there a convenient way in R to do this?
Using base R, then try it with
> unlist(regmatches(s,gregexpr("\\[\\+\\d+\\]",s)))
[1] "[+229]" "[+57]" "[+229]"
Or you can use
> gsub(".*?(\\[.*\\]).*","\\1",gsub("\\].*?\\[","] | [",s))
[1] "[+229] | [+57] | [+229]"
We can use str_extract_all from stringr
stringr::str_extract_all(x, "\\[\\+\\d+\\]")[[1]]
#[1] "[+229]" "[+57]" "[+229]"
Wrap it in unique if you need only unique values.
Similarly, in base R using regmatches and gregexpr
regmatches(x, gregexpr("\\[\\+\\d+\\]", x))[[1]]
data
x <- "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR"
Seems like you want to remove the alphabetical characters, so
gsub("[[:alpha:]]", "", x)
where [:alpha:] is the class of alphabetical (lower-case and upper-case) characters, [[:alpha:]] says 'match any single alphabetical character', and gsub() says substitute, globally, any alphabetical character with the empty string "". This seems better than trying to match bracketed numbers, which requires figuring out which characters need to be escaped with a (double!) \\.
If the intention is to return the unique bracketed numbers, then the approach is to extract the matches (rather than remove the unwanted characters). Instead of using gsub() to substitute matches to a regular expression with another value, I'll use gregexpr() to identify the matches, and regmatches() to extract the matches. Since numbers always occur in [], I'll simplify the regular expression to match one or more (+) characters from the collection +[:digit:].
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> xx
[[1]]
[1] "+229" "+57" "+229"
xx is a list of length equal to the length of x. I'll write a function that, for any element of this list, makes the values unique, surrounds the values with [ and ], and concatenates them
fun <- function(x)
paste0("[", unique(x), "]", collapse = "")
This needs to be applied to each element of the list, and simplified to a vector, a task for sapply().
> sapply(xx, fun)
[1] "[+229][+57]"
A minor improvement is to use vapply(), so that the result is robust (always returning a character vector with length equal to x) to zero-length inputs
> x = character()
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> sapply(xx, fun) # Hey, this returns a list :(
list()
> vapply(xx, fun, "character") # vapply() deals with 0-length inputs
character(0)

Drop char code from start of string in a whole column

Trying to drop a euro character code from the start of a column. Column was ingested as character by readr, but I need to convert to integers
data$price[1:3]
[1] "\u0080343,000.00" "\u0080185,000.00" "\u0080438,500.00"
so need to get rid of \u0080 from the start (and , and . but we'll deal with those later)
tried:
data$price <- sub("\u0080", "", data$price)
-- no change(!!!)
data$price <- substr(data$price, 7, 100)
-- invalid multibyte string, element 1 (???)
I'd like to get to:
343000, 185000, 438500
But not sure how to get there. Any wisdom would be much appreciated!
You can tell R to use the exact text rather than regular expressions by using the fixed = TRUE option.
price <- c("\u0080343,000.00", "\u0080185,000.00", "\u0080438,500.00")
sub("\u0080", "", price, fixed = TRUE)
[1] "343,000.00" "185,000.00" "438,500.00"
To remove the comma and convert to an integer, you can use gsub.
as.integer(gsub(",", "", sub("\u0080", "", price, fixed = TRUE)))
[1] 343000 185000 438500
You can do this:
gsub("[^ -~]+", "", price)
"343,000.00" "185,000.00" "438,500.00"
Explanation:
The Euro sign is a non-ASCII character. So to get rid of it in the values in price we define a character class of ASCII characters in [ -~]; by negating the class through the caret ^ we match non-ASCII characters (such as €). This pattern is matched in gsuband replaced by "", i.e., nothing.
To convert to integer, proceed as in #Adam's answer. To convert to numeric, you can do this:
as.numeric(gsub(",", "", gsub("[^ -~]+", "", price)))

Using regular expression in string replacement

I have a broken csv file that I am attempting to read into R and repair using a regular expression.
The reason it is broken is that it contains some fields which include a comma but does not wrap those fields in double quotes. So I have to use a regular expression to find these fields, and wrap them in double quotes.
Here is an example of the data source:
DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,$1,250.00
So you can see that in the third row, the Price field contains a comma but it is not wrapped in double quotes. This breaks the read.table function.
My approach is to use readLines and str_replace_all to wrap the price with commas in double quotes. But I am not good at regular expressions and stuck.
vector <- read.Lines(file)
vector_temp <- str_replace_all(vector, ",\\$[0-9]+,\\d{3}\\.\\d{2}", ",\"\\$[0-9]+,\\d{3}\\.\\d{2}\"")
I want the output to be:
DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,"$1,250.00"
With this format, I can read into R.
Appreciate any help!
lines <- readLines(textConnection(object="DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,$1,250.00"))
library(stringi)
library(tidyverse)
stri_split_regex(lines, ",", n=3, simplify=TRUE) %>%
as_data_frame() %>%
docxtractr::assign_colnames(1)
## DataField1 DataField2 Price
## 1 ID1 Value1
## 2 ID2 Value2 $500.00
## 3 ID3 Value3 $1,250.00
from there you can readr::write_csv() or write.csv()
The extra facilities in the stringi or stringr packages do not seem needed. gsub seems perfectly suited for this. You just need understand about capture-groups with paired parentheses (brackets to Brits) and the use of the double-backslash_n convention for referring to capture-group matches in the replacement argument:
txt <- "DataField1,DataField2,Price, extra
ID1,Value1, ,
ID2,Value2,$500.00,
ID3,Value3,$1,250.00, o"
vector<- gsub("([$][0-9]{1,3}([,]([0-9]{3})){0,10}([.][0-9]{0,2}))" , "\"\\1\"", readLines(textConnection(txt)) )
> read.csv(text=vector)
DataField1 DataField2 Price extra
1 ID1 Value1
2 ID2 Value2 $500.00
3 ID3 Value3 $1,250.00 o
You are putting quotes around specific sequence of digits possibly repeated(commas digits) and possible period and 2 digits . There might be earlier SO questions about formatting as "currency".
Here are some solutions:
1) read.pattern This uses read.pattern in the gsubfn package to read in a file (assumed to be called sc.csv) such that the capture groups, i.e. the parenthesized portions, of the pattern are the fields. This will read in the file and process it all in one step so it is not necessary to use readLines first.
^(.*?), that begins the pattern will match everything from the start until the first comma. Then (.*?), will match to the next comma and finally (.*)$ will match everything else to the end. Normally * is greedy, i.e. it matches as much as it can, but the question mark after it makes it ungreedy. We needed to specify perl=TRUE so that it uses perl regular expressions since by default gsubfn uses tcl regular expressions based on Henry Spencer's regex parser which does not support *? . If you would rather have character columns instead of factor then add the as.is=TRUE argument to read.pattern.
The final line of code removes the $ and , characters from the Price column and converts it to numeric. (Omit this line if you actually want it formatted.)
library(gsubfn)
DF <- read.pattern("sc.csv", pattern = "^(.*?),(.*?),(.*)$", perl = TRUE, header = TRUE)
DF$Price <- as.numeric(gsub("[$,]", "", DF$Price)) ##
giving:
> DF
DataField1 DataField2 Price
1 ID1 Value1 NA
2 ID2 Value2 500
3 ID3 Value3 1250
2) sub This uses very simple regular expression (just a single character match) and no packages. Using vector as defined in the question this replaces the first two commas with semicolons. Then it can be read in using sep = ";"
read.table(text = sub(",", ";", sub(",", ";", vector)), header = TRUE, sep = ";")
Add the line marked ## in (1) if you want numeric prices.

String processing in R (find and replace)

This is an example of my data.frame
no string
1 abc&URL_drf
2 abcdef&URL_efg
I need to replace word *&URL with "". So, I need a this result
no string
1 _drf
2 _efg
In case of Excel, I can easily make this result using '*&URL' in 'find and replace' function.
However, I cannot look for effective method in R.
In R, my approach is below.
First, I have split string using strsplit(df$string, "&URL") and then I have selected second column. I think that it is not a effective way.
Is there a any effective method?
# data
df <- read.table(text="no string
1 abc&URL_drf
2 abcdef&URL_efg", header=T, as.is=T)
# `gsub` function is to substitute the unwanted string with nothing,
# thus the `""`. The pattern of unwanted string was written in
# regular expressions.
df$string <- gsub("[a-z]+(&URL)", "", df$string)
# you get
no string
1 1 _drf
2 2 _efg
I suggest you use the grep function .
The grep function takes your regex as the first argument, and the input vector as the second argument.If you pass value=TRUE, then grep returns a vector with copies of the actual elements in the input vector that could be (partially) matched.
so in your case
grep("[a-z]+(&URL)", df$col, perl=TRUE, value=TRUE)
Another approach:
df <- transform(df, string = sub(".*&URL", "", string))
# no string
# 1 1 _drf
# 2 2 _efg

Extract first X Numbers from Text Field using Regex

I have strings that looks like this.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
I need to end up with:
"2134", "0983", and "8723"
Essentially, I need to extract the first four characters that are numbers from each element. Some begin with a letter (disallowing me from using a simple substring() function).
I guess technically, I could do something like:
x <- gsub("^P","",x)
x <- substr(x,1,4)
But I want to know how I would do this with regex!
You could use str_match from the stringr package:
library(stringr)
print(c(str_match(x, "\\d\\d\\d\\d")))
# [1] "2134" "0983" "8723"
You can do this with gsub too.
> sub('.?([0-9]{4}).*', '\\1', x)
[1] "2134" "0983" "8723"
>
I used sub instead of gsub to assure I only got the first match. .? says any single character and its optional (similar to just . but then it wouldn't match the case without the leading P). The () signify a group that I reference in the replacement '\\1'. If there were multiple sets of () I could reference them too with '\\2'. Inside the group, and you had the syntax correct, I want only numbers and I want exactly 4 of them. The final piece says zero or more trailing characters of any type.
Your syntax was working, but you were replacing something with itself so you wind up with the same output.
This will get you the first four digits of a string, regardless of where in the string they appear.
mapply(function(x, m) paste0(x[m], collapse=""),
strsplit(x, ""),
lapply(gregexpr("\\d", x), "[", 1:4))
Breaking it down into pieces:
What's going on in the above line is as follows:
# this will get you a list of matches of digits, and their location in each x
matches <- gregexpr("\\d", x)
# this gets you each individual digit
matches <- lapply(matches, "[", 1:4)
# individual characters of x
splits <- strsplit(x, "")
# get the appropriate string
mapply(function(x, m) paste0(x[m], collapse=""), splits, matches)
Another group capturing approach that doesn't assume 4 numbers.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
gsub("(^[^0-9]*)(\\d+)([^0-9].*)", "\\2", x)
## [1] "2134" "0983" "8723"

Resources