Issue with a column containing special characters - r

I have dataframe in R that contains a column of type character with values as follows
"\"121.29\""
"\"288.1\""
"\"120\""
"\"V132.3\""
"\"800\""
I am trying to get rid of the extra " and \ and retain clean values as below
121.29
288.10
120.00
V132.30
800.00
I tried gsub("([\\])","", x) also str_repalce_all function so far no luck. I would much appreciate it if anybody can help me resolve this issue. Thanks in advance.

Try
gsub('\\"',"",x)
[1] "121.29" "288.1" "120" "V132.3" "800"
Since the fourth entry is not numeric and an atomic vector can only contain entries of the same mode, the entries are all characters in this case (the most flexible mode capable of storing the data). So there still will be quotes around each entry.
Because \ is a special character, it needs to be escaped with a backslash, so the expression \\" is passed as a first parameter to gsub(). Moreover, as suggested by #rawr, one can use single quotes to address the double quote.
An alternative would be to use double quotes and escape them, too:
gsub("\\\"","",x)
which yields the same result.
Hope this helps.

Related

How to remove "\" from paste function output with quotation marks?

I'm working with the following code:
Y_Columns <- c("Y.1.1")
paste('{"ImportId":"', Y_Columns, '"}', sep = "")
The paste function produces the following output:
"{\"ImportId\":\"Y.1.1\"}"
How do I get the paste function to omit the \? Such that, the output is:
"{"ImportId":"Y.1.1"}"
Thank you for your help.
Note: I did do a search on SO to see if there were any Q's that asked "what is an escape character in R". But I didn't review all the 160 answers, only the first 20.
This is one way of demonstrating what I wrote in my comment:
out <- paste('{"ImportId":"', Y_Columns, '"}', sep = "")
out
#[1] "{\"ImportId\":\"Y.1.1\"}"
?print
print(out,quote=FALSE)
#[1] {"ImportId":"Y.1.1"}
Both R and regex patterns use escape characters to allow special characters to be displayed in print output or input. (And sometimes regex patterns need to have doubled escapes.) R has a few characters that need to be "escaped" in certain situation. You illustrated one such situation: including double-quote character inside a result that will be printed with surrounding double-quotes. If you were intending to include any single quotes inside a character value that was delimited by single quotes at the time of creation, they would have needed to be escaped as well.
out2 <- '\'quoted\''
nchar(out2)
#[1] 8 ... note that neither the surround single-quotes nor the backslashes get counted
> out2
[1] "'quoted'" ... and the default output quote-char is a double-quote.
Here's a good Q&A to review:How to replace '+' using gsub() function in R
It has two answers, both useful: one shows how to double escape a special character and the other shows how to use teh fixed argument to get around that requirement.
And another potentially useful Q&A on the topic of handling Windows paths:
File path issues in R using Windows ("Hex digits in character string" error)
And some further useful reading suggestions: Look at the series of help pages that start with capital letters. (Since I can never remember which one has which nugget of essential information, I tried ?Syntax first and it has a "See Also" list of essential reading: Arithmetic, Comparison, Control, Extract, Logic, NumericConstants, Paren, Quotes, Reserved. and I then realized what I wanted to refer you to was most likely ?Quotes where all the R-specific escape sequence letters should be listed.

grepping special characters in R

I have a variable named full.path.
And I am checking if the string contained in it is having certain special character or not.
From my code below, I am trying to grep some special character. As the characters are not there, still the output that I get is true.
Could someone explain and help. Thanks in advance.
full.path <- "/home/xyz"
#This returns TRUE :(
grepl("[?.,;:'-_+=()!##$%^&*|~`{}]", full.path)
By plugging this regex into https://regexr.com/ I was able to spot the issue: if you have - in a character class, you will create a range. The range from ' to _ happens to include uppercase letters, so you get spurious matches.
To avoid this behaviour, you can put - first in the character class, which is how you signal you want to actually match - and not a range:
> grepl("[-?.,;:'_+=()!##$%^&*|~`{}]", full.path)
[1] FALSE

Using grep to filter rows with two or more patterns in the string in R

I need to index all the rows that have a string beginning with either "B-" or "B^" in one of the columns. I tried a bunch of combinations, but I am suspecting it might not be working due to "-" and "^" signs being part of grep command as well.
dataset[grep('^(B-|B^)[^B-|B^]*$', dataset$Col1),]
With the above script, rows beginning with "B^" are not being extracted. Please suggest a smart way to handle this.
You can use the escape \\ command in grep:
dataset[grep('^(B\\-|B\\^)[^B\\-|B\\^]*$', dataset$Col1),]
For further explanation, the ^ matches the beginning of a string as an anchor therefore you have to escape it in the middle of string. The [] are a character class so [^B-|B^]* matches any character that's not a B,-,B, or ^. They are unnecessary here.
The simplified regex is:
dataset[grep('^(B-|B\\^)', dataset$Col1),]

Losing information when converting from character to numerical

I'm trying to convert characters like "9.230" to a numeric type.
First I erased the dots, because it was returning me "NA", and then I converted to numerical.
The problem is that when I convert to numerical I lose the trailing zero:
Example:
a<-9.230
as.numeric(gsub(".","",a,fixed=TRUE))
Returns: 923
Does anyone know how avoid this?
You assign the number 9.230 which is the same as 9.23. How is the system supposed to know that there was a trailing zero? If you want to transform a string, work with the string "9.230".
Look for result of
a<-9.230
gsub(".","",a,fixed=TRUE)
#[1] "923"
Question will be why? Because fixed=TRUE have been used in argument of gsub. Hence . is replaced by the 2nd argument of gsub that is "".
Basically thats the reason why as.numeric(gsub(".","",a,fixed=TRUE)) is resulting in 923
There is another point. How a <- 9.230 was changed to character in gsub function. This has been explained in r documentation for gsub:
Arguments: x, text
a character vector where matches are sought, or an object
which can be coerced by as.character to a character vector. Long
vectors are supported.
Final question: How to avoid such behavior?
Dont use gsub. Use sprintf("%.3f",a)

How to use wildcard in gsub replacement

I have a column of strings, e.g.
strings <- c("SometextPO0001moretext", "SometextPO0008moretext")
The 'sometext' and 'moretext' portions are variable in length. I want to remove the PO000* portion of the strings, where * is a wildcard. I've tried
gsub("PO000*", "", strings)
and Googled quite a bit but surprisingly haven't found an answer to this seemingly simple question. Since the last character varies, I would like to be able to do the removal this way vs. hard-coding a large number of variants. Any help would be appreciated!
For a single wild card, you need to use a .. * which you have used is repeate 0 or more times for the last character, which was 0.
gsub("PO000.", "", strings) would remove both PO0001 and PO0008
I think it should be gsub("PO000\\d{1}", "", strings)
And the result is :
[1] "Sometextmoretext" "Sometextmoretext"

Resources