I want to escape single quotation marks in a vector (for example to execute a SQL query), but I'm unable to do so.
Example code:
strings <- c("ef'gh")
strings2 <- gsub("'", "\'", strings)
cat(strings2)
print(strings2)
Expected output of either cat or print:
ef\'gh
"ef\'gh"
But none of the above. I tried multiple other combinations already, with different ways of escaping quotation, without success.
You may use the following code:
strings <- c("ef'gh")
strings2 <- sub("\'", "\\\\'", strings)
cat(strings2)
[1] ef\'gh
print(strings2)
[1] "ef\\'gh"
Related
I just learnt R and was trying to clean data for analysis using R using string manipulation using the code given below for Amount_USD column of a table. I could not find why changes were not made. Please help.
Code:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,10) == "\\\xc2\\\xa0",
str_sub(csv_file$Amount_USD,12,-1),csv_file2$Amount_USD)
Result:
\\xc2\\xa010,000,000
\\xc2\\xa016,200,000
\\xc2\\xa019,350,000
Expected Result:
10,000,000
16,200,000
19,350,000
You could use the following code, but maybe there is a more compact way:
vec <- c("\\xc2\\xa010,000,000", "\\xc2\\xa016,200,000", "\\xc2\\xa019,350,000")
gsub("(\\\\x[[:alpha:]]\\d\\\\x[[:alpha:]]0)([d,]*)", "\\2", vec)
[1] "10,000,000" "16,200,000" "19,350,000"
A compact way to extract the numbers is by using str_extract and negative lookahead:
library(stringr)
str_extract(vec, "(?!0)[\\d,]+$")
[1] "10,000,000" "16,200,000" "19,350,000"
How this works:
(?!0): this is negative lookahead to make sure that the next character is not 0
[\\d,]+$: a character class allowing only digits and commas to occur one or more times right up to the string end $
Alternatively:
str_sub(vec, start = 9)
There were a few minor issues with your code.
The main one being two unneeded backslashes in your matching statement. This also leads to a counting error in your first str_sub(), where you should be getting the first 8 characters not 10. Finally, you should be getting the substring from the next character after the text you want to match (i.e. position 9, not 12). The following should work:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,8) == "\\xc2\\xa0", str_sub(csv_file$Amount_USD,9,-1),csv_file2$Amount_USD)
However, I would have done this with a more compact gsub than provided above. As long as the text at the start to remove is always going to be "\\xc2\\xa0", you can simply replace it with nothing. Note that for gsub you will need to escape all the backslashes, and hence you end up with:
csv_file2$Amount_USD <- gsub("\\\\xc2\\\\xa0", replacement = "", csv_file2$Amount_USD)
Personally, especially if you plan to do any sort of mathematics with this column, I would go the additional step and remove the commas, and then coerce the column to be numeric:
csv_file2$Amount_USD <- as.numeric(gsub("(\\\\xc2\\\\xa0)|,", replacement = "", csv_file2$Amount_USD))
In R, I am having trouble replacing a substring that has punctuation. Ie within the string "r.Export", I am trying to replace "r." with "Report.". I've used gsub and below is my code:
string <- "r.Export"
short <- "r."
replacement <- "Report."
gsub(short,replacement,string)
The desired output is: "Report.Export" however gsub seems to replace the second r such that the output is:
Report.ExpoReport.
Using sub() instead is not a solution either because I am doing multiple gsubs where sometimes the string to be replaced is:
short <- "o."
So, then the o's in r.Export are replaced anyway and it becomes a complete mess.
string <- "r.Export"
short <- "r\\."
replacement <- "Report."
gsub(short,replacement,string)
Returns:
[1] "Report.Export"
Or, using fixed=TRUE:
string <- "r.Export"
short <- "r."
replacement <- "Report."
gsub(short,replacement,string, fixed=TRUE)
Returns:
[1] "Report.Export"
Explanation: Without the fixed=TRUE argument, gsub expects a regular expression as first argument. And with regular expressions . is a placeholder for 'any character'. If you want the literal . (period) you have to use either \\. (i.e. escaping the period) or the aforementioned argument fixed=TRUE
Since you have characters in your pattern (.) which has a special meaning in regex use fixed = TRUE which matches the string as is.
gsub(short,replacement,string, fixed = TRUE)
#[1] "Report.Export"
I might actually add word boundaries and lookaheads to the mix here, to ensure as targeted a match as possible:
string <- "r.Export"
replacement <- "Report."
output <- gsub("\\br\\.(?=\\w)", replacement, string, perl=TRUE)
output
[1] "Report.Export"
This approach ensures that we only match r. when the r is preceded by whitespace or is the start of the string, and also when what follows the dot is another word. Consider the sentence The project r.Export needed a programmer. We wouldn't want to replace the final r. in this case.
We can use sub
sub(short,replacement,string, fixed = TRUE)
#[1] "Report.Export"
How to convert the following string in R :
this_isastring_12(=32)
so that only the following is kept
isastring_12
Eg
f('this_isastring_12(=32)') returns 'isastring_12'
This should work on other strings with a similar structure, but different characters
Another example with a different string of similar structure
f('something_here_3(=1)') returns 'here_3'
We can use sub to extract everything from first underscore to opening round bracket in the text.
sub(".*?_(.*)\\(.*", "\\1", x)
#[1] "isastring_12" "here_3" "string_4"
where x is
x <- c("this_isastring_12(=32)", "something_here_3(=1)", "another_string_4(=1)")
You could use the package unglue.
Borrowing Ronak's data :
x <- c("this_isastring_12(=32)", "something_here_3(=1)", "another_string_4(=1)")
library(unglue)
unglue_vec(x, "{=.*?}_{res}({=.*?})")
#> [1] "isastring_12" "here_3" "string_4"
{=.*?} matches anything until what's next is matched, but doesn't extract anything because there's no lhs to the equality
{res}, where the name res could be replaced by anything, matches anything, and extracts it
outside of curly braces, no need to escape characters
unglue_vec() returns an atomic vector of the matches
I am interested to assign names to list elements. To do so I execute the following code:
file_names <- gsub("\\..*", "", doc_csv_names)
print(file_names)
"201409" "201412" "201504" "201507" "201510" "201511" "201604" "201707"
names(docs_data) <- file_names
In this case the name of the list element appears with ``.
docs_data$`201409`
However, in this case the name of the list element appears in the following way:
names(docs_data) <- paste("name", 1:8, sep = "")
docs_data$name1
How can I convert the gsub() result to receive the latter naming pattern without quotes?
gsub() and paste () seem to produce the same class () object. What is the difference?
Both gsub and paste return character objects. They are different because they are completely different functions, which you seem to know based on their usage (gsub replaces instances of your pattern with a desired output in a string of characters, while paste just... pastes).
As for why you get the quotations, that has nothing to do with gsub and everything to do with the fact that you are naming variables/columns with numbers. Indeed, try
names(docs_data) <- paste(1:8)
and you'll realize you have the same problem when invoking the naming pattern. It basically has to do with the fact that R doesn't want to be confused about whether a number is really a number or a variable because that would be chaos (how can 1 refer to a variable and also the number 1?), so what it does in such cases is change a number 1 into the character "1", which can be given names. For example, note that
> 1 <- 3
Error in 1 <- 3 : invalid (do_set) left-hand side to assignment
> "1" <- 3 #no problem!
So R is basically correcting that for you! This is not a problem when you name something using characters. Finally, an easy fix: just add a character in front of the numbers of your naming pattern, and you'll be able to invoke them without the quotations. For example:
file_names <- paste("file_",gsub("\\..*", "", doc_csv_names),sep="")
Should do the trick (or just change the "file_" into whatever you want as long as it's not empty, cause then you just have numbers left and the same problem)!
Thanks for grep using a character vector with multiple patterns, I figured out my own problem as well.
The question here was how to find multiple values by using grep function,
and the solution was either these:
grep("A1| A9 | A6")
or
toMatch <- c("A1", "A9", "A6")
matches <- unique (grep(paste(toMatch,collapse="|")
So I used the second suggestion since I had MANY values to search for.
But I'm curious why c() or for loop doesn't work out instead of |.
Before I researched the possible solution in stackoverflow and found recommendations above, I tried out two alternatives that I'll demonstrate below:
First, what I've written in R was something like this:
find.explore.l<-lapply(text.words.bl ,function(m) grep("^explor",m))
But then I had to 'grep' many words, so I tried out this
find.explore.l<-lapply(text.words.bl ,function(m) grep(c("A1","A2","A3"),m))
It didn't work, so I tried another one(XXX is the list of words that I'm supposed to find in the text)
for (i in XXX){
find.explore.l<-lapply(text.words.bl ,function(m) grep("XXX[i]"),m))
.......(more lines to append lines etc)
}
and it seemed like R tried to match XXX[i] itself, not the words inside.
Why can't c() and for loop for grep return right results?
Someone please let me know! I'm so curious :P
From the documentation for the pattern= argument in the grep() function:
Character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except for regexpr and gregexpr.
This confirms that, as #nrussell said in a comment, grep() is not vectorized over the pattern argument. Because of this, c() won't work for a list of regular expressions.
You could, however, use a loop, you just have to modify your syntax.
toMatch <- c("A1", "A9", "A6")
# Loop over values to match
for (i in toMatch) {
grep(i, text)
}
Using "XXX[i]" as your pattern doesn't work because it's interpreting that as a regular expression. That is, it will match exactly XXXi. To reference an element of a vector of regular expressions, you would simply use XXX[i] (note the lack of surrounding quotes).
You can apply() this, but in a slightly different way than you had done. You apply it to each regex in the list, rather than each text string.
lapply(toMatch, function(rgx, text) grep(rgx, text), text = text)
However, the best approach would be, as you already have in your post, to use
matches <- unique(grep(paste(toMatch, collapse = "|"), text))
Consider that:
XXX <- c("a", "b", "XXX[i]")
grep("XXX[i]", XXX, value=T)
character(0)
grep("XXX\\[i\\]", XXX, value=T)
[1] "XXX[i]"
What is R doing? It is using special rules for the first argument of grep. The brackets are considered special characters ([ and ]). I put in two backslashes to tell R to consider them regular brackets. And imgaine what would happen if I put that last expression into a for loop? It wouldn't do what I expected.
If you would like a for loop that goes through a character vector of possible matches, take out the quotes in the grep function.
#if you want the match returned
matches <- c("a", "b")
for (i in matches) print(grep(i, XXX, value=T))
[1] "a"
[1] "b"
#if you want the vector location of the match
for (i in matches) print(grep(i, XXX))
[1] 1
[1] 2
As the comments point out, grep(c("A1","A2","A3"),m)) is violating the grep required syntax.