I dont understand the meaning of quote="" or quote=" \ " ' " in the count.fields function. Can someone please explain the use of the quote field and difference between the above two values of quote field?
Consider the text file
one two
'three four'
"file six"
seven "eight nine"
which we can create with
lines <- c(
"one two",
"'three four'",
"\"file six\"",
"seven \"eight nine\"")
writeLines(lines, "test.txt")
The quote= parameter lets R know what characters can start/end quoted values within the file. We can ignore quotes all together by setting quote="". Doing that we see
count.fields("test.txt", quote="")
# [1] 2 2 2 3
so it's interpreting the spaces as starting new fields and each word is it's own field. This might be useful if you have fields that contain quotes for things other than creating strings. Such as last names like o'Brian and measurements like 5'6". If we just say only double quotes start string values, we get
count.fields("test.txt", quote="\"")
# [1] 2 2 1 2
So here the first two lines are the same but line 3 is considered to have just one value. The space between the quotes does not start a new field.
The default is to use either double quotes or single quotes which gives
count.fields("test.txt")
# [1] 2 1 1 2
So now the second line is treated like third line as having just one value
cat is often a good way to show what you are dealing with when you have quotes inside quotes.
> cat("Nothing:", "", "\n")
Nothing:
> cat("Something:", "\"'", "\n")
Something: "'
The first example of quote="" is specifying you have no quotes in the file.
The second example of quote="\"'" is specifying you have " or ' as potential quoting fields.
The \ backslash is used to 'escape' the following character so \" is treated literally as " instead of closing off the argument to quote= prematurely.
Related
How to select the number in a text?
I want to convert the Latin number and English numbers in the text. For example, in the text "……one telephone……". I want to change the English number "one" into "1", but I do not want to change "telephone" into "teleph1".
Is it right to select only the number with a space ahead of the word and a space after it? How to do it?
To avoid replacing parts of other words into numbers you can include word boundaries in the search pattern. Some functions have a dedicated option for this but generally you can just use the special character \\b to indicate a word boundary as long as the function supports regular expressions.
For example, \\bone\\b will only match "one" if it is not part of another word. That way you can apply it to your character string "……one telephone……" without having to rely on spaces as delimiter between words.
With the stringr package (part of the Tidyverse), the replacement might look like this:
# define test string
x <- "……one telephone……"
# define dictionary for replacements with \\b indicating word boundaries
dict <- c(
"\\bone\\b" = "1",
"\\btwo\\b" = "2",
"\\bthree\\b" = "3"
)
# replace matches in x
stringr::str_replace_all(x, dict)
#> [1] "……1 telephone……"
Created on 2022-11-11 with reprex v2.0.2
Here is one way:
gsub(" one ", " 1 ", ".. one telephone ..")
You may need more rules than "space before and space after" (e.g. punctuation). Here is an example to handle blank space or punctuation before "one"
gsub("\\([[:punct:]]|[[:blank:]]\\)one ", "\11 ", "..one telephone ..")
You can do something similar after "one".
The \1 in the second argument refers to whatever is matched inside \\( ... \\) in the first argument.
Check the documentation of gsub to learn more about regular expressions.
I have the following dataframe:
df<-c("red apples,(golden,red delicious),bananas,(cavendish,lady finger),golden pears","yellow pineapples,red tomatoes,(roma,vine),orange carrots")
I want to remove the word preceding a comma and parentheses so my output would yield:
[1] "golden,red delicious),cavendish,lady finger),golden pears" "yellow pineapples,roma,vine),orange carrots"
Ideally, the right parenthesis would be removed as well. But I can manage that delete with gsub.
I feel like a lookbehind might work but can't seem to code it correctly.
Thanks!
edit: I amended the dataframe so that the word I want deleted is a string of two words.
We can use base R with gsub to remove the characters. We match a word (\\w+) followed by space (\\s+) followed by word (\\w+) comma (,) and (, replace with blank ("")
gsub("\\w+\\s+\\w+,\\(", "", df)
#[1] "golden,red delicious),cavendish,lady finger),golden pears"
#[2] "yellow pineapples,roma,vine),orange carrots"
Or if the , is one of the patterns to check for the words, we can create the pattern with characters that are not a ,
gsub("[^,]+,\\(", "", df)
#[1] "golden,red delicious),cavendish,lady finger),golden pears"
#[2] "yellow pineapples,roma,vine),orange carrots"
Using the tidyverse package stringr, I was able to make your data appear the way you'd want it with two function calls separated by a pipe. The pipe comes from the package magrittr which loads with dplyr and/or tidyverse.
I used stringr::str_replace_all to perform two substitutions which remove the words you wanted to take out. Note the syntax for multiple substitutions within this function.
str_replace_all( c( "first string to get rid of" = "string to replace it with", "second string to get rid of" = "second replacement string")
You might find it more intuitive to combine all the "get rid of strings" first followed by combining the replacement strings, but each element within the c() is the string to be replaced (in quotes) connected to its replacement (also in quotes) with "=". Each of those replaced=replacement pairs is separated by a comma.
Using str_replace, I first took out all text which starts with "," and ends with ",)" using this regular expression ",[a-z ]+,\\(" which refers to comma, followed by any number of lowercase letters and spaces (allowing for chunks with multiple words to be detected) followed by ",(". Note the escape for the "(". If you thought there might be capital letters you would use [a-zA-Z ] instead. In either case, note the space before the "]".
Because you wanted to take out the word, but not the comma preceding it, I replaced the removed text with ",".
This doesn't remove "red apples" in the first string because it doesn't follow a comma. The expression "^[a-z ]+,\\(" refers to any number of lowercase letters and spaces coming before ",(" at the beginning of the string (the ^ "anchors" your pattern to the beginning of the string). Therefore it removes "red apples" or any other example where the text you want to remove starts the string. For these cases, it makes sense to replace it with nothing ("") because you want the first character of the remaining string to appear at the beginning.
Together, the two substitutions remove the offending text whether it starts the string or is in the middle of it or ends it so in that sense it's more or less generalized.
str_remove_all("\\)") removes the right parentheses throughout
library(stringr)
library(magrittr)
df<-c("red apples,(golden,red delicious),bananas,(cavendish,lady finger),
golden pears","yellow pineapples,red tomatoes,(roma,vine),orange carrots")
str_replace_all(df, c(",[a-z ]+,\\(" = ",",
"^[a-z ]+,\\(" = "")) %>%
str_remove_all("\\)")
[1] "golden,red delicious,cavendish,lady finger,golden pears"
[2] "yellow pineapples,roma,vine,orange carrots"
I have two data frames containing the same information. The first contains a unique identifier. I would like to user dplyr::inner_join to match by title.
Unfortunately, one of the data frames contains {"} to signify a quote and the other simply uses a single quote
For example, I would like to match the two titles shown below.
The {"}Level of Readiness{"} for HCV treatment
The 'Level of Readiness' for HCV treatment
You can turn them into single quotes using gsub, but you need to enclose {"} with single quotes and ' with double quotes. Note that fixed = TRUE treats '{"}' as a literal string instead of a regular expression:
gsub('{"}', "'", 'The {"}Level of Readiness{"} for HCV treatment', fixed = TRUE)
# [1] "The 'Level of Readiness' for HCV treatment"
I have this result with my data after a write.csv in R:
Last_Name,Sales,Country,Quarter
Smith,$16,753.00 ,UK,Qtr 3
Johnson,$14,808.00 ,USA,Qtr 4
Williams,$10,644.00 ,UK,Qtr 2
and I want this result (which is the original format of my data):
Last_Name,Sales,Country,Quarter
Smith,"$16,753.00 ",UK,Qtr 3
Johnson,"$14,808.00 ",USA,Qtr 4
Williams,"$10,644.00 ",UK,Qtr 2
because obviously I have some problems with amounts!
but I don't want :
"Last_Name","Sales","Country","Quarter"
"Smith,"$16,753.00 ","UK","Qtr 3"
"Johnson","$14,808.00 ","USA","Qtr 4"
"Williams","$10,644.00 ","UK","Qtr 2"
Any ideas?
Try quoting the sales column by itself:
df$Sales <- paste0("\"", df$Sales, "\"")
Then call write.csv without quotes. Or, you may specify which columns you want quoted in your call to write.csv:
write.csv(file="out.csv", df, quote=c(2))
This is the default behavior of data.table::fwrite, which only quotes columns as needed (in your case, to disambiguate the internal comma of the Sales field):
library(data.table)
fwrite(y)
# Last_Name,Sales,Country,Quarter
# Smith,"$16,753.00 ",UK,Qtr 3
# Johnson,"$14,808.00 ",USA,Qtr 4
# Williams,"$10,644.00 ",UK,Qtr 2
I'm just writing to stdout for convenience, of course you can specify an output file as the second argument (file). You can also control this behavior with the quote argument; "auto", works as follows (from ?fwrite):
When "auto", character fields, factor fields and column names will only be surrounded by double quotes when they need to be; i.e., when the field contains the separator sep, a line ending \n, the double quote itself or (when list columns are present) sep2[2] (see sep2 below).
As a bonus, of course, fwrite is way (can be upwards of 100x) faster than write.csv.
I'm trying to paste a series of string of characters like this:
paste0("//*[#id=",'"set_',1,'_div"]/a')
[1] "//*[#id=\"set_1_div\"]/a"
How can I get rid of the "\"? This is my expected outcome
[1] "//*[#id="set_1_div"]/a"
Thanks a lot
The backslash designates that the next character needs to be 'escaped', i.e., it does not need to be interpreted as being part of an expression, but rather as a character. When using the print statement, character strings are quoted and therefore the escape sign (backslash) is included. However, using the cat statement you can easily see that the backslashes are not actualy part of the character string:
> x <- paste0("//*[#id=",'"set_',1,'_div"]/a')
> x
[1] "//*[#id=\"set_1_div\"]/a"
> cat(x)
//*[#id="set_1_div"]/a