Match text strings containing quotation marks which are encoded differently - r

I have two data frames containing the same information. The first contains a unique identifier. I would like to user dplyr::inner_join to match by title.
Unfortunately, one of the data frames contains {"} to signify a quote and the other simply uses a single quote
For example, I would like to match the two titles shown below.
The {"}Level of Readiness{"} for HCV treatment
The 'Level of Readiness' for HCV treatment

You can turn them into single quotes using gsub, but you need to enclose {"} with single quotes and ' with double quotes. Note that fixed = TRUE treats '{"}' as a literal string instead of a regular expression:
gsub('{"}', "'", 'The {"}Level of Readiness{"} for HCV treatment', fixed = TRUE)
# [1] "The 'Level of Readiness' for HCV treatment"

Related

Add a character to each word within a sentence in R

I have sentences in R (they represent column names of an SQL query), like the following:
sample_sentence <- "CITY, AGE,NAME,LAST_NAME, COUNTRY"
I would need to add a character(s) like "k." in front of every word of the sentence. Notice how sometimes words within the sentence may be separated by a comma and a space, but sometimes just by a comma.
The desired output would be:
new_sentence <- "k.CITY, k.AGE,k.NAME,k.LAST_NAME, k.COUNTRY"
I would prefer to achieve this without using a loop for. I saw this post Add a character to the start of every word but there they work with a vector and I can't figure out how to apply that code to my example.
Thanks
sample_sentence <- "CITY, AGE,NAME,LAST_NAME, COUNTRY"
gsub(pattern = "(\\w+)", replacement = "k.\\1", sample_sentence)
# [1] "k.CITY, k.AGE,k.NAME,k.LAST_NAME, k.COUNTRY"
Explanation: in regex \\w+ matches one or more "word" characters, and putting it in () makes each match a "capturing group". We replace each capture group with k.\\1 where \\1 refers to every group captured by the first set of ().
A possible solution:
sample_sentence <- "CITY, AGE,NAME,LAST_NAME, COUNTRY"
paste0("k.", gsub("(,\\s*)", "\\1k\\.", sample_sentence))
#> [1] "k.CITY, k.AGE,k.NAME,k.LAST_NAME, k.COUNTRY"

How to extract a specific number of characters before and after a keyword in R?

I have a dataframe (data) with a column containing text from reports (data$Report_Text). I need to extract 40 characters before and after a keyword (including the keyword) for each row and store as a new column in the dataframe.
So far I have this for the characters before (ideally would like to store the text before + after in one column, but if that isn't possible I can do two columns):
data$characters <- sub('.*?(\\d{40}) keyword', "", data$Report_Text)
However when I run this, it gives me all of the text before the keyword, not just 40 characters. Where am I going wrong?
data$characters <- gsub("^.*(.{40}keyword.{40}).*$", "\\1", data$Report_Text))
posibly changing the . before the {40} by \\d (only digits) or the character type of your preference.

How to split strings of different lengths followed by numbers in R?

This is on request to ask a new question based on a previous one I asked here: I have a list of compounds like the following:
Ag0.05Zr1.0
Al0.11W1.0
Al0.18Cr1.0
AlFe
AlFe0.2NiCuCoCr
AlFe0.2NiCuCoCz
AlFeNi
AlFeNiCo
AlFeNiCrCo
AlFeNiCrCoCu0.2
AlFeNiCu0.2CoCr
Cr1.0Mo0.33
U0.33Zr1.0
V0.33W1.0
V1.0W1.0
I need to split and place an underscore in between element names. Now the element names can be single Capital letter, or 1 Capital letter followed by a small letter. I have achieved so far placing a '_' in between strings containing 2 letters. However whenever a Single letter element is concerned it is giving the following:
"Element V1.0 in compound Co_1.0_Cu_1.0_Fe_1.0_5_Nb_0.5_6_Ni_1.0 entered does not exist!". (Error message generated by the code) However I want it to be the following:
"V_1.0_Co_1.0_Cu_1.0_Fe_1.05_Nb_0.56_Ni_1.0"
So it is not only grouping single letters with numbers, it is also failing to identify all numbers grouped together. Can anyone please help? I used the following code to achieve this:
elem = gsub("(?<=[a-z0-9])(?=[0-9A-Z])", "_", elem, perl = TRUE)
where elem is my list of compounds.
The numbers are basically element fractions, So the error message should read like:
Error: Element 'Cz' (or 'Z' a single letter element which does not exist in the periodic table), " in compound xyz entered does not exist!
You can provide multiple replacement patterns in gsub using |.
# data
elem <- c("Ag0.05Zr1.0", "Al0.11W1.0", "Al0.18Cr1.0", "AlFe", "AlFe0.2NiCuCoCr",
"AlFe0.2NiCuCoCz", "AlFeNi", "AlFeNiCo", "AlFeNiCrCo", "AlFeNiCrCoCu0.2",
"AlFeNiCu0.2CoCr", "Cr1.0Mo0.33", "U0.33Zr1.0", "V0.33W1.0", "V1.0W1.0")
# solution
elem <- gsub("(?<=[a-z0-9])(?=[A-Z])|(?<=[a-z])(?=[0-9])|(?<=[A-Z])(?=[0-9])",
"_", elem, perl = TRUE)
elem
"Ag_0.05_Zr_1.0"
"Al_0.11_W_1.0"
"Al_0.18_Cr_1.0"
"Al_Fe"
"Al_Fe_0.2_Ni_Cu_Co_Cr"
"Al_Fe_0.2_Ni_Cu_Co_Cz"
"Al_Fe_Ni"
"Al_Fe_Ni_Co"
"Al_Fe_Ni_Cr_Co"
"Al_Fe_Ni_Cr_Co_Cu_0.2"
"Al_Fe_Ni_Cu_0.2_Co_Cr"
"Cr_1.0_Mo_0.33"
"U_0.33_Zr_1.0"
"V_0.33_W_1.0"
"V_1.0_W_1.0"
I made two changes to the gsub in the original post:
I changed the replacement pattern (?<=[a-z0-9])(?=[0-9A-Z]) to two separate replacement patterns (?<=[a-z0-9])(?=[A-Z])|(?<=[a-z])(?=[0-9]) because the original replacement pattern inserted "_" between double digits as well (e.g., "Cr_1.0_Mo_0.3_3").
I added a third replacement pattern (?<=[A-Z])(?=[0-9]) to insert "_" between capital letters and numbers.

What is the meaning of quote = "" in count.fields function in R?

I dont understand the meaning of quote="" or quote=" \ " ' " in the count.fields function. Can someone please explain the use of the quote field and difference between the above two values of quote field?
Consider the text file
one two
'three four'
"file six"
seven "eight nine"
which we can create with
lines <- c(
"one two",
"'three four'",
"\"file six\"",
"seven \"eight nine\"")
writeLines(lines, "test.txt")
The quote= parameter lets R know what characters can start/end quoted values within the file. We can ignore quotes all together by setting quote="". Doing that we see
count.fields("test.txt", quote="")
# [1] 2 2 2 3
so it's interpreting the spaces as starting new fields and each word is it's own field. This might be useful if you have fields that contain quotes for things other than creating strings. Such as last names like o'Brian and measurements like 5'6". If we just say only double quotes start string values, we get
count.fields("test.txt", quote="\"")
# [1] 2 2 1 2
So here the first two lines are the same but line 3 is considered to have just one value. The space between the quotes does not start a new field.
The default is to use either double quotes or single quotes which gives
count.fields("test.txt")
# [1] 2 1 1 2
So now the second line is treated like third line as having just one value
cat is often a good way to show what you are dealing with when you have quotes inside quotes.
> cat("Nothing:", "", "\n")
Nothing:
> cat("Something:", "\"'", "\n")
Something: "'
The first example of quote="" is specifying you have no quotes in the file.
The second example of quote="\"'" is specifying you have " or ' as potential quoting fields.
The \ backslash is used to 'escape' the following character so \" is treated literally as " instead of closing off the argument to quote= prematurely.

How to remove starting(suffix) special character("_") from column names [duplicate]

After I collapse my rows and separate using a semicolon, I'd like to delete the semicolons at the front and back of my string. Multiple semicolons represent blanks in a cell. For example an observation may look as follows after the collapse:
;TX;PA;CA;;;;;;;
I'd like the cell to look like this:
TX;PA;CA
Here is my collapse code:
new_df <- group_by(old_df, unique_id) %>% summarize_each(funs(paste(., collapse = ';')))
If I try to gsub for semicolon it removes all of them. If if I remove the end character it just removes one of the semicolons. Any ideas on how to remove all at the beginning and end, but leaving the ones in between the observations? Thanks.
use the regular expression ^;+|;+$
x <- ";TX;PA;CA;;;;;;;"
gsub("^;+|;+$", "", x)
The ^ indicates the start of the string, the + indicates multiple matches, and $ indicates the end of the string. The | states "OR". So, combined, it's searching for any number of ; at the start of a string OR any number of ; at the end of the string, and replace those with an empty space.
The stringi package allows you to specify patterns which you wish to preserve and trim everything else. If you only have letters there (though you could specify other pattern too), you could simply do
stringi::stri_trim_both(";TX;PA;CA;;;;;;;", "\\p{L}")
## [1] "TX;PA;CA"

Resources