Lower Case Certain Words R - r

I need to convert certain words to lower case. I am working with a list of movie titles, where prepositions and articles are normally lower case if they are not the first word in the title. If I have the vector:
movies = c('The Kings Of Summer', 'The Words', 'Out Of The Furnace', 'Me And Earl And The Dying Girl')
What I need is this:
movies_updated = c('The Kings of Summer', 'The Words', 'Out of the Furnace', 'Me and Earl and the Dying Girl')
Is there an elegant way to do this without using a long series of gsub(), as in:
movies_updated = gsub(' In ', ' in ', movies)
movies_updated = gsub(' In', ' in', movies_updated)
movies_updated = gsub(' Of ', ' of ', movies)
movies_updated = gsub(' Of', ' of', movies_updated)
movies_updated = gsub(' The ', ' the ', movies)
movies_updated = gsub(' the', ' the', movies_updated)
And so on.

In effect, it appears that you are interested in converting your text to title case. This can be easily achieved with use of the stringi package, as shown below:
>> stringi::stri_trans_totitle(c('The Kings of Summer', 'The Words', 'Out of the Furnace'))
[1] "The Kings Of Summer" "The Words" "Out Of The Furnace"
Alternative approach would involve making use of the toTitleCase function available in the the tools package:
>> tools::toTitleCase(c('The Kings of Summer', 'The Words', 'Out of the Furnace'))
[1] "The Kings of Summer" "The Words" "Out of the Furnace"

Though I like #Konrad's answer for its succinctness, I'll offer an alternative that is more literal and manual.
movies = c('The Kings Of Summer', 'The Words', 'Out Of The Furnace',
'Me And Earl And The Dying Girl')
gr <- gregexpr("(?<!^)\\b(of|in|the)\\b", movies, ignore.case = TRUE, perl = TRUE)
mat <- regmatches(movies, gr)
regmatches(movies, gr) <- lapply(mat, tolower)
movies
# [1] "The Kings of Summer" "The Words"
# [3] "Out of the Furnace" "Me And Earl And the Dying Girl"
The tricks of the regular expression:
(?<!^) ensures we don't match a word at the beginning of a string. Without this, the first The of movies 1 and 2 will be down-cased.
\\b sets up word-boundaries, such that in in the middle of Dying will not match. This is slightly more robust than your use of space, since hyphens, commas, etc, will not be spaces but do indicate the beginning/end of a word.
(of|in|the) matches any one of of, in, or the. More patterns can be added with separating pipes |.
Once identified, it's as simple as replacing them with down-cased versions.

Another example of how to turn certain words to lower case with gsub (with a PCRE regex):
movies = c('The Kings Of Summer', 'The Words', 'Out Of The Furnace', 'Me And Earl And The Dying Girl')
gsub("(?!^)\\b(Of|In|The)\\b", "\\L\\1", movies, perl=TRUE)
See the R demo
Details:
(?!^) - not at the start of the string (it does not matter if we use a lookahead or lookbehind here since the pattern inside is a zero-width assertion)
\\b - find leading word boundary
(Of|In|The) - capture Of or In or The into Group 1
\\b - assure there is a trailing word boundary.
The replacement contains the lowercasing operator \L that turns all the chars in the first backreference value (the text captured into Group 1) to lower case.
Note it can turn out a more flexible approach than using tools::toTitleCase. The code part that keeps specific words in lower case is:
## These should be lower case except at the beginning (and after :)
lpat <- "^(a|an|and|are|as|at|be|but|by|en|for|if|in|is|nor|not|of|on|or|per|so|the|to|v[.]?|via|vs[.]?|from|into|than|that|with)$"
If you only need to apply lowercasing and do not care about the other logic in the function, it might be enough to add these alternatives (do not use ^ and $ anchors) to the regex at the top of the post.

Related

Find closing parenthesis with regex in r

I have several strings with open and unclosed parenthesis. I managed to remove the opening parenthesis (if there is no closing one), but I do not manage to remove the closing parenthesis if there is no opening one. I want to leave those with matching parenthesis alone
string1 = "This (is solved"
string2 = "This is (fine)"
string3 = "This is the problem)"
This is what I was able to remove the first Problem case with (Opening parenthesis but no opening)
str_remove(data, "[(](?!.*[)])")
But I cannot seem to turn it around. The following grabs all closing parenthesis, but not the one without an oping.
"(?!.*[(])[)]"
Any ideas are appreciated!
If you do not need to handle nested paired (balanced) parentheses, you can use
gsub("(\\([^()]*\\))|[()]", "\\1", string)
See the regex demo. Details:
(\([^()]*\)) - Group 1 (\1 refers to this group value): (, then zero or more chars other than ( and ), and then a ) char
| - or
[()] - a ( or ) char.
See the R demo:
x <- c("This (is solved", "This is (fine)", "This is the problem)")
gsub("(\\([^()]*\\))|[()]", "\\1", x)
# => [1] "This is solved" "This is (fine)" "This is the problem"
If the parentheses can be nested, you can use
gsub("(\\((?:[^()]++|(?1))*\\))|[()]", "\\1", string, perl=TRUE)
See this regex demo. Details:
(\((?:[^()]++|(?1))*\)) - Group 1:
\( - a ( char
(?:[^()\n]++|(?1))* - zero or more sequences of either one or more chars other than ( and ), or the whole Group 1 pattern that is recursed
\) - a ) char
|[()] - or a ( / ) char.

Stringr str_replace_all misses repeated terms

I'm having an issue with the stringr::str_replace_all function. I'm trying to replace all instances of iv with insuredvehicle, but the function only seems to catch the first term.
temp_data <- data.table(text = 'the driver of the 1st vehicle hit the iv iv at a stop')
temp_data[, new_text := stringr::str_replace_all(pattern = ' iv ', replacement = ' insuredvehicle ', string = text)]
The outcome looks like the following, which missed the 2nd iv term:
1: the driver of the 1st vehicle hit the insuredvehicle iv at a stop
I believe the issue is that the 2 instances share a space, which is part of the search pattern. I did that because I want to replace the iv term, and not iv within driver.
I DON'T want to simply consolidate the repeated terms to 1. I'd like the result to look like:
1: the driver of the 1st vehicle hit the insuredvehicle insuredvehicle at a stop
I'd appreciate any help getting this to work!
Maybe if you include a word boundary in your regex, than remove the white spaces from the replacement? It is ideal when you want just a full word matching the pattern, but not parts of words, while staying away from these blank space issues.
\\bseems to do the trick
temp_data[, new_text := stringr::str_replace_all(pattern = '\\biv\\b', replacement = 'insuredvehicle', string = text)]
new_text
1: the driver of the 1st vehicle hit the insuredvehicle insuredvehicle at a stop
You can use lookarounds:
temp_data[, new_text := stringr::str_replace_all(pattern = '(?<= )iv(?= )', replacement = 'insuredvehicle', string = text)]
Output:
"the driver of the 1st vehicle hit the insuredvehicle insuredvehicle at a stop"
Use gsub:
gsub("\\biv\\b", "insuredvehicle", temp_data$text)
[1] "the driver of the 1st vehicle hit the uninsuredvehicle uninsuredvehicle at a stop"
Use space boundaries:
temp_data <- data.table(text = 'the driver of the 1st vehicle hit the iv iv at a stop')
temp_data[, new_text := stringr::str_replace_all(pattern = '(?<!\\S)iv(?!\\S)', replacement = 'insuredvehicle', string = text)]
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
iv 'iv'
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
) end of look-ahead

Finding a word with condition in a vector with regex on R (perl)

I would like to find the rows in a vector with the word 'RT' in it or 'R' but not if the word 'RT' is preceded by 'no'.
The word RT may be preceded by nothing, a space, a dot, etc.
With the regex, I tried :
grep("(?<=[no] )RT", aaa,ignore.case = FALSE, perl = T)
Which was giving me all the rows with "no RT".
and
grep("(?=[^no].*)RT",aaa , perl = T)
which was giving me all the rows containing 'RT' with and without 'no' at the beginning.
What is my mistake? I thought the ^ was giving everything but the character that follows it.
Example :
aaa = c("RT alone", "no RT", "CT/RT", "adj.RTx", "RT/CT", "lang, RT+","npo RT" )
(?<=[no] )RT matches any RT that is immediately preceded with "n " or "o ".
You should use a negative lookbehind,
"(?<!no )RT"
See the regex demo.
Or, if you need to check for a whole word no,
"(?<!\\bno )RT"
See this regex demo.
Here, (?<!no ) makes sure there is no no immediately to the left of the current location, and only then RT is consumed.

Text Mining R Package & Regex to handle Replace Smart Curly Quotes

I've got a bunch of texts like this below with different smart quotes - for single and double quotes. All I could end up with the packages I'm aware of is to remove those characters but I want them to replaced with the normal quotes.
textclean::replace_non_ascii("You don‘t get “your” money’s worth")
Received Output: "You dont get your moneys worth"
Expected Output: "You don't get "your" money's worth"
Also would appreciate if someone's got the regex to replace every such quotes in one shot.
Thanks!
Use two gsub operations: 1) to replace double curly quotes, 2) to replace single quotes:
> gsub("[“”]", "\"", gsub("[‘’]", "'", text))
[1] "You don't get \"your\" money's worth"
See the online R demo. Tested in both Linux and Windows, and works the same.
The [“”] construct is a positive character class that matches any single char defined in the class.
To normalize all chars similar to double quotes, you might want to use
> sngl_quot_rx = "[ʻʼʽ٬‘’‚‛՚︐]"
> dbl_quot_rx = "[«»““”„‟≪≫《》〝〞〟\"″‶]"
> res = gsub(dbl_quot_rx, "\"", gsub(sngl_quot_rx, "'", `Encoding<-`(text, "UTF8")))
> cat(res, sep="\n")
You don't get "your" money's worth
Here, [«»““”„‟≪≫《》〝〞〟"″‶] matches
« 00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
» 00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
“ 05F4 HEBREW PUNCTUATION GERSHAYIM
“ 201C LEFT DOUBLE QUOTATION MARK
” 201D RIGHT DOUBLE QUOTATION MARK
„ 201E DOUBLE LOW-9 QUOTATION MARK
‟ 201F DOUBLE HIGH-REVERSED-9 QUOTATION MARK
≪ 226A MUCH LESS-THAN
≫ 226B MUCH GREATER-THAN
《 300A LEFT DOUBLE ANGLE BRACKET
》 300B RIGHT DOUBLE ANGLE BRACKET
〝 301D REVERSED DOUBLE PRIME QUOTATION MARK
〞 301E DOUBLE PRIME QUOTATION MARK
〟 301F LOW DOUBLE PRIME QUOTATION MARK
" FF02 FULLWIDTH QUOTATION MARK
″ 2033 DOUBLE PRIME
‶ 2036 REVERSED DOUBLE PRIME
The [ʻʼʽ٬‘’‚‛՚︐] is used to normalize some chars similar to single quotes:
ʻ 02BB MODIFIER LETTER TURNED COMMA
ʼ 02BC MODIFIER LETTER APOSTROPHE
ʽ 02BD MODIFIER LETTER REVERSED COMMA
٬ 066C ARABIC THOUSANDS SEPARATOR
‘ 2018 LEFT SINGLE QUOTATION MARK
’ 2019 RIGHT SINGLE QUOTATION MARK
‚ 201A SINGLE LOW-9 QUOTATION MARK
‛ 201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
՚ 055A ARMENIAN APOSTROPHE
︐ FE10 PRESENTATION FORM FOR VERTICAL COMMA
There's a function in {proustr} to normalize punctuation, called pr_normalize_punc() :
https://github.com/ColinFay/proustr#pr_normalize_punc
It turns :
=> ″‶« »“”`´„“ into "
=> ՚ ’ into '
=> … into ...
For example :
library(proustr)
a <- data.frame(text = "Il l՚a dit : « La ponctuation est chelou » !")
pr_normalize_punc(a, text)
# A tibble: 1 x 1
text
* <chr>
1 "Il l'a dit : \"La ponctuation est chelou\" !"
For your text :
pr_normalize_punc(data.frame( text = "You don‘t get “your” money’s worth"), text)
# A tibble: 1 x 1
text
* <chr>
1 "You don‘t get \"your\" money's worth"
We can use gsub here for a base R option. Replace each curly quoted term at a time.
text <- "You don‘t get “your” money’s worth"
new_text <- gsub("“(.*?)”", "\"\\1\"", text)
new_text <- gsub("’", "'", new_text)
new_text
[1] "You don‘t get \"your\" money's worth"
I have assumed here that your curly quotes are always balanced, i.e. they always wrap a word. If not, then you might have to do more work.
Doing a blanket replacement of opening/closing double curly quotes may not play out as intended, if you want them to remain as is when not quoting a word.
Demo

Replace rogue double-quotes in vector in R

I have a broken CSV file with long text fields containing both double quotes and commas. I've been able to clean it up to some extent and now have tab-separated fields as a vector of whole lines (each value is a line).
head(temp, 2)
[1] "\"org_order\"\t\"organizations.api_path\"\t\"permalink\"\t\"api_path\"\t\"web_path\"\t\"name\"\t\"also_known_as\"\t\"short_description\"\t\"description\"\t\"profile_image_url\"\t\"primary_role\"\t\"role_company\"\t\"role_investor\"\t\"role_group\"\t\"role_school\"\t\"founded_on\"\t\"founded_on_trust_code\"\t\"is_closed\"\t\"closed_on\"\t\"closed_on_trust_code\"\t\"num_employees_min\"\t\"num_employees_max\"\t\"stock_exchange\"\t\"stock_symbol\"\t\"total_funding_usd\"\t\"number_of_investments\"\t\"homepage_url\"\t\"created_at\"\t\"updated_at\""
[2] "1\t\"organizations/care1st-health-plan-arizona\"\t\"care1st-health-plan-arizona\"\t\"organizations/care1st-health-plan-arizona\"\t\"organization/care1st-health-plan-arizona\"\t\"Care1st Health Plan Arizona\"\t\"\"\t\"Care1st Health Plan Arizona provides high quality health care services.\"\t\"Care1st is a health plan providing support and services to meet the health care needs of eligible members enrolled in KidsCare, AHCCCS, and DDD.\"\t\"http://public.crunchbase.com/t_api_images/v1475743278/m2teurxnhkwacygzdn2m.png\"\t\"company\"\t\"\"\t\"\"\t\"\"\t\"\"\t\"2003-01-01\"\t\"4\"\t\"FALSE\"\t\"\"\t\"0\"\t\"251\"\t\"500\"\t\"\"\t\"\"\t\"0\"\t\"0\"\t\"\"\t\"1475743348\"\t\"1475899305\""
I then write temp as a file and read it back (which I've found much faster than textConnection). However, read.table("temp", sep = "\t", quote = "\"", encoding = "UTF-8", colClasses = "character") chokes on certain lines and gives me messages such as:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec
= dec, : line 66951 did not have 29 elements
I think this is due to rogue double quotes, as in the following line (rogue quote can be found immediately after "TripAdvisor de la sant?").
temp[66951]
[1] "67654\t\"organizations/docotop\"\t\"docotop\"\t\"organizations/docotop\"\t\"organization/docotop\"\t\"DOCOTOP\"\t\"\"\t\"Le 'TripAdvisor de la sant?\" est arriv?. Docotop permet de trouver le meilleur professionnel de sant?gr?e ?la communaut?de patients\"\t\"\"\t\"http://public.crunchbase.com/t_api_images/v1455271104/ry9lhcfezcmemoifp92h.png\"\t\"company\"\t\"TRUE\"\t\"\"\t\"\"\t\"\"\t\"2015-11-17\"\t\"7\"\t\"\"\t\"\"\t\"0\"\t\"1\"\t\"10\"\t\"EURONEXT\"\t\"\"\t\"0\"\t\"0\"\t\"http://docotop.com/\"\t\"1455271299\"\t\"1473443321\""
I propose to replace rogue double quotes by single quotes, but I have to leave expected quotes in place. Quotes are expected right before or after a separator (tab) and at the beginning (first line only) and the end of a line. I've written the following attempt at regex with lookarounds for tab and line start and end, but it doesn't work:
temp <- gsub("(?<![^\t])\"(?![\t$])", "'", temp, perl = T)
EDIT: I tried #akrun's solution, but get:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec
= dec, : line 181 did not have 29 elements
The line in question (which didn't cause an error before):
temp[181]
[1] "198\torganizations/playfusion\tplayfusion\torganizations/playfusion\torganization/playfusion\tPlayFusion\t\tPlayFusion is a developer of computer games.\tPlayFusion is pioneering the next generation of connected interactive entertainment. PlayFusion's proprietary technology platform fuses video games, robotics, toys, and trans-media entertainment. The company is currently working on its own original IP to trail-blaze its vision ahead of opening its platform to others. PlayFusion is an independent, employee-owned company with offices in Cambridge and Derby in the UK, Douglas in the Isle of Man, and New York and San Francisco in the USA.\thttp://public.crunchbase.com/t_api_images/v1475688372/xnhrd4t254pxj6yxegzt.png\tcompany\t\t\t\t\t2015-01-01\t4\tFALSE\t\t0\t11\t50\t\t\t0\t0\thttp://playfusion.com/#intro\t1475688521\t1475899292"
Your (?<![^\t])"(?![\t$]) regex matches a " that is not preceded with a char other than a tab (so, there must be a tab or start of string before the "), and that is not followed with a tab or $ symbol.
So, the ^ and $ inside character classes lose their anchor meaning.
Replace the character classes with alternation groups:
gsub("(?<!\t|^)\"(?!\t|$)", "'", temp, perl=TRUE)
The (?<!\t|^) lookbehind requires that the " is not at the start of the string and is not preceded with a tab.
The (?!\t|$) lookahead requires that the " is not at the end of the string ($) and is not followed with a tab char.

Resources