How do I detect and delete abbreviations with regex in R? - r

I have a column with the following kind of strings:
Author
Achebe, Chinua. Ach
Akbar, M.j. Akb
Alanahally, Srikrishna. Ala
These are names of authors with their shortened abbreviation at the end. This is only at the end, because if I just look for three letter words, author names like Jon and Sam will be deleted. This usually occurs after two spaces. I want to eliminate this. I wrote the following regex to detect and delete these:
data$Author <- gsub("\\s([A-Z]+[A-Za-z]{2})\\s", "", data$Author)
What do I change in this so that I can delete these three letter abbreviations?

Your \\s at the end of the pattern is forcing a space after the three-letters, and none of the samples have that here. Options:
You cannot remove it or replace it with \\s*, as those will be too permissive (and break things):
gsub("\\s([A-Z]+[A-Za-z]{2})", "", authors)
# [1] "Achebe,nua. " "Akbar, M.j. " "Alanahally,krishna. "
add a word-boundary \\b
gsub("\\s([A-Z]+[A-Za-z]{2})\\b", "", authors)
# [1] "Achebe, Chinua. " "Akbar, M.j. " "Alanahally, Srikrishna. "
change to end-of-string
gsub("\\s([A-Z]+[A-Za-z]{2})$", "", authors)
# [1] "Achebe, Chinua. " "Akbar, M.j. " "Alanahally, Srikrishna. "
(though I think this might be over-constraining).
Data
authors <- c("Achebe, Chinua. Ach", "Akbar, M.j. Akb", "Alanahally, Srikrishna. Ala")

Try this with (find & replace) syntax ,
Find: \s?\s\w+$
Replace: leave it empty

Related

Inverting a regex pattern in R to match all left and between a giving set of strings

Hello everyone I hope you guys are having a good one,
I have multiple and long strings of text in a dataset, I am trying to capture all text between , after and before a set of words, I will refere to this words as keywords
keywords= UHJ, uhj, AXY, axy, YUI, yui, OPL, opl, UJI, uji
if I have the following string:
UHJ This is only a test to AXY check regex in a YUI educational context so OPL please be kind UJI
The following regex will easily match my keywords:
UHJ|uhj|AXY|axy|YUI|yui|OPL|opl|UJI|uji
but since I am interested in capturing eveyrthing in between after and before those words, I am in some way wanting to capture the invert of my regex so that I can have something like this:
I have tried the following:
[^UHJ|uhj|AXY|axy|YUI|yui|OPL|opl|UJI|uji]
with no luck and in the future the keywords may change so please if you know a regex that would work in R that can achive my desired output
The simplest solution is probably just to split by your pattern. (Note this includes an empty string if the text starts with a keyword.)
x <- "UHJ This is only a test to AXY check regex in a YUI educational context so OPL please be kind UJI"
keywords <- "UHJ|uhj|AXY|axy|YUI|yui|OPL|opl|UJI|uji"
strsplit(x, keywords)
# [[1]]
# [1] "" " This is only a test to "
# [3] " check regex in a " " educational context so "
# [5] " please be kind "
Other options would be to use regmatches() with invert = TRUE. (This includes empty strings if the text starts or ends with a keyword.)
regmatches(
x,
gregexpr(keywords, x, perl = TRUE),
invert = TRUE
)
# [[1]]
# [1] "" " This is only a test to "
# [3] " check regex in a " " educational context so "
# [5] " please be kind "
Or stringr::str_extract_all() with your pattern in both a lookbehind and a lookahead. (This doesn't include empty strings.)
library(stringr)
str_extract_all(
x,
str_glue("(?<={keywords}).+?(?={keywords})"
)
# [[1]]
# [1] " This is only a test to " " check regex in a "
# [3] " educational context so " " please be kind "

Remove whitespace after a symbol (hyphen) in R

I'm trying to remove the hyphen that divides a word from a string. For example, the word example: "for exam- ple this".
a <- "for exam- ple this"
How could I join them?
I have tried to remove the script using this command:
str_replace_all(a, "-", "")
But I got this back:
"for exam ple this"
It does not return the word united. I have also tried this:
str_replace_all(a, "- ", "") but I get nothing.
Therefore I have thought of first removing the white spaces after a hyphen to get the following
"for exm-ple this"
and then eliminating the hyphen.
Can you help me?
Here is an option with sub where we match the - followed by zero or more spaces (\\s*) and replace with -
sub("-\\s*", "-", a)
#[1] "for exam-ple this"
If it is to remove all spaces instead of a single one, then replace with gsub
gsub("-\\s*", "-", a)
str_replace_all(a, "- ", "-")
If you are just trying to remove the whitespace after a symbol then Ricardo's answer is sufficient. If you want to remove an unknown amount of whitespace after a hyphen consider
str_replace_all(a, "- +", "-")
#[1] "for exam-ple this"
b <- "for exam- ple this"
str_replace_all(b, "- +", "-")
#[1] "for exam-ple this"
EDIT --- Explaination
The "+" is something that tells r how to match a string and is part of the regular expressions. "+" specifically means to match the preceding character (or group/set) 1 or more times. You can find out more about regular expressions here.

How to throw out spaces and underscores only from the beginning of the string?

I want to ignore the spaces and underscores in the beginning of a string in R.
I can write something like
txt <- gsub("^\\s+", "", txt)
txt <- gsub("^\\_+", "", txt)
But I think there could be an elegant solution
txt <- " 9PM 8-Oct-2014_0.335kwh "
txt <- gsub("^[\\s+|\\_+]", "", txt)
txt
The output should be "9PM 8-Oct-2014_0.335kwh ". But my code gives " 9PM 8-Oct-2014_0.335kwh ".
How can I fix it?
You could bundle the \s and the underscore only in a character class and use quantifier to repeat that 1+ times.
^[\s_]+
Regex demo
For example:
txt <- gsub("^[\\s_]+", "", txt, perl=TRUE)
Or as #Tim Biegeleisen points out in the comment, if only the first occurrence is being replaced you could use sub instead:
txt <- sub("[\\s_]+", "", txt, perl=TRUE)
Or using a POSIX character class
txt <- sub("[[:space:]_]+", "", txt)
More info about perl=TRUE and regular expressions used in R
R demo
The stringr packages offers some task specific functions with helpful names. In your original question you say you would like to remove whitespace and underscores from the start of your string, but in a comment you imply that you also wish to remove the same characters from the end of the same string. To that end, I'll include a few different options.
Given string s <- " \t_blah_ ", which contains whitespace (spaces and tabs) and underscores:
library(stringr)
# Remove whitespace and underscores at the start.
str_remove(s, "[\\s_]+")
# [1] "blah_ "
# Remove whitespace and underscores at the start and end.
str_remove_all(s, "[\\s_]+")
# [1] "blah"
In case you're looking to remove whitespace only – there are, after all, no underscores at the start or end of your example string – there are a couple of stringr functions that will help you keep things simple:
# `str_trim` trims whitespace (\s and \t) from either or both sides.
str_trim(s, side = "left")
# [1] "_blah_ "
str_trim(s, side = "right")
# [1] " \t_blah_"
str_trim(s, side = "both") # This is the default.
# [1] "_blah_"
# `str_squish` reduces repeated whitespace anywhere in string.
s <- " \t_blah blah_ "
str_squish(s)
# "_blah blah_"
The same pattern [\\s_]+ will also work in base R's sub or gsub, with some minor modifications, if that's your jam (see Thefourthbird`s answer).
You can use stringr as:
txt <- " 9PM 8-Oct-2014_0.335kwh "
library(stringr)
str_trim(txt)
[1] "9PM 8-Oct-2014_0.335kwh"
Or the trimws in Base R
trimws(txt)
[1] "9PM 8-Oct-2014_0.335kwh"

How to put a space in between a list of strings?

This is my current dataset:
c("Jetstar","Qantas", "QantasLink","RegionalExpress","TigerairAustralia",
"VirginAustralia","VirginAustraliaRegionalAirlines","AllAirlines",
"Qantas-allQFdesignatedservices","VirginAustralia-allVAdesignatedservices")
I want to add a space in between airlines name and separate it with space.
For this i tried this code:
airlines$airline <- gsub("([[:lower:]]) ([[:upper:]])", "\\1 \\2", airlines$airline)
But I got the text in the same format as before.
My desired output is as below:
txt <- c("Jetstar","Qantas", "QantasLink","RegionalExpress","TigerairAustralia",
"VirginAustralia","VirginAustraliaRegionalAirlines","AllAirlines",
"Qantas-allQFdesignatedservices","VirginAustralia-allVAdesignatedservices")
You need two different sorts of rules: one for the spaces before the case changes and the other for recurring words ("designated", "services") or symbols ("-"). You could start with a pattern that identified a lowercase character followed by an uppercase character (identified with a character class like "[A-Z]") and then insert a space between those two characters in two capture classes (created with flanking parentheses around a section of a pattern). See the ?regex Details section for a quick description of character classes and capture classes:
gsub("([a-z])([A-Z])", "\\1 \\2", txt)
You then use that result as an argument that adds a space before any of the recurring words in your text that you want also separated:
gsub("(-|all|designated|services)", " \\1", # second pattern and sub for "specials"
gsub("([a-z])([A-Z])", "\\1 \\2", txt)) #first pattern and sub for case changes
[1] "Jetstar"
[2] "Qantas"
[3] "Qantas Link"
[4] "Regional Express"
[5] "Tigerair Australia"
[6] "Virgin Australia"
[7] "Virgin Australia Regional Airlines"
[8] "All Airlines"
[9] "Qantas - all QF designated services"
[10] "Virgin Australia - all VA designated services"
I see that someone upvoted my earlier answer to Splitting CamelCase in R which was similar, but this one had a few more wrinkles to iron out.
This could (almost) do the trick
gsub("([A-Z])", " \\1", airlines)
Borrowed from: splitting-camelcase-in-r
Of course names like Qantas-allQFd… will stil pose a problem because of the two consecutive UpperCase letters ("QF") in the second part of the string.
I have tried to figure it out and I have come up with something:
library(stringr)
data_vec<- c("Jetstar","Qantas", "QantasLink","RegionalExpress","TigerairAustralia",
"VirginAustralia","VirginAustraliaRegionalAirlines","AllAirlines",
"Qantas-allQFdesignatedservices","VirginAustralia-allVAdesignatedservices")
str_trim(gsub("(?<=[A-Z]{2})([a-z]{1})", " \\1",gsub("([A-Z]{1,2})", " \\1", data_vec)))
I Hope this helps.

separating last sentence from a string in R

I have a vector of strings and i want to separate the last sentence from each string in R.
Sentences may end with full stops(.) or even exclamatory marks(!). Hence i am confused as to how to separate the last sentence from a string in R.
You can use strsplit to get the last sentence from each string as shown:-
## paragraph <- "Your vector here"
result <- strsplit(paragraph, "\\.|\\!|\\?")
last.sentences <- sapply(result, function(x) {
trimws((x[length(x)]))
})
Provided that your input is clean enough (in particular, that there are spaces between the sentences), you can use:
sub(".*(\\.|\\?|\\!) ", "", trimws(yourvector))
It finds the longest substring ending with a punctuation mark and a space and removes it.
I added trimws just in case there are trailing spaces in some of your strings.
Example:
u <- c("This is a sentence. And another sentence!",
"By default R regexes are greedy. So only the last sentence is kept. You see ? ",
"Single sentences are not a problem.",
"What if there are no spaces between sentences?It won't work.",
"You know what? Multiple marks don't break my solution!!",
"But if they are separated by spaces, they do ! ! !")
sub(".*(\\.|\\?|\\!) ", "", trimws(u))
# [1] "And another sentence!"
# [2] "You see ?"
# [3] "Single sentences are not a problem."
# [4] "What if there are no spaces between sentences?It won't work."
# [5] "Multiple marks don't break my solution!!"
# [6] "!"
This regex anchors to the end of the string with $, allows an optional '.' or '!' at the end. At the front it finds the closest ". " or "! " as the end of the prior sentence. The negative lookback ?<= ensures the "." or '!' are not matched. Also provides for a single sentence by using ^ for the beginning.
s <- "Sentences may end with full stops(.) or even exclamatory marks(!). Hence i am confused as to how to separate the last sentence from a string in R."
library (stringr)
str_extract(s, "(?<=(\\.\\s|\\!\\s|^)).+(\\.|\\!)?$")
yields
# [1] "Hence i am confused as to how to separate the last sentence from a string in R."

Resources