R searching for specific string patterns (part 2) - r

I have a previous question listed here (Searching for specific string pattern) but there are some additional questions that I have.
Previously, I thought my file naming convention was only of these formats:
"aaaaa-ttttt-eeee-q4-2015-file"
"aaaaaa-fffff-3333-q2-2012-file"
or specifically, it is the quarter followed by "-" then year.
However, upon further investigation, the files have other variations such as:
"aaaaaa-f2q09-bbbbb"
"aaaaaa-f2q2008-bbbbb"
"aaaaaa-f4q-2008-fffff"
"f4q-aaaaa-eeeeee-2008"
"q2-aaaaaaaaa-eeeeeee-2005"
"aaaaaaaa-3q-2008-rrrrrrr"
Similarly for all the above, I would like to extract the year and quarter, and I'm not sure if there is a general code that I can write that can extract them all at one go or do i have to write a few sets of code and run them by waves. Not very familiar with sub function in R and would actually appreciate if someone can point me to a website that has detailed explanations and examples for me to write my own code to extract these info.
Ultimately, the code should parse all those strings and output something like: year = 2005, quarter = q4 etc.

Try this it uses regexpr to show the location of the match and regmatches to return them, it is very susceptible to pull out incorrect data. For quarter it will return any instance of 1-4 either followed or preceded by a q. If there is any other information that can make these more specific matches than I suggest including them.
input=c("aaaaaa-f2q09-bbbbb",
"aaaaaa-f2q2008-bbbbb",
"aaaaaa-f4q-2008-fffff",
"f4q-aaaaa-eeeeee-2008",
"q2-aaaaaaaaa-eeeeeee-2005",
"aaaaaaaa-3q-2008-rrrrrrr")
quarter=regmatches(input, regexpr("[1-4]q|q[1-4]", input))
year = regmatches(input, regexpr("q\\d{4}|q\\d{2}|\\d{4}", input))
year = gsub("q","",year)
year = sub("\\b(\\d{2})\\b","20\\1", year)
There are lots of issues with the year matching also, because you have three different formats that are possible "q09", "q2008", "2008". Because the function returns the first match in the string the q\d{4} is needed to pull back the q2008 example.
My sub function here subs that matching regular expression with 20 and the matching expression itself, the \\1 is returning the stuff in brackets (\\d{2})
Test it and comment any mistakes

Related

Replacing Content of a column with part of that column's content

I'd like to replace the content of a column in a data frame with only a specific word in that column.
The column always looks like this:
Place(fullName='Würzburg, Germany', name='Würzburg', type='city', country='Germany', countryCode='DE')
Place(fullName='Iphofen, Deutschland', name='Iphofen', type='city', country='Germany', countryCode='DE')
I'd like to extract the city name (in this case Würzburg or Iphofen) into a new column, or replace the entire row with the name of the town. There are many different towns so having a gsub-command for every city name will be tough.
Is there a way to maybe just use a gsub and tell Rstudio to replace whatever it finds inside the first two ' '?
Might it be possible to tell it, "give me the word after "name=' until the next '?
I'm very new to using R so I'm kind of out of ideas.
Thanks a lot for any help!
I know of the gsub command, but I don't think it will be the most appropriate in this case.
Yes, with a regular expression you can do exactly that:
string <- "Place(fullName='Würzburg, Germany', name='Würzburg', type='city', country='Germany', countryCode='DE')"
city <- gsub(".*name='(.*?)'.*", "\\1", string)
The regular expression says "match any characters followed by name=', then capture any characters until the next ' and then match any additional characters". Then you replace all of that with just the captured characters ("\\1").
The parentheses mean "capture this part", and the value becomes "\\1". (You can do multiple captures, with subsequent captures being \\2, \\3, etc.
Note the question mark in (.*?). This means "match as little as possible while still satisfying the rest of the regex". If you don't include the question mark, the regular expression will match "greedily" and you will capture the entire rest of the line instead of just the city since that would also satisfy the regular expression.
More about regular expression (specific to R) can be found here

I am using R code to count for a specific word occurrence in a string. How can I update it to count if the word's synonyms are used?

I'm using the following code to find if the word "assist" is used in a string variable.
string<- c("assist")
`assist <-
(1:nrow(df) %in% c(sapply(string, grep, df$textvariable, fixed = TRUE)))+0`
`sum(assist)`
If I also wanted to check if synonyms such as "help" and "support" are used in the string, how can I update the code? So if either of these synonyms are used, I want to code it as 1. If neither of these words are used, I want to code it as 0. It doesn't matter if all of the words appear in the string or how many times they are used.
I tried changing it to
string<- c("assist", "help", "support")
But it looks like it is searching for strings in which all of these words are used?
I'd appreciate your help!
Thank you

Removing part of strings within a column

I have a column within a data frame with a series of identifiers in, a letter and 8 numbers, i.e. B15006788.
Is there a way to remove all instances of B15.... to make them empty cells (there’s thousands of variations of numbers within each category) but keep B16.... etc?
I know if there was just one thing I wanted to remove, like the B15, I could do;
sub(“B15”, ””, df$col)
But I’m not sure on the how to remove a set number of characters/numbers (or even all subsequent characters after B15).
Thanks in advance :)
Welcome to SO! This is a case of regex. You can use base R as I show here or look into the stringR package for handy tools that are easier to understand. You can also look for regex rules to help define what you want to look for. For what you ask you can use the following code example to help:
testStrings <- c("KEEPB15", "KEEPB15A", "KEEPB15ABCDE")
gsub("B15.{2}", "", testStrings)
gsub is the base R function to replace a pattern with something else in one or a series of inputs. To test our regex I created the testStrings vector for different examples.
Breaking down the regex code, "B15" is the pattern you're specifically looking for. The "." means any character and the "{2}" is saying what range of any character we want to grab after "B15". You can change it as you need. If you want to remove everything after "B15". replace the pattern with "B15.". the "" means everything till the end.
edit: If you want to specify that "B15" must be at the start of the string, you can add "^" to the start of the pattern as so: "^B15.{2}"
https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf has a info on different regex's you can make to be more particular.

r Regular expression for extracting UK postcode from an address is not ordered

I'm trying to extract UK postcodes from address strings in R, using the regular expression provided by the UK government here.
Here is my function:
address_to_postcode <- function(addresses) {
# 1. Convert addresses to upper case
addresses = toupper(addresses)
# 2. Regular expression for UK postcodes:
pcd_regex = "[Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})"
# 3. Check if a postcode is present in each address or not (return TRUE if present, else FALSE)
present <- grepl(pcd_regex, addresses)
# 4. Extract postcodes matching the regular expression for a valid UK postcode
postcodes <- regmatches(addresses, regexpr(pcd_regex, addresses))
# 5. Return NA where an address does not contain a (valid format) UK postcode
postcodes_out <- list()
postcodes_out[present] <- postcodes
postcodes_out[!present] <- NA
# 6. Return the results in a vector (should be same length as input vector)
return(do.call(c, postcodes_out))
}
According to the guidance document, the logic this regular expression looks for is as follows:
"GIR 0AA" OR One letter followed by either one or two numbers OR One letter followed by a second letter that must be one of
ABCDEFGHJ KLMNOPQRSTUVWXY (i.e..not I) and then followed by either one
or two numbers OR One letter followed by one number and then another
letter OR A two part post code where the first part must be One letter
followed by a second letter that must be one of ABCDEFGH
JKLMNOPQRSTUVWXY (i.e..not I) and then followed by one number and
optionally a further letter after that AND The second part (separated
by a space from the first part) must be One number followed by two
letters. A combination of upper and lower case characters is allowed.
Note: the length is determined by the regular expression and is
between 2 and 8 characters.
My problem is that this logic is not completely preserved when using the regular expression without the ^ and $ anchors (as I have to do in this scenario because the postcode could be anywhere within the address strings); what I'm struggling with is how to preserve the order and number of characters for each segment in a partial (as opposed to complete) string match.
Consider the following example:
> address_to_postcode("1A noplace road, random city, NR1 2PK, UK")
[1] "NR1 2PK"
According to the logic in the guideline, the second letter in the postcode cannot be 'z' (and there are some other exclusions too); however look what happens when I add a 'z':
> address_to_postcode("1A noplace road, random city, NZ1 2PK, UK")
[1] "Z1 2PK"
... whereas in this case I would expect the output to be NA.
Adding the anchors (for a different usage case) doesn't seem to help as the 'z' is still accepted even though it is in the wrong place:
> grepl("^[Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})$", "NZ1 2PK")
[1] TRUE
Two questions:
Have I misunderstood the logic of the regular expression and
If not, how can I correct it (i.e. why aren't the specified letter
and character ranges exclusive to their position within the regular expression)?
Edit
Since posting this answer, I dug deeper into the UK government's regex and found even more problems. I posted another answer here that describes all the issues and provides alternatives to their poorly formatted regex.
Note
Please note that I'm posting the raw regex here. You'll need to escape certain characters (like backslashes \) when porting to r.
Issues
You have many issues here, all of which are caused by whoever created the document you're retrieving your regex from or the coder that created it.
1. The space character
My guess is that when you copied the regular expression from the link you provided it converted the space character into a newline character and you removed it (that's exactly what I did at first). You need to, instead, change it to a space character.
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
here ^
2. Boundaries
You need to remove the anchors ^ and $ as these indicate start and end of line. Instead, wrap your regex in (?:) and place a \b (word boundary) on either end as the following shows. In fact, the regex in the documentation is incorrect (see Side note for more information) as it will fail to anchor the pattern properly.
See regex in use here
\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
^^^^^ ^^^
3. Character class oversight
There's a missing - in the character class as pointed out by #deadcrab in his answer here.
\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
^
4. They made the wrong character class optional!
In the documentation it clearly states:
A two part post code where the first part must be:
One letter followed by a second letter that must be one of ABCDEFGHJKLMNOPQRSTUVWXY (i.e..not I) and then followed by one number and optionally a further letter after that
They made the wrong character class optional!
\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
^^^^^^
it should be this one ^^^^^^^^
5. The whole thing is just awful...
There are so many things wrong with this regex that I just decided to rewrite it. It can very easily be simplified to perform a fraction of the steps it currently takes to match text.
\b(?:[A-Za-z][A-HJ-Ya-hj-y]?[0-9][0-9A-Za-z]? [0-9][A-Za-z]{2}|[Gg][Ii][Rr] 0[Aa]{2})\b
Answer
As mentioned in the comments below my answer, some postcodes are missing the space character. For missing spaces in the postcodes (e.g. NR12PK), simply add a ? after the spaces as shown in the regex below:
\b(?:[A-Za-z][A-HJ-Ya-hj-y]?[0-9][0-9A-Za-z]? ?[0-9][A-Za-z]{2}|[Gg][Ii][Rr] ?0[Aa]{2})\b
^^ ^^
You may also shorten the regex above with the following and use the case-insensitive flag (ignore.case(pattern) or ignore_case = TRUE in r, depending on the method used.):
\b(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]? ?[0-9][A-Z]{2}|GIR ?0A{2})\b
Note
Please note that regular expressions only validate the possible format(s) of a string and cannot actually identify whether or not a postcode legitimately exists. For this, you should use an API. There are also some edge-cases where this regex will not properly match valid postcodes. For a list of these postcodes, please see this Wikipedia article.
The regex below additionally matches the following (make it case-insensitive to match lowercase variants as well):
British Overseas Territories
British Forces Post Office
Although they've recently changed it to align with the British postcode system to BF, followed by a number (starting with BF1), they're considered optional alternative postcodes
Special cases outlined in that article (as well as SAN TA1 - a valid postcode for Santa!)
See this regex in use here.
\b(?:(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]?|ASCN|STHL|TDCU|BBND|[BFS]IQ{2}|GX11|PCRN|TKCA) ?[0-9][A-Z]{2}|GIR ?0A{2}|SAN ?TA1|AI-?[0-9]{4}|BFPO[ -]?[0-9]{2,3}|MSR[ -]?1(?:1[12]|[23][135])0|VG[ -]?11[1-6]0|[A-Z]{2} ? [0-9]{2}|KY[1-3][ -]?[0-2][0-9]{3})\b
I would also recommend anyone implementing this answer to read this StackOverflow question titled UK Postcode Regex (Comprehensive).
Side note
The documentation you linked to (Bulk Data Transfer: Additional Validation for CAS Upload - Section 3. UK Postcode Regular Expression) actually has an improperly written regular expression.
As mentioned in the Issues section, they should have:
Wrapped the entire expression in (?:) and placed the anchors around the non-capturing group. Their regular expression, as it stands, will fail in for some cases as seen here.
The regular expression is also missing - in one of the character classes
It also made the wrong character class optional.
here is my regular expression
txt="0288, Bishopsgate, London Borough of Tower Hamlets, London, Greater London, England, EC2M 4QP, United Kingdom"
matches=re.findall(r'[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}', txt)

regular expression for date with Starting and Ending date

I am using the regular expression of the date for the format "MM/DD/YYYY" like
"^(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\d\d$"
its working fine, no problem....here I want to limit the year between "1950" to "2050", how to do this, can anyone help me....
So the answer depends on how you want to accomplish the task.
Your current Regex search pattern is going to match on most dates in the format "MM/DD/YYYY" in the 20th and 21st century. So one approach is to loop through the resulting matches, which are represented as string values at this point, and parse each string into a DateTime. Then you can do some range validation checking.
(Note: I removed the beginning ^ and ending $ from your original to make my example work)
string input = "This is one date 07/04/1776 and this is another 12/07/1941. Today is 08/10/2019.";
string pattern = "(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\\d\\d";
List<DateTime> list = new List<DateTime>();
foreach (Match match in Regex.Matches(input, pattern))
{
Console.WriteLine(match.Value);
DateTime result;
if (DateTime.TryParse(match.Value, out result))
{
if (result.Year >= 1950 && result.Year <= 2050)
{
list.Add(result);
}
}
}
Console.WriteLine("Number of valid dates: {0}", list.Count);
This code outputs the following, noting that 1776 is not matched, the other two dates are, but only the last one is added to the list.
12/07/1941
08/10/2019
Number of valid dates: 1
Although this approach has some drawbacks, such as looping over the results a second time to try and do the range validation, there are some advantages as well.
The built-in DateTime methods in the framework are easier to deal with, rather than constantly adjusting the Regex search pattern as your acceptable range can move over time.
By range checking afterward, you could also simplify your Regex search pattern to be more inclusive, perhaps even getting all dates.
A simpler Regex search pattern is easier to maintain, and also makes clear the intent of the code. Regex can be confusing and tricky to decipher the meaning, especially for less experienced coders.
Complex Regex search patterns can introduce subtle bugs. Make sure you have good unit tests wrapped around your code.
Of course your other approach is to adjust the Regex search pattern so that you don't have to parse and check afterwards. In most cases this is going to be the best option. Your search pattern is not returning any values that are outside the range, so you don't have to loop or do any additional checking at that point. Just remember those unit tests!
As #skywalker pointed out in his answer, this pattern should work for you.
string pattern = "(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19[5-9][0-9]|20[0-4][0-9]|2050)";
year 1950-2050 both inclusive can be found using 19[5-9][0-9]|20[0-4][0-9]|2050

Resources