Formatting UK Postcodes in R - r

I am trying to format UK postcodes that come in as a vector of different input in R.
For example, I have the following postcodes:
postcodes<-c("IV41 8PW","IV408BU","kY11..4hJ","KY1.1UU","KY4 9RW","G32-7EJ")
How do I write a generic code that would convert entries of the above vector into:
c("IV41 8PW","IV40 8BU","KY11 4HJ","KY1 1UU","KY4 9RW","G32 7EJ")
That is the first part of the postcode is separated from the second part of the postcode by one space and all letters are capitals.
EDIT: the second part of the postcode is always the 3 last characters (combination of a number followed by letters)

I couldn't come up with a smart regex solution so here is a split-apply-combine approach.
sapply(strsplit(sub('^(.*?)(...)$', '\\1:\\2', postcodes), ':', fixed = TRUE), function(x) {
paste0(toupper(trimws(x, whitespace = '[.\\s-]')), collapse = ' ')
})
#[1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU" "KY4 9RW" "G32 7EJ"
The logic here is that we insert a : (or any character that is not in the data) in the string between the 1st and 2nd part. Split the string on :, remove unnecessary characters, get it in upper case and combine it in one string.

One approach:
Convert to uppercase
extract the alphanumeric characters
Paste back together with a space before the last three characters
The code would then be:
library(stringr)
postcodes<-c("IV41 8PW","IV408BU","kY11..4hJ","KY1.1UU","KY4 9RW","G32-7EJ")
postcodes <- str_to_upper(postcodes)
sapply(str_extract_all(postcodes, "[:alnum:]"), function(x)paste(paste0(head(x,-3), collapse = ""), paste0(tail(x,3), collapse = "")))
# [1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU" "KY4 9RW" "G32 7EJ"

You can remove everything what is not a word caracter \\W (or [^[:alnum:]_]) and then insert a space before the last 3 characters with (.{3})$ and \\1.
sub("(.{3})$", " \\1", toupper(gsub("\\W+", "", postcodes)))
#sub("(...)$", " \\1", toupper(gsub("\\W+", "", postcodes))) #Alternative
#sub("(?=.{3}$)", " ", toupper(gsub("\\W+", "", postcodes)), perl=TRUE) #Alternative
#[1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU" "KY4 9RW" "G32 7EJ"

# Option 1 using regex:
res1 <- gsub(
"(\\w+)(\\d[[:upper:]]\\w+$)",
"\\1 \\2",
gsub(
"\\W+",
" ",
postcodes
)
)
# Option 2 using substrings:
res2 <- vapply(
trimws(
gsub(
"\\W+",
" ",
postcodes
)
),
function(ir){
paste(
trimws(
substr(
ir,
1,
nchar(ir) -3
)
),
substr(
ir,
nchar(ir) -2,
nchar(ir)
)
)
},
character(1),
USE.NAMES = FALSE
)

Related

Extract portion of string with punctuation

I have a string:
string <- "newdatat.scat == \"RDS16\" ~ \"Asthma\","
and I want to extract separately:
RDS16
Asthma
What I've tried so far is:
extract <- str_extract(string,'~."(.+)')
but I am only able to get:
~ \"Asthma\",
If you have an answer, can you also kindly explain the regex behind it? I'm having a hard time converting string patterns to regex.
If you just need to extract sections surrounded by ", then you can use something like the following. The pattern ".*?" matches first ", then .*? meaning as few characters as possible, before finally matching another ". This will get you the strings including the " double quotes; you then just have to remove the double quotes to clean up.
Note that str_extract_all is used to return all matches, and that it returns a list of character vectors so we need to index into the list before removing the double quotes.
library(stringr)
string <- "newdatat.scat == \"RDS16\" ~ \"Asthma\","
str_extract_all(string, '".*?"') %>%
`[[`(1) %>%
str_remove_all('"')
#> [1] "RDS16" "Asthma"
Created on 2021-06-21 by the reprex package (v1.0.0)
Base R solutions:
# Solution 1:
# Extract strings (still quoted):
# dirtyStrings => list of strings
dirtyStrings <- regmatches(
string,
gregexpr(
'".*?"',
string
)
)
# Iterate over the list and "clean" - unquote - each
# element, store as a vector: result => character vector
result <- c(
vapply(
dirtyStrings,
function(x){
noquote(
gsub(
'"',
'',
x
)
)
},
character(
lengths(
dirtyStrings
)
)
)
)
# Solution 2:
# Same as above, less generic -- assumes all strings
# will follow the same pattern: result => character vector
result <- unlist(
lapply(
strsplit(
gsub(
".*\\=\\=",
"",
noquote(
string
)
),
"~"),
function(x){
gsub(
"\\W+",
"",
noquote(x)
)
}
)
)
You can capture the two values in two separate columns.
In stringr use str_match -
string <- "newdatat.scat == \"RDS16\" ~ \"Asthma\","
stringr::str_match(string, '"(\\w+)" ~ "(\\w+)"')[, -1, drop = FALSE]
# [,1] [,2]
#[1,] "RDS16" "Asthma"
Or in base R use strcapture
strcapture('"(\\w+)" ~ "(\\w+)"', string,
proto = list(col1 = character(), col2 = character()))
# col1 col2
#1 RDS16 Asthma

Remove first place comma and space between two texts and the last comma or space

I have joined multiple columns in from a data frame into a single column. Now because of the formatting I am getting some issues. I want to remove comma if it at the first place and last place comma.Also I want to delete the space coming in between the texts.
eq: if the combines string :
, this is test, dd,pqr, then this should be converted to this is test,dd,prq
df <- as.data.frame(rbind(c('11061002','11862192','11083069'),
c(" ",'1234567','452589'),
c("fs"," ","dd"," ")))
df$f1 <-paste0(df$V1,
',',
" ",
df$V2,
',',
" ",
df$V3,',',df$V4)
df_1 <- as.data.frame(df[,c(5)])
names(df_1)[1] <-"f1"
expected output is :
11061002,11862192,11083069,11061002 (No spaces)
1234567,452589
fs,dd
Regards,
R
Using double gsub :
gsub(',{2,}', ',', gsub('^,|,$| ', '', trimws(df_1$f1)))
#[1] "11061002,11862192,11083069,11061002" "1234567,452589" "fs,dd"
,{2,} - Replaces more than 2+ consecutive commas with one comma.
^, - removes commas at start.
,$ - removes commas at end.
and remove whitespaces from the string.
It seems that you have double space in third row. One way to approach this is to use apply with margin 1 to do a rowwise operation; in your case, paste, i.e.
apply(df, 1, function(i)paste(i[!i %in% c(' ', ' ')], collapse = ','))
#[1] "11061002,11862192,11083069" "1234567,452589" "fs,dd"

gsub: Keep a given character only if between two digits/letters

I have a list of addresses that I would like to split into two arrays:
Address line (keeping special characters such as "-" whenever between two letters - c.f. text.2)
House number (keeping special characters such as "-" whenever between two digits)
Here is an example:
text.1 <- "CALLE COMPOSITOR LEHMBERG RUIZ 19-21"
text.2 <- "CALLE COMPOSITOR LEHMBERG-RUIZ 19-21"
To extract the house numbers, I tried using gsub("[^0-9\\-]", "", x) which works fine for text.1 but not as well as desired for text.2:
> gsub("[^0-9\\-]", "", text.1)
[1] "19-21"
> gsub("[^0-9\\-]", "", text.2)
[1] "-19-21"
To extract the address line I used gsub("[0-9]", "", x) yielding a similar problem.
I could circumvent this issue with the following code:
ifelse( substr( gsub("[^0-9\\-]", "", x ), 1, 1 ) == "-" ,
substr( gsub("[^0-9\\-]", "", x), 2, nchar( gsub("[^0-9\\-]", "", x) )
)
, gsub("[^0-9\\-]", "", x)
)
yielding "19-21" for both x = text.1 and x = text.2. However, as one can tell it is not very elegant.
My question would be if there is an "elegant" way to solve this issue (e.g. using gsub in a cleverer fashion)?
We can use a regular expression to SKIP when the pattern is true and remove all other characters
gsub("(\\d+)-(\\d+)(*SKIP)(*F)|.", "", text.1, perl = TRUE)
#[1] "19-21"
gsub("(\\d+)-(\\d+)(*SKIP)(*F)|.", "", text.2, perl = TRUE)
#[1] "19-21"
I would advise to use str_extract instead of gsub in your case. You could d as follow:
library(stringr)
str_extract(text.1,"[0-9]{1,3}\\-[0-9]{1,3}")
[1] "19-21"
str_extract(text.2,"[0-9]{1,3}\\-[0-9]{1,3}")
[1] "19-21"
str_extract(text.1,"[^0-9][A-Z\\-\\s]+")
[1] "CALLE COMPOSITOR LEHMBERG RUIZ "
str_extract(text.2,"[^0-9][A-Z\\-\\s]+")
[1] "CALLE COMPOSITOR LEHMBERG-RUIZ "

Extract words the include numbers in R

I have words that include numbers within, or begin with or end with numbers. How do i extract those only.
s <- c("An ex4mple". "anothe 3xample" "A thir7", "And sentences w1th w0rds as w3ll")
Expected output:
c("ex4mple", "3xample", "thir7", "w1th w0rds w3ll")
Words could include more than one number.
We can split the strings by space into a list, loop through the elements with sapply, then match all words that have only letters from start (^) to end ($), specify invert=TRUE with value=TRUE to get those elements that don't fit the criteria, paste them together
sapply(strsplit(s, "\\s+"), function(x)
paste(grep("^[A-Za-z]+$", x, invert = TRUE, value = TRUE), collapse=' '))
#[1] "ex4mple" "3xample" "thir7" "w1th w0rds w3ll"
Or we can use str_extract
library(stringr)
sapply(str_extract_all(s, '[A-Za-z]*[0-9]+[A-Za-z]*'), paste, collapse=' ')
#[1] "ex4mple" "3xample" "thir7" "w1th w0rds w3ll"
data
s <- c("An ex4mple", "anothe 3xample", "A thir7", "And sentences w1th w0rds as w3ll")

how to remove duplicate words in a certain pattern from a string in R

I aim to remove duplicate words only in parentheses from string sets.
a = c( 'I (have|has|have) certain (words|word|worded|word) certain',
'(You|You|Youre) (can|cans|can) do this (works|works|worked)',
'I (am|are|am) (sure|sure|surely) you know (what|when|what) (you|her|you) should (do|do)' )
What I want to get is just like this
a
[1]'I (have|has) certain (words|word|worded) certain'
[2]'(You|Youre) (can|cans) do this (works|worked)'
[3]'I (am|are) pretty (sure|surely) you know (what|when) (you|her) should (do|)'
In order to get the result, I used a code like this
a = gsub('\\|', " | ", a)
a = gsub('\\(', "( ", a)
a = gsub('\\)', " )", a)
a = vapply(strsplit(a, " "), function(x) paste(unique(x), collapse = " "), character(1L))
However, it resulted in undesirable outputs.
a
[1] "I ( have | has ) certain words word worded"
[2] "( You | Youre ) can cans do this works worked"
[3] "I ( am | are ) sure surely you know what when her should do"
Why did my code remove parentheses located in the latter part of strings?
What should I do for the result I want?
We can use gsubfn. Here, the idea is to select the characters inside the brackets by matching the opening bracket (\\( have to escape the bracket as it is a metacharacter) followed by one or more characters that are not a closing bracket ([^)]+), capture it as a group within the brackets. In the replacement, we split the group of characters (x) with strsplit, unlist the list output, get the unique elements and paste it together
library(gsubfn)
gsubfn("\\(([^)]+)", ~paste0("(", paste(unique(unlist(strsplit(x,
"[|]"))), collapse="|")), a)
#[1] "I (have|has) certain (words|word|worded) certain"
#[2] "(You|Youre) (can|cans) do this (works|worked)"
#[3] "I (am|are) (sure|surely) you know (what|when) (you|her) should (do)"
Take the answer above. This is more straightforward, but you can also try:
library(stringi)
library(stringr)
a_new <- gsub("[|]","-",a) # replace this | due to some issus during the replacement later
a1 <- str_extract_all(a_new,"[(](.*?)[)]") # extract the "units"
# some magic using stringi::stri_extract_all_words()
a2 <- unlist(lapply(a1,function(x) unlist(lapply(stri_extract_all_words(x), function(y) paste(unique(y),collapse = "|")))))
# prepare replacement
names(a2) <- unlist(a1)
# replacement and finalization
str_replace_all(a_new, a2)
[1] "I (have|has) certain (words|word|worded) certain"
[2] "(You|Youre) (can|cans) do this (works|worked)"
[3] "I (am|are) (sure|surely) you know (what|when) (you|her) should (do)"
The idea is to extract the words within the brackets as unit. Then remove the duplicates and replace the old unit with the updated.
a longer but more elaborate try
a = c( 'I (have|has|have) certain (words|word|worded|word) certain',
'(You|You|Youre) (can|cans|can) do this (works|works|worked)',
'I (am|are|am) (sure|sure|surely) you know (what|when|what) (you|her|you) should (do|do)' )
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
# blank output
new_a <- c()
for (sentence in 1:length(a)) {
split <- trim(unlist(strsplit(a[sentence],"[( )]")))
newsentence <- c()
for (i in split) {
j1 <- as.character(unique(trim(unlist(strsplit(gsub('\\|'," ",i)," ")))))
if( length(j1)==0) {
next
} else {
ifelse(length(j1)>1,
newsentence <- c(newsentence,paste("(",paste(j1,collapse="|"),")",sep="")),
newsentence <- c(newsentence,j1[1]))
}
}
newsentence <- paste(newsentence,collapse=" ")
print(newsentence)
new_a <- c(new_a,newsentence)}
# [1] "I (have|has) certain (words|word|worded) certain"
# [2] "(You|Youre) (can|cans) do this (works|worked)"
# [3] "I (am|are) (sure|surely) you know (what|when) (you|her) should do"

Resources