R extract variables with regex - r

I have a character column that needs to be separated with regex. Here is an example of the raw data:
data_raw <- tribble(
~census_geo,
"Division No. 1, Subd. V (SNO), Newfoundland and Labrador",
"Portugal Cove South (T), Newfoundland and Labrador",
"Division No. 1, Subd. U, Reserve (SNO), Newfoundland and Labrador")
We have three columns to be extracted. The first is everything before the brackets. The second column is the word inside the bracket. The last column is everything after the last comma (or everything after the word in the bracket). Here is an example of what the clean output would look like:
data_clean <- tribble(
~csd_name, ~csd_type, ~province,
"Division No. 1, Subd. V", "SNO", "Newfoundland and Labrador",
"Portugal Cove South", "T", "Ontario",
"Division No. 1, Subd. U, Reserve", "SNO", "Newfoundland and Labrador")
I can extract the last column with this code:
data_raw %>%
mutate(csd_type = str_extract(census_geo, pattern = "(?<=\\().*(?=\\))"))
But I can't get the other two columns.
Any help would be greatly appreciated.

You can use tidyr's extract and pass regular expressions to extract the relevant text in different columns.
tidyr::extract(data_raw, census_geo, c('csd_name', 'csd_type', 'province'),
'(.*) \\((.*)\\),\\s*(.*)')
# csd_name csd_type province
# <chr> <chr> <chr>
#1 Division No. 1, Subd. V SNO Newfoundland and Labrador
#2 Portugal Cove South T Newfoundland and Labrador
#3 Division No. 1, Subd. U, Reserve SNO Newfoundland and Labrador
You can achieve the same result in base R with strcapture :
strcapture('(.*) \\((.*)\\),\\s*(.*)', data_raw$census_geo,
proto = list(csd_name = character(), csd_type = character(),
province = character()))

I know you already selected Ronak Shah's answer (which was very nice btw), but I wanted to just show an approach with stringr's separate:
library(stringr)
data_raw %>%
separate(
col = census_geo,
into = c('csd_name', 'csd_type', 'province'),
sep = '(\\s\\(|\\),\\s)'
)
The \\s is for the white space, the \\( for the parenthesis, and the | for splitting the two distinct patterns to look for.

Just in case OP is interested to see how the original approach with str_extract would work for all three separate columns using negative character classes [^)(]and [^,]:
data_raw %>%
mutate(
csd_name = str_extract(census_geo, "^[^)(]+(?=\\s)"),
csd_type = str_extract(census_geo, "(?<=\\()[^)(]+(?=\\))"),
csd_province = str_extract(census_geo, "(?<=,\\s)[^,]+$")) %>%
select(-census_geo)
# A tibble: 3 x 3
csd_name csd_type csd_province
<chr> <chr> <chr>
1 Division No. 1, Subd. V SNO Newfoundland and Labrador
2 Portugal Cove South T Newfoundland and Labrador
3 Division No. 1, Subd. U, Reserve SNO Newfoundland and Labrador

Related

Rowwise extract common substrings from to columns in a data frame

I want to match cities with regions in a data frame. The columns are a little bit messy, so I would like to extract the names of the cities / regions that appear in two columns as in the following example.
A <- c("Berlin",
"Hamburg",
"Munich",
"Stuttgart",
"Rhein Main Frankfurt",
"Hannover")
B <- c("Berlin Brandenburg",
"Hamburg",
"Munich Bayern",
"Region Stuttgart",
"Main Rhein Darmstadt",
"Wiesbaden")
The resulting column / data frame should look like this:
result <- c("Berlin",
"Hamburg",
"Munich",
"Stuttgart",
"Rhein Main",
NA
)
df <- data.frame(A, B, result)
...while it doesn't matter whether it's "Rhein Main" or "Main Rhein".
Thank you for your help!
Maybe I am missing a smart regex trick but one option would be to split strings into words and find the common words using intersect.
df$Result <- mapply(function(x, y) paste0(intersect(x, y), collapse = " "),
strsplit(df$A, '\\s+'), strsplit(df$B, '\\s+'))
df
# A B Result
#1 Berlin Berlin Brandenburg Berlin
#2 Hamburg Hamburg Hamburg
#3 Munich Munich Bayern Munich
#4 Stuttgart Region Stuttgart Stuttgart
#5 Rhein Main Frankfurt Main Rhein Darmstadt Rhein Main
#6 Hannover Wiesbaden
This returns empty string when there is no match. You can turn the empty string to NA if needed.

Extracting substring by positions in pipe

I would like to extract substring from every row of the id column of a tibble. I am interested always in a region between 1st and 3rd space of original id. The resulted substring, so Zoe Boston and Jane Rome, would go to the new column - name.
I tried to get the positions of "spaces" in every id with str_locate_all and then use positions to use str_sub. However I cannot extract the positions correctly.
data <- tibble(id = c("#1265746 Zoe Boston 58962 st. Victory cont_1.0)", "#958463279246 Jane Rome 874593.01 musician band: XYZ 985147") ) %>%
mutate(coor = str_locate_all(id, "\\s"),
name = str_sub(id, start = coor[[1]], end = coor[[3]] ) )
You can use regex to extract what you want.
Assuming you have stored your tibble in data, you can use sub to extract 1st and 2nd word.
sub('^#\\w+\\s(\\w+\\s\\w+).*', '\\1', data$id)
#[1] "Zoe Boston" "Jane Rome"
^# - starts with hash
\\w+ - A word
\\s - Whitespace
( - start of capture group
\\w+ - A word
followed by \\s - whitespace
\\w+ - another word
) - end of capture group.
.* - remaining string.
The str_locate is more complex, since it first returns the position of whitespace then you need to select the end of 1st whitespace and start of 3rd and then use str_sub to extract text between those positions.
library(dplyr)
library(stringr)
library(purrr)
data %>%
mutate(coor = str_locate_all(id, "\\s"),
start = map_dbl(coor, `[`, 1) + 1,
end = map_dbl(coor, `[`, 3) - 1,
name = str_sub(id, start, end))
# A tibble: 2 x 2
# id name
# <chr> <chr>
#1 #1265746 Zoe Boston 58962 st. Victory cont_1.0) Zoe Boston
#2 #958463279246 Jane Rome 874593.01 musician band: XYZ 985147 Jane Rome
Another possible solution using stringr and purrr packages
library(stringr)
library(purrr)
library(dplyr)
data %>%
mutate(name = map_chr(str_split(id, " "), ~paste(unlist(.)[2:3], collapse = " ")))
Explanation:
in str_split(id, " ") we create a list of the terms that are separated inside id by a whitespace
map_chr is useful to take each one of these lists, and apply the following function to them: unlist the list, take the elements in positions 2 and 3 (which are the name we want) and then collapse them with a whitespace between them
Output
# A tibble: 2 x 2
# id name
# <chr> <chr>
# 1 #1265746 Zoe Boston 58962 st. Victory cont_1.0) Zoe Boston
# 2 #958463279246 Jane Rome 874593.01 musician band: XYZ 985147 Jane Rome

Replace lowercase in names, not in surnames

I have a problem with a database with names of persons. I want to put the names in abbreviation but not the last names. The last name is separated from the name by a comma and the different people are separated from each other by a semicolon, like this example:
Michael, Jordan; Bird, Larry;
If the name is a single word, the code would be like this:
breve$autor <- str_replace_all(breve$autor, "[:lower:]{1,}\\;", ".\\;")
Result with this code:
Michael, J.; Bird, L.;
The problem is in compound names. With this code, the name:
Jordan, Michael Larry;
It would be:
Jordan, Michael L.;
Could someone tell me how to remove all lowercase letters that are between the comma and the semicolon? and it will look like this:
Jordan, M.L.;
Here is another solution:
x1 <- 'Michael, Jordan; Bird, Larry;'
x2 <- 'Jordan, Michael Larry;'
gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x1, perl = TRUE)
# [1] "Michael, J.; Bird, L.;"
gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x2, perl = TRUE)
# [1] "Jordan, M. L.;"
Surnames are followed by , while are parts of the names are followed by or ;. Here I use (?=[ ;]) to make sure that the following character after the pattern to be matched is a space or a semicolon.
To remove the space between M. and L., an additional step is needed:
gsub('\\. ', '.', gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x2, perl = TRUE))
# [1] "Jordan, M.L.;"
There must be a regular expression that will do this, of course. But that magic is a little beyond me. So here is an approach with simple string manipulation in a data frame using tidyverse functions.
library(stringr)
library(dplyr)
library(tidyr)
ballers <- "Michael, Jordan; Bird, Larry;"
mj <- "Jordan, Michael Larry"
c(ballers, mj) %>%
#split the players
str_split(., ";", simplify = TRUE) %>%
# remove white space
str_trim() %>%
#transpose to get players in a column
t %>%
#split again into last name and first + middle (if any)
str_split(",", simplify = TRUE) %>%
# convert to a tibble
as_tibble() %>%
# remove more white space
mutate(V2=str_trim(V2)) %>%
# remove empty rows (these can be avoided by different manipulation upstream)
filter(!V1 == "") %>%
# name the columns
rename("Last"=V1, "First_two"=V2) %>%
# separate the given names into first and middle (if any)
separate(First_two,into=c("First", "Middle"), sep=" ",) %>%
# abbreviate to first letter
mutate(First_i=abbreviate(First, 1)) %>%
# abbreviate, but take into account that middle name might be missing
mutate(Middle_i=ifelse(!is.na(Middle), paste0(abbreviate(Middle, 1), "."), "")) %>%
# combine the First and middle initals
mutate(Initials=paste(First_i, Middle_i, sep=".")) %>%
# make the desired Last, F.M. vector
mutate(Final=paste(Last, Initials, sep=", "))
# A tibble: 3 x 7
Last First Middle First_i Middle_i Initials Final
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Michael Jordan NA J "" J. Michael, J.
2 Jordan Michael Larry M L. M.L. Jordan, M.L.
3 Bird Larry NA L "" L. Bird, L.
Warning message:
Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [1, 3].
Much longer than a regex.
There will probably be a better way to do this, but I managed to get it to work using the stringr and tibble packages.
library(stringr)
library(tibble)
names <- 'Jordan, Michael; Bird, Larry; Obama, Barack; Bush, George Walker'
df <- as_tibble(str_split(unlist(str_split(names, '; ')), ', ', simplify = TRUE))
df[, 2] <- gsub('[a-z]+', '.', pull(df[, 2]))
This code generates the tibble df, which has the following contents:
# A tibble: 4 x 2
V1 V2
<chr> <chr>
1 Jordan M.
2 Bird L.
3 Obama B.
4 Bush G. W.
The names are first split into first and last names and stored into a data frame so that the gsub() call does not operate on the last names. Then, gsub() searches for any lowercase letters in the last names and replaces them with a single .
Then, you can call str_c(str_c(pull(df[, 1]), ', ', pull(df[, 2])), collapse = '; ') (or str_c(pull(unite(df, full, c('V1', 'V2'), sep = ', ')), collapse = '; ') if you already have the tidyr package loaded) to return the string "Jordan, M.; Bird, L.; Obama, B.; Bush, G. W.".
...also, did you mean Michael Jordan, not Jordan Michael? lol
Here's one that uses gsub twice. The inner one is for names with no middle names and the outer is for names that have a middle name.
x = c("Michael, Jordan; Jordan, Michael Larry; Bird, Larry;")
gsub(", ([A-Z])[a-z]+ ([A-Z])[a-z]+;", ", \\1.\\2.;", gsub(", ([A-Z])[a-z]+;", ", \\1.;", x))
#[1] "Michael, J.; Jordan, M.L.; Bird, L.;"

Merge dataframes based on regex condition

This problem involves R. I have two dataframes, represented by this minimal reproducible example:
a <- data.frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"), county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data.frame(geocode = c("360050002001002", "360850323001019"), jobs = c("4", "204"))
An example to help communicate the very specific operation I am trying to perform: the geocode_selector column in dataframe a contains the FIPS county codes of the five boroughs of NY. The geocode column in dataframe b is the 15-digit ID of a specific Census block. The first five digits of a geocode match a more general geocode_selector, indicating which county the Census block is located in. I want to add a column to b specifying which county each census block falls under, based on which geocode_selector each geocode in b matches with.
Generally, I'm trying to merge dataframes based on a regex condition. Ideally, I'd like to perform a full merge carrying all of the columns of a over to b and not just the county_name.
I tried something along the lines of:
b[, "county_name"] <- NA
for (i in 1:nrow(b)) {
for (j in 1:nrow(a)) {.
if (grepl(data.a$geocode_selector[j], b$geocode[i]) == TRUE) {
b$county_name[i] <- a$county_name[j]
}
}
}
but it took an extremely long time for the large datasets I am actually processing and the finished product was not what I wanted.
Any insight on how to merge dataframes conditionally based on a regex condition would be much appreciated.
You could do this...
b$geocode_selector <- substr(b$geocode,1,5)
b2 <- merge(b, a, all.x=TRUE) #by default it will merge on common column names
b2
geocode_selector geocode jobs county_name
1 36005 360050002001002 4 Bronx
2 36085 360850323001019 204 Richmond
If you wish, you can delete the geocode_selector column from b2 with b2[,1] <- NULL
We can use sub to create the 'geocode_selector' and then do the join
library(data.table)
setDT(a)[as.data.table(b)[, geocode_selector := sub('^(.{5}).*', '\\1', geocode)],
on = .(geocode_selector)]
# geocode_selector county_name geocode jobs
#1: 36005 Bronx 360050002001002 4
#2: 36085 Richmond 360850323001019 204
This is a great opportunity to use dplyr. I also tend to like the string handling functions in stringr, such as str_sub.
library(dplyr)
library(stringr)
a <- data_frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"),
county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data_frame(geocode = c("360050002001002", "360850323001019"),
jobs = c("4", "204"))
b %>%
mutate(geocode_selector = str_sub(geocode, end = 5)) %>%
inner_join(a, by = "geocode_selector")
#> # A tibble: 2 x 4
#> geocode jobs geocode_selector county_name
#> <chr> <chr> <chr> <chr>
#> 1 360050002001002 4 36005 Bronx
#> 2 360850323001019 204 36085 Richmond

approximate string matching on condition of a match in a separate field in R

I have two dataframes from which I would like to carry out approximate string matching.
> df1
Source Name Country
A Glen fiddich United Kingdom
A Talisker dark storm United Kingdom
B johnney walker United states
D veuve clicquot brut France
E nicolas feuillatte brut France
C glen morangie united kingdom
B Talisker 54 degrees United kingdom
F Talisker dark storm United states
The second data frame:
> df2
Source Name Country
A smirnoff ice Russia
A Talisker daek strome United Kingdom
B johnney walker United states
D veuve clicquot brut Australia
E nicolea feuilate brut Italy
C glen morangie united kingdom
B Talisker 54 degrees United kingdom
the key column for the approximate matching between the two data frames is "Name". Because of the relationship in the columns for the observations, It is important to select the approximate matches that also have a match on the "country" column. The extract of the code I am using is below:
dist.mat <- stringdistmatrix(tolower(df1$title), tolower(df2$title), method = "jw",
nthread = getOption("sd_num_thread"))
min.dist <- apply(dist.mat, 1, min)
matched <- data.frame(df1$title,
as.character(apply(dist.mat, 1, function(x) df2$title[which(x == min(x))])),
apply(dist.mat, 1, which.min), "jw", apply(dist.mat, 1, min))
colnames(matched) <- c("to_be_matched", "closest_match", "index_closest_match",
"distance_method", "distance")
The code above only executes approximate match between df1 and df2 based on data in the "Name" column. What I want to do is have the approximate match on "Name" column selected on the condition that for the two values, there is a match on the "Country" column.
You should check out the fuzzywuzzy library, which has excellent fuzzy text matching capabilities. Then I would iterate through the unique countries and look for matches that pass a certain fuzz threshold score, like the following:
from fuzzywuzzy import fuzz, process
matches = []
for country in df1['Country'].unique().tolist():
dfm1 = df1[df1['Country'] == country]
dfm2 = df2[df2['Country'] == country]
candidates = dfm2['Name'].tolist()
matches.append(dfm1['Name'].apply(lambda x: x, process.extractOne(x, candidates, score_cutoff=90)))
You can tweak the scorer input in order to get the matches the way you like it.

Resources