Extract House Number from (address) string using r - r

I want to parse apart (extract) addresses into HouseNumber and Streetname.
I should later be able to write the extracted "values" into new columns (shops$HouseNumber and shops$Streetname).
So lets say I have a data frame called "shops":
> shops
Name city street
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
So is there a way to split the street column into two lists one with the streetnames and one for the house numbers including cases like "1-3","14a", so that in the end, the result could be assigned to the data frame and look like.
> shops
Name city Streetname HouseNumber
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
Example: Easyfakestreet 5 --> Easyfakestreet , 5
It gets slightly complicated by the fact that some of my street strings will have hyphenated street addresses and have non numerical components.
Examples: New Street 3 --> ['New Street', '3 ']
Some-Complicated-Casestreet 1-3 --> ['Some-Complicated-Casestreet','1-3']
Fake Street 14a --> ['Fake Street', '14a']
I would appreciate some help!

Here's a possible tidyr solution
library(tidyr)
extract(df, "street", c("Streetname", "HouseNumber"), "(\\D+)(\\d.*)")
# Name city Streetname HouseNumber
# 1 Something Fakecity New Street 3
# 2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
# 3 SomethingDifferent Fakecity Fake Street 14a

You can try:
shops$Streetname <- gsub("(.+)\\s[^ ]+$","\\1", shops$street)
shops$HousNumber <- gsub(".+\\s([^ ]+)$","\\1", shops$street)
data
shops$street
#[1] "New Street 3" "Some-Complicated-Casestreet 1-3" "Fake Street 14a"
results
shops$Streetname
#[1] "New Street" "Some-Complicated-Casestreet" "Fake` Street"
shops$HousNumber
#[1] "3" "1-3" "14a"

Create a pattern with back references that match both the street and the number and then using sub replace it by each backreference in turn. No packages are needed:
pat <- "(.*) (\\d.*)"
transform(shops,
street = sub(pat, "\\1", street),
HouseNumber = sub(pat, "\\2", street)
)
giving:
Name city street HouseNumber
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
Here is a visualization of pat:
(.*) (\d.*)
Debuggex Demo
Note:
1) We used this for shops:
shops <-
structure(list(Name = c("Something", "SomethingOther", "SomethingDifferent"
), city = c("Fakecity", "Fakecity", "Fakecity"), street = c("New Street 3",
"Some-Complicated-Casestreet 1-3", "Fake Street 14a")), .Names = c("Name",
"city", "street"), class = "data.frame", row.names = c(NA, -3L))
2) David Arenburg's pattern could alternately be used here. Just set pat to it. The pattern above has the advantage that it allows street names that have embedded numbers in them but David's has the advantage that the space may be missing before the street number.

You could use the package unglue
library(unglue)
unglue_unnest(shops, street, "{street} {value=\\d.*}")
#> Name city street value
#> 1 Something Fakecity New Street 3
#> 2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
#> 3 SomethingDifferent Fakecity Fake Street 14a
Created on 2019-10-08 by the reprex package (v0.3.0)

Very complex issue for international addresses
$re = '/(\d+[\d\/\-\. ,]*[ ,\d\-\w]{0,2} )/m';
$str = '234 Test Road, Testville
456b Tester Road, Testville
789 c Tester Road, Testville
Mystreet 14a
123/3 dsdsdfs
Roobertinkatu 36-40
Flats 1-24 Acacia Avenue
Apartment 9D, 1 Acacia Avenue
Flat 24, 1 Acacia Avenue
Moscow Street, plot,23 building 2
Apartment 5005 no. 7 lane 31 Wuming Rd
Quinta da Redonda Lote 3 - 1 º
102 - 3 Esq
Av 1 Maio 16,2 dt,
Rua de Ceuta Lote 1 Loja 5
11334 Nc Highway 72 E ';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
Output example
https://regex101.com/r/WVPBji/1

Related

New Column Based on Conditions

To set the scene, I have a set of data where two columns of the data have been mixed up. To give a simple example:
df1 <- data.frame(Name = c("Bob", "John", "Mark", "Will"), City=c("Apple", "Paris", "Orange", "Berlin"), Fruit=c("London", "Pear", "Madrid", "Orange"))
df2 <- data.frame(Cities = c("Paris", "London", "Berlin", "Madrid", "Moscow", "Warsaw"))
As a result, we have two small data sets:
> df1
Name City Fruit
1 Bob Apple London
2 John Paris Pear
3 Mark Orange Madrid
4 Will Berlin Orange
> df2
Cities
1 Paris
2 London
3 Berlin
4 Madrid
5 Moscow
6 Warsaw
My aim is to create a new column where the cities are in the correct place using df2. I am a bit new to R so I don't know how this would work.
I don't really know where to even start with this sort of a problem. My full dataset is much larger and it would be good to have an efficient method of unpicking this issue!
If the 'City' values are only different. We may loop over the rows, create a logical vector based on the matching values with 'Cities' from 'df2', and concatenate with the rest of the values by getting the matched values second in the order
df1[] <- t(apply(df1, 1, function(x)
{
i1 <- x %in% df2$Cities
i2 <- !i1
x1 <- x[i2]
c(x1[1], x[i1], x1[2])}))
-output
> df1
Name City Fruit
1 Bob London Apple
2 John Paris Pear
3 Mark Madrid Orange
4 Will Berlin Orange
using dplyr package this is a solution, where it looks up the two City and Fruit values in df1, and takes the one that exists in the df2 cities list.
if none of the two are a city name, an empty string is returned, you can replace that with anything you prefer.
library(dplyr)
df1$corrected_City <- case_when(df1$City %in% df2$Cities ~ df1$City,
df1$Fruit%in% df2$Cities ~ df1$Fruit,
TRUE ~ "")
output, a new column created as you wanted with the city name on that row.
> df1
Name City Fruit corrected_City
1 Bob Apple London London
2 John Paris Pear Paris
3 Mark Orange Madrid Madrid
4 Will Berlin Orange Berlin
Another way is:
library(dplyr)
library(tidyr)
df1 %>%
mutate(across(1:3, ~case_when(. %in% df2$Cities ~ .), .names = 'new_{col}')) %>%
unite(New_Col, starts_with('new'), na.rm = TRUE, sep = ' ')
Name City Fruit New_Col
1 Bob Apple London London
2 John Paris Pear Paris
3 Mark Orange Madrid Madrid
4 Will Berlin Orange Berlin

Join each term with list of keywords

Probably a simple problem and you can help me quickly.
I have a vector with all the terms contained in a list of keywords. Now I want to join each term with all keywords that contain this term. Here's an example
vec <- c("small", "boat", "river", "house", "tour", "a", "on", "the", "houseboat", …)
keywords <- c("small boat tour", "a house on the river", "a houseboat", …)
The expected result looks like:
keywords terms
small boat tour small
small boat tour boat
small boat tour tour
a house on the river a
a house on the river house
a house on the river on
a house on the river the
a house on the river river
a houseboat a
a houseboat houseboat
You can use expand.grid to get all combinations, wrap the words of vec in word boundaries, grepl and filter, i.e.
df1 <- expand.grid(vec, keywords)
df1[mapply(grepl, paste0('\\b' ,df1$Var1, '\\b'), df1$Var2),]
Var1 Var2
1 small small boat tour
2 boat small boat tour
5 tour small boat tour
12 river a house on the river
13 house a house on the river
15 a a house on the river
16 on a house on the river
17 the a house on the river
24 a a houseboat
27 houseboat a houseboat
You can do a fuzzyjoin::fuzzy_join using stringr::str_detect as the matching function, and adding \\b word boundaries to each word in vec.
vec <- data.frame(terms = c("small", "boat", "river", "house", "tour", "a", "on", "the", "houseboat"))
keywords <- data.frame(keywords = c("small boat tour", "a house on the river", "a houseboat"))
fuzzyjoin::fuzzy_inner_join(keywords, vec, by = c("keywords" = "terms"),
match_fun = \(x, y) stringr::str_detect(x, paste0("\\b", y, "\\b")))
output
keywords terms
1 small boat tour small
2 small boat tour boat
3 small boat tour tour
4 a house on the river river
5 a house on the river house
6 a house on the river a
7 a house on the river on
8 a house on the river the
9 a houseboat a
10 a houseboat houseboat
A way can be using strsplit and intersect.
. <- lapply(strsplit(keywords, " ", TRUE), intersect, vec)
data.frame(keywords = rep(keywords, lengths(.)), terms = unlist(.))
# keywords terms
#1 small boat tour small
#2 small boat tour boat
#3 small boat tour tour
#4 a house on the river a
#5 a house on the river house
#6 a house on the river on
#7 a house on the river the
#8 a house on the river river
#9 a houseboat a
#10 a houseboat houseboat
In casevec contains all keywords there is no need for join.
. <- lapply(strsplit(keywords, " ", TRUE), unique)
data.frame(keywords = rep(keywords, lengths(.)), terms = unlist(.))
The expected result can be reproduced by
library(data.table)
data.table(keywords)[, .(terms = tstrsplit(keywords, "\\W+")), by = keywords]
keywords terms
1: small boat tour small
2: small boat tour boat
3: small boat tour tour
4: a house on the river a
5: a house on the river house
6: a house on the river on
7: a house on the river the
8: a house on the river river
9: a houseboat a
10: a houseboat houseboat
This is a rather simple answer which is an interpretation of OP's sentence
I have a vector with all the terms contained in a list of keywords.
The important point is all the terms. So, it is assumed that we just can split the keywords into separate terms.
Note that the regex \\W+ is used to separate the terms in case there are more than one non-word characters between the terms, e.g., ", ".
However, in case the vector does not contain all terms intentionally we need to subset the result, e.g.
vec <- c("small", "boat", "river", "house", "tour", "houseboat")
data.table(keywords)[, .(terms = tstrsplit(keywords, "\\W+")), by = keywords][
terms %in% vec]
keywords terms
1: small boat tour small
2: small boat tour boat
3: small boat tour tour
4: a house on the river house
5: a house on the river river
6: a houseboat houseboat
Here is a base R one-liner with strsplit + stack
> with(keywords, rev(stack(setNames(strsplit(keywords, " "), keywords))))
ind values
1 small boat tour small
2 small boat tour boat
3 small boat tour tour
4 a house on the river a
5 a house on the river house
6 a house on the river on
7 a house on the river the
8 a house on the river river
9 a houseboat a
10 a houseboat houseboat

spell out the direction of a street name

I tried to correct the street names of a data frame using stringr package, spelling out "S." to "South" or "E" to "East" as well as "st." to "Street". The sample data is below.
df = data.frame(street = c('333 S. HOPE STREET', '21 South Hope Street', '54 Hope PKWY', '60C/O St.'))
This is my code.
df2 <- df %>% mutate(street2 = str_replace(street, 'S', "South"),
street2 = str_replace_all(street2, 'PKWY', "PARKWAY"),
street2 = str_replace_all(street2, 'st.', "Street"))
It returns to the following result.
street street2
333 S. HOPE STREET 333 South. HOPE STREET
21 South Hope Street 21 Southouth Hope Street
54 Hope PKWY 54 Hope PARKWAY
60C/O St. 60C/O Southt.
This is the result I desire. Not sure where I get wrong.
street street2
333 S. HOPE STREET 333 South HOPE STREET
21 South Hope Street 21 South Hope Street
54 Hope PKWY 54 Hope PARKWAY
60C/O St. 60C/O Sreet.
Don't forget to escape the dots! In a regex-pattern, the . matches (almost) any character. If you mean a literal dot, you have to escape the dot, with a \ (which you also have to escape with another \).
So:
df %>% mutate(street2 = str_replace(street, 'S\\.', "South"),
street2 = str_replace_all(street2, 'PKWY', "PARKWAY"),
street2 = str_replace_all(street2, 'St\\.', "Street"))
will result in
# street street2
# 1 333 S. HOPE STREET 333 South HOPE STREET
# 2 21 South Hope Street 21 South Hope Street
# 3 54 Hope PKWY 54 Hope PARKWAY
# 4 60C/O St. 60C/O Street
and for better readable results, you can use stringr::str_to_title
df %>% mutate(street2 = str_replace(street, 'S\\.', "South"),
street2 = str_replace_all(street2, 'PKWY', "PARKWAY"),
street2 = str_replace_all(street2, 'St\\.', "Street") ) %>%
mutate_all( ., str_to_title )
# street street2
# 1 333 S. Hope Street 333 South Hope Street
# 2 21 South Hope Street 21 South Hope Street
# 3 54 Hope Pkwy 54 Hope Parkway
# 4 60c/O St. 60c/O Street

Construct a vector of names from data frame using R

I have a big data frame that contains data about the outcomes of sports matches. I want to try and extract specific data from the data frame depending on certain criteria. Here's a quick example of what I mean...
Imagine I have a data frame df, which displays data about specific football matches of a tournament on each row, like so:
Winner_Teams Win_Capt_Nm Win_Country Loser_teams Lose_Capt_Nm Lose_Country
1 Man utd John England Barcalona Carlos Spain
2 Liverpool Steve England Juventus Mario Italy
3 Man utd John Scotland R Madrid Juan Spain
4 Paris SG Teirey France Chelsea Mark England
So, for example, in row [1] Man utd won against Barcalona, Man utd's captain's name was John and he is from England. Barcalona's (the losers of the match) captain's name was Carlos and he is from Spain.
I want to construct a vector with the names of all English players in the tournament, where the output should look something like this:
[1] "John" "Mark" "Steve"
Here's what I've tried so far...
My first step was to create a data frame that discards all the matches that don't have English captains
> England_player <- data.frame(filter(df, Win_Country=="England" ))
> England_player
Winner_Teams Win_Capt_Nm Win_Country Loser_teams Lose_Capt_Nm Lose_Country
1 Man utd John England Barcalona Carlos Spain
2 Liverpool Steve England Juventus Mario Italy
3 Paris SG Teirey France Chelsea MArk England
Then I used select() on England_player to isolate just the names:
> England_player_names <- select(England_player, Win_Capt_Nm, Lose_Capt_Nm)
> England_player_names
Win_Capt_Nm Lose_Capt_Nm
1 John Carlos
2 Steve Mario
3 Teirey Mark
And then I get stuck! As you can see, the output displays the English winner's name and the name of his opponent... which is not what I want!
It's easy to just read the names off this data frame.. but the data frame I'm working with is large, so just reading the values is no good!
Any suggestions as to how I'd do this?
english.players <- union(data$Win_Capt_Nm[data$Win_Country == 'England'], data$Lose_Capt_Nm[data$Lose_Country == 'England'])
[1] "John" "Steve" "Mark"

Lookup values in a vectorized way

I keep reading about the importance of vectorized functionality so hopefully someone can help me out here.
Say I have a data frame with two columns: name and ID. Now I also have another data frame with name and birthplace, but this data frame is much larger than the first, and contains some but not all of the names from the first data frame. How can I add a third column to the the first table that is populated with birthplaces looked up using the second table.
What I have is now is:
corresponding.birthplaces <- sapply(table1$Name,
function(name){return(table2$Birthplace[table2$Name==name])})
This seems inefficient. Thoughts? Does anyone know of a good book/resource for using R 'properly'..I get the feeling that I generally do think in the least computationally effective manner conceivable.
Thanks :)
See ?merge which will perform a database link merge or join.
Here is an example:
set.seed(2)
d1 <- data.frame(ID = 1:5, Name = c("Bill","Bob","Jessica","Jennifer","Robyn"))
d2 <- data.frame(Name = c("Bill", "Gavin", "Bob", "Joris", "Jessica", "Andrie",
"Jennifer","Joshua","Robyn","Iterator"),
Birthplace = sample(c("London","New York",
"San Francisco", "Berlin",
"Tokyo", "Paris"), 10, rep = TRUE))
which gives:
> d1
ID Name
1 1 Bill
2 2 Bob
3 3 Jessica
4 4 Jennifer
5 5 Robyn
> d2
Name Birthplace
1 Bill New York
2 Gavin Tokyo
3 Bob Berlin
4 Joris New York
5 Jessica Paris
6 Andrie Paris
7 Jennifer London
8 Joshua Paris
9 Robyn San Francisco
10 Iterator Berlin
Then we use merge() to do the join:
> merge(d1, d2)
Name ID Birthplace
1 Bill 1 New York
2 Bob 2 Berlin
3 Jennifer 4 London
4 Jessica 3 Paris
5 Robyn 5 San Francisco

Resources