Merge data frames based on partial matches of address strings - r

I'm looking for a way to match two different address data frames. The both contain a string of text (the 'Line' column in my example), a postcode/zip code type identifier (the 'PC' column') and a unique Ref or ID code. I would need to resulting matches to be in a new data frame with a format along the lines of: DF1$Line, DF1$PD, DF2$Line, DF2$PD, Ref, ID and some sort of numeric detailing the strength of match (this is based on the example code below).
My actual dataset contains several thousand records and I have been playing with the idea of using the 'PC' column to subset both datasets and then perform some sort of matching along the lines of this, but the resulting matches I get are completely wrong.
Here is a made-up dataset that resembles my data (in these examples the rows in each dataset correspond to each other, my real data is not formatted like this unfortunately).
DF1 <- data.frame(
Line = c("64 London Street, Jasper","46 London Road, Flat 2, Jasper","99 York Parade, Yorkie","99 Parade Road, Placename","29 Road Street, Townplace","92 Parade Street, Yorky"),
PC = c("ZZ1 4TY","ZZ1 4TY","PP1 9TR","ZZ1 4TY","PP1 9TR","PP1 9RT"),
Ref = c("123451","567348","23412","98734","43223","32453")
)
and
DF2 <- data.frame(
Line = c("64 London St, Jasper","Flat 2, 46 Road, London, Jasper","99 York Parade, Yorky","99 Parade Road, Placenames","Flat 3, 29 Road Street, Townplace, Townplace","92 Street, Parade, Yorkie"),
PC = c("ZZ1 4TY","ZZ1 4TY","PP1 9TR","ZZ1 4TY","PP1 9TR","PP1 9RT"),
ID = c("ABGED","GGFRW","UYTER","RTERF","WERWE","OYUIY")
)
Any help resolving this would be very much appreciated, as would any metric that helps me quantify how precise the matches are. Thanks.

Here is my base R solution let me know if I get it.
DF3 <- merge(DF1, DF2, by = "PC")
DF3[!duplicated(DF3$Ref) , ]
PC Line.x Ref Line.y ID
1 PP1 9RT 92 Parade Street, Yorky 32453 92 Street, Parade, Yorkie OYUIY
2 PP1 9TR 99 York Parade, Yorkie 23412 99 York Parade, Yorky UYTER
4 PP1 9TR 29 Road Street, Townplace 43223 99 York Parade, Yorky UYTER
6 ZZ1 4TY 64 London Street, Jasper 123451 64 London St, Jasper ABGED
9 ZZ1 4TY 46 London Road, Flat 2, Jasper 567348 64 London St, Jasper ABGED
12 ZZ1 4TY 99 Parade Road, Placename 98734 64 London St, Jasper ABGED

I would consider first evaluating potential matches using agrep:
for (i in 1:length(DF1$Line)) {
matchDF1 <- agrep(pattern = DF1$Line[i], x = DF2$Line, max.distance = 0.5,
value = TRUE)
}

Related

Combine two rows in R that are separated

I am trying to clean the dataset so that all data is in its appropriate cell by combining rows since they are oddly separated. There is an obvious nuance to the dataset in that there are some rows that are correctly coded and there are some that are not.
Here is an example of the data:
Rank
Store
City
Address
Transactions
Avg. Value
Dollar Sales Amt.
40
1404
State College
Hamilton Square Shop Center
230 W Hamilton Ave
155548
52.86
8263499
41
2310
Springfield
149 Baltimore Pike
300258
27.24
8211137
42
2514
Erie
Yorktown Centre
2501 West 12th Street
190305
41.17
7862624
Here is an example of how I want the data:
Rank
Store
City
Address
Transactions
Avg. Value
Dollar Sales Amt.
40
1404
State College
Hamilton Square Shop Center, 230 W Hamilton Ave
155548
52.86
8263499
41
2310
Springfield
149 Baltimore Pike
300258
27.28
8211137
42
2514
Erie
Yorktown Centre, 2501 West 12th Street
190305
41.17
7862624
Is there an Excel or R function to fix this, or does anyone know how to write an R functional to correct this?
I read into the CONCATENATE function in excel and realized it was not going to accomplish anything. I figured an R functional would be the only way to fix this.
The concatenate function will work here or in the excel select the columns and by using the merge formula in the formala option you can complete the task.
I recommend checking how the file is being parsed.
From the example data you provided, it looks like the address column is being
split on ", " and going to the next line.
Based on this assumption alone, below is a potential solution using the
tidyverse:
library(tidyverse)
original_data <- tibble(Rank = c(40,NA,41,42,NA),
Store = c(1404,NA,2310,2514,NA),
City = c("State College",NA,"Springfield","Erie",NA),
Address = c("Hamilton Square Shop Center",
"230 W Hamilton Ave","149 Baltimore Pike",
"Yorktown Centre","2501 West 12th Street"),
Transactions = c(NA,155548,300258,NA,190305),
`Avg. Value` = c(NA,52.86,27.24,NA,41.17),
`Dollar Sales Amt.` = c(NA,8263499,8211137,NA,7862624))
new_data <- original_data %>%
fill(Rank:City) %>%
group_by_at(vars(Rank:City)) %>%
mutate(Address1 = lag(Address)) %>%
slice(n()) %>%
ungroup() %>%
mutate(Address = if_else(is.na(Address1), Address,
str_c(Address1, Address, sep = ", "))) %>%
select(Rank:`Dollar Sales Amt.`)
new_data

Combine every two rows of data in R

I have a csv file that I have read in but I now need to combine every two rows together. There is a total of 2000 rows but I need to reduce to 1000 rows. Every two rows is has the same account number in one column and the address split into two rows in another. Two rows are taken up for each observation and I want to combine two address rows into one. For example rows 1 and 2 are Acct# 1234 and have 123 Hollywood Blvd and LA California 90028 on their own lines respectively.
Using the tidyverse, you can group_by the Acct number and summarise with str_c:
library(tidyverse)
df %>%
group_by(Acct) %>%
summarise(Address = str_c(Address, collapse = " "))
# A tibble: 2 × 2
Acct Address
<dbl> <chr>
1 1234 123 Hollywood Blvd LA California 90028
2 4321 55 Park Avenue NY New York State 6666
Data:
df <- data.frame(
Acct = c(1234, 1234, 4321, 4321),
Address = c("123 Hollywood Blvd", "LA California 90028",
"55 Park Avenue", "NY New York State 6666")
)
It can be fairly simple with data.table package:
# assuming `dataset` is the name of your dataset, column with account number is called 'actN' and column with adress is 'adr'
library(data.table)
dataset2 <- data.table(dataset)[,.(whole = paste0(adr, collapse = ", ")), by = .(adr)]

how to find top highest number in R

I'm new in R coding. I want to find code for this question. Display the city name and the total attendance of the five top-attendance stadiums. I have dataframe worldcupmatches. Please, if anyone can help me out.
Since you have not provided us a subset of your data (which is strongly recommended), I will create a tiny dataset with city names and attendance like so:
df = data.frame(city = c("London", "Liverpool", "Manchester", "Birmingham"),
attendance = c(2390, 1290, 8734, 5433))
Then your problem can easily be solved. For example, one of the base R approaches is:
df[order(df$attendance, decreasing = T), ]
You could also use dplyr which makes things look a little tidier:
library(dplyr)
df %>% arrange(desc(attendance))
Output of the both methods is your original data, but ordered from the highest to the lowest attendance:
city attendance
3 Manchester 8734
4 Birmingham 5433
1 London 2390
2 Liverpool 1290
If you specifically want to display a certain number of cities (or stadiums) with top highest attendance, you could do:
df[order(df$attendance, decreasing = T), ][1:3, ] # 1:3 takes the top 3 staidums
city attendance
3 Manchester 8734
4 Birmingham 5433
1 London 2390
Again, dplyr approach/code looks much nicer:
df %>% slice_max(n = 3, order_by = attendance)
city attendance
1 Manchester 8734
2 Birmingham 5433
3 London 2390

Splitting character object using vector of delimiters

I have a large number of text files. Each file is stored as an observation in a dataframe. Each observation contains multiple fields so there is some structure in each object. I'm looking to split each based on the structured information within each file.
Data is currently in the following structure (simplified):
a <- c("Name: John Doe Age: 50 Address Please give full address 22 Main Street, New York")
b <- c("Name: Jane Bloggs Age: 42 Address Please give full address 1 Lower Street, London")
df <- data.frame(rawtext = c(a,b))
I'd like to split each observation into individual variable columns. It should end up looking like this:
Name Age Address
John Doe 50 22 Main Street, New York
Jane Bloggs 42 1 Lower Street, London
I thought that this could be done fairly simply using a pre-defined vector of delimiters since each text object is structured. I have tried using stringr and str_split() but this doesn't handle the vector input. e.g.
delims <- c("Name:", "Age", "Address Please give full address")
str_split(df$rawtext, delims)
I'm perhaps trying to oversimplify here. The only other approach I can think of is to loop through each observation and extract all text after delims[1] and before delims[2] (and so on) for all fields.
e.g. the following bodge would get me the name field based on the delimiters:
sub(paste0(".*", delims[1]), "", df$rawtext[1]) %>% sub(paste0(delims[2], ".*"), "", .)
[1] " John Doe "
This feels extremely inefficient. Is there a better way that I'm missing?
A tidyverse solution:
library(tidyverse)
delims <- c("Name", "Age", "Address Please give full address")
df %>%
mutate(rawtext = str_remove_all(rawtext, ":")) %>%
separate(rawtext, c("x", delims), sep = paste(delims, collapse = "|"), convert = T) %>%
mutate(across(where(is.character), str_squish), x = NULL)
# # A tibble: 2 x 3
# Name Age `Address Please give full address`
# <chr> <dbl> <chr>
# 1 John Doe 50 22 Main Street, New York
# 2 Jane Bloggs 42 1 Lower Street, London
Note: convert = T in separate() converts Age from character to numeric ignoring leading/trailing whitespaces.

Regex, Separate according to punctuation R?

I know this is a regex question, which has probably been answered but I cannot figure out the answer to this particular question. I have a dataset of 5000 addresses, and some of the addresses are presented as:
199 REEDSDALE ROAD MILTON, MA (42.252352, -71.075213)
2014 WASHINGTON STREET NEWTON, MA (42.332339, -71.246592)
75 FRANCIS STREET BOSTON, MA (42.335954, -71.107661)
235 NORTH PEARL STREET BROCKTON, MA (42.09707, -71.065645)
41 HIGHLAND AVENUE WINCHESTER, MA (42.465496, -71.121408)
The first comma is the separation of the address city from the state, but also there is latitude and longitude coordinates. I am interested in getting the coordinates into two columns, latitude and longitude as
lat lon
42.252352 -71.075213
42.332339 -71.246592
42.335954 -71.107661
42.09707 -71.065645
42.465496 -71.121408
Any and all help is appreciated!
One option is to extract the numeric part with a regex lookaround
library(tidyverse)
data_frame(lat = str_extract(lines, "(?<=\\()-?[0-9.]+"),
lon = str_extract(lines, "-?[0-9.]+(?=\\))"))
# A tibble: 5 x 2
# lat lon
# <chr> <chr>
#1 42.252352 -71.075213
#2 42.332339 -71.246592
#3 42.335954 -71.107661
#4 42.09707 -71.065645
#5 42.465496 -71.121408
Or with read.csv after removing the characters until the (, including the ( and ) (at the end) with gsub, making the , as separator for the read.csv to split into two columns
read.csv(text = gsub("^[^(]+\\(|\\)$", "", lines), header=FALSE,
col.names = c("lat", "lon"))
# lat lon
#1 42.25235 -71.07521
#2 42.33234 -71.24659
#3 42.33595 -71.10766
#4 42.09707 -71.06565
#5 42.46550 -71.12141
data
lines <- readLines("file.txt")

Resources