spell out the direction of a street name - r

I tried to correct the street names of a data frame using stringr package, spelling out "S." to "South" or "E" to "East" as well as "st." to "Street". The sample data is below.
df = data.frame(street = c('333 S. HOPE STREET', '21 South Hope Street', '54 Hope PKWY', '60C/O St.'))
This is my code.
df2 <- df %>% mutate(street2 = str_replace(street, 'S', "South"),
street2 = str_replace_all(street2, 'PKWY', "PARKWAY"),
street2 = str_replace_all(street2, 'st.', "Street"))
It returns to the following result.
street street2
333 S. HOPE STREET 333 South. HOPE STREET
21 South Hope Street 21 Southouth Hope Street
54 Hope PKWY 54 Hope PARKWAY
60C/O St. 60C/O Southt.
This is the result I desire. Not sure where I get wrong.
street street2
333 S. HOPE STREET 333 South HOPE STREET
21 South Hope Street 21 South Hope Street
54 Hope PKWY 54 Hope PARKWAY
60C/O St. 60C/O Sreet.

Don't forget to escape the dots! In a regex-pattern, the . matches (almost) any character. If you mean a literal dot, you have to escape the dot, with a \ (which you also have to escape with another \).
So:
df %>% mutate(street2 = str_replace(street, 'S\\.', "South"),
street2 = str_replace_all(street2, 'PKWY', "PARKWAY"),
street2 = str_replace_all(street2, 'St\\.', "Street"))
will result in
# street street2
# 1 333 S. HOPE STREET 333 South HOPE STREET
# 2 21 South Hope Street 21 South Hope Street
# 3 54 Hope PKWY 54 Hope PARKWAY
# 4 60C/O St. 60C/O Street
and for better readable results, you can use stringr::str_to_title
df %>% mutate(street2 = str_replace(street, 'S\\.', "South"),
street2 = str_replace_all(street2, 'PKWY', "PARKWAY"),
street2 = str_replace_all(street2, 'St\\.', "Street") ) %>%
mutate_all( ., str_to_title )
# street street2
# 1 333 S. Hope Street 333 South Hope Street
# 2 21 South Hope Street 21 South Hope Street
# 3 54 Hope Pkwy 54 Hope Parkway
# 4 60c/O St. 60c/O Street

Related

Is it possible to convert lines from a text file into columns to get a dataframe?

I have a text file containing information on book title, author name, and country of birth which appear in seperate lines as shown below:
Oscar Wilde
De Profundis
Ireland
Nathaniel Hawthorn
Birthmark
USA
James Joyce
Ulysses
Ireland
Walt Whitman
Leaves of Grass
USA
Is there any way to convert the text to a dataframe with these three items appearing as different columns:
ID Author Book Country
1 "Oscar Wilde" "De Profundis" "Ireland"
2 "Nathaniel Hawthorn" "Birthmark" "USA"
There are built-in functions for dealing with this kind of data:
data.frame(scan(text=xx, multi.line=TRUE,
what=list(Author="", Book="", Country=""), sep="\n"))
# Author Book Country
#1 Oscar Wilde De Profundis Ireland
#2 Nathaniel Hawthorn Birthmark USA
#3 James Joyce Ulysses Ireland
#4 Walt Whitman Leaves of Grass USA
You can create a 3-column matrix from one column of data.
dat <- read.table('data.txt', sep = ',')
result <- matrix(dat$V1, ncol = 3, byrow = TRUE) |>
data.frame() |>
setNames(c('Author', 'Book', 'Country'))
result <- cbind(ID = 1:nrow(result), result)
result
# ID Author Book Country
#1 1 Oscar Wilde De Profundis Ireland
#2 2 Nathaniel Hawthorn Birthmark USA
#3 3 James Joyce Ulysses Ireland
#4 4 Walt Whitman Leaves of Grass USA
There aren't any built in functions that handle data like this. But you can reshape your data after importing.
#Test data
xx <- "Oscar Wilde
De Profundis
Ireland
Nathaniel Hawthorn
Birthmark
USA
James Joyce
Ulysses
Ireland
Walt Whitman
Leaves of Grass
USA"
writeLines(xx, "test.txt")
And then the code
library(dplyr)
library(tidyr)
lines <- read.csv("test.txt", header=FALSE)
lines %>%
mutate(
rid = ((row_number()-1) %% 3)+1,
pid = (row_number()-1) %/%3 + 1) %>%
mutate(col=case_when(rid==1~"Author",rid==2~"Book", rid==3~"Country")) %>%
select(-rid) %>%
pivot_wider(names_from=col, values_from=V1)
Which returns
# A tibble: 4 x 4
pid Author Book Country
<dbl> <chr> <chr> <chr>
1 1 Oscar Wilde De Profundis Ireland
2 2 Nathaniel Hawthorn Birthmark USA
3 3 James Joyce Ulysses Ireland
4 4 Walt Whitman Leaves of Grass USA

Cleaning addresses - add last token in street name (Ave, St,..) where missing, based on other records [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
In the example data below, some addresses are missing the last 'token' making up the street name - ave, st, dr, etc. I'm using OSM for geocoding and I find these records get a hit, but often in some other country. I would like to clean them further by adding the most likely missing token based on other records in the data.
valid_ends <- c("AVE", "ST", "EXT", "BLVD")
data.frame(address = c("75 NEW PARK AVE", "245 NEW PARK AVE", "42 NEW PARK",
"934 NEW PARK ST", "394 NEW PARK", "34 ASYLUM ST",
"42 ASYLUM", "953 ASYLUM AVE", "23 ASYLUM ST",
"65 WASHINGTON AVE EXT", "94 WASHINGTON AVE")) %>%
mutate(addr_tokens = str_split(address, " ")) %>%
mutate(addr_fix = NA)
Desired result: a new character column ("addr_fix") added to the above which contains an "augmented" address for records 3, 5, 7 ("AVE", "AVE", "ST"...respectively). Those which are augmented are done so based on the last address token not being contained in valid_ends. The token which is appended to the one which occurs most frequently for that street (matched based on removing the numeric first token and the valid end tokens from addresses in the dataset)
A little messy, but this approach should work:
Start by getting the "core address" - the street name without suffix - and copying the suffix/"valid end", if there is one, to end:
valid_ends_rgx <- paste0(valid_ends, collapse = "|")
df2 <- df %>%
mutate(has_valid_end = str_detect(address, valid_ends_rgx),
core_addr =
str_remove_all(address, valid_ends_rgx) %>%
str_trim() %>%
str_remove("\\d+ "),
end = str_match(address, valid_ends_rgx)[, 1]
)
df2
# A tibble: 11 x 4
address has_valid_end core_addr end
<chr> <lgl> <chr> <chr>
1 75 NEW PARK AVE TRUE NEW PARK AVE
2 245 NEW PARK AVE TRUE NEW PARK AVE
3 42 NEW PARK FALSE NEW PARK NA
4 934 NEW PARK ST TRUE NEW PARK ST
5 394 NEW PARK FALSE NEW PARK NA
6 34 ASYLUM ST TRUE ASYLUM ST
7 42 ASYLUM FALSE ASYLUM NA
8 953 ASYLUM AVE TRUE ASYLUM AVE
9 23 ASYLUM ST TRUE ASYLUM ST
10 65 WASHINGTON AVE EXT TRUE WASHINGTON AVE
11 94 WASHINGTON AVE TRUE WASHINGTON AVE
Find the most common valid ending for each street:
replacements <- df2 %>%
group_by(core_addr, end) %>%
summarise(end_ct = n()) %>%
group_by(core_addr) %>%
summarise(most_end = end[which.max(end_ct)])
# A tibble: 3 x 2
core_addr most_end
<chr> <chr>
1 ASYLUM ST
2 NEW PARK AVE
3 WASHINGTON AVE
Update the address fields with missing ends, based on the most_end field in `replacements.
df2 %>%
left_join(replacements, by = "core_addr") %>%
transmute(
address = if_else(has_valid_end, address, str_c(address, most_end, sep = " "))
)
# A tibble: 11 x 1
address
<chr>
1 75 NEW PARK AVE
2 245 NEW PARK AVE
3 42 NEW PARK AVE
4 934 NEW PARK ST
5 394 NEW PARK AVE
6 34 ASYLUM ST
7 42 ASYLUM ST
8 953 ASYLUM AVE
9 23 ASYLUM ST
10 65 WASHINGTON AVE EXT
11 94 WASHINGTON AVE

Merge two datasets

I create a node list as follows:
name <- c("Joe","Frank","Peter")
city <- c("New York","Detroit","Maimi")
age <- c(24,55,65)
node_list <- data.frame(name,age,city)
node_list
name age city
1 Joe 24 New York
2 Frank 55 Detroit
3 Peter 65 Maimi
Then I create an edge list as follows:
from <- c("Joe","Frank","Peter","Albert")
to <- c("Frank","Albert","James","Tony")
to_city <- c("Detroit","St. Louis","New York","Carson City")
edge_list <- data.frame(from,to,to_city)
edge_list
from to to_city
1 Joe Frank Detroit
2 Frank Albert St. Louis
3 Peter James New York
4 Albert Tony Carson City
Notice that the names in the node list and edge list do not overlap 100%. I want to create a master node list of all the names, capturing city information as well. This is my dplyr attempt to do this:
new_node <- edge_list %>%
gather("from_to", "name", from, to) %>%
distinct(name) %>%
full_join(node_list)
new_node
name age city
1 Joe 24 New York
2 Frank 55 Detroit
3 Peter 65 Maimi
4 Albert NA <NA>
5 James NA <NA>
6 Tony NA <NA>
I need to figure out how to add to_city information. What do I need to add to my dplyr code to make this happen? Thanks.
Join twice, once on to and once on from, with the irrelevant columns subsetted out:
library(dplyr)
node_list <- data_frame(name = c("Joe", "Frank", "Peter"),
city = c("New York", "Detroit", "Maimi"),
age = c(24, 55, 65))
edge_list <- data_frame(from = c("Joe", "Frank", "Peter", "Albert"),
to = c("Frank", "Albert", "James", "Tony"),
to_city = c("Detroit", "St. Louis", "New York", "Carson City"))
node_list %>%
full_join(select(edge_list, name = to, city = to_city)) %>%
full_join(select(edge_list, name = from))
#> Joining, by = c("name", "city")
#> Joining, by = "name"
#> # A tibble: 6 x 3
#> name city age
#> <chr> <chr> <dbl>
#> 1 Joe New York 24.
#> 2 Frank Detroit 55.
#> 3 Peter Maimi 65.
#> 4 Albert St. Louis NA
#> 5 James New York NA
#> 6 Tony Carson City NA
In this case the second join doesn't do anything because everybody is already included, but it would insert anyone who only existed in the from column.

Extract House Number from (address) string using r

I want to parse apart (extract) addresses into HouseNumber and Streetname.
I should later be able to write the extracted "values" into new columns (shops$HouseNumber and shops$Streetname).
So lets say I have a data frame called "shops":
> shops
Name city street
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
So is there a way to split the street column into two lists one with the streetnames and one for the house numbers including cases like "1-3","14a", so that in the end, the result could be assigned to the data frame and look like.
> shops
Name city Streetname HouseNumber
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
Example: Easyfakestreet 5 --> Easyfakestreet , 5
It gets slightly complicated by the fact that some of my street strings will have hyphenated street addresses and have non numerical components.
Examples: New Street 3 --> ['New Street', '3 ']
Some-Complicated-Casestreet 1-3 --> ['Some-Complicated-Casestreet','1-3']
Fake Street 14a --> ['Fake Street', '14a']
I would appreciate some help!
Here's a possible tidyr solution
library(tidyr)
extract(df, "street", c("Streetname", "HouseNumber"), "(\\D+)(\\d.*)")
# Name city Streetname HouseNumber
# 1 Something Fakecity New Street 3
# 2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
# 3 SomethingDifferent Fakecity Fake Street 14a
You can try:
shops$Streetname <- gsub("(.+)\\s[^ ]+$","\\1", shops$street)
shops$HousNumber <- gsub(".+\\s([^ ]+)$","\\1", shops$street)
data
shops$street
#[1] "New Street 3" "Some-Complicated-Casestreet 1-3" "Fake Street 14a"
results
shops$Streetname
#[1] "New Street" "Some-Complicated-Casestreet" "Fake` Street"
shops$HousNumber
#[1] "3" "1-3" "14a"
Create a pattern with back references that match both the street and the number and then using sub replace it by each backreference in turn. No packages are needed:
pat <- "(.*) (\\d.*)"
transform(shops,
street = sub(pat, "\\1", street),
HouseNumber = sub(pat, "\\2", street)
)
giving:
Name city street HouseNumber
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
Here is a visualization of pat:
(.*) (\d.*)
Debuggex Demo
Note:
1) We used this for shops:
shops <-
structure(list(Name = c("Something", "SomethingOther", "SomethingDifferent"
), city = c("Fakecity", "Fakecity", "Fakecity"), street = c("New Street 3",
"Some-Complicated-Casestreet 1-3", "Fake Street 14a")), .Names = c("Name",
"city", "street"), class = "data.frame", row.names = c(NA, -3L))
2) David Arenburg's pattern could alternately be used here. Just set pat to it. The pattern above has the advantage that it allows street names that have embedded numbers in them but David's has the advantage that the space may be missing before the street number.
You could use the package unglue
library(unglue)
unglue_unnest(shops, street, "{street} {value=\\d.*}")
#> Name city street value
#> 1 Something Fakecity New Street 3
#> 2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
#> 3 SomethingDifferent Fakecity Fake Street 14a
Created on 2019-10-08 by the reprex package (v0.3.0)
Very complex issue for international addresses
$re = '/(\d+[\d\/\-\. ,]*[ ,\d\-\w]{0,2} )/m';
$str = '234 Test Road, Testville
456b Tester Road, Testville
789 c Tester Road, Testville
Mystreet 14a
123/3 dsdsdfs
Roobertinkatu 36-40
Flats 1-24 Acacia Avenue
Apartment 9D, 1 Acacia Avenue
Flat 24, 1 Acacia Avenue
Moscow Street, plot,23 building 2
Apartment 5005 no. 7 lane 31 Wuming Rd
Quinta da Redonda Lote 3 - 1 ยบ
102 - 3 Esq
Av 1 Maio 16,2 dt,
Rua de Ceuta Lote 1 Loja 5
11334 Nc Highway 72 E ';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
Output example
https://regex101.com/r/WVPBji/1

R - Finding a specific row of a dataframe and then adding data from that row to another data frame

I have two dataframes. 1 full of data about individuals, including their street name and house number but not their house size. And another with information about each house including street name and house number and house size but not data on the individuals living in that house. I'd like to add the size information to the first dataframe as a new column so I can see the house size for each individual.
I have over 200,000 individuals and around 100,000 houses and the methods I've tried so far (cutting down the second dataframe for each individual) are painfully slow. Is their an efficient way to do this? Thank you.
Using #jazzurro's example another option for larger datasets would be to use data.table
library(data.table)
setkey(setDT(df1), street, num)
setkey(setDT(df2), street, num)
df2[df1]
# size street num person
#1: large liliha st 3 bob
#2: NA mahalo st 32 dan
#3: small makiki st 15 ana
#4: NA nehoa st 11 ellen
#5: medium nuuanu ave 8 cathy
Here is my suggestion. Given what you described in your data, I created a sample data. However, please try to provide sample data from next time. When you provide sample data and your code, you are more likely to receive help and let people save more time. You have two key variables to merge two data frames, which are street name and house number. Here, I chose to keep all data points in df1.
df1 <- data.frame(person = c("ana", "bob", "cathy", "dan", "ellen"),
street = c("makiki st", "liliha st", "nuuanu ave", "mahalo st", "nehoa st"),
num = c(15, 3, 8, 32, 11),
stringsAsFactors = FALSE)
#person street num
#1 ana makiki st 15
#2 bob liliha st 3
#3 cathy nuuanu ave 8
#4 dan mahalo st 32
#5 ellen nehoa st 11
df2 <- data.frame(size = c("small", "large", "medium"),
street = c("makiki st", "liliha st", "nuuanu ave"),
num = c(15, 3, 8),
stringsAsFactors = FALSE)
# size street num
#1 small makiki st 15
#2 large liliha st 3
#3 medium nuuanu ave 8
library(dplyr)
left_join(df1, df2)
# street num person size
#1 makiki st 15 ana small
#2 liliha st 3 bob large
#3 nuuanu ave 8 cathy medium
#4 mahalo st 32 dan <NA>
#5 nehoa st 11 ellen <NA>

Resources