Use R and regex to keep only desired comma in address string - r

I would like to split a list of address strings into two columns, splitting between City and State.
For example, say I have two address strings:
addr1 <- "123 ABC street Lot 10, Fairfax, VA 22033"
addr2 <- "123 ABC street Fairfax, VA 22033"
How would I use regex in R to remove the 'unexpected' comma between Lot 10 and Fairfax, so that the only comma remaining in any given address string is the comma separating City and State?
My desired result is a dataframe with the address string split into two columns on the abovementioned comma:

There are two ways to expand on Tim's answer:
Zip+4 zip codes (US only?); and
"state" of not-2-letters ... really, just looking for the word-boundary instead of hard-coding "2 letters" (not sure if/when this is a factor ... does anybody write a non-2-letter state?)
addresses <- c("123 ABC street Lot 10, Fairfax, VA 22033", "123 ABC street Fairfax, VA 22033")
sub("\\b[[:alpha:]]+\\s+[[:digit:]]{5}(-[[:digit:]]{4})?$", "", addresses)
# [1] "123 ABC street Lot 10, Fairfax, " "123 ABC street Fairfax, "
sub(".*(\\b[[:alpha:]]+\\s+[[:digit:]]{5}(-[[:digit:]]{4})?$)", "\\1", addresses)
# [1] "VA 22033" "VA 22033"
We can remove commas (gsub(",","",...)) and trim whitespace (trimws(...)) separately.
out <- data.frame(
X1 = sub("\\b[[:alpha:]]+\\s+[[:digit:]]{5}(-[[:digit:]]{4})?$", "", addresses),
X2 = sub(".*(\\b[[:alpha:]]+\\s+[[:digit:]]{5}(-[[:digit:]]{4})?$)", "\\1", addresses)
)
out[] <- lapply(out, function(x) trimws(gsub(",", "", x)))
out
# X1 X2
# 1 123 ABC street Lot 10 Fairfax VA 22033
# 2 123 ABC street Fairfax VA 22033
(Though one may argue for a more-careful removal of commas. shrug)

Assuming you just want to split the address before the final state and zip code, you may use sub as follows:
df$X1 <- sub(", [A-Z]{2} \\d{5}$", "", df$address)
df$X2 <- sub("^.*([A-Z]{2} \\d{5})$", "\\1", df$address)
df
X1 X2
1 123 ABC street Lot 10, Fairfax VA 22033
2 123 ABC street Fairfax VA 22033
Data:
df <- data.frame(address=c("123 ABC street Lot 10, Fairfax, VA 22033",
"123 ABC street Fairfax, VA 22033"), stringsAsFactors=FALSE)

Related

Splitting character object using vector of delimiters

I have a large number of text files. Each file is stored as an observation in a dataframe. Each observation contains multiple fields so there is some structure in each object. I'm looking to split each based on the structured information within each file.
Data is currently in the following structure (simplified):
a <- c("Name: John Doe Age: 50 Address Please give full address 22 Main Street, New York")
b <- c("Name: Jane Bloggs Age: 42 Address Please give full address 1 Lower Street, London")
df <- data.frame(rawtext = c(a,b))
I'd like to split each observation into individual variable columns. It should end up looking like this:
Name Age Address
John Doe 50 22 Main Street, New York
Jane Bloggs 42 1 Lower Street, London
I thought that this could be done fairly simply using a pre-defined vector of delimiters since each text object is structured. I have tried using stringr and str_split() but this doesn't handle the vector input. e.g.
delims <- c("Name:", "Age", "Address Please give full address")
str_split(df$rawtext, delims)
I'm perhaps trying to oversimplify here. The only other approach I can think of is to loop through each observation and extract all text after delims[1] and before delims[2] (and so on) for all fields.
e.g. the following bodge would get me the name field based on the delimiters:
sub(paste0(".*", delims[1]), "", df$rawtext[1]) %>% sub(paste0(delims[2], ".*"), "", .)
[1] " John Doe "
This feels extremely inefficient. Is there a better way that I'm missing?
A tidyverse solution:
library(tidyverse)
delims <- c("Name", "Age", "Address Please give full address")
df %>%
mutate(rawtext = str_remove_all(rawtext, ":")) %>%
separate(rawtext, c("x", delims), sep = paste(delims, collapse = "|"), convert = T) %>%
mutate(across(where(is.character), str_squish), x = NULL)
# # A tibble: 2 x 3
# Name Age `Address Please give full address`
# <chr> <dbl> <chr>
# 1 John Doe 50 22 Main Street, New York
# 2 Jane Bloggs 42 1 Lower Street, London
Note: convert = T in separate() converts Age from character to numeric ignoring leading/trailing whitespaces.

how to extract sub strings from a string like "Airport West 1/26 Cameron St 3 br t $830000 S Nelson Alexander" using r and stringr

I have some property sale data downloaded from Internet. It is a PDF file. When I copy and paste the data into a text file, it looks like this:
> a
[1] "Airport West 1/26 Cameron St 3 br t $830000 S Nelson Alexander" "Albert Park 106 Graham St 2 br h $0 SP RT Edgar"
Let's take the first line as an example. Every row is a record of a property, including suburb (Airport West), address (1/26 Cameron St), the count of bedrooms (3), property type (t), price ($830000), sale type (S). The last one (Nelson) is about the agent, which I do not need here.
I want to analyse this data. I need to extract the information first. I hope I can get the data like this: (b is a data frame)
> b
Suburb Address Bedroom PropertyType Price SoldType
1 Airport West 1/26 Cameron St 3 t 830000 S
2 Albert Park 106 Graham St 2 h 0 SP
Could anyone please tell me how to use stringr package or other methods to split the long string into the sub strings that I need?
1) gsubfn::read.pattern read.pattern in the gsubfn package takes a regular expression whose capture groups (the parts within parentheses) are taken to be the fields of the input and a data frame is created to assemble them.
library(gsubfn)
pat <- "^(.*?) (\\d.*?) (\\d) br (.) [$](\\d+) (\\w+) .*"
cn <- c("Suburb", "Address", "Bedroom", "PropertyType", "Price", "SoldType")
read.pattern(text = a, pattern = pat, col.names = cn, as.is = TRUE)
giving this data.frame:
Suburb Address Bedroom PropertyType Price SoldType
1 Airport West 1/26 Cameron St 3 t 830000 S
2 Albert Park 106 Graham St 2 h 0 SP
2) no packages This could also be done without any packages like this (pat and cn are from above):
replacement <- "\\1,\\2,\\3,\\4,\\5,\\6"
read.table(text = sub(pat, replacement, a), col.names = cn, as.is = TRUE, sep = ",")
Note: The input a in reproducible form is:
a <- c("Airport West 1/26 Cameron St 3 br t $830000 S Nelson Alexander",
"Albert Park 106 Graham St 2 br h $0 SP RT Edgar")

how to read text files and create a data frame in R

Need to read the txt file in
https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt
and convert them into a data frame R with column number as: LastName, FirstName, streetno, streetname, city, state, and zip...
Tried to use sep command to separate them but failed...
Expanding on my comments, here's another approach. You may need to tweak some of the code if your full data set has a wider range of patterns to account for.
library(stringr) # For str_trim
# Read string data and split into data frame
dat = readLines("addr.txt")
dat = as.data.frame(do.call(rbind, strsplit(dat, split=" {2,10}")), stringsAsFactors=FALSE)
names(dat) = c("LastName", "FirstName", "address", "city", "state", "zip")
# Separate address into number and street (if streetno isn't always numeric,
# or if you don't want it to be numeric, then just remove the as.numeric wrapper).
dat$streetno = as.numeric(gsub("([0-9]{1,4}).*","\\1", dat$address))
dat$streetname = gsub("[0-9]{1,4} (.*)","\\1", dat$address)
# Clean up zip
dat$zip = gsub("O","0", dat$zip)
dat$zip = str_trim(dat$zip)
dat = dat[,c(1:2,7:8,4:6)]
dat
LastName FirstName streetno streetname city state zip
1 Bania Thomas M. 725 Commonwealth Ave. Boston MA 02215
2 Barnaby David 373 W. Geneva St. Wms. Bay WI 53191
3 Bausch Judy 373 W. Geneva St. Wms. Bay WI 53191
...
41 Wright Greg 791 Holmdel-Keyport Rd. Holmdel NY 07733-1988
42 Zingale Michael 5640 S. Ellis Ave. Chicago IL 60637
Try this.
x<-scan("https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt" ,
what = list(LastName="", FirstName="", streetno="", streetname="", city="", state="",zip=""))
data<-as.data.frame(x)
I found it easiest to fix up the file into a csv by adding the commas where they belong, then read it.
## get the page as text
txt <- RCurl::getURL(
"https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt"
)
## fix the EOL (end-of-line) markers
g1 <- gsub(" \n", "\n", txt, fixed = TRUE)
## read it
df <- read.csv(
## add most comma-separators, then the last for the house number
text = gsub("(\\d+) (\\D+)", "\\1,\\2", gsub("\\s{2,}", ",", g1)),
header = FALSE,
## set the column names
col.names = c("LastName", "FirstName", "streetno", "streetname", "city", "state", "zip")
)
## result
head(df)
# LastName FirstName streetno streetname city state zip
# 1 Bania Thomas M. 725 Commonwealth Ave. Boston MA O2215
# 2 Barnaby David 373 W. Geneva St. Wms. Bay WI 53191
# 3 Bausch Judy 373 W. Geneva St. Wms. Bay WI 53191
# 4 Bolatto Alberto 725 Commonwealth Ave. Boston MA O2215
# 5 Carlstrom John 933 E. 56th St. Chicago IL 60637
# 6 Chamberlin Richard A. 111 Nowelo St. Hilo HI 96720
Here your problem is not how to use R to read in this data, but rather it's that your data is not sufficiently structured using regular delimiters between the variable-length fields you have as inputs. In addition, the zip code field contains some alpha "O" characters that should be "0".
So here is a way to use regular expression substitution to add in delimiters, and then parse the delimited text using read.csv(). Note that depending on exceptions in your full set of text, you may need to adjust the regular expressions. I have done them step by step here to make it clear what is being done and so that you can adjust them as you find exceptions in your input text. (For instance, some city names like `Wms. Bay" are two words.)
addr.txt <- readLines("https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt")
addr.txt <- gsub("\\s+O(\\d{4})", " 0\\1", addr.txt) # replace O with 0 in zip
addr.txt <- gsub("(\\s+)([A-Z]{2})", ", \\2", addr.txt) # state
addr.txt <- gsub("\\s+(\\d{5}(\\-\\d{4}){0,1})\\s*", ", \\1", addr.txt) # zip
addr.txt <- gsub("\\s+(\\d{1,4})\\s", ", \\1, ", addr.txt) # streetno
addr.txt <- gsub("(^\\w*)(\\s+)", "\\1, ", addr.txt) # LastName (FirstName)
addr.txt <- gsub("\\s{2,}", ", ", addr.txt) # city, by elimination
addr <- read.csv(textConnection(addr.txt), header = FALSE,
col.names = c("LastName", "FirstName", "streetno", "streetname", "city", "state", "zip"),
stringsAsFactors = FALSE)
head(addr)
## LastName FirstName streetno streetname city state zip
## 1 Bania Thomas M. 725 Commonwealth Ave. Boston MA 02215
## 2 Barnaby David 373 W. Geneva St. Wms. Bay WI 53191
## 3 Bausch Judy 373 W. Geneva St. Wms. Bay WI 53191
## 4 Bolatto Alberto 725 Commonwealth Ave. Boston MA 02215
## 5 Carlstrom John 933 E. 56th St. Chicago IL 60637
## 6 Chamberlin Richard A. 111 Nowelo St. Hilo HI 96720

How do I make this nested for loop work faster

My data is as shown below:
txt$txt:
my friend stays in adarsh nagar
I changed one apple one samsung S3 n one sony experia z.
Hi girls..Friends meet at bangalore
what do u think of ccd at bkc
I have an exhaustive list of city names. Listing few of them below:
city:
ahmedabad
adarsh nagar
airoli
bangalore
bangaladesh
banerghatta Road
bkc
calcutta
I am searching for city names (from the "city" list I have) in txt$txt and extracting them into another column if they are present. So the simple loop below works for me... but it's taking a lot of time on the bigger dataset.
for(i in 1:nrow(txt)){
a <- c()
for(j in 1:nrow(city)){
a[j] <- grepl(paste("\\b",city[j,1],"\\b", sep = ""),txt$txt[i])
}
txt$city[i] <- ifelse(sum(a) > 0, paste(city[which(a),1], collapse = "_"), "NONE")
}
I tried to use an apply function, and this is the maximum i could get to.
apply(as.matrix(txt$txt), 1, function(x){ifelse(sum(unlist(strsplit(x, " ")) %in% city[,1]) > 0, paste(unlist(strsplit(x, " "))[which(unlist(strsplit(x, " ")) %in% city[,1])], collapse = "_"), "NONE")})
[1] "NONE" "NONE" "bangalore" "bkc"
Desired Output:
> txt
txt city
1 my friend stays in adarsh nagar adarsh nagar
2 I changed one apple one samsung S3 n one sony experia z. NONE
3 Hi girls..Friends meet at bangalore bangalore
4 what do u think of ccd at bkc bkc
I want a faster process in R, which does the same thing what the for loop above does. Please advise. Thanks
Here's a possibility using stri_extract_first_regex from stringi package:
library(stringi)
# prepare some data
df <- data.frame(txt = c("in adarsh nagar", "sony experia z", "at bangalore"))
city <- c("ahmedabad", "adarsh nagar", "airoli", "bangalore")
df$city <- stri_extract_first_regex(str = df$txt, regex = paste(city, collapse = "|"))
df
# txt city
# 1 in adarsh nagar adarsh nagar
# 2 sony experia z <NA>
# 3 at bangalore bangalore
This should be much faster:
bigPattern <- paste('(\\b',city[,1],'\\b)',collapse='|',sep='')
txt$city <- sapply(regmatches(txt$txt,gregexpr(bigPattern,txt$txt)),FUN=function(x) ifelse(length(x) == 0,'NONE',paste(unique(x),collapse='_')))
Explanation:
in the first line we build a big regular expression matching all the cities, e.g. :
(\\bahmedabad\\b)|(\\badarsh nagar\\b)|(\\bairoli\\b)| ...
Then we use gregexpr in combination with regmatches, in this way we get a list of the matches for each element in txt$txt.
Finally, with a simple sapply, for each element of the list we concatenate the matched cities (after removing the duplicates i.e. cities mentioned more than one time).
Try this:
# YOUR DATA
##########
txt <- readLines(n = 4)
my friend stays in adarsh nagar and airoli
I changed one apple one samsung S3 n one sony experia z.
Hi girls..Friends meet at bangalore
what do u think of ccd at bkc
city <- readLines(n = 8)
ahmedabad
adarsh nagar
airoli
bangalore
bangaladesh
banerghatta Road
bkc
calcutta
# MATCHING
##########
matches <- unlist(setNames(lapply(city, grep, x = txt, fixed = TRUE),
city))
(res <- (sapply(1:length(txt), function(x)
paste0(names(matches)[matches == x], collapse = "___"))))
# [1] "adarsh nagar___airoli" ""
# [3] "bangalore" "bkc"

Split street address into street number and street name in r

I want to split a street address into street name and street number in r.
My input data has a column that reads for example
Street.Addresses
205 Cape Road
32 Albany Street
cnr Kempston/Durban Roads
I want to split the street number and street name into two separate columns, so that it reads:
Street Number Street Name
205 Cape Road
32 Albany Street
cnr Kempston/Durban Roads
Is it in anyway possible to split the numeric value from the non numeric entries in a factor/string in R?
Thank you
you can try:
y <- lapply(strsplit(x, "(?<=\\d)\\b ", perl=T), function(x) if (length(x)<2) c("", x) else x)
y <- do.call(rbind, y)
colnames(y) <- c("Street Number", "Street Name")
hth
I'm sure that someone is going to come along with a cool regex solution with lookaheads and so on, but this might work for you:
X <- c("205 Cape Road", "32 Albany Street", "cnr Kempston/Durban Roads")
nonum <- grepl("^[^0-9]", X)
X[nonum] <- paste0(" \t", X[nonum])
X[!nonum] <- gsub("(^[0-9]+ )(.*)", "\\1\t\\2", X[!nonum])
read.delim(text = X, header = FALSE)
# V1 V2
# 1 205 Cape Road
# 2 32 Albany Street
# 3 NA cnr Kempston/Durban Roads
Here is another way:
df <- data.frame (Street.Addresses = c ("205 Cape Road", "32 Albany Street", "cnr Kempston/Durban Roads"),
stringsAsFactors = F)
new_df <- data.frame ("Street.Number" = character(),
"Street.Name" = character(),
stringsAsFactors = F)
for (i in 1:nrow (df)) {
new_df [i,"Street.Number"] <- unlist(strsplit (df[["Street.Addresses"]], " ")[i])[1]
new_df [i,"Street.Name"] <- paste (unlist(strsplit (df[["Street.Addresses"]], " ")[i])[-1], collapse = " ")
}
> new_df
Street.Number Street.Name
1 205 Cape Road
2 32 Albany Street
3 cnr Kempston/Durban Roads

Resources