I want to split a street address into street name and street number in r.
My input data has a column that reads for example
Street.Addresses
205 Cape Road
32 Albany Street
cnr Kempston/Durban Roads
I want to split the street number and street name into two separate columns, so that it reads:
Street Number Street Name
205 Cape Road
32 Albany Street
cnr Kempston/Durban Roads
Is it in anyway possible to split the numeric value from the non numeric entries in a factor/string in R?
Thank you
you can try:
y <- lapply(strsplit(x, "(?<=\\d)\\b ", perl=T), function(x) if (length(x)<2) c("", x) else x)
y <- do.call(rbind, y)
colnames(y) <- c("Street Number", "Street Name")
hth
I'm sure that someone is going to come along with a cool regex solution with lookaheads and so on, but this might work for you:
X <- c("205 Cape Road", "32 Albany Street", "cnr Kempston/Durban Roads")
nonum <- grepl("^[^0-9]", X)
X[nonum] <- paste0(" \t", X[nonum])
X[!nonum] <- gsub("(^[0-9]+ )(.*)", "\\1\t\\2", X[!nonum])
read.delim(text = X, header = FALSE)
# V1 V2
# 1 205 Cape Road
# 2 32 Albany Street
# 3 NA cnr Kempston/Durban Roads
Here is another way:
df <- data.frame (Street.Addresses = c ("205 Cape Road", "32 Albany Street", "cnr Kempston/Durban Roads"),
stringsAsFactors = F)
new_df <- data.frame ("Street.Number" = character(),
"Street.Name" = character(),
stringsAsFactors = F)
for (i in 1:nrow (df)) {
new_df [i,"Street.Number"] <- unlist(strsplit (df[["Street.Addresses"]], " ")[i])[1]
new_df [i,"Street.Name"] <- paste (unlist(strsplit (df[["Street.Addresses"]], " ")[i])[-1], collapse = " ")
}
> new_df
Street.Number Street.Name
1 205 Cape Road
2 32 Albany Street
3 cnr Kempston/Durban Roads
Related
I would like to split a list of address strings into two columns, splitting between City and State.
For example, say I have two address strings:
addr1 <- "123 ABC street Lot 10, Fairfax, VA 22033"
addr2 <- "123 ABC street Fairfax, VA 22033"
How would I use regex in R to remove the 'unexpected' comma between Lot 10 and Fairfax, so that the only comma remaining in any given address string is the comma separating City and State?
My desired result is a dataframe with the address string split into two columns on the abovementioned comma:
There are two ways to expand on Tim's answer:
Zip+4 zip codes (US only?); and
"state" of not-2-letters ... really, just looking for the word-boundary instead of hard-coding "2 letters" (not sure if/when this is a factor ... does anybody write a non-2-letter state?)
addresses <- c("123 ABC street Lot 10, Fairfax, VA 22033", "123 ABC street Fairfax, VA 22033")
sub("\\b[[:alpha:]]+\\s+[[:digit:]]{5}(-[[:digit:]]{4})?$", "", addresses)
# [1] "123 ABC street Lot 10, Fairfax, " "123 ABC street Fairfax, "
sub(".*(\\b[[:alpha:]]+\\s+[[:digit:]]{5}(-[[:digit:]]{4})?$)", "\\1", addresses)
# [1] "VA 22033" "VA 22033"
We can remove commas (gsub(",","",...)) and trim whitespace (trimws(...)) separately.
out <- data.frame(
X1 = sub("\\b[[:alpha:]]+\\s+[[:digit:]]{5}(-[[:digit:]]{4})?$", "", addresses),
X2 = sub(".*(\\b[[:alpha:]]+\\s+[[:digit:]]{5}(-[[:digit:]]{4})?$)", "\\1", addresses)
)
out[] <- lapply(out, function(x) trimws(gsub(",", "", x)))
out
# X1 X2
# 1 123 ABC street Lot 10 Fairfax VA 22033
# 2 123 ABC street Fairfax VA 22033
(Though one may argue for a more-careful removal of commas. shrug)
Assuming you just want to split the address before the final state and zip code, you may use sub as follows:
df$X1 <- sub(", [A-Z]{2} \\d{5}$", "", df$address)
df$X2 <- sub("^.*([A-Z]{2} \\d{5})$", "\\1", df$address)
df
X1 X2
1 123 ABC street Lot 10, Fairfax VA 22033
2 123 ABC street Fairfax VA 22033
Data:
df <- data.frame(address=c("123 ABC street Lot 10, Fairfax, VA 22033",
"123 ABC street Fairfax, VA 22033"), stringsAsFactors=FALSE)
I have a dataframe in R concerning houses. This is a small sample:
Address Type Rent
Glasgow;Scotland House 1500
High Street;Edinburgh;Scotland Apartment 1000
Dundee;Scotland Apartment 800
South Street;Dundee;Scotland House 900
I would like to just pull out the last two instances of the Address column into a City and County column in my dataframe.
I have used mutate and strsplit to split this column by:
data<-mutate(dataframe, split_add = strsplit(dataframe$Address, ";")
I now have a new column in my dataframe which resembles the following:
split_add
c("Glasgow","Scotland")
c("High Street","Edinburgh","Scotland")
c("Dundee","Scotland")
c("South Street","Dundee","Scotland")
How to I extract the last 2 instances of each of these vector observations into columns "City" and "County"?
I attempted:
data<-mutate(data, city=split_add[-2] ))
thinking it would take the second instance from the end of the vectors- but this did not work.
using tidyr::separate() with the fill = "left" option is probably your best bet...
dataframe <- read.table(header = T, stringsAsFactors = F, text = "
Address Type Rent
Glasgow;Scotland House 1500
'High Street;Edinburgh;Scotland' Apartment 1000
Dundee;Scotland Apartment 800
'South Street;Dundee;Scotland' House 900
")
library(tidyr)
separate(dataframe, Address, into = c("Street", "City", "County"),
sep = ";", fill = "left")
# Street City County Type Rent
# 1 <NA> Glasgow Scotland House 1500
# 2 High Street Edinburgh Scotland Apartment 1000
# 3 <NA> Dundee Scotland Apartment 800
# 4 South Street Dundee Scotland House 900
I thinking about another way of dealing with this problem.
1.Creating a dataframe with the split_add column data
c("Glasgow","Scotland")
c("High Street","Edinburgh","Scotland")
c("Dundee","Scotland")
c("South Street","Dundee","Scotland")
test_data <- data.frame(split_add <- c("Glasgow, Scotland",
"High Street, Edinburgh, Scotland",
"Dundee, Scotland",
"South Street, Dundee, Scotland"),stringsAsFactors = F)
names(test_data) <- "address"
2.Use separate() from tidyr to split the column
library(tidyr)
new_test <- test_data %>% separate(address,c("c1","c2","c3"), sep=",")
3.Use dplyr and ifelse() to only reserve the last two columns
library(dplyr)
new_test %>%
mutate(city = ifelse(is.na(c3),c1,c2),county = ifelse(is.na(c3),c2,c3)) %>%
select(city,county)
The final data looks like this.
Assuming that you're using dplyr
data <- mutate(dataframe, split_add = strsplit(Address, ';'), City = tail(split_add, 2)[1], Country = tail(split_add, 1))
I am relatively new to R. I have written the following code. However, because it uses a for-loop, it is slow. I am not too familiar with packages that will convert this for-loop into a more efficient solution (apply functions?).
What my code does is this: it is trying to extract country names from a variable based on another dataframe that has all countries.
For instance, this is what data looks like:
country Institution
edmonton general hospital
ontario, canada
miyazaki, japan
department of head
this is what countries looks like
Name Code
algeria dz
canada ca
japan jp
kenya ke
# string match the countries
for(i in 1:nrow(data))
{
for (j in 1:nrow(countries))
{
data$country[i] <- ifelse(str_detect(string = data$Institution[i], pattern = paste0("\\b", countries$Name[j], "\\b")), countries$Name[j], data$country[i])
}
}
The above code runs so that it changes data so it looks like this:
country Institution
edmonton general hospital
canada ontario, canada
japan miyazaki, japan
department of head
How can I convert my for-loop to preserve the same function?
Thanks.
You can do a one-liner with str_extract. We'll paste the country names together with word boundaries and concatenate them with a regex | or operator.
library(stringr)
data$country = str_extract(data$Institution, paste0(
"\\b", country$Name, "\\b", collapse = "|"
))
data
# Institution country
# 1 edmonton general hospital <NA>
# 2 ontario, canada canada
# 3 miyazaki, japan japan
# 4 department of head <NA>
Using this data:
country <- read.table(text = " Name Code
algeria dz
canada ca
japan jp
kenya ke",
stringsAsFactors = FALSE, header = TRUE)
data <- data.frame(Institution = c("edmonton general hospital",
"ontario, canada",
"miyazaki, japan",
"department of head"))
The data:
countries <- setDT(read.table(text = " Name Code
algeria dz
canada ca
japan jp
kenya ke",
stringsAsFactors = FALSE, header = TRUE))
data <- setDT(list(country = array(dim = 2), Institution =
c("edmonton general hospital ontario, canada",
"miyazaki, japan department of head")))
I use data.table for syntax convenience, but you can surely do otherwise, the main idea is to use just one loop and grepl
data[,country := as.character(country)]
for( x in unique(countries$Name)){data[grepl(x,data$Institution),country := x]}
> data
country Institution
1: canada edmonton general hospital ontario, canada
2: japan miyazaki, japan department of head
You could add the tolower function to avoid cases problems grepl(tolower(x),tolower(data$Institution))
I got all the " Google Map API Requests" in a row, but when I tried to loop to call and parse it. I am getting an error. If I don't use a loop and do it manually it works.
a <- c("1780 N Washington Ave Scranton PA 18509", "1858 Hunt Ave Bronx NY 10462", "140 N Warren St Trenton NJ 08608-1308")
#API Key need to be added to run:
w <- c("https://maps.googleapis.com/maps/api/distancematrix/json?units=imperial&origins=19+East+34th+Street+New York+NY+10016&destinations=1780+N+Washington+Ave+Scranton+PA+18509&mode=transit&language=fr-FR&key=API_KEY_HERE",
"https://maps.googleapis.com/maps/api/distancematrix/json?units=imperial&origins=19+East+34th+Street+New York+NY+10016&destinations=1858+Hunt+Ave+Bronx+NY+10462&mode=transit&language=fr-FR&key=API_KEY_HERE",
"https://maps.googleapis.com/maps/api/distancematrix/json?units=imperial&origins=19+East+34th+Street+New York+NY+10016&destinations=140+N+Warren+St+Trenton+NJ+08608-1308&mode=transit&language=fr-FR&key=API_KEY_HERE")
df <- data.frame(a,w)
for (i in cpghq) {
url <- df$w
testdf <- jsonlite::fromJSON(url, simplifyDataFrame = TRUE)
list <- unlist(testdf$rows)
transit_time <- as.data.frame(t(as.data.frame(list)))
cpghq$transit_time <- transit_time
The error I get is:
Error: lexical error: invalid char in json text.
https://maps.googleapis.com/map
(right here) ------^
My API call was wrong because "New York" have space. I fixed using gsub("[[:space:]]", "+", a) , but also utils::URLencode() would have work.
Build the API call
a <- c("1780 N Washington Ave Scranton PA 18509", "1858 Hunt Ave Bronx NY 10462", "140 N Warren St Trenton NJ 08608-1308")
fix_address <- gsub("[[:space:]]", "+", a)
key <- "YOUR_GOOGLE_API_KEY_HERE"
travel_mode <- "transit"
root <- "https://maps.googleapis.com/maps/api/distancematrix/json
units=imperial&origins="
api_call <- paste0(root,"350+5th+Ave+New+York+NY+10118",
"&destinations=",
fix_address,
"&mode=",
travel_mode,
"&language=en-EN",
"&key=", key)
My problem with the loop was very simple. I wasn't using lapply()
Now used RSJSONIO::fromJSON to send the call to Google Map API
require("RJSONIO")
if(verbose) cat(address,"\n")
# Get json returns from Google
doc <- lapply(api_call, RCurl::getURL)
As pointed out in my other answer to you, you can also use my googleway to do the work for you.
library(googleway)
key <- "your_api_key"
a <- c("1780 N Washington Ave Scranton PA 18509",
"1858 Hunt Ave Bronx NY 10462",
"140 N Warren St Trenton NJ 08608-1308")
google_distance(origins = "350 5th Ave New York NY 10188",
destinations = as.list(a),
mode = "transit",
key = key,
simplify = T)
# $destination_addresses
# [1] "1780 N Washington Ave, Scranton, PA 18509, USA" "1858 Hunt Ave, Bronx, NY 10462, USA"
# [3] "140 N Warren St, Trenton, NJ 08608, USA"
#
# $origin_addresses
# [1] "Empire State Building, 350 5th Ave, New York, NY 10118, USA"
#
# $rows
# elements
# 1 ZERO_RESULTS, OK, OK, NA, 19.0 km, 95.8 km, NA, 18954, 95773, NA, 54 mins, 1 hour 44 mins, NA, 3242, 6260
#
# $status
# [1] "OK"
Need to read the txt file in
https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt
and convert them into a data frame R with column number as: LastName, FirstName, streetno, streetname, city, state, and zip...
Tried to use sep command to separate them but failed...
Expanding on my comments, here's another approach. You may need to tweak some of the code if your full data set has a wider range of patterns to account for.
library(stringr) # For str_trim
# Read string data and split into data frame
dat = readLines("addr.txt")
dat = as.data.frame(do.call(rbind, strsplit(dat, split=" {2,10}")), stringsAsFactors=FALSE)
names(dat) = c("LastName", "FirstName", "address", "city", "state", "zip")
# Separate address into number and street (if streetno isn't always numeric,
# or if you don't want it to be numeric, then just remove the as.numeric wrapper).
dat$streetno = as.numeric(gsub("([0-9]{1,4}).*","\\1", dat$address))
dat$streetname = gsub("[0-9]{1,4} (.*)","\\1", dat$address)
# Clean up zip
dat$zip = gsub("O","0", dat$zip)
dat$zip = str_trim(dat$zip)
dat = dat[,c(1:2,7:8,4:6)]
dat
LastName FirstName streetno streetname city state zip
1 Bania Thomas M. 725 Commonwealth Ave. Boston MA 02215
2 Barnaby David 373 W. Geneva St. Wms. Bay WI 53191
3 Bausch Judy 373 W. Geneva St. Wms. Bay WI 53191
...
41 Wright Greg 791 Holmdel-Keyport Rd. Holmdel NY 07733-1988
42 Zingale Michael 5640 S. Ellis Ave. Chicago IL 60637
Try this.
x<-scan("https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt" ,
what = list(LastName="", FirstName="", streetno="", streetname="", city="", state="",zip=""))
data<-as.data.frame(x)
I found it easiest to fix up the file into a csv by adding the commas where they belong, then read it.
## get the page as text
txt <- RCurl::getURL(
"https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt"
)
## fix the EOL (end-of-line) markers
g1 <- gsub(" \n", "\n", txt, fixed = TRUE)
## read it
df <- read.csv(
## add most comma-separators, then the last for the house number
text = gsub("(\\d+) (\\D+)", "\\1,\\2", gsub("\\s{2,}", ",", g1)),
header = FALSE,
## set the column names
col.names = c("LastName", "FirstName", "streetno", "streetname", "city", "state", "zip")
)
## result
head(df)
# LastName FirstName streetno streetname city state zip
# 1 Bania Thomas M. 725 Commonwealth Ave. Boston MA O2215
# 2 Barnaby David 373 W. Geneva St. Wms. Bay WI 53191
# 3 Bausch Judy 373 W. Geneva St. Wms. Bay WI 53191
# 4 Bolatto Alberto 725 Commonwealth Ave. Boston MA O2215
# 5 Carlstrom John 933 E. 56th St. Chicago IL 60637
# 6 Chamberlin Richard A. 111 Nowelo St. Hilo HI 96720
Here your problem is not how to use R to read in this data, but rather it's that your data is not sufficiently structured using regular delimiters between the variable-length fields you have as inputs. In addition, the zip code field contains some alpha "O" characters that should be "0".
So here is a way to use regular expression substitution to add in delimiters, and then parse the delimited text using read.csv(). Note that depending on exceptions in your full set of text, you may need to adjust the regular expressions. I have done them step by step here to make it clear what is being done and so that you can adjust them as you find exceptions in your input text. (For instance, some city names like `Wms. Bay" are two words.)
addr.txt <- readLines("https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt")
addr.txt <- gsub("\\s+O(\\d{4})", " 0\\1", addr.txt) # replace O with 0 in zip
addr.txt <- gsub("(\\s+)([A-Z]{2})", ", \\2", addr.txt) # state
addr.txt <- gsub("\\s+(\\d{5}(\\-\\d{4}){0,1})\\s*", ", \\1", addr.txt) # zip
addr.txt <- gsub("\\s+(\\d{1,4})\\s", ", \\1, ", addr.txt) # streetno
addr.txt <- gsub("(^\\w*)(\\s+)", "\\1, ", addr.txt) # LastName (FirstName)
addr.txt <- gsub("\\s{2,}", ", ", addr.txt) # city, by elimination
addr <- read.csv(textConnection(addr.txt), header = FALSE,
col.names = c("LastName", "FirstName", "streetno", "streetname", "city", "state", "zip"),
stringsAsFactors = FALSE)
head(addr)
## LastName FirstName streetno streetname city state zip
## 1 Bania Thomas M. 725 Commonwealth Ave. Boston MA 02215
## 2 Barnaby David 373 W. Geneva St. Wms. Bay WI 53191
## 3 Bausch Judy 373 W. Geneva St. Wms. Bay WI 53191
## 4 Bolatto Alberto 725 Commonwealth Ave. Boston MA 02215
## 5 Carlstrom John 933 E. 56th St. Chicago IL 60637
## 6 Chamberlin Richard A. 111 Nowelo St. Hilo HI 96720