Need to read the txt file in
https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt
and convert them into a data frame R with column number as: LastName, FirstName, streetno, streetname, city, state, and zip...
Tried to use sep command to separate them but failed...
Expanding on my comments, here's another approach. You may need to tweak some of the code if your full data set has a wider range of patterns to account for.
library(stringr) # For str_trim
# Read string data and split into data frame
dat = readLines("addr.txt")
dat = as.data.frame(do.call(rbind, strsplit(dat, split=" {2,10}")), stringsAsFactors=FALSE)
names(dat) = c("LastName", "FirstName", "address", "city", "state", "zip")
# Separate address into number and street (if streetno isn't always numeric,
# or if you don't want it to be numeric, then just remove the as.numeric wrapper).
dat$streetno = as.numeric(gsub("([0-9]{1,4}).*","\\1", dat$address))
dat$streetname = gsub("[0-9]{1,4} (.*)","\\1", dat$address)
# Clean up zip
dat$zip = gsub("O","0", dat$zip)
dat$zip = str_trim(dat$zip)
dat = dat[,c(1:2,7:8,4:6)]
dat
LastName FirstName streetno streetname city state zip
1 Bania Thomas M. 725 Commonwealth Ave. Boston MA 02215
2 Barnaby David 373 W. Geneva St. Wms. Bay WI 53191
3 Bausch Judy 373 W. Geneva St. Wms. Bay WI 53191
...
41 Wright Greg 791 Holmdel-Keyport Rd. Holmdel NY 07733-1988
42 Zingale Michael 5640 S. Ellis Ave. Chicago IL 60637
Try this.
x<-scan("https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt" ,
what = list(LastName="", FirstName="", streetno="", streetname="", city="", state="",zip=""))
data<-as.data.frame(x)
I found it easiest to fix up the file into a csv by adding the commas where they belong, then read it.
## get the page as text
txt <- RCurl::getURL(
"https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt"
)
## fix the EOL (end-of-line) markers
g1 <- gsub(" \n", "\n", txt, fixed = TRUE)
## read it
df <- read.csv(
## add most comma-separators, then the last for the house number
text = gsub("(\\d+) (\\D+)", "\\1,\\2", gsub("\\s{2,}", ",", g1)),
header = FALSE,
## set the column names
col.names = c("LastName", "FirstName", "streetno", "streetname", "city", "state", "zip")
)
## result
head(df)
# LastName FirstName streetno streetname city state zip
# 1 Bania Thomas M. 725 Commonwealth Ave. Boston MA O2215
# 2 Barnaby David 373 W. Geneva St. Wms. Bay WI 53191
# 3 Bausch Judy 373 W. Geneva St. Wms. Bay WI 53191
# 4 Bolatto Alberto 725 Commonwealth Ave. Boston MA O2215
# 5 Carlstrom John 933 E. 56th St. Chicago IL 60637
# 6 Chamberlin Richard A. 111 Nowelo St. Hilo HI 96720
Here your problem is not how to use R to read in this data, but rather it's that your data is not sufficiently structured using regular delimiters between the variable-length fields you have as inputs. In addition, the zip code field contains some alpha "O" characters that should be "0".
So here is a way to use regular expression substitution to add in delimiters, and then parse the delimited text using read.csv(). Note that depending on exceptions in your full set of text, you may need to adjust the regular expressions. I have done them step by step here to make it clear what is being done and so that you can adjust them as you find exceptions in your input text. (For instance, some city names like `Wms. Bay" are two words.)
addr.txt <- readLines("https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt")
addr.txt <- gsub("\\s+O(\\d{4})", " 0\\1", addr.txt) # replace O with 0 in zip
addr.txt <- gsub("(\\s+)([A-Z]{2})", ", \\2", addr.txt) # state
addr.txt <- gsub("\\s+(\\d{5}(\\-\\d{4}){0,1})\\s*", ", \\1", addr.txt) # zip
addr.txt <- gsub("\\s+(\\d{1,4})\\s", ", \\1, ", addr.txt) # streetno
addr.txt <- gsub("(^\\w*)(\\s+)", "\\1, ", addr.txt) # LastName (FirstName)
addr.txt <- gsub("\\s{2,}", ", ", addr.txt) # city, by elimination
addr <- read.csv(textConnection(addr.txt), header = FALSE,
col.names = c("LastName", "FirstName", "streetno", "streetname", "city", "state", "zip"),
stringsAsFactors = FALSE)
head(addr)
## LastName FirstName streetno streetname city state zip
## 1 Bania Thomas M. 725 Commonwealth Ave. Boston MA 02215
## 2 Barnaby David 373 W. Geneva St. Wms. Bay WI 53191
## 3 Bausch Judy 373 W. Geneva St. Wms. Bay WI 53191
## 4 Bolatto Alberto 725 Commonwealth Ave. Boston MA 02215
## 5 Carlstrom John 933 E. 56th St. Chicago IL 60637
## 6 Chamberlin Richard A. 111 Nowelo St. Hilo HI 96720
Related
I would like to split a list of address strings into two columns, splitting between City and State.
For example, say I have two address strings:
addr1 <- "123 ABC street Lot 10, Fairfax, VA 22033"
addr2 <- "123 ABC street Fairfax, VA 22033"
How would I use regex in R to remove the 'unexpected' comma between Lot 10 and Fairfax, so that the only comma remaining in any given address string is the comma separating City and State?
My desired result is a dataframe with the address string split into two columns on the abovementioned comma:
There are two ways to expand on Tim's answer:
Zip+4 zip codes (US only?); and
"state" of not-2-letters ... really, just looking for the word-boundary instead of hard-coding "2 letters" (not sure if/when this is a factor ... does anybody write a non-2-letter state?)
addresses <- c("123 ABC street Lot 10, Fairfax, VA 22033", "123 ABC street Fairfax, VA 22033")
sub("\\b[[:alpha:]]+\\s+[[:digit:]]{5}(-[[:digit:]]{4})?$", "", addresses)
# [1] "123 ABC street Lot 10, Fairfax, " "123 ABC street Fairfax, "
sub(".*(\\b[[:alpha:]]+\\s+[[:digit:]]{5}(-[[:digit:]]{4})?$)", "\\1", addresses)
# [1] "VA 22033" "VA 22033"
We can remove commas (gsub(",","",...)) and trim whitespace (trimws(...)) separately.
out <- data.frame(
X1 = sub("\\b[[:alpha:]]+\\s+[[:digit:]]{5}(-[[:digit:]]{4})?$", "", addresses),
X2 = sub(".*(\\b[[:alpha:]]+\\s+[[:digit:]]{5}(-[[:digit:]]{4})?$)", "\\1", addresses)
)
out[] <- lapply(out, function(x) trimws(gsub(",", "", x)))
out
# X1 X2
# 1 123 ABC street Lot 10 Fairfax VA 22033
# 2 123 ABC street Fairfax VA 22033
(Though one may argue for a more-careful removal of commas. shrug)
Assuming you just want to split the address before the final state and zip code, you may use sub as follows:
df$X1 <- sub(", [A-Z]{2} \\d{5}$", "", df$address)
df$X2 <- sub("^.*([A-Z]{2} \\d{5})$", "\\1", df$address)
df
X1 X2
1 123 ABC street Lot 10, Fairfax VA 22033
2 123 ABC street Fairfax VA 22033
Data:
df <- data.frame(address=c("123 ABC street Lot 10, Fairfax, VA 22033",
"123 ABC street Fairfax, VA 22033"), stringsAsFactors=FALSE)
I have a FULLNAME column, I want to split it into 3 columns: LASTNAME, FIRSTNAME, MIDDLE_NAME_INITIAL
Different cases are included in the example below. I think it is easy to look at my code than my description.
df <- data.frame(FULLNAME = c("John, Smith J.",
"David, Cameron",
"Adam-Steve, Johnson M.",
"Antonio, Zang-Chi K",
"Joan Philippe, Luis Carlos",
"Dave, Jr., Danny Rock",
"Jake, Joan-Anberto",
"Annie, L.K Selena",
"Anna, P. Zhei"))
Output:
LASTNAME FIRSTNAME MIDDLE_NAME_INITIAL
1 John Smith J.
2 David Cameron
3 Adam-Steve Johnson M.
4 Antonio Zang-Chi K
5 Joan Philippe Luis Carlos
6 Dave, Jr. Danny Rock
7 Jake Joan-Anberto
8 Annie Selena L.K
9 Anna Zhei P.
I have google things and I found this here
I tried different ways one of them is pattern = "(.+),\\s*(.+)\\s+(.+)" , but it failed to get the expected output.
Every recommendation would be appreciated.
Requires PCRE-style regular expression support. So, yeah...
/
^ # start at the beginning of the string
(
\w+ # first name
(?:[- ]\w+)* # optional second part of first name
(?:,(?![^,]*$)\s[\w.]+)? # optional comma-separated addendum to 1st name
)
,\s # delimiting comma and space
(?= # assert existence of last name
.*? # bridge gap to last name (pre-initials)
(\w{2,}(?:-\w{2,})*) # (optionally multi-part) last name
)
(?= # assert existence of optional initials
(?:.*?\b(\w\.\w\b|\w\b\.?|(?<!-)\w+$))? # optional initals or middle name
)
/x # flag: enable free-spacing mode for expression
See demo.
I have no idea about R; this is just an example of how to capture the different name parts, so far as possible.
Edit: updated the expression to treat additional name parts like middle name initials.
Because your data is not in a fixed column-wise order, I think there is too much conditional logic to try to capture all this in a maintainable regex in R. I'm not even sure how you could tell which names are first names and which are middle names when initials are not used since the ordering is inconsistent.
However, based on the rules implied by how you have manually parsed the names, here is some code that can replicate these rules:
extract_initials <- function(x)
{
y <- lapply(strsplit(x, " "), function(z) z[nzchar(z)])
sapply(y, function(z){
if(length(z) == 1) return("")
else if(!all(grepl("[a-z]", z)))
return(paste(grep("[a-z]", z, invert = T, value = T), collapse = " "))
else return(paste(z[length(z)], collapse = " "))
})
}
extract_first <- function(x)
{
y <- lapply(strsplit(x, " "), function(z) z[nzchar(z)])
sapply(y, function(z){
if(length(z) == 1) return(z)
else if(!all(grepl("[a-z]", z)))
return(paste(grep("[a-z]", z, value = T), collapse = " "))
else return(paste(z[-length(z)], collapse = " "))
})
}
split_name <- function(x)
{
partlist <- strsplit(x, ",(?=[^,]*$)", perl = TRUE)
surnames <- sapply(partlist, `[`, 1)
forenames <- sapply(partlist, `[`, 2)
data.frame(surname = surnames,
first = extract_first(forenames),
middle = extract_initials(forenames),
stringsAsFactors = FALSE)
}
and it works as simply as this:
split_name(df$FULLNAME)
#> surname first middle
#> 1 John Smith J.
#> 2 David Cameron
#> 3 Adam-Steve Johnson M.
#> 4 Antonio Zang-Chi K
#> 5 Joan Philippe Luis Carlos
#> 6 Dave, Jr. Danny Rock
#> 7 Jake Joan-Anberto
#> 8 Annie Selena L.K
#> 9 Anna Zhei P.
Created on 2020-03-20 by the reprex package (v0.3.0)
Try this expression:
([\w\s.,-]+)(?:[^,]*,\s){1,}([\w.-]+)\s*([\w.-]*)
Here you can see how it works: https://regexr.com/50oef
I don't know R language, so let me show the example using Java:
List<String> items = Arrays.asList(
"John, Smith J.",
"David, Cameron",
"Adam-Steve, Johnson M.",
"Antonio, Zang-Chi K",
"Joan Philippe, Luis Carlos",
"Dave, Jr., Danny Rock",
"Jake, Joan-Anberto",
"Annie, L.K Selena",
"Anna, P. Zhei");
Pattern regex = Pattern.compile("([\\w\\s.,-]+)(?:[^,]*,\\s){1,}([\\w.-]+)\\s*([\\w.-]*)");
int k = 0;
for (String item : items) {
Matcher m = regex.matcher(item);
if (m.find()) {
String group1 = m.group(1);
String group2 = m.group(2);
String group3 = m.group(3);
boolean initialsInGroup2 = group2.contains(".");
boolean initialsInGroup3 = group3.contains(".");
System.out.println(++k
+ (!"".equals(group1) ? String.format("%15s", group1) : "")
+ (!"".equals(group2) ? String.format("%15s", initialsInGroup2 ? group3 : group2) : "")
+ (!"".equals(group3) ? String.format("%10s", initialsInGroup3 ? group3 : initialsInGroup2 ? group2 : group3) : ""));
}
}
Output:
1 John Smith J.
2 David Cameron
3 Adam-Steve Johnson M.
4 Antonio Zang-Chi K
5 Joan Philippe Luis Carlos
6 Dave, Jr. Danny Rock
7 Jake Joan-Anberto
8 Annie Selena L.K
9 Anna Zhei P.
I have some property sale data downloaded from Internet. It is a PDF file. When I copy and paste the data into a text file, it looks like this:
> a
[1] "Airport West 1/26 Cameron St 3 br t $830000 S Nelson Alexander" "Albert Park 106 Graham St 2 br h $0 SP RT Edgar"
Let's take the first line as an example. Every row is a record of a property, including suburb (Airport West), address (1/26 Cameron St), the count of bedrooms (3), property type (t), price ($830000), sale type (S). The last one (Nelson) is about the agent, which I do not need here.
I want to analyse this data. I need to extract the information first. I hope I can get the data like this: (b is a data frame)
> b
Suburb Address Bedroom PropertyType Price SoldType
1 Airport West 1/26 Cameron St 3 t 830000 S
2 Albert Park 106 Graham St 2 h 0 SP
Could anyone please tell me how to use stringr package or other methods to split the long string into the sub strings that I need?
1) gsubfn::read.pattern read.pattern in the gsubfn package takes a regular expression whose capture groups (the parts within parentheses) are taken to be the fields of the input and a data frame is created to assemble them.
library(gsubfn)
pat <- "^(.*?) (\\d.*?) (\\d) br (.) [$](\\d+) (\\w+) .*"
cn <- c("Suburb", "Address", "Bedroom", "PropertyType", "Price", "SoldType")
read.pattern(text = a, pattern = pat, col.names = cn, as.is = TRUE)
giving this data.frame:
Suburb Address Bedroom PropertyType Price SoldType
1 Airport West 1/26 Cameron St 3 t 830000 S
2 Albert Park 106 Graham St 2 h 0 SP
2) no packages This could also be done without any packages like this (pat and cn are from above):
replacement <- "\\1,\\2,\\3,\\4,\\5,\\6"
read.table(text = sub(pat, replacement, a), col.names = cn, as.is = TRUE, sep = ",")
Note: The input a in reproducible form is:
a <- c("Airport West 1/26 Cameron St 3 br t $830000 S Nelson Alexander",
"Albert Park 106 Graham St 2 br h $0 SP RT Edgar")
I want to split a street address into street name and street number in r.
My input data has a column that reads for example
Street.Addresses
205 Cape Road
32 Albany Street
cnr Kempston/Durban Roads
I want to split the street number and street name into two separate columns, so that it reads:
Street Number Street Name
205 Cape Road
32 Albany Street
cnr Kempston/Durban Roads
Is it in anyway possible to split the numeric value from the non numeric entries in a factor/string in R?
Thank you
you can try:
y <- lapply(strsplit(x, "(?<=\\d)\\b ", perl=T), function(x) if (length(x)<2) c("", x) else x)
y <- do.call(rbind, y)
colnames(y) <- c("Street Number", "Street Name")
hth
I'm sure that someone is going to come along with a cool regex solution with lookaheads and so on, but this might work for you:
X <- c("205 Cape Road", "32 Albany Street", "cnr Kempston/Durban Roads")
nonum <- grepl("^[^0-9]", X)
X[nonum] <- paste0(" \t", X[nonum])
X[!nonum] <- gsub("(^[0-9]+ )(.*)", "\\1\t\\2", X[!nonum])
read.delim(text = X, header = FALSE)
# V1 V2
# 1 205 Cape Road
# 2 32 Albany Street
# 3 NA cnr Kempston/Durban Roads
Here is another way:
df <- data.frame (Street.Addresses = c ("205 Cape Road", "32 Albany Street", "cnr Kempston/Durban Roads"),
stringsAsFactors = F)
new_df <- data.frame ("Street.Number" = character(),
"Street.Name" = character(),
stringsAsFactors = F)
for (i in 1:nrow (df)) {
new_df [i,"Street.Number"] <- unlist(strsplit (df[["Street.Addresses"]], " ")[i])[1]
new_df [i,"Street.Name"] <- paste (unlist(strsplit (df[["Street.Addresses"]], " ")[i])[-1], collapse = " ")
}
> new_df
Street.Number Street.Name
1 205 Cape Road
2 32 Albany Street
3 cnr Kempston/Durban Roads
I have a CSV file like
AdvertiserName,CampaignName
Wells Fargo,Gary IN MetroChicago IL Metro
EMC,Los Angeles CA MetroBoston MA Metro
Apple,Cupertino CA Metro
Desired Output in R
AdvertiserName,City,State
Wells Fargo,Gary,IN
Wells Fargo,Chicago,IL
EMC,Los Angeles,CA
EMC,Boston,MA
Apple,Cupertino,CA
I have done it like
record <- read.csv("C:/Users/Administrator/Downloads/Campaignname.csv",header=TRUE)
ad <- record$AdvertiserName
camp <- record$CampaignName
read.table(text=gsub('Metro', '\n', c), col.names=c('City', 'State'))
It throws an error.
How to get the desired result?
Thanks in advance.
You can do this for example:
## read the csv file, you change text here by your fileName
xx <- read.table(text ='AdvertiserName,CampaignName
Wells Fargo,Gary INMetro Chicago IL Metro
EMC,Los Angeles CAMetro Boston MA Metro',sep=',',header=TRUE)
## use regular expression to create city and state variables
## rows are separated by ":"
## columns are separated by a comma ","
res <-
gsub('(.*) ([A-Z]{2})*Metro (.*) ([A-Z]{2}) .*','\\1,\\2:\\3,\\4',
xx$CampaignName)
## Use strsrsplit to extract rows and columns
## This is a compacted code !
yy <-
Map(function(x,y)
cbind.data.frame(y,do.call(rbind,strsplit(x,','))),
strsplit(res,':'),xx$AdvertiserName)
## create the final data.frame and set names
res <- do.call(rbind,yy)
setNames(res, c('AdvertiserName','City','State'))
AdvertiserName City State
1 Wells Fargo Gary IN
2 Wells Fargo Chicago IL
3 EMC Los Angeles CA
4 EMC Boston MA