I have multiple columns of addresses, where they may contain duplicated information (but generally will not have exactly duplicated information).
The following code will provide an example of my issue,
id= c(1, 2)
add1 = c("21ST AVE", "5TH ST")
add2 = c("21ST AVE BLAH ST", "EAST BLAH BLVD")
df = data.frame(id, add1, add2)
df$combined = paste(add1, add2)
df
This gives the following result,
id add1 add2 combined
1 1 21ST AVE 21ST AVE BLAH ST 21ST AVE 21ST AVE BLAH ST
2 2 5TH ST EAST BLAH BLVD 5TH ST EAST BLAH BLVD
The conclusion I need is the following,
id add1 add2 combined
1 1 21ST AVE 21ST AVE BLAH ST 21ST AVE BLAH ST
2 2 5TH ST EAST BLAH BLVD 5TH ST EAST BLAH BLVD
I wish to identify if what's in add1 is contained in add2. If I find that add2 contains the same information that add1 provides, then I either want to avoid combining those particular column values or delete the repeated information in the combined column (which I believe would require solving a different issue of repeated phrases in a string). I have not been able to find an example of finding column values that are 'contained in' rather than 'exact' - and I'm working with over 500K cases in a dataset where this issue is a common occurrence. Any help is appreciated.
We split the second and third column by one or more space (\\s+), then paste the union of the corresponding rows with mapply to create the 'combined'
lst <- lapply(df[2:3], function(x) strsplit(as.character(x), "\\s+"))
df$combined <- mapply(function(x,y) paste(union(x, y), collapse=" "), lst$add1, lst$add2)
df$combined
#[1] "21ST AVE BLAH ST" "5TH ST EAST BLAH BLVD"
Or another option is gsub
gsub("((\\w+\\s*){2,})\\1", "\\1", do.call(paste, df[2:3]))
#[1] "21ST AVE BLAH ST" "5TH ST EAST BLAH BLVD"
Here's one way to accomplish this where the ifelse tests whetheradd1 is in add2, and if so, then doesn't include it, otherwise it combines them:
id= c(1, 2)
add1 = c("21ST AVE", "5TH ST")
add2 = c("21ST AVE BLAH ST", "EAST BLAH BLVD")
df = data.frame(id, add1, add2, stringsAsFactors = F)
require(stringr)
require(dplyr)
df %>% mutate(combined = ifelse(str_detect(add2, add1),
add2,
str_c(add1, add2)))
Output:
id add1 add2 combined
1 1 21ST AVE 21ST AVE BLAH ST 21ST AVE BLAH ST
2 2 5TH ST EAST BLAH BLVD 5TH STEAST BLAH BLVD
Related
I have a csv file that I have read in but I now need to combine every two rows together. There is a total of 2000 rows but I need to reduce to 1000 rows. Every two rows is has the same account number in one column and the address split into two rows in another. Two rows are taken up for each observation and I want to combine two address rows into one. For example rows 1 and 2 are Acct# 1234 and have 123 Hollywood Blvd and LA California 90028 on their own lines respectively.
Using the tidyverse, you can group_by the Acct number and summarise with str_c:
library(tidyverse)
df %>%
group_by(Acct) %>%
summarise(Address = str_c(Address, collapse = " "))
# A tibble: 2 × 2
Acct Address
<dbl> <chr>
1 1234 123 Hollywood Blvd LA California 90028
2 4321 55 Park Avenue NY New York State 6666
Data:
df <- data.frame(
Acct = c(1234, 1234, 4321, 4321),
Address = c("123 Hollywood Blvd", "LA California 90028",
"55 Park Avenue", "NY New York State 6666")
)
It can be fairly simple with data.table package:
# assuming `dataset` is the name of your dataset, column with account number is called 'actN' and column with adress is 'adr'
library(data.table)
dataset2 <- data.table(dataset)[,.(whole = paste0(adr, collapse = ", ")), by = .(adr)]
I have a character column that needs to be separated with regex. Here is an example of the raw data:
data_raw <- tribble(
~census_geo,
"Division No. 1, Subd. V (SNO), Newfoundland and Labrador",
"Portugal Cove South (T), Newfoundland and Labrador",
"Division No. 1, Subd. U, Reserve (SNO), Newfoundland and Labrador")
We have three columns to be extracted. The first is everything before the brackets. The second column is the word inside the bracket. The last column is everything after the last comma (or everything after the word in the bracket). Here is an example of what the clean output would look like:
data_clean <- tribble(
~csd_name, ~csd_type, ~province,
"Division No. 1, Subd. V", "SNO", "Newfoundland and Labrador",
"Portugal Cove South", "T", "Ontario",
"Division No. 1, Subd. U, Reserve", "SNO", "Newfoundland and Labrador")
I can extract the last column with this code:
data_raw %>%
mutate(csd_type = str_extract(census_geo, pattern = "(?<=\\().*(?=\\))"))
But I can't get the other two columns.
Any help would be greatly appreciated.
You can use tidyr's extract and pass regular expressions to extract the relevant text in different columns.
tidyr::extract(data_raw, census_geo, c('csd_name', 'csd_type', 'province'),
'(.*) \\((.*)\\),\\s*(.*)')
# csd_name csd_type province
# <chr> <chr> <chr>
#1 Division No. 1, Subd. V SNO Newfoundland and Labrador
#2 Portugal Cove South T Newfoundland and Labrador
#3 Division No. 1, Subd. U, Reserve SNO Newfoundland and Labrador
You can achieve the same result in base R with strcapture :
strcapture('(.*) \\((.*)\\),\\s*(.*)', data_raw$census_geo,
proto = list(csd_name = character(), csd_type = character(),
province = character()))
I know you already selected Ronak Shah's answer (which was very nice btw), but I wanted to just show an approach with stringr's separate:
library(stringr)
data_raw %>%
separate(
col = census_geo,
into = c('csd_name', 'csd_type', 'province'),
sep = '(\\s\\(|\\),\\s)'
)
The \\s is for the white space, the \\( for the parenthesis, and the | for splitting the two distinct patterns to look for.
Just in case OP is interested to see how the original approach with str_extract would work for all three separate columns using negative character classes [^)(]and [^,]:
data_raw %>%
mutate(
csd_name = str_extract(census_geo, "^[^)(]+(?=\\s)"),
csd_type = str_extract(census_geo, "(?<=\\()[^)(]+(?=\\))"),
csd_province = str_extract(census_geo, "(?<=,\\s)[^,]+$")) %>%
select(-census_geo)
# A tibble: 3 x 3
csd_name csd_type csd_province
<chr> <chr> <chr>
1 Division No. 1, Subd. V SNO Newfoundland and Labrador
2 Portugal Cove South T Newfoundland and Labrador
3 Division No. 1, Subd. U, Reserve SNO Newfoundland and Labrador
I have some property sale data downloaded from Internet. It is a PDF file. When I copy and paste the data into a text file, it looks like this:
> a
[1] "Airport West 1/26 Cameron St 3 br t $830000 S Nelson Alexander" "Albert Park 106 Graham St 2 br h $0 SP RT Edgar"
Let's take the first line as an example. Every row is a record of a property, including suburb (Airport West), address (1/26 Cameron St), the count of bedrooms (3), property type (t), price ($830000), sale type (S). The last one (Nelson) is about the agent, which I do not need here.
I want to analyse this data. I need to extract the information first. I hope I can get the data like this: (b is a data frame)
> b
Suburb Address Bedroom PropertyType Price SoldType
1 Airport West 1/26 Cameron St 3 t 830000 S
2 Albert Park 106 Graham St 2 h 0 SP
Could anyone please tell me how to use stringr package or other methods to split the long string into the sub strings that I need?
1) gsubfn::read.pattern read.pattern in the gsubfn package takes a regular expression whose capture groups (the parts within parentheses) are taken to be the fields of the input and a data frame is created to assemble them.
library(gsubfn)
pat <- "^(.*?) (\\d.*?) (\\d) br (.) [$](\\d+) (\\w+) .*"
cn <- c("Suburb", "Address", "Bedroom", "PropertyType", "Price", "SoldType")
read.pattern(text = a, pattern = pat, col.names = cn, as.is = TRUE)
giving this data.frame:
Suburb Address Bedroom PropertyType Price SoldType
1 Airport West 1/26 Cameron St 3 t 830000 S
2 Albert Park 106 Graham St 2 h 0 SP
2) no packages This could also be done without any packages like this (pat and cn are from above):
replacement <- "\\1,\\2,\\3,\\4,\\5,\\6"
read.table(text = sub(pat, replacement, a), col.names = cn, as.is = TRUE, sep = ",")
Note: The input a in reproducible form is:
a <- c("Airport West 1/26 Cameron St 3 br t $830000 S Nelson Alexander",
"Albert Park 106 Graham St 2 br h $0 SP RT Edgar")
I got all the " Google Map API Requests" in a row, but when I tried to loop to call and parse it. I am getting an error. If I don't use a loop and do it manually it works.
a <- c("1780 N Washington Ave Scranton PA 18509", "1858 Hunt Ave Bronx NY 10462", "140 N Warren St Trenton NJ 08608-1308")
#API Key need to be added to run:
w <- c("https://maps.googleapis.com/maps/api/distancematrix/json?units=imperial&origins=19+East+34th+Street+New York+NY+10016&destinations=1780+N+Washington+Ave+Scranton+PA+18509&mode=transit&language=fr-FR&key=API_KEY_HERE",
"https://maps.googleapis.com/maps/api/distancematrix/json?units=imperial&origins=19+East+34th+Street+New York+NY+10016&destinations=1858+Hunt+Ave+Bronx+NY+10462&mode=transit&language=fr-FR&key=API_KEY_HERE",
"https://maps.googleapis.com/maps/api/distancematrix/json?units=imperial&origins=19+East+34th+Street+New York+NY+10016&destinations=140+N+Warren+St+Trenton+NJ+08608-1308&mode=transit&language=fr-FR&key=API_KEY_HERE")
df <- data.frame(a,w)
for (i in cpghq) {
url <- df$w
testdf <- jsonlite::fromJSON(url, simplifyDataFrame = TRUE)
list <- unlist(testdf$rows)
transit_time <- as.data.frame(t(as.data.frame(list)))
cpghq$transit_time <- transit_time
The error I get is:
Error: lexical error: invalid char in json text.
https://maps.googleapis.com/map
(right here) ------^
My API call was wrong because "New York" have space. I fixed using gsub("[[:space:]]", "+", a) , but also utils::URLencode() would have work.
Build the API call
a <- c("1780 N Washington Ave Scranton PA 18509", "1858 Hunt Ave Bronx NY 10462", "140 N Warren St Trenton NJ 08608-1308")
fix_address <- gsub("[[:space:]]", "+", a)
key <- "YOUR_GOOGLE_API_KEY_HERE"
travel_mode <- "transit"
root <- "https://maps.googleapis.com/maps/api/distancematrix/json
units=imperial&origins="
api_call <- paste0(root,"350+5th+Ave+New+York+NY+10118",
"&destinations=",
fix_address,
"&mode=",
travel_mode,
"&language=en-EN",
"&key=", key)
My problem with the loop was very simple. I wasn't using lapply()
Now used RSJSONIO::fromJSON to send the call to Google Map API
require("RJSONIO")
if(verbose) cat(address,"\n")
# Get json returns from Google
doc <- lapply(api_call, RCurl::getURL)
As pointed out in my other answer to you, you can also use my googleway to do the work for you.
library(googleway)
key <- "your_api_key"
a <- c("1780 N Washington Ave Scranton PA 18509",
"1858 Hunt Ave Bronx NY 10462",
"140 N Warren St Trenton NJ 08608-1308")
google_distance(origins = "350 5th Ave New York NY 10188",
destinations = as.list(a),
mode = "transit",
key = key,
simplify = T)
# $destination_addresses
# [1] "1780 N Washington Ave, Scranton, PA 18509, USA" "1858 Hunt Ave, Bronx, NY 10462, USA"
# [3] "140 N Warren St, Trenton, NJ 08608, USA"
#
# $origin_addresses
# [1] "Empire State Building, 350 5th Ave, New York, NY 10118, USA"
#
# $rows
# elements
# 1 ZERO_RESULTS, OK, OK, NA, 19.0 km, 95.8 km, NA, 18954, 95773, NA, 54 mins, 1 hour 44 mins, NA, 3242, 6260
#
# $status
# [1] "OK"
Need to read the txt file in
https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt
and convert them into a data frame R with column number as: LastName, FirstName, streetno, streetname, city, state, and zip...
Tried to use sep command to separate them but failed...
Expanding on my comments, here's another approach. You may need to tweak some of the code if your full data set has a wider range of patterns to account for.
library(stringr) # For str_trim
# Read string data and split into data frame
dat = readLines("addr.txt")
dat = as.data.frame(do.call(rbind, strsplit(dat, split=" {2,10}")), stringsAsFactors=FALSE)
names(dat) = c("LastName", "FirstName", "address", "city", "state", "zip")
# Separate address into number and street (if streetno isn't always numeric,
# or if you don't want it to be numeric, then just remove the as.numeric wrapper).
dat$streetno = as.numeric(gsub("([0-9]{1,4}).*","\\1", dat$address))
dat$streetname = gsub("[0-9]{1,4} (.*)","\\1", dat$address)
# Clean up zip
dat$zip = gsub("O","0", dat$zip)
dat$zip = str_trim(dat$zip)
dat = dat[,c(1:2,7:8,4:6)]
dat
LastName FirstName streetno streetname city state zip
1 Bania Thomas M. 725 Commonwealth Ave. Boston MA 02215
2 Barnaby David 373 W. Geneva St. Wms. Bay WI 53191
3 Bausch Judy 373 W. Geneva St. Wms. Bay WI 53191
...
41 Wright Greg 791 Holmdel-Keyport Rd. Holmdel NY 07733-1988
42 Zingale Michael 5640 S. Ellis Ave. Chicago IL 60637
Try this.
x<-scan("https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt" ,
what = list(LastName="", FirstName="", streetno="", streetname="", city="", state="",zip=""))
data<-as.data.frame(x)
I found it easiest to fix up the file into a csv by adding the commas where they belong, then read it.
## get the page as text
txt <- RCurl::getURL(
"https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt"
)
## fix the EOL (end-of-line) markers
g1 <- gsub(" \n", "\n", txt, fixed = TRUE)
## read it
df <- read.csv(
## add most comma-separators, then the last for the house number
text = gsub("(\\d+) (\\D+)", "\\1,\\2", gsub("\\s{2,}", ",", g1)),
header = FALSE,
## set the column names
col.names = c("LastName", "FirstName", "streetno", "streetname", "city", "state", "zip")
)
## result
head(df)
# LastName FirstName streetno streetname city state zip
# 1 Bania Thomas M. 725 Commonwealth Ave. Boston MA O2215
# 2 Barnaby David 373 W. Geneva St. Wms. Bay WI 53191
# 3 Bausch Judy 373 W. Geneva St. Wms. Bay WI 53191
# 4 Bolatto Alberto 725 Commonwealth Ave. Boston MA O2215
# 5 Carlstrom John 933 E. 56th St. Chicago IL 60637
# 6 Chamberlin Richard A. 111 Nowelo St. Hilo HI 96720
Here your problem is not how to use R to read in this data, but rather it's that your data is not sufficiently structured using regular delimiters between the variable-length fields you have as inputs. In addition, the zip code field contains some alpha "O" characters that should be "0".
So here is a way to use regular expression substitution to add in delimiters, and then parse the delimited text using read.csv(). Note that depending on exceptions in your full set of text, you may need to adjust the regular expressions. I have done them step by step here to make it clear what is being done and so that you can adjust them as you find exceptions in your input text. (For instance, some city names like `Wms. Bay" are two words.)
addr.txt <- readLines("https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt")
addr.txt <- gsub("\\s+O(\\d{4})", " 0\\1", addr.txt) # replace O with 0 in zip
addr.txt <- gsub("(\\s+)([A-Z]{2})", ", \\2", addr.txt) # state
addr.txt <- gsub("\\s+(\\d{5}(\\-\\d{4}){0,1})\\s*", ", \\1", addr.txt) # zip
addr.txt <- gsub("\\s+(\\d{1,4})\\s", ", \\1, ", addr.txt) # streetno
addr.txt <- gsub("(^\\w*)(\\s+)", "\\1, ", addr.txt) # LastName (FirstName)
addr.txt <- gsub("\\s{2,}", ", ", addr.txt) # city, by elimination
addr <- read.csv(textConnection(addr.txt), header = FALSE,
col.names = c("LastName", "FirstName", "streetno", "streetname", "city", "state", "zip"),
stringsAsFactors = FALSE)
head(addr)
## LastName FirstName streetno streetname city state zip
## 1 Bania Thomas M. 725 Commonwealth Ave. Boston MA 02215
## 2 Barnaby David 373 W. Geneva St. Wms. Bay WI 53191
## 3 Bausch Judy 373 W. Geneva St. Wms. Bay WI 53191
## 4 Bolatto Alberto 725 Commonwealth Ave. Boston MA 02215
## 5 Carlstrom John 933 E. 56th St. Chicago IL 60637
## 6 Chamberlin Richard A. 111 Nowelo St. Hilo HI 96720