Identify vector element that contains ONLY certain strings in r

Identify vector element that contains ONLY certain strings in r - r

I have a long list of addresses. some of them only contains CA or USA or both.
What I need is I need to convert those to NA and leave other intact.
An example, I have the vector like below:
loc = c('CA, USA',
'USA',
'2 main st CA',
'35 1st ave CA, USA',
'CA')
What I need is:
loc = c( NA, NA, '2 main st CA',
'35 1st ave CA, USA', NA)
This is just an example. The actual list is very long.
Thanks a lot in advance.

nchar will count the letters in each element of the vector of strings.
ifelse(nchar(string) > 7, string, NA) #to account for spaces
string<-c('CA, USA',
'USA',
'2 main st CA',
'35 1st ave CA, USA',
'CA')
string
[1] "CA, USA" "USA" "2 main st CA"
[4] "35 1st ave CA, USA" "CA"
ifelse(nchar(string) > 7, string, NA)
[1] NA NA "2 main st CA"
[4] "35 1st ave CA, USA" NA
Or you can collapse all strings using:
st <- gsub(" ", "", gsub(",", "", string))
st
[1] "CAUSA" "USA" "2mainstCA" "351staveCAUSA"
[5] "CA"
replace(string, nchar(st) < 6, NA)
[1] NA NA "2 main st CA"
[4] "35 1st ave CA, USA" NA
Or if you know exactly your criteria:
ifelse((grepl("^USA$", st) | grepl("^CA$", st) |
grepl("^USACA$", st) | grepl("^CAUSA$", st)), NA, string)
[1] NA NA "2 main st CA"
[4] "35 1st ave CA, USA" NA

If the pattern you want to retain always starts with a number, then you can use this
> loc[grep("^\\d", loc, invert = T)] <- NA
> loc
[1] NA NA "2 main st CA" "35 1st ave CA, USA" NA

Related

Use R and regex to keep only desired comma in address string

I would like to split a list of address strings into two columns, splitting between City and State.
For example, say I have two address strings:
addr1 <- "123 ABC street Lot 10, Fairfax, VA 22033"
addr2 <- "123 ABC street Fairfax, VA 22033"
How would I use regex in R to remove the 'unexpected' comma between Lot 10 and Fairfax, so that the only comma remaining in any given address string is the comma separating City and State?
My desired result is a dataframe with the address string split into two columns on the abovementioned comma:

There are two ways to expand on Tim's answer:
Zip+4 zip codes (US only?); and
"state" of not-2-letters ... really, just looking for the word-boundary instead of hard-coding "2 letters" (not sure if/when this is a factor ... does anybody write a non-2-letter state?)
addresses <- c("123 ABC street Lot 10, Fairfax, VA 22033", "123 ABC street Fairfax, VA 22033")
sub("\\b[[:alpha:]]+\\s+[[:digit:]]{5}(-[[:digit:]]{4})?$", "", addresses)
# [1] "123 ABC street Lot 10, Fairfax, " "123 ABC street Fairfax, "
sub(".*(\\b[[:alpha:]]+\\s+[[:digit:]]{5}(-[[:digit:]]{4})?$)", "\\1", addresses)
# [1] "VA 22033" "VA 22033"
We can remove commas (gsub(",","",...)) and trim whitespace (trimws(...)) separately.
out <- data.frame(
X1 = sub("\\b[[:alpha:]]+\\s+[[:digit:]]{5}(-[[:digit:]]{4})?$", "", addresses),
X2 = sub(".*(\\b[[:alpha:]]+\\s+[[:digit:]]{5}(-[[:digit:]]{4})?$)", "\\1", addresses)
)
out[] <- lapply(out, function(x) trimws(gsub(",", "", x)))
out
# X1 X2
# 1 123 ABC street Lot 10 Fairfax VA 22033
# 2 123 ABC street Fairfax VA 22033
(Though one may argue for a more-careful removal of commas. shrug)

Assuming you just want to split the address before the final state and zip code, you may use sub as follows:
df$X1 <- sub(", [A-Z]{2} \\d{5}$", "", df$address)
df$X2 <- sub("^.*([A-Z]{2} \\d{5})$", "\\1", df$address)
df
X1 X2
1 123 ABC street Lot 10, Fairfax VA 22033
2 123 ABC street Fairfax VA 22033
Data:
df <- data.frame(address=c("123 ABC street Lot 10, Fairfax, VA 22033",
"123 ABC street Fairfax, VA 22033"), stringsAsFactors=FALSE)

Using unite and str_to_title: NAs united as strings and go undetected as NAs R

I have NAs in two columns that were united. Before uniting, I used str_to_title to create uniformity in the values.
The issue is that now the NAs are not registered as NAs and they've united as strings. I.E.
City State City, State
Denver CO Denver, CO
NA NA NA, NA
Los Angeles CA Los Angeles, CA
I tried hard coding it using df[df$col == "NA, NA"] <- NA and that didn't work.

We can create an index and then update
library(stringr)
i1 <- !(is.na(df$City) & is.na(df$State))
df$City_State[i1] <- with(df[i1,], paste(City, State, sep=', '))
If we use str_c, then it would return NA if there is any NA
with(df, str_c(City, State, sep=", "))
#[1] "Denver, CO" NA "Los Angeles, CA"
data
df <- structure(list(City = c("Denver", NA, "Los Angeles"),
State = c("CO",
NA, "CA")), class = "data.frame", row.names = c(NA, -3L))

We can use tidyr::unite with na.rm = TRUE to ignore NA values. while pasting.
tidyr::unite(df, CityState, City, State, na.rm = TRUE, remove = FALSE, sep = ",")
# CityState City State
#1 Denver,CO Denver CO
#2 <NA> <NA>
#3 Los Angeles,CA Los Angeles CA
Make sure that City and State are of type characters and not factors when using unite.

Find column value contained in another column R

I have multiple columns of addresses, where they may contain duplicated information (but generally will not have exactly duplicated information).
The following code will provide an example of my issue,
id= c(1, 2)
add1 = c("21ST AVE", "5TH ST")
add2 = c("21ST AVE BLAH ST", "EAST BLAH BLVD")
df = data.frame(id, add1, add2)
df$combined = paste(add1, add2)
df
This gives the following result,
id add1 add2 combined
1 1 21ST AVE 21ST AVE BLAH ST 21ST AVE 21ST AVE BLAH ST
2 2 5TH ST EAST BLAH BLVD 5TH ST EAST BLAH BLVD
The conclusion I need is the following,
id add1 add2 combined
1 1 21ST AVE 21ST AVE BLAH ST 21ST AVE BLAH ST
2 2 5TH ST EAST BLAH BLVD 5TH ST EAST BLAH BLVD
I wish to identify if what's in add1 is contained in add2. If I find that add2 contains the same information that add1 provides, then I either want to avoid combining those particular column values or delete the repeated information in the combined column (which I believe would require solving a different issue of repeated phrases in a string). I have not been able to find an example of finding column values that are 'contained in' rather than 'exact' - and I'm working with over 500K cases in a dataset where this issue is a common occurrence. Any help is appreciated.

We split the second and third column by one or more space (\\s+), then paste the union of the corresponding rows with mapply to create the 'combined'
lst <- lapply(df[2:3], function(x) strsplit(as.character(x), "\\s+"))
df$combined <- mapply(function(x,y) paste(union(x, y), collapse=" "), lst$add1, lst$add2)
df$combined
#[1] "21ST AVE BLAH ST" "5TH ST EAST BLAH BLVD"
Or another option is gsub
gsub("((\\w+\\s*){2,})\\1", "\\1", do.call(paste, df[2:3]))
#[1] "21ST AVE BLAH ST" "5TH ST EAST BLAH BLVD"

Here's one way to accomplish this where the ifelse tests whetheradd1 is in add2, and if so, then doesn't include it, otherwise it combines them:
id= c(1, 2)
add1 = c("21ST AVE", "5TH ST")
add2 = c("21ST AVE BLAH ST", "EAST BLAH BLVD")
df = data.frame(id, add1, add2, stringsAsFactors = F)
require(stringr)
require(dplyr)
df %>% mutate(combined = ifelse(str_detect(add2, add1),
add2,
str_c(add1, add2)))
Output:
id add1 add2 combined
1 1 21ST AVE 21ST AVE BLAH ST 21ST AVE BLAH ST
2 2 5TH ST EAST BLAH BLVD 5TH STEAST BLAH BLVD

R: JSONlite Loop Issue

I got all the " Google Map API Requests" in a row, but when I tried to loop to call and parse it. I am getting an error. If I don't use a loop and do it manually it works.
a <- c("1780 N Washington Ave Scranton PA 18509", "1858 Hunt Ave Bronx NY 10462", "140 N Warren St Trenton NJ 08608-1308")
#API Key need to be added to run:
w <- c("https://maps.googleapis.com/maps/api/distancematrix/json?units=imperial&origins=19+East+34th+Street+New York+NY+10016&destinations=1780+N+Washington+Ave+Scranton+PA+18509&mode=transit&language=fr-FR&key=API_KEY_HERE",
"https://maps.googleapis.com/maps/api/distancematrix/json?units=imperial&origins=19+East+34th+Street+New York+NY+10016&destinations=1858+Hunt+Ave+Bronx+NY+10462&mode=transit&language=fr-FR&key=API_KEY_HERE",
"https://maps.googleapis.com/maps/api/distancematrix/json?units=imperial&origins=19+East+34th+Street+New York+NY+10016&destinations=140+N+Warren+St+Trenton+NJ+08608-1308&mode=transit&language=fr-FR&key=API_KEY_HERE")
df <- data.frame(a,w)
for (i in cpghq) {
url <- df$w
testdf <- jsonlite::fromJSON(url, simplifyDataFrame = TRUE)
list <- unlist(testdf$rows)
transit_time <- as.data.frame(t(as.data.frame(list)))
cpghq$transit_time <- transit_time
The error I get is:
Error: lexical error: invalid char in json text.
https://maps.googleapis.com/map
(right here) ------^

My API call was wrong because "New York" have space. I fixed using gsub("[[:space:]]", "+", a) , but also utils::URLencode() would have work.
Build the API call
a <- c("1780 N Washington Ave Scranton PA 18509", "1858 Hunt Ave Bronx NY 10462", "140 N Warren St Trenton NJ 08608-1308")
fix_address <- gsub("[[:space:]]", "+", a)
key <- "YOUR_GOOGLE_API_KEY_HERE"
travel_mode <- "transit"
root <- "https://maps.googleapis.com/maps/api/distancematrix/json
units=imperial&origins="
api_call <- paste0(root,"350+5th+Ave+New+York+NY+10118",
"&destinations=",
fix_address,
"&mode=",
travel_mode,
"&language=en-EN",
"&key=", key)
My problem with the loop was very simple. I wasn't using lapply()
Now used RSJSONIO::fromJSON to send the call to Google Map API
require("RJSONIO")
if(verbose) cat(address,"\n")
# Get json returns from Google
doc <- lapply(api_call, RCurl::getURL)

As pointed out in my other answer to you, you can also use my googleway to do the work for you.
library(googleway)
key <- "your_api_key"
a <- c("1780 N Washington Ave Scranton PA 18509",
"1858 Hunt Ave Bronx NY 10462",
"140 N Warren St Trenton NJ 08608-1308")
google_distance(origins = "350 5th Ave New York NY 10188",
destinations = as.list(a),
mode = "transit",
key = key,
simplify = T)
# $destination_addresses
# [1] "1780 N Washington Ave, Scranton, PA 18509, USA" "1858 Hunt Ave, Bronx, NY 10462, USA"
# [3] "140 N Warren St, Trenton, NJ 08608, USA"
#
# $origin_addresses
# [1] "Empire State Building, 350 5th Ave, New York, NY 10118, USA"
#
# $rows
# elements
# 1 ZERO_RESULTS, OK, OK, NA, 19.0 km, 95.8 km, NA, 18954, 95773, NA, 54 mins, 1 hour 44 mins, NA, 3242, 6260
#
# $status
# [1] "OK"

Split street address into street number and street name in r

I want to split a street address into street name and street number in r.
My input data has a column that reads for example
Street.Addresses
205 Cape Road
32 Albany Street
cnr Kempston/Durban Roads
I want to split the street number and street name into two separate columns, so that it reads:
Street Number Street Name
205 Cape Road
32 Albany Street
cnr Kempston/Durban Roads
Is it in anyway possible to split the numeric value from the non numeric entries in a factor/string in R?
Thank you

you can try:
y <- lapply(strsplit(x, "(?<=\\d)\\b ", perl=T), function(x) if (length(x)<2) c("", x) else x)
y <- do.call(rbind, y)
colnames(y) <- c("Street Number", "Street Name")
hth

I'm sure that someone is going to come along with a cool regex solution with lookaheads and so on, but this might work for you:
X <- c("205 Cape Road", "32 Albany Street", "cnr Kempston/Durban Roads")
nonum <- grepl("^[^0-9]", X)
X[nonum] <- paste0(" \t", X[nonum])
X[!nonum] <- gsub("(^[0-9]+ )(.*)", "\\1\t\\2", X[!nonum])
read.delim(text = X, header = FALSE)
# V1 V2
# 1 205 Cape Road
# 2 32 Albany Street
# 3 NA cnr Kempston/Durban Roads

Here is another way:
df <- data.frame (Street.Addresses = c ("205 Cape Road", "32 Albany Street", "cnr Kempston/Durban Roads"),
stringsAsFactors = F)
new_df <- data.frame ("Street.Number" = character(),
"Street.Name" = character(),
stringsAsFactors = F)
for (i in 1:nrow (df)) {
new_df [i,"Street.Number"] <- unlist(strsplit (df[["Street.Addresses"]], " ")[i])[1]
new_df [i,"Street.Name"] <- paste (unlist(strsplit (df[["Street.Addresses"]], " ")[i])[-1], collapse = " ")
}
> new_df
Street.Number Street.Name
1 205 Cape Road
2 32 Albany Street
3 cnr Kempston/Durban Roads

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Identify vector element that contains ONLY certain strings in r - r

If the pattern you want to retain always starts with a number, then you can use this > loc[grep("^\\d", loc, invert = T)] <- NA > loc [1] NA NA "2 main st CA" "35 1st ave CA, USA" NA

Related

Use R and regex to keep only desired comma in address string

Using unite and str_to_title: NAs united as strings and go undetected as NAs R

Find column value contained in another column R

R: JSONlite Loop Issue

Split street address into street number and street name in r

Categories

Resources