Update a field if the value of a pattern is true - r

This is my first question so please excuse the mistakes.
I have a dataframe where the address is in one line and has many missing values and several errors.
Address
Braemor Drive, Clontarf, Co.Dublin
Meadow Avenue, Dundrum
Philipsburgh Avenue, Marino
Myrtle Square, The Coast
I would like to add a new field "District", if the value of the address contains certain values for example if it contains Marino, Fairview or Clontarf the District should be Dublin 3.
Dublin3 <- c("Marino", "Fairview", "Clontarf")
matches <- unique (grep(paste(Dublin3,collapse="|"),
DubPPReg$Address, value=TRUE))
Using R, how can I update the value of District where the match is true?

# I've created example data frame with column Adress
df <- data.frame(Adress = c("Braemor Drive",
"Clontarf",
"Co.Dublin",
"Meadow Avenue",
"Dundrum",
"Philipsburgh Avenue",
"Marino",
"Myrtle Square", "The Coast"))
# And vector Dublin
Dublin3 <- c("Marino", "Fairview", "Clontarf")
# Match names in column Adress and vector Dublin 3
df$District <- ifelse(df$Adress %in% Dublin3, "Dublin 3",FALSE)
df
Adress District
1 Braemor Drive FALSE
2 Clontarf Dublin 3
3 Co.Dublin FALSE
4 Meadow Avenue FALSE
5 Dundrum FALSE
6 Philipsburgh Avenue FALSE
7 Marino Dublin 3
8 Myrtle Square FALSE
9 The Coast FALSE
Instead of FALSE you can choose something else (e.g. NA).
Edited: If your data are in vector
df <- c("Braemor Drive, Churchtown, Co.Dublin",
"Meadow Avenue, Clontarf, Dublin 14",
"Sallymount Avenue, Ranelagh", "Philipsburgh Avenue, Marino")
Which looks like this
df
[1] "Braemor Drive, Churchtown, Co.Dublin"
[2] "Meadow Avenue, Clontarf, Dublin 14"
[3] "Sallymount Avenue, Ranelagh"
[4] "Philipsburgh Avenue, Marino"
You can find your maches using grepl like this
match <- ifelse(grepl("Marino|Fairview|Clontarf", df, ignore.case = T), "Dublin 3",FALSE)
and output is
[1] "FALSE" "Dublin 3" "FALSE" "Dublin 3"
Which means that one or all of the matching names that you are looking for (i.e. Marino, Fairview or Clontarf) are in second and fourth row in df.

Related

Appending multiple nested lists to a dataframe in R

I have a list of ambiguous addresses that I need to return full geocode information for.
Only issue is that what I get is a large list of nested lists (JSON)
I want to be able to get a data frame that contains the key information, i.e.
IDEAL OUTPUT
Original_Address, StreetNum, StreetName, Suburb, town_city, locality, Postcode, geo_xCord, Country, Postcode
I almost wonder if this is just too difficult and if there is an easier method that I haven't considered.
I basically just need to be able to spit out the key address elements for each address I have.
# Stack Overflow Example -------------------------------------------
random_addresses <- c('27 Hall Street, Wellington',
'52 Ethan Street, New Zealand',
'13 Epsom Street, Auckland',
'42 Elden Drive, New Zealand')
register_google(key = "MYAPIKEY")
place_lookup <- geocode(random_addresses, output = "all")
print(place_lookup[1])
>>>
[[1]]$results
[[1]]$results[[1]]
[[1]]$results[[1]]$address_components
[[1]]$results[[1]]$address_components[[1]]
[[1]]$results[[1]]$address_components[[1]]$long_name
[1] "27"
[[1]]$results[[1]]$address_components[[1]]$short_name
[1] "27"
[[1]]$results[[1]]$address_components[[1]]$types
[[1]]$results[[1]]$address_components[[1]]$types[[1]]
[1] "street_number"
[[1]]$results[[1]]$address_components[[2]]
[[1]]$results[[1]]$address_components[[2]]$long_name
[1] "Hall Street"
[[1]]$results[[1]]$address_components[[2]]$short_name
[1] "Hall St"
[[1]]$results[[1]]$address_components[[2]]$types
[[1]]$results[[1]]$address_components[[2]]$types[[1]]
[1] "route"
[[1]]$results[[1]]$address_components[[3]]
[[1]]$results[[1]]$address_components[[3]]$long_name
[1] "Newtown"
[[1]]$results[[1]]$address_components[[3]]$short_name
[1] "Newtown"
[[1]]$results[[1]]$address_components[[3]]$types
[[1]]$results[[1]]$address_components[[3]]$types[[1]]
[1] "political"
[[1]]$results[[1]]$address_components[[3]]$types[[2]]
[1] "sublocality"
[[1]]$results[[1]]$address_components[[3]]$types[[3]]
[1] "sublocality_level_1"
[[1]]$results[[1]]$address_components[[4]]
[[1]]$results[[1]]$address_components[[4]]$long_name
[1] "Wellington"
[[1]]$results[[1]]$address_components[[4]]$short_name
[1] "Wellington"
[[1]]$results[[1]]$address_components[[4]]$types
[[1]]$results[[1]]$address_components[[4]]$types[[1]]
[1] "locality"
[[1]]$results[[1]]$address_components[[4]]$types[[2]]
[1] "political"
[[1]]$results[[1]]$address_components[[5]]
[[1]]$results[[1]]$address_components[[5]]$long_name
[1] "Wellington"
[[1]]$results[[1]]$address_components[[5]]$short_name
[1] "Wellington"
[[1]]$results[[1]]$address_components[[5]]$types
[[1]]$results[[1]]$address_components[[5]]$types[[1]]
[1] "administrative_area_level_1"
[[1]]$results[[1]]$address_components[[5]]$types[[2]]
[1] "political"
[[1]]$results[[1]]$address_components[[6]]
[[1]]$results[[1]]$address_components[[6]]$long_name
[1] "New Zealand"
[[1]]$results[[1]]$address_components[[6]]$short_name
[1] "NZ"
[[1]]$results[[1]]$address_components[[6]]$types
[[1]]$results[[1]]$address_components[[6]]$types[[1]]
[1] "country"
[[1]]$results[[1]]$address_components[[6]]$types[[2]]
[1] "political"
[[1]]$results[[1]]$address_components[[7]]
[[1]]$results[[1]]$address_components[[7]]$long_name
[1] "6021"
[[1]]$results[[1]]$address_components[[7]]$short_name
[1] "6021"
[[1]]$results[[1]]$address_components[[7]]$types
[[1]]$results[[1]]$address_components[[7]]$types[[1]]
[1] "postal_code"
[[1]]$results[[1]]$formatted_address
[1] "27 Hall Street, Newtown, Wellington 6021, New Zealand"
[[1]]$results[[1]]$geometry
[[1]]$results[[1]]$geometry$bounds
[[1]]$results[[1]]$geometry$bounds$northeast
[[1]]$results[[1]]$geometry$bounds$northeast$lat
[1] -41.31066
[[1]]$results[[1]]$geometry$bounds$northeast$lng
[1] 174.7768
[[1]]$results[[1]]$geometry$bounds$southwest
[[1]]$results[[1]]$geometry$bounds$southwest$lat
[1] -41.31081
[[1]]$results[[1]]$geometry$bounds$southwest$lng
[1] 174.7766
[[1]]$results[[1]]$geometry$location
[[1]]$results[[1]]$geometry$location$lat
[1] -41.31074
[[1]]$results[[1]]$geometry$location$lng
[1] 174.7767
[[1]]$results[[1]]$geometry$location_type
[1] "ROOFTOP"
[[1]]$results[[1]]$geometry$viewport
[[1]]$results[[1]]$geometry$viewport$northeast
[[1]]$results[[1]]$geometry$viewport$northeast$lat
[1] -41.30932
[[1]]$results[[1]]$geometry$viewport$northeast$lng
[1] 174.778
[[1]]$results[[1]]$geometry$viewport$southwest
[[1]]$results[[1]]$geometry$viewport$southwest$lat
[1] -41.31202
[[1]]$results[[1]]$geometry$viewport$southwest$lng
[1] 174.7753
[[1]]$results[[1]]$place_id
[1] "ChIJiynBCOOvOG0RMx429ZNDR3A"
[[1]]$results[[1]]$types
[[1]]$results[[1]]$types[[1]]
[1] "premise"
[[1]]$status
[1] "OK"
---
You can explore the nested lists with viewer in Rstudio or listviewer::jsonedit. You can then drill down to the desired information. Basically using unnest_wider to spread the list to columns to then select desired columns and unnest_longer to tease out nested lists to then iterate through.
library(tidyverse)
map(random_addresses, ~geocode(.x, output = "all") %>%
# results is name of list with desired information, create tibble for unnest
tibble(output = .$results) %>%
# Create tibble with address_components as column-list
unnest_wider(output) %>%
dplyr::select(address_components) %>%
# Get address_components as list of lists, each list to df
unnest_longer(., col = "address_components") %>%
map_dfr(., ~.x) %>%
# types is the type of information. It is listed so unlist
mutate(types = unlist(types)) %>%
# Choose the information to keep
filter(types %in% c("street_number", "route")) %>%
# Choose the format of data
select(long_name, types) %>%
# Put in wide form
pivot_wider(names_from = "types", values_from = "long_name")
) %>%
bind_rows # create master df
It will give you lists with your information (before filtering)
[[4]]
# A tibble: 13 × 3
long_name short_name types
<chr> <chr> <chr>
1 New Zealand NZ country
2 New Zealand NZ political
3 42 42 street_number
4 Elden Drive Elden Dr route
5 Saddle River Saddle River locality
6 Saddle River Saddle River political
7 Bergen County Bergen County administrative_area_level_2
8 Bergen County Bergen County political
9 New Jersey NJ administrative_area_level_1
10 New Jersey NJ political
11 United States US country
12 United States US political
13 07458 07458 postal_code

How to get global environment variables in a vector in R? [duplicate]

This question already has answers here:
How do I make a list of data frames?
(10 answers)
Closed 5 years ago.
I have a csv data file with 50000+ records stored in dataframe 'data'. I am creating data subsets based on 2 factors Segment & Market with below values:
customer_segments <- c('Consumer','Corporate','Home Office')
markets <- c('Africa','APAC','Canada','EMEA','EU','LATAM','US')
To get all subsets with 21 combinations for Market & Segement, I am using below nested for loops with assign & paste functions:
for(i in 1:length(markets)){
for(j in 1:length(customer_segments)){
assign(paste(markets[i],customer_segments[j],sep='_'),data[(data$Market == markets[i]) & (data$Segment == customer_segments[j]), ])
}
}
This creates 21 dataframes & assign them a name accordingly like Canada_Home Office etc.
Problem is I want to iterate over all these 21 dataframes to aggregate 3 attributes: Sales, Quantity & Profit on each but not sure how to address these dataframes in a loop? Maybe if I get all 21 dataframes in a vector I can iterate, but not sure if this is the best option.
Create combination of markets and customer_segments using expand.grid().
df <- expand.grid(markets, customer_segments)
head(df)
# Var1 Var2
# 1 Africa Consumer
# 2 APAC Consumer
# 3 Canada Consumer
# 4 EMEA Consumer
# 5 EU Consumer
# 6 LATAM Consumer
Vector of the combination of markets and customer_segments
df1 <- as.vector(paste(df$Var1,df$Var2, sep = " "))
df1
# [1] "Africa Consumer" "APAC Consumer" "Canada Consumer"
# [4] "EMEA Consumer" "EU Consumer" "LATAM Consumer"
# [7] "US Consumer" "Africa Corporate" "APAC Corporate"
# [10] "Canada Corporate" "EMEA Corporate" "EU Corporate"
# [13] "LATAM Corporate" "US Corporate" "Africa Home Office"
# [16] "APAC Home Office" "Canada Home Office" "EMEA Home Office"
# [19] "EU Home Office" "LATAM Home Office" "US Home Office"

Extract address components from coordiantes

I'm trying to reverse geocode with R. I first used ggmap but couldn't get it to work with my API key. Now I'm trying it with googleway.
newframe[,c("Front.lat","Front.long")]
Front.lat Front.long
1 -37.82681 144.9592
2 -37.82681 145.9592
newframe$address <- apply(newframe, 1, function(x){
google_reverse_geocode(location = as.numeric(c(x["Front.lat"],
x["Front.long"])),
key = "xxxx")
})
This extracts the variables as a list but I can't figure out the structure.
I'm struggling to figure out how to extract the address components listed below as variables in newframe
postal_code, administrative_area_level_1, administrative_area_level_2, locality, route, street_number
I would prefer each address component as a separate variable.
Google's API returns the response in JSON. Which, when translated into R naturally forms nested lists. Internally in googleway this is done through jsonlite::fromJSON()
In googleway I've given you the choice of returning the raw JSON or a list, through using the simplify argument.
I've deliberately returned ALL the data from Google's response and left it up to the user to extract the elements they're interested in through usual list-subsetting operations.
Having said all that, in the development version of googleway I've written a few functions to help accessing elements of various API calls. Here are three of them that may be useful to you
## Install the development version
# devtools::install_github("SymbolixAU/googleway")
res <- google_reverse_geocode(
location = c(df[1, 'Front.lat'], df[1, 'Front.long']),
key = apiKey
)
geocode_address(res)
# [1] "45 Clarke St, Southbank VIC 3006, Australia"
# [2] "Bank Apartments, 275-283 City Rd, Southbank VIC 3006, Australia"
# [3] "Southbank VIC 3006, Australia"
# [4] "Melbourne VIC, Australia"
# [5] "South Wharf VIC 3006, Australia"
# [6] "Melbourne, VIC, Australia"
# [7] "CBD & South Melbourne, VIC, Australia"
# [8] "Melbourne Metropolitan Area, VIC, Australia"
# [9] "Victoria, Australia"
# [10] "Australia"
geocode_address_components(res)
# long_name short_name types
# 1 45 45 street_number
# 2 Clarke Street Clarke St route
# 3 Southbank Southbank locality, political
# 4 Melbourne City Melbourne administrative_area_level_2, political
# 5 Victoria VIC administrative_area_level_1, political
# 6 Australia AU country, political
# 7 3006 3006 postal_code
geocode_type(res)
# [[1]]
# [1] "street_address"
#
# [[2]]
# [1] "establishment" "general_contractor" "point_of_interest"
#
# [[3]]
# [1] "locality" "political"
#
# [[4]]
# [1] "colloquial_area" "locality" "political"
After reverse geocoding into newframe$address the address components could be extracted further as follows:
# Make a boolean array of the valid ("OK" status) responses (other statuses may be "NO_RESULTS", "REQUEST_DENIED" etc).
sel <- sapply(c(1: nrow(newframe)), function(x){
newframe$address[[x]]$status == 'OK'
})
# Get the address_components of the first result (i.e. best match) returned per geocoded coordinate.
address.components <- sapply(c(1: nrow(newframe[sel,])), function(x){
newframe$address[[x]]$results[1,]$address_components
})
# Get all possible component types.
all.types <- unique(unlist(sapply(c(1: length(address.components)), function(x){
unlist(lapply(address.components[[x]]$types, function(l) l[[1]]))
})))
# Get "long_name" values of the address_components for each type present (the other option is "short_name").
all.values <- lapply(c(1: length(address.components)), function(x){
types <- unlist(lapply(address.components[[x]]$types, function(l) l[[1]]))
matches <- match(all.types, types)
values <- address.components[[x]]$long_name[matches]
})
# Bind results into a dataframe.
all.values <- do.call("rbind", all.values)
all.values <- as.data.frame(all.values)
names(all.values) <- all.types
# Add columns and update original data frame.
newframe[, all.types] <- NA
newframe[sel,][, all.types] <- all.values
Note that I've only kept the first type given per component, effectively skipping the "political" type as it appears in multiple components and is likely superfluous e.g. "administrative_area_level_1, political".
You can use ggmap:revgeocode easily; look below:
library(ggmap)
df <- cbind(df,do.call(rbind,
lapply(1:nrow(df),
function(i)
revgeocode(as.numeric(
df[i,2:1]), output = "more")
[c("administrative_area_level_1","locality","postal_code","address")])))
#output:
df
# Front.lat Front.long administrative_area_level_1 locality
# 1 -37.82681 144.9592 Victoria Southbank
# 2 -37.82681 145.9592 Victoria Noojee
# postal_code address
# 1 3006 45 Clarke St, Southbank VIC 3006, Australia
# 2 3833 Cec Dunns Track, Noojee VIC 3833, Australia
You can add "route" and "street_number" to the variables that you want to extract but as you can see the second address does not have street number and that will cause an error.
Note: You may also use sub and extract the information from the address.
Data:
df <- structure(list(Front.lat = c(-37.82681, -37.82681), Front.long =
c(144.9592, 145.9592)), .Names = c("Front.lat", "Front.long"), class = "data.frame",
row.names = c(NA, -2L))

Retrieving latitude/longitude coordinates for cities/countries that have since changed names?

Say I have a vector of cities and countries, which may or may not include names of places that have since changed names:
locations <- c("Paris, France", "Sarajevo, Yugoslavia", "Rome, Italy", "Leningrad, Soviet Union", "St Petersburg, Russia")
The problem is that I can't use something like ggmap::geocode since it doesn't appear to work well for locations whose names have changed:
ggmap::geocode(locations, source = "dsk")
lon lat
1 2.34880 48.85341 #Works for Paris
2 NA NA #Didn't work for Sarajevo
3 12.48390 41.89474 #Works for Rome
4 98.00000 60.00000 #Didn't work for the old name of St Petersburg seems to just get the center of Russia
5 30.26417 59.89444 #Worked for St Petersburg
Is there an alternative functions I could use? If I have to "update" the names of the cities & countries, is there an easy method of going through this? I have hundreds of locations that I was looking to collect the longitude and latitude coordinates.
This might not be what you had in mind, but if you use the exact same code with only the city names (and not the countries), at least the two cases that you mentioned (Sarajevo and Leningrad) seem to work fine. You could try to run the function with a modified locations vector including just the city names, and see if you still get errors. Something like this:
(cities <- gsub(',.*', '', locations))
## [1] "Paris" "Sarajevo" "Rome" "Leningrad" "St Petersburg"
cbind(ggmap::geocode(cities, source = 'dsk'), cities)
## lon lat cities
## 1 2.34880 48.85341 Paris
## 2 18.35644 43.84864 Sarajevo
## 3 12.48390 41.89474 Rome
## 4 30.26417 59.89444 Leningrad
## 5 30.26417 59.89444 St Petersburg

Split column with multiple delimiters

I am trying to determine in R how to split a column that has multiple fields with multiple delimiters.
From an API, I get a column in a data frame called "Location". It has multiple location identifiers in it. Here is an example of one entry. (edit- I added a couple more)
6540 BENNINGTON AVE
Kansas City, MO 64133
(39.005620414000475, -94.50998643299965)
4284 E 61ST ST
Kansas City, MO 64130
(39.014638172000446, -94.5335298549997)
3002 SPRUCE AVE
Kansas City, MO 64128
(39.07083265200049, -94.53320606399967)
6022 E Red Bridge Rd
Kansas City, MO 64134
(38.92458893200046, -94.52090062499968)
So the above is the entry in row 1-4, column "location".
I want split this into address, city, state, zip, long and lat columns. Some fields are separated by space or tab while others by comma. Also nothing is fixed width.
I have looked at the reshape package- but seems I need a single deliminator. I can't use space (or can I?) as the address has spaces in it.
Thoughts?
If the data you have is not like this, let everyone know by adding code we can copy and paste into R to reproduce your data (see how this sample data can be easily copied and pasted into R?)
Sample data:
location <- c(
"6540 BENNINGTON AVE
Kansas City, MO 64133
(39.005620414000475, -94.50998643299965)",
"456 POOH LANE
New York City, NY 10025
(40, -90)")
location
#[1] "6540 BENNINGTON AVE\nKansas City, MO 64133\n(39.005620414000475, -94.50998643299965)"
#[2] "456 POOH LANE\nNew York City, NY 10025\n(40, -90)"
A solution:
# Insert a comma between the state abbreviation and the zip code
step1 <- gsub("([[:alpha:]]{2}) ([[:digit:]]{5})", "\\1,\\2", location)
# get rid of parentheses
step2 <- gsub("\\(|\\)", "", step1)
# split on "\n", ",", and ", "
strsplit(step2, "\n|,|, ")
#[[1]]
#[1] "6540 BENNINGTON AVE" "Kansas City" "MO"
#[4] "64133" "39.005620414000475" "-94.50998643299965"
#[[2]]
#[1] "456 POOH LANE" "New York City" "NY" "10025"
#[5] "40" "-90"
Here is an example with the stringr package.
Using #Frank's example data from above, you can do:
library(stringr)
address <- str_match(location,
"(^[[:print:]]+)[[:space:]]([[:alpha:]. ]+), ([[:alpha:]]{2}) ([[:digit:]]{5})[[:space:]][(]([[:digit:].-]+), ([[:digit:].-]+)")
address <- data.frame(address[,-1]) # get rid of the first column which has the full match
names(address) <- c("address", "city", "state", "zip", "lat", "lon")
> address
address city state zip lat lon
1 6540 BENNINGTON AVE Kansas City MO 64133 39.005620414000475 -94.50998643299965
2 456 POOH LANE New York City NY 10025 40 -90
Note that this is pretty specific to the format of the one entry given. It would need to be tweaked if there is variation in any number of ways.
This takes everything from the start of the string to the first [:space:] character as address. The next set of letters, spaces and periods up until the next comma is given to city. After the comma and a space, the next two letters are given to state. Following a space, the next five digits make up the zip field. Finally, the next set of numbers, period and/or minus signs each get assigned to lat and lon.

Resources