Appending multiple nested lists to a dataframe in R - r

I have a list of ambiguous addresses that I need to return full geocode information for.
Only issue is that what I get is a large list of nested lists (JSON)
I want to be able to get a data frame that contains the key information, i.e.
IDEAL OUTPUT
Original_Address, StreetNum, StreetName, Suburb, town_city, locality, Postcode, geo_xCord, Country, Postcode
I almost wonder if this is just too difficult and if there is an easier method that I haven't considered.
I basically just need to be able to spit out the key address elements for each address I have.
# Stack Overflow Example -------------------------------------------
random_addresses <- c('27 Hall Street, Wellington',
'52 Ethan Street, New Zealand',
'13 Epsom Street, Auckland',
'42 Elden Drive, New Zealand')
register_google(key = "MYAPIKEY")
place_lookup <- geocode(random_addresses, output = "all")
print(place_lookup[1])
>>>
[[1]]$results
[[1]]$results[[1]]
[[1]]$results[[1]]$address_components
[[1]]$results[[1]]$address_components[[1]]
[[1]]$results[[1]]$address_components[[1]]$long_name
[1] "27"
[[1]]$results[[1]]$address_components[[1]]$short_name
[1] "27"
[[1]]$results[[1]]$address_components[[1]]$types
[[1]]$results[[1]]$address_components[[1]]$types[[1]]
[1] "street_number"
[[1]]$results[[1]]$address_components[[2]]
[[1]]$results[[1]]$address_components[[2]]$long_name
[1] "Hall Street"
[[1]]$results[[1]]$address_components[[2]]$short_name
[1] "Hall St"
[[1]]$results[[1]]$address_components[[2]]$types
[[1]]$results[[1]]$address_components[[2]]$types[[1]]
[1] "route"
[[1]]$results[[1]]$address_components[[3]]
[[1]]$results[[1]]$address_components[[3]]$long_name
[1] "Newtown"
[[1]]$results[[1]]$address_components[[3]]$short_name
[1] "Newtown"
[[1]]$results[[1]]$address_components[[3]]$types
[[1]]$results[[1]]$address_components[[3]]$types[[1]]
[1] "political"
[[1]]$results[[1]]$address_components[[3]]$types[[2]]
[1] "sublocality"
[[1]]$results[[1]]$address_components[[3]]$types[[3]]
[1] "sublocality_level_1"
[[1]]$results[[1]]$address_components[[4]]
[[1]]$results[[1]]$address_components[[4]]$long_name
[1] "Wellington"
[[1]]$results[[1]]$address_components[[4]]$short_name
[1] "Wellington"
[[1]]$results[[1]]$address_components[[4]]$types
[[1]]$results[[1]]$address_components[[4]]$types[[1]]
[1] "locality"
[[1]]$results[[1]]$address_components[[4]]$types[[2]]
[1] "political"
[[1]]$results[[1]]$address_components[[5]]
[[1]]$results[[1]]$address_components[[5]]$long_name
[1] "Wellington"
[[1]]$results[[1]]$address_components[[5]]$short_name
[1] "Wellington"
[[1]]$results[[1]]$address_components[[5]]$types
[[1]]$results[[1]]$address_components[[5]]$types[[1]]
[1] "administrative_area_level_1"
[[1]]$results[[1]]$address_components[[5]]$types[[2]]
[1] "political"
[[1]]$results[[1]]$address_components[[6]]
[[1]]$results[[1]]$address_components[[6]]$long_name
[1] "New Zealand"
[[1]]$results[[1]]$address_components[[6]]$short_name
[1] "NZ"
[[1]]$results[[1]]$address_components[[6]]$types
[[1]]$results[[1]]$address_components[[6]]$types[[1]]
[1] "country"
[[1]]$results[[1]]$address_components[[6]]$types[[2]]
[1] "political"
[[1]]$results[[1]]$address_components[[7]]
[[1]]$results[[1]]$address_components[[7]]$long_name
[1] "6021"
[[1]]$results[[1]]$address_components[[7]]$short_name
[1] "6021"
[[1]]$results[[1]]$address_components[[7]]$types
[[1]]$results[[1]]$address_components[[7]]$types[[1]]
[1] "postal_code"
[[1]]$results[[1]]$formatted_address
[1] "27 Hall Street, Newtown, Wellington 6021, New Zealand"
[[1]]$results[[1]]$geometry
[[1]]$results[[1]]$geometry$bounds
[[1]]$results[[1]]$geometry$bounds$northeast
[[1]]$results[[1]]$geometry$bounds$northeast$lat
[1] -41.31066
[[1]]$results[[1]]$geometry$bounds$northeast$lng
[1] 174.7768
[[1]]$results[[1]]$geometry$bounds$southwest
[[1]]$results[[1]]$geometry$bounds$southwest$lat
[1] -41.31081
[[1]]$results[[1]]$geometry$bounds$southwest$lng
[1] 174.7766
[[1]]$results[[1]]$geometry$location
[[1]]$results[[1]]$geometry$location$lat
[1] -41.31074
[[1]]$results[[1]]$geometry$location$lng
[1] 174.7767
[[1]]$results[[1]]$geometry$location_type
[1] "ROOFTOP"
[[1]]$results[[1]]$geometry$viewport
[[1]]$results[[1]]$geometry$viewport$northeast
[[1]]$results[[1]]$geometry$viewport$northeast$lat
[1] -41.30932
[[1]]$results[[1]]$geometry$viewport$northeast$lng
[1] 174.778
[[1]]$results[[1]]$geometry$viewport$southwest
[[1]]$results[[1]]$geometry$viewport$southwest$lat
[1] -41.31202
[[1]]$results[[1]]$geometry$viewport$southwest$lng
[1] 174.7753
[[1]]$results[[1]]$place_id
[1] "ChIJiynBCOOvOG0RMx429ZNDR3A"
[[1]]$results[[1]]$types
[[1]]$results[[1]]$types[[1]]
[1] "premise"
[[1]]$status
[1] "OK"
---

You can explore the nested lists with viewer in Rstudio or listviewer::jsonedit. You can then drill down to the desired information. Basically using unnest_wider to spread the list to columns to then select desired columns and unnest_longer to tease out nested lists to then iterate through.
library(tidyverse)
map(random_addresses, ~geocode(.x, output = "all") %>%
# results is name of list with desired information, create tibble for unnest
tibble(output = .$results) %>%
# Create tibble with address_components as column-list
unnest_wider(output) %>%
dplyr::select(address_components) %>%
# Get address_components as list of lists, each list to df
unnest_longer(., col = "address_components") %>%
map_dfr(., ~.x) %>%
# types is the type of information. It is listed so unlist
mutate(types = unlist(types)) %>%
# Choose the information to keep
filter(types %in% c("street_number", "route")) %>%
# Choose the format of data
select(long_name, types) %>%
# Put in wide form
pivot_wider(names_from = "types", values_from = "long_name")
) %>%
bind_rows # create master df
It will give you lists with your information (before filtering)
[[4]]
# A tibble: 13 × 3
long_name short_name types
<chr> <chr> <chr>
1 New Zealand NZ country
2 New Zealand NZ political
3 42 42 street_number
4 Elden Drive Elden Dr route
5 Saddle River Saddle River locality
6 Saddle River Saddle River political
7 Bergen County Bergen County administrative_area_level_2
8 Bergen County Bergen County political
9 New Jersey NJ administrative_area_level_1
10 New Jersey NJ political
11 United States US country
12 United States US political
13 07458 07458 postal_code

Related

Subsetting elements in a list and placing them in a data frame

I have a list ("listanswer") that looks something like this:
> str(listanswer)
List of 100
$ : chr [1:3] "" "" "\t\t"
$ : chr [1:5] "" "Dr. Smith" "123 Fake Street" "New York, ZIPCODE 1" ...
$ : chr [1:5] "" "Dr. Jones" "124 Fake Street" "New York, ZIPCODE 2" ...
> listanswer
[[1]]
[1] "" "" "\t\t"
[[2]]
[1] "" "Dr. Smith" "123 Fake Street" "New York"
[5] "ZIPCODE 1"
[[3]]
[1] "" "Dr. Jones" "124 Fake Street," "New York"
[5] "ZIPCODE2"
For each element in this list, I noticed the following pattern within the sub-elements:
# first sub-element is always empty
> listanswer[[2]][[1]]
[1] ""
# second sub-element is the name
> listanswer[[2]][[2]]
[1] "Dr. Smith"
# third sub-element is always the address
> listanswer[[2]][[3]]
[1] "123 Fake Street"
# fourth sub-element is always the city
> listanswer[[2]][[4]]
[1] "New York"
# fifth sub-element is always the ZIP
> listanswer[[2]][[5]]
[1] "ZIPCODE 1"
I want to create a data frame that contains the information from this list in row format. For example:
id name address city ZIP
1 2 Dr. Smith 123 Fake Street New York ZIPCODE 1
2 3 Dr. Jones 124 Fake Street New York ZIPCODE 2
I thought of the following way to do this:
name = sapply(listanswer,function(x) x[2])
address = sapply(listanswer,function(x) x[3])
city = sapply(listanswer,function(x) x[4])
zip = sapply(listanswer,function(x) x[5])
final_data = data.frame(name, address, city, zip)
id = 1:nrow(final_data)
My Question: I just wanted to confirm - Is this the correct way to reference sub-elements in lists?
If it works, it's the correct way, although there might be a more efficient or more readable way to do the same thing.
Another way to do this is to create a data frame with your columns, and add rows to it. i. e.
#create an empty data frame
df <- data.frame(matrix(ncol = 4, nrow = 0))
colnames(df) <- c("name", "address", "city", "zip")
#add rows
lapply(listanswer, \(x){df[nrow(df) + 1,] <- x[2:5]})
This is simply another way to solve the same problem. Readability is a personal preference, and there's nothing wrong with your solution either.
If this is based on your elephant question, for businesses in Vancouver, then this mostly works.
library(rvest)
url<-"Website/british-columbia/"
page <-read_html(url)
#find the div tab of class=one_third
b = page %>% html_nodes("div.one_third")
listanswer <- b %>% html_text() %>% strsplit("\\n")
#listanswer2 <- b %>% html_text2() %>% strsplit("\\n")
listanswer[[1]]<-NULL #remove first blank record
rows<-lapply(listanswer, function(element){
vect<-element[-1] #remove first blank field
cityindex<-as.integer(grep("Vancouver", vect)) #find city field
#add some error checking and corrections
if(length(cityindex)==0) {
cityindex <- length(vect)-1 }
else if(length(cityindex)>1) {
cityindex <- cityindex[2] }
#get the fields of interest
address <- vect[cityindex-1]
city<-vect[cityindex]
phone <- vect[cityindex+1]
if( cityindex < 3) {
cityindex <- 3
} #error check
#first groups combine into 1 name
name <- toString(vect[1:(cityindex-2)])
data.frame(name, address, city, phone)
})
answer<-bind_rows(rows)
#clean up
answer$phone <- sub("Website", "", answer$phone)
answer
This still needs some clean up to handle the inconsistences but should be 80-90% complete

reading address and lat,long from xml_node in R (mapsapi package)

I'm trying to get informations from an address over the package mapsapi in R.
So my code looks like follows:
library(mapsapi)
library(XML)
library(RCurl)
string <- "Pariser Platz 1, 10117 Berlin"
test <- mp_geocode(string)
xml <- xml_child(test[[string]],2)
xml
Now I'm getting this kind of xml file:
{xml_node}
<result>
[1] <type>street_address</type>
[2] <formatted_address>Pariser Platz 1, 10117 Berlin, Germany</formatted_address>
[3] <address_component>\n <long_name>1</long_name>\n <short_name>1</short_name>\n <type>street_number</type>\n</address_component>
[4] <address_component>\n <long_name>Pariser Platz</long_name>\n <short_name>Pariser Platz</short_name>\n <type>route</type>\n</address_component>
[5] <address_component>\n <long_name>Mitte</long_name>\n <short_name>Mitte</short_name>\n <type>political</type>\n <type>sublocality</type>\n <type>sublocality_level_1</type>\n</address_component>
[6] <address_component>\n <long_name>Berlin</long_name>\n <short_name>Berlin</short_name>\n <type>locality</type>\n <type>political</type>\n</address_component>
[7] <address_component>\n <long_name>Berlin</long_name>\n <short_name>Berlin</short_name>\n <type>administrative_area_level_1</type>\n <type>political</type>\n</address_component>
[8] <address_component>\n <long_name>Germany</long_name>\n <short_name>DE</short_name>\n <type>country</type>\n <type>political</type>\n</address_component>
[9] <address_component>\n <long_name>10117</long_name>\n <short_name>10117</short_name>\n <type>postal_code</type>\n</address_component>
[10] <geometry>\n <location>\n <lat>52.5160964</lat>\n <lng>13.3779369</lng>\n </location>\n <location_type>ROOFTOP</location_type>\n <viewport>\n <southwest>\n <lat>52.5147474</lat>\n <lng>13.37658 ...
[11] <place_id>ChIJnYvtVcZRqEcRl6Kftq66b6Y</place_id>
So how can I export the street number, address, city, zip, lat and long out of this xml into decent variables?
Thanks for your help!
regards
I've made accessing this type of information easy in my googleway package
library(googleway)
## you're using Google's API, and they require you to have an API key
## so you'll need to get one
set_key("GOOGLE_API_KEY")
## perform query
res <- google_geocode("Pariser Platz 1, 10117 Berlin")
With the res result you can use geocode_coordinates() to extract the coordinates, and geocode_address_components() to get the street number
## coordinates
geocode_coordinates(res)
# lat lng
# 1 52.5161 13.37794
geocode_address_components(res)
# long_name short_name types
# 1 1 1 street_number
# 2 Pariser Platz Pariser Platz route
# 3 Mitte Mitte political, sublocality, sublocality_level_1
# 4 Berlin Berlin locality, political
# 5 Berlin Berlin administrative_area_level_1, political
# 6 Germany DE country, political
# 7 10117 10117 postal_code
You can look at str(res) to see the full list of items returned from Google's API
Alternatively, you can also use ggmap::geocode():
> library(ggmap)
> geocode(location = "Pariser Platz 1, 10117 Berlin", output = 'latlon' )
Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Pariser%20Platz%201,%2010117%20Berlin&sensor=false
lon lat
1 13.37794 52.5161
Changing the output parameter can give you a very detailed list output (if required):
> geocode(location = "Pariser Platz 1, 10117 Berlin", output = 'all' )
Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Pariser%20Platz%201,%2010117%20Berlin&sensor=false
$results
$results[[1]]
$results[[1]]$address_components
$results[[1]]$address_components[[1]]
$results[[1]]$address_components[[1]]$long_name
[1] "1"
$results[[1]]$address_components[[1]]$short_name
[1] "1"
$results[[1]]$address_components[[1]]$types
[1] "street_number"
$results[[1]]$address_components[[2]]
$results[[1]]$address_components[[2]]$long_name
[1] "Pariser Platz"
$results[[1]]$address_components[[2]]$short_name
[1] "Pariser Platz"
$results[[1]]$address_components[[2]]$types
[1] "route"
$results[[1]]$address_components[[3]]
$results[[1]]$address_components[[3]]$long_name
[1] "Mitte"
$results[[1]]$address_components[[3]]$short_name
[1] "Mitte"
$results[[1]]$address_components[[3]]$types
[1] "political" "sublocality" "sublocality_level_1"
$results[[1]]$address_components[[4]]
$results[[1]]$address_components[[4]]$long_name
[1] "Berlin"
$results[[1]]$address_components[[4]]$short_name
[1] "Berlin"
$results[[1]]$address_components[[4]]$types
[1] "locality" "political"
$results[[1]]$address_components[[5]]
$results[[1]]$address_components[[5]]$long_name
[1] "Berlin"
$results[[1]]$address_components[[5]]$short_name
[1] "Berlin"
$results[[1]]$address_components[[5]]$types
[1] "administrative_area_level_1" "political"
$results[[1]]$address_components[[6]]
$results[[1]]$address_components[[6]]$long_name
[1] "Germany"
$results[[1]]$address_components[[6]]$short_name
[1] "DE"
$results[[1]]$address_components[[6]]$types
[1] "country" "political"
$results[[1]]$address_components[[7]]
$results[[1]]$address_components[[7]]$long_name
[1] "10117"
$results[[1]]$address_components[[7]]$short_name
[1] "10117"
$results[[1]]$address_components[[7]]$types
[1] "postal_code"
$results[[1]]$formatted_address
[1] "Pariser Platz 1, 10117 Berlin, Germany"
$results[[1]]$geometry
$results[[1]]$geometry$location
$results[[1]]$geometry$location$lat
[1] 52.5161
$results[[1]]$geometry$location$lng
[1] 13.37794
$results[[1]]$geometry$location_type
[1] "ROOFTOP"
$results[[1]]$geometry$viewport
$results[[1]]$geometry$viewport$northeast
$results[[1]]$geometry$viewport$northeast$lat
[1] 52.51745
$results[[1]]$geometry$viewport$northeast$lng
[1] 13.37929
$results[[1]]$geometry$viewport$southwest
$results[[1]]$geometry$viewport$southwest$lat
[1] 52.51475
$results[[1]]$geometry$viewport$southwest$lng
[1] 13.37659
$results[[1]]$place_id
[1] "ChIJnYvtVcZRqEcRl6Kftq66b6Y"
$results[[1]]$types
[1] "street_address"
$status
[1] "OK"
You can find more info in the function help section.
Sometimes the call may fail with the following message:
Warning message:
geocode failed with status OVER_QUERY_LIMIT, location = "Pariser Platz 1, 10117 Berlin"
Generally if you try after a few seconds it works fine. You can always check the remaining queries left in your quota with geocodeQueryCheck:
> geocodeQueryCheck()
2490 geocoding queries remaining.

How to get global environment variables in a vector in R? [duplicate]

This question already has answers here:
How do I make a list of data frames?
(10 answers)
Closed 5 years ago.
I have a csv data file with 50000+ records stored in dataframe 'data'. I am creating data subsets based on 2 factors Segment & Market with below values:
customer_segments <- c('Consumer','Corporate','Home Office')
markets <- c('Africa','APAC','Canada','EMEA','EU','LATAM','US')
To get all subsets with 21 combinations for Market & Segement, I am using below nested for loops with assign & paste functions:
for(i in 1:length(markets)){
for(j in 1:length(customer_segments)){
assign(paste(markets[i],customer_segments[j],sep='_'),data[(data$Market == markets[i]) & (data$Segment == customer_segments[j]), ])
}
}
This creates 21 dataframes & assign them a name accordingly like Canada_Home Office etc.
Problem is I want to iterate over all these 21 dataframes to aggregate 3 attributes: Sales, Quantity & Profit on each but not sure how to address these dataframes in a loop? Maybe if I get all 21 dataframes in a vector I can iterate, but not sure if this is the best option.
Create combination of markets and customer_segments using expand.grid().
df <- expand.grid(markets, customer_segments)
head(df)
# Var1 Var2
# 1 Africa Consumer
# 2 APAC Consumer
# 3 Canada Consumer
# 4 EMEA Consumer
# 5 EU Consumer
# 6 LATAM Consumer
Vector of the combination of markets and customer_segments
df1 <- as.vector(paste(df$Var1,df$Var2, sep = " "))
df1
# [1] "Africa Consumer" "APAC Consumer" "Canada Consumer"
# [4] "EMEA Consumer" "EU Consumer" "LATAM Consumer"
# [7] "US Consumer" "Africa Corporate" "APAC Corporate"
# [10] "Canada Corporate" "EMEA Corporate" "EU Corporate"
# [13] "LATAM Corporate" "US Corporate" "Africa Home Office"
# [16] "APAC Home Office" "Canada Home Office" "EMEA Home Office"
# [19] "EU Home Office" "LATAM Home Office" "US Home Office"

Extract address components from coordiantes

I'm trying to reverse geocode with R. I first used ggmap but couldn't get it to work with my API key. Now I'm trying it with googleway.
newframe[,c("Front.lat","Front.long")]
Front.lat Front.long
1 -37.82681 144.9592
2 -37.82681 145.9592
newframe$address <- apply(newframe, 1, function(x){
google_reverse_geocode(location = as.numeric(c(x["Front.lat"],
x["Front.long"])),
key = "xxxx")
})
This extracts the variables as a list but I can't figure out the structure.
I'm struggling to figure out how to extract the address components listed below as variables in newframe
postal_code, administrative_area_level_1, administrative_area_level_2, locality, route, street_number
I would prefer each address component as a separate variable.
Google's API returns the response in JSON. Which, when translated into R naturally forms nested lists. Internally in googleway this is done through jsonlite::fromJSON()
In googleway I've given you the choice of returning the raw JSON or a list, through using the simplify argument.
I've deliberately returned ALL the data from Google's response and left it up to the user to extract the elements they're interested in through usual list-subsetting operations.
Having said all that, in the development version of googleway I've written a few functions to help accessing elements of various API calls. Here are three of them that may be useful to you
## Install the development version
# devtools::install_github("SymbolixAU/googleway")
res <- google_reverse_geocode(
location = c(df[1, 'Front.lat'], df[1, 'Front.long']),
key = apiKey
)
geocode_address(res)
# [1] "45 Clarke St, Southbank VIC 3006, Australia"
# [2] "Bank Apartments, 275-283 City Rd, Southbank VIC 3006, Australia"
# [3] "Southbank VIC 3006, Australia"
# [4] "Melbourne VIC, Australia"
# [5] "South Wharf VIC 3006, Australia"
# [6] "Melbourne, VIC, Australia"
# [7] "CBD & South Melbourne, VIC, Australia"
# [8] "Melbourne Metropolitan Area, VIC, Australia"
# [9] "Victoria, Australia"
# [10] "Australia"
geocode_address_components(res)
# long_name short_name types
# 1 45 45 street_number
# 2 Clarke Street Clarke St route
# 3 Southbank Southbank locality, political
# 4 Melbourne City Melbourne administrative_area_level_2, political
# 5 Victoria VIC administrative_area_level_1, political
# 6 Australia AU country, political
# 7 3006 3006 postal_code
geocode_type(res)
# [[1]]
# [1] "street_address"
#
# [[2]]
# [1] "establishment" "general_contractor" "point_of_interest"
#
# [[3]]
# [1] "locality" "political"
#
# [[4]]
# [1] "colloquial_area" "locality" "political"
After reverse geocoding into newframe$address the address components could be extracted further as follows:
# Make a boolean array of the valid ("OK" status) responses (other statuses may be "NO_RESULTS", "REQUEST_DENIED" etc).
sel <- sapply(c(1: nrow(newframe)), function(x){
newframe$address[[x]]$status == 'OK'
})
# Get the address_components of the first result (i.e. best match) returned per geocoded coordinate.
address.components <- sapply(c(1: nrow(newframe[sel,])), function(x){
newframe$address[[x]]$results[1,]$address_components
})
# Get all possible component types.
all.types <- unique(unlist(sapply(c(1: length(address.components)), function(x){
unlist(lapply(address.components[[x]]$types, function(l) l[[1]]))
})))
# Get "long_name" values of the address_components for each type present (the other option is "short_name").
all.values <- lapply(c(1: length(address.components)), function(x){
types <- unlist(lapply(address.components[[x]]$types, function(l) l[[1]]))
matches <- match(all.types, types)
values <- address.components[[x]]$long_name[matches]
})
# Bind results into a dataframe.
all.values <- do.call("rbind", all.values)
all.values <- as.data.frame(all.values)
names(all.values) <- all.types
# Add columns and update original data frame.
newframe[, all.types] <- NA
newframe[sel,][, all.types] <- all.values
Note that I've only kept the first type given per component, effectively skipping the "political" type as it appears in multiple components and is likely superfluous e.g. "administrative_area_level_1, political".
You can use ggmap:revgeocode easily; look below:
library(ggmap)
df <- cbind(df,do.call(rbind,
lapply(1:nrow(df),
function(i)
revgeocode(as.numeric(
df[i,2:1]), output = "more")
[c("administrative_area_level_1","locality","postal_code","address")])))
#output:
df
# Front.lat Front.long administrative_area_level_1 locality
# 1 -37.82681 144.9592 Victoria Southbank
# 2 -37.82681 145.9592 Victoria Noojee
# postal_code address
# 1 3006 45 Clarke St, Southbank VIC 3006, Australia
# 2 3833 Cec Dunns Track, Noojee VIC 3833, Australia
You can add "route" and "street_number" to the variables that you want to extract but as you can see the second address does not have street number and that will cause an error.
Note: You may also use sub and extract the information from the address.
Data:
df <- structure(list(Front.lat = c(-37.82681, -37.82681), Front.long =
c(144.9592, 145.9592)), .Names = c("Front.lat", "Front.long"), class = "data.frame",
row.names = c(NA, -2L))

Update a field if the value of a pattern is true

This is my first question so please excuse the mistakes.
I have a dataframe where the address is in one line and has many missing values and several errors.
Address
Braemor Drive, Clontarf, Co.Dublin
Meadow Avenue, Dundrum
Philipsburgh Avenue, Marino
Myrtle Square, The Coast
I would like to add a new field "District", if the value of the address contains certain values for example if it contains Marino, Fairview or Clontarf the District should be Dublin 3.
Dublin3 <- c("Marino", "Fairview", "Clontarf")
matches <- unique (grep(paste(Dublin3,collapse="|"),
DubPPReg$Address, value=TRUE))
Using R, how can I update the value of District where the match is true?
# I've created example data frame with column Adress
df <- data.frame(Adress = c("Braemor Drive",
"Clontarf",
"Co.Dublin",
"Meadow Avenue",
"Dundrum",
"Philipsburgh Avenue",
"Marino",
"Myrtle Square", "The Coast"))
# And vector Dublin
Dublin3 <- c("Marino", "Fairview", "Clontarf")
# Match names in column Adress and vector Dublin 3
df$District <- ifelse(df$Adress %in% Dublin3, "Dublin 3",FALSE)
df
Adress District
1 Braemor Drive FALSE
2 Clontarf Dublin 3
3 Co.Dublin FALSE
4 Meadow Avenue FALSE
5 Dundrum FALSE
6 Philipsburgh Avenue FALSE
7 Marino Dublin 3
8 Myrtle Square FALSE
9 The Coast FALSE
Instead of FALSE you can choose something else (e.g. NA).
Edited: If your data are in vector
df <- c("Braemor Drive, Churchtown, Co.Dublin",
"Meadow Avenue, Clontarf, Dublin 14",
"Sallymount Avenue, Ranelagh", "Philipsburgh Avenue, Marino")
Which looks like this
df
[1] "Braemor Drive, Churchtown, Co.Dublin"
[2] "Meadow Avenue, Clontarf, Dublin 14"
[3] "Sallymount Avenue, Ranelagh"
[4] "Philipsburgh Avenue, Marino"
You can find your maches using grepl like this
match <- ifelse(grepl("Marino|Fairview|Clontarf", df, ignore.case = T), "Dublin 3",FALSE)
and output is
[1] "FALSE" "Dublin 3" "FALSE" "Dublin 3"
Which means that one or all of the matching names that you are looking for (i.e. Marino, Fairview or Clontarf) are in second and fourth row in df.

Resources