How to split address column in R

How to split address column in R - r

I have an address column in a dataframe like below:
Address
101 Marietta Street NorthWest Atlanta GA 30303
Now I want to split it into 4 diff columns like
Address City State Zip
101 Marietta Street NorthWest Atlanta GA 30303
It is guaranteed that the last value in address column will be zip code, second last will be state, third last will be city and remaining will be address. So I am thinking, I can split address column values with space and extract values from rear.
How can I do this?

We can use tidyr::extract to get last 3 words in separate columns and remaining text as Address
tidyr::extract(df, Address, c("Address", "City", "State", "Zip"),
regex = "(.+) (\\w+) (\\w+) (\\w+)")
# Address City State Zip
#1 101 Marietta Street NorthWest Atlanta GA 30303

Related

How to geocode address given two streets and number, specific case Colombian addresses

How to geocode with here API address given two streets and number, Colombian addresses are given by main street, intersection with another street and house number.
Example
Calle 27 Carrera 25 24 , Bogota, Colombia is lat:4.62331 lng:-74.07728
https://geocode.search.hereapi.com/v1/geocode?q=Calle+27+Calle+25-34+Bogota&apiKey=*****
Gives the wrong lat and long, the wright location is (4.62331,-74.07728) but it returns (4.62638,-74.08522)

Changing the value in one column based on a subset of values of another column in r

I have a dataset that contains columns city and country. Some of the country columns are incorrectly mislabelled as 'Other'. I know this because some of the city values contain labels like saddle lake (Canada). Is there a way I can search a subset of the value in the city to change the value in Country. IE search for any city value containing the word 'Canada' and change country to 'Canada'. I'd like to do this for multiple countries including the USA and UK. which might mean my search would need an 'or' element and search usa, US, USA etc
Current dataset:
City - Country
Saddle(Canada) - Other
Dublin - Other
Detroit - USA
Vancouver - Canada
NYC: US - Other
Output:
Saddle(Canada) - Canada
Dublin -Other
Detroit - USA
Vancouver - Canada
NYC: US - USA
I've played around with if statements using grep() but no success.
Edit: some code I have tried:
for (i in Data$city){
if (Data$city == '.*canada*.'){
Data$country = Canada
}}

How do I convert city names to time zones?

Sorry if this is repetitive, but I've looked everywhere and can't seem to find anything that addresses my specific problem in R. I have a column with city names:
cities <-data.frame(c("Sydney", "Dusseldorf", "LidCombe", "Portland"))
colnames(cities)[1]<-"CityName"
Ideally I'd like to attach a column with either the lat/long for each city or the time zone. I have tried using the "ggmap" package in R, but my request exceeds the maximum number of requests they allow per day. I found the "geonames" package that converts lat/long to timezones, so if I get the lat/long for the city I should be able to take it from there.
Edit to address potential duplicate question: I would like to do this without using the ggmap package, as I have too many rows and they have a maximum # of requests per day.

You can get at least many major cities from the world.cities data in the maps package.
## Changing your data to a vector
cities <- c("Sydney", "Dusseldorf", "LidCombe", "Portland")
## Load up data
library(maps)
data(world.cities)
world.cities[match(cities, world.cities$name), ]
name country.etc pop lat long capital
36817 Sydney Australia 4444513 -33.87 151.21 0
10026 Dusseldorf Germany 573521 51.24 6.79 0
NA <NA> <NA> NA NA NA NA
29625 Portland Australia 8757 -38.34 141.59 0
Note: LidCombe was not included.
Warning: For many names, there is more than one world city. For example,
world.cities[grep("Portland", world.cities$name), ]
name country.etc pop lat long capital
29625 Portland Australia 8757 -38.34 141.59 0
29626 Portland USA 542751 45.54 -122.66 0
29627 Portland USA 62882 43.66 -70.28 0
Of course the two in the USA are Portland, Maine and Portland, Oregon.
match is just giving the first one on the list. You may need to use more information than just the name to get a good result.

Using regex on a vector list in R

I have a vector with character strings in it. The vector has over 6000 rivers in it but we can use the following as an example:
Names <- Baker R, Colorado R, Missouri R
I am then matching these river names to a list that contains their full names. As an example, the other list contains names such as:
station_nm <- North Creek River, Baker River at Wentworth, Lostine River at Baker Road, Colorado River at North Street, Missouri River
In order to find the full names of the stations for the river names in "Names" I have:
station_nm <- grep(paste(Names, collapse = "|"), ALLsites$station_nm, ignore.case = TRUE, perl = TRUE, value = TRUE)
Continuing with the example, this returns: Baker River at Wentworth, Lostine River at Baker Road, Colorado River at North Street, Missouri River. It does not return North Creek River, as this is not listed in the "Names" vector. This is what I want.
However, I want to restrict the rivers that it returns to only Baker River at Wentworth, Colorado River at North Street, Missouri River. I don't want to include names for which there is something before it, i.e. Lostine River at Baker Road.
I believe this should involve some sort of negative look behind but I don't know how to write this with the vector "Names".
Thank you for any help!

You just have to prepend the values in Names with a ^ meaning "has to start with":
grep(paste0("^", Names, "iver", collapse = "|"), station_nm,
ignore.case = TRUE, value = TRUE)
# [1] "Baker River at Wentworth" "Colorado River at North Street" "Missouri River"

Can't match two dataframe values

I am not sure why the dataframe values do not match with each other.
I have a df name fileUpload which looks like this (the cols are aligned correctly):
Destination City Year Adults
Amsterdam 2015 2
Amsterdam 2016 2
Amsterdam 2015 2
Amsterdam 2016 2
Amsterdam 2015 3
There is a space after each city name.
I have another dataframe that is not uploaded, like this:
cities <- read.csv(text = "
City,Lat,Long,Pop
Amsterdam ,4.8952,52.3702,779808
Bali ,115.1889,-8.4095,4225000")
I need to merge the two dataframes, but I realized that the city values returns not matching (NA). I tried checking it using fileUpload %in% cities returns false
I tried removing the space after the city, also did not work.
The typeof(df$city) for both is integer.
How can I make the cities name match together?

As pointed out in the comments you should convert your columns to strings from factors.
mergedCities <- merge(fileUpload, cities, by.x ="Destination City", by.y = "City", all = TRUE)
Set the all parameter to specify if you want to keep all cities or just the one form x or y or only the cities present in both.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to split address column in R - r

We can use tidyr::extract to get last 3 words in separate columns and remaining text as Address tidyr::extract(df, Address, c("Address", "City", "State", "Zip"), regex = "(.+) (\\w+) (\\w+) (\\w+)") # Address City State Zip #1 101 Marietta Street NorthWest Atlanta GA 30303

Related

How to geocode address given two streets and number, specific case Colombian addresses

Changing the value in one column based on a subset of values of another column in r

How do I convert city names to time zones?

Using regex on a vector list in R

Can't match two dataframe values

Categories

Resources