Splitting full address column in multiple columns - r

I have a dataframe with the following column structure (over 1000+ rows total):
addressfull
POINT(3.124537653 32.179354012)||DEF_32||molengraaf 20, 1689 GL Utrecht, Netherlands||15||map
POINT(3.124537680 32.179354014)||DEF_32||winkellaan 67, 5788 BG Amsterdam, Netherlands||13||map
POINT(3.124537653 32.179354012)||DEF_32||vermeerstraat 18, 0932 DC Rotterdam, Netherlands||11||map
POINT(2.915206183 24.315583523)||DEF_32||--||13||map
POINT (2.900824999999923 34.3175721)||DEF_84||Zandhorstlaan 122, 0823 GT Ochtrup, Germany||17||map
structure(list(addressfull = structure(c(3L, 5L, 4L, 2L, 1L), .Label = c("POINT (2.900824999999923 34.3175721)||DEF_84||Zandhorstlaan 122, 0823 GT Ochtrup, Germany||17||map",
"POINT(2.915206183 24.315583523)||DEF_32||--||13||map", "POINT(3.124537653 32.179354012)||DEF_32||molengraaf 20, 1689 GL Utrecht, Netherlands||15||map",
"POINT(3.124537653 32.179354012)||DEF_32||vermeerstraat 18, 0932 DC Rotterdam, Netherlands||11||map",
"POINT(3.124537680 32.179354014)||DEF_32||winkellaan 67, 5788 BG Amsterdam, Netherlands||13||map"
), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
The column contains an location, street, housenumber, zip code, city, and country. I want to split the column addressfull with R in multiple columns, as example:
street house number zip city country
molengraaf 20 1689 GL Utrecht Netherlands
winkellaan 67 5788 BG Amsterdam Netherlands
vermeerstraat 18 0932 DC Rotterdam Netherlands
na na na na na
Zandhorstlaan 122 0823 GT Ochtrup Germany
I have readed the tidyr and stringr documentation. I can see an pattern for splitting (by ")", "| from position x", and ",". but i can't figure out the correct code to split the column into multiple columns.
Can someone help me?

You could brute force it using sub for a base R approach:
df$steet <- sub("^(\\S+)\\s+.*$", "\\1", df$adressfull)
df$`house number` <- sub("^\\S+\\s+(\\d+).*$", "\\1", df$adressfull)
df$zip <- sub("^\\S+\\s+\\d+,\\s*(\\d+\\s+[A-Z]+).*$", "\\1", df$adressfull)
df$city <- sub("^.*?(\\S+),\\s*\\S+$", "\\1", df$adressfull)
df$country <- sub("^.*,\\s*(\\S+)$", "\\1", df$adressfull)
df
adressfull steet house number zip
1 molengraaf 20, 1689 GL Utrecht, Netherlands molengraaf 20 1689 GL
city country
1 Utrecht Netherlands
Data:
df <- data.frame(adressfull=c("molengraaf 20, 1689 GL Utrecht, Netherlands"),
stringsAsFactors=FALSE)
This assumes that we have already isolated just the address text. To do that, conisder:
text <- "POINT(3.124537653 32.179354012)||DEF_32||molengraaf 20, 1689 GL Utrecht, Netherlands||15||map"
addresfull <- unlist(strsplit(text, "\\|\\|"))[3]
addresfull
[1] "molengraaf 20, 1689 GL Utrecht, Netherlands"

A stringrsolution is this:
addresssplit <- data.frame(
street = str_extract(addressfull$addressfull, "(?<=DEF_\\d{2}\\|\\|)\\w+\\b"),
number = str_extract(addressfull$addressfull, "\\d{1,}(?=,)"),
zip = str_extract(addressfull$addressfull, "(?<=\\s)\\d{4}\\s[A-Z]{2}"),
city = str_extract(addressfull$addressfull, "(?<=\\d{4}\\s[A-Z]{2}\\s)\\w+"),
country = str_extract(addressfull$addressfull, "(?<=[a-z]\\b,\\s)\\w+\\b")
)
RESULT:
addresssplit
street number zip city country
1 molengraaf 20 1689 GL Utrecht Netherlands
2 winkellaan 67 5788 BG Amsterdam Netherlands
3 vermeerstraat 18 0932 DC Rotterdam Netherlands
4 <NA> <NA> <NA> <NA> <NA>
5 Zandhorstlaan 122 0823 GT Ochtrup Germany
DATA:
addressfull <- structure(list(addressfull = structure(c(3L, 5L, 4L, 2L, 1L), .Label = c("POINT (2.900824999999923 34.3175721)||DEF_84||Zandhorstlaan 122, 0823 GT Ochtrup, Germany||17||map",
"POINT(2.915206183 24.315583523)||DEF_32||--||13||map", "POINT(3.124537653 32.179354012)||DEF_32||molengraaf 20, 1689 GL Utrecht, Netherlands||15||map",
"POINT(3.124537653 32.179354012)||DEF_32||vermeerstraat 18, 0932 DC Rotterdam, Netherlands||11||map",
"POINT(3.124537680 32.179354014)||DEF_32||winkellaan 67, 5788 BG Amsterdam, Netherlands||13||map"
), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))

This would be a tidyverse approach to the problem:
library(tidyverse)
df <- structure(list(addressfull = structure(c(3L, 5L, 4L, 2L, 1L), .Label = c("POINT (2.900824999999923 34.3175721)||DEF_84||Zandhorstlaan 122, 0823 GT Ochtrup, Germany||17||map",
"POINT(2.915206183 24.315583523)||DEF_32||--||13||map", "POINT(3.124537653 32.179354012)||DEF_32||molengraaf 20, 1689 GL Utrecht, Netherlands||15||map",
"POINT(3.124537653 32.179354012)||DEF_32||vermeerstraat 18, 0932 DC Rotterdam, Netherlands||11||map",
"POINT(3.124537680 32.179354014)||DEF_32||winkellaan 67, 5788 BG Amsterdam, Netherlands||13||map"
), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
df %>% separate(addressfull, sep = "\\|\\|", into = c("Coords", "DEF", "ADDRESS"),extra = "drop") %>%
select(ADDRESS) %>%
separate(ADDRESS, sep = ",", into = c("street", "city", "country")) %>%
separate(street, sep = "(?= \\d)", into = c("street", "house_number")) %>%
separate(city, sep = "(?<=[A-Z][A-Z])", into = c("zip", "city"))
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [4].
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [4].
#> street house_number zip city country
#> 1 molengraaf 20 1689 GL Utrecht Netherlands
#> 2 winkellaan 67 5788 BG Amsterdam Netherlands
#> 3 vermeerstraat 18 0932 DC Rotterdam Netherlands
#> 4 -- <NA> <NA> <NA> <NA>
#> 5 Zandhorstlaan 122 0823 GT Ochtrup Germany
Created on 2020-02-27 by the reprex package (v0.3.0)

Related

Transform data to long with grouped columns

For this week's tidytuesday challenge, for some reason, I am not able to group the column names in R which I was doing with pivot_longer function from tidyr previously. So, here is my code and I do not get it why it does throw an error and not give what I want.
library(tidyverse)
tuesdata <- tidytuesdayR::tt_load(2023, week = 7)
age_gaps <- tuesdata$age_gaps
df_long <- age_gaps %>%
pivot_longer(cols= actor_1_name:actor_2_name, names_to = "actornumber", values_to = "actorname") %>%
pivot_longer(cols= character_1_gender:character_2_gender, names_to = "gendernumber", values_to = "gender") %>%
pivot_longer(cols= actor_1_age:actor_2_age, names_to = "agenumber", values_to = "age") %>%
select(movie_name, release_year, director, age_difference, actorname, gender, age)
As seen from the code, the initial data has 1155 rows and after doing the quick data wrangling, I am expecting to get a data of 1155x2=2310 rows as I would like to merge the columns on actor names and their relevant information such as age and birthdate. Yet, the code does not give me the expected outcome and I am wondering why and how can I solve this problem. Thank you for your attention beforehand.
Example data (first 6 rows)
age_gaps <- structure(list(movie_name = c("Harold and Maude", "Venus", "The Quiet American",
"The Big Lebowski", "Beginners", "Poison Ivy"), release_year = c(1971,
2006, 2002, 1998, 2010, 1992), director = c("Hal Ashby", "Roger Michell",
"Phillip Noyce", "Joel Coen", "Mike Mills", "Katt Shea"), age_difference = c(52,
50, 49, 45, 43, 42), couple_number = c(1, 1, 1, 1, 1, 1), actor_1_name = c("Ruth Gordon",
"Peter O'Toole", "Michael Caine", "David Huddleston", "Christopher Plummer",
"Tom Skerritt"), actor_2_name = c("Bud Cort", "Jodie Whittaker",
"Do Thi Hai Yen", "Tara Reid", "Goran Visnjic", "Drew Barrymore"
), character_1_gender = c("woman", "man", "man", "man", "man",
"man"), character_2_gender = c("man", "woman", "woman", "woman",
"man", "woman"), actor_1_birthdate = structure(c(-26725, -13666,
-13442, -14351, -14629, -13278), class = "Date"), actor_2_birthdate = structure(c(-7948,
4536, 4656, 2137, 982, 1878), class = "Date"), actor_1_age = c(75,
74, 69, 68, 81, 59), actor_2_age = c(23, 24, 20, 23, 38, 17)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
You could set ".value" in names_to and supply one of names_sep or names_pattern to specify how the column names should be split.
library(tidyr)
age_gaps %>%
pivot_longer(actor_1_name:actor_2_age,
names_prefix = "(actor|character)_",
names_to = c("actor", ".value"),
names_sep = '_')
# A tibble: 12 × 10
movie_name release_year director age_difference couple_number actor name gender birthdate age
<chr> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <date> <dbl>
1 Harold and Maude 1971 Hal Ashby 52 1 1 Ruth Gordon woman 1896-10-30 75
2 Harold and Maude 1971 Hal Ashby 52 1 2 Bud Cort man 1948-03-29 23
3 Venus 2006 Roger Michell 50 1 1 Peter O'Toole man 1932-08-02 74
4 Venus 2006 Roger Michell 50 1 2 Jodie Whittaker woman 1982-06-03 24
5 The Quiet American 2002 Phillip Noyce 49 1 1 Michael Caine man 1933-03-14 69
6 The Quiet American 2002 Phillip Noyce 49 1 2 Do Thi Hai Yen woman 1982-10-01 20
7 The Big Lebowski 1998 Joel Coen 45 1 1 David Huddleston man 1930-09-17 68
8 The Big Lebowski 1998 Joel Coen 45 1 2 Tara Reid woman 1975-11-08 23
9 Beginners 2010 Mike Mills 43 1 1 Christopher Plummer man 1929-12-13 81
10 Beginners 2010 Mike Mills 43 1 2 Goran Visnjic man 1972-09-09 38
11 Poison Ivy 1992 Katt Shea 42 1 1 Tom Skerritt man 1933-08-25 59
12 Poison Ivy 1992 Katt Shea 42 1 2 Drew Barrymore woman 1975-02-22 17

R: Is there a way to select a column according to the current year?

Say you have a database like gapminder with the population per country. Even though the current year is 2021, you also have predictions for the following years to come.
location 2020.0 2021.0 2022.0
Canada 5 7 9
China 23 34 54
Congo 1 2 3
and another database like this, vaccins
location date amount_of_vaccins
Canada 2020-01-02 50
China 2021-05-03 59
Congo 2022-03-05 34
How can I merge the population of each country into the second database, but following the dates in the second database.
I managed to merge them by country like this:
merge(gapminder,vaccins, by = "location")
but I'm getting this
location date amount_of_vaccins 2020.0 2021.0 2022.0
Canada 2020-01-02 50 5 7 9
China 2021-05-03 59 23 34 54
Congo 2022-03-05 34 1 2 3
I'd like to have only a new variable giving the population of the country according to the year. Thank you.
You could do something like this with tidyverse.
library(tidyverse)
df1 <- df1 %>%
pivot_longer(!location, names_to = "date", values_to = "population") %>%
dplyr::mutate(year = str_sub(date, 1, 4))
df2 %>%
dplyr::mutate(year = str_sub(date, end = 4)) %>%
dplyr::left_join(., df1, by = c("location", "year")) %>%
dplyr::select(-c(date.y, year)) %>%
dplyr::rename(date = date.x)
Output
location date amount_of_vaccins population
1 Canada 2020-01-02 50 5
2 China 2021-05-03 59 34
3 Congo 2022-03-05 54 3
Data
df1 <-
structure(
list(
location = c("Canada", "China", "Congo"),
`2020.0` = c(5, 23, 1),
`2021.0` = c(7, 34, 2),
`2022.0` = c(9, 54, 3)
),
class = "data.frame",
row.names = c(NA,-3L)
)
df2 <-
structure(
list(
location = c("Canada", "China", "Congo"),
date = c("2020-01-02",
"2021-05-03", "2022-03-05"),
amount_of_vaccins = c(50, 59, 54)
),
class = "data.frame",
row.names = c(NA,-3L)
)

Rearranging order of data frame rows based on a vector

I have the following data frame above and a vector x= 1, 156, 153, 3 , 185. My vector corresponds to node.id column showed in the picture and I would like to rearrange the rows of the data frame to match up with the order of my vector. So the order of the data frame rows should be the row with node.id= 1, then 156, 153,3,185. Hopefully, I explained this well enough.
We can use match
df1[match(df1$node.id, x),]
# x lon lat node.id name
#1 1 -122.41 37.70 1 San Francisco
#4 22 -117.16 32.71 156 San Diego
#3 21 -118.24 34.05 153 Los Angeles
#2 3 -115.14 36.16 3 Las Vegas
#5 26 -112.08 38.77 185 Richfield
data
df1 <- structure(list(x = c(1, 3, 21, 22, 26), lon = c(-122.41, -115.14,
-118.24, -117.16, -112.08), lat = c(37.7, 36.16, 34.05, 32.71,
38.77), node.id = c(1, 3, 153, 156, 185), name = structure(c(5L,
1L, 2L, 4L, 3L), .Label = c("Las Vegas", "Los Angeles", "Richfield",
"San Diego", "San Francisco"), class = "factor")),
class = "data.frame", row.names = c(NA,
-5L))
x <- c(1, 156, 153, 3, 185)

How to append 2 data sets one below the other having slightly different column names?

Data set1:
ID Name Territory Sales
1 Richard NY 59
8 Sam California 44
Data set2:
Terr ID Name Comments
LA 5 Rick yes
MH 11 Oly no
I want final data set to have columns of 1st data set only and identify Territory is same as Terr and does not bring forward Comments column.
Final data should look like:
ID Name Territory Sales
1 Richard NY 59
8 Sam California 44
5 Rick LA NA
11 Oly MH NA
Thanks in advance
A possible solution:
# create a named vector with names from 'set2'
# with the positions of the matching columns in 'set1'
nms2 <- sort(unlist(sapply(names(set2), agrep, x = names(set1))))
# only keep the columns in 'set2' for which a match is found
# and give them the same names as in 'set1'
set2 <- setNames(set2[names(nms2)], names(set1[nms2]))
# bind the two dataset together
# option 1:
library(dplyr)
bind_rows(set1, set2)
# option 2:
library(data.table)
rbindlist(list(set1, set2), fill = TRUE)
which gives (dplyr-output shown):
ID Name Territory Sales
1 1 Richard NY 59
2 8 Sam California 44
3 5 Rick LA NA
4 11 Oly MH NA
Used data:
set1 <- structure(list(ID = c(1L, 8L),
Name = c("Richard", "Sam"),
Territory = c("NY", "California"),
Sales = c(59L, 44L)),
.Names = c("ID", "Name", "Territory", "Sales"), class = "data.frame", row.names = c(NA, -2L))
set2 <- structure(list(Terr = c("LA", "MH"),
ID = c(5L, 11L),
Name = c("Rick", "Oly"),
Comments = c("yes", "no")),
.Names = c("Terr", "ID", "Name", "Comments"), class = "data.frame", row.names = c(NA, -2L))

Populating new variable from ddply within old data frame in R

I have a data.frame which looks like this (in reality 1M rows):
`> df
R.DMA.NAMES quarter daypart allpersons.imp rate station spot.id
1 Wilkes.Barre.Scranton.Hztn Q22014 afternoon 0.0 30 WSWB 13048713
2 Nashville Q12014 primetime 0.0 50 COM NASHVILLE 11969260
3 Seattle.Tacoma Q12014 primetime 6.1 51 ESPN SEATTLE, EVERETT ZONE 11898905
4 Jacksonville Q42013 late fringe 2.3 130 Jacksonville WAWS 11617447
5 Detroit Q22014 overnight 0.0 0 WKBD 12571421
6 South.Bend.Elkhart Q42013 primetime 11.5 325 WBND 11741171`
dput(df)
structure(list(R.DMA.NAMES = c("Wilkes.Barre.Scranton.Hztn",
"Nashville", "Seattle.Tacoma", "Jacksonville", "Detroit", "South.Bend.Elkhart"
), quarter = structure(c(3L, 1L, 1L, 6L, 3L, 6L), .Label = c("Q12014",
"Q22013", "Q22014", "Q32013", "Q32014", "Q42013"), class = "factor"),
daypart = c("afternoon", "primetime", "primetime", "late fringe",
"overnight", "primetime"), allpersons.imp = c(0, 0, 6.1,
2.3, 0, 11.5), rate = c(30, 50, 51, 130, 0, 325), station = c("WSWB",
"COM NASHVILLE", "ESPN SEATTLE, EVERETT ZONE", "Jacksonville WAWS",
"WKBD", "WBND"), spot.id = c(13048713L, 11969260L, 11898905L,
11617447L, 12571421L, 11741171L)), .Names = c("R.DMA.NAMES",
"quarter", "daypart", "allpersons.imp", "rate", "station", "spot.id"
), row.names = c(NA, -6L), class = "data.frame")
I am using a ddply function to perform a calculation:
ddply(df, .(R.DMA.NAMES, station, quarter), function (x) {
cpi = sum(df$rate) / sum(df$allpersons.imp)
})
This creates a new data.frame which looks like this:
R.DMA.NAMES station quarter V1
1 Detroit WKBD Q22014 NaN
2 Jacksonville Jacksonville WAWS Q42013 56.521739
3 Nashville COM NASHVILLE Q12014 Inf
4 Seattle.Tacoma ESPN SEATTLE, EVERETT ZONE Q12014 8.360656
5 South.Bend.Elkhart WBND Q42013 28.260870
6 Wilkes.Barre.Scranton.Hztn WSWB Q22014 Inf
What I'd like to do is create a new column called "cpi" in my original df i.e. the applicable "cpi" value should appear against the particular row. Of course, the same value will repeat many times i.e. 8.36 will appear for every row which contains "Seattle.Tacoma" for R.DMA.NAMES, "ESPN SEATTLE, EVERETT ZONE" for station and Q12014 for quarter. I tried several things including:
transform(df, cpi = ddply(df, .(R.DMA.NAMES, station, quarter), function (x) {
cpi = sum(df$rate) / sum(df$allpersons.imp)
})
But this didn't work ! Can someone explain . .
Use transform within ddply:
ddply(df, .(R.DMA.NAMES, station, quarter),
transform, cpi = sum(rate) / sum(allpersons.imp))

Resources