I am not very sure about the terminology in the question title but I have a data frame containing non-structured addresses of institutions and I would like to extract their countries with a lookup table with multiple possible matches.
the addresses might look like:
address
xxx US
xxx USA yy
xxx United States yy
xxx UK
xxx United Kingdom yy
Note that not necessarily the country comes at the end of the string.
The pipeline would be matching and extracting whatever might be a country name (from a list of around 20 countries) and return a clean country name for each country.
(df <- tribble(
~address, ~clean_country,
"xxx US", "United States",
"xxx, USA yy", "United States",
"xxx United States, yy", "United States",
"xxx UK", "United Kingdom",
"xxx United Kingdom yy", "United Kingdom",
"xxx zz yy", NA_character_,
))
I am thinking of creating a lookup table as a data.frame with two columns:
(lookup <- tribble(
~country, ~matches,
"United States", "US|USA|United States",
"United Kingdom", "UK|United Kingdom"
))
and then checking with regex if any of the vertical bar separed matches can be found in the
df$address column, then appending the column country as clean_country in df.
Of course, I am interested in solutions following other strategies. The more (memory) efficient because the data set is relatively big.
Using the lookup table approach you can extract the country name from address using str_extract and replace it with country name in the loookup table.
library(stringr)
str_replace_all(str_extract(df$address, str_c(lookup$matches, collapse = '|')),
setNames(lookup$country, lookup$matches))
#[1] "United States" "United States" "United States"
#[4] "United Kingdom" "United Kingdom" NA
I have a string that contains country and other region names. I am only interested in the country names and would ideally like to add several columns, each of which contains a country name listed in the string. Here is an exemplary code for the way the dataframe lis set up:
df <- data.frame(id = c(1,2,3),
country = c("Cote d'Ivoire Africa Developing Economies West Africa",
"South Africa United Kingdom Africa BRICS Countries",
"Myanmar Gambia Bangladesh Netherlands Africa Asia"))
If I only split the string by space, those countries which contain a space get lost (e.g. "United Kingdom"). See here:
df2 <- separate(df, country, paste0("C",3:8), sep=" ")
Therefore, I tried to look up country names using the world.cities dataset. However, this only seems to loop through the string until there is non-country name. See here:
library(maps)
library(stringr)
all_countries <- str_c(unique(world.cities$country.etc), collapse = "|")
df$c1 <- sapply(str_extract_all(df$country, all_countries), toString)
I am wondering whether it's possible to use the space a delimiter but define exceptions (like "United Kingdom"). This might obviously require some manual work, but appears to be most feasible solution to me. Does anyone know how to define such exceptions? I am of course also open to and thankful for any other solutions.
UPDATE:
I figured out another solution using the countrycode package:
library(countrycode)
countries <- data.frame(countryname_dict)
countries$continent <- countrycode(sourcevar = countries[["country.name.en"]],
origin = "country.name.en",
destination = "continent")
africa <- countries[ which(countries$continent=='Africa'), ]
library(stringr)
pat <- paste0("\\b", paste(africa$country.name.en , collapse="\\b|\\b"), "\\b")
df$country_list <- str_extract_all(df$country, regex(pat, ignore_case = TRUE))
You could do:
library(stringi)
vec <- stri_trans_general(countrycode::codelist$country.name.en, id = "Latin-ASCII")
stri_extract_all(df$country,regex = sprintf(r"(\b(%s)\b)",stri_c(vec,collapse = "|")))
[[1]]
[1] "Cote d'Ivoire"
[[2]]
[1] "South Africa" "United Kingdom"
[[3]]
[1] "Gambia" "Bangladesh" "Netherlands"
I have a dataset of map data using the following:
worldMap_df <- map_data("world") %>%
rename(Economy = region) %>%
filter(Economy != "Antarctica") %>%
mutate(Economy = str_replace_all(Economy,
c("Brunei" = "Brunei Darussalam",
"Macedonia" = "Macedonia, FYR",
"Puerto Rico" = "Puerto Rico US",
"Russia" = "Russian Federation",
"UK" = "United Kingdom",
"USA" = "United States",
"Palestine" = "West Bank and Gaza",
"Saint Lucia" = "St Lucia",
"East Timor" = "Timor-Leste")))
There are a number of countries (under Economy) that I am trying to use str_replace_all to concatenate. One example is observations for which Economy is either "Trinidad" or "Tobago".
I've used the following but this seems to only partially re-label observations:
trin_tobago_vector <- c("Trinidad", "Tobago")
worldMap_df$Economy <- str_replace_all(worldMap_df$Economy, trin_tobago_vector, "Trinidad and Tobago")
However, certain observations still have Trinidad and Tobago under Economy whilst others remain Trinidad OR Tobago. Can anyone see what I'm doing wrong here?
You supply str_replace_all with a pattern that is a vector: trin_tobago_vector. It will then iterate over your 'Economy' column and check the first element with "Trinidad", the second element with "Tobago", the third with "Trinidad", and so on. You should do this replacement in two steps instead:
worldMap_df$Economy <- str_replace_all(worldMap_df$Economy, "^Trinidad$", "Trinidad and Tobago")
worldMap_df$Economy <- str_replace_all(worldMap_df$Economy, "^Tobago$", "Trinidad and Tobago")
or use a named vector:
trin_tobago_vector <- c("^Trinidad$" = "Trinidad and Tobago", "^Tobago$" = "Trinidad and Tobago")
worldMap_df$Economy <- str_replace_all(worldMap_df$Economy, trin_tobago_vector)
The ^ and $ inside the pattern vector make sure that only the literal strings "Trinidad" and "Tobago" are replaced.
I want to regroup US states by regions and thus I need to define a "US state" -> "US Region" mapping function, which is done by setting up an appropriate data frame.
The basis is this exercise (apparently this is a map of the "Commonwealth of the Fallout"):
One starts off with an original list in raw form:
Alabama = "Gulf"
Arizona = "Four States"
Arkansas = "Texas"
California = "South West"
Colorado = "Four States"
Connecticut = "New England"
Delaware = "Columbia"
which eventually leads to this R code:
us_state <- c("Alabama","Arizona","Arkansas","California","Colorado","Connecticut",
"Delaware","District of Columbia","Florida","Georgia","Idaho","Illinois","Indiana",
"Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland","Massachusetts","Michigan",
"Minnesota","Mississippi","Missouri","Montana","Nebraska","Nevada","New Hampshire",
"New Jersey","New Mexico","New York","North Carolina","North Dakota","Ohio","Oklahoma",
"Oregon","Pennsylvania","Rhode Island","South Carolina","South Dakota","Tennessee",
"Texas","Utah","Vermont","Virginia","Washington","West Virginia ","Wisconsin","Wyoming")
us_region <- c("Gulf","Four States","Texas","South West","Four States","New England",
"Columbia","Columbia","Gulf","Southeast","North West","Midwest","Midwest","Plains",
"Plains","East Central","Gulf","New England","Columbia","New England","Midwest",
"Midwest","Gulf","Plains","North","Plains","South West","New England","Eastern",
"Four States","Eastern","Southeast","North","East Central","Plains","North West",
"Eastern","New England","Southeast","North","East Central","Texas","Four States",
"New England","Columbia","North West","Eastern","Midwest","North")
us_state_to_region_map <- data.frame(us_state, us_region, stringsAsFactors=FALSE)
which is supremely ugly and unmaintainable as the State -> Region mapping is effectively
obfuscated.
I actually wrote a Perl program to generate the above from the original list.
In Perl, one would write things like:
#!/usr/bin/perl
$mapping = {
"Alabama"=> "Gulf",
"Arizona"=> "Four States",
"Arkansas"=> "Texas",
"California"=> "South West",
"Colorado"=> "Four States",
"Connecticut"=> "New England",
...etc...etc...
"West Virginia "=> "Eastern",
"Wisconsin"=> "Midwest",
"Wyoming"=> "North" };
which is maintainable because one can verify the mapping on a line-by-line basis.
There must be something similar to this Perl goodness in R?
It seems a bit open for interpretation as to what you're looking for.
Is the mapping meant to be a function type thing such that a call would return the region or vise-versa (Eg. similar to a function call mapping("alabama") => "Gulf")?
I am reading the question to be more looking for a dictionary style storage, which in R could be obtained with an equivalent named list
ncountry <- 49
mapping <- as.list(c("Gulf","Four States",
...
,"Midwest","North"))
names(mapping) <- c("Alabama","Arizona",
...
,"Wisconsin","Wyoming")
mapping[["Pennsylvania"]]
[1] "Eastern"
This could be performed in a single call
mapping <- list("Alabama" = "Gulf",
"Arizona" = "Four States",
...,
"Wisconsin" = "Midwest",
"Wyoming" = "North")
Which makes it very simple to check that the mapping is working as expected. This doesn't convert nicely to a 2 column data.frame however, which we would then obtain using
mapping_df <- data.frame(region = unlist(mapping), state = names(mapping))
note "not nicely" simply means as.data.frame doesn't translate the input into a 2 column output.
Alternatively just using a named character vector would likely be fine too
mapping_c <- c("Alabama" = "Gulf",
"Arizona" = "Four States",
...,
"Wisconsin" = "Midwest",
"Wyoming" = "North")
which would be converted to a data.frame in almost the same fashion
mapping_df_c <- data.frame(region = mapping_c, state = names(mapping_c))
Note however a slight difference in the two choices of storage. While referencing an entry that exists using either single brackets [ or double brackets [[ works just fine
#Works:
mapping_c["Pennsylvania"] == mapping["Pennsylvania"]
#output
Pennsylvania
TRUE
mapping_c[["Pennsylvania"]] == mapping[["Pennsylvania"]]
[1] TRUE
But when referencing unknown entries these differ slightly in behaviour
#works sorta:
mapping_c["hello"] == mapping["hello"]
#output
$<NA>
NULL
#Does not work:
mapping_c[["hello"]] == mapping[["hello"]]
Error in mapping_c[["hello"]] : subscript out of bounds
If you are converting your input into a data.frame this is not an issue, but it is worth being aware of this, so you obtain the behaviour expected.
Of course you could use a function call to create a proper dictionary with a simple switch statement. I don't think that would be any prettier though.
If us_region is a named list...
us_region <- list(Alabama = "Gulf",
Arizona = "Four States",
Arkansas = "Texas",
California = "South West",
Colorado = "Four States",
Connecticut = "New England",
Delaware = "Columbia")
Then,
us_state_to_region_map <- data.frame(us_state = names(us_region),
us_region = sapply(us_region, c),
stringsAsFactors = FALSE)
and, as a bonus, you also get the states as row names...
us_state_to_region_map
us_state us_region
Alabama Alabama Gulf
Arizona Arizona Four States
Arkansas Arkansas Texas
California California South West
Colorado Colorado Four States
Connecticut Connecticut New England
Delaware Delaware Columbia
As #tim-biegeleisen says it could be more appropriate to maintain this dataset in a database, a CSV file or a spreadsheet and open it in R (with readxl::read_excel(), readr::read_csv(),...).
However if you want to write it directly in your code you can use tibble:tribble() which allows to write a dataframe row by row :
library(tibble)
tribble(~ state, ~ region,
"Alabama", "Gulf",
"Arizona", "Four States",
(...)
"Wisconsin", "Midwest",
"Wyoming", "North")
One option could be to create a data frame in wide format (your initial list makes it very straightforward and this maintains a very obvious mapping. It is actually quite similar to your Perl code), then transform it to the long format:
library(tidyr)
data.frame(
Alabama = "Gulf",
Arizona = "Four States",
Arkansas = "Texas",
California = "South West",
Colorado = "Four States",
Connecticut = "New England",
Delaware = "Columbia",
stringsAsFactors = FALSE
) %>%
gather("us_state", "us_region") # transform to long format
I want to find states with exactly two Os in the name. I tried this:
> data(state)
> index=grep('o.*o',state.name)
> state.name[index]
"Colorado" "North Carolina" "North Dakota" "South Carolina" "South Dakota"
Problem: there are three Os in "Colorado" and I don't want it. How can I revise my regex?
I also want to do three Os:
> data(state)
> index=grep('o.*o.*o',state.name)
> state.name[index]
"Colorado"
Is there a simpler way to do this?
You can do:
grep('^([^o]*o[^o]*){2}$', state.name, value = TRUE)
# [1] "North Carolina" "North Dakota"
# [3] "South Carolina" "South Dakota"
grep('^([^o]*o[^o]*){3}$', state.name, value = TRUE)
# [1] "Colorado"
and as GSee suggested below, you can add ignore.case = TRUE if you want to include states with a capital O like Ohio, Oklahoma, and Oregon.
Michael's response is definitely more eloquent but here's the brute force method:
state.name[sapply(strsplit(tolower(state.name), NULL), function(x) sum(x %in% "o") == 2)]
You should ensure that the other characters that you're matching, besides the two matching Os, are not Os:
grep("^[^o]*o[^o]*o[^o]*$", state.name, value = TRUE)
Solution using ?gregexpr: A little ugly, but generalizes to other regexs well. (Don't forget the capital O in Ohio.)
state.name[sapply(state.name,function(x) length(unlist(gregexpr("o|O",x)))) == 2]
Count number of os in state name.
State <- c("North Dakota","Ohio","Colorado","South Dakota")
nos <- nchar(gsub("[^oO]","",State))
State[nos==2]
State[nos==3]