How to extract a vector element without using the index - r

I have this vector containing names of different countries:
Contries<-c("United States", "India", "Brazil", "France", "Mali", "Australia")
I want to extract only "France", then I want to extract all country names except the France, without using the index of France, but the name of France itself.
I tried these but they don't work.
Contries["France"]
Contries[!"France"]
Finally, I want to extract the names of all countries except France and Mali
Contries[!c("France","Mali")]
How can I do that?

You can try
> Contries[-match(c("France", "Mali"), Contries)]
[1] "United States" "India" "Brazil" "Australia"

Contries[!Contries %in% c("France", "Mali")]
#> [1] "United States" "India" "Brazil" "Australia"
Created on 2021-07-04 by the reprex package (v2.0.0)

Contries<-c("United States", "India", "Brazil", "France", "Mali", "Australia")
Include
Contries[Contries == "France"]
Exclude single element
Contries[Contries != "France"]
Exclude multiple elements
Contries[!Contries %in% c("France, Mali")]
Contries[Contries %in% setdiff(Contries, c("France", "Mali"))]

Although a little more code, you can also do this with dplyr.
Include
library(dplyr)
country <-
as.data.frame(Contries) %>%
dplyr::filter(Contries == "France") %>%
unlist() %>%
as.character()
Exclude single element
country <-
as.data.frame(Contries) %>%
dplyr::filter(Contries != "France") %>%
unlist() %>%
as.character()
Exclude multiple elements
country <-
as.data.frame(Contries) %>%
dplyr::filter(!Contries %in% c("France", "Mali")) %>%
unlist() %>%
as.character()

Related

Combining only certain cells in two columns in R

I have two columns in R, one named Province.State, the other Country.Region. Province.State contains the names of subregions (from states and provinces to overseas territories) within countries (but not all) whereas Country.Region names the country. Here's my dilemma: I need to aggregate my data by country with the overseas territories treated as separate countries for the per capita calulation that I do later. So I want to create a single column listing the country if Province.State lists a province/state within the country or the name of the overseas territory if Provice.State lists the name of the territory.
Sample from my data set:
Province.State <- c("Australian Capital Territory", "New South Wales", "Northern Territory", "Queensland", "South Australia", "Tasmania", "Victoria", "Western Australia", "", "", "", "Faroe Islands", "Greenland")
Country.Region <- c("Australia", "Australia", "Australia", "Australia", "Australia", "Australia", "Australia", "Australia", "Austria", "Azerbaijan", "Denmark", "Denmark", "Denmark")
df <- data.frame(Province.State, Country.Region)
In this example, I want to aggregate the data for Australia. However, the data for the Faroe Islands and Greenland need to be separate from Denmark proper. I've tried to paste the data df$country <- paste(df$Province.State, Country.Region, sep = " ") but that only compounded my issue as I now had to go back and change a number of individual cells.
Edit: I do have a second data set with the names of the countries I want. It looks like this:
Country <- c("Australia", "Austria", "Azerbaijan", "Denmark", "Faroe Islands", "Greenland")
seconddf <- data.frame(Country)
What I would like to do is create a column in df that looks like this:
df$nation <- "Australia", "Australia", "Australia", "Australia", "Australia", "Australia", "Australia", "Australia", "Austria", "Azerbaijan", "Denmark", "Faroe Islands", "Greenland")
Is there a way to do this or am I stuck doing it by hand? Thank you for any help on this vexing problem.

R find two words into same string

I want to create a single regex or str_detect (if possible) to search through text strings and determine if two words (countries) occur in the same string based on one country list, but testing without repetition. For example:
latam <- c("BRAZIL", "MEXICO", "CHILE", "ARGENTINA", "COLOMBIA", "CUBA", "VENEZUELA", "PERU", "COSTA RICA", "ECUADOR", "URUGUAY", "BOLIVIA", "PARAGUAY", "GUATEMALA", "EL SALVADOR", "PANAMA", "NICARAGUA", "DOMINICAN REPUBLIC", "HONDURAS", "HAITI")
example_string <- c("USA;BRAZIL", "USA;BRAZIL;ARGENTINA", "BRAZIL;BRAZIL;ARGENTINA", "BRAZIL;ARGENTINA", "BRAZIL;BRAZIL", "BRAZIL;BRAZIL;ARGENTINA;ARGENTINA")
Testing example_string, the desired output is: FALSE, TRUE, TRUE, TRUE, FALSE, TRUE.
a data.table solution (for sure there's a regex for this, but...)
library(data.table)
a <- data.table(example_string)
create a data.table with the test string
a[, countries := sapply(example_string, strsplit, ";")]
split the countries, so each one can be tested individually
a[, sum(unique(countries[[1]]) %in% latam) >= 2, by = example_string]
find how many UNIQUE countries of each row are in the latam list and check if the number is equal or greater than 2.

How to match a lookup table by detecing possibly multiple substrings as in matching "US|USA|United States" with `abc,United States,xzy` in R?

I am not very sure about the terminology in the question title but I have a data frame containing non-structured addresses of institutions and I would like to extract their countries with a lookup table with multiple possible matches.
the addresses might look like:
address
xxx US
xxx USA yy
xxx United States yy
xxx UK
xxx United Kingdom yy
Note that not necessarily the country comes at the end of the string.
The pipeline would be matching and extracting whatever might be a country name (from a list of around 20 countries) and return a clean country name for each country.
(df <- tribble(
~address, ~clean_country,
"xxx US", "United States",
"xxx, USA yy", "United States",
"xxx United States, yy", "United States",
"xxx UK", "United Kingdom",
"xxx United Kingdom yy", "United Kingdom",
"xxx zz yy", NA_character_,
))
I am thinking of creating a lookup table as a data.frame with two columns:
(lookup <- tribble(
~country, ~matches,
"United States", "US|USA|United States",
"United Kingdom", "UK|United Kingdom"
))
and then checking with regex if any of the vertical bar separed matches can be found in the
df$address column, then appending the column country as clean_country in df.
Of course, I am interested in solutions following other strategies. The more (memory) efficient because the data set is relatively big.
Using the lookup table approach you can extract the country name from address using str_extract and replace it with country name in the loookup table.
library(stringr)
str_replace_all(str_extract(df$address, str_c(lookup$matches, collapse = '|')),
setNames(lookup$country, lookup$matches))
#[1] "United States" "United States" "United States"
#[4] "United Kingdom" "United Kingdom" NA

How to identify all country names mentioned in a string and split accordingly?

I have a string that contains country and other region names. I am only interested in the country names and would ideally like to add several columns, each of which contains a country name listed in the string. Here is an exemplary code for the way the dataframe lis set up:
df <- data.frame(id = c(1,2,3),
country = c("Cote d'Ivoire Africa Developing Economies West Africa",
"South Africa United Kingdom Africa BRICS Countries",
"Myanmar Gambia Bangladesh Netherlands Africa Asia"))
If I only split the string by space, those countries which contain a space get lost (e.g. "United Kingdom"). See here:
df2 <- separate(df, country, paste0("C",3:8), sep=" ")
Therefore, I tried to look up country names using the world.cities dataset. However, this only seems to loop through the string until there is non-country name. See here:
library(maps)
library(stringr)
all_countries <- str_c(unique(world.cities$country.etc), collapse = "|")
df$c1 <- sapply(str_extract_all(df$country, all_countries), toString)
I am wondering whether it's possible to use the space a delimiter but define exceptions (like "United Kingdom"). This might obviously require some manual work, but appears to be most feasible solution to me. Does anyone know how to define such exceptions? I am of course also open to and thankful for any other solutions.
UPDATE:
I figured out another solution using the countrycode package:
library(countrycode)
countries <- data.frame(countryname_dict)
countries$continent <- countrycode(sourcevar = countries[["country.name.en"]],
origin = "country.name.en",
destination = "continent")
africa <- countries[ which(countries$continent=='Africa'), ]
library(stringr)
pat <- paste0("\\b", paste(africa$country.name.en , collapse="\\b|\\b"), "\\b")
df$country_list <- str_extract_all(df$country, regex(pat, ignore_case = TRUE))
You could do:
library(stringi)
vec <- stri_trans_general(countrycode::codelist$country.name.en, id = "Latin-ASCII")
stri_extract_all(df$country,regex = sprintf(r"(\b(%s)\b)",stri_c(vec,collapse = "|")))
[[1]]
[1] "Cote d'Ivoire"
[[2]]
[1] "South Africa" "United Kingdom"
[[3]]
[1] "Gambia" "Bangladesh" "Netherlands"

Missing observations when using str_replace_all

I have a dataset of map data using the following:
worldMap_df <- map_data("world") %>%
rename(Economy = region) %>%
filter(Economy != "Antarctica") %>%
mutate(Economy = str_replace_all(Economy,
c("Brunei" = "Brunei Darussalam",
"Macedonia" = "Macedonia, FYR",
"Puerto Rico" = "Puerto Rico US",
"Russia" = "Russian Federation",
"UK" = "United Kingdom",
"USA" = "United States",
"Palestine" = "West Bank and Gaza",
"Saint Lucia" = "St Lucia",
"East Timor" = "Timor-Leste")))
There are a number of countries (under Economy) that I am trying to use str_replace_all to concatenate. One example is observations for which Economy is either "Trinidad" or "Tobago".
I've used the following but this seems to only partially re-label observations:
trin_tobago_vector <- c("Trinidad", "Tobago")
worldMap_df$Economy <- str_replace_all(worldMap_df$Economy, trin_tobago_vector, "Trinidad and Tobago")
However, certain observations still have Trinidad and Tobago under Economy whilst others remain Trinidad OR Tobago. Can anyone see what I'm doing wrong here?
You supply str_replace_all with a pattern that is a vector: trin_tobago_vector. It will then iterate over your 'Economy' column and check the first element with "Trinidad", the second element with "Tobago", the third with "Trinidad", and so on. You should do this replacement in two steps instead:
worldMap_df$Economy <- str_replace_all(worldMap_df$Economy, "^Trinidad$", "Trinidad and Tobago")
worldMap_df$Economy <- str_replace_all(worldMap_df$Economy, "^Tobago$", "Trinidad and Tobago")
or use a named vector:
trin_tobago_vector <- c("^Trinidad$" = "Trinidad and Tobago", "^Tobago$" = "Trinidad and Tobago")
worldMap_df$Economy <- str_replace_all(worldMap_df$Economy, trin_tobago_vector)
The ^ and $ inside the pattern vector make sure that only the literal strings "Trinidad" and "Tobago" are replaced.

Resources