Missing observations when using str_replace_all - r

I have a dataset of map data using the following:
worldMap_df <- map_data("world") %>%
rename(Economy = region) %>%
filter(Economy != "Antarctica") %>%
mutate(Economy = str_replace_all(Economy,
c("Brunei" = "Brunei Darussalam",
"Macedonia" = "Macedonia, FYR",
"Puerto Rico" = "Puerto Rico US",
"Russia" = "Russian Federation",
"UK" = "United Kingdom",
"USA" = "United States",
"Palestine" = "West Bank and Gaza",
"Saint Lucia" = "St Lucia",
"East Timor" = "Timor-Leste")))
There are a number of countries (under Economy) that I am trying to use str_replace_all to concatenate. One example is observations for which Economy is either "Trinidad" or "Tobago".
I've used the following but this seems to only partially re-label observations:
trin_tobago_vector <- c("Trinidad", "Tobago")
worldMap_df$Economy <- str_replace_all(worldMap_df$Economy, trin_tobago_vector, "Trinidad and Tobago")
However, certain observations still have Trinidad and Tobago under Economy whilst others remain Trinidad OR Tobago. Can anyone see what I'm doing wrong here?

You supply str_replace_all with a pattern that is a vector: trin_tobago_vector. It will then iterate over your 'Economy' column and check the first element with "Trinidad", the second element with "Tobago", the third with "Trinidad", and so on. You should do this replacement in two steps instead:
worldMap_df$Economy <- str_replace_all(worldMap_df$Economy, "^Trinidad$", "Trinidad and Tobago")
worldMap_df$Economy <- str_replace_all(worldMap_df$Economy, "^Tobago$", "Trinidad and Tobago")
or use a named vector:
trin_tobago_vector <- c("^Trinidad$" = "Trinidad and Tobago", "^Tobago$" = "Trinidad and Tobago")
worldMap_df$Economy <- str_replace_all(worldMap_df$Economy, trin_tobago_vector)
The ^ and $ inside the pattern vector make sure that only the literal strings "Trinidad" and "Tobago" are replaced.

Related

Combining only certain cells in two columns in R

I have two columns in R, one named Province.State, the other Country.Region. Province.State contains the names of subregions (from states and provinces to overseas territories) within countries (but not all) whereas Country.Region names the country. Here's my dilemma: I need to aggregate my data by country with the overseas territories treated as separate countries for the per capita calulation that I do later. So I want to create a single column listing the country if Province.State lists a province/state within the country or the name of the overseas territory if Provice.State lists the name of the territory.
Sample from my data set:
Province.State <- c("Australian Capital Territory", "New South Wales", "Northern Territory", "Queensland", "South Australia", "Tasmania", "Victoria", "Western Australia", "", "", "", "Faroe Islands", "Greenland")
Country.Region <- c("Australia", "Australia", "Australia", "Australia", "Australia", "Australia", "Australia", "Australia", "Austria", "Azerbaijan", "Denmark", "Denmark", "Denmark")
df <- data.frame(Province.State, Country.Region)
In this example, I want to aggregate the data for Australia. However, the data for the Faroe Islands and Greenland need to be separate from Denmark proper. I've tried to paste the data df$country <- paste(df$Province.State, Country.Region, sep = " ") but that only compounded my issue as I now had to go back and change a number of individual cells.
Edit: I do have a second data set with the names of the countries I want. It looks like this:
Country <- c("Australia", "Austria", "Azerbaijan", "Denmark", "Faroe Islands", "Greenland")
seconddf <- data.frame(Country)
What I would like to do is create a column in df that looks like this:
df$nation <- "Australia", "Australia", "Australia", "Australia", "Australia", "Australia", "Australia", "Australia", "Austria", "Azerbaijan", "Denmark", "Faroe Islands", "Greenland")
Is there a way to do this or am I stuck doing it by hand? Thank you for any help on this vexing problem.

How to extract a vector element without using the index

I have this vector containing names of different countries:
Contries<-c("United States", "India", "Brazil", "France", "Mali", "Australia")
I want to extract only "France", then I want to extract all country names except the France, without using the index of France, but the name of France itself.
I tried these but they don't work.
Contries["France"]
Contries[!"France"]
Finally, I want to extract the names of all countries except France and Mali
Contries[!c("France","Mali")]
How can I do that?
You can try
> Contries[-match(c("France", "Mali"), Contries)]
[1] "United States" "India" "Brazil" "Australia"
Contries[!Contries %in% c("France", "Mali")]
#> [1] "United States" "India" "Brazil" "Australia"
Created on 2021-07-04 by the reprex package (v2.0.0)
Contries<-c("United States", "India", "Brazil", "France", "Mali", "Australia")
Include
Contries[Contries == "France"]
Exclude single element
Contries[Contries != "France"]
Exclude multiple elements
Contries[!Contries %in% c("France, Mali")]
Contries[Contries %in% setdiff(Contries, c("France", "Mali"))]
Although a little more code, you can also do this with dplyr.
Include
library(dplyr)
country <-
as.data.frame(Contries) %>%
dplyr::filter(Contries == "France") %>%
unlist() %>%
as.character()
Exclude single element
country <-
as.data.frame(Contries) %>%
dplyr::filter(Contries != "France") %>%
unlist() %>%
as.character()
Exclude multiple elements
country <-
as.data.frame(Contries) %>%
dplyr::filter(!Contries %in% c("France", "Mali")) %>%
unlist() %>%
as.character()

R find two words into same string

I want to create a single regex or str_detect (if possible) to search through text strings and determine if two words (countries) occur in the same string based on one country list, but testing without repetition. For example:
latam <- c("BRAZIL", "MEXICO", "CHILE", "ARGENTINA", "COLOMBIA", "CUBA", "VENEZUELA", "PERU", "COSTA RICA", "ECUADOR", "URUGUAY", "BOLIVIA", "PARAGUAY", "GUATEMALA", "EL SALVADOR", "PANAMA", "NICARAGUA", "DOMINICAN REPUBLIC", "HONDURAS", "HAITI")
example_string <- c("USA;BRAZIL", "USA;BRAZIL;ARGENTINA", "BRAZIL;BRAZIL;ARGENTINA", "BRAZIL;ARGENTINA", "BRAZIL;BRAZIL", "BRAZIL;BRAZIL;ARGENTINA;ARGENTINA")
Testing example_string, the desired output is: FALSE, TRUE, TRUE, TRUE, FALSE, TRUE.
a data.table solution (for sure there's a regex for this, but...)
library(data.table)
a <- data.table(example_string)
create a data.table with the test string
a[, countries := sapply(example_string, strsplit, ";")]
split the countries, so each one can be tested individually
a[, sum(unique(countries[[1]]) %in% latam) >= 2, by = example_string]
find how many UNIQUE countries of each row are in the latam list and check if the number is equal or greater than 2.

Is there syntactic sugar to define a data frame in R

I want to regroup US states by regions and thus I need to define a "US state" -> "US Region" mapping function, which is done by setting up an appropriate data frame.
The basis is this exercise (apparently this is a map of the "Commonwealth of the Fallout"):
One starts off with an original list in raw form:
Alabama = "Gulf"
Arizona = "Four States"
Arkansas = "Texas"
California = "South West"
Colorado = "Four States"
Connecticut = "New England"
Delaware = "Columbia"
which eventually leads to this R code:
us_state <- c("Alabama","Arizona","Arkansas","California","Colorado","Connecticut",
"Delaware","District of Columbia","Florida","Georgia","Idaho","Illinois","Indiana",
"Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland","Massachusetts","Michigan",
"Minnesota","Mississippi","Missouri","Montana","Nebraska","Nevada","New Hampshire",
"New Jersey","New Mexico","New York","North Carolina","North Dakota","Ohio","Oklahoma",
"Oregon","Pennsylvania","Rhode Island","South Carolina","South Dakota","Tennessee",
"Texas","Utah","Vermont","Virginia","Washington","West Virginia ","Wisconsin","Wyoming")
us_region <- c("Gulf","Four States","Texas","South West","Four States","New England",
"Columbia","Columbia","Gulf","Southeast","North West","Midwest","Midwest","Plains",
"Plains","East Central","Gulf","New England","Columbia","New England","Midwest",
"Midwest","Gulf","Plains","North","Plains","South West","New England","Eastern",
"Four States","Eastern","Southeast","North","East Central","Plains","North West",
"Eastern","New England","Southeast","North","East Central","Texas","Four States",
"New England","Columbia","North West","Eastern","Midwest","North")
us_state_to_region_map <- data.frame(us_state, us_region, stringsAsFactors=FALSE)
which is supremely ugly and unmaintainable as the State -> Region mapping is effectively
obfuscated.
I actually wrote a Perl program to generate the above from the original list.
In Perl, one would write things like:
#!/usr/bin/perl
$mapping = {
"Alabama"=> "Gulf",
"Arizona"=> "Four States",
"Arkansas"=> "Texas",
"California"=> "South West",
"Colorado"=> "Four States",
"Connecticut"=> "New England",
...etc...etc...
"West Virginia "=> "Eastern",
"Wisconsin"=> "Midwest",
"Wyoming"=> "North" };
which is maintainable because one can verify the mapping on a line-by-line basis.
There must be something similar to this Perl goodness in R?
It seems a bit open for interpretation as to what you're looking for.
Is the mapping meant to be a function type thing such that a call would return the region or vise-versa (Eg. similar to a function call mapping("alabama") => "Gulf")?
I am reading the question to be more looking for a dictionary style storage, which in R could be obtained with an equivalent named list
ncountry <- 49
mapping <- as.list(c("Gulf","Four States",
...
,"Midwest","North"))
names(mapping) <- c("Alabama","Arizona",
...
,"Wisconsin","Wyoming")
mapping[["Pennsylvania"]]
[1] "Eastern"
This could be performed in a single call
mapping <- list("Alabama" = "Gulf",
"Arizona" = "Four States",
...,
"Wisconsin" = "Midwest",
"Wyoming" = "North")
Which makes it very simple to check that the mapping is working as expected. This doesn't convert nicely to a 2 column data.frame however, which we would then obtain using
mapping_df <- data.frame(region = unlist(mapping), state = names(mapping))
note "not nicely" simply means as.data.frame doesn't translate the input into a 2 column output.
Alternatively just using a named character vector would likely be fine too
mapping_c <- c("Alabama" = "Gulf",
"Arizona" = "Four States",
...,
"Wisconsin" = "Midwest",
"Wyoming" = "North")
which would be converted to a data.frame in almost the same fashion
mapping_df_c <- data.frame(region = mapping_c, state = names(mapping_c))
Note however a slight difference in the two choices of storage. While referencing an entry that exists using either single brackets [ or double brackets [[ works just fine
#Works:
mapping_c["Pennsylvania"] == mapping["Pennsylvania"]
#output
Pennsylvania
TRUE
mapping_c[["Pennsylvania"]] == mapping[["Pennsylvania"]]
[1] TRUE
But when referencing unknown entries these differ slightly in behaviour
#works sorta:
mapping_c["hello"] == mapping["hello"]
#output
$<NA>
NULL
#Does not work:
mapping_c[["hello"]] == mapping[["hello"]]
Error in mapping_c[["hello"]] : subscript out of bounds
If you are converting your input into a data.frame this is not an issue, but it is worth being aware of this, so you obtain the behaviour expected.
Of course you could use a function call to create a proper dictionary with a simple switch statement. I don't think that would be any prettier though.
If us_region is a named list...
us_region <- list(Alabama = "Gulf",
Arizona = "Four States",
Arkansas = "Texas",
California = "South West",
Colorado = "Four States",
Connecticut = "New England",
Delaware = "Columbia")
Then,
us_state_to_region_map <- data.frame(us_state = names(us_region),
us_region = sapply(us_region, c),
stringsAsFactors = FALSE)
and, as a bonus, you also get the states as row names...
us_state_to_region_map
us_state us_region
Alabama Alabama Gulf
Arizona Arizona Four States
Arkansas Arkansas Texas
California California South West
Colorado Colorado Four States
Connecticut Connecticut New England
Delaware Delaware Columbia
As #tim-biegeleisen says it could be more appropriate to maintain this dataset in a database, a CSV file or a spreadsheet and open it in R (with readxl::read_excel(), readr::read_csv(),...).
However if you want to write it directly in your code you can use tibble:tribble() which allows to write a dataframe row by row :
library(tibble)
tribble(~ state, ~ region,
"Alabama", "Gulf",
"Arizona", "Four States",
(...)
"Wisconsin", "Midwest",
"Wyoming", "North")
One option could be to create a data frame in wide format (your initial list makes it very straightforward and this maintains a very obvious mapping. It is actually quite similar to your Perl code), then transform it to the long format:
library(tidyr)
data.frame(
Alabama = "Gulf",
Arizona = "Four States",
Arkansas = "Texas",
California = "South West",
Colorado = "Four States",
Connecticut = "New England",
Delaware = "Columbia",
stringsAsFactors = FALSE
) %>%
gather("us_state", "us_region") # transform to long format

Substitute multiple values in order [duplicate]

This question already has answers here:
How to use grep()/gsub() to find exact match
(2 answers)
How do I replace the string exactly using gsub()
(1 answer)
Closed 5 years ago.
I'm currently cleaning some country based data. I have approximately 1000 entries and need to replace all country codes with full country names. An example of the codes are below:
"SL/L/N", "Sierra Leone", "L", "Lib/Nepal", "SL2/ Nepal", "SL2/L
My code converts all of the codes/countries correctly except one. The issue I have is that "L" stands for "Liberia" so needs substituting, but I can't differentiate between "L"s that are within a word e.g. "Sri Lanka" and that which stand for "Liberia". I tried using forward slashes as identifying features in the code below, but it returns for the "L" entries:
lut = c("Lib" = "Liberia", "Sri lanka" = "Sri Lanka", "WACC" = "West Africa", "W.Africa" = "West Africa", "SL2" = "Sri Lanka", "N" = "Nepal", "SL" = "Sierra Leone", "/L" = "/Liberia", "/L/" = "/Liberia/", "/L" = "/Liberia")
countryData$Country <- lut[countryData$Country]
Any help in turning the correct "L"s into "Liberia" but leaving "Sri Lanka" and "Sierra Leone" untouched is gratefully received.

Resources