I have two columns in R, one named Province.State, the other Country.Region. Province.State contains the names of subregions (from states and provinces to overseas territories) within countries (but not all) whereas Country.Region names the country. Here's my dilemma: I need to aggregate my data by country with the overseas territories treated as separate countries for the per capita calulation that I do later. So I want to create a single column listing the country if Province.State lists a province/state within the country or the name of the overseas territory if Provice.State lists the name of the territory.
Sample from my data set:
Province.State <- c("Australian Capital Territory", "New South Wales", "Northern Territory", "Queensland", "South Australia", "Tasmania", "Victoria", "Western Australia", "", "", "", "Faroe Islands", "Greenland")
Country.Region <- c("Australia", "Australia", "Australia", "Australia", "Australia", "Australia", "Australia", "Australia", "Austria", "Azerbaijan", "Denmark", "Denmark", "Denmark")
df <- data.frame(Province.State, Country.Region)
In this example, I want to aggregate the data for Australia. However, the data for the Faroe Islands and Greenland need to be separate from Denmark proper. I've tried to paste the data df$country <- paste(df$Province.State, Country.Region, sep = " ") but that only compounded my issue as I now had to go back and change a number of individual cells.
Edit: I do have a second data set with the names of the countries I want. It looks like this:
Country <- c("Australia", "Austria", "Azerbaijan", "Denmark", "Faroe Islands", "Greenland")
seconddf <- data.frame(Country)
What I would like to do is create a column in df that looks like this:
df$nation <- "Australia", "Australia", "Australia", "Australia", "Australia", "Australia", "Australia", "Australia", "Austria", "Azerbaijan", "Denmark", "Faroe Islands", "Greenland")
Is there a way to do this or am I stuck doing it by hand? Thank you for any help on this vexing problem.
Related
I have this vector containing names of different countries:
Contries<-c("United States", "India", "Brazil", "France", "Mali", "Australia")
I want to extract only "France", then I want to extract all country names except the France, without using the index of France, but the name of France itself.
I tried these but they don't work.
Contries["France"]
Contries[!"France"]
Finally, I want to extract the names of all countries except France and Mali
Contries[!c("France","Mali")]
How can I do that?
You can try
> Contries[-match(c("France", "Mali"), Contries)]
[1] "United States" "India" "Brazil" "Australia"
Contries[!Contries %in% c("France", "Mali")]
#> [1] "United States" "India" "Brazil" "Australia"
Created on 2021-07-04 by the reprex package (v2.0.0)
Contries<-c("United States", "India", "Brazil", "France", "Mali", "Australia")
Include
Contries[Contries == "France"]
Exclude single element
Contries[Contries != "France"]
Exclude multiple elements
Contries[!Contries %in% c("France, Mali")]
Contries[Contries %in% setdiff(Contries, c("France", "Mali"))]
Although a little more code, you can also do this with dplyr.
Include
library(dplyr)
country <-
as.data.frame(Contries) %>%
dplyr::filter(Contries == "France") %>%
unlist() %>%
as.character()
Exclude single element
country <-
as.data.frame(Contries) %>%
dplyr::filter(Contries != "France") %>%
unlist() %>%
as.character()
Exclude multiple elements
country <-
as.data.frame(Contries) %>%
dplyr::filter(!Contries %in% c("France", "Mali")) %>%
unlist() %>%
as.character()
I want to create a single regex or str_detect (if possible) to search through text strings and determine if two words (countries) occur in the same string based on one country list, but testing without repetition. For example:
latam <- c("BRAZIL", "MEXICO", "CHILE", "ARGENTINA", "COLOMBIA", "CUBA", "VENEZUELA", "PERU", "COSTA RICA", "ECUADOR", "URUGUAY", "BOLIVIA", "PARAGUAY", "GUATEMALA", "EL SALVADOR", "PANAMA", "NICARAGUA", "DOMINICAN REPUBLIC", "HONDURAS", "HAITI")
example_string <- c("USA;BRAZIL", "USA;BRAZIL;ARGENTINA", "BRAZIL;BRAZIL;ARGENTINA", "BRAZIL;ARGENTINA", "BRAZIL;BRAZIL", "BRAZIL;BRAZIL;ARGENTINA;ARGENTINA")
Testing example_string, the desired output is: FALSE, TRUE, TRUE, TRUE, FALSE, TRUE.
a data.table solution (for sure there's a regex for this, but...)
library(data.table)
a <- data.table(example_string)
create a data.table with the test string
a[, countries := sapply(example_string, strsplit, ";")]
split the countries, so each one can be tested individually
a[, sum(unique(countries[[1]]) %in% latam) >= 2, by = example_string]
find how many UNIQUE countries of each row are in the latam list and check if the number is equal or greater than 2.
UPDATES: It turns out that as.character has a 500 character "soft limitation" when coercing long string(list).
Hi everyone:
I have a long string list of a location set in R, which is:
a=c("CANADA", "UNITED.STATES", "VIETNAM", "TAIWAN", "RUSSIAN.FEDERATION", "SENEGAL", "SOUTH.AFRICA", "MALAWI", "SLOVENIA", "BELGIUM", "ISRAEL", "HONG.KONG", "FRANCE", "PHILIPPINES", "MYANMAR", "GERMANY", "UKRAINE", "CENTRAL.AFRICAN.REPUBLIC", "COTE.D.IVOIRE", "JAPAN", "ZAMBIA", "SOUTH.KOREA", "DEM.REP.OF.CONGO", "SPAIN", "SWEDEN", "BOTSWANA", "AUSTRALIA", "CHINA", "MALAYSIA", "PAKISTAN", "ITALY", "CAMEROON", "BRAZIL", "CUBA", "DENMARK", "UGANDA", "THAILAND", "CYPRUS", "GHANA", "TANZANIA", "KENYA","MONGOLIA", "INDIA")
But when I try to convert "list(a)" to a single string use as.character(in order to store it as a whole), the command is:
as.character(list(a))
Then, a "\n" is automatically created between location "KENYA" and "MONGOLIA", the output should look like:
[1] "list(c(\"CANADA\", \"UNITED.STATES\", \"VIETNAM\", \"TAIWAN\", \"RUSSIAN.FEDERATION\", \"SENEGAL\", \"SOUTH.AFRICA\", \"MALAWI\", \"SLOVENIA\", \"BELGIUM\", \"ISRAEL\", \"HONG.KONG\", \"FRANCE\", \"PHILIPPINES\", \"MYANMAR\", \"GERMANY\", \"UKRAINE\", \"CENTRAL.AFRICAN.REPUBLIC\", \"COTE.D.IVOIRE\", \"JAPAN\", \"ZAMBIA\", \"SOUTH.KOREA\", \"DEM.REP.OF.CONGO\", \"SPAIN\", \"SWEDEN\", \"BOTSWANA\", \"AUSTRALIA\", \"CHINA\", \"MALAYSIA\", \"PAKISTAN\", \"ITALY\", \"CAMEROON\", \"BRAZIL\", \"CUBA\", \"DENMARK\", \"UGANDA\", \"THAILAND\", \"CYPRUS\", \"GHANA\", \"TANZANIA\", \"KENYA\", \n\"MONGOLIA\", \"INDIA\"))"
Notice the "\n" inserted, and when I deleted first element(CANADA), the "\n" moved to between "TANZANIA", "KENYA", so it looks like "\n" is always created before the 42nd element.
However, when I create a sequential number of string, that is, a=c("1","2"..."41","42","43"), do the same convertion "as.character(list(a))",
No "\n" is created!
I'm confused, anyone know why?
Because you are coercing a list object into a character string. If you want to convert into a single string, try (as #Parfait suggested):
paste0(a, collapse = ",")
If you don't want a comma-delimited string, then change the value in the collapse argument.
I have a dataset of map data using the following:
worldMap_df <- map_data("world") %>%
rename(Economy = region) %>%
filter(Economy != "Antarctica") %>%
mutate(Economy = str_replace_all(Economy,
c("Brunei" = "Brunei Darussalam",
"Macedonia" = "Macedonia, FYR",
"Puerto Rico" = "Puerto Rico US",
"Russia" = "Russian Federation",
"UK" = "United Kingdom",
"USA" = "United States",
"Palestine" = "West Bank and Gaza",
"Saint Lucia" = "St Lucia",
"East Timor" = "Timor-Leste")))
There are a number of countries (under Economy) that I am trying to use str_replace_all to concatenate. One example is observations for which Economy is either "Trinidad" or "Tobago".
I've used the following but this seems to only partially re-label observations:
trin_tobago_vector <- c("Trinidad", "Tobago")
worldMap_df$Economy <- str_replace_all(worldMap_df$Economy, trin_tobago_vector, "Trinidad and Tobago")
However, certain observations still have Trinidad and Tobago under Economy whilst others remain Trinidad OR Tobago. Can anyone see what I'm doing wrong here?
You supply str_replace_all with a pattern that is a vector: trin_tobago_vector. It will then iterate over your 'Economy' column and check the first element with "Trinidad", the second element with "Tobago", the third with "Trinidad", and so on. You should do this replacement in two steps instead:
worldMap_df$Economy <- str_replace_all(worldMap_df$Economy, "^Trinidad$", "Trinidad and Tobago")
worldMap_df$Economy <- str_replace_all(worldMap_df$Economy, "^Tobago$", "Trinidad and Tobago")
or use a named vector:
trin_tobago_vector <- c("^Trinidad$" = "Trinidad and Tobago", "^Tobago$" = "Trinidad and Tobago")
worldMap_df$Economy <- str_replace_all(worldMap_df$Economy, trin_tobago_vector)
The ^ and $ inside the pattern vector make sure that only the literal strings "Trinidad" and "Tobago" are replaced.
This question already has answers here:
How to use grep()/gsub() to find exact match
(2 answers)
How do I replace the string exactly using gsub()
(1 answer)
Closed 5 years ago.
I'm currently cleaning some country based data. I have approximately 1000 entries and need to replace all country codes with full country names. An example of the codes are below:
"SL/L/N", "Sierra Leone", "L", "Lib/Nepal", "SL2/ Nepal", "SL2/L
My code converts all of the codes/countries correctly except one. The issue I have is that "L" stands for "Liberia" so needs substituting, but I can't differentiate between "L"s that are within a word e.g. "Sri Lanka" and that which stand for "Liberia". I tried using forward slashes as identifying features in the code below, but it returns for the "L" entries:
lut = c("Lib" = "Liberia", "Sri lanka" = "Sri Lanka", "WACC" = "West Africa", "W.Africa" = "West Africa", "SL2" = "Sri Lanka", "N" = "Nepal", "SL" = "Sierra Leone", "/L" = "/Liberia", "/L/" = "/Liberia/", "/L" = "/Liberia")
countryData$Country <- lut[countryData$Country]
Any help in turning the correct "L"s into "Liberia" but leaving "Sri Lanka" and "Sierra Leone" untouched is gratefully received.