R loop with characters list - r

The problem here is similar to this previous one but here we do not need to do any computation but just to build lists
I have some list of world regions:
list.asia <- c("Central Asia", "Eastern Asia", "South-eastern Asia", "Southern Asia", "Western Asia")
list.africa <- c("Northern Africa", "Sub-Saharan Africa", "Eastern Africa", "Middle Africa", "Southern Africa", "Western Africa")
I use the R library("ISOcodes") to produce lists of countries with ISO Alpha 3 digits format as follow:
region <- subset(UN_M.49_Regions, Name %in% list.asia)
subset <- subset(UN_M.49_Countries, Code %in% unlist(strsplit(region$Children, ", ")))
subset$ISO_Alpha_3
This example, with the list.asiagives the expected result:
[1] "AFG" "ARM" "AZE" "BHR" "BGD" "BTN" "BRN" "KHM" "CHN" "HKG" "MAC" "CYP" "PRK"
[14] "GEO" "IND" "IDN" "IRN" "IRQ" "ISR" "JPN" "JOR" "KAZ" "KWT" "KGZ" "LAO" "LBN"
[27] "MYS" "MDV" "MNG" "MMR" "NPL" "OMN" "PAK" "PHL" "QAT" "KOR" "SAU" "SGP" "LKA"
[40] "PSE" "SYR" "TJK" "THA" "TLS" "TUR" "TKM" "ARE" "UZB" "VNM" "YEM"
which can easily be saved as follow:
countries.list.asia <- subset$ISO_Alpha_3
The problem is that I have got a lot of regions and I would prefer to do a loop.
To keep it simple let's say that I only have 2 lists list.asia and list.africa. I regroup them in a new list.continent
list.continent <- c("list.asia","list.africa")
and then I "loop" the list production: (which does not work)
for(i in list.continent){
list.loop <- sym(i)
region <- subset(UN_M.49_Regions, Name %in% list.loop)
subset <- subset(UN_M.49_Countries, Code %in% unlist(strsplit(region$Children, ", ")))
paste("countries",list.loop, sep=".") <- subset$ISO_Alpha_3
rm(region, subset, list.loop)
}
The expected results (in this case) are 2 new objects (class list) called countries.list.asia and countries.list.africa containing the ISO Alpha 3 digits codes of the countries present in these regions.
I tried to replace list.loop by !!list.loop or as.list(list.loop), but nothing works. Any Idea?

Consider using an overall list and not attempt to save an object to global environment iteratively and use a function to return your needed output and avoid the need to remove helper objects. And in R, a list + function can be encapsulated with lapply (or its wrapper sapply used here for list names):
# NAMED LIST OF ACTUAL OBJECTS (NOT CHARACTER VECTOR)
list.continent <- list(list.asia = list.asia, list.africa = list.africa)
# BUILD NEW LIST OF SUBSETTED ITEMS
new_list.continent <- sapply(list.continent, function(item) {
region <- subset(UN_M.49_Regions, Name %in% item)
sub <- subset(UN_M.49_Countries, Code %in% unlist(strsplit(region$Children, ", ")))
return(sub$ISO_Alpha_3)
}, simplify = FALSE)
# SHOW OBJECT CONTENTS
new_list.continent$list.asia
new_list.continent$list.africa

Related

How to identify all country names mentioned in a string and split accordingly?

I have a string that contains country and other region names. I am only interested in the country names and would ideally like to add several columns, each of which contains a country name listed in the string. Here is an exemplary code for the way the dataframe lis set up:
df <- data.frame(id = c(1,2,3),
country = c("Cote d'Ivoire Africa Developing Economies West Africa",
"South Africa United Kingdom Africa BRICS Countries",
"Myanmar Gambia Bangladesh Netherlands Africa Asia"))
If I only split the string by space, those countries which contain a space get lost (e.g. "United Kingdom"). See here:
df2 <- separate(df, country, paste0("C",3:8), sep=" ")
Therefore, I tried to look up country names using the world.cities dataset. However, this only seems to loop through the string until there is non-country name. See here:
library(maps)
library(stringr)
all_countries <- str_c(unique(world.cities$country.etc), collapse = "|")
df$c1 <- sapply(str_extract_all(df$country, all_countries), toString)
I am wondering whether it's possible to use the space a delimiter but define exceptions (like "United Kingdom"). This might obviously require some manual work, but appears to be most feasible solution to me. Does anyone know how to define such exceptions? I am of course also open to and thankful for any other solutions.
UPDATE:
I figured out another solution using the countrycode package:
library(countrycode)
countries <- data.frame(countryname_dict)
countries$continent <- countrycode(sourcevar = countries[["country.name.en"]],
origin = "country.name.en",
destination = "continent")
africa <- countries[ which(countries$continent=='Africa'), ]
library(stringr)
pat <- paste0("\\b", paste(africa$country.name.en , collapse="\\b|\\b"), "\\b")
df$country_list <- str_extract_all(df$country, regex(pat, ignore_case = TRUE))
You could do:
library(stringi)
vec <- stri_trans_general(countrycode::codelist$country.name.en, id = "Latin-ASCII")
stri_extract_all(df$country,regex = sprintf(r"(\b(%s)\b)",stri_c(vec,collapse = "|")))
[[1]]
[1] "Cote d'Ivoire"
[[2]]
[1] "South Africa" "United Kingdom"
[[3]]
[1] "Gambia" "Bangladesh" "Netherlands"

Is there syntactic sugar to define a data frame in R

I want to regroup US states by regions and thus I need to define a "US state" -> "US Region" mapping function, which is done by setting up an appropriate data frame.
The basis is this exercise (apparently this is a map of the "Commonwealth of the Fallout"):
One starts off with an original list in raw form:
Alabama = "Gulf"
Arizona = "Four States"
Arkansas = "Texas"
California = "South West"
Colorado = "Four States"
Connecticut = "New England"
Delaware = "Columbia"
which eventually leads to this R code:
us_state <- c("Alabama","Arizona","Arkansas","California","Colorado","Connecticut",
"Delaware","District of Columbia","Florida","Georgia","Idaho","Illinois","Indiana",
"Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland","Massachusetts","Michigan",
"Minnesota","Mississippi","Missouri","Montana","Nebraska","Nevada","New Hampshire",
"New Jersey","New Mexico","New York","North Carolina","North Dakota","Ohio","Oklahoma",
"Oregon","Pennsylvania","Rhode Island","South Carolina","South Dakota","Tennessee",
"Texas","Utah","Vermont","Virginia","Washington","West Virginia ","Wisconsin","Wyoming")
us_region <- c("Gulf","Four States","Texas","South West","Four States","New England",
"Columbia","Columbia","Gulf","Southeast","North West","Midwest","Midwest","Plains",
"Plains","East Central","Gulf","New England","Columbia","New England","Midwest",
"Midwest","Gulf","Plains","North","Plains","South West","New England","Eastern",
"Four States","Eastern","Southeast","North","East Central","Plains","North West",
"Eastern","New England","Southeast","North","East Central","Texas","Four States",
"New England","Columbia","North West","Eastern","Midwest","North")
us_state_to_region_map <- data.frame(us_state, us_region, stringsAsFactors=FALSE)
which is supremely ugly and unmaintainable as the State -> Region mapping is effectively
obfuscated.
I actually wrote a Perl program to generate the above from the original list.
In Perl, one would write things like:
#!/usr/bin/perl
$mapping = {
"Alabama"=> "Gulf",
"Arizona"=> "Four States",
"Arkansas"=> "Texas",
"California"=> "South West",
"Colorado"=> "Four States",
"Connecticut"=> "New England",
...etc...etc...
"West Virginia "=> "Eastern",
"Wisconsin"=> "Midwest",
"Wyoming"=> "North" };
which is maintainable because one can verify the mapping on a line-by-line basis.
There must be something similar to this Perl goodness in R?
It seems a bit open for interpretation as to what you're looking for.
Is the mapping meant to be a function type thing such that a call would return the region or vise-versa (Eg. similar to a function call mapping("alabama") => "Gulf")?
I am reading the question to be more looking for a dictionary style storage, which in R could be obtained with an equivalent named list
ncountry <- 49
mapping <- as.list(c("Gulf","Four States",
...
,"Midwest","North"))
names(mapping) <- c("Alabama","Arizona",
...
,"Wisconsin","Wyoming")
mapping[["Pennsylvania"]]
[1] "Eastern"
This could be performed in a single call
mapping <- list("Alabama" = "Gulf",
"Arizona" = "Four States",
...,
"Wisconsin" = "Midwest",
"Wyoming" = "North")
Which makes it very simple to check that the mapping is working as expected. This doesn't convert nicely to a 2 column data.frame however, which we would then obtain using
mapping_df <- data.frame(region = unlist(mapping), state = names(mapping))
note "not nicely" simply means as.data.frame doesn't translate the input into a 2 column output.
Alternatively just using a named character vector would likely be fine too
mapping_c <- c("Alabama" = "Gulf",
"Arizona" = "Four States",
...,
"Wisconsin" = "Midwest",
"Wyoming" = "North")
which would be converted to a data.frame in almost the same fashion
mapping_df_c <- data.frame(region = mapping_c, state = names(mapping_c))
Note however a slight difference in the two choices of storage. While referencing an entry that exists using either single brackets [ or double brackets [[ works just fine
#Works:
mapping_c["Pennsylvania"] == mapping["Pennsylvania"]
#output
Pennsylvania
TRUE
mapping_c[["Pennsylvania"]] == mapping[["Pennsylvania"]]
[1] TRUE
But when referencing unknown entries these differ slightly in behaviour
#works sorta:
mapping_c["hello"] == mapping["hello"]
#output
$<NA>
NULL
#Does not work:
mapping_c[["hello"]] == mapping[["hello"]]
Error in mapping_c[["hello"]] : subscript out of bounds
If you are converting your input into a data.frame this is not an issue, but it is worth being aware of this, so you obtain the behaviour expected.
Of course you could use a function call to create a proper dictionary with a simple switch statement. I don't think that would be any prettier though.
If us_region is a named list...
us_region <- list(Alabama = "Gulf",
Arizona = "Four States",
Arkansas = "Texas",
California = "South West",
Colorado = "Four States",
Connecticut = "New England",
Delaware = "Columbia")
Then,
us_state_to_region_map <- data.frame(us_state = names(us_region),
us_region = sapply(us_region, c),
stringsAsFactors = FALSE)
and, as a bonus, you also get the states as row names...
us_state_to_region_map
us_state us_region
Alabama Alabama Gulf
Arizona Arizona Four States
Arkansas Arkansas Texas
California California South West
Colorado Colorado Four States
Connecticut Connecticut New England
Delaware Delaware Columbia
As #tim-biegeleisen says it could be more appropriate to maintain this dataset in a database, a CSV file or a spreadsheet and open it in R (with readxl::read_excel(), readr::read_csv(),...).
However if you want to write it directly in your code you can use tibble:tribble() which allows to write a dataframe row by row :
library(tibble)
tribble(~ state, ~ region,
"Alabama", "Gulf",
"Arizona", "Four States",
(...)
"Wisconsin", "Midwest",
"Wyoming", "North")
One option could be to create a data frame in wide format (your initial list makes it very straightforward and this maintains a very obvious mapping. It is actually quite similar to your Perl code), then transform it to the long format:
library(tidyr)
data.frame(
Alabama = "Gulf",
Arizona = "Four States",
Arkansas = "Texas",
California = "South West",
Colorado = "Four States",
Connecticut = "New England",
Delaware = "Columbia",
stringsAsFactors = FALSE
) %>%
gather("us_state", "us_region") # transform to long format

Remove Trailing Whitespace and Consolidate Potentially Duplicated Factors in R [duplicate]

This question already has answers here:
How can I trim leading and trailing white space?
(15 answers)
Closed 3 years ago.
I want to change level name (eg "Africa " ) to another already available level (eg "Africa") in categorical variable (e.g. with the same descriptor, some factors have trailing whitespace while others do not). These variables, in the Continent column, are currently stored as factors in a dataframe.
Here are the output of my gigantic dataset
I tried series ifelse but I got weird results:
data.CONTINENT$Continent_R<- ifelse (data.CONTINENT$Continent=="Africa ","Africa",
ifelse (data.CONTINENT$Continent=="Asia ","Asia",
ifelse (data.CONTINENT$Continent=="Europe ","Europe",
ifelse (data.CONTINENT$Continent=="Europe ","Europe",
ifelse (data.CONTINENT$Continent=="Multi ","Multi",
ifelse (data.CONTINENT$Continent=="North America ","North America",
ifelse (data.CONTINENT$Continent=="South America ","South America",
data.CONTINENT$Continent))))))); table (data.CONTINENT$Continent_R)
Here is what I got based on the prior code:
Any Advice will be greatly appreciated.
I would use the amazing forcats package.
library(forcats)
data.CONTINENT$Continent_R <- fct_collapse(data.CONTINENT$Continent_R,
Africa = c("Africa", "Africa "),
`South America` = c("South America", "South America "))
Programtically if all you wanted to do was to remove the trailing whitespace, you could do something like:
# where the regex '\\s?$' = remove one or none spaces before the end of the string
data.CONTINENT$Continent_R %>% fct_relabel(~ gsub("\\s+$", "", .x))
If all you're trying to do is remove whitespace, just use the base trimws function (or stringr::str_trim, although I don't know what advantage it has, if any). Replace the levels with their trimmed versions.
You didn't include a reproducible version of data, so I'm creating it by pasting continent names with randomly sampled empty strings or single spaces.
set.seed(123)
data.CONTINENT <- data.frame(
Continent = paste0(sample(c("Africa", "Asia", "South America"), 10, replace = T),
sample(c("", " "), 10, replace = T))
)
levels(data.CONTINENT$Continent)
#> [1] "Africa" "Asia" "Asia " "South America"
#> [5] "South America "
Version one: replace the labels with their trimmed versions, and set it back to being a factor.
factor(data.CONTINENT$Continent, labels = trimws(levels(data.CONTINENT$Continent)))
#> [1] South America South America South America Asia South America
#> [6] Asia Asia Asia South America Africa
#> Levels: Africa Asia South America
Version two: use forcats and just pass the name of the function you need applied to the labels. Gets same output as above.
forcats::fct_relabel(data.CONTINENT$Continent, trimws)
There are a lot of potential approaches here. You could:
Manually replace them one at a time:
data.CONTINENT$Continent[which(data.CONTINENT$Continent=="Africa ")] <- "Africa"
Use a look-up table to replace them all at once:
lut <- data.frame(old = c('Africa ', 'South America '),
new = c('Africa', 'South America'))
# copy data to a new column to avoid over-writing data
data.CONTINENT$Continent_R <- data.CONTINENT$Continent
# replace only the 'old' values with the 'new' values in the look-up-table
data.CONTINENT$Continent_R[which(data.CONTINENT$Continent %in% lut$old)] <- lut$new[match(data.CONTINENT$Continent[which(data.CONTINENT$Continent %in% lut$old)], lut$old)]
# You may want to re-factor the column after this if you want to use it as a factor variable so as to remove the old factors that are no longer present.
If the only issues are extra spaces before and/or after entries, then you can just use the trimws() function.
Use the dplyr::recode() function.
data.CONTINENT$Continent_R <- dplyr::recode(data.CONTINENT$Continent, 'Africa ' = 'Africa', 'South America ' = 'South America')
And there are probably 20 other ways of doing things using functions like dplyr::join or switch.

improve nested ifelse statement in r

I have more than 10k address info, looks like "XXX street, city, state, US", in a character vector.
I want to group them by states, so I use nested ifelse to get the address date.frame with two variable, add_info and state.
library(stringr)
for (i in nrow(address){
ifelse(str_detect(address, 'Alabama'), address[i,state]='Alabama',
ifelse(str_detect(address, 'Alaska'), address[i,state]='Alaska',
ifelse(str_detect(address, 'Arizona'), address[i,state]='Arizona',
...
ifelse(str_detect(address, 'Wyoming'), address[i,state]='Wyoming', address[i,state]=NA)...)
}
Of course, this is extremely inefficient, but I don't know how to rewrite this nested ifelse. Any idea?
There are many ways to approach this problem. This is one approach assuming that your address string always contains the full spelling of only one US state.
library(stringr)
# Get a list of all states
state.list = scan(text = "Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, Florida, Georgia, Hawaii, Idaho, Illinois, Indiana, Iowa, Kansas, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, Nevada, New Hampshire, New Jersey, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin, Wyoming", what = "", sep = ",", strip.white = T)
# Extract state from vector address using library(stringr)
state = unlist(sapply(address, function(x) state.list[str_detect(x, state.list)]))
# Generate fake data to test
fake.address = paste0(replicate(10, paste(sample(c(0:9, LETTERS), 10, replace=TRUE), collapse="")),
sample(state.list, 20, rep = T),
replicate(10, paste(sample(c(0:9, LETTERS), 10, replace=TRUE), collapse="")))
# Test using fake address
unlist(sapply(fake.address, function(x) state.list[str_detect(x, state.list)]))
Output for fake address
O4H8V0NYEHColoradoA5K5XK35LX 44NDPQVMZ8UtahMY0I4M3086 LJ0LJW8BOBFloridaP5H2QW8B81 521IHHC1MFCaliforniaG7QTYCJRO5
"Colorado" "Utah" "Florida" "California"
YESTB7R6EPRhode IslandXEEGD4GEY3 5OHN2BR29HKansasCOKR9DY1WJ 4UXNJQW0QKNew MexicoH9GVQR3ZFY 5SYELTKO5HTexas3ONM1HU1VB
"Rhode Island" "Kansas" "New Mexico" "Texas"
Z8MKKL7K1RWashingtonGEBS7LJUU0 WPRSQEI2CNIndiana141S0Z1M2E O4H8V0NYEHNorth DakotaA5K5XK35LX 44NDPQVMZ8New HampshireMY0I4M3086
"Washington" "Indiana" "North Dakota" "New Hampshire"
LJ0LJW8BOBWest VirginiaP5H2QW8B811 LJ0LJW8BOBWest VirginiaP5H2QW8B812 521IHHC1MFNew JerseyG7QTYCJRO5 YESTB7R6EPWisconsinXEEGD4GEY3
"Virginia" "West Virginia" "New Jersey" "Wisconsin"
5OHN2BR29HOregonCOKR9DY1WJ 4UXNJQW0QKOhioH9GVQR3ZFY 5SYELTKO5HRhode Island3ONM1HU1VB Z8MKKL7K1ROklahomaGEBS7LJUU0
"Oregon" "Ohio" "Rhode Island" "Oklahoma"
WPRSQEI2CNIowa141S0Z1M2E
"Iowa"
edit: Use the following function based on agrep() for Fuzzy matching. Should work with minor spelling mistakes. You might need to go into edit comment to copy the code. The code contains an index-assign [<- operator called functionally, so the display is glitching here.
unlist(sapply(fake.address, function(x) state.list[[<-((L<-as.logical(sapply(state.list, function(s) agrep(s, x)*1))),is.na(L),F)]))
Assuming that your formatting is consistent (sensu Joran's comment above), you could just parse with strsplit and then use data.frame:
address1<-"410 West Street, Small Town, MN, US"
address2<-"5844 Green Street, Foo Town, NY, US"
address3<-"875 Cardinal Lane, Placeville, CA, US"
vector<-c(address1,address2,address3)
df<-t(data.frame(strsplit(vector,", "))
colnames(df)<-c("Number","City","State","Country")
rownames(df)<-NULL
df
which produces:
Number City State Country
[1,] "410 West Street" "Small Town" "MN" "US"
[2,] "5844 Green Street" "Foo Town" "NY" "US"
[3,] "875 Cardinal Lane" "Placeville" "CA" "US"
There are several methods.
First we need some sample data.
# some sample data
set.seed(123)
dat <- data.frame(addr=sprintf('123 street, Townville, %s, US',
sample(state.name, 25, replace=T)),
stringsAsFactors=F)
If your data is super regular like that:
# the easy way, split on commas:
matrix(unlist(strsplit(dat$addr, ',')), ncol=4, byrow=T)
Method 2, use grep to search for values. This works even if no commas or different commas in different rows. (As long as the states always appear spelled the same way)
# get a list of state name matches; need to match ', state name,' otherwise
# West Virginia counts as Virginia...
matches <- sapply(paste0(', ', state.name, ','), grep, dat$addr)
# now pair up the state name with the row it matches to
state_df <- data.frame(state=rep(state.name, sapply(matches, length)),
row=unname(unlist(matches)),
stringsAsFactors=F)
# reorder based on position in original data.frame, and there you go!
dat$state <- state_df[order(state_df$row), 'state']
This seemed to be working in my tests:
just.ST <- gsub( paste0(".+(", paste(state.name,collapse="|"), ").+$"),
"\\1", address)
As mentioned in comments and illustrated in other answers, state.name should be available by default. It does have the deficiency that in case of a non-match it returns the whole string, but you can probably use:
is.na(just.ST) <- nchar(just.ST) > max(nchar(state.name))

R: finding specific numbers of characters in a character array

I want to find states with exactly two Os in the name. I tried this:
> data(state)
> index=grep('o.*o',state.name)
> state.name[index]
"Colorado" "North Carolina" "North Dakota" "South Carolina" "South Dakota"
Problem: there are three Os in "Colorado" and I don't want it. How can I revise my regex?
I also want to do three Os:
> data(state)
> index=grep('o.*o.*o',state.name)
> state.name[index]
"Colorado"
Is there a simpler way to do this?
You can do:
grep('^([^o]*o[^o]*){2}$', state.name, value = TRUE)
# [1] "North Carolina" "North Dakota"
# [3] "South Carolina" "South Dakota"
grep('^([^o]*o[^o]*){3}$', state.name, value = TRUE)
# [1] "Colorado"
and as GSee suggested below, you can add ignore.case = TRUE if you want to include states with a capital O like Ohio, Oklahoma, and Oregon.
Michael's response is definitely more eloquent but here's the brute force method:
state.name[sapply(strsplit(tolower(state.name), NULL), function(x) sum(x %in% "o") == 2)]
You should ensure that the other characters that you're matching, besides the two matching Os, are not Os:
grep("^[^o]*o[^o]*o[^o]*$", state.name, value = TRUE)
Solution using ?gregexpr: A little ugly, but generalizes to other regexs well. (Don't forget the capital O in Ohio.)
state.name[sapply(state.name,function(x) length(unlist(gregexpr("o|O",x)))) == 2]
Count number of os in state name.
State <- c("North Dakota","Ohio","Colorado","South Dakota")
nos <- nchar(gsub("[^oO]","",State))
State[nos==2]
State[nos==3]

Resources