Remove Trailing Whitespace and Consolidate Potentially Duplicated Factors in R [duplicate] - r

This question already has answers here:
How can I trim leading and trailing white space?
(15 answers)
Closed 3 years ago.
I want to change level name (eg "Africa " ) to another already available level (eg "Africa") in categorical variable (e.g. with the same descriptor, some factors have trailing whitespace while others do not). These variables, in the Continent column, are currently stored as factors in a dataframe.
Here are the output of my gigantic dataset
I tried series ifelse but I got weird results:
data.CONTINENT$Continent_R<- ifelse (data.CONTINENT$Continent=="Africa ","Africa",
ifelse (data.CONTINENT$Continent=="Asia ","Asia",
ifelse (data.CONTINENT$Continent=="Europe ","Europe",
ifelse (data.CONTINENT$Continent=="Europe ","Europe",
ifelse (data.CONTINENT$Continent=="Multi ","Multi",
ifelse (data.CONTINENT$Continent=="North America ","North America",
ifelse (data.CONTINENT$Continent=="South America ","South America",
data.CONTINENT$Continent))))))); table (data.CONTINENT$Continent_R)
Here is what I got based on the prior code:
Any Advice will be greatly appreciated.

I would use the amazing forcats package.
library(forcats)
data.CONTINENT$Continent_R <- fct_collapse(data.CONTINENT$Continent_R,
Africa = c("Africa", "Africa "),
`South America` = c("South America", "South America "))
Programtically if all you wanted to do was to remove the trailing whitespace, you could do something like:
# where the regex '\\s?$' = remove one or none spaces before the end of the string
data.CONTINENT$Continent_R %>% fct_relabel(~ gsub("\\s+$", "", .x))

If all you're trying to do is remove whitespace, just use the base trimws function (or stringr::str_trim, although I don't know what advantage it has, if any). Replace the levels with their trimmed versions.
You didn't include a reproducible version of data, so I'm creating it by pasting continent names with randomly sampled empty strings or single spaces.
set.seed(123)
data.CONTINENT <- data.frame(
Continent = paste0(sample(c("Africa", "Asia", "South America"), 10, replace = T),
sample(c("", " "), 10, replace = T))
)
levels(data.CONTINENT$Continent)
#> [1] "Africa" "Asia" "Asia " "South America"
#> [5] "South America "
Version one: replace the labels with their trimmed versions, and set it back to being a factor.
factor(data.CONTINENT$Continent, labels = trimws(levels(data.CONTINENT$Continent)))
#> [1] South America South America South America Asia South America
#> [6] Asia Asia Asia South America Africa
#> Levels: Africa Asia South America
Version two: use forcats and just pass the name of the function you need applied to the labels. Gets same output as above.
forcats::fct_relabel(data.CONTINENT$Continent, trimws)

There are a lot of potential approaches here. You could:
Manually replace them one at a time:
data.CONTINENT$Continent[which(data.CONTINENT$Continent=="Africa ")] <- "Africa"
Use a look-up table to replace them all at once:
lut <- data.frame(old = c('Africa ', 'South America '),
new = c('Africa', 'South America'))
# copy data to a new column to avoid over-writing data
data.CONTINENT$Continent_R <- data.CONTINENT$Continent
# replace only the 'old' values with the 'new' values in the look-up-table
data.CONTINENT$Continent_R[which(data.CONTINENT$Continent %in% lut$old)] <- lut$new[match(data.CONTINENT$Continent[which(data.CONTINENT$Continent %in% lut$old)], lut$old)]
# You may want to re-factor the column after this if you want to use it as a factor variable so as to remove the old factors that are no longer present.
If the only issues are extra spaces before and/or after entries, then you can just use the trimws() function.
Use the dplyr::recode() function.
data.CONTINENT$Continent_R <- dplyr::recode(data.CONTINENT$Continent, 'Africa ' = 'Africa', 'South America ' = 'South America')
And there are probably 20 other ways of doing things using functions like dplyr::join or switch.

Related

Most commonly mentioned countries in the corpus; extracting country names from abstracts R

I have a corpus of a couple of thousand documents and I'm trying to find the most commonly mentioned countries in the abstracts.
The library countrycode seems to have a comprehensive list of country names I can match against:
# country.name.alt shows multiple potential namings for 'Congo' (yay!):
install.packages(countrycode)
countrycode::countryname_dict |> filter(grepl('congo', tolower(country.name.alt)))
# Also seems to work for ones like "China"/"People's Republic of China"
A reprex of the data looks something like this:
df <- data.frame(entry_number = 1:5,
text = c("a few paragraphs that might contain the country name congo or democratic republic of congo",
"More text that might contain myanmar or burma, as well as thailand",
"sentences that do not contain a country name can be returned as NA",
"some variant of U.S or the united states",
"something with an accent samóoa"))
I want to reduce each entry in the column "text" to contain only a country name. Ideally something like this (note the repeat entry number):
desired_df <- data.frame(entry_number = c(1, 2, 2, 3, 4, 5),
text = c("congo",
"myanmar",
"thailand",
NA,
"united states",
"samoa"))
I've attempted with str_extract and various other failed attempts! The corpus is in English but international alphabets included in countrycode::countryname_dict$country.name.alt do throw reges errors. countrycode::countryname_dict$country.name.alt contains all the alternatives that countrycode::countryname_dict$country.name.en does not...
Open to any approach (dplyr,data.table...) that answers the initial question of how many times each country is mentioned in the corpus. Only requirement is that it is as robust as possible to different potential country names, accents and any other hidden catches!
Thanks community!
P.S, I have reviewed the following questions but no luck with my own example:
Matching an extracting country name from character string in R
extract country names (or other entity) from column
Extracting country names in R
Extracting Country Name from Author Affiliations
This seeems to work well on example data.
library(tidyverse)
all_country <- countrycode::countryname_dict %>%
filter(grepl('[A-Za-z]', country.name.alt)) %>%
pull(country.name.alt) %>%
tolower()
pattern <- str_c(all_country, collapse = '|')
df %>%
mutate(country = str_extract_all(tolower(text), pattern)) %>%
select(-text) %>%
unnest(country, keep_empty = TRUE)
# entry_number country
# <int> <chr>
#1 1 congo
#2 1 democratic republic of congo
#3 2 myanma
#4 2 burma
#5 2 thailand
#6 3 NA
#7 4 united states
#8 5 samóoa

How to identify all country names mentioned in a string and split accordingly?

I have a string that contains country and other region names. I am only interested in the country names and would ideally like to add several columns, each of which contains a country name listed in the string. Here is an exemplary code for the way the dataframe lis set up:
df <- data.frame(id = c(1,2,3),
country = c("Cote d'Ivoire Africa Developing Economies West Africa",
"South Africa United Kingdom Africa BRICS Countries",
"Myanmar Gambia Bangladesh Netherlands Africa Asia"))
If I only split the string by space, those countries which contain a space get lost (e.g. "United Kingdom"). See here:
df2 <- separate(df, country, paste0("C",3:8), sep=" ")
Therefore, I tried to look up country names using the world.cities dataset. However, this only seems to loop through the string until there is non-country name. See here:
library(maps)
library(stringr)
all_countries <- str_c(unique(world.cities$country.etc), collapse = "|")
df$c1 <- sapply(str_extract_all(df$country, all_countries), toString)
I am wondering whether it's possible to use the space a delimiter but define exceptions (like "United Kingdom"). This might obviously require some manual work, but appears to be most feasible solution to me. Does anyone know how to define such exceptions? I am of course also open to and thankful for any other solutions.
UPDATE:
I figured out another solution using the countrycode package:
library(countrycode)
countries <- data.frame(countryname_dict)
countries$continent <- countrycode(sourcevar = countries[["country.name.en"]],
origin = "country.name.en",
destination = "continent")
africa <- countries[ which(countries$continent=='Africa'), ]
library(stringr)
pat <- paste0("\\b", paste(africa$country.name.en , collapse="\\b|\\b"), "\\b")
df$country_list <- str_extract_all(df$country, regex(pat, ignore_case = TRUE))
You could do:
library(stringi)
vec <- stri_trans_general(countrycode::codelist$country.name.en, id = "Latin-ASCII")
stri_extract_all(df$country,regex = sprintf(r"(\b(%s)\b)",stri_c(vec,collapse = "|")))
[[1]]
[1] "Cote d'Ivoire"
[[2]]
[1] "South Africa" "United Kingdom"
[[3]]
[1] "Gambia" "Bangladesh" "Netherlands"

R find replace in data frame

I tried to find an answer for this in other posts but nothing seemed to be working.
I have a data set where people answered the city they were in using a free response format. Therefore for each city, people identified in many different ways. For example, those living in Atlanta might have written "Atlanta", "atlanta", "Atlanta, GA" and so on.
There are 12 cities represented in this data set. I'm trying to clean this variable so each city is written consistently. Is there a way to do this efficiently for each city?
I've tried mutate_if and str_replace_all but can't seem to figure it out (see my code below)
all_data_city <- mutate_if(all_data_city, is.character,
str_replace_all, pattern = "Atlanta, GA",
replacement = "Atlanta")
all_data_city %>%
str_replace_all(c("Atlanta, GA" & "HCA Atlanta" & "HCC Atlanta" &
"Suwanee" & "Suwanee, GA" & "suwanee"), = "Atlanta")
If we need to pass a vector of elements to be replaced, paste them together with | as pattern and replace with 'Atlanta'
library(dplyr)
library(stringr)
pat <- str_c(c("Atlanta, GA" , "HCA Atlanta" , "HCC Atlanta" ,
"Suwanee" , "Suwanee, GA" , "suwanee"), collapse = "|")
all_data_city %>%
str_replace_all(pat, "Atlanta")
Using a reproducible example with iris
iris %>%
transmute(Species = str_replace_all(Species,
str_c(c("set", "versi"), collapse="|"), "hello")) %>%
pull(Species) %>%
unique
#[1] "helloosa" "hellocolor" "virginica"
Questions on data cleaning are difficult to answer, as answers strongly depend on the data.
Proposed solutions may work for a (small) sample dataset but may fail for a (large) production dataset.
In this case, I see two possible approaches:
Collecting all possible ways of writing a city's name and replacing these different variants by the desired city name. This can be achieved by str_replace() or by joining. This is safe but tedious.
Looking for a matching character string within the city name and replace if found.
Below is a blue print which can be extended for other uses cases. For demonstration, a data.frame with one column city is created:
library(dplyr)
library(stringr)
data.frame(city = c("Atlanta, GA", "HCA Atlanta", "HCC Atlanta",
"Suwanee", "Suwanee, GA", "suwanee", "Atlantic City")) %>%
mutate(city_new = case_when(
str_detect(city, regex("Atlanta|Suwanee", ignore_case = TRUE)) ~ "Atlanta",
TRUE ~ as.character(city)
)
)
city city_new
1 Atlanta, GA Atlanta
2 HCA Atlanta Atlanta
3 HCC Atlanta Atlanta
4 Suwanee Atlanta
5 Suwanee, GA Atlanta
6 suwanee Atlanta
7 Atlantic City Atlantic City

Sorting character vectors with custom order in R

I'm solving a task for my online course in R. We have the following two vectors:
Country<-c("Egypt","Peru","Belgium","Bulgaria","China","Russia")
Capital<-c("Brussels","Kairo","Moscow","Beijing","Sofia","Lima")
The task is to order the vectors and output:
Capital is the capital of Country
in the console, sorted in the right order. I've solved the task using the cat()-function:
cat(Capital[1]," is the capital of ",Country[3])
Is there a better way to do it, insted of calling the cat()-function for every pair of country-capital?
We could attempt a more "sophisticated" approach.
First we get a list with countries and their capitals from the internet using the rvest package, e.g.
library(rvest)
doc <- read_html("http://techslides.com/list-of-countries-and-capitals")
countries <- as.data.frame(html_table(doc, fill=TRUE, header=TRUE))
> head(countries, 3)
Country.Name Capital.Name Capital.Latitude Capital.Longitude Country.Code Continent.Name
1 Afghanistan Kabul 34.51667 69.18333 AF Asia
2 Aland Islands Mariehamn 60.11667 19.90000 AX Europe
3 Albania Tirana 41.31667 19.81667 AL Europe
Using your country vector
Country <- c("Egypt", "Peru", "Belgium", "Bulgaria", "China", "Russia")
applying to countries data frame yields
ind <- countries$Country.Name %in% Country
paste(countries$Country.Name[ind], 'is the capital of',
countries$Capital.Name[ind])
[1] "Belgium is the capital of Brussels" "Bulgaria is the capital of Sofia"
[3] "China is the capital of Beijing" "Egypt is the capital of Cairo"
[5] "Peru is the capital of Lima" "Russia is the capital of Moscow"

State name to abbreviation

I have a large file with a variable state that has full state names. I would like to replace it with the state abbreviations (that is "NY" for "New York"). Is there an easy way to do this (apart from using several if-else commands)? May be using replace() statement?
R has two built-in constants that might help: state.abb with the abbreviations, and state.name with the full names. Here is a simple usage example:
> x <- c("New York", "Virginia")
> state.abb[match(x,state.name)]
[1] "NY" "VA"
1) grep the full name from state.name and use that to index into state.abb:
state.abb[grep("New York", state.name)]
## [1] "NY"
1a) or using which:
state.abb[which(state.name == "New York")]
## [1] "NY"
2) or create a vector of state abbreviations whose names are the full names and index into it using the full name:
setNames(state.abb, state.name)["New York"]
## New York
## "NY"
Unlike (1), this one works even if "New York" is replaced by a vector of full state names, e.g. setNames(state.abb, state.name)[c("New York", "Idaho")]
Old post I know, but wanted to throw mine in there. I learned on tidyverse, so for better or worse I avoid base R when possible. I wanted one with DC too, so first I built the crosswalk:
library(tidyverse)
st_crosswalk <- tibble(state = state.name) %>%
bind_cols(tibble(abb = state.abb)) %>%
bind_rows(tibble(state = "District of Columbia", abb = "DC"))
Then I joined it to my data:
left_join(data, st_crosswalk, by = "state")
I found the built-in state.name and state.abb have only 50 states. I got a bigger table (including DC and so on) from online (e.g., this link: http://www.infoplease.com/ipa/A0110468.html) and pasted it to a .csv file named States.csv. I then load states and abbr. from this file instead of using the built-in. The rest is quite similar to #Aniko 's
library(dplyr)
library(stringr)
library(stringdist)
setwd()
# load data
data = c("NY", "New York", "NewYork")
data = toupper(data)
# load state name and abbr.
State.data = read.csv('States.csv')
State = toupper(State.data$State)
Stateabb = as.vector(State.data$Abb)
# match data with state names, misspell of 1 letter is allowed
match = amatch(data, State, maxDist=1)
data[ !is.na(match) ] = Stateabb[ na.omit( match ) ]
There's a small difference between match and amatch in how they calculate the distance from one word to another. See P25-26 here http://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf
You can also use base::abbreviate if you don't have US state names. This won't give you equally sized abbreviations unless you increase minlength.
state.name %>% base::abbreviate(minlength = 1)
Here is another way of doing it in case you have more than one state in your data and you want to replace the names with the corresponding abbreviations.
#creating a list of names
states_df <- c("Alabama","California","Nevada","New York",
"Oregon","Texas", "Utah","Washington")
states_df <- as.data.frame(states_df)
The output is
> print(states_df)
states_df
1 Alabama
2 California
3 Nevada
4 New York
5 Oregon
6 Texas
7 Utah
8 Washington
Now using the state.abb function you can easily convert the names into abbreviations, and vice-versa.
states_df$state_code <- state.abb[match(states_df$states_df, state.name)]
> print(states_df)
states_df state_code
1 Alabama AL
2 California CA
3 Nevada NV
4 New York NY
5 Oregon OR
6 Texas TX
7 Utah UT
8 Washington WA
If matching state names to abbreviations or the other way around is something you have to frequently, you could put Aniko's solution in a function in a .Rprofile or a package:
state_to_st <- function(x){
c(state.abb, 'DC')[match(x, c(state.name, 'District of Columbia'))]
}
st_to_state <- function(x){
c(state.name, 'District of Columbia')[match(x, c(state.abb, 'DC'))]
}
Using that function as a part of a dplyr chain:
enframe(state.name, value = 'state_name') %>%
mutate(state_abbr = state_to_st(state_name))

Resources