I tried to find an answer for this in other posts but nothing seemed to be working.
I have a data set where people answered the city they were in using a free response format. Therefore for each city, people identified in many different ways. For example, those living in Atlanta might have written "Atlanta", "atlanta", "Atlanta, GA" and so on.
There are 12 cities represented in this data set. I'm trying to clean this variable so each city is written consistently. Is there a way to do this efficiently for each city?
I've tried mutate_if and str_replace_all but can't seem to figure it out (see my code below)
all_data_city <- mutate_if(all_data_city, is.character,
str_replace_all, pattern = "Atlanta, GA",
replacement = "Atlanta")
all_data_city %>%
str_replace_all(c("Atlanta, GA" & "HCA Atlanta" & "HCC Atlanta" &
"Suwanee" & "Suwanee, GA" & "suwanee"), = "Atlanta")
If we need to pass a vector of elements to be replaced, paste them together with | as pattern and replace with 'Atlanta'
library(dplyr)
library(stringr)
pat <- str_c(c("Atlanta, GA" , "HCA Atlanta" , "HCC Atlanta" ,
"Suwanee" , "Suwanee, GA" , "suwanee"), collapse = "|")
all_data_city %>%
str_replace_all(pat, "Atlanta")
Using a reproducible example with iris
iris %>%
transmute(Species = str_replace_all(Species,
str_c(c("set", "versi"), collapse="|"), "hello")) %>%
pull(Species) %>%
unique
#[1] "helloosa" "hellocolor" "virginica"
Questions on data cleaning are difficult to answer, as answers strongly depend on the data.
Proposed solutions may work for a (small) sample dataset but may fail for a (large) production dataset.
In this case, I see two possible approaches:
Collecting all possible ways of writing a city's name and replacing these different variants by the desired city name. This can be achieved by str_replace() or by joining. This is safe but tedious.
Looking for a matching character string within the city name and replace if found.
Below is a blue print which can be extended for other uses cases. For demonstration, a data.frame with one column city is created:
library(dplyr)
library(stringr)
data.frame(city = c("Atlanta, GA", "HCA Atlanta", "HCC Atlanta",
"Suwanee", "Suwanee, GA", "suwanee", "Atlantic City")) %>%
mutate(city_new = case_when(
str_detect(city, regex("Atlanta|Suwanee", ignore_case = TRUE)) ~ "Atlanta",
TRUE ~ as.character(city)
)
)
city city_new
1 Atlanta, GA Atlanta
2 HCA Atlanta Atlanta
3 HCC Atlanta Atlanta
4 Suwanee Atlanta
5 Suwanee, GA Atlanta
6 suwanee Atlanta
7 Atlantic City Atlantic City
Related
I need to mutate a new column "Group" by those keyword,
I tried to using %in% but not got data I expected.
I want to create an extra column names'group' in my df data frame.
In this column, I want lable every rows by using some keywords.
(from the keywords vector or may be another keywords dataframe)
For example:
library(tibble)
df <- tibble(Title = c("Iran: How we are uncovering the protests and crackdowns",
"Deepak Nirula: The man who brought burgers and pizzas to India",
"Phil Foden: Manchester City midfielder signs new deal with club until 2027",
"The Danish tradition we all need now",
"Slovakia LGBT attack"),
Text = c("Iranian authorities have been disrupting the internet service in order to limit the flow of information and control the narrative, but Iranians are still sending BBC Persian videos of protests happening across the country via messaging apps. Videos are also being posted frequently on social media.
Before a video can be used in any reports, journalists need to establish where and when it was filmed.They can pinpoint the location by looking for landmarks and signs in the footage and checking them against satellite images, street-level photos and previous footage. Weather reports, the position of the sun and the angles of shadows it creates can be used to confirm the timing.",
"For anyone who grew up in capital Delhi during the 1970s and 1980s, Nirula's - run by the family of Deepak Nirula who died last week - is more than a restaurant. It's an emotion.
The restaurant transformed the eating-out culture in the city and introduced an entire generation to fast food, American style, before McDonald's and KFC came into the country. For many it was synonymous with its hot chocolate fudge.",
"Stockport-born Foden, who has scored two goals in 18 caps for England, has won 11 trophies with City, including four Premier League titles, four EFL Cups and the FA Cup.He has also won the Premier League Young Player of the Season and PFA Young Player of the Year awards in each of the last two seasons.
City boss Pep Guardiola handed him his debut as a 17-year-old and Foden credited the Spaniard for his impressive development over the last five years.",
"Norwegian playwright and poet Henrik Ibsen popularised the term /friluftsliv/ in the 1850s to describe the value of spending time in remote locations for spiritual and physical wellbeing. It literally translates to /open-air living/, and today, Scandinavians value connecting to nature in different ways – something we all need right now as we emerge from an era of lockdowns and inactivity.",
"The men were shot dead in the capital Bratislava on Wednesday, in a suspected hate crime.Organisers estimated that 20,000 people took part in the vigil, mourning the men's deaths and demanding action on LGBT rights.Slovak President Zuzana Caputova, who has raised the rainbow flag over her office, spoke at the event.")
)
keyword1 <- c("authorities", "Iranian", "Iraq", "control", "Riots",)
keyword2 <- c("McDonald's","KFC", "McCafé", "fast food")
keyword3 <- c("caps", "trophies", "season", "seasons")
keyword4 <- c("travel", "landscape", "living", "spiritual")
keyword5 <- c("LGBT", "lesbian", "les", "rainbow", "Gay", "Bisexual","Transgender")
I need to mutate a new column "Group" by those keyword
if match keyword1 lable "Politics",
if match keyword2 lable "Food",
if match keyword3 lable "Sport",
if match keyword4 lable "Travel",
if match keyword5 lable "LGBT".
Can also ignore.case ?
Below is expected output
Title
Text
Group
Iran: How..
Iranian...
Politics
Deepak Nir..
For any...
Food
Phil Foden..
Stockpo...
Sport
The Danish..
Norwegi...
Travel
Slovakia L..
The men...
LGBT
Thanks to everyone who spending time.
you could try this:
df %>%
rowwise %>%
mutate(
## add column with words found in title or text (splitting by non-word character):
words = list(strsplit(split = '\\W', paste(Title, Text)) %>% unlist),
group = {
categories <- list(keyword1, keyword2, keyword3, keyword4, keyword5)
## i indexes those items (=keyword vectors) of list 'categories'
## which share at least one word with column Title or Text (so that length > 0)
i <- categories %>% lapply(\(category) length(intersect(unlist(words), category))) %>% as.logical
## pick group name via index; join with ',' if more than one category applies
c('Politics', 'Food', 'Sport', 'Travel', 'LGBD')[i] %>% paste(collapse = ',')
}
)
output:
## # A tibble: 5 x 4
## # Rowwise:
## Title Text words group
## <chr> <chr> <lis> <chr>
## 1 Iran: How we are uncovering the protests and crackdowns "Ira~ <chr> Poli~
## 2 Deepak Nirula: The man who brought burgers and pizzas to In~ "For~ <chr> Food
## 3 Phil Foden: Manchester City midfielder signs new deal with ~ "Sto~ <chr> Sport
## 4 The Danish tradition we all need now "Nor~ <chr> Trav~
## 5 Slovakia LGBT attack "The~ <chr> LGBD
Check this out - the basic idea is to define all keyword* case-insensitively (hence the (?i) in the patterns) as alternation patterns (hence the | for collapsing) with word boundaries (hence the \\b before and after the alternatives, to ensure that "caps" is matched but not for example "capsize") and use nested ifelse statements to assign the Group labels:
library(tidyverse)
df %>%
mutate(
All = str_c(Title, Text),
Group = ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword1, collapse = "|"), ")\\b")), "Politics",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword2, collapse = "|"), ")\\b")), "Food",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword3, collapse = "|"), ")\\b")), "Sport",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword4, collapse = "|"), ")\\b")), "Travel", "LGBT"))))
) %>%
select(Group)
# A tibble: 5 × 1
Group
<chr>
1 Politics
2 Food
3 Sport
4 Travel
5 LGBT
I have a corpus of a couple of thousand documents and I'm trying to find the most commonly mentioned countries in the abstracts.
The library countrycode seems to have a comprehensive list of country names I can match against:
# country.name.alt shows multiple potential namings for 'Congo' (yay!):
install.packages(countrycode)
countrycode::countryname_dict |> filter(grepl('congo', tolower(country.name.alt)))
# Also seems to work for ones like "China"/"People's Republic of China"
A reprex of the data looks something like this:
df <- data.frame(entry_number = 1:5,
text = c("a few paragraphs that might contain the country name congo or democratic republic of congo",
"More text that might contain myanmar or burma, as well as thailand",
"sentences that do not contain a country name can be returned as NA",
"some variant of U.S or the united states",
"something with an accent samóoa"))
I want to reduce each entry in the column "text" to contain only a country name. Ideally something like this (note the repeat entry number):
desired_df <- data.frame(entry_number = c(1, 2, 2, 3, 4, 5),
text = c("congo",
"myanmar",
"thailand",
NA,
"united states",
"samoa"))
I've attempted with str_extract and various other failed attempts! The corpus is in English but international alphabets included in countrycode::countryname_dict$country.name.alt do throw reges errors. countrycode::countryname_dict$country.name.alt contains all the alternatives that countrycode::countryname_dict$country.name.en does not...
Open to any approach (dplyr,data.table...) that answers the initial question of how many times each country is mentioned in the corpus. Only requirement is that it is as robust as possible to different potential country names, accents and any other hidden catches!
Thanks community!
P.S, I have reviewed the following questions but no luck with my own example:
Matching an extracting country name from character string in R
extract country names (or other entity) from column
Extracting country names in R
Extracting Country Name from Author Affiliations
This seeems to work well on example data.
library(tidyverse)
all_country <- countrycode::countryname_dict %>%
filter(grepl('[A-Za-z]', country.name.alt)) %>%
pull(country.name.alt) %>%
tolower()
pattern <- str_c(all_country, collapse = '|')
df %>%
mutate(country = str_extract_all(tolower(text), pattern)) %>%
select(-text) %>%
unnest(country, keep_empty = TRUE)
# entry_number country
# <int> <chr>
#1 1 congo
#2 1 democratic republic of congo
#3 2 myanma
#4 2 burma
#5 2 thailand
#6 3 NA
#7 4 united states
#8 5 samóoa
I have a large dataframe of published articles for which I would like to extract all articles relating to a few authors specified in a separate list. The authors in the dataframe are grouped together in one column separated by a ; . Not all authors need to match, I would like to extract any article which has one author matched to the list. An example is below.
Title<-c("A", "B", "C")
AU<-c("Mark; John; Paul", "Simone; Lily; Poppy", "Sarah; Luke")
df<-cbind(Title, AU)
authors<-as.character(c("Mark", "John", "Luke"))
df[sapply(strsplit((as.character(df$AU)), "; "), function(x) any(authors %in% x)),]
I would expect to return;
Title AU
A Mark; John
C Sarah; Luke
However with my large dataframe this command does not work to return all AU, it only returns rows which have a single AU not multiple ones.
Here is a dput from my larger dataframe of 5 rows
structure(list(AU = c("FOOKES PG;DEARMAN WR;FRANKLIN JA", "SIMS DG;DOWNHAM MAPS;MCQUILLIN J;GARDNER PS",
"TURNER BR", "BUTLER J;MARSH H;GOODARZI F", "OVERTON M"), TI = c("SOME ENGINEERING ASPECTS OF ROCK WEATHERING WITH FIELD EXAMPLES FROM DARTMOOR AND ELSEWHERE",
"RESPIRATORY SYNCYTIAL VIRUS INFECTION IN NORTH-EAST ENGLAND",
"TECTONIC AND CLIMATIC CONTROLS ON CONTINENTAL DEPOSITIONAL FACIES IN THE KAROO BASIN OF NORTHERN NATAL, SOUTH AFRICA",
"WORLD COALS: GENESIS OF THE WORLD'S MAJOR COALFIELDS IN RELATION TO PLATE TECTONICS",
"WEATHER AND AGRICULTURAL CHANGE IN ENGLAND, 1660-1739"), SO = c("QUARTERLY JOURNAL OF ENGINEERING GEOLOGY",
"BRITISH MEDICAL JOURNAL", "SEDIMENTARY GEOLOGY", "FUEL", "AGRICULTURAL HISTORY"
), JI = c("Q. J. ENG. GEOL.", "BRIT. MED. J.", "SEDIMENT. GEOL.",
"FUEL", "AGRICULTURAL HISTORY")
An option with str_extract
library(dplyr)
library(stringr)
df %>%
mutate(Names = str_extract_all(Names, str_c(authors, collapse="|"))) %>%
filter(lengths(Names) > 0)
# Title Names
#1 A Mark, John
#2 C Luke
data
df <- data.frame(Title, Names)
in Base-R you can access it like so
df[sapply(strsplit(as.character(df$Names, "; "), function(x) any(authors %in% x)),]
Title Names
1 A Mark; John; Paul
3 C Sarah; Luke
This can be accomplished by subsetting on those Names that match the pattern specified in the first argument to the function grepl:
df[grepl(paste0(authors, collapse = "|"), df[,2]),]
Title Names
[1,] "A" "Mark; John; Paul"
[2,] "C" "Sarah; Luke"
This question already has answers here:
How can I trim leading and trailing white space?
(15 answers)
Closed 3 years ago.
I want to change level name (eg "Africa " ) to another already available level (eg "Africa") in categorical variable (e.g. with the same descriptor, some factors have trailing whitespace while others do not). These variables, in the Continent column, are currently stored as factors in a dataframe.
Here are the output of my gigantic dataset
I tried series ifelse but I got weird results:
data.CONTINENT$Continent_R<- ifelse (data.CONTINENT$Continent=="Africa ","Africa",
ifelse (data.CONTINENT$Continent=="Asia ","Asia",
ifelse (data.CONTINENT$Continent=="Europe ","Europe",
ifelse (data.CONTINENT$Continent=="Europe ","Europe",
ifelse (data.CONTINENT$Continent=="Multi ","Multi",
ifelse (data.CONTINENT$Continent=="North America ","North America",
ifelse (data.CONTINENT$Continent=="South America ","South America",
data.CONTINENT$Continent))))))); table (data.CONTINENT$Continent_R)
Here is what I got based on the prior code:
Any Advice will be greatly appreciated.
I would use the amazing forcats package.
library(forcats)
data.CONTINENT$Continent_R <- fct_collapse(data.CONTINENT$Continent_R,
Africa = c("Africa", "Africa "),
`South America` = c("South America", "South America "))
Programtically if all you wanted to do was to remove the trailing whitespace, you could do something like:
# where the regex '\\s?$' = remove one or none spaces before the end of the string
data.CONTINENT$Continent_R %>% fct_relabel(~ gsub("\\s+$", "", .x))
If all you're trying to do is remove whitespace, just use the base trimws function (or stringr::str_trim, although I don't know what advantage it has, if any). Replace the levels with their trimmed versions.
You didn't include a reproducible version of data, so I'm creating it by pasting continent names with randomly sampled empty strings or single spaces.
set.seed(123)
data.CONTINENT <- data.frame(
Continent = paste0(sample(c("Africa", "Asia", "South America"), 10, replace = T),
sample(c("", " "), 10, replace = T))
)
levels(data.CONTINENT$Continent)
#> [1] "Africa" "Asia" "Asia " "South America"
#> [5] "South America "
Version one: replace the labels with their trimmed versions, and set it back to being a factor.
factor(data.CONTINENT$Continent, labels = trimws(levels(data.CONTINENT$Continent)))
#> [1] South America South America South America Asia South America
#> [6] Asia Asia Asia South America Africa
#> Levels: Africa Asia South America
Version two: use forcats and just pass the name of the function you need applied to the labels. Gets same output as above.
forcats::fct_relabel(data.CONTINENT$Continent, trimws)
There are a lot of potential approaches here. You could:
Manually replace them one at a time:
data.CONTINENT$Continent[which(data.CONTINENT$Continent=="Africa ")] <- "Africa"
Use a look-up table to replace them all at once:
lut <- data.frame(old = c('Africa ', 'South America '),
new = c('Africa', 'South America'))
# copy data to a new column to avoid over-writing data
data.CONTINENT$Continent_R <- data.CONTINENT$Continent
# replace only the 'old' values with the 'new' values in the look-up-table
data.CONTINENT$Continent_R[which(data.CONTINENT$Continent %in% lut$old)] <- lut$new[match(data.CONTINENT$Continent[which(data.CONTINENT$Continent %in% lut$old)], lut$old)]
# You may want to re-factor the column after this if you want to use it as a factor variable so as to remove the old factors that are no longer present.
If the only issues are extra spaces before and/or after entries, then you can just use the trimws() function.
Use the dplyr::recode() function.
data.CONTINENT$Continent_R <- dplyr::recode(data.CONTINENT$Continent, 'Africa ' = 'Africa', 'South America ' = 'South America')
And there are probably 20 other ways of doing things using functions like dplyr::join or switch.
I have a large file with a variable state that has full state names. I would like to replace it with the state abbreviations (that is "NY" for "New York"). Is there an easy way to do this (apart from using several if-else commands)? May be using replace() statement?
R has two built-in constants that might help: state.abb with the abbreviations, and state.name with the full names. Here is a simple usage example:
> x <- c("New York", "Virginia")
> state.abb[match(x,state.name)]
[1] "NY" "VA"
1) grep the full name from state.name and use that to index into state.abb:
state.abb[grep("New York", state.name)]
## [1] "NY"
1a) or using which:
state.abb[which(state.name == "New York")]
## [1] "NY"
2) or create a vector of state abbreviations whose names are the full names and index into it using the full name:
setNames(state.abb, state.name)["New York"]
## New York
## "NY"
Unlike (1), this one works even if "New York" is replaced by a vector of full state names, e.g. setNames(state.abb, state.name)[c("New York", "Idaho")]
Old post I know, but wanted to throw mine in there. I learned on tidyverse, so for better or worse I avoid base R when possible. I wanted one with DC too, so first I built the crosswalk:
library(tidyverse)
st_crosswalk <- tibble(state = state.name) %>%
bind_cols(tibble(abb = state.abb)) %>%
bind_rows(tibble(state = "District of Columbia", abb = "DC"))
Then I joined it to my data:
left_join(data, st_crosswalk, by = "state")
I found the built-in state.name and state.abb have only 50 states. I got a bigger table (including DC and so on) from online (e.g., this link: http://www.infoplease.com/ipa/A0110468.html) and pasted it to a .csv file named States.csv. I then load states and abbr. from this file instead of using the built-in. The rest is quite similar to #Aniko 's
library(dplyr)
library(stringr)
library(stringdist)
setwd()
# load data
data = c("NY", "New York", "NewYork")
data = toupper(data)
# load state name and abbr.
State.data = read.csv('States.csv')
State = toupper(State.data$State)
Stateabb = as.vector(State.data$Abb)
# match data with state names, misspell of 1 letter is allowed
match = amatch(data, State, maxDist=1)
data[ !is.na(match) ] = Stateabb[ na.omit( match ) ]
There's a small difference between match and amatch in how they calculate the distance from one word to another. See P25-26 here http://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf
You can also use base::abbreviate if you don't have US state names. This won't give you equally sized abbreviations unless you increase minlength.
state.name %>% base::abbreviate(minlength = 1)
Here is another way of doing it in case you have more than one state in your data and you want to replace the names with the corresponding abbreviations.
#creating a list of names
states_df <- c("Alabama","California","Nevada","New York",
"Oregon","Texas", "Utah","Washington")
states_df <- as.data.frame(states_df)
The output is
> print(states_df)
states_df
1 Alabama
2 California
3 Nevada
4 New York
5 Oregon
6 Texas
7 Utah
8 Washington
Now using the state.abb function you can easily convert the names into abbreviations, and vice-versa.
states_df$state_code <- state.abb[match(states_df$states_df, state.name)]
> print(states_df)
states_df state_code
1 Alabama AL
2 California CA
3 Nevada NV
4 New York NY
5 Oregon OR
6 Texas TX
7 Utah UT
8 Washington WA
If matching state names to abbreviations or the other way around is something you have to frequently, you could put Aniko's solution in a function in a .Rprofile or a package:
state_to_st <- function(x){
c(state.abb, 'DC')[match(x, c(state.name, 'District of Columbia'))]
}
st_to_state <- function(x){
c(state.name, 'District of Columbia')[match(x, c(state.abb, 'DC'))]
}
Using that function as a part of a dplyr chain:
enframe(state.name, value = 'state_name') %>%
mutate(state_abbr = state_to_st(state_name))