How to clean up data in R using strings?

How to clean up data in R using strings? - r

I need to clean up gender and dates columns of the dataset found here.
They apparently contain some misspellings and ambiguities. I am new to R and data cleaning so I am not sure how to go about doing this. For starters, I have tried to correct the misspellings using
factor(data$artist_data$gender)
str_replace_all(data$artist_data$gender, pattern = "femle", replacement = "Female")
str_replace_all(data$artist_data$gender, pattern = "f.", replacement = "Female")
str_replace_all(data$artist_data$gender, pattern = "F.", replacement = "Female")
str_replace_all(data$artist_data$gender, pattern = "female", replacement = "Female")
But it doesn't seem to work as I still have f., F. and femle in my output. Secondly, there seem to be empty cells inside. Do I need to remove them or is it alright to leave them there. If I need to remove them, how?
Thirdly, for the dates column, how do I make it clearer? i.e. change the format of born in xxxx to maybe xxxx-yyyy if died or xxxx-present if still alive. e.g. born in 1940 - is it safe to assume that they are still alive? Also one of the data has the word active in it. Would like to make this data more straight-forward.
Please help,
Thank you.

We have to escape the dot in f. and F.
library(dplyr)
library(stringr)
library(tibble)
pattern <- paste("f\\.|F\\.|female|femle", collapse="|")
df[[2]] %>%
mutate(gender = str_replace(string=gender,
pattern = pattern,
replacement="Female")) %>%
as_tibble()
name gender dates placeOfBirth placeOfDeath
<chr> <chr> <chr> <chr> <chr>
1 Abakanowicz, Magdalena Female born 1930 Polska ""
2 Abbey, Edwin Austin Male 1852–1911 Philadelphia, United States "London, United Kingdom"
3 Abbott, Berenice Female 1898–1991 Springfield, United States "Monson, United States"
4 Abbott, Lemuel Francis Male 1760–1803 Leicestershire, United Kingdom "London, United Kingdom"
5 Abrahams, Ivor Male born 1935 Wigan, United Kingdom ""
6 Absalon Male 1964–1993 Tel Aviv-Yafo, Yisra'el "Paris, France"
7 Abts, Tomma Female born 1967 Kiel, Deutschland ""
8 Acconci, Vito Male born 1940 New York, United States ""
9 Ackling, Roger Male 1947–2014 Isleworth, United Kingdom ""
10 Ackroyd, Norman Male born 1938 Leeds, United Kingdom ""
# ... with 3,522 more rows

Related

Is it possible to get R to identify countries in a dataframe?

This is what my dataset currently looks like. I'm hoping to add a column with the country names that correspond with the 'paragraph' column, but I don't even know how to start going about with that. Should I upload a list of all country names and then use the match function?
Any suggestions for a more optimal way would be appreciated! Thank you.
The output of dput(head(dataset, 20)) is as follows:
structure(list(category = c("State Ownership and Privatization;...row.names = c(NA, 20L), class = "data.frame")

Use the package "countrycode":
Toy data:
df <- data.frame(entry_number = 1:5,
text = c("a few paragraphs that might contain the country name congo or democratic republic of congo",
"More text that might contain myanmar or burma, as well as thailand",
"sentences that do not contain a country name can be returned as NA",
"some variant of U.S or the united states",
"something with an accent samóoa"))
This is how you can match the country names in a separate column:
library(tidyr)
library(dplyr)
#install.packages("countrycode")
library(countrycode)
all_country <- countryname_dict %>%
# filter out non-ASCII country names:
filter(grepl('[A-Za-z]', country.name.alt)) %>%
# define column `country.name.alt` as an atomic vector:
pull(country.name.alt) %>%
# change to lower-case:
tolower()
# define alternation pattern of all country names:
library(stringr)
pattern <- str_c(all_country, collapse = '|') # A huge alternation pattern!
df %>%
# extract country name matches
mutate(country = str_extract_all(tolower(text), pattern))
entry_number text
1 1 a few paragraphs that might contain the country name congo or democratic republic of congo
2 2 More text that might contain myanmar or burma, as well as thailand
3 3 sentences that do not contain a country name can be returned as NA
4 4 some variant of U.S or the united states
5 5 something with an accent samóoa
country
1 congo, democratic republic of congo
2 myanma, burma, thailand
3
4 united states
5 samóoa

Replacing NAs in a dataframe based on a partial string match (in another dataframe) in R

Goal: To change a column of NAs in one dataframe based on a "key" in another dataframe (something like a VLookUp, except only in R)
Given df1 here (For Simplicity's sake, I just have 6 rows. The key I have is 50 rows for 50 states):
Index
State_Name
Abbreviation
1
California
CA
2
Maryland
MD
3
New York
NY
4
Texas
TX
5
Virginia
VA
6
Washington
WA
And given df2 here (This is just an example. The real dataframe I'm working with has a lot more rows) :
Index
State
Article
1
NA
Texas governor, Abbott, signs new abortion bill
2
NA
Effort to recall California governor Newsome loses steam
3
NA
New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4
NA
Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5
NA
DC statehood unlikely as Manchin opposes
6
NA
Amazon HQ2 causing housing prices to soar in northern Virginia
Task: To create an R function that loops and reads the state in each df2$Article row; then cross-reference it with df1$State_Name to replace the NAs in df2$State with the respective df1$Abbreviation key based on the state in df2$Article. I know it's quite a mouthful. I'm stuck with how to start, and finish this puzzle. Hard-coding is not an option as the real datasheet I have have thousands of rows like this, and will update as we add more articles to text-scrape.
The output should look like:
Index
State
Article
1
TX
Texas governor, Abbott, signs new abortion bill
2
CA
Effort to recall California governor Newsome loses steam
3
NY
New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4
MD
Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5
NA
DC statehood unlikely as Manchin opposes
6
VA
Amazon HQ2 causing housing prices to soar in northern Virginia
Note: The fifth entry with DC is intended to be NA.
Any links to guides, and/or any advice on how to code this is most appreciated. Thank you!

You can create create a regex pattern from the State_Name and use str_extract to extract it from Article. Use match to get the corresponding Abbreviation name from df1.
library(stringr)
df2$State <- df1$Abbreviation[match(str_extract(df2$Article,
str_c(df1$State_Name, collapse = '|')), df1$State_Name)]
df2$State
#[1] "TX" "CA" "NY" "MD" NA "VA"
You can also use inbuilt state.name and state.abb instead of df1 to get state name and abbreviations.
Here's a way to do this in for loop -
for(i in seq(nrow(df1))) {
inds <- grep(df1$State_Name[i], df2$Article)
if(length(inds)) df2$State[inds] <- df1$Abbreviation[i]
}
df2
# Index State Article
#1 1 TX Texas governor, Abbott, signs new abortion bill
#2 2 CA Effort to recall California governor Newsome loses steam
#3 3 NY New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
#4 4 MD Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
#5 5 <NA> DC statehood unlikely as Manchin opposes
#6 6 VA Amazon HQ2 causing housing prices to soar in northern Virginia

Not as concise as above but a Base R approach:
# Unlist handling 0 length vectors: list_2_vec => function()
list_2_vec <- function(lst){
# Coerce 0 length vectors to na values of the appropriate type:
# .zero_to_nas => function()
.zero_to_nas <- function(x){
if(identical(x, character(0))){
NA_character_
}else if(identical(x, integer(0))){
NA_integer_
}else if(identical(x, numeric(0))){
NA_real_
}else if(identical(x, complex(0))){
NA_complex_
}else if(identical(x, logical(0))){
NA
}else{
x
}
}
# Unlist cleaned list: res => vector
res <- unlist(lapply(lst, .zero_to_nas))
# Explictly define return object: vector => GlobalEnv()
return(res)
}
# Classify each article as belonging to the appropriate state:
# clean_df => data.frame
clean_df <- transform(
df2,
State = df1$Abbreviation[
match(
list_2_vec(
regmatches(
Article,
gregexpr(
paste0(df1$State_Name, collapse = "|"), Article
)
)
),
df1$State_Name
)
]
)
# Data:
df1 <- structure(list(Index = 1:6, State_Name = c("California", "Maryland",
"New York", "Texas", "Virginia", "Washington"), Abbreviation = c("CA",
"MD", "NY", "TX", "VA", "WA")), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(Index = 1:6, State = c(NA, NA, NA, NA, NA, NA),
Article = c("Texas governor, Abbott, signs new abortion bill",
"Effort to recall California governor Newsome loses steam",
"New York governor, Cuomo, accused of manipulating Covid-19 nursing home data",
"Hogan (Maryland, R) announces plans to lift statewide Covid restrictions",
"DC statehood unlikely as Manchin opposes", "Amazon HQ2 causing housing prices to soar in northern Virginia"
)), class = "data.frame", row.names = c(NA, -6L))

Separating data within a cell and duplicating row data

I have data that is within one cell, separated by spaces.
For example, there is one column with city name such as "New York, NY" and then another column with the zip codes "12345 67891 23456".
What is a good method for separating this single row so that it could become three rows, with each having "New York, NY" and then having a single zip code associated?

Try this:
library(dplyr)
library(tidyr)
tibble(city = "New York, NY", zipcodes = "12345 67891 23456") %>%
mutate(zipcodes = strsplit(zipcodes, "\\s+")) %>%
unnest(zipcodes)
# # A tibble: 3 x 2
# city zipcodes
# <chr> <chr>
# 1 New York, NY 12345
# 2 New York, NY 67891
# 3 New York, NY 23456
Base R:
dat <- data.frame(city = "New York, NY", zipcodes = "12345 67891 23456", stringsAsFactors = FALSE)
zips <- strsplit(dat$zipcodes, "\\s+")
data.frame(city=rep(dat$city, each = lengths(zips)), zipcode = unlist(zips))
# city zipcode
# 1 New York, NY 12345
# 2 New York, NY 67891
# 3 New York, NY 23456
One premise of this answer is that the zip codes are separated by one or more whitespace (space, tab, etc). If there are legitimate spaces (true in many countries), then #ThomasIsCoding's approach may be a better start in that it attempts to extract the specific elements. Both will fail where zip codes are alphanumeric and contain a space; for instance, the UK has BS2 0JA as a zip code. In that case, you'll need a lot more logic to extract them safely.

If you are using base R, do you mean this kind of output?
s <- "New York, NY 12345 67891 23456"
data.frame(addr = paste0(gsub("(.*?\\s)\\d.*","\\1",s), unlist(regmatches(s,gregexpr("\\d+",s)))))
yielding
addr
1 New York, NY 12345
2 New York, NY 67891
3 New York, NY 23456

Converting a nested for-loop for string search for optimization in R

I am relatively new to R. I have written the following code. However, because it uses a for-loop, it is slow. I am not too familiar with packages that will convert this for-loop into a more efficient solution (apply functions?).
What my code does is this: it is trying to extract country names from a variable based on another dataframe that has all countries.
For instance, this is what data looks like:
country Institution
edmonton general hospital
ontario, canada
miyazaki, japan
department of head
this is what countries looks like
Name Code
algeria dz
canada ca
japan jp
kenya ke
# string match the countries
for(i in 1:nrow(data))
{
for (j in 1:nrow(countries))
{
data$country[i] <- ifelse(str_detect(string = data$Institution[i], pattern = paste0("\\b", countries$Name[j], "\\b")), countries$Name[j], data$country[i])
}
}
The above code runs so that it changes data so it looks like this:
country Institution
edmonton general hospital
canada ontario, canada
japan miyazaki, japan
department of head
How can I convert my for-loop to preserve the same function?
Thanks.

You can do a one-liner with str_extract. We'll paste the country names together with word boundaries and concatenate them with a regex | or operator.
library(stringr)
data$country = str_extract(data$Institution, paste0(
"\\b", country$Name, "\\b", collapse = "|"
))
data
# Institution country
# 1 edmonton general hospital <NA>
# 2 ontario, canada canada
# 3 miyazaki, japan japan
# 4 department of head <NA>
Using this data:
country <- read.table(text = " Name Code
algeria dz
canada ca
japan jp
kenya ke",
stringsAsFactors = FALSE, header = TRUE)
data <- data.frame(Institution = c("edmonton general hospital",
"ontario, canada",
"miyazaki, japan",
"department of head"))

The data:
countries <- setDT(read.table(text = " Name Code
algeria dz
canada ca
japan jp
kenya ke",
stringsAsFactors = FALSE, header = TRUE))
data <- setDT(list(country = array(dim = 2), Institution =
c("edmonton general hospital ontario, canada",
"miyazaki, japan department of head")))
I use data.table for syntax convenience, but you can surely do otherwise, the main idea is to use just one loop and grepl
data[,country := as.character(country)]
for( x in unique(countries$Name)){data[grepl(x,data$Institution),country := x]}
> data
country Institution
1: canada edmonton general hospital ontario, canada
2: japan miyazaki, japan department of head
You could add the tolower function to avoid cases problems grepl(tolower(x),tolower(data$Institution))

Isolating partial text in r data frame

I have an r data frame that contains U.S. state and county names in one column. The data is in the format:
United States - State name - County name
where each cell is a unique county. For example:
United States - North Carolina - Wake County
United States - North Carolina - Warren County
etc.
I need to break the column into 2 columns, one containing just the state name and the other containing just the county name. I've experimented with sub and gsub but am getting no results. I understand this is probably a simple matter for r experts but I'm a newbie. I would be most grateful if anyone can point me in the right direction.

You can use tidyr's separate function:
library(tidyr)
df <- separate(df, currentColumn, into = c("Country", "State", "County"), sep = " - ")
If the data is as you show in your question (including United States as country) and if your data frame is called df and the current column with the data is called currentColumn.
Example:
df <- data.frame(currentColumn = c("United States - North Carolina - Wake County",
"United States - North Carolina - Warren County"), val = rnorm(2))
df
# currentColumn val
#1 United States - North Carolina - Wake County 0.8173619
#2 United States - North Carolina - Warren County 0.4941976
separate(df, currentColumn, into = c("Country", "State", "County"), sep = " - ")
# Country State County val
#1 United States North Carolina Wake County 0.8173619
#2 United States North Carolina Warren County 0.4941976

Using read.table, and assuming your data is in df$var
read.table(text=df$var,sep="-",strip.white=TRUE,
col.names=c("Country","State","County"))
If speed is an issue, then strsplit will be a lot quicker:
setNames(data.frame(do.call(rbind,strsplit(df$var,split=" - "))),
c("Country","State","County"))
Both give:
# Country State County
#1 United States North Carolina Wake County
#2 United States North Carolina Warren County

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to clean up data in R using strings? - r

Related

Is it possible to get R to identify countries in a dataframe?

Replacing NAs in a dataframe based on a partial string match (in another dataframe) in R

Separating data within a cell and duplicating row data

Converting a nested for-loop for string search for optimization in R

Isolating partial text in r data frame

Categories

Resources