Clean duplicate phone numbers in R dataframe column - r

We have a dataframe with a Phone column that has phone numbers, however phone numbers are duplicated in many of the columns:
structure(list(Title = c("Head Coach", "Athletic Trainer", "Head Coach",
"Assistant Coach", "Student Assistant", "Head Men's Basketball Coach", "Coach"
), Phone = c("(904) 256-7242\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t(904) 256-7242",
"256-765-5020\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t256-765-5020",
NA, "765.285.8142\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t765.285.8142",
"", "549-5849\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t549-5849", "516-302-1039"
)), row.names = c(1L,2L, 3L,4L,5L,6L,7L ), class = "data.frame")
Title Phone
1 Head Coach (904) 256-7242\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t(904) 256-7242
2 Athletic Trainer 256-765-5020\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t256-765-5020
3 Head Coach <NA>
4 Assistant Coach 765.285.8142\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t765.285.8142
5 Student Assistant
6 Head Men's Basketball Coach 549-5849\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t549-5849
7 Coach 516-302-1039
The correct output would remove phone number duplicates:
structure(list(Title = c("Head Coach", "Athletic Trainer", "Head Coach",
"Assistant Coach", "Student Assistant", "Head Men's Basketball Coach", "Coach"
), Phone = c("(904) 256-7242",
"256-765-5020",
NA, "765.285.8142",
"", "549-5849", "516-302-1039"
)), row.names = c(1L,2L, 3L,4L,5L,6L,7L ), class = "data.frame")
Typically I would share our progress on this, but quite frankly we are lost as to how to even get started on this. Seems like a very difficult problem especially given (a) the \r\n\t\t\t\ that appear in the strings, (b) that there are NA and missing values and (c) not every row is duplicated, (d) different formats (some area codes, some with ., some with -, some with ()). Any recommendations on how to clean this column?

df$Phone = sub('\r.*', '', df$Phone)
Title Phone
1 Head Coach (904) 256-7242
2 Athletic Trainer 256-765-5020
3 Head Coach <NA>
4 Assistant Coach 765.285.8142
5 Student Assistant
6 Head Men's Basketball Coach 549-5849
7 Coach 516-302-1039

We could remove the whitespace with gsub, split at the delimiter created (,) and extract the first element
df1$Phone <- sapply(strsplit(gsub("[\r\n\t]+", ",", df1$Phone), ","), \(x) x[1])
-output
df1$Phone
[1] "(904) 256-7242" "256-765-5020" NA
[4] "765.285.8142" NA "549-5849" "516-302-1039"
Or another option is trimws - specify the whitespace to match the one or more [\r\n\t] followed by other characters (.*)
trimws(df1$Phone, whitespace = "[\r\n\t]+.*")
[1] "(904) 256-7242" "256-765-5020" NA
[4] "765.285.8142" "" "549-5849" "516-302-1039"

Related

using key word to label a new column in R

I need to mutate a new column "Group" by those keyword,
I tried to using %in% but not got data I expected.
I want to create an extra column names'group' in my df data frame.
In this column, I want lable every rows by using some keywords.
(from the keywords vector or may be another keywords dataframe)
For example:
library(tibble)
df <- tibble(Title = c("Iran: How we are uncovering the protests and crackdowns",
"Deepak Nirula: The man who brought burgers and pizzas to India",
"Phil Foden: Manchester City midfielder signs new deal with club until 2027",
"The Danish tradition we all need now",
"Slovakia LGBT attack"),
Text = c("Iranian authorities have been disrupting the internet service in order to limit the flow of information and control the narrative, but Iranians are still sending BBC Persian videos of protests happening across the country via messaging apps. Videos are also being posted frequently on social media.
Before a video can be used in any reports, journalists need to establish where and when it was filmed.They can pinpoint the location by looking for landmarks and signs in the footage and checking them against satellite images, street-level photos and previous footage. Weather reports, the position of the sun and the angles of shadows it creates can be used to confirm the timing.",
"For anyone who grew up in capital Delhi during the 1970s and 1980s, Nirula's - run by the family of Deepak Nirula who died last week - is more than a restaurant. It's an emotion.
The restaurant transformed the eating-out culture in the city and introduced an entire generation to fast food, American style, before McDonald's and KFC came into the country. For many it was synonymous with its hot chocolate fudge.",
"Stockport-born Foden, who has scored two goals in 18 caps for England, has won 11 trophies with City, including four Premier League titles, four EFL Cups and the FA Cup.He has also won the Premier League Young Player of the Season and PFA Young Player of the Year awards in each of the last two seasons.
City boss Pep Guardiola handed him his debut as a 17-year-old and Foden credited the Spaniard for his impressive development over the last five years.",
"Norwegian playwright and poet Henrik Ibsen popularised the term /friluftsliv/ in the 1850s to describe the value of spending time in remote locations for spiritual and physical wellbeing. It literally translates to /open-air living/, and today, Scandinavians value connecting to nature in different ways – something we all need right now as we emerge from an era of lockdowns and inactivity.",
"The men were shot dead in the capital Bratislava on Wednesday, in a suspected hate crime.Organisers estimated that 20,000 people took part in the vigil, mourning the men's deaths and demanding action on LGBT rights.Slovak President Zuzana Caputova, who has raised the rainbow flag over her office, spoke at the event.")
)
keyword1 <- c("authorities", "Iranian", "Iraq", "control", "Riots",)
keyword2 <- c("McDonald's","KFC", "McCafé", "fast food")
keyword3 <- c("caps", "trophies", "season", "seasons")
keyword4 <- c("travel", "landscape", "living", "spiritual")
keyword5 <- c("LGBT", "lesbian", "les", "rainbow", "Gay", "Bisexual","Transgender")
I need to mutate a new column "Group" by those keyword
if match keyword1 lable "Politics",
if match keyword2 lable "Food",
if match keyword3 lable "Sport",
if match keyword4 lable "Travel",
if match keyword5 lable "LGBT".
Can also ignore.case ?
Below is expected output
Title
Text
Group
Iran: How..
Iranian...
Politics
Deepak Nir..
For any...
Food
Phil Foden..
Stockpo...
Sport
The Danish..
Norwegi...
Travel
Slovakia L..
The men...
LGBT
Thanks to everyone who spending time.
you could try this:
df %>%
rowwise %>%
mutate(
## add column with words found in title or text (splitting by non-word character):
words = list(strsplit(split = '\\W', paste(Title, Text)) %>% unlist),
group = {
categories <- list(keyword1, keyword2, keyword3, keyword4, keyword5)
## i indexes those items (=keyword vectors) of list 'categories'
## which share at least one word with column Title or Text (so that length > 0)
i <- categories %>% lapply(\(category) length(intersect(unlist(words), category))) %>% as.logical
## pick group name via index; join with ',' if more than one category applies
c('Politics', 'Food', 'Sport', 'Travel', 'LGBD')[i] %>% paste(collapse = ',')
}
)
output:
## # A tibble: 5 x 4
## # Rowwise:
## Title Text words group
## <chr> <chr> <lis> <chr>
## 1 Iran: How we are uncovering the protests and crackdowns "Ira~ <chr> Poli~
## 2 Deepak Nirula: The man who brought burgers and pizzas to In~ "For~ <chr> Food
## 3 Phil Foden: Manchester City midfielder signs new deal with ~ "Sto~ <chr> Sport
## 4 The Danish tradition we all need now "Nor~ <chr> Trav~
## 5 Slovakia LGBT attack "The~ <chr> LGBD
Check this out - the basic idea is to define all keyword* case-insensitively (hence the (?i) in the patterns) as alternation patterns (hence the | for collapsing) with word boundaries (hence the \\b before and after the alternatives, to ensure that "caps" is matched but not for example "capsize") and use nested ifelse statements to assign the Group labels:
library(tidyverse)
df %>%
mutate(
All = str_c(Title, Text),
Group = ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword1, collapse = "|"), ")\\b")), "Politics",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword2, collapse = "|"), ")\\b")), "Food",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword3, collapse = "|"), ")\\b")), "Sport",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword4, collapse = "|"), ")\\b")), "Travel", "LGBT"))))
) %>%
select(Group)
# A tibble: 5 × 1
Group
<chr>
1 Politics
2 Food
3 Sport
4 Travel
5 LGBT

Stack top two rows (including column name) onto other dataframe

I have two data frames:
df<-structure(list(`Active Contact*` = "Entries must be in a Yes or No format. Only active staff may be added to a protocol.",
`First Name*` = "Free text field. [255]", `Middle Name
` = "Free text field. [255]",
`Last Name*` = "Free text field. [255]", `Email**
` = "This field is required when the contact is a user or the contact has any of the Receives Broadcast Emails, Receives Notifications, or Receives Administrative System Notifications settings set to Yes.\r\nThis field must be mapped if Email is selected in the Unique Identifier field. Entries must be unique across all contacts (both active and inactive) and must be in a valid email format (abc#efg.zyx). [254]"), row.names = c(NA,
-1L), class = c("tbl_df", "tbl", "data.frame"))
df2<-structure(list(ActiveContact = c("Yes", "Yes", "Yes", "Yes",
"Yes", "Yes", "Yes"), fname = c("practice", "practice", "practice",
"practice", "practice", "practice", "practice"), middlename = c(NA,
NA, NA, NA, NA, NA, NA), lname = c("PI", "research nurse", "research nurse",
"research nurse", "regulatory", "regulatory", "regulatory"),
email = c("ppi#lifespan.org", "prn#lifespan.org", "prn#lifespan.org",
"prn#lifespan.org", "preg#lifespan.org", "preg#lifespan.org",
"preg#lifespan.org")), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
I need to use the the column name from df, and also the first row from df.... as column name and first row in df2 (replacing the column name from df2, and also pushing the first row in df2 down 1 row to fit).
My expected output would be:
I know the column names are terrible (weird symbols and spaces and things I hate), and also I know the first row that I need is full of all sorts of stuff I typically hate, but I need this for my output format.
Thank you!
You can try to row bind them, simultaneously renaming the columns of df2
rbind(df,setNames(df2,names(df)))
Output:
`Active Contact*` `First Name*` `Middle Name\n ~ `Last Name*` `Email**\n \n ~
<chr> <chr> <chr> <chr> <chr>
1 Entries must be in a Yes or No fo~ Free text field~ Free text field. [255] Free text fie~ "This field is required when the contact is a us~
2 Yes practice NA PI "ppi#lifespan.org"
3 Yes practice NA research nurse "prn#lifespan.org"
4 Yes practice NA research nurse "prn#lifespan.org"
5 Yes practice NA research nurse "prn#lifespan.org"
6 Yes practice NA regulatory "preg#lifespan.org"
7 Yes practice NA regulatory "preg#lifespan.org"
8 Yes practice NA regulatory "preg#lifespan.org"
names(df2) <- names(df)
df3 <- rbind(df, df2)

Repetition when scraping using rvest in R

I am trying to scrape text using rvest in R and df1 is the output. For News 2, the text was spilt into 3 rows and this causes News 1 to repeat for 2 more extra rows. How can I make News 2 join into 1 complete sentence?
> dput(df1)
structure(list(`News 1` = c("Nike faces social media storm in China over Xinjiang statement",
"Nike faces social media storm in China over Xinjiang statement",
"Nike faces social media storm in China over Xinjiang statement"
), `News 2` = c("Biden calls for assault weapon ban after", "Colorado",
"shooting")), row.names = c(NA, -3L), class = c("tbl_df", "tbl",
"data.frame"))
My desired output is having the whole sentence in the same row
df1 >
News 1 News 2
1 Nike faces social media storm in China over Biden calls for assault weapon ban
Xinjiang statement after Colorado shooting

How to extract matching values from a column in a dataframe when semicolons are present in R?

I have a large dataframe of published articles for which I would like to extract all articles relating to a few authors specified in a separate list. The authors in the dataframe are grouped together in one column separated by a ; . Not all authors need to match, I would like to extract any article which has one author matched to the list. An example is below.
Title<-c("A", "B", "C")
AU<-c("Mark; John; Paul", "Simone; Lily; Poppy", "Sarah; Luke")
df<-cbind(Title, AU)
authors<-as.character(c("Mark", "John", "Luke"))
df[sapply(strsplit((as.character(df$AU)), "; "), function(x) any(authors %in% x)),]
I would expect to return;
Title AU
A Mark; John
C Sarah; Luke
However with my large dataframe this command does not work to return all AU, it only returns rows which have a single AU not multiple ones.
Here is a dput from my larger dataframe of 5 rows
structure(list(AU = c("FOOKES PG;DEARMAN WR;FRANKLIN JA", "SIMS DG;DOWNHAM MAPS;MCQUILLIN J;GARDNER PS",
"TURNER BR", "BUTLER J;MARSH H;GOODARZI F", "OVERTON M"), TI = c("SOME ENGINEERING ASPECTS OF ROCK WEATHERING WITH FIELD EXAMPLES FROM DARTMOOR AND ELSEWHERE",
"RESPIRATORY SYNCYTIAL VIRUS INFECTION IN NORTH-EAST ENGLAND",
"TECTONIC AND CLIMATIC CONTROLS ON CONTINENTAL DEPOSITIONAL FACIES IN THE KAROO BASIN OF NORTHERN NATAL, SOUTH AFRICA",
"WORLD COALS: GENESIS OF THE WORLD'S MAJOR COALFIELDS IN RELATION TO PLATE TECTONICS",
"WEATHER AND AGRICULTURAL CHANGE IN ENGLAND, 1660-1739"), SO = c("QUARTERLY JOURNAL OF ENGINEERING GEOLOGY",
"BRITISH MEDICAL JOURNAL", "SEDIMENTARY GEOLOGY", "FUEL", "AGRICULTURAL HISTORY"
), JI = c("Q. J. ENG. GEOL.", "BRIT. MED. J.", "SEDIMENT. GEOL.",
"FUEL", "AGRICULTURAL HISTORY")
An option with str_extract
library(dplyr)
library(stringr)
df %>%
mutate(Names = str_extract_all(Names, str_c(authors, collapse="|"))) %>%
filter(lengths(Names) > 0)
# Title Names
#1 A Mark, John
#2 C Luke
data
df <- data.frame(Title, Names)
in Base-R you can access it like so
df[sapply(strsplit(as.character(df$Names, "; "), function(x) any(authors %in% x)),]
Title Names
1 A Mark; John; Paul
3 C Sarah; Luke
This can be accomplished by subsetting on those Names that match the pattern specified in the first argument to the function grepl:
df[grepl(paste0(authors, collapse = "|"), df[,2]),]
Title Names
[1,] "A" "Mark; John; Paul"
[2,] "C" "Sarah; Luke"

how to quickly expand a dataframe using the lists inside the dataframe

I have a dataframe that contains one column with a unique string identifier, another column with a simple string/keyword, and a third column that is a string separated by commas ("categories"). This dataframe has x rows and the categories string in the 3rd column may have any number of commas. I want to split the categories by commas, append the keyword string to each of those separated categories, then create a new dataframe that consists of a column for the unique string identifier and a column for each new string that was created.
Here's an example of my starting DF:
startDF <- data.frame(uq_id = c("44ffd", "t3dd", "rrk33--ds", "limmt3"),
keyword = c("citizen", "river", "mouse", "hello"),
categories = c("App, Restaurant, Hotel", "Field, Place", "Movie", "App, Hotel, Theater, Show"))
And here's what I'd like the final DF to look like:
endDF <- data.frame(uq_iq = c("44ffd", "44ffd", "44ffd", "t3dd", "t3dd", "rrk33--ds", "limmt3", "limmt3", "limmt3", "limmt3"),
combo = c("citizen App", "citizen Restaurant", "citizen Hotel", "river Field", "river Place", "mouse Movie",
"hello App", "hello Hotel", "hello Theater", "hello Show"))
Currently, I'm looping through each element of the DF and creating this new dataframe row by row, but that is slow and I feel like there must be a better way using apply, strsplit, paste, etc.
Is there a quick and simple solution for this? Thanks!
Using tidyverse, separate_rows we can first separate each category into individual row and then unite them with keyword column.
library(tidyverse)
startDF %>%
separate_rows(categories) %>%
unite(combo, keyword, categories, sep = " ")
# uq_id combo
#1 44ffd citizen App
#2 44ffd citizen Restaurant
#3 44ffd citizen Hotel
#4 t3dd river Field
#5 t3dd river Place
#6 rrk33--ds mouse Movie
#7 limmt3 hello App
#8 limmt3 hello Hotel
#9 limmt3 hello Theater
#10 limmt3 hello Show
Base R method could be by splitting the categories on comma, repeating uq_id based on the length of each category and create a new dataframe by pasting the string together with keyword using mapply.
list_cat <- strsplit(startDF$categories, ",")
data.frame(uq_id = rep(startDF$uq_id, lengths(list_cat)),
combo = unlist(mapply(paste, list_cat, startDF$keyword)))
Read startDF using stringsAsFactors = FALSE to keep them as characters instead of factors.
A different tidyverse possibility could be:
startDF %>%
mutate(categories = strsplit(as.character(categories), ", ", fixed = TRUE)) %>%
unnest() %>%
transmute(uq_id = uq_id,
combo = paste(keyword, categories, sep = " "))
uq_id combo
1 44ffd citizen App
2 44ffd citizen Restaurant
3 44ffd citizen Hotel
4 t3dd river Field
5 t3dd river Place
6 rrk33--ds mouse Movie
7 limmt3 hello App
8 limmt3 hello Hotel
9 limmt3 hello Theater
10 limmt3 hello Show

Resources