Use regex to replace duplicate phrases

Use regex to replace duplicate phrases - r

I need to parse large data files and for reasons unknown the addresses are sometimes repeated, like this:
d<- data.table(name = c("bill", "tom"), address = c("35 Valerie Avenue / 35 Valerie Avenue", "702 / 9 Paddock Street / 702 / 9 Paddock Street"))
I have figured out how to de-dupe the easy ones (e.g. "35 Valerie Avenue / 35 Valerie Avenue") with the following:
replace.dupe.addresses<- function(x){
rep_expr<- "^(.*)/(.*)$"
idx<- grepl("/",x) & (trimws(sub(rep_expr, "\\2", x)) == trimws(sub(rep_expr, "\\1",x)))
x[idx]<- trimws(sub(rep_expr, "\\1",x[idx]))
x
}
d[,address := replace.dupe.addresses(address)]
But this doesn't work for addresses where the critical '/' is further embedded. I have tried this as my regex: rep_expr<- "^(.*)[:alpha:][:space:]?/(.*)$"
but this doesn't work. What regex expression would capture both of these repeating phrases?

See if this works for your dataset
library(data.table)
d[, .(name, address = lapply(strsplit(address, " / "), function(x)
paste(x[!duplicated(x)], collapse=" / "))), by=.I]
name address
1: bill 35 Valerie Avenue
2: tom 702 / 9 Paddock Street

Please check the below code
d %>% separate_rows(address, sep = '\\/') %>% mutate(address=trimws(address)) %>%
group_by(name, address) %>% slice_head(n=1) %>% group_by(name) %>%
mutate(address=paste(address, collapse = '/')) %>% slice_head(n=1)
Created on 2023-01-27 with reprex v2.0.2
# A tibble: 2 × 2
# Groups: name [2]
name address
<chr> <chr>
1 bill 35 Valerie Avenue
2 tom 702/9 Paddock Street

Split on forward slash, then get unique and paste it back:
sapply(strsplit(d$address, " / ", fixed = TRUE),
function(i) paste(unique(i), collapse = "/"))
# [1] "35 Valerie Avenue" "702/9 Paddock Street"

Try using a regex
gsub("((?:\\S+\\s+/\\s+)?(?:\\S+\\s+){2}\\S+)\\s+/\\s+\\1+", "\\1", d$address)
[1] "35 Valerie Avenue" "702 / 9 Paddock Street"

Related

Extract keys and values from hstore string in R

I'm working in R with strings obtained from OpenStreetMap and stored using the hstore data type. For example:
"comment"=>"Removed junction=roundabout as some entrances have right of way. See http://wiki.openstreetmap.org/wiki/Tag:junction%3Droundabout","lit"=>"yes","maxspeed"=>"30 mph","oneway"=>"yes","surface"=>"asphalt"
I would like to create a regex (or any other approach is fine) to extract all keys and all values. Please notice that keys and values could contain the characters =, >, \", or ,. The ideal approach should use only functions implemented in base-R packages.
EDIT - Expected output
If the input is
"comment"=>"Removed junction=roundabout as some entrances have right of way. See http://wiki.openstreetmap.org/wiki/Tag:junction%3Droundabout","lit"=>"yes","maxspeed"=>"30 mph","oneway"=>"yes","surface"=>"asphalt"
then the expected output should be something like
keys: "comment", "lit", "maxspeed", "oneway", "surface"
values: "Removed junction=roundabout as some entrances have right of way. See http://wiki.openstreetmap.org/wiki/Tag:junction%3Droundabout", "yes", "30 mph", "yes", "asphalt"
If the input is
"lit"=>"no","foot"=>"designated","horse"=>"designated","bicycle"=>"yes","surface"=>"gravel","old_name"=>"Freshwater, Yarmouth & Newport Railway","prow_ref"=>"F61"
then the output should be like
keys: "lit", "foot", "horse", "bicycle", "surface", "old_name"
values: "no", "designated", "designated", "yes", "gravel", "Freshwater, Yarmouth & Newport"

We can use tidyverse approaches
library(dplyr)
library(tidyr)
library(stringr)
tibble(str1) %>%
separate_rows(str1, sep = '"[^"]+"(*SKIP)(*FAIL)|,') %>%
separate(str1, into = c('key', 'value'), sep= '"=>"') %>%
mutate(across(everything(), str_remove_all, pattern = '"'))
-output
# A tibble: 12 x 2
# key value
# <chr> <chr>
# 1 comment Removed junction=roundabout as some entrances have right of way. See http://wiki.openstreetmap.org/wiki/Tag:junction%3Droundabout
# 2 lit yes
# 3 maxspeed 30 mph
# 4 oneway yes
# 5 surface asphalt
# 6 lit no
# 7 foot designated
# 8 horse designated
# 9 bicycle yes
#10 surface gravel
#11 old_name Freshwater, Yarmouth & Newport Railway
#12 prow_ref F61
data
str1 <- c("\"comment\"=>\"Removed junction=roundabout as some entrances have right of way. See http://wiki.openstreetmap.org/wiki/Tag:junction%3Droundabout\",\"lit\"=>\"yes\",\"maxspeed\"=>\"30 mph\",\"oneway\"=>\"yes\",\"surface\"=>\"asphalt\"",
"\"lit\"=>\"no\",\"foot\"=>\"designated\",\"horse\"=>\"designated\",\"bicycle\"=>\"yes\",\"surface\"=>\"gravel\",\"old_name\"=>\"Freshwater, Yarmouth & Newport Railway\",\"prow_ref\"=>\"F61\""
)

read.table(text=gsub(',?([^,]+)=>',"\n\\1:", string, perl = TRUE), sep=":",
col.names = c("Key", "value"))
Key value
1 lit no
2 foot designated
3 horse designated
4 bicycle yes
5 surface gravel
6 old_name Freshwater, Yarmouth & Newport Railway
7 prow_ref F61

How to concatenate multiple columns with separators but ignore some of columns based on condition in R?

Hi there i would like to concatenate columns containing strings or blanks or NA's with ";".
So lets take example below:
Actor1<- c("Driver","NA","","")
Actor2<- c("President","Zombie","","")
Actor3<- c("CEO","Devil","","")
Actor4<-c("Priest","","Killer","Mayor")
df_ex <-data.frame(Actor1, Actor2, Actor3, Actor4)
i tried this:
df_ex %>%
mutate(combined= paste0(Actor1,";",Actor2,";",Actor3,";",Actor4))
but obviously the result is wrong, e.g.:
df_ex[3,]
outcome in combined column is this:
;;;Killer
I would expect outcome to be:
Killer.
Note: there NA's and blanks "" as well which id like to ignore.
thanks in advance ,
cheers

I am far far away from being a regex expert, but I'll put here a tidyverse approach:
Actor1 <- c("Driver","NA","","")
Actor2 <- c("President","Zombie","","")
Actor3 <- c("CEO","Devil","","")
Actor4 <-c("Priest","","Killer","Mayor")
library(tidyverse)
data.frame(Actor1, Actor2, Actor3, Actor4) %>%
mutate_all(~str_replace(., pattern = "NA", replacement = "")) %>%
unite(col = "combined", sep = ";", remove = F) %>%
mutate(combined = str_replace_all(combined, pattern = "^[:punct:]|[:punct:]$|[:punct:]{2,}", replacement = "")) %>%
select(-combined, everything(.), combined)
#> Actor1 Actor2 Actor3 Actor4 combined
#> 1 Driver President CEO Priest Driver;President;CEO;Priest
#> 2 Zombie Devil Zombie;Devil
#> 3 Killer Killer
#> 4 Mayor Mayor
If you want just some of the columns, you can pass them in unite:
data.frame(Actor1, Actor2, Actor3, Actor4) %>%
mutate_all(~str_replace(., pattern = "NA", replacement = "")) %>%
unite(Actor2, Actor4, col = "combined", sep = ";", remove = F) %>%
mutate(combined = str_replace_all(combined, pattern = "^[:punct:]|[:punct:]$|[:punct:]{2,}", replacement = "")) %>%
select(-combined, everything(.), combined)
#> Actor1 Actor2 Actor3 Actor4 combined
#> 1 Driver President CEO Priest President;Priest
#> 2 Zombie Devil Zombie
#> 3 Killer Killer
#> 4 Mayor Mayor

Actor1<- c("Driver","NA","","")
Actor2<- c("President","Zombie","","")
Actor3<- c("CEO","Devil","","")
Actor4<-c("Priest","","Killer","Mayor")
matrix_ex <-cbind(Actor1, Actor2, Actor3, Actor4)
#apply(df_ex,1,paste,collapse=";")
x<-apply(matrix_ex,1,function(x){paste(x[!(is.na(x)|x==""|x=="NA")],collapse=";")})
x
[1] "Driver;President;CEO;Priest" "Zombie;Devil" "Killer" "Mayor"
> cat(paste(x,collapse="\n"))
#Driver;President;CEO;Priest
#Zombie;Devil
#Killer
#Mayor
To answer the comments:
df_ex <-data.frame(Actor1=Actor1, Actor2=Actor2, Actor3=Actor3, Actor4=Actor4,rnorm(4))
df_ex$concat<-apply(df_ex[c("Actor1","Actor3")],1,function(x){paste(x[!(is.na(x)|x==""|x=="NA")],collapse=";")})
df_ex$concat
df_ex$concat2<-apply(df_ex[c(1,3)],1,function(x){paste(x[!(is.na(x)|x==""|x=="NA")],collapse=";")})
df_ex$concat2

You can try the code below, using do.call + paste
df_ex$combine <- gsub("\\bNA;?\\b|;{2,}|;$","",do.call(paste,c(df_ex,sep = ";")))
such that
> df_ex
Actor1 Actor2 Actor3 Actor4 combine
1 Driver President CEO Priest Driver;President;CEO;Priest
2 NA Zombie Devil Zombie;Devil
3 Killer Killer
4 Mayor Mayor

Splitting strings in between 3rd and 4th characters in R

I'm grabbing information from Wikipedia on Canadian Forward Sortation Areas (FSAs - those are the first 3 digits of postal codes in Canada) and what cities/areas they belong to. Example of this information is below:
library(rvest)
library(tidyverse)
URL <- paste0("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_", "K")
FSAs <- URL %>%
read_html() %>%
html_nodes(xpath = "//td") %>%
html_text()
head(FSAs)
[1] "K1AGovernment of CanadaOttawa and Gatineau offices (partly in QC)\n" "K2AOttawa(Highland Park / McKellar Park /Westboro /Glabar Park /Carlingwood)\n"
[3] "K4AOttawa(Fallingbrook)\n" "K6AHawkesbury\n"
[5] "K7ASmiths Falls\n" "K8APembrokeCentral and northern subdivisions\n"
The problem I'm facing is that I would like to have a data frame with the first 3 digits of each spring in one column, and the rest of the information in another. I've thought there would be a solution involving a stringr function like str_split(), but this removes the pattern of the first 3 digits, which I of course don't want. In effect, I'm looking to split the string in-between the 3rd and 4th character of each string.
I've figured out this solution, with the last bit borrowed from this answer, but it's incredibly hackish. My question is, is there a better way of doing this?
FSAs %>%
enframe(name = NULL) %>%
separate(value, c(NA, "Location"), sep = "^...", remove = FALSE) %>%
separate(value, c("FSA", NA), sep = "(?<=\\G...)")
# A tibble: 195 x 2
FSA Location
<chr> <chr>
1 K1A "Government of CanadaOttawa and Gatineau offices (partly in QC)\n"
2 K2A "Ottawa(Highland Park / McKellar Park /Westboro /Glabar Park /Carlingwood)\n"
3 K4A "Ottawa(Fallingbrook)\n"
4 K6A "Hawkesbury\n"
5 K7A "Smiths Falls\n"
6 K8A "PembrokeCentral and northern subdivisions\n"
7 K9A "Cobourg\n"
8 K1B "Ottawa(Blackburn Hamlet / Pine View / Sheffield Glen)\n"
9 K2B "Ottawa(Britannia /Whitehaven / Bayshore / Pinecrest)\n"
10 K4B "Ottawa(Navan)\n"

Replace lowercase in names, not in surnames

I have a problem with a database with names of persons. I want to put the names in abbreviation but not the last names. The last name is separated from the name by a comma and the different people are separated from each other by a semicolon, like this example:
Michael, Jordan; Bird, Larry;
If the name is a single word, the code would be like this:
breve$autor <- str_replace_all(breve$autor, "[:lower:]{1,}\\;", ".\\;")
Result with this code:
Michael, J.; Bird, L.;
The problem is in compound names. With this code, the name:
Jordan, Michael Larry;
It would be:
Jordan, Michael L.;
Could someone tell me how to remove all lowercase letters that are between the comma and the semicolon? and it will look like this:
Jordan, M.L.;

Here is another solution:
x1 <- 'Michael, Jordan; Bird, Larry;'
x2 <- 'Jordan, Michael Larry;'
gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x1, perl = TRUE)
# [1] "Michael, J.; Bird, L.;"
gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x2, perl = TRUE)
# [1] "Jordan, M. L.;"
Surnames are followed by , while are parts of the names are followed by or ;. Here I use (?=[ ;]) to make sure that the following character after the pattern to be matched is a space or a semicolon.
To remove the space between M. and L., an additional step is needed:
gsub('\\. ', '.', gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x2, perl = TRUE))
# [1] "Jordan, M.L.;"

There must be a regular expression that will do this, of course. But that magic is a little beyond me. So here is an approach with simple string manipulation in a data frame using tidyverse functions.
library(stringr)
library(dplyr)
library(tidyr)
ballers <- "Michael, Jordan; Bird, Larry;"
mj <- "Jordan, Michael Larry"
c(ballers, mj) %>%
#split the players
str_split(., ";", simplify = TRUE) %>%
# remove white space
str_trim() %>%
#transpose to get players in a column
t %>%
#split again into last name and first + middle (if any)
str_split(",", simplify = TRUE) %>%
# convert to a tibble
as_tibble() %>%
# remove more white space
mutate(V2=str_trim(V2)) %>%
# remove empty rows (these can be avoided by different manipulation upstream)
filter(!V1 == "") %>%
# name the columns
rename("Last"=V1, "First_two"=V2) %>%
# separate the given names into first and middle (if any)
separate(First_two,into=c("First", "Middle"), sep=" ",) %>%
# abbreviate to first letter
mutate(First_i=abbreviate(First, 1)) %>%
# abbreviate, but take into account that middle name might be missing
mutate(Middle_i=ifelse(!is.na(Middle), paste0(abbreviate(Middle, 1), "."), "")) %>%
# combine the First and middle initals
mutate(Initials=paste(First_i, Middle_i, sep=".")) %>%
# make the desired Last, F.M. vector
mutate(Final=paste(Last, Initials, sep=", "))
# A tibble: 3 x 7
Last First Middle First_i Middle_i Initials Final
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Michael Jordan NA J "" J. Michael, J.
2 Jordan Michael Larry M L. M.L. Jordan, M.L.
3 Bird Larry NA L "" L. Bird, L.
Warning message:
Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [1, 3].
Much longer than a regex.

There will probably be a better way to do this, but I managed to get it to work using the stringr and tibble packages.
library(stringr)
library(tibble)
names <- 'Jordan, Michael; Bird, Larry; Obama, Barack; Bush, George Walker'
df <- as_tibble(str_split(unlist(str_split(names, '; ')), ', ', simplify = TRUE))
df[, 2] <- gsub('[a-z]+', '.', pull(df[, 2]))
This code generates the tibble df, which has the following contents:
# A tibble: 4 x 2
V1 V2
<chr> <chr>
1 Jordan M.
2 Bird L.
3 Obama B.
4 Bush G. W.
The names are first split into first and last names and stored into a data frame so that the gsub() call does not operate on the last names. Then, gsub() searches for any lowercase letters in the last names and replaces them with a single .
Then, you can call str_c(str_c(pull(df[, 1]), ', ', pull(df[, 2])), collapse = '; ') (or str_c(pull(unite(df, full, c('V1', 'V2'), sep = ', ')), collapse = '; ') if you already have the tidyr package loaded) to return the string "Jordan, M.; Bird, L.; Obama, B.; Bush, G. W.".
...also, did you mean Michael Jordan, not Jordan Michael? lol

Here's one that uses gsub twice. The inner one is for names with no middle names and the outer is for names that have a middle name.
x = c("Michael, Jordan; Jordan, Michael Larry; Bird, Larry;")
gsub(", ([A-Z])[a-z]+ ([A-Z])[a-z]+;", ", \\1.\\2.;", gsub(", ([A-Z])[a-z]+;", ", \\1.;", x))
#[1] "Michael, J.; Jordan, M.L.; Bird, L.;"

R data splitting unicodes

I have a data and want to split into columns
price_list <- c("Vegetables", " Garlic Desi<U+062A><U+06BE><U+0648><U+0645> <U+062F><U+06CC><U+0633><U+06CC> 140 per kg ",
" Fresh-bean<U+0641><U+0631><U+0627><U+0634><U+0628><U+06CC><U+0646> — per kg ",
"Fruits",
" Apple Kala Kolu Irani<U+0633><U+06CC><U+0628> <U+06A9><U+0627><U+0644><U+0627> <U+06A9><U+0648><U+0644><U+0648> <U+0627><U+06CC><U+0631><U+0627><U+0646><U+06CC> 168 per kg ",
" Apple golden 115 per kg ",
" Banana (I)<U+06A9><U+06CC><U+0644><U+0627> <U+0627><U+0646><U+0688><U+06CC><U+0646> 182 per dozen ",
"Others",
" Chicken<U+0645><U+0631><U+063A><U+06CC> <U+0634><U+06CC><U+0648><U+0631> 170 per kg ",
" Egg<U+0627><U+0646><U+0688><U+06D2> <U+0634><U+06CC><U+0648><U+0631> 95 per dozen "
)
tried but Unicodes creating problem
library(stringr)
regexp <- "[[:digit:]]+"
rprice <- str_extract(df$price_list, regexp)
df$price <- data.frame(rprice)
Desired out put like
Name Unicode Price Quantity
Vegetables
Fresh-bean فراشبین NA kg
Fruits
Apple golden NA 115 kg
Others
Egg انڈے شیور NA dozen
This forum is really helpful saved hundred and thousands of hours thanks
url <- "https://ictadministration.gov.pk/services/price-list/
complete code
library(rvest)
scraping_wiki <- read_html("https://ictadministration.gov.pk/services/price-list/")
library(magrittr)
price_date <- scraping_wiki %>%
html_nodes(".tm-article-content > ol:nth-child(1) > div:nth-child(1)") %>%
html_text()%>%
strsplit(split = "\n") %>%
unlist() %>%
.[. != ""]
price_date <- gsub(":", "", price_date)
price_list <- scraping_wiki %>%
html_nodes(".xl-tbl") %>%
html_text() %>%
strsplit(split = "\n") %>%
unlist() %>%
.[. != ""]

Wow, messy. This gets you close:
library(dplyr)
library(stringr)
unis <- price_list %>% str_extract(pattern = "<[[:print:]]*>")
words <- price_list %>% str_extract(pattern = "[A-Z a-z<]*") %>% gsub("<U", "", x = .)
price <- price_list %>% str_extract(pattern = "[0-9]* per") %>% gsub("per", "", x = .)
quant <- price_list %>% str_extract(pattern = "per [a-z]*")
df <- tibble(Name = words, Unicode = unis, Price = price, Quantity = quant)
Result:
> head(df)
# A tibble: 6 x 4
Name Unicode Price Quantity
<chr> <chr> <chr> <chr>
1 Vegetables NA NA NA
2 " Garlic Desi" <U+062A><U+06BE><U+0648><U+0645> <U+062F><U+06CC><U+0633><U+06CC> "140~ per kg
3 " Fresh" <U+0641><U+0631><U+0627><U+0634><U+0628><U+06CC><U+0646> " " per kg
4 Fruits NA NA NA
5 " Apple Kala Kolu Irani" <U+0633><U+06CC><U+0628> <U+06A9><U+0627><U+0644><U+0627> <U+06A9><U+~ "168~ per kg
6 " Apple golden " NA "115~ per kg
I'm not a regex genius, so I'm sure there must be a cleaner way.

Here's a functional approach. It's always good to learn to find a work around with functions.
Following are the steps:
1. Clean the price_list and keep the name, number and quantity.
2. Write functions which does that.
3. Apply functions on the new data frame.
# clean text
clean_list <- lapply(price_list, function(i) gsub("<[^>]+>", "",i))
clean_list <- lapply(clean_list, function(i) gsub('per','',i))
clean_list <- lapply(clean_list, str_trim)
# convert list to data frame
df <- data.table(do.call('rbind', clean_list))
colnames(df) <- 'text'
# helper functions
get_number <- function(j)
{
p1 <- unlist(strsplit(j, ' '))
p2 <- grepl('\\d+',p1)
if(sum(as.integer(p2)) ==1) return (grep('\\d+',p1,value = T))
else return (0)
}
get_quantity <- function(j)
{
p1 <- unlist(strsplit(j, ' '))
p2 <- grepl('kg|dozen',p1)
if(sum(as.integer(p2)) ==1) return (grep('kg|dozen',p1,value = T))
else return (NA)
}
# apply functions and get output
df[,Name := sapply(text, function(i) unlist(strsplit(i, ' '))[1])]
df[,Price := sapply(text, get_number)]
df[,Quantity := sapply(text, get_quantity)]
df[,Unicode := sapply(price_list, function(x) str_extract(string = x, pattern = '<[[:print:]]*>'))]
head(df)
text Name Price Quantity Unicode
1 Vegetables Vegetables 0 NA NA
2 Garlic Desi 140 kg Garlic Desi 140 kg <U+062A><U+06BE><U+0648><U+0645> <U+062F><U+06CC><U+0633><U+06CC>
3 Fresh-bean — kg Fresh-bean 0 kg <U+0641><U+0631><U+0627><U+0634><U+0628><U+06CC><U+0646>
4 Fruits Fruits 0 NA NA
5 Apple Kala Kolu Irani 168 kg Apple Kala Kolu Irani 168 kg <U+0633><U+06CC><U+0628> <U+06A9><U+0627><U+0644><U+0627> <U+06A9><U+0648><U+0644><…
6 Apple golden 115 kg Apple golden 115 kg NA
>

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Use regex to replace duplicate phrases - r

See if this works for your dataset library(data.table) d[, .(name, address = lapply(strsplit(address, " / "), function(x) paste(x[!duplicated(x)], collapse=" / "))), by=.I] name address 1: bill 35 Valerie Avenue 2: tom 702 / 9 Paddock Street

Split on forward slash, then get unique and paste it back: sapply(strsplit(d$address, " / ", fixed = TRUE), function(i) paste(unique(i), collapse = "/")) # [1] "35 Valerie Avenue" "702/9 Paddock Street"

Try using a regex gsub("((?:\\S+\\s+/\\s+)?(?:\\S+\\s+){2}\\S+)\\s+/\\s+\\1+", "\\1", d$address) [1] "35 Valerie Avenue" "702 / 9 Paddock Street"

Related

Extract keys and values from hstore string in R

How to concatenate multiple columns with separators but ignore some of columns based on condition in R?

Splitting strings in between 3rd and 4th characters in R

Replace lowercase in names, not in surnames

R data splitting unicodes

Categories

Resources