Rowwise extract common substrings from to columns in a data frame - r

I want to match cities with regions in a data frame. The columns are a little bit messy, so I would like to extract the names of the cities / regions that appear in two columns as in the following example.
A <- c("Berlin",
"Hamburg",
"Munich",
"Stuttgart",
"Rhein Main Frankfurt",
"Hannover")
B <- c("Berlin Brandenburg",
"Hamburg",
"Munich Bayern",
"Region Stuttgart",
"Main Rhein Darmstadt",
"Wiesbaden")
The resulting column / data frame should look like this:
result <- c("Berlin",
"Hamburg",
"Munich",
"Stuttgart",
"Rhein Main",
NA
)
df <- data.frame(A, B, result)
...while it doesn't matter whether it's "Rhein Main" or "Main Rhein".
Thank you for your help!

Maybe I am missing a smart regex trick but one option would be to split strings into words and find the common words using intersect.
df$Result <- mapply(function(x, y) paste0(intersect(x, y), collapse = " "),
strsplit(df$A, '\\s+'), strsplit(df$B, '\\s+'))
df
# A B Result
#1 Berlin Berlin Brandenburg Berlin
#2 Hamburg Hamburg Hamburg
#3 Munich Munich Bayern Munich
#4 Stuttgart Region Stuttgart Stuttgart
#5 Rhein Main Frankfurt Main Rhein Darmstadt Rhein Main
#6 Hannover Wiesbaden
This returns empty string when there is no match. You can turn the empty string to NA if needed.

Related

Separate a string of multiple dates and names in R

I have a dataframe with 2 columns, where the first column lists companies and the second column are strings of multiple dates and company names as follows:
data=data.frame('Company'=(c("A","B","C")),
'Bank'=c("1/13/2020 Bank A 5/12/2020 Bank H C 11/9/2020 HelloBank",
"2/14/2020 HopeBank 1/9/2020 Liberty Bank SA",
"10/18/2020 Securities"))
I would like to separate column "Bank" into multiple columns of Dates and Bank Names, such that:
data=data.frame('Company'=(c("A","B","C")),
"Date1"=(c("1/13/2020","2/14/2020","10/18/2020")),
'Bank1'=c("Bank A", "HopeBank","Securities"),
"Date2"=(c("5/12/2020","1/9/2020",NA)),
'Bank2'=c("Bank H C", "Liberty Bank SA",NA),
"Date3"=(c("11/9/2020 ",NA,NA)),
'Bank3'=c("HelloBank", NA,NA))
I have tried using library(stringr) but the formats of the dates are not consistent. Also, I do not know how many variables I will need in the final dataframe, and some of the strings in the "Bank" column are very long (up to 824 nchar).
I have also tried using separate from tidyr but without success.
Here is a base R option using strsplit to make it
v <- strsplit(data$Bank, "\\s(?=(\\d+\\/))|(?<=\\d)\\s", perl = TRUE)
data <- cbind(
data[1],
`colnames<-`(
do.call(rbind, lapply(v, `length<-`, max(lengths(v)))),
paste0(c("Date", "Bank"), rep(1:(max(lengths(v)) / 2), each = 2))
)
)
which gives
> data
Company Date1 Bank1 Date2 Bank2 Date3 Bank3
1 A 1/13/2020 Bank A 5/12/2020 Bank H C 11/9/2020 HelloBank
2 B 2/14/2020 HopeBank 1/9/2020 Liberty Bank SA <NA> <NA>
3 C 10/18/2020 Securities <NA> <NA> <NA> <NA>
If you don't know how many banks there might be in each row, you are better off creating a dataframe in long format. Something like this will do it, using the tidyverse...
library(tidyverse)
data_long <- data %>%
mutate(Bank = str_replace_all(Bank, "( \\d+/)", "#\\1"), #add markers between banks
Bank = str_split(Bank, "#")) %>% #split at markers
unnest(Bank) %>% #convert to one row per entry
mutate(Bank = str_squish(Bank)) %>% #trim white space
separate(Bank, into = c("Date", "BankName"), sep = " ", extra = "merge")
data_long
Company Date BankName
<chr> <chr> <chr>
1 A 1/13/2020 Bank A
2 A 5/12/2020 Bank H C
3 A 11/9/2020 HelloBank
4 B 2/14/2020 HopeBank
5 B 1/9/2020 Liberty Bank SA
6 C 10/18/2020 Securities
You might then want to convert Date into date format.
If you really want it in wide format, use pivot_wider.

how do I extract a part of data from a column and and paste it n another column using R?

I want to extract a part of data from a column and and paste it in another column using R:
My data looks like this:
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NULL,+3579514862,NULL,+5554848123)
data <- data.frame(names,country,mobile)
data
> data
names country mobile
1 Sia London +1234567890 NULL
2 Ryan Paris +3579514862
3 J Sydney +0123458796 NULL
4 Ricky Delhi +5554848123
I would like to separate phone number from country column wherever applicable and put it into mobile.
The output should be,
> data
names country mobile
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
You can use the tidyverse package which has a separate function.
Note that I rather use NA instead of NULL inside the mobile vector.
Also, I use the option, stringsAsFactors = F when creating the dataframe to avoid working with factors.
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NA, "+3579514862", NA, "+5554848123")
data <- data.frame(names,country,mobile, stringsAsFactors = F)
library(tidyverse)
data %>% as_tibble() %>%
separate(country, c("country", "number"), sep = " ", fill = "right") %>%
mutate(mobile = coalesce(mobile, number)) %>%
select(-number)
# A tibble: 4 x 3
names country mobile
<chr> <chr> <chr>
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
EDIT
If you want to do this without the pipes (which I would not recommend because the code becomes much harder to read) you can do this:
select(mutate(separate(as_tibble(data), country, c("country", "number"), sep = " ", fill = "right"), mobile = coalesce(mobile, number)), -number)

R observation strs split - multiple value in columns

I have a dataframe in R concerning houses. This is a small sample:
Address Type Rent
Glasgow;Scotland House 1500
High Street;Edinburgh;Scotland Apartment 1000
Dundee;Scotland Apartment 800
South Street;Dundee;Scotland House 900
I would like to just pull out the last two instances of the Address column into a City and County column in my dataframe.
I have used mutate and strsplit to split this column by:
data<-mutate(dataframe, split_add = strsplit(dataframe$Address, ";")
I now have a new column in my dataframe which resembles the following:
split_add
c("Glasgow","Scotland")
c("High Street","Edinburgh","Scotland")
c("Dundee","Scotland")
c("South Street","Dundee","Scotland")
How to I extract the last 2 instances of each of these vector observations into columns "City" and "County"?
I attempted:
data<-mutate(data, city=split_add[-2] ))
thinking it would take the second instance from the end of the vectors- but this did not work.
using tidyr::separate() with the fill = "left" option is probably your best bet...
dataframe <- read.table(header = T, stringsAsFactors = F, text = "
Address Type Rent
Glasgow;Scotland House 1500
'High Street;Edinburgh;Scotland' Apartment 1000
Dundee;Scotland Apartment 800
'South Street;Dundee;Scotland' House 900
")
library(tidyr)
separate(dataframe, Address, into = c("Street", "City", "County"),
sep = ";", fill = "left")
# Street City County Type Rent
# 1 <NA> Glasgow Scotland House 1500
# 2 High Street Edinburgh Scotland Apartment 1000
# 3 <NA> Dundee Scotland Apartment 800
# 4 South Street Dundee Scotland House 900
I thinking about another way of dealing with this problem.
1.Creating a dataframe with the split_add column data
c("Glasgow","Scotland")
c("High Street","Edinburgh","Scotland")
c("Dundee","Scotland")
c("South Street","Dundee","Scotland")
test_data <- data.frame(split_add <- c("Glasgow, Scotland",
"High Street, Edinburgh, Scotland",
"Dundee, Scotland",
"South Street, Dundee, Scotland"),stringsAsFactors = F)
names(test_data) <- "address"
2.Use separate() from tidyr to split the column
library(tidyr)
new_test <- test_data %>% separate(address,c("c1","c2","c3"), sep=",")
3.Use dplyr and ifelse() to only reserve the last two columns
library(dplyr)
new_test %>%
mutate(city = ifelse(is.na(c3),c1,c2),county = ifelse(is.na(c3),c2,c3)) %>%
select(city,county)
The final data looks like this.
Assuming that you're using dplyr
data <- mutate(dataframe, split_add = strsplit(Address, ';'), City = tail(split_add, 2)[1], Country = tail(split_add, 1))

approximate string matching on condition of a match in a separate field in R

I have two dataframes from which I would like to carry out approximate string matching.
> df1
Source Name Country
A Glen fiddich United Kingdom
A Talisker dark storm United Kingdom
B johnney walker United states
D veuve clicquot brut France
E nicolas feuillatte brut France
C glen morangie united kingdom
B Talisker 54 degrees United kingdom
F Talisker dark storm United states
The second data frame:
> df2
Source Name Country
A smirnoff ice Russia
A Talisker daek strome United Kingdom
B johnney walker United states
D veuve clicquot brut Australia
E nicolea feuilate brut Italy
C glen morangie united kingdom
B Talisker 54 degrees United kingdom
the key column for the approximate matching between the two data frames is "Name". Because of the relationship in the columns for the observations, It is important to select the approximate matches that also have a match on the "country" column. The extract of the code I am using is below:
dist.mat <- stringdistmatrix(tolower(df1$title), tolower(df2$title), method = "jw",
nthread = getOption("sd_num_thread"))
min.dist <- apply(dist.mat, 1, min)
matched <- data.frame(df1$title,
as.character(apply(dist.mat, 1, function(x) df2$title[which(x == min(x))])),
apply(dist.mat, 1, which.min), "jw", apply(dist.mat, 1, min))
colnames(matched) <- c("to_be_matched", "closest_match", "index_closest_match",
"distance_method", "distance")
The code above only executes approximate match between df1 and df2 based on data in the "Name" column. What I want to do is have the approximate match on "Name" column selected on the condition that for the two values, there is a match on the "Country" column.
You should check out the fuzzywuzzy library, which has excellent fuzzy text matching capabilities. Then I would iterate through the unique countries and look for matches that pass a certain fuzz threshold score, like the following:
from fuzzywuzzy import fuzz, process
matches = []
for country in df1['Country'].unique().tolist():
dfm1 = df1[df1['Country'] == country]
dfm2 = df2[df2['Country'] == country]
candidates = dfm2['Name'].tolist()
matches.append(dfm1['Name'].apply(lambda x: x, process.extractOne(x, candidates, score_cutoff=90)))
You can tweak the scorer input in order to get the matches the way you like it.

How do I make this nested for loop work faster

My data is as shown below:
txt$txt:
my friend stays in adarsh nagar
I changed one apple one samsung S3 n one sony experia z.
Hi girls..Friends meet at bangalore
what do u think of ccd at bkc
I have an exhaustive list of city names. Listing few of them below:
city:
ahmedabad
adarsh nagar
airoli
bangalore
bangaladesh
banerghatta Road
bkc
calcutta
I am searching for city names (from the "city" list I have) in txt$txt and extracting them into another column if they are present. So the simple loop below works for me... but it's taking a lot of time on the bigger dataset.
for(i in 1:nrow(txt)){
a <- c()
for(j in 1:nrow(city)){
a[j] <- grepl(paste("\\b",city[j,1],"\\b", sep = ""),txt$txt[i])
}
txt$city[i] <- ifelse(sum(a) > 0, paste(city[which(a),1], collapse = "_"), "NONE")
}
I tried to use an apply function, and this is the maximum i could get to.
apply(as.matrix(txt$txt), 1, function(x){ifelse(sum(unlist(strsplit(x, " ")) %in% city[,1]) > 0, paste(unlist(strsplit(x, " "))[which(unlist(strsplit(x, " ")) %in% city[,1])], collapse = "_"), "NONE")})
[1] "NONE" "NONE" "bangalore" "bkc"
Desired Output:
> txt
txt city
1 my friend stays in adarsh nagar adarsh nagar
2 I changed one apple one samsung S3 n one sony experia z. NONE
3 Hi girls..Friends meet at bangalore bangalore
4 what do u think of ccd at bkc bkc
I want a faster process in R, which does the same thing what the for loop above does. Please advise. Thanks
Here's a possibility using stri_extract_first_regex from stringi package:
library(stringi)
# prepare some data
df <- data.frame(txt = c("in adarsh nagar", "sony experia z", "at bangalore"))
city <- c("ahmedabad", "adarsh nagar", "airoli", "bangalore")
df$city <- stri_extract_first_regex(str = df$txt, regex = paste(city, collapse = "|"))
df
# txt city
# 1 in adarsh nagar adarsh nagar
# 2 sony experia z <NA>
# 3 at bangalore bangalore
This should be much faster:
bigPattern <- paste('(\\b',city[,1],'\\b)',collapse='|',sep='')
txt$city <- sapply(regmatches(txt$txt,gregexpr(bigPattern,txt$txt)),FUN=function(x) ifelse(length(x) == 0,'NONE',paste(unique(x),collapse='_')))
Explanation:
in the first line we build a big regular expression matching all the cities, e.g. :
(\\bahmedabad\\b)|(\\badarsh nagar\\b)|(\\bairoli\\b)| ...
Then we use gregexpr in combination with regmatches, in this way we get a list of the matches for each element in txt$txt.
Finally, with a simple sapply, for each element of the list we concatenate the matched cities (after removing the duplicates i.e. cities mentioned more than one time).
Try this:
# YOUR DATA
##########
txt <- readLines(n = 4)
my friend stays in adarsh nagar and airoli
I changed one apple one samsung S3 n one sony experia z.
Hi girls..Friends meet at bangalore
what do u think of ccd at bkc
city <- readLines(n = 8)
ahmedabad
adarsh nagar
airoli
bangalore
bangaladesh
banerghatta Road
bkc
calcutta
# MATCHING
##########
matches <- unlist(setNames(lapply(city, grep, x = txt, fixed = TRUE),
city))
(res <- (sapply(1:length(txt), function(x)
paste0(names(matches)[matches == x], collapse = "___"))))
# [1] "adarsh nagar___airoli" ""
# [3] "bangalore" "bkc"

Resources