Merge rows and values in R based on condition - r

I have the following sample data:
# data
school = c('ABC University','ABC Uni','DFG University','DFG U')
applicant = c(2000,3100,210,2000)
students = c(100,2000,300,4000)
df = data.frame(school,applicant,students)
I want to merge to this:
|school |appliant| students |
-----------------------------------
|ABC University| 5100 | 2100 |
|DFG University| 2210 | 4300 |
I ran this code:
df$school[df$school == 'ABC Uni'] = 'ABC University'
But it gives me ABC University twice instead of merging them together.

This really depends on the rest of your strings, but you could take a look into grep and use ^ for begins with.
df[grep('^ABC U', df$school), 'school'] <- 'ABC University'
df[grep('^DFG U', df$school), 'school'] <- 'DFG University'
And the aggregate as usual.
aggregate(cbind(applicant, students) ~ school, df, sum)
# school applicant students
# 1 ABC University 5100 2100
# 2 DFG University 2210 4300

Here is dplyr stringr solution:
library(dplyr)
library(stringr)
df %>%
mutate(school = str_replace_all(school, c(
"^ABC Uni$" = "ABC University",
"^DFG U$" = "DFG University"))) %>%
group_by(school) %>%
summarise(across(c(applicant, students), sum))
output:
school applicant students
<chr> <dbl> <dbl>
1 ABC University 5100 2100
2 DFG University 2210 4300

Related

How to fuzzy lookup a string in one column to another column ignoring sub-setted words

I have the following 2 dataframes vendor_list and firm_list:
MARKET_ID <- c(1,2,3,4,5)
MARKET_NAME <- c("DELHI","MUMBAI","BANGALORE","KOLKATA","CHENNAI")
vendor_list <- data.frame(MARKET_ID,MARKET_NAME)
MARKET_NAME <- c("DELHI MUNICIPAL CORP","DELHI","MUMBAI","BENGALURU","BANGALORES","CITYKOLKATA")
POPULATION <- c(1000,2000,3000,4000,5000,6000)
firm_list <- data.frame(MARKET_NAME,POPULATION)
I need to search for strings in MARKET_NAME column in vendor_list dataframe in MARKET_NAME column in firm_list dataframe. But there are certain conditions:
It should only show as a match if the string is present as a stand alone block, i.e. it should not be a sub-set of the word.
So,
The match of DELHI to DELHI MUNICIPAL CORP is TRUE
The match of DELHI to DELHI is TRUE
The match of BANGALORE to BANGALORES is FALSE as BANGALORE is a sub-set of BANGALORES
The match of KOLKATA to CITYKOLKATA is FALSE as KOLKATA is a sub-set of CITYKOLKATA
Thus, the final dataframe final_market_info after lookup should look like this:
| MARKET_ID| MARKET_NAME.x | MARKET_NAME.y | POPULATION |
| 1 | DELHI | DELHI MUNICIPAL CORP| 1000 |
| 1 | DELHI | DELHI | 2000 |
| 2 | MUMBAI | MUMBAI | 3000 |
I had tried stringdist_join in stringr package using lcs and jw method but it was not giving me correct result like this.
Is this what you need?
firm_list %>%
mutate(match = str_extract(MARKET_NAME, str_c("\\b", vendor_list$MARKET_NAME, collapse = "|", "\\b"))) %>%
left_join(., vendor_list %>% rename(match = MARKET_NAME), by = "match")
MARKET_NAME POPULATION match MARKET_ID
1 DELHI MUNICIPAL CORP 1000 DELHI 1
2 DELHI 2000 DELHI 1
3 MUMBAI 3000 MUMBAI 2
4 BENGALURU 4000 <NA> NA
5 BANGALORES 5000 <NA> NA
6 CITYKOLKATA 6000 <NA> NA
The point here is that the elements in vendor_list$MARKET_NAME are wrapped into word \\boundary markers to get exact matches and concatenated as an alternation pattern.
To remove the rows without matches, use inner_join instead of left_join:
firm_list %>%
mutate(match = str_extract(MARKET_NAME, str_c("\\b", vendor_list$MARKET_NAME, collapse = "|", "\\b"))) %>%
inner_join(., vendor_list %>% rename(match = MARKET_NAME), by = "match")
MARKET_NAME POPULATION match MARKET_ID
1 DELHI MUNICIPAL CORP 1000 DELHI 1
2 DELHI 2000 DELHI 1
3 MUMBAI 3000 MUMBAI 2

Joining Dataframes in R, Matching Patterns in Strings

Two big real life tables to join up, but here's a little reprex:
I've got a table of small strings and I want to left join on a second table, with the join being based on whether or not these small strings can be found inside the bigger strings on the second table.
df_1 <- data.frame(index = 1:5,
keyword = c("john", "ella", "mil", "nin", "billi"))
df_2 <- data.frame(index_2 = 1001:1008,
name = c("John Coltrane", "Ella Fitzgerald", "Miles Davis", "Billie Holliday",
"Nina Simone", "Bob Smith", "John Brown", "Tony Montana"))
df_results_i_want <- data.frame(index = c(1, 1:5),
keyword = c("john", "john", "ella", "mil", "nin", "billi"),
index_2 = c(1001, 1007, 1002, 1003, 1005, 1004),
name = c("John Coltrane", "John Brown", "Ella Fitzgerald",
"Miles Davis", "Nina Simone", "Billie Holliday"))
Seems like a str_detect() call and a left_join() call might be part of the solution - ie I'm hoping for something like:
library(tidyverse)
df_results <- df_1 |> left_join(df_2, join_by(blah blah str_detect() blah blah))
I'm using dplyr 1.1 so I can use join_by(), but I'm not sure of the correct way to get what I need - can anyone help please?
I suppose I could do a simple cross join using tidyr::crossing() and then do the str_detect() stuff afterwards (and filter out things that don't match)
df_results <- df_1 |>
crossing(df_2) |>
mutate(match = str_detect(name, fixed(keyword, ignore_case = TRUE))) |>
filter(match) |>
select(-match)
but in my real life example, the cross join would produce an absolutely enormous table that would overwhelm my PC.
Thank you.
You can try fuzzy_join::regex_join():
library(fuzzyjoin)
regex_join(df_2, df_1, by=c("name"="keyword"), ignore_case=T)
Output:
index.x name index.y keyword
1 1001 John Coltrane 1 john
2 1002 Ella Fitzgerald 2 ella
3 1003 Miles Davis 3 mil
4 1004 Billie Holliday 5 billi
5 1005 Nina Simone 4 nin
6 1007 John Brown 1 john
join_by does not support inexact join (but unequal), but you can use fuzzyjoin:
library(dplyr)
library(fuzzyjoin)
df_2 %>%
mutate(name = tolower(name)) %>%
fuzzy_left_join(df_1, ., by = c(keyword = "name"),
match_fun = \(x, y) str_detect(y, x))
index keyword index_2 name
1 1 john 1001 john coltrane
2 1 john 1007 john brown
3 2 ella 1002 ella fitzgerald
4 3 mil 1003 miles davis
5 4 nin 1005 nina simone
6 5 billi 1004 billie holliday
We can use SQL to do that.
library(sqldf)
sqldf("select * from [df_1] A
left join [df_2] B on B.name like '%' || A.keyword || '%'")
giving:
index keyword index_2 name
1 1 john 1001 John Coltrane
2 1 john 1007 John Brown
3 2 ella 1002 Ella Fitzgerald
4 3 mil 1003 Miles Davis
5 4 nin 1005 Nina Simone
6 5 billi 1004 Billie Holliday
It can be placed in a pipeline like this:
library(magrittr)
library(sqldf)
df_1 %>%
{ sqldf("select * from [.] A
left join [df_2] B on B.name like '%' || A.keyword || '%'")
}

Count words in string, grouped by year

I'm trying to find popular words in a string using R, which is probably easiest to explain with an example.
Taking this as the input (with millions of entries, where each date can appear thousands of times)
IncorporationDate CompanyName
3007931 2003-05-12 OUTLANE BUSINESS CONSULTANTS LIMITED
692999 2013-03-28 AGB SERVICES ANGLIA LIMITED
2255234 2008-05-22 CIDA INTERNATIONAL LIMITED
310577 2017-09-19 FA IT SERVICES LIMITED
2020738 2012-09-03 THE SPARES SHOP LIMITED
2776144 2006-02-03 ANGELVIEW PROPERTIES LIMITED
2420435 2017-10-17 SHANE WARD TM LIMITED
2523165 2014-06-04 THE INDEPENDENT GIN COMPANY LTD
2594847 2015-05-05 AIA ENGINEERING LTD
2701395 2015-05-27 LAURA BRIDGES LIMITED
I want to find the top 10 most popular words used in each year, with the result looking something like this:
| Year | Top1 | Top1_Count | Top2 | Top2_Count | ...
| ---- | ------- | ---------- | ---- | ---------- |
| 2017 | LIMITED | 2 | IT | 1 |
| ...
The closest I've got so far is:
words <- data.frame(table(unlist(strsplit(tolower(df$SText, " "))))
but that loses the year data, only giving a full total across the entire data frame.
I've also played around with summarize from dplyr, but haven't found a way to get it to do what I want.
edit: using the answer from #maurits-evers I've got a bit further, and found the top 10 using this:
top_words_by_year <- words_by_year %>% group_by(year) %>% top_n(n = 10, wt = n)
just trying to figure out how to get it into the shape I need
Thanks
You could do something like this:
library(tidyverse);
df %>%
mutate(year = format(as.Date(IncorporationDate, format = "%Y-%m-%d"), "%Y")) %>%
group_by(year) %>%
mutate(words = strsplit(as.character(CompanyName), " ")) %>%
unnest() %>%
count(year, words);
# year words n
#<chr> <chr> <int>
#1 2003 BUSINESS 1
#2 2003 CONSULTANTS 1
#3 2003 LIMITED 1
#4 2003 OUTLANE 1
#5 2006 ANGELVIEW 1
#6 2006 LIMITED 1
#7 2006 PROPERTIES 1
#8 2008 CIDA 1
#9 2008 INTERNATIONAL 1
#10 2008 LIMITED 1
## ... with 26 more rows
Explanation: Extract year from IncorporationDate, group by year, split CompanyName into words, unnest, and count the number of words per year.
Sample data
df <- read.table(text =
"IncorporationDate CompanyName
3007931 2003-05-12 'OUTLANE BUSINESS CONSULTANTS LIMITED'
692999 2013-03-28 'AGB SERVICES ANGLIA LIMITED'
2255234 2008-05-22 'CIDA INTERNATIONAL LIMITED'
310577 2017-09-19 'FA IT SERVICES LIMITED'
2020738 2012-09-03 'THE SPARES SHOP LIMITED'
2776144 2006-02-03 'ANGELVIEW PROPERTIES LIMITED'
2420435 2017-10-17 'SHANE WARD TM LIMITED'
2523165 2014-06-04 'THE INDEPENDENT GIN COMPANY LTD'
2594847 2015-05-05 'AIA ENGINEERING LTD'
2701395 2015-05-27 'LAURA BRIDGES LIMITED'", header = T)

"Select query equivalent" in R to select from one data frame and paste values to the second frame accordingly

I have two data frames:
temp <- data.frame(
team1 = c("Chennai Super Kings","Deccan Chargers","Delhi Daredevils"),
team2 = c("Mumbai Indians","Royal Challengers Bangalore","Gujarat Lions")
)
teamdata <- data.frame(
teamname=c("Chennai Super Kings","Deccan Chargers","Delhi Daredevils",
"Mumbai Indians","Royal Challengers Bangalore","Gujarat Lions"),
matchesplayed = c("100","200","300","400","500","600"),
matcheswon = c("50","100","150","200","250","300")
)
In the temp data frame I want to add variables such as team1matchesplayed and team2matchesplayed or team1matcheswon and team2matcheswon according to the name of the team in variables team1 and team2 of the temp dataframe. The values should be populated from teamdata data frame. New columns should be generated in the temp data frame.
P.S: This is my first question on here and may not be the best representation. Apologies: Sorry for attaching images earlier. Thank you for pointing it out.
Simply merge twice on both team1 and team2 respectively:
# NESTED MERGE
mdf <- merge(merge(temp, teamdata, by.x=c("team1"), by.y=c("teamname"), all.x=TRUE),
teamdata, by.x=c("team2"), by.y=c("teamname"), all.x=TRUE)
# RENAME COLUMNS
mdf <- setNames(mdf, c("team2", "team1", "team1_matchesplayed", "team1_matcheswon",
"team2_matchesplayed", "team2_matcheswon"))
# REORDER COLUMNS
mdf <- mdf[c("team1", "team2", "team1_matchesplayed", "team2_matchesplayed",
"team1_matcheswon", "team2_matcheswon")]
mdf
# team1 team2 team1_matchesplayed team2_matchesplayed team1_matcheswon team2_matcheswon
# 1 Delhi Daredevils Gujarat Lions 300 600 150 300
# 2 Chennai Super Kings Mumbai Indians 100 400 50 200
# 3 Deccan Chargers Royal Challengers Bangalore 200 500 100 250
> library(sqldf)
Loading required package: gsubfn
Loading required package: proto
Loading required package: RSQLite
> temp2=sqldf("select temp.*,matchesplayed as team1matchesplayed,matcheswon as team1matcheswon from temp,teamdata where temp.team1=teamdata.teamname")
> temp2
team1 team2 team1matchesplayed
1 Chennai Super Kings Mumbai Indians 100
2 Deccan Chargers Royal Challengers Bangalore 200
3 Delhi Daredevils Gujarat Lions 300
team1matcheswon
1 50
2 100
3 150
> temp3=sqldf("select temp2.*,matchesplayed as team2matchesplayed,matcheswon as team2matcheswon from temp2,teamdata where temp2.team2=teamdata.teamname")
> temp3
team1 team2 team1matchesplayed
1 Chennai Super Kings Mumbai Indians 100
2 Deccan Chargers Royal Challengers Bangalore 200
3 Delhi Daredevils Gujarat Lions 300
team1matcheswon team2matchesplayed team2matcheswon
1 50 400 200
2 100 500 250
3 150 600 300

Concatenate rows in R depending on specific row value range

I have two data frames:
df
set.seed(10)
df <- data.frame(Name = c("Bob","John","Jane","John","Bob","Jane","Jane"),
Date=as.Date(c("2014-06-04", "2013-12-04", "2013-11-04" , "2013-12-06" ,
"2014-01-09", "2014-03-21", "2014-09-24")), Degrees= rnorm(7, mean=32, sd=32))
Name | Date | Degrees
Bob | 2014-06-04 | 50.599877
John | 2013-12-04 | 44.103919
Jane | 2013-11-04 | 6.117422
John | 2013-12-06 | 30.826633
Bob | 2014-01-09 | 59.425444
Jane | 2014-03-21 | 62.473418
Jane | 2014-09-24 | 11.341562
df2
df2 <- data.frame(Name = c("Bob","John","Jane"),
Date=as.Date(c("2014-03-01", "2014-01-20", "2014-06-07")),
Weather = c("Good weather","Bad weather", "Good weather"))
Name | Date | Weather
Bob | 2014-03-01 | Good weather
John | 2014-01-20 | Bad weather
Jane | 2014-06-07 | Good weather
I would like to extract the following:
Name | Date | Weather | Degrees (until this Date) | Other measures
Bob | 2014-03-01 | Good weather | 59.425444 | 50.599877
John | 2014-01-20 | Bad weather | 44.103919, 30.826633 |
Jane | 2014-06-07 | Good weather | 6.117422, 62.473418 | 11.341562
Which is a merge between both df and df2, with:
"Degrees (until this Date)" concatenates from df$Degrees up until the date of df2$Date;
the value of "Other measures" is whatever measures are on df$Degrees after the date of df2$Date.
Another alternative:
#a grouping variable to use for identical splitting
nms = unique(c(as.character(df$Name), as.character(df2$Name)))
#split data
dates = split(df$Date, factor(df$Name, nms))
degrees = split(df$Degrees, factor(df$Name, nms))
thresholds = split(df2$Date, factor(df2$Name, nms))
#mapply the condition
res = do.call(rbind.data.frame,
Map(function(date, thres, deg)
tapply(deg, factor(date <= thres, c(TRUE, FALSE)),
paste0, collapse = ", "),
dates, thresholds, degrees))
#bind with df2
cbind(df2, setNames(res[match(row.names(res), df2$Name), ], c("Degrees", "Other")))
# Name Date Weather Degrees Other
#Bob Bob 2014-03-01 Good weather 41.4254440501603 32.5998774701384
#John John 2014-01-20 Bad weather 26.10391865379, 12.826633094921 <NA>
#Jane Jane 2014-06-07 Good weather -11.8825775975204, 44.4734176224054 -6.65843761374357
Here's one approach:
library(dplyr)
library(tidyr)
library(magrittr)
res <-
left_join(df, df2 %>% select(Name, Date, Weather), by = "Name") %>%
mutate(paste = factor(Date.x <= Date.y, labels = c("before", "other"))) %>%
group_by(Name, paste) %>%
mutate(Degrees = paste(Degrees, collapse = ", ")) %>%
distinct() %>%
spread(paste, Degrees) %>%
group_by(Name, Date.y, Weather) %>%
summarise(other = other[1], before = before[2]) %>%
set_names(c("Name", "Date" , "Weather", "Degrees (until this Date)" , "Other measures"))
res[is.na(res)] <- ""
res
# Name Date Weather Degrees (until this Date) Other measures
# 1 Bob 2014-03-01 Good weather 41.4254440501603 32.5998774701384
# 2 Jane 2014-06-07 Good weather -11.8825775975204, 44.4734176224054 -6.65843761374357
# 3 John 2014-01-20 Bad weather 26.10391865379, 12.826633094921
There may be room for improvements, but anyway.

Resources