I have a table (table 1) with a bunch of cities (punctuation, capitalization and spaces have been removed).
I want to scan through the 2nd table (table 2) and pull out any record (the first) that exactly matches or contains the string anywhere within it.
# Table 1
city1
1 waterloo
2 kitchener
3 toronto
4 guelph
5 ottawa
# Table 2
city2
1 waterlookitchener
2 toronto
3 hamilton
4 cityofottawa
This would give the 3rd table seen below.
# Table 3
city1 city2
1 waterloo waterlookitchener
2 kitchener waterlookitchener
3 toronto toronto
4 guelph <N/A>
5 ottawa cityofottawa
I believe there are more sophisticated ways of completing your task, but here is a simple approach using tidyverse.
df <- read_table2("city1
waterloo
kitchener
toronto
guelph
ottawa")
df2 <- read_table2("city2
waterlookitchener
toronto
hamilton
cityofottawa")
df3 <- df$city1 %>%
lapply(grep, df2$city2, value=TRUE) %>%
lapply(function(x) if(identical(x, character(0))) NA_character_ else x) %>%
unlist
df3 <- cbind(df, df3)
Search for every element of df$city1 in df2$city2 (partial or complete match) and return this element of df2$city2. See ?grep for more information.
Replace the character(0) (element not found) with NA. See How to convert character(0) to NA in a list with R language? for details.
Convert list into a vector (unlist).
Attach result to list of cities (cbind).
You can also try using fuzzyjoin. In this case, you can use the function stri_detect_fixed from stringi package to identify at least one occurrence of a fixed pattern in a string.
library(fuzzyjoin)
library(stringi)
library(dplyr)
fuzzy_right_join(table2, table1, by = c("city2" = "city1"), match_fun = stri_detect_fixed) %>%
select(city1, city2)
Output
city1 city2
1 waterloo waterlookitchener
2 kitchener waterlookitchener
3 toronto toronto
4 guelph <NA>
5 ottawa cityofottawa
Data
table1 <- structure(list(city1 = c("waterloo", "kitchener", "toronto",
"guelph", "ottawa")), class = "data.frame", row.names = c(NA,
-5L))
table2 <- structure(list(city2 = c("waterlookitchener", "toronto", "hamilton",
"cityofottawa")), class = "data.frame", row.names = c(NA, -4L
))
Related
Two big real life tables to join up, but here's a little reprex:
I've got a table of small strings and I want to left join on a second table, with the join being based on whether or not these small strings can be found inside the bigger strings on the second table.
df_1 <- data.frame(index = 1:5,
keyword = c("john", "ella", "mil", "nin", "billi"))
df_2 <- data.frame(index_2 = 1001:1008,
name = c("John Coltrane", "Ella Fitzgerald", "Miles Davis", "Billie Holliday",
"Nina Simone", "Bob Smith", "John Brown", "Tony Montana"))
df_results_i_want <- data.frame(index = c(1, 1:5),
keyword = c("john", "john", "ella", "mil", "nin", "billi"),
index_2 = c(1001, 1007, 1002, 1003, 1005, 1004),
name = c("John Coltrane", "John Brown", "Ella Fitzgerald",
"Miles Davis", "Nina Simone", "Billie Holliday"))
Seems like a str_detect() call and a left_join() call might be part of the solution - ie I'm hoping for something like:
library(tidyverse)
df_results <- df_1 |> left_join(df_2, join_by(blah blah str_detect() blah blah))
I'm using dplyr 1.1 so I can use join_by(), but I'm not sure of the correct way to get what I need - can anyone help please?
I suppose I could do a simple cross join using tidyr::crossing() and then do the str_detect() stuff afterwards (and filter out things that don't match)
df_results <- df_1 |>
crossing(df_2) |>
mutate(match = str_detect(name, fixed(keyword, ignore_case = TRUE))) |>
filter(match) |>
select(-match)
but in my real life example, the cross join would produce an absolutely enormous table that would overwhelm my PC.
Thank you.
You can try fuzzy_join::regex_join():
library(fuzzyjoin)
regex_join(df_2, df_1, by=c("name"="keyword"), ignore_case=T)
Output:
index.x name index.y keyword
1 1001 John Coltrane 1 john
2 1002 Ella Fitzgerald 2 ella
3 1003 Miles Davis 3 mil
4 1004 Billie Holliday 5 billi
5 1005 Nina Simone 4 nin
6 1007 John Brown 1 john
join_by does not support inexact join (but unequal), but you can use fuzzyjoin:
library(dplyr)
library(fuzzyjoin)
df_2 %>%
mutate(name = tolower(name)) %>%
fuzzy_left_join(df_1, ., by = c(keyword = "name"),
match_fun = \(x, y) str_detect(y, x))
index keyword index_2 name
1 1 john 1001 john coltrane
2 1 john 1007 john brown
3 2 ella 1002 ella fitzgerald
4 3 mil 1003 miles davis
5 4 nin 1005 nina simone
6 5 billi 1004 billie holliday
We can use SQL to do that.
library(sqldf)
sqldf("select * from [df_1] A
left join [df_2] B on B.name like '%' || A.keyword || '%'")
giving:
index keyword index_2 name
1 1 john 1001 John Coltrane
2 1 john 1007 John Brown
3 2 ella 1002 Ella Fitzgerald
4 3 mil 1003 Miles Davis
5 4 nin 1005 Nina Simone
6 5 billi 1004 Billie Holliday
It can be placed in a pipeline like this:
library(magrittr)
library(sqldf)
df_1 %>%
{ sqldf("select * from [.] A
left join [df_2] B on B.name like '%' || A.keyword || '%'")
}
I'm tidying some data that I read into R from a PDF using tabulizer. Unfortunately some cells haven't been read properly. In column 9 (Split 5 at 37.1km) rows 3 and 4 contain information that should have ended up in column 10 (Final Time).
How do I separate that column (9) just for these rows and paste the necessary data into an already existing column (10)?
I know how to use tidyr::separate function but can't figure out how (an if) to apply it here. Any help and guidance will be appreciated.
structure(list(Rank = c("23", "24", "25", "26"), `Race Number` = c("13",
"11", "29", "30"), Name = c("FOSS Tobias S.", "McNULTY Brandon",
"BENNETT George", "KUKRLE Michael"), `NOC Code` = c("NOR", "USA",
"NZL", "CZE"), `Split 1 at 9.7km` = c("13:47.65(22)", "13:28.23(15)",
"14:05.46(30)", "14:05.81(32)"), `Split 2 at 15.0km` = c("19:21.16(22)",
"19:04.80(18)", "19:47.53(31)", "19:48.77(32)"), `Split 3 at 22.1km` = c("29:17.44(24)",
"29:01.94(20)", "29:58.88(28)", "29:58.09(27)"), `Split 4 at 31.8km` = c("44:06.82(24)",
"43:51.67(23)", "44:40.28(25)", "44:42.74(26)"), `Split 5 at 37.1km` = c("49:49.65(24)",
"49:40.49(23)", "50:21.82(25)1:00:28.39 (25)", "50:30.02(26)1:00:41.55 (26)"
), `Final Time` = c("59:51.68 (23)", "59:57.73 (24)", "", ""),
`Time Behind` = c("+4:47.49", "+4:53.54", "+5:24.20", "+5:37.36"
), `Average Speed` = c("44.302", "44.228", "43.854", "43.696"
)), class = "data.frame", row.names = c(NA, -4L))
My answer is not really fancy, but it does the job for any number in the final time column. It works as long as there are always numbers in brackets at the end.
# dummy df
df <- data.frame("split" = c("49:49.65(24)", "49:40.49(23)", "50:21.82(25)1:00:28.39 (25)", "50:30.02(26)1:00:41.55 (26)"),
"final" = c("59:51.68 (23)", "59:57.73 (24)", "", ""))
# combining & splitting strings
merge_strings <- paste0(df$split, df$final)
split_strings <- strsplit(merge_strings, ")")
df$split <- paste0(unlist(lapply(split_strings, "[[", 1)),")")
df$final <- paste0(unlist(lapply(split_strings, "[[", 2)),")")
This gives:
split final
1 49:49.65(24) 59:51.68 (23)
2 49:40.49(23) 59:57.73 (24)
3 50:21.82(25) 1:00:28.39 (25)
4 50:30.02(26) 1:00:41.55 (26)
Calling df to your dataframe:
library(tidyr)
library(dplyr)
df %>%
separate(`Split 5 at 37.1km`, into = c("Split 5 at 37.1km","aux"), sep = "\\)") %>%
mutate(`Final Time` = coalesce(if_else(`Final Time`!="",`Final Time`, NA_character_), paste0(aux, ")")),
aux = NULL,
`Split 5 at 37.1km` = paste0(`Split 5 at 37.1km`, ")"))
Rank Race Number Name NOC Code Split 1 at 9.7km Split 2 at 15.0km Split 3 at 22.1km Split 4 at 31.8km Split 5 at 37.1km Final Time
1 23 13 FOSS Tobias S. NOR 13:47.65(22) 19:21.16(22) 29:17.44(24) 44:06.82(24) 49:49.65(24) 59:51.68 (23)
2 24 11 McNULTY Brandon USA 13:28.23(15) 19:04.80(18) 29:01.94(20) 43:51.67(23) 49:40.49(23) 59:57.73 (24)
3 25 29 BENNETT George NZL 14:05.46(30) 19:47.53(31) 29:58.88(28) 44:40.28(25) 50:21.82(25) 1:00:28.39 (25)
4 26 30 KUKRLE Michael CZE 14:05.81(32) 19:48.77(32) 29:58.09(27) 44:42.74(26) 50:30.02(26) 1:00:41.55 (26)
Time Behind Average Speed
1 +4:47.49 44.302
2 +4:53.54 44.228
3 +5:24.20 43.854
4 +5:37.36 43.696
You could use dplyr and stringr:
library(dplyr)
library(stringr)
data %>%
mutate(`Final Time` = ifelse(`Final Time` == "", str_remove(`Split 5 at 37.1km`, "\\d+:\\d+\\.\\d+\\(\\d+\\)"), `Final Time`),
`Split 5 at 37.1km` = str_extract(`Split 5 at 37.1km`, "\\d+:\\d+\\.\\d+\\(\\d+\\)"))
which returns
Rank Race Number Name NOC Code Split 1 at 9.7km Split 2 at 15.0km Split 3 at 22.1km Split 4 at 31.8km
1 23 13 FOSS Tobias S. NOR 13:47.65(22) 19:21.16(22) 29:17.44(24) 44:06.82(24)
2 24 11 McNULTY Brandon USA 13:28.23(15) 19:04.80(18) 29:01.94(20) 43:51.67(23)
3 25 29 BENNETT George NZL 14:05.46(30) 19:47.53(31) 29:58.88(28) 44:40.28(25)
4 26 30 KUKRLE Michael CZE 14:05.81(32) 19:48.77(32) 29:58.09(27) 44:42.74(26)
Split 5 at 37.1km Final Time Time Behind Average Speed
1 49:49.65(24) 59:51.68 (23) +4:47.49 44.302
2 49:40.49(23) 59:57.73 (24) +4:53.54 44.228
3 50:21.82(25) 1:00:28.39 (25) +5:24.20 43.854
4 50:30.02(26) 1:00:41.55 (26) +5:37.36 43.696
I like to use regex and stringr. Whilst theres some suboptimal code here the key step is with str_extract(). Using this we can select the two substrings we want, that of the first time and that of the second time. If either time is missing then we will have a missing value. So we can then fill in the columns based on where missingness occurs.
The regex string is as follows^((\\d+:)?\\d{2}:\\d{2}.\\d{2}\\(\\d+\\))\\.?+((\\d+:)?\\d{2}:\\d{2}.\\d{2} \\(\\d+\\))$. Here we have 4 capture groups, the first and third group capture the two whole times respectively. the second and fourth select the optional groups containing the hour (this ensures that times over an hour are completely captured. Additionally we check for an optional space.
My code is as follows:
library(tidyverse)
data <- structure(list(Rank = c("23", "24", "25", "26"), `Race Number` = c("13",
"11", "29", "30"), Name = c("FOSS Tobias S.", "McNULTY Brandon",
"BENNETT George", "KUKRLE Michael"), `NOC Code` = c("NOR", "USA",
"NZL", "CZE"), `Split 1 at 9.7km` = c("13:47.65(22)", "13:28.23(15)",
"14:05.46(30)", "14:05.81(32)"), `Split 2 at 15.0km` = c("19:21.16(22)",
"19:04.80(18)", "19:47.53(31)", "19:48.77(32)"), `Split 3 at 22.1km` = c("29:17.44(24)",
"29:01.94(20)", "29:58.88(28)", "29:58.09(27)"), `Split 4 at 31.8km` = c("44:06.82(24)",
"43:51.67(23)", "44:40.28(25)", "44:42.74(26)"), `Split 5 at 37.1km` = c("49:49.65(24)",
"49:40.49(23)", "50:21.82(25)1:00:28.39 (25)", "50:30.02(26)1:00:41.55 (26)"
), `Final Time` = c("59:51.68 (23)", "59:57.73 (24)", "", ""),
`Time Behind` = c("+4:47.49", "+4:53.54", "+5:24.20", "+5:37.36"
), `Average Speed` = c("44.302", "44.228", "43.854", "43.696"
)), class = "data.frame", row.names = c(NA, -4L))
# Take data and use a matching string to the regex pattern
data |>
mutate(match = map(`Split 5 at 37.1km`, ~unlist(str_match(., "^((\\d+:)?\\d{2}:\\d{2}.\\d{2}\\(\\d+\\))((\\d+:)?\\d{2}:\\d{2}.\\d{2} ?\\(\\d+\\))$")))) |>
# Grab the strings that match the whole first and second/final times
mutate(match1 = map(match, ~.[[2]]), match2 = map(match, ~.[[4]]), .keep = "unused") |>
# Check where the NAs are and put into the dataframe accordingly
mutate(`Split 5 at 37.1km`= ifelse(is.na(match1), `Split 5 at 37.1km`, match1),
`Final Time` = ifelse(is.na(match2), `Final Time`, match2), .keep = "unused")
#> Rank Race Number Name NOC Code Split 1 at 9.7km Split 2 at 15.0km
#> 1 23 13 FOSS Tobias S. NOR 13:47.65(22) 19:21.16(22)
#> 2 24 11 McNULTY Brandon USA 13:28.23(15) 19:04.80(18)
#> 3 25 29 BENNETT George NZL 14:05.46(30) 19:47.53(31)
#> 4 26 30 KUKRLE Michael CZE 14:05.81(32) 19:48.77(32)
#> Split 3 at 22.1km Split 4 at 31.8km Split 5 at 37.1km Final Time
#> 1 29:17.44(24) 44:06.82(24) 49:49.65(24) 59:51.68 (23)
#> 2 29:01.94(20) 43:51.67(23) 49:40.49(23) 59:57.73 (24)
#> 3 29:58.88(28) 44:40.28(25) 50:21.82(25) 1:00:28.39 (25)
#> 4 29:58.09(27) 44:42.74(26) 50:30.02(26) 1:00:41.55 (26)
#> Time Behind Average Speed
#> 1 +4:47.49 44.302
#> 2 +4:53.54 44.228
#> 3 +5:24.20 43.854
#> 4 +5:37.36 43.696
Created on 2021-07-28 by the reprex package (v2.0.0)
Note in the above I use the base pipe from R 4.1 onwards |> this can be replaced simply with the magrittr pipe %>% if you are on an earlier R version.
I am trying to avoid writing a long nested ifelse statement in excel.
I am working on two datasets, one where I have abbreviations and county names.
Abbre
COUNTY_NAME
1 AD Adams
2 AS Asotin
3 BE Benton
4 CH Chelan
5 CM Clallam
6 CR Clark
And another data set that contains the county abbreviation and votes.
CountyCode Votes
1 WM 97
2 AS 14
3 WM 163
4 WM 144
5 SJ 21
For the second table, how do I convert the countycode (abbreviation) into the full spelled-out text and add that as a new column?
I have been trying to solve this unsuccessfully using grep, match, and %in%. Clearly I am missing something and any insight would be greatly appreciated.
We can use a join
library(dplyr)
library(tidyr)
df2 <- df2 %>%
left_join(Abbre %>%
separate(COUNTY_NAME, into = c("CountyCode", "FullName")),
by = "CountyCode")
Or use base R
tmp <- read.table(text = Abbre$COUNTY_NAME, header = FALSE,
col.names = c("CountyCode", "FullName"))
df2 <- merge(df2, tmp, by = 'CountyCode', all.x = TRUE)
Another base R option using match
df2$COUNTY_NAME <- with(
df1,
COUNTY_NAME[match(df2$CountyCode, Abbre)]
)
gives
> df2
CountyCode Votes COUNTY_NAME
1 WM 97 <NA>
2 AS 14 Asotin
3 WM 163 <NA>
4 WM 144 <NA>
5 SJ 21 <NA>
A data.table option
> setDT(df1)[setDT(df2), on = .(Abbre = CountyCode)]
Abbre COUNTY_NAME Votes
1: WM <NA> 97
2: AS Asotin 14
3: WM <NA> 163
4: WM <NA> 144
5: SJ <NA> 21
I am trying to rearrange a data set and then sort it on multiple variables. For example, right now I have something that looks like this:
ID Name Class 1 Class2 Monday 7-8 Monday 8-9
1 Brad Chem Bio Monday 7-8 NA
2 Charlene Acct NA NA Monday 8-9
3 Carly Philosophy Physics NA NA
4 Jess Chem Acct Monday 7-8 Monday 8-9
And sort the data like this:
Class Monday 7-8 Monday 8-9
Acct Jess Charlene, Jess
Bio Brad NA
Chem Brad, Jess Jess
Philosophy NA NA
Physics NA NA
I have tried separating all of the variables into different spreadsheets and then merging them, but I cant figure out how to sort the name based on both class and time and it is proving incredibly difficult to figure out. The actual database is composed of about 70 different time options with 80 different people and 150 different class names (chem, bio, etc), so I cant go in and create this individually
a tidyr solution:
df1 %>%
gather(class_col,Class,'Class.1','Class2') %>%
filter(!is.na(Class)) %>%
gather(date_col,date,'Monday.7.8','Monday.8.9') %>%
group_by(Class,date) %>%
summarize(Name = paste(Name,collapse=", ")) %>%
spread(date,Name) %>%
select(-`<NA>`)
# # A tibble: 5 x 3
# # Groups: Class [5]
# Class `Monday 7-8` `Monday 8-9`
# * <chr> <chr> <chr>
# 1 Acct Jess Charlene, Jess
# 2 Bio Brad <NA>
# 3 Chem Brad, Jess Jess
# 4 Philosophy <NA> <NA>
# 5 Physics <NA> <NA>
Here is some base R code for this task:
dat <- data.frame(
name=c("Brad", "Charlene", "Carly", "Jess"),
class1=c("Chem", "Acct", "Philosophy", "Chem"),
class2=c("Bio", NA, "Physics", "Acct"),
monday7.8=c("monday7.8", NA, NA, "monday7.8"),
monday8.9=c(NA, "monday8.9", NA, "monday8.9"),
stringsAsFactors=FALSE
)
classes <- c("Chem", "Acct", "Philosophy", "Physics")
times <- c("monday7.8", "monday8.9")
ret <- expand.grid(class=classes, time=times, stringsAsFactors=FALSE)
one_alloc <- function(cl, tm, dat) {
idx <- which(!is.na(dat[,tm]) & (dat[,"class1"]==cl | dat[,"class2"]==cl))
if(length(idx)>0) return(paste(dat[idx,"name"], collapse=", ")) else return(NA)
}
one_alloc <- Vectorize(one_alloc, vectorize.args=c("cl", "tm"))
ret[,"names"] <- one_alloc(cl=ret[,"class"], tm=ret[,"time"], dat=dat)
ret <- reshape(ret, timevar="time", idvar="class", direction="wide")
ret
How do I get a dataframe like this:
soccer_player country position
"sam" USA left defender
"jon" USA right defender
"sam" USA left midfielder
"jon" USA offender
"bob" England goalie
"julie" England central midfielder
"jane" England goalie
To look like this (country with the counts of unique players per country):
country player_count
USA 2
England 3
The obvious complication is that there are multiple observations per player, so I cannot simply do table(df$country) to get the number of observations per country.
I have been playing with the table() and merge() functions but have not had any luck.
Here's one way:
as.data.frame(table(unique(d[-3])$country))
# Var1 Freq
# 1 England 3
# 2 USA 2
Drop the third column, remove any duplicate Country-Name pairs, then count the occurrences of each country.
The new features of dplyr v 3.0 provide a compact solution:
Data:
dd <- read.csv(text='
soccer_player,country,position
"sam",USA,left defender
"jon",USA,right defender
"sam",USA,left midfielder
"jon",USA,offender
"bob",England,goalie
"julie",England,central midfielder
"jane",England,goalie')
Code:
library(dplyr)
dd %>% distinct(soccer_player,country) %>%
count(country)
Without using any packages you can do:
List = by(df, df$country, function(x) length(unique(x$soccer_player)))
DataFrame = do.call(rbind, lapply(names(List), function(x)
data.frame(country=x, player_count=List[[x]])))
# country player_count
#1 England 2
#2 USA 2
It's easier with something like data.table:
dt = data.table(df)
dt[,list(player_count = length(unique(soccer_player))),by=country]
Here is an sqldf solution:
library(sqldf)
sqldf("select country, count(distinct soccer_player) player_count
from df
group by country")
## country player_count
## 1 England 2
## 2 USA 2
and here is a base R solution:
as.data.frame(xtabs(~ country, unique(df[1:2])), responseName = "player_count")
## country player_count
## 1 England 2
## 2 USA 2
One more base R option, using aggregate:
> aggregate(soccer_player ~ country, dd, FUN = function(x) length(unique(x)))
# country soccer_player
#1 England 3
#2 USA 2