I have a column in one dataframe with city and state names in it:
ac <- c("san francisco ca", "pittsburgh pa", "philadelphia pa", "washington dc", "new york ny", "aliquippa pa", "gainesville fl", "manhattan ks")
ac <- as.data.frame(ac)
I would like to search for the values in ac$ac in another data frame column, d$description and return the value of column id if there is a match.
dput(df)
structure(list(month = c(202110L, 201910L, 202005L, 201703L,
201208L, 201502L), id = c(100559687L, 100558763L, 100558934L,
100558946L, 100543422L, 100547618L), description = c("residential local telephone service local with more san francisco ca flat rate with eas package plan includes voicemail call forwarding call waiting caller id call restriction three way calling id block speed dialing call return call screening modem rental voip transmission telephone access line 34 95 modem rental 7 00 total 41 95",
"digital video programming service multilatino ultra bensalem pa service includes digital economy multilatino digital preferred tier and certain additonal digital channels coaxial cable transmission",
"residential all distance telephone service unlimited voice only harrisburg pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking coaxial cable transmission",
"residential all distance telephone service unlimited voice only pittsburgh pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking",
"local spot advertising 30 second advertisement austin tx weekday 6 am 6 pm other audience demographic w18 49 number of rating points for daypart 0 29 average cpp 125",
"residential public switched toll interstate manhattan ks ks plan area residence switched toll base period average revenue per minute 0 18 minute online"
)), row.names = c(1L, 1245L, 3800L, 10538L, 20362L, 50000L), class = "data.frame")
I have tried to do this via accessing the row indexes of the matches via the following methods:
which(ac$ac %in% df$description)--this returns integer(0).
grep(ac$ac, df$description, value = FALSE)--this returns the first index, 1. But this isn't vectorized.
str_detect(string = ac$ac, pattern = df$description) -- but this returns all FALSE which is incorrect.
My question: how do I search for ac$ac in df$description and return the corresponding value of df$id in the event of a match? Note that the vectors are not of the same length. I am looking for ALL matches, not just the first. I would prefer something simple and fast, because the actual datasets that I will be using have over 100k rows each but any suggestions or ideas are welcome. Thanks.
Edit. Due to Andre's initial answer below, the name of the question was changed to account for the change in the scope of the question.
Edit (12/7): bounty added to generate additional interest and a fast, efficient scalable solution.
Edit (12/8): Clarification--I would like to be able to add the id variable from df to the ac dataframe, as in ac$id.
The simplest solutions are usually the fastest!
Here is my suggestion:
str = paste0(ac, collapse="|")
df$id[grep(str, df$description)]
But you can also this way
df$id[as.logical(rowSums(!is.na(sapply(ac, function(x) stringr::str_match(df$description, x)))))]
Or this way
df$id[grepl(str, df$description, perl=T)]
However, it has to be compared. By the way, I added suggestions from #Andre Wildberg and #Martina C. Arnolda.
Below is the Benchmark.
str = paste0(ac, collapse="|")
fFiolka1 = function() df$id[grep(str, df$description)]
fFiolka2 = function() df$id[as.logical(rowSums(!is.na(sapply(ac, function(x) stringr::str_match(df$description, x)))))]
fFiolka3 = function() df$id[grepl(str, df$description, perl=T)]
fWildberg1 = function() df$id[unlist(sapply(ac, function(x) grep(x, df$description)))]
fWildberg2 = function() df$id[as.logical(rowSums(sapply(ac, function(x) stri_detect_regex(df$description, x))))]
fArnolda1 = function() df[grep(str, df$description), ]["id"]
fArnolda2 = function() df[stringi::stri_detect_regex(df$description, str), ]["id"]
fArnolda3 = function() df %>% filter(description %>% str_detect(str)) %>% select(id)
library(microbenchmark)
ggplot2::autoplot(microbenchmark(
fFiolka1(), fFiolka2(), fFiolka3(),
fWildberg1(), fWildberg2(),
fArnolda1(), fArnolda2(), fArnolda3(),
times=100))
Note, for the sake of simplicity I left ac as a vector !.
ac <- c("san francisco ca", "pittsburgh pa", "philadelphia pa", "washington dc", "new york ny", "aliquippa pa", "gainesville fl", "manhattan ks")
Special update for #jvalenti
OKAY. Now I understand better what you want to achieve. However, in order to fully show the best solution, I have slightly modified your data. Here they are
library(tidyverse)
ac <- c("san francisco ca", "pittsburgh pa", "philadelphia pa", "washington dc", "new york ny", "aliquippa pa", "gainesville fl", "manhattan ks")
ac = tibble(ac = ac)
df = structure(list(
month = c(202110L, 201910L, 202005L, 201703L, 201208L, 201502L),
id = c(100559687L, 100558763L, 100558934L, 100558946L, 100543422L, 100547618L),
description = c(
"residential local telephone pittsburgh pa local with more san francisco ca flat rate with eas philadelphia pa plan includes voicemail call forwarding call waiting caller id call restriction three way calling id block speed dialing call return call screening modem rental voip transmission telephone access line 34 95 modem rental 7 00 total 41 95",
"digital video san francisco ca pittsburgh pa multilatino ultra bensalem pa service includes digital economy multilatino digital preferred tier and certain additonal digital channels coaxial cable transmission",
"residential all distance telephone pittsburgh pa unlimited voice only harrisburg pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking coaxial cable transmission",
"residential all distance telephone pittsburgh pa unlimited voice philadelphia pa san francisco ca pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking",
"local spot advertising 30 second advertisement austin tx weekday 6 am 6 pm other audience demographic w18 49 number of rating points for daypart 0 29 average cpp 125",
"residential public switched toll pittsburgh pa manhattan ks ks plan area residence switched toll base san philadelphia pa ca average revenue per minute 0 18 minute online"
)), row.names = c(1L, 1245L, 3800L, 10538L, 20362L, 50000L), class = "data.frame")
Below you will find four different solutions. One based on the for loop, two solutions based on the functions from the dplyr package, and yet a function from the collapse package.
fSolition1 = function(){
id = vector("list", nrow(ac))
for(i in seq_along(ac$ac)){
id[[i]] = df$id[grep(ac$ac[i], df$description)]
}
ac %>% mutate(id = id) %>% unnest(id)
}
fSolition1()
fSolition2 = function(){
ac %>% group_by(ac) %>%
mutate(id = list(df$id[grep(ac, df$description)])) %>%
unnest(id)
}
fSolition2()
fSolition3 = function(){
ac %>% rowwise(ac) %>%
mutate(id = list(df$id[grep(ac, df$description)])) %>%
unnest(id)
}
fSolition3()
fSolition4 = function(){
ac %>%
collapse::ftransform(id = lapply(ac, function(x) df$id[grep(x, df$description)])) %>%
unnest(id)
}
fSolition4()
Note that for the given data, all functions that return the following table as a result
# A tibble: 12 x 2
ac id
<chr> <int>
1 san francisco ca 100559687
2 san francisco ca 100558763
3 san francisco ca 100558946
4 pittsburgh pa 100559687
5 pittsburgh pa 100558763
6 pittsburgh pa 100558934
7 pittsburgh pa 100558946
8 pittsburgh pa 100547618
9 philadelphia pa 100559687
10 philadelphia pa 100558946
11 philadelphia pa 100547618
12 manhattan ks 100547618
It's time for a benchmark
library(microbenchmark)
ggplot2::autoplot(microbenchmark(
fSolition1(), fSolition2(), fSolition3(), fSolition4(), times=100))
It is perhaps no surprise to anyone that the collapse based solution is the fastest. However, second place may be a big surprise. The good old solution based on the for function is in second place!! Anyone else want to say that for is slow?
Special update for #Gwang-Jin Kim
The actions on vectors did not change much. Look below.
df_ac = ac$ac
df_decription = df$description
df_id = df$id
fSolition5 = function(){
id = vector("list", length = length(df_ac))
for(i in seq_along(df_ac)){
id[[i]] = df_id[grep(df_ac[i], df_decription)]
}
ac %>% mutate(id = id) %>% unnest(id)
}
fSolition5()
library(microbenchmark)
ggplot2::autoplot(microbenchmark(
fSolition1(), fSolition2(), fSolition3(), fSolition4(), fSolition5(), times=100))
But the combination of for and ftransform can be surprising !!!
fSolition6 = function(){
id = vector("list", nrow(ac))
for(i in seq_along(ac$ac)){
id[[i]] = df$id[grep(ac$ac[i], df$description)]
}
ac %>% collapse::ftransform(id = id) %>% unnest(id)
}
fSolition6()
library(microbenchmark)
ggplot2::autoplot(microbenchmark(
fSolition1(), fSolition2(), fSolition3(), fSolition4(), fSolition5(), fSolition6(), times=100))
Last update for #jvalenti
Dear jvaleniti, in your question you wrote I have a column in one dataframe with city and state names and then I will be using have over 100k rows. My conclusion is that it is very likely that a given city will appear several times in your variable description.
However, in the comment you wrote I don't want to change the number of rows in ac
So what kind of results do you expect? Let's see what can be done with it.
Solution 1 - we return all id as a list of vectors
ac %>% collapse::ftransform(id = map(ac, ~df$id[grep(.x, df$description)]))
# # A tibble: 8 x 2
# ac id
# * <chr> <list>
# 1 san francisco ca <int [3]>
# 2 pittsburgh pa <int [5]>
# 3 philadelphia pa <int [3]>
# 4 washington dc <int [0]>
# 5 new york ny <int [0]>
# 6 aliquippa pa <int [0]>
# 7 gainesville fl <int [0]>
# 8 manhattan ks <int [1]>
Solution 2 - we only return the first id
ac %>% collapse::ftransform(id = map_int(ac, ~df$id[grep(.x, df$description)][1]))
# # A tibble: 8 x 2
# ac id
# * <chr> <int>
# 1 san francisco ca 100559687
# 2 pittsburgh pa 100559687
# 3 philadelphia pa 100559687
# 4 washington dc NA
# 5 new york ny NA
# 6 aliquippa pa NA
# 7 gainesville fl NA
# 8 manhattan ks 100547618
Solution 3 - we only return the last id
ac %>%
collapse::ftransform(id = map_int(ac, function(x) {
idx = grep(x, df$description)
ifelse(length(idx)>0, df$id[idx[length(idx)]], NA)}))
# # A tibble: 8 x 2
# ac id
# * <chr> <int>
# 1 san francisco ca 100558946
# 2 pittsburgh pa 100547618
# 3 philadelphia pa 100547618
# 4 washington dc NA
# 5 new york ny NA
# 6 aliquippa pa NA
# 7 gainesville fl NA
# 8 manhattan ks 100547618
Solution 4 - or maybe you would like to choose any id from all possible
ac %>%
collapse::ftransform(id = map_int(ac, function(x) {
idx = grep(x, df$description)
ifelse(length(idx)==0, NA, ifelse(length(idx)==1, df$id[idx], df$id[sample(idx, 1)]))}))
# # A tibble: 8 x 2
# ac id
# * <chr> <int>
# 1 san francisco ca 100558763
# 2 pittsburgh pa 100559687
# 3 philadelphia pa 100547618
# 4 washington dc NA
# 5 new york ny NA
# 6 aliquippa pa NA
# 7 gainesville fl NA
# 8 manhattan ks 100547618
Solution 5 - if you accidentally wanted to see all the id's and wanted to keep the number of ac lines at the same time
ac %>%
collapse::ftransform(id = map(ac, function(x) {
idx = grep(x, df$description)
if(length(idx)==0) tibble(id = NA, idn = "id1") else tibble(
id = df$id[idx],
idn = paste0("id",1:length(id)))})) %>%
unnest(id) %>%
pivot_wider(ac, names_from = idn, values_from = id)
# # A tibble: 8 x 6
# ac id1 id2 id3 id4 id5
# <chr> <int> <int> <int> <int> <int>
# 1 san francisco ca 100559687 100558763 100558946 NA NA
# 2 pittsburgh pa 100559687 100558763 100558934 100558946 100547618
# 3 philadelphia pa 100559687 100558946 100547618 NA NA
# 4 washington dc NA NA NA NA NA
# 5 new york ny NA NA NA NA NA
# 6 aliquippa pa NA NA NA NA NA
# 7 gainesville fl NA NA NA NA NA
# 8 manhattan ks 100547618 NA NA NA NA
Unfortunately, the description provided by you does not indicate which of the above five solutions is an acceptable solution for you. You will have to decide for yourself.
Perhaps this is an option?
ac$id <- sapply(ac$ac, function(x) d$id[grep(x, d$description)])
# ac id
# 1 san francisco ca 100559687
# 2 pittsburgh pa 100558946
# 3 philadelphia pa
# 4 washington dc
# 5 new york ny
# 6 aliquippa pa
# 7 gainesville fl
# 8 manhattan ks 100547618
Try this sapply with grep.
df$id[ unlist( sapply( ac$ac, function(x) grep(x, df$description ) ) ) ]
[1] 100559687 100558946 100547618
EDIT, try stri_detect_regex from stringi. Should be 2-5 times faster.
library(stringi)
df$id[ as.logical( rowSums( sapply( ac$ac, function(x)
stri_detect_regex( df$description, x ) ) ) ) ]
[1] 100559687 100558946 100547618
Microbenchmark on an extended data set with 1.728M rows:
Memory should not be a problem unless you are using a system with less than 4Gb RAM total.
nrow(df)
[1] 1728000
library(microbenchmark)
microbenchmark(
"grep1" = { res <- sapply(ac$ac, function(x) df$id[grep(x, df$description)]) },
"grep2" = { res <- df$id[ unlist( sapply( ac$ac, function(x) grep(x, df$description ) ) ) ] },
"stringi" = { res <- df$id[ as.logical( rowSums( sapply( ac$ac, function(x) stri_detect_regex( df$description, x ) ) ) ) ] }, times=10 )
Unit: seconds
expr min lq mean median uq max neval cld
grep1 96.90757 97.98706 100.13299 99.05837 101.99050 107.04312 10 b
grep2 97.51382 97.66425 100.00610 99.20753 101.17921 106.86661 10 b
stringi 46.15548 46.65894 48.68073 47.29635 50.15713 53.50351 10 a
Memory footprint during microbenchmark:
Path: /Library/Frameworks/R.framework/Versions/4.0/Resources/bin/exec/R
Physical footprint: 638.3M
Physical footprint (peak): 1.8G
You can use regex_inner_join from package fuzzyjoin
> library(fuzzyjoin)
> regex_inner_join(df, ac, by = c(description = "ac"))
month id
1 202110 100559687
2 201703 100558946
3 201502 100547618
description
1 residential local telephone service local with more san francisco ca flat rate with eas package plan includes voicemail call forwarding call waiting caller id call restriction three way calling id block speed dialing call return call screening modem rental voip transmission telephone access line 34 95 modem rental 7 00 total 41 95
2 residential all distance telephone service unlimited voice only pittsburgh pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking
3 residential public switched toll interstate manhattan ks ks plan area residence switched toll base period average revenue per minute 0 18 minute online
ac
1 san francisco ca
2 pittsburgh pa
3 manhattan ks
Checking using a regular expression and non-expensive functions should be fast:
First, we generate the pattern to be checked: ac_regex <- paste(ac$ac, collapse = "|").
There are several ways to detect matches in description and subset. Here are three:
# 1 grep()
df[grep(ac_regex, df$description), ]["id"],
# 2 stringi::stri_detect_*()
df[stri_detect_regex(df$description, ac_regex), ]["id"],
# 3 stringr::str_detect() + tidy subsetting
df %>% filter(description %>% str_detect(ac_regex)) %>% select(id),
All three return the desired subset of df:
id
1 100559687
2 100558946
3 100547618
(You need the packages tidyverse and stringi for options 2 and 3.)
Let's benchmark (using package bench):
bench::mark(
base_grep = df[grep(ac_regex, df$description), ]["id"],
base_stringi = df[stringi::stri_detect_regex(df$description, ac_regex), ]["id"],
tidy = df %>% filter(description %>% str_detect(ac_regex)) %>% select(id),
check = F
)
expression median
<bch:expr> <bch:tm>
1 base_grep 146.61µs
2 base_stringi 119.6µs
3 tidy 1.99ms
I'd go with stringi!
First there is no c$c assignment in the provided code. All the data is assigned to a variable called c. This variable does not have any c members (c$c) you are trying to work with.
Second it is a very bad practice to assign any data to variables called as the basic functions of R c <- c(...).
I wonder if there is an simpler way than writing if...else... for the following case. I have a dataframe and I only want the rows with number in column "percentage" >=95. Moreover, for one object, if there is multiple rows fitting this criteria, I only want the largest one(s). If there are more than one largest ones, I would like to keep all of them.
For example:
object city street percentage
A NY Sun 100
A NY Malino 97
A NY Waterfall 100
B CA Washington 98
B WA Lieber 95
C NA Moon 75
Then I'd like the result shows:
object city street percentage
A NY Sun 100
A NY Waterfall 100
B CA Washington 98
I am able to do it using if else statement, but I feel there should be some smarter ways to say: 1. >=95 2. if more than one, choose the largest 3. if more than one largest, choose them all.
You can do this by creating an variable that indicates the rows that have the maximum percentage for each of the objects. We can then use this indicator to subset the data.
# your data
dat <- read.table(text = "object city street percentage
A NY Sun 100
A NY Malino 97
A NY Waterfall 100
B CA Washington 98
B WA Lieber 95
C NA Moon 75", header=TRUE, na.strings="", stringsAsFactors=FALSE)
# create an indicator to identify the rows that have the maximum
# percentage by object
id <- with(dat, ave(percentage, object, FUN=function(i) i==max(i)) )
# subset your data - keep rows that are greater than 95 and have the
# maximum group percentage (given by id equal to one)
dat[dat$percentage >= 95 & id , ]
This works by the addition statement creating a logical, which can then be used to subset the rows of dat.
dat$percentage >= 95 & id
#[1] TRUE FALSE TRUE TRUE FALSE FALSE
Or putting these together
with(dat, dat[percentage >= 95 & ave(percentage, object,
FUN=function(i) i==max(i)) , ])
# object city street percentage
# 1 A NY Sun 100
# 3 A NY Waterfall 100
# 4 B CA Washington 98
You could do this also in data.table using the same approach by #user20650
library(data.table)
setDT(dat)[dat[,percentage==max(percentage) & percentage >=95, by=object]$V1,]
# object city street percentage
#1: A NY Sun 100
#2: A NY Waterfall 100
#3: B CA Washington 98
Or using dplyr
dat %>%
group_by(object) %>%
filter(percentage==max(percentage) & percentage >=95)
Following works:
ddf2 = ddf[ddf$percentage>95,]
ddf3 = ddf2[-c(1:nrow(ddf2)),]
for(oo in unique(ddf2$object)){
tempdf = ddf2[ddf2$object == oo, ]
maxval = max(tempdf$percentage)
tempdf = tempdf[tempdf$percentage==maxval,]
for(i in 1:nrow(tempdf)) ddf3[nrow(ddf3)+1,] = tempdf[i,]
}
ddf3
object city street percentage
1 A NY Sun 100
3 A NY Waterfall 100
4 B CA Washington 98