How to modify a variable iteratively using data.table? - r

I'm hoping someone can help me figure out how to modify one variable multiple times in data.table, or find a similar approach that would work for big data.
I have a dataset with strings (addresses to be exact, but the exact contents aren't important), such as:
library(data.table)
library(stringr)
# example addresses although you can imagine other types of strings here
addr <- data.table(street = c('1 main street',
'99 madison avenue',
'340 circle court'))
I have another dataset with a column of patterns that I want to search for patterns in these strings (i.e. in the addr dataset) and substitute with other strings kept in another column in this second dataset. For example:
# example of patterns to search for and what I want to replace them with
abbrev <- data.table(full = c('street', 'avenue', 'circle', 'court'),
abbrev = c('st', 'ave', 'cir', 'ct'))
The actual datasets are much larger: millions of addresses and 300+ abbreviations I want to check each address for.
It'd be fairly simple to do this in a loop, but because of the size, I'd like to use data.table and probably an apply function to make this process more efficient.
I'm struggling to figure out how to write this exactly. I want something like the following:
# duplicate addresses so we can compare to changes
addr[, orig.street := street]
# function to substitute abbreviations we want
standardize <- function(word, shorter) {
addr[, street := str_replace_all(street,
paste0(" ", word),
paste0(" ", shorter))]
}
# now run function for all abbreviations we want
addr[, street := mapply(FUN = standardize,
word = abbrev$full,
shorter = abbrev$abbrev,
SIMPLIFY = FALSE, USE.NAMES = FALSE)]
When I run this in Rstudio, this is returning the error, "Supplied 4 items to be assigned to 3 items of column 'street'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code."
However it actually does give me what I want, despite the error:
# it breaks but I do get the desired outcome:
street orig.street
1: 1 main st 1 main street
2: 99 madison ave 99 madison avenue
3: 340 cir ct 340 circle court
I feel like there must be a solution I'm missing, but I haven't figured it out. Any help would be greatly appreciated.

You could use stri_replace_all_fixed along with it's argument vectorize_all = FALSE from library(stringi):
library(data.table)
library(stringi)
addr <- data.table(orig_street = c('1 main street',
'99 madison avenue',
'340 circle court'))
abbrev <- data.table(full = c('street', 'avenue', 'circle', 'court'),
abbrev = c('st', 'ave', 'cir', 'ct'))
addr[, street := stri_replace_all_fixed(orig_street, abbrev$full, abbrev$abbrev, vectorize_all = FALSE)]
> addr
orig_street street
1: 1 main street 1 main st
2: 99 madison avenue 99 madison ave
3: 340 circle court 340 cir ct
Please also see this related answer and note that library(stringr) imports library(stringi).

An alternative is a Reduce method:
addr[, street2 := Reduce(function(txt, i) gsub(paste0("\\b", abbrev$full[i], "\\b"), abbrev$abbrev[i], txt),
seq_len(nrow(abbrev)), init = street)][]
# street street2
# <char> <char>
# 1: 1 main street 1 main st
# 2: 99 madison avenue 99 madison ave
# 3: 340 circle court 340 cir ct
Note:
I explicitly add word-boundaries (\\b) to the gsub regex so that we don't inadvertently replace a portion of a word. I think we need this instead of fixed=TRUE because gsub("court", "ct", "courteous", fixed = TRUE) returns "cteous".
If we tried an apply family (on abbrev), then we would see the updated value for each of the patterns, but not know (without extra work) which one had the change; further, if it's possible (in general, perhaps not here) for more than one abbreviation pattern to be useful, then we need to apply each pattern/replacement on the results of the previous replacement, which *apply cannot do (as easily).
Unfortunately, Reduce does not easily iterate over rows of a frame, so we iterate over row indices (seq_len(nrow(abbrev))).
However, I can't help but feel that the last row should really be "340 circle ct". In which case, if we assume that the abbrev is at the end of the string, we can use that instead:
addr[, street3 := Reduce(function(txt, i) gsub(paste0("\\b", abbrev$full[i], "\\s*$"), abbrev$abbrev[i], txt),
seq_len(nrow(abbrev)), init = street)][]
# street street2 street3
# <char> <char> <char>
# 1: 1 main street 1 main st 1 main st
# 2: 99 madison avenue 99 madison ave 99 madison ave
# 3: 340 circle court 340 cir ct 340 circle ct

Related

How do I find most frequent words by each observation in R?

I am very new to NLP. Please, don't judge me strictly.
I have got a very big data-frame on customers' feedback, my goal is to analyze feedbacks. I tokenized words in feedbacks, deleted stop-words (SMART). Now, I need to receive a table of most and less frequent used words.
The code looks like this:
library(tokenizers)
library(stopwords)
words_as_tokens <-
tokenize_words(dat$description,
stopwords = stopwords(language = "en", source = "smart"))
The dataframe looks like this: there are lots of feedbacks (variable "description") and customers by whom the feedbacks were given (each customer is not unique, they can be repeated). I want to receive a table with 3 columns: a) customer name b) word c) its frequency. This "ranking" should be in a decreasing order.
Try this
library(tokenizers)
library(stopwords)
library(tidyverse)
# count freq of words
words_as_tokens <- setNames(lapply(sapply(dat$description,
tokenize_words,
stopwords = stopwords(language = "en", source = "smart")),
function(x) as.data.frame(sort(table(x), TRUE), stringsAsFactors = F)), dat$name)
# tidyverse's job
df <- words_as_tokens %>%
bind_rows(, .id = "name") %>%
rename(word = x)
# output
df
# name word Freq
# 1 John experience 2
# 2 John word 2
# 3 John absolutely 1
# 4 John action 1
# 5 John amazon 1
# 6 John amazon.ae 1
# 7 John answering 1
# ....
# 42 Alex break 2
# 43 Alex nice 2
# 44 Alex times 2
# 45 Alex 8 1
# 46 Alex accent 1
# 47 Alex africa 1
# 48 Alex agents 1
# ....
Data
dat <- data.frame(name = c("John", "Alex"),
description = c("Unprecedented. The perfect word to describe Amazon. In every positive sense of that word! All because of one man - Jeff Bezos. What an entrepreneur! What a vision! This is from personal experience. Let me explain. I had given up all hope, after a horrible experience with Amazon.ae (formerly Souq.com) - due to a Herculean effort to get an order cancelled and the subsequent refund issued. I have never faced such a feedback-resistant team in my life! They were robotically answering my calls and sending me monotonous, unhelpful emails, followed by absolutely zero action!",
"Not only does Amazon have great products but their Customer Service for the most part is wonderful. Although most times you are outsourced to a different country, I personally have found that when I call it's either South Africa or Philippines and they speak so well, understand me and my NY accent and are quite nice. Let’s face it. Most times you are calling CS with a problem or issue. These agents have to listen to 8 hours of complaints so they themselves need a break. No matter how annoyed I am I try to be on my best behavior and as nice as can be because they too need a break with how nasty we as a society can be."), stringsAsFactors = F)
You can try with quanteda as well as follows:
library(quanteda)
library(quanteda.textstats)
# define a corpus object to store your initial documents
mycorpus = corpus(dat$description)
# convert the corpus to a Document-Feature Matrix
mydfm = dfm( mycorpus,
tolower = TRUE,
remove = stopwords(), # this removes English stopwords
remove_punct = TRUE, # this removes punctuation
remove_numbers = TRUE, # this removes digits
remove_symbol = TRUE, # this removes symbols
remove_url = TRUE ) # this removes urls
# calculate word frequencies and return a data.frame
word_frequencies = textstat_frequency( mydfm )

Sampling by group without repetition using data.table

I'll use a hypothetical scenario to illustrate the question. Here's a table with musicians and the instrument they play and a table with the composition for a band:
musicians <- data.table(
instrument = rep(c('bass','drums','guitar'), each = 4),
musician = c('Chas','John','Paul','Stuart','Andy','Paul','Peter','Ringo','George','John','Paul','Ringo')
)
band.comp <- data.table(
instrument = c('bass','drums','guitar'),
n = c(2,1,2)
)
To avoid arguments about who is best with which instrument, the band will be assembled by sortition. Here's how I'm doing:
musicians[band.comp, on = 'instrument'][, sample(musician, n), by = instrument]
instrument V1
1: bass Paul
2: bass Chas
3: drums Andy
4: guitar Paul
5: guitar George
The problem is: since there are musicians who play more than one instrument, it can happen that one person is drawn more than once.
One can build a for loop that, for each subsequent subset of instruments, draws musicians and then eliminates those from the rest of the table. But I would like suggestions on how to do this using data.table. Mainly because the kind of problem I need to solve in real life with this logic involves data bases with hundreds of thousands of rows. And also because I'm trying to better understand the data.table syntax.
As a reference, I tried some tips from Andrew Brooks blog, but couldn't come up with a solution.
This can be a solution, first you select an instrument by musician and then you select the musicians of your sample. But it may be that when selecting an instrument per musician your sample size is larger than the population then you will get an error (but in your real data this may not be a problem).
musicians[, .(instrument = sample(instrument, 1)), by = musician][band.comp, on = 'instrument'][, sample(musician, n), by = instrument]
You could expand the band comp into sum(band.comp$n) distinct positions and keep sampling until you find a feasible composition:
roles = musicians[,
CJ(posn = 1:band.comp[.BY, on=.(instrument), x.n], musician = musician)
, by=instrument]
set.seed(1)
while (TRUE){
roles[sample(1:.N), keep := !duplicated(.SD, by="musician") & !duplicated(.SD, by=c("instrument", "posn"))][]
if (sum(roles$keep) == sum(band.comp$n)) break
}
setorder(roles[keep == TRUE, !"keep"])[]
instrument posn musician
1: bass 1 Stuart
2: bass 2 John
3: drums 1 Andy
4: guitar 1 George
5: guitar 2 Paul
There's probably something you could do with linear programming or a bipartite graph to answer the question of whether a feasible comp exists, but it's unclear what "sampling" even means in terms of the distribution over feasible comps.
Came across a relevant post: Randomly draw rows from dataframe based on unique values and column values and eddi's answer is perfect for this OP:
#keep number of musicians per instrument in 1 data.table
musicians[band.comp, n:=n, on=.(instrument)]
#for storing the musician that has been sampled so far
m <- c()
musicians[, {
#exclude sampled musician before sampling
res <- .SD[!musician %chin% m][sample(.N, n[1L])]
m <- c(m, res$musician)
res
}, by=.(instrument)]
sample output:
instrument musician n
1: bass Stuart 2
2: bass Chas 2
3: drums Paul 1
4: guitar John 2
5: guitar Ringo 2
Or more succinctly with error handling as well:
m <- c()
musicians[
band.comp,
on=.(instrument),
j={
s <- setdiff(musician, m)
if (length(s) < n) stop(paste("Not enough musicians playing", .BY))
res <- sample(s, n)
m <- c(m, res)
res
},
by=.EACHI]

Fuzzy Address matching R

Yeah, it's been asked before, but I can't find a thread that provides a simple, clean answer to this question.
I have example data below - I have two columns, col1 is the current address, col2 is an address I am told is 'better' than the current address. I need to see how much 'better' the second column is over the first. Most of the time, the second is better b/c it contains secondary information that the first is lacking, such as apartment number.
test <- as.data.frame(matrix(c(
"742 Evergreen Terrace" , "742 Evergreen Terrace Apt 3" ,
"31 Spooner Street #42" , "31 Spooner Street",
"129 W 81st Street" , "129 W 81st Street Apt 5A" ,
"245 E 73rd Street", "245 E 73rd Street Apt 6") , ncol=2, byrow=TRUE,
dimnames=list(NULL, c("old_addr" , "new_addr"))) ,stringsAsFactors=FALSE)
There is an answer I found here that gets close to what I would like:
Fuzzy match row in one column with same row in next column
I need to create a third column that is a simple 1/0 variable that == 1 if it's an approximate match, and 0 if not. I need to be able to specify threshold for approximate matching.
For my first example - 742 Evergreen Terrace vs 742 Evergreen Terrace Apt 3, the length differs by six. I need to be able to specify a length difference of six, or eight, or whatever.
I looked at agrep, but I need to compare two columns data within the same row, and it does not allow for that. I have also tried lapply, but its results make me think it is cycling through all data in the entire column, and I need row by row comparisons. Also max distance I do not understand, with the ifelse below and a max of 1 (if I understand this correctly to be 1 == there can be one unit of edit or change), it should be throwing errors but it only does in one case.
agrep(test$old_addr, test$new_addr, max.distance = 0.1, ignore.case = TRUE)
test$fuzz_match <- lapply(test$old_addr , agrep , x =
test$new_addr , max.distance = 1 , ignore.case = TRUE)
Any help is appreciated, thank you!
You can calculate the Levenshtein distance between each pair. Then what you need to decide is how large must the distance be for the two not to be the same address.
test$lev_dist <- mapply(adist, test$old_addr, test$new_addr)
test$same_addr <- test$lev_dist < 5
test
# old_addr new_addr lev_dist same_addr
# 1 742 Evergreen Terrace 742 Evergreen Terrace Apt 3 6 FALSE
# 2 31 Spooner Street #42 31 Spooner Street 4 TRUE
# 3 129 W 81st Street 129 W 81st Street Apt 5A 7 FALSE
# 4 245 E 73rd Street 245 E 73rd Street Apt 6 6 FALSE
You can use agrep() together with mapply() in a similar manner.
test$agrep_match <- mapply(agrep, test$old_addr, test$new_addr)
test$agrep_match <- lengths(test$agrep_match) == 1
test
# old_addr new_addr agrep_match
# 1 742 Evergreen Terrace 742 Evergreen Terrace Apt 3 TRUE
# 2 31 Spooner Street #42 31 Spooner Street FALSE
# 3 129 W 81st Street 129 W 81st Street Apt 5A TRUE
# 4 245 E 73rd Street 245 E 73rd Street Apt 6 TRUE
agrep() is also based on Levenshtein distance, but has a bunch of different options for adjusting the threshold, as I'm sure you've found.
There are other difference measures than Levenshtein that might be better suited for this application. Package stringdist has a number of other string distance metrics available.

How can I match fuzzy match strings from two datasets?

I've been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS.
So far AGREP is the closest tool I've found that might work. I can use levenshtein distances in the AGREP package, which measure the number of deletions, insertions and substitutions between two strings. AGREP will return the string with the smallest distance (the most similar).
However, I've been having trouble turning this command from a single value to apply it to an entire data frame. I've crudely used a for loop to repeat the AGREP function, but there's gotta be an easier way.
See the following code:
a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))
for (i in 1:6){
a$x[i] = agrep(a$name[i], b$name, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
a$Y[i] = agrep(a$name[i], b$name, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}
Here is a solution using the fuzzyjoin package. It uses dplyr-like syntax and stringdist as one of the possible types of fuzzy matching.
As suggested by #C8H10N4O2, the stringdist method="jw" creates the best matches for your example.
As suggested by #dgrtwo, the developer of fuzzyjoin, I used a large max_dist and then used dplyr::group_by and dplyr::slice_min to get only the best match with minimum distance. (slice_min replaces the older top_n and if the original order is important and not alphabetical, use mutate(rank = row_number(dist)) %>% filter(rank == 1))
a <- data.frame(name = c('Ace Co', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),
price = c(10, 13, 2, 1, 15, 1))
b <- data.frame(name = c('Ace Co.', 'Bayes Inc.', 'asdf'),
qty = c(9, 99, 10))
library(fuzzyjoin); library(dplyr);
stringdist_join(a, b,
by = "name",
mode = "left",
ignore_case = FALSE,
method = "jw",
max_dist = 99,
distance_col = "dist") %>%
group_by(name.x) %>%
slice_min(order_by = dist, n = 1)
#> # A tibble: 6 x 5
#> # Groups: name.x [6]
#> name.x price name.y qty dist
#> <fctr> <dbl> <fctr> <dbl> <dbl>
#> 1 Ace Co 10 Ace Co. 9 0.04761905
#> 2 Bayes 13 Bayes Inc. 99 0.16666667
#> 3 asd 2 asdf 10 0.08333333
#> 4 Bcy 1 Bayes Inc. 99 0.37777778
#> 5 Baes 15 Bayes Inc. 99 0.20000000
#> 6 Bays 1 Bayes Inc. 99 0.20000000
The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.
One-to-one case (requires assignment algorithm):
When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.
Not familiar with AGREP but here's example using stringdist for your distance matrix.
library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)
# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,
# 1 for already assigned, -1 for unassigned and unassignable
while(any(x==0)){
min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
a_sel <- a[d==min_d & x==0][1]
b_sel <- b[d==min_d & a == a_sel & x==0][1]
x[a==a_sel & b == b_sel] <- 1
x[x==0 & (a==a_sel|b==b_sel)] <- -1
}
cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))
Produces the assignment:
a b d
1 Ace Co Ace Co. 0.04762
2 Bayes Bayes Inc. 0.16667
3 asd asdf 0.08333
I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.
Many-to-one case (not an assignment problem):
do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))
Produces the result:
a_name b_name dist
1 Ace Co Ace Co. 0.04762
11 Baes Bayes Inc. 0.20000
8 Bayes Bayes Inc. 0.16667
12 Bays Bayes Inc. 0.20000
10 Bcy Bayes Inc. 0.37778
15 asd asdf 0.08333
Edit: use method="jw" to produce desired results. See help("stringdist-package")
I am not sure if this is a useful direction for you, John Andrews, but it gives you another tool (from the RecordLinkage package) and might help.
install.packages("ipred")
install.packages("evd")
install.packages("RSQLite")
install.packages("ff")
install.packages("ffbase")
install.packages("ada")
install.packages("~/RecordLinkage_0.4-1.tar.gz", repos = NULL, type = "source")
require(RecordLinkage) # it is not on CRAN so you must load source from Github, and there are 7 dependent packages, as per above
compareJW <- function(string, vec, cutoff) {
require(RecordLinkage)
jarowinkler(string, vec) > cutoff
}
a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))
a$name <- as.character(a$name)
b$name <- as.character(b$name)
test <- compareJW(string = a$name, vec = b$name, cutoff = 0.8) # pick your level of cutoff, of course
data.frame(name = a$name, price = a$price, test = test)
> data.frame(name = a$name, price = a$price, test = test)
name price test
1 Ace Co 10 TRUE
2 Bayes 13 TRUE
3 asd 2 TRUE
4 Bcy 1 FALSE
5 Baes 15 TRUE
6 Bays 1 FALSE
Fuzzy Matching
Approximate String Matching is approximately matching one string to another. e.g. banana and bananas.
Fuzzy Matching is finding an approximate pattern in a string. e.g. banana within bananas in pyjamas.
Method
R Implementation
Basic
Bitap≈Levenshtein
b$name <- lapply(b$name, agrep, a$name, value=TRUE); merge(a,b)
Advanced
?stringdist::stringdist-metrics
fuzzyjoin::stringdist_join(a, b, mode='full', by=c('name'), method='lv')
Fuzzy Match
TRE
agrep2 <- function(pattern, x) x[which.min(adist(pattern, x, partial=TRUE))]; b$name <- lapply(b$name, agrep2, a$name); merge(a, b)
Run yourself
# Data
a <- data.frame(name=c('Ace Co.', 'Bayes Inc.', 'asdf'), qty=c(9,99,10))
b <- data.frame(name=c('Ace Company', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'), price=c(10,13,2,1,15,1))
# Basic
c <- b
c$name.b <- c$name
c$name <- lapply(c$name, agrep, a$name, value=TRUE)
merge(a, c, all.x=TRUE)
# Advanced
fuzzyjoin::stringdist_join(a, b, mode='full')
# Fuzzy Match
c <- b
c$name.b <- c$name
c$name <- lapply(c$name, function(pattern, x) x[which.min(adist(pattern, x, partial=TRUE))], a$name)
merge(a, c)
Agreed with above answer "Not familiar with AGREP but here's example using stringdist for your distance matrix." but add-on the signature function as below from Merging Data Sets Based on Partially Matched Data Elements will be more accurate since the calculation of LV is based on position/addition/deletion
##Here's where the algorithm starts...
##I'm going to generate a signature from country names to reduce some of the minor differences between strings
##In this case, convert all characters to lower case, sort the words alphabetically, and then concatenate them with no spaces.
##So for example, United Kingdom would become kingdomunited
##We might also remove stopwords such as 'the' and 'of'.
signature=function(x){
sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')
return(sig)
}
I use lapply for those circumstances:
yournewvector: lapply(yourvector$yourvariable, agrep, yourothervector$yourothervariable, max.distance=0.01),
then to write it as a csv it's not so straightforward:
write.csv(matrix(yournewvector, ncol=1), file="yournewvector.csv", row.names=FALSE)
Here is what I used for getting number of times a company appears in a list though the company names are inexact matches,
step.1 Install phonics Package
step.2 create a new column called "soundexcodes" in "mylistofcompanynames"
step.3 Use soundex function to return soundex codes of the company names in "soundexcodes"
step.4 Copy the company names AND corresponding soundex code into a new file (2 columns called "companynames" and "soundexcode") called "companysoundexcodestrainingfile"
step.5 Remove duplicates of soundexcodes in "companysoundexcodestrainingfile"
step.6 Go through the list of remaining company names and change the names as you want it to appear in your original company
example:
Amazon Inc A625 can be Amazon A625
Accenture Limited A455 can be Accenture A455
step.6 Perform a left_join or (simple vlookup) between companysoundexcodestrainingfile$soundexcodes and mylistofcompanynames$soundexcodes by "soundexcodes"
step.7 The result should have the original list with a new column called "co.y" which has the name of the company the way you left it in the training file.
step.8 Sort "co.y" and check if most of the company names are matched correctly,if so replace the old company names with the new ones given by vlookup of the soundex code.

Merging datasets by name when names have different formats in R

I have two different dataframes in R that I am trying to merge together. One is just a set of names and the other is a set of names with corresponding information about each person.
So say I want to take this first dataframe:
Name
1. Blow, Joe
2. Smith, John
3. Jones, Tom
etc....
and merge it to this one:
DonorName CandidateName DonationAmount CandidateParty
1 blow joe Bush, George W 3,000 Republican
2 guy some Obama, Barack 5,000 Democrat
3 smith john Reid, Harry 4,000 Democrat
such that I'd have a new list that includes only people on my first list with the information from the second. Were the two "Name" values formatted in the same way, I could just use merge(), but would there be a way to somehow use agrep() or pmatch() to do this?
Also, the 2nd dataframe I'm working with has about 25 million rows in it and 6 columns, so would making a for loop be the fastest way to go about this?
Reproducible versions of the example data:
first <- data.frame(Name=c("Blow, Joe","Smith, John","Jones, Tom"),
stringsAsFactors=FALSE)
second <- read.csv(text="
DonorName|CandidateName|DonationAmount|CandidateParty
blow joe|Bush, George W|3,000|Republican
guy some|Obama, Barack|5,000|Democrat
smith john|Reid, Harry|4,000|Democrat",header=TRUE,sep="|",
stringsAsFactors=FALSE)
solution:
first$DonorName <- gsub(", "," ",tolower(first$Name),fixed=TRUE)
require(dplyr)
result <- inner_join(first,second,by="DonorName")
will give you what you need if the data is as you've provided it.
result
Name DonorName CandidateName DonationAmount CandidateParty
1 Blow, Joe blow joe Bush, George W 3,000 Republican
2 Smith, John smith john Reid, Harry 4,000 Democrat
"fast way to go about this"
The dplyr method as above:
f_dplyr <- function(left,right){
left$DonorName <- gsub(", "," ",tolower(left$Name),fixed=TRUE)
inner_join(left,right,by="DonorName")
}
data.table method, setting key on first.
f_dt <- function(left,right){
left[,DonorName := gsub(", "," ",tolower(Name),fixed=TRUE)]
setkey(left,DonorName)
left[right,nomatch=0L]
}
data.table method, setting both keys.
f_dt2 <- function(left,right){
left[,DonorName := gsub(", "," ",tolower(Name),fixed=TRUE)]
setkey(left,DonorName)
setkey(right,DonorName)
left[right,nomatch=0L]
}
base method relying on sapply:
f_base <- function(){
second[second$DonorName %in%
sapply(tolower(first[[1]]), gsub, pattern = ",", replacement = "", fixed = TRUE), ]
}
let's make second df a bit more realistic at 1M obs for a fairish comparision:
second <- cbind(second[rep(1:3,1000000),],data.frame(varn= 1:1000000))
left <- as.data.table(first)
right <- as.data.table(second)
library(microbenchmark)
microbenchmark(
f_base(),
f_dplyr(first,second),
f_dt(left,right),
f_dt2(left,right),
times=20)
And we get:
Unit: milliseconds
expr min lq median uq max neval
f_base() 2880.6152 3031.0345 3097.3776 3185.7903 3904.4649 20
f_dplyr(first, second) 292.8271 362.7379 454.6864 533.9147 774.1897 20
f_dt(left, right) 489.6288 531.4152 605.4148 788.9724 1340.0016 20
f_dt2(left, right) 472.3126 515.4398 552.8019 659.7249 901.8133 20
On my machine, with this ?contrived example we gain about 2.5 seconds over base methods. sapply simplifies and doesn't scale very well in my experience... this gap likely gets bigger when you increase the number of unique groups in first and second.
Please feel free to edit if you come up with more efficient use. I don't pretend to know, but I always try to learn something.
Without dplyr:
second[second$DonorName %in%
sapply(tolower(first[[1]]), gsub, pattern = ",", replacement = "", fixed = TRUE), ]
Result:
# DonorName CandidateName DonationAmount CandidateParty
# 1 blow joe Bush, George W 3,000 Republican
# 3 smith john Reid, Harry 4,000 Democrat

Resources