Merging datasets by name when names have different formats in R - r

I have two different dataframes in R that I am trying to merge together. One is just a set of names and the other is a set of names with corresponding information about each person.
So say I want to take this first dataframe:
Name
1. Blow, Joe
2. Smith, John
3. Jones, Tom
etc....
and merge it to this one:
DonorName CandidateName DonationAmount CandidateParty
1 blow joe Bush, George W 3,000 Republican
2 guy some Obama, Barack 5,000 Democrat
3 smith john Reid, Harry 4,000 Democrat
such that I'd have a new list that includes only people on my first list with the information from the second. Were the two "Name" values formatted in the same way, I could just use merge(), but would there be a way to somehow use agrep() or pmatch() to do this?
Also, the 2nd dataframe I'm working with has about 25 million rows in it and 6 columns, so would making a for loop be the fastest way to go about this?
Reproducible versions of the example data:
first <- data.frame(Name=c("Blow, Joe","Smith, John","Jones, Tom"),
stringsAsFactors=FALSE)
second <- read.csv(text="
DonorName|CandidateName|DonationAmount|CandidateParty
blow joe|Bush, George W|3,000|Republican
guy some|Obama, Barack|5,000|Democrat
smith john|Reid, Harry|4,000|Democrat",header=TRUE,sep="|",
stringsAsFactors=FALSE)

solution:
first$DonorName <- gsub(", "," ",tolower(first$Name),fixed=TRUE)
require(dplyr)
result <- inner_join(first,second,by="DonorName")
will give you what you need if the data is as you've provided it.
result
Name DonorName CandidateName DonationAmount CandidateParty
1 Blow, Joe blow joe Bush, George W 3,000 Republican
2 Smith, John smith john Reid, Harry 4,000 Democrat
"fast way to go about this"
The dplyr method as above:
f_dplyr <- function(left,right){
left$DonorName <- gsub(", "," ",tolower(left$Name),fixed=TRUE)
inner_join(left,right,by="DonorName")
}
data.table method, setting key on first.
f_dt <- function(left,right){
left[,DonorName := gsub(", "," ",tolower(Name),fixed=TRUE)]
setkey(left,DonorName)
left[right,nomatch=0L]
}
data.table method, setting both keys.
f_dt2 <- function(left,right){
left[,DonorName := gsub(", "," ",tolower(Name),fixed=TRUE)]
setkey(left,DonorName)
setkey(right,DonorName)
left[right,nomatch=0L]
}
base method relying on sapply:
f_base <- function(){
second[second$DonorName %in%
sapply(tolower(first[[1]]), gsub, pattern = ",", replacement = "", fixed = TRUE), ]
}
let's make second df a bit more realistic at 1M obs for a fairish comparision:
second <- cbind(second[rep(1:3,1000000),],data.frame(varn= 1:1000000))
left <- as.data.table(first)
right <- as.data.table(second)
library(microbenchmark)
microbenchmark(
f_base(),
f_dplyr(first,second),
f_dt(left,right),
f_dt2(left,right),
times=20)
And we get:
Unit: milliseconds
expr min lq median uq max neval
f_base() 2880.6152 3031.0345 3097.3776 3185.7903 3904.4649 20
f_dplyr(first, second) 292.8271 362.7379 454.6864 533.9147 774.1897 20
f_dt(left, right) 489.6288 531.4152 605.4148 788.9724 1340.0016 20
f_dt2(left, right) 472.3126 515.4398 552.8019 659.7249 901.8133 20
On my machine, with this ?contrived example we gain about 2.5 seconds over base methods. sapply simplifies and doesn't scale very well in my experience... this gap likely gets bigger when you increase the number of unique groups in first and second.
Please feel free to edit if you come up with more efficient use. I don't pretend to know, but I always try to learn something.

Without dplyr:
second[second$DonorName %in%
sapply(tolower(first[[1]]), gsub, pattern = ",", replacement = "", fixed = TRUE), ]
Result:
# DonorName CandidateName DonationAmount CandidateParty
# 1 blow joe Bush, George W 3,000 Republican
# 3 smith john Reid, Harry 4,000 Democrat

Related

How to modify a variable iteratively using data.table?

I'm hoping someone can help me figure out how to modify one variable multiple times in data.table, or find a similar approach that would work for big data.
I have a dataset with strings (addresses to be exact, but the exact contents aren't important), such as:
library(data.table)
library(stringr)
# example addresses although you can imagine other types of strings here
addr <- data.table(street = c('1 main street',
'99 madison avenue',
'340 circle court'))
I have another dataset with a column of patterns that I want to search for patterns in these strings (i.e. in the addr dataset) and substitute with other strings kept in another column in this second dataset. For example:
# example of patterns to search for and what I want to replace them with
abbrev <- data.table(full = c('street', 'avenue', 'circle', 'court'),
abbrev = c('st', 'ave', 'cir', 'ct'))
The actual datasets are much larger: millions of addresses and 300+ abbreviations I want to check each address for.
It'd be fairly simple to do this in a loop, but because of the size, I'd like to use data.table and probably an apply function to make this process more efficient.
I'm struggling to figure out how to write this exactly. I want something like the following:
# duplicate addresses so we can compare to changes
addr[, orig.street := street]
# function to substitute abbreviations we want
standardize <- function(word, shorter) {
addr[, street := str_replace_all(street,
paste0(" ", word),
paste0(" ", shorter))]
}
# now run function for all abbreviations we want
addr[, street := mapply(FUN = standardize,
word = abbrev$full,
shorter = abbrev$abbrev,
SIMPLIFY = FALSE, USE.NAMES = FALSE)]
When I run this in Rstudio, this is returning the error, "Supplied 4 items to be assigned to 3 items of column 'street'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code."
However it actually does give me what I want, despite the error:
# it breaks but I do get the desired outcome:
street orig.street
1: 1 main st 1 main street
2: 99 madison ave 99 madison avenue
3: 340 cir ct 340 circle court
I feel like there must be a solution I'm missing, but I haven't figured it out. Any help would be greatly appreciated.
You could use stri_replace_all_fixed along with it's argument vectorize_all = FALSE from library(stringi):
library(data.table)
library(stringi)
addr <- data.table(orig_street = c('1 main street',
'99 madison avenue',
'340 circle court'))
abbrev <- data.table(full = c('street', 'avenue', 'circle', 'court'),
abbrev = c('st', 'ave', 'cir', 'ct'))
addr[, street := stri_replace_all_fixed(orig_street, abbrev$full, abbrev$abbrev, vectorize_all = FALSE)]
> addr
orig_street street
1: 1 main street 1 main st
2: 99 madison avenue 99 madison ave
3: 340 circle court 340 cir ct
Please also see this related answer and note that library(stringr) imports library(stringi).
An alternative is a Reduce method:
addr[, street2 := Reduce(function(txt, i) gsub(paste0("\\b", abbrev$full[i], "\\b"), abbrev$abbrev[i], txt),
seq_len(nrow(abbrev)), init = street)][]
# street street2
# <char> <char>
# 1: 1 main street 1 main st
# 2: 99 madison avenue 99 madison ave
# 3: 340 circle court 340 cir ct
Note:
I explicitly add word-boundaries (\\b) to the gsub regex so that we don't inadvertently replace a portion of a word. I think we need this instead of fixed=TRUE because gsub("court", "ct", "courteous", fixed = TRUE) returns "cteous".
If we tried an apply family (on abbrev), then we would see the updated value for each of the patterns, but not know (without extra work) which one had the change; further, if it's possible (in general, perhaps not here) for more than one abbreviation pattern to be useful, then we need to apply each pattern/replacement on the results of the previous replacement, which *apply cannot do (as easily).
Unfortunately, Reduce does not easily iterate over rows of a frame, so we iterate over row indices (seq_len(nrow(abbrev))).
However, I can't help but feel that the last row should really be "340 circle ct". In which case, if we assume that the abbrev is at the end of the string, we can use that instead:
addr[, street3 := Reduce(function(txt, i) gsub(paste0("\\b", abbrev$full[i], "\\s*$"), abbrev$abbrev[i], txt),
seq_len(nrow(abbrev)), init = street)][]
# street street2 street3
# <char> <char> <char>
# 1: 1 main street 1 main st 1 main st
# 2: 99 madison avenue 99 madison ave 99 madison ave
# 3: 340 circle court 340 cir ct 340 circle ct

Sampling by group without repetition using data.table

I'll use a hypothetical scenario to illustrate the question. Here's a table with musicians and the instrument they play and a table with the composition for a band:
musicians <- data.table(
instrument = rep(c('bass','drums','guitar'), each = 4),
musician = c('Chas','John','Paul','Stuart','Andy','Paul','Peter','Ringo','George','John','Paul','Ringo')
)
band.comp <- data.table(
instrument = c('bass','drums','guitar'),
n = c(2,1,2)
)
To avoid arguments about who is best with which instrument, the band will be assembled by sortition. Here's how I'm doing:
musicians[band.comp, on = 'instrument'][, sample(musician, n), by = instrument]
instrument V1
1: bass Paul
2: bass Chas
3: drums Andy
4: guitar Paul
5: guitar George
The problem is: since there are musicians who play more than one instrument, it can happen that one person is drawn more than once.
One can build a for loop that, for each subsequent subset of instruments, draws musicians and then eliminates those from the rest of the table. But I would like suggestions on how to do this using data.table. Mainly because the kind of problem I need to solve in real life with this logic involves data bases with hundreds of thousands of rows. And also because I'm trying to better understand the data.table syntax.
As a reference, I tried some tips from Andrew Brooks blog, but couldn't come up with a solution.
This can be a solution, first you select an instrument by musician and then you select the musicians of your sample. But it may be that when selecting an instrument per musician your sample size is larger than the population then you will get an error (but in your real data this may not be a problem).
musicians[, .(instrument = sample(instrument, 1)), by = musician][band.comp, on = 'instrument'][, sample(musician, n), by = instrument]
You could expand the band comp into sum(band.comp$n) distinct positions and keep sampling until you find a feasible composition:
roles = musicians[,
CJ(posn = 1:band.comp[.BY, on=.(instrument), x.n], musician = musician)
, by=instrument]
set.seed(1)
while (TRUE){
roles[sample(1:.N), keep := !duplicated(.SD, by="musician") & !duplicated(.SD, by=c("instrument", "posn"))][]
if (sum(roles$keep) == sum(band.comp$n)) break
}
setorder(roles[keep == TRUE, !"keep"])[]
instrument posn musician
1: bass 1 Stuart
2: bass 2 John
3: drums 1 Andy
4: guitar 1 George
5: guitar 2 Paul
There's probably something you could do with linear programming or a bipartite graph to answer the question of whether a feasible comp exists, but it's unclear what "sampling" even means in terms of the distribution over feasible comps.
Came across a relevant post: Randomly draw rows from dataframe based on unique values and column values and eddi's answer is perfect for this OP:
#keep number of musicians per instrument in 1 data.table
musicians[band.comp, n:=n, on=.(instrument)]
#for storing the musician that has been sampled so far
m <- c()
musicians[, {
#exclude sampled musician before sampling
res <- .SD[!musician %chin% m][sample(.N, n[1L])]
m <- c(m, res$musician)
res
}, by=.(instrument)]
sample output:
instrument musician n
1: bass Stuart 2
2: bass Chas 2
3: drums Paul 1
4: guitar John 2
5: guitar Ringo 2
Or more succinctly with error handling as well:
m <- c()
musicians[
band.comp,
on=.(instrument),
j={
s <- setdiff(musician, m)
if (length(s) < n) stop(paste("Not enough musicians playing", .BY))
res <- sample(s, n)
m <- c(m, res)
res
},
by=.EACHI]

I will like to find duplicates in a data frame base on name and last name. Using partial string matching

this is a example data frame and I just want to se if their any unction that can find duplicates , using partial string matching.
df
name last
Joseph Smith
Jose Smith
Joseph Smit
Maria Cruz
maria cru
Mari Cruz
Data Prep
Using dplyr, first concatenate the first and last name into a whole name
library(dplyr)
df1 <- df %>%
rowwise() %>% # rowwise operation
mutate(whole=paste0(name,last,collapse="")) # concatenate first and last name by row
ungroup() # remove rowwise grouping
Output
name last whole
1 Joseph Smith JosephSmith
2 Jose Smith JoseSmith
3 Joseph Smit JosephSmit
4 Maria Cruz MariaCruz
5 maria cru mariacru
6 Mari Cruz MariCruz
Grouping similar strings
This recursive function will use agrepl, logical approximate grep, to find related string groups, and group and label them grp. NOTE The tolerance to differences among strings is set by max.distance. Lower numbers are more stringent
desired <- NULL
grp <- 1
special <- function(x, y, grp) {
if (nrow(y) < 1) { # if y is empty return data
return(x)
} else {
similar <- agrepl(y$whole[1], y$whole, max.distance=0.4) # find similar occurring strings
x <- rbind(x, y[similar,] %>% mutate(grp=grp)) # save similar strings
y <- setdiff(y, y[similar,]) # remaining non-similar strings
special(x, y, grp+1) # run function again on non-similar strings
}
}
desired <- special(desired, df1, grp)
Output
name last whole grp
1 Joseph Smith JosephSmith 1
2 Jose Smith JoseSmith 1
3 Joseph Smit JosephSmit 1
4 Maria Cruz MariaCruz 2
5 maria cru mariacru 2
6 Mari Cruz MariCruz 2
To get rid of whole
df2 <- df1 %>% select(-whole)

R - Is the result of tapply always in alphabetical order

I work with the dataframe df
Name = c("Albert", "Caeser", "Albert", "Frank")
Earnings = c(1000,2000,1000,5000)
df = data.frame(Name, Earnings)
Name Earnings
Albert 1000
Caesar 2000
Albert 1000
Frank 5000
If I use the tapply function
result <- tapply(df$Earnings, df$Name, sum)
I get this table result
Albert 2000
Caeser 2000
Frank 5000
Are there any circumstances, under which the table "result" would not be ordered alphabetically, if I use the tapply function as described above?
When I tried to find an answer, I changed the order of the rows:
Name Earnings
Frank 5000
Caeser 2000
Albert 1000
Albert 1000
but still get the same result.
I use multiple functions where I calculate with the output of tapply calculations and I have to be absolutely sure, that the output is always delivered in the same order.
Normally the output is ordered, but you can come up with examples where it is not. For example if you have factors with unordered levels.
df <- data.frame(Name = factor(c('Ben', 'Al'), levels = c('Ben', 'Al')),
Earnings = c(1, 4))
tapply(df$Earnings, df$Name, sum)
## Ben Al
## 1 4
In that case you can either use as.character or (probably saver) order the result afterwards.
tapply(df$Earnings, as.character(df$Name), sum)
## Al Ben
## 4 1
result <- tapply(df$Earnings, df$Name, sum)
result[order(names(result))]
## Al Ben
## 4 1
Another possible problem can be leading spaces:
df <- data.frame(Name = c(' Ben', 'Al'),
Earnings = c(1, 4))
tapply(df$Earnings, df$Name, sum)
## Ben Al
## 1 4
In that case, just remove all leading spaces to get results ordered.
You can order sapply output as you order any array in R. Using the [sort] command.1
> result
Albert Caeser Frank
2000 2000 5000
> sort(result,decreasing=TRUE)
Frank Albert Caeser
5000 2000 2000
Depending on what you want to order by, you can either sort the values as shown above (by leaving decreasing NULL, i.e. sort(result) you will get values in increasing order), or by sorting the names:
This will deliver the results by name in reverse alphabetical order
result[sort(names(result),decreasing=TRUE)]
Frank Caeser Albert
5000 2000 2000
What else would you like to sort and order by?

Perform multiple summary functions and return a dataframe

I have a data set that includes a whole bunch of data about students, including their current school, zipcode of former residence, and a score:
students <- read.table(text = "zip school score
43050 'Hunter' 202.72974236
48227 'NYU' 338.49571519
48227 'NYU' 223.48658339
32566 'CCNY' 310.40666224
78596 'Columbia' 821.59318662
78045 'Columbia' 853.09842034
60651 'Lang' 277.48624384
32566 'Lang' 315.49753763
32566 'Lang' 80.296556533
94941 'LIU' 373.53839238
",header = TRUE,sep = "")
I want a heap of summary data about it, per school. How many students from each school are in the data set, how many unique zipcodes per school, average and cumulative score. I know I can get this by using tapply to create a bunch of tmp frames:
tmp.mean <- data.frame(tapply(students$score, students$school, mean))
tmp.sum <- data.frame(tapply(students$score, students$school, sum))
tmp.unique.zip <- data.frame(tapply(students$zip, students$school, function(x) length(unique(x))))
tmp.count <- data.frame(tapply(students$zip, students$school, function(x) length(x)))
Giving them better column names:
colnames(tmp.unique.zip) <- c("Unique zips")
colnames(tmp.count) <- c("Count")
colnames(tmp.mean) <- c("Mean Score")
colnames(tmp.sum) <- c("Total Score")
And using cbind to tie them all back together again:
school.stats <- cbind(tmp.mean, tmp.sum, tmp.unique.zip, tmp.count)
I think the cleaner way to do this is:
library(plyr)
school.stats <- ddply(students, .(school), summarise,
record.count=length(score),
unique.r.zips=length(unique(zip)),
mean.dist=mean(score),
total.dist=sum(score)
)
The resulting data looks about the same (actually, the ddply approach is cleaner and includes the schools as a column instead of as row names). Two questions: is there better way to find out how many records there are associated with each school? And, am I using ddply efficiently here? I'm new to it.
If performance is an issue, you can also use data.table
require(data.table)
tab_s<-data.table(students)
setkey(tab_s,school)
tab_s[,list(total=sum(score),
avg=mean(score),
unique.zips=length(unique(zip)),
records=length(score)),
by="school"]
school total avg unique.zips records
1: Hunter 202.7297 202.7297 1 1
2: NYU 561.9823 280.9911 1 2
3: CCNY 310.4067 310.4067 1 1
4: Columbia 1674.6916 837.3458 2 2
5: Lang 673.2803 224.4268 2 3
6: LIU 373.5384 373.5384 1 1
Comments seem to be in general agreement: this looks good.

Resources