How can I match fuzzy match strings from two datasets? - r

I've been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS.
So far AGREP is the closest tool I've found that might work. I can use levenshtein distances in the AGREP package, which measure the number of deletions, insertions and substitutions between two strings. AGREP will return the string with the smallest distance (the most similar).
However, I've been having trouble turning this command from a single value to apply it to an entire data frame. I've crudely used a for loop to repeat the AGREP function, but there's gotta be an easier way.
See the following code:
a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))
for (i in 1:6){
a$x[i] = agrep(a$name[i], b$name, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
a$Y[i] = agrep(a$name[i], b$name, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}

Here is a solution using the fuzzyjoin package. It uses dplyr-like syntax and stringdist as one of the possible types of fuzzy matching.
As suggested by #C8H10N4O2, the stringdist method="jw" creates the best matches for your example.
As suggested by #dgrtwo, the developer of fuzzyjoin, I used a large max_dist and then used dplyr::group_by and dplyr::slice_min to get only the best match with minimum distance. (slice_min replaces the older top_n and if the original order is important and not alphabetical, use mutate(rank = row_number(dist)) %>% filter(rank == 1))
a <- data.frame(name = c('Ace Co', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),
price = c(10, 13, 2, 1, 15, 1))
b <- data.frame(name = c('Ace Co.', 'Bayes Inc.', 'asdf'),
qty = c(9, 99, 10))
library(fuzzyjoin); library(dplyr);
stringdist_join(a, b,
by = "name",
mode = "left",
ignore_case = FALSE,
method = "jw",
max_dist = 99,
distance_col = "dist") %>%
group_by(name.x) %>%
slice_min(order_by = dist, n = 1)
#> # A tibble: 6 x 5
#> # Groups: name.x [6]
#> name.x price name.y qty dist
#> <fctr> <dbl> <fctr> <dbl> <dbl>
#> 1 Ace Co 10 Ace Co. 9 0.04761905
#> 2 Bayes 13 Bayes Inc. 99 0.16666667
#> 3 asd 2 asdf 10 0.08333333
#> 4 Bcy 1 Bayes Inc. 99 0.37777778
#> 5 Baes 15 Bayes Inc. 99 0.20000000
#> 6 Bays 1 Bayes Inc. 99 0.20000000

The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.
One-to-one case (requires assignment algorithm):
When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.
Not familiar with AGREP but here's example using stringdist for your distance matrix.
library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)
# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,
# 1 for already assigned, -1 for unassigned and unassignable
while(any(x==0)){
min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
a_sel <- a[d==min_d & x==0][1]
b_sel <- b[d==min_d & a == a_sel & x==0][1]
x[a==a_sel & b == b_sel] <- 1
x[x==0 & (a==a_sel|b==b_sel)] <- -1
}
cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))
Produces the assignment:
a b d
1 Ace Co Ace Co. 0.04762
2 Bayes Bayes Inc. 0.16667
3 asd asdf 0.08333
I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.
Many-to-one case (not an assignment problem):
do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))
Produces the result:
a_name b_name dist
1 Ace Co Ace Co. 0.04762
11 Baes Bayes Inc. 0.20000
8 Bayes Bayes Inc. 0.16667
12 Bays Bayes Inc. 0.20000
10 Bcy Bayes Inc. 0.37778
15 asd asdf 0.08333
Edit: use method="jw" to produce desired results. See help("stringdist-package")

I am not sure if this is a useful direction for you, John Andrews, but it gives you another tool (from the RecordLinkage package) and might help.
install.packages("ipred")
install.packages("evd")
install.packages("RSQLite")
install.packages("ff")
install.packages("ffbase")
install.packages("ada")
install.packages("~/RecordLinkage_0.4-1.tar.gz", repos = NULL, type = "source")
require(RecordLinkage) # it is not on CRAN so you must load source from Github, and there are 7 dependent packages, as per above
compareJW <- function(string, vec, cutoff) {
require(RecordLinkage)
jarowinkler(string, vec) > cutoff
}
a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))
a$name <- as.character(a$name)
b$name <- as.character(b$name)
test <- compareJW(string = a$name, vec = b$name, cutoff = 0.8) # pick your level of cutoff, of course
data.frame(name = a$name, price = a$price, test = test)
> data.frame(name = a$name, price = a$price, test = test)
name price test
1 Ace Co 10 TRUE
2 Bayes 13 TRUE
3 asd 2 TRUE
4 Bcy 1 FALSE
5 Baes 15 TRUE
6 Bays 1 FALSE

Fuzzy Matching
Approximate String Matching is approximately matching one string to another. e.g. banana and bananas.
Fuzzy Matching is finding an approximate pattern in a string. e.g. banana within bananas in pyjamas.
Method
R Implementation
Basic
Bitap≈Levenshtein
b$name <- lapply(b$name, agrep, a$name, value=TRUE); merge(a,b)
Advanced
?stringdist::stringdist-metrics
fuzzyjoin::stringdist_join(a, b, mode='full', by=c('name'), method='lv')
Fuzzy Match
TRE
agrep2 <- function(pattern, x) x[which.min(adist(pattern, x, partial=TRUE))]; b$name <- lapply(b$name, agrep2, a$name); merge(a, b)
Run yourself
# Data
a <- data.frame(name=c('Ace Co.', 'Bayes Inc.', 'asdf'), qty=c(9,99,10))
b <- data.frame(name=c('Ace Company', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'), price=c(10,13,2,1,15,1))
# Basic
c <- b
c$name.b <- c$name
c$name <- lapply(c$name, agrep, a$name, value=TRUE)
merge(a, c, all.x=TRUE)
# Advanced
fuzzyjoin::stringdist_join(a, b, mode='full')
# Fuzzy Match
c <- b
c$name.b <- c$name
c$name <- lapply(c$name, function(pattern, x) x[which.min(adist(pattern, x, partial=TRUE))], a$name)
merge(a, c)

Agreed with above answer "Not familiar with AGREP but here's example using stringdist for your distance matrix." but add-on the signature function as below from Merging Data Sets Based on Partially Matched Data Elements will be more accurate since the calculation of LV is based on position/addition/deletion
##Here's where the algorithm starts...
##I'm going to generate a signature from country names to reduce some of the minor differences between strings
##In this case, convert all characters to lower case, sort the words alphabetically, and then concatenate them with no spaces.
##So for example, United Kingdom would become kingdomunited
##We might also remove stopwords such as 'the' and 'of'.
signature=function(x){
sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')
return(sig)
}

I use lapply for those circumstances:
yournewvector: lapply(yourvector$yourvariable, agrep, yourothervector$yourothervariable, max.distance=0.01),
then to write it as a csv it's not so straightforward:
write.csv(matrix(yournewvector, ncol=1), file="yournewvector.csv", row.names=FALSE)

Here is what I used for getting number of times a company appears in a list though the company names are inexact matches,
step.1 Install phonics Package
step.2 create a new column called "soundexcodes" in "mylistofcompanynames"
step.3 Use soundex function to return soundex codes of the company names in "soundexcodes"
step.4 Copy the company names AND corresponding soundex code into a new file (2 columns called "companynames" and "soundexcode") called "companysoundexcodestrainingfile"
step.5 Remove duplicates of soundexcodes in "companysoundexcodestrainingfile"
step.6 Go through the list of remaining company names and change the names as you want it to appear in your original company
example:
Amazon Inc A625 can be Amazon A625
Accenture Limited A455 can be Accenture A455
step.6 Perform a left_join or (simple vlookup) between companysoundexcodestrainingfile$soundexcodes and mylistofcompanynames$soundexcodes by "soundexcodes"
step.7 The result should have the original list with a new column called "co.y" which has the name of the company the way you left it in the training file.
step.8 Sort "co.y" and check if most of the company names are matched correctly,if so replace the old company names with the new ones given by vlookup of the soundex code.

Related

How to lookup items from one dataframe in another dataframe based on a custom 'matching' function

I have two data frames: sold1 and sold.
They both hold data on sold house prices but from different sources. As it turns out, the data, even for the same property may not match.
I’d like to compare each element in Sold1 to Sold2 and exclude those in Sold1 that are already present (duplicated) in sold2.
The test for a ‘duplicate’ in this instance would be a combination of:
Sold1.price is within 10% of Sold2.price
Sold2.address is contained in Sold1.address
Sold2.dateOfSale is no earlier than Sold1.dateOfSale
Two questions:
Why does the 'price' field change class when accessed in a function?
Is there a more elegant way to do this without using apply twice (very slow on two large dataframes)?
sold1 <- data.frame(
price = c(100000,150000,200000,250000,300000,400000),
address = c("Widmore Road, Bromley", "River Quaggy Apartments", "Meadowcroft Way, Orwell, SG8","Freelands Road, Bromley","Nascot Street, London, W12","Priory Terrace, South Hampstead, NW6"),
dateOfSale = c(as.Date("2020-01-01"),as.Date("2020-02-03"),as.Date("2020-03-05"),as.Date("2020-04-06"),as.Date("2020-05-08"),as.Date("2020-06-12"))
)
sold2 <- data.frame(
price = c(100000,150000,210000,251000,300000),
address = c("Random Road, Bromley", "Random2 Road Apartments", "Meadowcroft Way","Freelands Road","Random street London, W12"),
dateOfSale = c(as.Date("2020-01-01"),as.Date("2020-04-03"),as.Date("2020-03-25"),as.Date("2020-04-26"),as.Date("2019-05-08"))
)
FLR_Check_Match <- function (s2, s1row) {
checkRes = TRUE
# Price within tolerance
checkRes = checkRes && as.numeric(s1row["price"]) >= as.numeric(s2["price"]) * .9 && as.numeric(s1row["price"]) <= as.numeric(s2["price"]) * 1.1
# address close match
checkRes = checkRes && grepl(s2["address"], s1row["address"], ignore.case = TRUE)
# date of sale falls 7 weeks after date of sale in the sold1 (s1) data
checkRes = checkRes && as.Date(s2["dateOfSale"]) >= as.Date(s1row["dateOfSale"])
return(checkRes)
}
FCheck_Sold_Dups <- function(s1, s2) {
print(class(s1["price"])) #character
#For each element in s2, check whether there is a match to s1
excV <- apply(s2, 1, FLR_Check_Match, s1)
result <- any(excV)
return(result)
}
sold1$exclude <- apply(sold1,1,FCheck_Sold_Dups,sold2)
1. Why does the 'price' field change class when accessed in a function?
apply coerces dataframe to matrix.
From ?apply :
If X is not an array but an object of a class with a non-null dim value (such as a data frame), apply attempts to coerce it to an array via as.matrix
This can also verified by the source code of apply :
apply
function (X, MARGIN, FUN, ...)
{
FUN <- match.fun(FUN)
dl <- length(dim(X))
if (!dl)
stop("dim(X) must have a positive length")
if (is.object(X))
X <- if (dl == 2L)
as.matrix(X) # <- here
else as.array(X)
.....
.....
Since matrix can hold data of only one type everything is changed to character. Hence, you have to coerce the data into numeric again in the function.
2. Is there a more elegant way to do this without using apply twice (very slow on two large dataframes)
We cannot really avoid comparing every row of sold1 to sold2 (maybe fuzzy matching/joining ?) so if you have very large data most of the answers would be slow. Here is one way which is maybe more elegant, faster and shorter. I added prefix to column names of sold1 and sold2 dataframes to differentiate between them.
library(dplyr)
names(sold1) <- paste0('sold1_', names(sold1))
names(sold2) <- paste0('sold2_', names(sold2))
tidyr::crossing(sold1, sold2) %>%
group_by(sold1_address) %>%
#Check if address matches
summarise(address_check = any(str_detect(sold1_address, stringr::regex(sold2_address, ignore_case = TRUE))),
#Check if price is in range
price_check = any(data.table::between(sold1_price, sold2_price * .9, sold2_price * 1.1)),
#Check if date is in range
date_check = any(sold2_dateOfSale >= sold1_dateOfSale),
#if all the three conditions satisfy
exclude = address_check & price_check & date_check) %>%
select(sold1_address, exclude) %>%
left_join(sold1, by = 'sold1_address')
# sold1_address exclude sold1_price sold1_dateOfSale
# <chr> <lgl> <dbl> <date>
#1 Freelands Road, Bromley TRUE 250000 2020-04-06
#2 Meadowcroft Way, Orwell, SG8 TRUE 200000 2020-03-05
#3 Nascot Street, London, W12 FALSE 300000 2020-05-08
#4 Priory Terrace, South Hampstead, NW6 FALSE 400000 2020-06-12
#5 River Quaggy Apartments FALSE 150000 2020-02-03
#6 Widmore Road, Bromley FALSE 100000 2020-01-01
Note that I have created 3 separate columns address_check, price_check and date_check for clarity and keep each condition separate, they are not necessarily needed and we can combine those conditions into one if required.

How to explicitly build sparse stringdistmatrix to avoid running out of memory?

Match large number of slightly varying restaurant names in "data" vector to appropriate "match" vector:
The stringdistmatrix function in stringdist package is great, but runs out of memory for a few 10k x 10k and my data is larger.
Tried as(stringdistmatrix(data, match),'sparseMatrix') would give hoped for result, but runs out of memory. Hence, I would like to explicitly index pairs using sparseMatrix(i,j,x,dims,dimnames) with x calculated by adist() or similar string distance in hopes that it would fit in memory.
R
data <- c("McDonalds", "MacDonalds", "Mc Donald's", "Wendy's", "Wendys", "Wendy",
"Chipotle", "Chipotle's")
match <- c("McDonalds", "Wendys", "Chipotle")
Trying:
library(Matrix)
library(stringdist)
idx <- expand.grid(a=data,b=match)
idx$row <- match(idx$a,idx$b)
idx$col <- match(idx$b,idx$a)
library(Matrix)
sparseMatrix(i=idx$row,
j=idx$col,
x=ifthen(adist(data,match)<2,1,0),
dims=c(7,3),
dimnames = list(data, match))
Hoped for output to match:
library(stringdist)
as(ifelse(stringdistmatrix(data,match)<2,1,0),'sparseMatrix')
If I understand your question correctly, your task is to match dirty strings with clean strings. You do not need the whole matrix for that (and it would indeed not be sparse). Instead you can use amatch.
library(stringdist)
data <- c("McDonalds", "MacDonalds", "Mc Donald's", "Wendy's", "Wendys", "Wendy",
"Chipotle", "Chipotle's")
match <- c("McDonalds", "Wendys", "Chipotle")
i <- amatch(data, match, method="osa",maxDist=2)
data.frame(data=data, matched_data = match[i], stringsAsFactors = FALSE)
data matched_data
1 McDonalds McDonalds
2 MacDonalds McDonalds
3 Mc Donald's McDonalds
4 Wendy's Wendys
5 Wendys Wendys
6 Wendy Wendys
7 Chipotle Chipotle
8 Chipotle's Chipotle

Cluster sequences of strings in R [duplicate]

This question already has answers here:
Text clustering with Levenshtein distances
(4 answers)
Closed 6 years ago.
I have to following data:
attributes <- c("apple-water-orange", "apple-water", "apple-orange", "coffee", "coffee-croissant", "green-red-yellow", "green-red-blue", "green-red","black-white","black-white-purple")
attributes
attributes
1 apple-water-orange
2 apple-water
3 apple-orange
4 coffee
5 coffee-croissant
6 green-red-yellow
7 green-red-blue
8 green-red
9 black-white
10 black-white-purple
What I want is another column, that assigns a category to each row, based on observation similarity.
category <- c(1,1,1,2,2,3,3,3,4,4)
df <- as.data.frame(cbind(df, category))
attributes category
1 apple-water-orange 1
2 apple-water 1
3 apple-orange 1
4 coffee 2
5 coffee-croissant 2
6 green-red-yellow 3
7 green-red-blue 3
8 green-red 3
9 black-white 4
10 black-white-purple 4
It is clustering in the broader sense, but I think most clustering methods are for numeric data only and one-hot-encoding has a lot of disadvantages (thats what I read on the internet).
Does anyone have an idea how to do this task? Maybe some word-matching approaches?
It would be also great if I could adjust degree of similarity (rough vs. decent "clustering") based on a parameter.
Thanks in advance for any idea!
So I have whipped up two possibilities. Option 1: uses "one-hot-encoding" which is simple and straight forward so long as apple/apples are equally different from apple/orange, for example. I use the Jaccard index for the distance metric because it does reasonably well with overlapping sets. Option 2: Uses a local sequence alignment algorithm and should be quite robust against things like apple/apples vs. apple/orange, it will also have more tuning parameters which could take time to optimize for your problem.
library(reshape2)
library(proxy)
attributes <- c("apple-water-orange", "apple-water", "apple-orange", "coffee",
"coffee-croissant", "green-red-yellow", "green-red-blue",
"green-red","black-white","black-white-purple")
dat <- data.frame(attr=attributes, row.names = paste("id", seq_along(attributes), sep=""))
attributesList <- strsplit(attributes, "-")
df <- data.frame(id=paste("id", rep(seq_along(attributesList), sapply(attributesList, length)), sep=""),
word=unlist(attributesList))
df.wide <- dcast(data=df, word ~ id, length)
rownames(df.wide) <- df.wide[, 1]
df.wide <- as.matrix(df.wide[, -1])
df.dist <- dist(t(df.wide), method="jaccard")
plot(hclust(df.dist))
abline(h=c(0.6, 0.8))
heatmap.2(df.wide, trace="none", col=rev(heat.colors(15)))
res <- merge(dat, data.frame(cat1=cutree(hclust(df.dist), h=0.8)), by="row.names")
res <- merge(res, data.frame(cat2=cutree(hclust(df.dist), h=0.6)), by.y="row.names", by.x="Row.names")
res
You'll see you can control the granularity of the categorization by adjusting where you cut the dendrogram.
Here is a method using the "Smith-Waterman" alignment (local) alignment
Biostrings is part of the Bioconductor project. The SW algorithm finds the optimal local (non-end-to-end) alignment of two sequences (strings). In this case you can again use cutree to set your categories but you can also tune the scoring function to suit your needs.
library(Biostrings)
strList <- lapply(attributes, BString)
swDist <- matrix(apply(expand.grid(seq_along(strList), seq_along(strList)), 1, function(x) {
pairwiseAlignment(strList[[x[1]]], strList[[x[2]]], type="local")#score
}), nrow = 10)
heatmap.2(swDist, trace="none", col = rev(heat.colors(15)),
labRow = paste("id", 1:10, sep=""), labCol = paste("id", 1:10, sep=""))

How to use R to check data consistency (make sure no contradiction between case and value)?

Let's say I have:
Person Movie Rating
Sally Titanic 4
Bill Titanic 4
Rob Titanic 4
Sue Cars 8
Alex Cars **9**
Bob Cars 8
As you can see, there is a contradiction for Alex. All the same movies should have the same ranking, but there was a data error entry for Alex. How can I use R to solve this? I've been thinking about it for a while, but I can't figure it out. Do I have to just do it manually in excel or something? Is there a command on R that will return all the cases where there are data contradictions between two columns?
Perhaps I could have R do a boolean check if all the Movie cases match the first rating of its first iteration? For all that returns "no," I can go look at it manually? How would I write this function?
Thanks
Here's a data.table solution
Define the function
Myfunc <- function(x) {
temp <- table(x)
names(temp)[which.max(temp)]
}
library(data.table)
Create a column with the correct rating (by reference)
setDT(df)[, CorrectRating := Myfunc(Rating), Movie][]
# Person Movie Rating CorrectRating
# 1: Sally Titanic 4 4
# 2: Bill Titanic 4 4
# 3: Rob Titanic 4 4
# 4: Sue Cars 8 8
# 5: Alex Cars 9 8
# 6: Bob Cars 8 8
Or If you want to remove the "bad" ratings
df[Rating == CorrectRating][]
# Person Movie Rating CorrectRating
# 1: Sally Titanic 4 4
# 2: Bill Titanic 4 4
# 3: Rob Titanic 4 4
# 4: Sue Cars 8 8
# 5: Bob Cars 8 8
It looks like, within each group defined by "Movie", you're looking for any instances of Rating that are not the same as the most common value.
You can solve this using dplyr (which is good at "group by one column, then perform an operation within each group), along with the "Mode" function defined in this answer that finds the most common item in a vector:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr)
dat %>% group_by(Movie) %>% filter(Rating != Mode(Rating))
This finds all the cases where a row does not agree with the rest of the group. If you instead want to remove them, you can do:
newdat <- dat %>% group_by(Movie) %>% filter(Rating == Mode(Rating))
If you want to fix them, do
newdat <- dat %>% group_by(Movie) %>% mutate(Rating = Mode(Rating))
You can test the above with a reproducible version of your data:
dat <- data.frame(Person = c("Sally", "Bill", "Rob", "Sue", "Alex", "Bob"),
Movie = rep(c("Titanic", "Cars"), each = 3),
Rating = c(4, 4, 4, 8, 9, 8))
If the goal is to see if all the values within a group are the same (or if there are some differences) then this can be a simple application of tapply (or aggregate, etc.) used with a function like var (or compute the range). If all the values are the same then the variance and range will be 0. If it is any other value (outside of rounding error) then there must be a value that is different. The which function can help identify the group/individual.
tapply(dat$Rating, dat$Movie, FUN=var)
which(.Last.value > 0.00001)
tapply(dat$Rating, dat$Movie, FUN=function(x)diff(range(x)))
which(.Last.value != 0)
which( abs(dat$Rating - ave(dat$Rating, dat$Movie)) > 0)
which.max( abs(dat$Rating - ave(dat$Rating, dat$Movie)) )
dat[.Last.value,]
I would add a variable for mode so I can see if there is anything weird going on with the data, like missing data, text, many different answers instead of the rare anomaly,etc. I used "x" as your dataset
# one of many functions to find mode, could use any other
modefunc <- function(x){
names(table(x))[table(x)==max(table(x))]
}
# add variable for mode split by Movie
x$mode <- ave(x = x$Rating,x$Movie,FUN = modefunc)
# do whatever you want with the records that are different
x[x$Rating != x$mode, ]
If you want another function for mode, try other functions for mode

Perform multiple summary functions and return a dataframe

I have a data set that includes a whole bunch of data about students, including their current school, zipcode of former residence, and a score:
students <- read.table(text = "zip school score
43050 'Hunter' 202.72974236
48227 'NYU' 338.49571519
48227 'NYU' 223.48658339
32566 'CCNY' 310.40666224
78596 'Columbia' 821.59318662
78045 'Columbia' 853.09842034
60651 'Lang' 277.48624384
32566 'Lang' 315.49753763
32566 'Lang' 80.296556533
94941 'LIU' 373.53839238
",header = TRUE,sep = "")
I want a heap of summary data about it, per school. How many students from each school are in the data set, how many unique zipcodes per school, average and cumulative score. I know I can get this by using tapply to create a bunch of tmp frames:
tmp.mean <- data.frame(tapply(students$score, students$school, mean))
tmp.sum <- data.frame(tapply(students$score, students$school, sum))
tmp.unique.zip <- data.frame(tapply(students$zip, students$school, function(x) length(unique(x))))
tmp.count <- data.frame(tapply(students$zip, students$school, function(x) length(x)))
Giving them better column names:
colnames(tmp.unique.zip) <- c("Unique zips")
colnames(tmp.count) <- c("Count")
colnames(tmp.mean) <- c("Mean Score")
colnames(tmp.sum) <- c("Total Score")
And using cbind to tie them all back together again:
school.stats <- cbind(tmp.mean, tmp.sum, tmp.unique.zip, tmp.count)
I think the cleaner way to do this is:
library(plyr)
school.stats <- ddply(students, .(school), summarise,
record.count=length(score),
unique.r.zips=length(unique(zip)),
mean.dist=mean(score),
total.dist=sum(score)
)
The resulting data looks about the same (actually, the ddply approach is cleaner and includes the schools as a column instead of as row names). Two questions: is there better way to find out how many records there are associated with each school? And, am I using ddply efficiently here? I'm new to it.
If performance is an issue, you can also use data.table
require(data.table)
tab_s<-data.table(students)
setkey(tab_s,school)
tab_s[,list(total=sum(score),
avg=mean(score),
unique.zips=length(unique(zip)),
records=length(score)),
by="school"]
school total avg unique.zips records
1: Hunter 202.7297 202.7297 1 1
2: NYU 561.9823 280.9911 1 2
3: CCNY 310.4067 310.4067 1 1
4: Columbia 1674.6916 837.3458 2 2
5: Lang 673.2803 224.4268 2 3
6: LIU 373.5384 373.5384 1 1
Comments seem to be in general agreement: this looks good.

Resources