This question already has answers here:
Text clustering with Levenshtein distances
(4 answers)
Closed 6 years ago.
I have to following data:
attributes <- c("apple-water-orange", "apple-water", "apple-orange", "coffee", "coffee-croissant", "green-red-yellow", "green-red-blue", "green-red","black-white","black-white-purple")
1 apple-water-orange
2 apple-water
3 apple-orange
4 coffee
5 coffee-croissant
6 green-red-yellow
7 green-red-blue
8 green-red
9 black-white
10 black-white-purple
What I want is another column, that assigns a category to each row, based on observation similarity.
category <- c(1,1,1,2,2,3,3,3,4,4)
df <-, category))
attributes category
1 apple-water-orange 1
2 apple-water 1
3 apple-orange 1
4 coffee 2
5 coffee-croissant 2
6 green-red-yellow 3
7 green-red-blue 3
8 green-red 3
9 black-white 4
10 black-white-purple 4
It is clustering in the broader sense, but I think most clustering methods are for numeric data only and one-hot-encoding has a lot of disadvantages (thats what I read on the internet).
Does anyone have an idea how to do this task? Maybe some word-matching approaches?
It would be also great if I could adjust degree of similarity (rough vs. decent "clustering") based on a parameter.
Thanks in advance for any idea!
So I have whipped up two possibilities. Option 1: uses "one-hot-encoding" which is simple and straight forward so long as apple/apples are equally different from apple/orange, for example. I use the Jaccard index for the distance metric because it does reasonably well with overlapping sets. Option 2: Uses a local sequence alignment algorithm and should be quite robust against things like apple/apples vs. apple/orange, it will also have more tuning parameters which could take time to optimize for your problem.
attributes <- c("apple-water-orange", "apple-water", "apple-orange", "coffee",
"coffee-croissant", "green-red-yellow", "green-red-blue",
dat <- data.frame(attr=attributes, row.names = paste("id", seq_along(attributes), sep=""))
attributesList <- strsplit(attributes, "-")
df <- data.frame(id=paste("id", rep(seq_along(attributesList), sapply(attributesList, length)), sep=""),
df.wide <- dcast(data=df, word ~ id, length)
rownames(df.wide) <- df.wide[, 1]
df.wide <- as.matrix(df.wide[, -1])
df.dist <- dist(t(df.wide), method="jaccard")
abline(h=c(0.6, 0.8))
heatmap.2(df.wide, trace="none", col=rev(heat.colors(15)))
res <- merge(dat, data.frame(cat1=cutree(hclust(df.dist), h=0.8)), by="row.names")
res <- merge(res, data.frame(cat2=cutree(hclust(df.dist), h=0.6)), by.y="row.names", by.x="Row.names")
You'll see you can control the granularity of the categorization by adjusting where you cut the dendrogram.
Here is a method using the "Smith-Waterman" alignment (local) alignment
Biostrings is part of the Bioconductor project. The SW algorithm finds the optimal local (non-end-to-end) alignment of two sequences (strings). In this case you can again use cutree to set your categories but you can also tune the scoring function to suit your needs.
strList <- lapply(attributes, BString)
swDist <- matrix(apply(expand.grid(seq_along(strList), seq_along(strList)), 1, function(x) {
pairwiseAlignment(strList[[x[1]]], strList[[x[2]]], type="local")#score
}), nrow = 10)
heatmap.2(swDist, trace="none", col = rev(heat.colors(15)),
labRow = paste("id", 1:10, sep=""), labCol = paste("id", 1:10, sep=""))
Let's say I have:
Person Movie Rating
Sally Titanic 4
Bill Titanic 4
Rob Titanic 4
Sue Cars 8
Alex Cars **9**
Bob Cars 8
As you can see, there is a contradiction for Alex. All the same movies should have the same ranking, but there was a data error entry for Alex. How can I use R to solve this? I've been thinking about it for a while, but I can't figure it out. Do I have to just do it manually in excel or something? Is there a command on R that will return all the cases where there are data contradictions between two columns?
Perhaps I could have R do a boolean check if all the Movie cases match the first rating of its first iteration? For all that returns "no," I can go look at it manually? How would I write this function?
Here's a data.table solution
Define the function
Myfunc <- function(x) {
temp <- table(x)
Create a column with the correct rating (by reference)
setDT(df)[, CorrectRating := Myfunc(Rating), Movie][]
# Person Movie Rating CorrectRating
# 1: Sally Titanic 4 4
# 2: Bill Titanic 4 4
# 3: Rob Titanic 4 4
# 4: Sue Cars 8 8
# 5: Alex Cars 9 8
# 6: Bob Cars 8 8
Or If you want to remove the "bad" ratings
df[Rating == CorrectRating][]
# Person Movie Rating CorrectRating
# 1: Sally Titanic 4 4
# 2: Bill Titanic 4 4
# 3: Rob Titanic 4 4
# 4: Sue Cars 8 8
# 5: Bob Cars 8 8
It looks like, within each group defined by "Movie", you're looking for any instances of Rating that are not the same as the most common value.
You can solve this using dplyr (which is good at "group by one column, then perform an operation within each group), along with the "Mode" function defined in this answer that finds the most common item in a vector:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
dat %>% group_by(Movie) %>% filter(Rating != Mode(Rating))
This finds all the cases where a row does not agree with the rest of the group. If you instead want to remove them, you can do:
newdat <- dat %>% group_by(Movie) %>% filter(Rating == Mode(Rating))
If you want to fix them, do
newdat <- dat %>% group_by(Movie) %>% mutate(Rating = Mode(Rating))
You can test the above with a reproducible version of your data:
dat <- data.frame(Person = c("Sally", "Bill", "Rob", "Sue", "Alex", "Bob"),
Movie = rep(c("Titanic", "Cars"), each = 3),
Rating = c(4, 4, 4, 8, 9, 8))
If the goal is to see if all the values within a group are the same (or if there are some differences) then this can be a simple application of tapply (or aggregate, etc.) used with a function like var (or compute the range). If all the values are the same then the variance and range will be 0. If it is any other value (outside of rounding error) then there must be a value that is different. The which function can help identify the group/individual.
tapply(dat$Rating, dat$Movie, FUN=var)
which(.Last.value > 0.00001)
tapply(dat$Rating, dat$Movie, FUN=function(x)diff(range(x)))
which(.Last.value != 0)
which( abs(dat$Rating - ave(dat$Rating, dat$Movie)) > 0)
which.max( abs(dat$Rating - ave(dat$Rating, dat$Movie)) )
I would add a variable for mode so I can see if there is anything weird going on with the data, like missing data, text, many different answers instead of the rare anomaly,etc. I used "x" as your dataset
# one of many functions to find mode, could use any other
modefunc <- function(x){
# add variable for mode split by Movie
x$mode <- ave(x = x$Rating,x$Movie,FUN = modefunc)
# do whatever you want with the records that are different
x[x$Rating != x$mode, ]
If you want another function for mode, try other functions for mode
I've been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS.
So far AGREP is the closest tool I've found that might work. I can use levenshtein distances in the AGREP package, which measure the number of deletions, insertions and substitutions between two strings. AGREP will return the string with the smallest distance (the most similar).
However, I've been having trouble turning this command from a single value to apply it to an entire data frame. I've crudely used a for loop to repeat the AGREP function, but there's gotta be an easier way.
See the following code:
a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))
for (i in 1:6){
a$x[i] = agrep(a$name[i], b$name, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
a$Y[i] = agrep(a$name[i], b$name, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
Here is a solution using the fuzzyjoin package. It uses dplyr-like syntax and stringdist as one of the possible types of fuzzy matching.
As suggested by #C8H10N4O2, the stringdist method="jw" creates the best matches for your example.
As suggested by #dgrtwo, the developer of fuzzyjoin, I used a large max_dist and then used dplyr::group_by and dplyr::slice_min to get only the best match with minimum distance. (slice_min replaces the older top_n and if the original order is important and not alphabetical, use mutate(rank = row_number(dist)) %>% filter(rank == 1))
a <- data.frame(name = c('Ace Co', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),
price = c(10, 13, 2, 1, 15, 1))
b <- data.frame(name = c('Ace Co.', 'Bayes Inc.', 'asdf'),
qty = c(9, 99, 10))
library(fuzzyjoin); library(dplyr);
stringdist_join(a, b,
by = "name",
mode = "left",
ignore_case = FALSE,
method = "jw",
max_dist = 99,
distance_col = "dist") %>%
group_by(name.x) %>%
slice_min(order_by = dist, n = 1)
#> # A tibble: 6 x 5
#> # Groups: name.x [6]
#> name.x price name.y qty dist
#> <fctr> <dbl> <fctr> <dbl> <dbl>
#> 1 Ace Co 10 Ace Co. 9 0.04761905
#> 2 Bayes 13 Bayes Inc. 99 0.16666667
#> 3 asd 2 asdf 10 0.08333333
#> 4 Bcy 1 Bayes Inc. 99 0.37777778
#> 5 Baes 15 Bayes Inc. 99 0.20000000
#> 6 Bays 1 Bayes Inc. 99 0.20000000
The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.
One-to-one case (requires assignment algorithm):
When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.
Not familiar with AGREP but here's example using stringdist for your distance matrix.
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)
# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,
# 1 for already assigned, -1 for unassigned and unassignable
min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
a_sel <- a[d==min_d & x==0][1]
b_sel <- b[d==min_d & a == a_sel & x==0][1]
x[a==a_sel & b == b_sel] <- 1
x[x==0 & (a==a_sel|b==b_sel)] <- -1
Produces the assignment:
a b d
1 Ace Co Ace Co. 0.04762
2 Bayes Bayes Inc. 0.16667
3 asd asdf 0.08333
I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.
Many-to-one case (not an assignment problem):, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))
Produces the result:
a_name b_name dist
1 Ace Co Ace Co. 0.04762
11 Baes Bayes Inc. 0.20000
8 Bayes Bayes Inc. 0.16667
12 Bays Bayes Inc. 0.20000
10 Bcy Bayes Inc. 0.37778
15 asd asdf 0.08333
Edit: use method="jw" to produce desired results. See help("stringdist-package")
I am not sure if this is a useful direction for you, John Andrews, but it gives you another tool (from the RecordLinkage package) and might help.
install.packages("~/RecordLinkage_0.4-1.tar.gz", repos = NULL, type = "source")
require(RecordLinkage) # it is not on CRAN so you must load source from Github, and there are 7 dependent packages, as per above
compareJW <- function(string, vec, cutoff) {
jarowinkler(string, vec) > cutoff
a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))
a$name <- as.character(a$name)
b$name <- as.character(b$name)
test <- compareJW(string = a$name, vec = b$name, cutoff = 0.8) # pick your level of cutoff, of course
data.frame(name = a$name, price = a$price, test = test)
> data.frame(name = a$name, price = a$price, test = test)
name price test
1 Ace Co 10 TRUE
2 Bayes 13 TRUE
3 asd 2 TRUE
4 Bcy 1 FALSE
5 Baes 15 TRUE
6 Bays 1 FALSE
Fuzzy Matching
Approximate String Matching is approximately matching one string to another. e.g. banana and bananas.
Fuzzy Matching is finding an approximate pattern in a string. e.g. banana within bananas in pyjamas.
R Implementation
b$name <- lapply(b$name, agrep, a$name, value=TRUE); merge(a,b)
fuzzyjoin::stringdist_join(a, b, mode='full', by=c('name'), method='lv')
Fuzzy Match
agrep2 <- function(pattern, x) x[which.min(adist(pattern, x, partial=TRUE))]; b$name <- lapply(b$name, agrep2, a$name); merge(a, b)
Run yourself
# Data
a <- data.frame(name=c('Ace Co.', 'Bayes Inc.', 'asdf'), qty=c(9,99,10))
b <- data.frame(name=c('Ace Company', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'), price=c(10,13,2,1,15,1))
# Basic
c <- b
c$name.b <- c$name
c$name <- lapply(c$name, agrep, a$name, value=TRUE)
merge(a, c, all.x=TRUE)
# Advanced
fuzzyjoin::stringdist_join(a, b, mode='full')
# Fuzzy Match
c <- b
c$name.b <- c$name
c$name <- lapply(c$name, function(pattern, x) x[which.min(adist(pattern, x, partial=TRUE))], a$name)
merge(a, c)
Agreed with above answer "Not familiar with AGREP but here's example using stringdist for your distance matrix." but add-on the signature function as below from Merging Data Sets Based on Partially Matched Data Elements will be more accurate since the calculation of LV is based on position/addition/deletion
##Here's where the algorithm starts...
##I'm going to generate a signature from country names to reduce some of the minor differences between strings
##In this case, convert all characters to lower case, sort the words alphabetically, and then concatenate them with no spaces.
##So for example, United Kingdom would become kingdomunited
##We might also remove stopwords such as 'the' and 'of'.
sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')
I use lapply for those circumstances:
yournewvector: lapply(yourvector$yourvariable, agrep, yourothervector$yourothervariable, max.distance=0.01),
then to write it as a csv it's not so straightforward:
write.csv(matrix(yournewvector, ncol=1), file="yournewvector.csv", row.names=FALSE)
Here is what I used for getting number of times a company appears in a list though the company names are inexact matches,
step.1 Install phonics Package
step.2 create a new column called "soundexcodes" in "mylistofcompanynames"
step.3 Use soundex function to return soundex codes of the company names in "soundexcodes"
step.4 Copy the company names AND corresponding soundex code into a new file (2 columns called "companynames" and "soundexcode") called "companysoundexcodestrainingfile"
step.5 Remove duplicates of soundexcodes in "companysoundexcodestrainingfile"
step.6 Go through the list of remaining company names and change the names as you want it to appear in your original company
Amazon Inc A625 can be Amazon A625
Accenture Limited A455 can be Accenture A455
step.6 Perform a left_join or (simple vlookup) between companysoundexcodestrainingfile$soundexcodes and mylistofcompanynames$soundexcodes by "soundexcodes"
step.7 The result should have the original list with a new column called "co.y" which has the name of the company the way you left it in the training file.
step.8 Sort "co.y" and check if most of the company names are matched correctly,if so replace the old company names with the new ones given by vlookup of the soundex code.
In the graph below,
Is it possible to create same graph with less lines of codes? I mean, since each Figs. A-D has different label settings, I have to write settings for each Fig. which makes it longer.
The graph below is produced with the data in pdf device.
Any help with these issues is highly appreciated.(Newbie to R!). Since all the code is too long to post here, I have posted a part relevant to the problem here for Fig.C
axis(2,tick=T,at=seq(0.0,2.4,by=0.6),label= seq(0.0,2.4,by=0.6))
axis(3,xlim=c(0,24),tick=TRUE,at= seq(0,24,by=6),label=seq(0,24,by=6),col.axis="violetred4",col="violetred4")
axis(4,tick=TRUE,at= label1,label=label1,col.axis="violetred4",col="violetred4")
You ask many questions in the same OP. I will try to answer to just one : How to simplify your code or rather how to call it once for each letter. I think it is better to put your data in the long format. For example, This will create a list of 4 elements
ll <- lapply(LETTERS[1:4],function(let){
dat.let <- dat[,grepl(let,colnames(dat))]
dd <- reshape(dat.let,direction ='long',
dd$time <- factor(dd$time)
dd$Type <- let
ll is a list of 4 data.frame, where each one that looks like :
time TimeVariable Variable id Type
1.1 1 0 0 1 A
2.1 1 0 5 2 A
3.1 1 8 110 3 A
4.1 1 16 0 4 A
5.1 1 NA NA 5 A
6.1 1 NA NA 6 A
Then you can use it like this for example :
layout(matrix(1:4, 2, 2, byrow = TRUE))
Type <- unique(dat$Type)
dat <- subset(data,time==2) <- max(dat$Variable,na.rm=TRUE)
dat <- subset(data,time==2)
axis(2,tick=T,at=seq(0.0,2.4,by=0.6),label= seq(0.0,2.4,by=0.6))
dat <- subset(data,time==1)
axis(3,xlim=c(0,24),tick=TRUE,at= seq(0,24,by=6),label=seq(0,24,by=6),col.axis="violetred4",col="violetred4")
axis(4,tick=TRUE,at= label1,label=label1,col.axis="violetred4",col="violetred4")
Another advantage of using the long data format is to use ``ggplot2andfacet_wrap` for example .
## transform your data to a data.frame
dat.l <-,ll)
ggplot(subset(dat.l,time !=1)) +
geom_polygon(data=subset(dat.l,time ==1),
geom_line(data=subset(dat.l,time ==1),
This question already has an answer here:
t-test in R between individuals columns and the rest of a given dataframe
(1 answer)
Closed 9 years ago.
I have a dataframe of the basic form:
> head(
NAC cOF3 APir Pu Tu V2.3 mOF3 DGpf
1 6.314770 6.181188 6.708971 6.052134 6.546938 6.079848 6.640716 6.263770
2 8.825595 8.740217 9.532026 8.919598 8.776969 8.843287 8.631505 9.053732
3 5.518933 5.982044 5.632379 5.712680 5.655525 5.580141 5.750969 6.119935
4 6.063098 6.700194 6.255736 5.124315 6.133631 5.891009 6.070467 6.062815
5 8.931570 9.048621 9.258875 8.681762 8.680993 9.040971 8.785271 9.122226
6 5.694149 5.356218 5.608698 5.894171 5.629965 5.759247 5.929289 6.092337
I would like to perform t-tests of each column versus all other columns and save the subsequent p-values to a variable in some variation of the following:
#run tests
test.result = mapply(t.test, one.column, other.columns)
#store p-values
p.values = stack(mapply(function(x, y)
+ t.test(x,y)$p.value, one.column, other.columns))
Or would aov() be a better option for such an analysis? In any case, I would like to know how to streamline doing it using t-tests.
Here's one solution:
Read in the data:
dat <- read.table(text='NAC cOF3 APir Pu Tu V2.3 mOF3 DGpf
1 6.314770 6.181188 6.708971 6.052134 6.546938 6.079848 6.640716 6.263770
2 8.825595 8.740217 9.532026 8.919598 8.776969 8.843287 8.631505 9.053732
3 5.518933 5.982044 5.632379 5.712680 5.655525 5.580141 5.750969 6.119935
4 6.063098 6.700194 6.255736 5.124315 6.133631 5.891009 6.070467 6.062815
5 8.931570 9.048621 9.258875 8.681762 8.680993 9.040971 8.785271 9.122226
6 5.694149 5.356218 5.608698 5.894171 5.629965 5.759247 5.929289 6.092337')
Get all possible pairwise combinations:
com <- combn(colnames(dat), 2)
Get the p-values
p <- apply(com, 2, function(x) t.test(dat[,x[1]], dat[,x[2]])$p.val)
Put into a data frame:
data.frame(comparison = paste(com[1,], com[2,], sep = ' vs. '), p.value = p)
An even better solution is to use melt from the rehape package and pairwise.t.test:
with(melt(dat), pairwise.t.test(value, variable, p.adjust.method = 'none'))
If you want to pair just the first with all other columns, you can also use this:
x <- sapply(dat[,-1], function(x) t.test(x, dat[,1])$p.value)
data.frame(variable = names(x), p.value = as.numeric(x))
I'd like to classify the values of a data frame according to two columns. Let's say, I've got the following data frame:
my.df <- data.frame(a=c(1:20), b=c(61:80))
And now I want to subdivide it into 8 areas by dividing the 2D-scatterplot into 4 equal parts and then overlaying a rectangle in the middle that would consist of a quarter of each of the 4 parts. So far I've been using the following tedious way:
ar <- range(my.df$a)
br <- range(my.df$b)
aint <- seq(ar[1], ar[2], by=(ar[2]-ar[1])/4)
bint <- seq(br[1], br[2], by=(br[2]-br[1])/4)
my.df$z <- NA
my.df[which(my.df$a < aint[3] & my.df$b < bint[3]),"z"] <- 1
my.df[which(my.df$a < aint[3] & my.df$b >= bint[3]),"z"] <- 2
my.df[which(my.df$z == 1 & my.df$a >= aint[2] & my.df$b >= bint[2]),"z"] <- 5
I am sure there must be a way to do it in a neater and more general way, i.e. by writing a general function, but I am struggling to write one myself.
Also, I was surprised to see that after all of this, the class of the column z is automatically set to shingle. Why that? How does R "know" that this is a shingle?
I'd approach it by cutting it into 16 groups first (x and y into 4 groups independently) and then combining them back together into fewer groups.
my.df$a.q <- cut(my.df$a, breaks=4, labels=1:4)
my.df$b.q <- cut(my.df$b, breaks=4, labels=1:4)
my.df$a.b.q <- paste(my.df$a.q, my.df$b.q, sep=".")
my.df$z <- c("1.1"=1, "1.2"=1, "1.3"=2, "1.4"=2,
"2.1"=1, "2.2"=3, "2.3"=4, "2.4"=2,
"3.1"=5, "3.2"=6, "3.3"=7, "3.4"=8,
"4.1"=5, "4.2"=5, "4.3"=8, "4.4"=8)[my.df$a.b.q]
This seems reasonable
plot(my.df$a, my.df$b, col=my.df$z)
With some data with more coverage:
my.df <- data.frame(a=runif(1000, 1, 20), b=runif(1000, 61, 80))