Using Jaccard Coefficient for measuring string similarity - similarity

I got a test and a training dataset which should be used for string similarity measurement. Here I have given few lines of the dataset,
Brandon Bass ||| what the hell is Brandon bass thinking ||| Brandon Bass Has 5 Personal Fouls ||| False
Sac ||| Congrats to Sac Kings fans ||| why yall forcing the kings to stay in sac town smh ||| False
Stella ||| hello Stella can you follow me please ||| STELLA DO U HATE ME ||| False
The data file has 50 entries of the form
TOPIC ||| TWEET_SENT_1 ||| TWEET_SENT_2 ||| HAVE_SIMILAR_MEANING
TOPIC – Twitter topic
TWEET_SENT_1 – Tweet sentence 1
TWEET_SENT_2 – Tweet sentence 2
HAVE_SIMILAR_MEANING – a binary label (True – two sentences are similar, false – two sentences are not similar) assigned by a human annotator
We need to divide the data set into two: training set (35 samples) and test set (15 samples) and have to use the training set for parameter tuning of the algorithms. And test with the test set using the best tuned parameter.
If the algorithm is Jaccard Coefficient
how can I perform this task? Can someone please let me know the approach that I can use.

Jaccard similarity is a measure of how two sets (of n-grams in your case) are similar. There is no "tuning" to be done here, except for the threshold at which you decide that two strings are similar or not.
For example if you have 2 strings abcde and abdcde it works as follow :
ngrams (n=2) : 'abcde' & 'abdcde'
ab bc cd de dc bd
A 1 1 1 1 0 0
B 1 0 1 1 1 1
J(A, B) = (A∩B) / (A∪B)
J(A, B) = (3 / 6) = 0.5
There is also the Jaccard distance which captures the dissimilarity between two sets, and is calculated by taking one minus the Jaccard coeeficient (in this case, 1 - 0.5 = 0.5)
So, for you problem, I would use the training set with the labels in order to define the proper threshold for which your strings are considered similar/dissimilar.

Related

Clustering or multiple correspondence analysis for a set of check-all-that-apply questions in R

I have a dataset that contains a set of check-all-that-apply questions. Let's say it goes like this
Q1. "Which sports do you play"
A. Football B. Basketball C. ... etc. etc. etc.
Q2. "Which of the following elements motivated you to play sports?"
A. Competitiveness B. Achievement C. ... etc. etc. etc.
There are 4 goals I want to achieve:
Understand which sports are "close" to each other
Understand which elements are "close" to each other
Understand which sports are associated with which elements
Categorize participants considering their participation preferences in sports (e.g. stamina sports players vs. agility
sports player vs. etc.)
Is it valid to do K-means clustering on a table like this (note: the actual data is much larger; this table is only for demo)? Why and why not?
Is multiple correspondence analysis a better way? Why and why not?
football <- c(1,1,1,1,0,0,0,0)
basketball <- c(1,1,0,0,0,0,0,1)
other <- c(1,0,0,0,0,0,0,0)
df<- data.frame(football, basketball, other)
m = as.matrix(df)
t(m) %*% m / colSums(m)
# football basketball other
# football 1.0000000 0.5 0.2500000
# basketball 0.6666667 1.0 0.3333333
# other 1.0000000 1.0 1.0000000

Grouping words that are similar

CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
I want to get either:
CompanyName2
Kraft
Kraft
Kraft
nestle
nestle
general motors
general motors
Dow
Dow
But would be absolutely fine with:
CompanyName2
1
1
1
2
2
3
3
I see algorithms for getting the distance between two words, so if I had just one weird name I would compare it to all other names and pick the one with the lowest distance. But I have thousands of names and want to group them all into groups.
I do not know anything about elastic search, but would one of the functions in the elastic package or some other function help me out here?
I'm sorry there's no programming here. I know. But this is way out of my area of normal expertise.
Solution: use string distance
You're on the right track. Here is some R code to get you started:
install.packages("stringdist") # install this package
library("stringdist")
CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
CompanyName = tolower(CompanyName) # otherwise case matters too much
# Calculate a string distance matrix; LCS is just one option
?"stringdist-metrics" # see others
sdm = stringdistmatrix(CompanyName, CompanyName, useNames=T, method="lcs")
Let's take a look. These are the calculated distances between strings, using Longest Common Subsequence metric (try others, e.g. cosine, Levenshtein). They all measure, in essence, how many characters the strings have in common. Their pros and cons are beyond this Q&A. You might look into something that gives a higher similarity value to two strings that contain the exact same substring (like dow)
sdm[1:5,1:5]
kraft kraft foods kfraft nestle nestle usa
kraft 0 6 1 9 13
kraft foods 6 0 7 15 15
kfraft 1 7 0 10 14
nestle 9 15 10 0 4
nestle usa 13 15 14 4 0
Some visualization
# Hierarchical clustering
sdm_dist = as.dist(sdm) # convert to a dist object (you essentially already have distances calculated)
plot(hclust(sdm_dist))
If you want to group then explicitly into k groups, use k-medoids.
library("cluster")
clusplot(pam(sdm_dist, 5), color=TRUE, shade=F, labels=2, lines=0)

R - Combinations of a dataframe with constraints

I'm trying to make an R script for fantasy football (proper UK football, not hand egg :-)) where I can input a list of players in a csv and it will spit out every 11-player combination, which meet various constraints.
Here's my sample dataframe:
df <- read.csv("Filename.csv",
header = TRUE)
> print(df)
Name Positon Team Salary
1 Eric Dier D TOT 9300000
2 Erik Pieters D STO 9200000
3 Christian Fuchs D LEI 9100000
4 Héctor Bellerín D ARS 9000000
5 Charlie Daniels D BOU 9000000
6 Ben Davies D TOT 8900000
7 Federico Fernández D SWA 8800000
8 Per Mertesacker D ARS 8800000
9 Alberto Moreno D LIV 8700000
10 Chris Smalling D MUN 8700000
11 Seamus Coleman D EVE 8700000
12 Jan Vertonghen D TOT 8700000
13 Romelu Lukaku F EVE 12700000
14 Harry Kane F TOT 12500000
15 Max Gradel F BOU 11900000
16 Alexis Sánchez F ARS 11300000
17 Jamie Vardy F LEI 11200000
18 Theo Walcott F ARS 10700000
19 Olivier Giroud F ARS 10700000
20 Wilfried Bony F MCI 10000000
21 Kristoffer Nordfeldt G SWA 7000000
22 Joe Hart G MCI 6800000
23 Jack Rose G WBA 6600000
24 Asmir Begovic G CHE 6600000
25 Mesut Özil M ARS 15600000
26 Riyad Mahrez M LEI 15200000
27 Ross Barkley M EVE 13300000
28 Dimitri Payet M WHM 12800000
29 Willian M CHE 12500000
30 Bertrand Traore M CHE 12500000
31 Kevin De Bruyne M MCI 12400000
And the constraints are as follows:
1) The total salary of each 11-player lineup cannot exceed 100,000,000
2) There can only be a maximum of four players from one team. E.g. four player from 'CHE' (Chelsea).
3) There is a limit of how many players within each 11-player lineup can be from each position. There can be:
1 G (goalkeeper), 3 to 4 D (defender), 3 to 5 M (midfielder), 1 to 3 F (forward)
I'd like every 11 player combination that meets the above contraints to be returned. Order is not important (e.g. 1,2,3 is considered the same as 2,1,3 and shouldn't be duplicated) and a player can appear in more than one lineup.
I've done a fair bit of research and played around but can't seem to get anywhere with this. I'm new to R. I don't expect anyone to nail this for me, but if someone could point a newbie like myself in the right direction it would be much appreciated.
Thanks.
This can be solved as linear integer program using the library LPSolve.
This kind of problems are very well solvable -- opposed to what has been written before -- as typical the number of solutions are much smaller than the domain size.
You can add for each Player a zero one variable, whether or not that player is in the team.
The package can be installed using
install.packages("lpSolve")
install.packages("lpSolveAPI")
The documentation can be found at: https://cran.r-project.org/web/packages/lpSolve/lpSolve.pdf
First constraint sum of players 11
The salary is basically a sum of all players variable multiplied by the salary column and so on....
To get a proper solutions you need to specify in
lp.solve(all.bin=TRUE
Such that all variables referring to players are either zero or one.
( I understood that you are trying to learn, that's why I refrain from giving a full solution)
EDIT
As I got down-voted probably because of not giving the full solution. Kind of sad as as the original author explicitly wrote that he doesn't expect a full solution
library(lpSolve)
df <- read.csv("/tmp/football.csv",header = TRUE,sep=";")
f.obj <- rep(1,nrow(df))
f.con <-
matrix(c(f.obj <- rep(1,nrow(df)),
as.vector(df$Salary),
(df$Positon=="G") *1.0,
(df$Positon=="D") *1.0,
(df$Positon=="D") *1.0,
(df$Positon=="M") *1.0,
(df$Positon=="M") *1.0,
(df$Positon=="F") *1.0,
(df$Positon=="F") *1.0),nrow=9,byrow= TRUE)
f.dir <- c("==", "<=","==",">=","<=",">=","<=",">=","<=")
f.rhs<- c(11, #number players
100000000, #salary
1 , #Goalkeeper
3 , # def min
4 , # def max
3 , # mdef min
5, # mdef max
1, # for, min
3 # wor, max
)
solutions <- lp ("max", f.obj, f.con, f.dir, f.rhs,all.bin=TRUE)
I didn't add the Team Constraint as it wouldn't have provided any additionally insights here....
** EDIT2 **
This might come handy if you change your data set
R lpsolve binary find all possible solutions
A brute-force way to tackle this, (which is also beautifully parallelizable and guarantees you all possible combinations) is to calculate all 11-player permutations and then filter out the combinations that don't conform to your limits in a stepwise manner.
To make a program like this fit into your computer's memory, give each player a unique integer ID and create vectors of IDs as team sets. When you then implement your filters your functions can refer to the player info by that ID in a single dataframe.
Say df is your data frame with all player data.
df$id <- 1:nrow(df)
Get all combinations of ids:
# This will take a long time or run out of memory!
# In my 2.8Gz laptop took 466 seconds just for your 31 players
teams <- combn(df$id, 11)
Of course, if your dataframe is big (like hundreds of players) this implementation could take impossibly long to finish. You probably would be better off just sampling 11-sets from your player set without replacement and construct teams in an "on demand" fashion.
A more clever way is to partition your dataset according to player position into - one for goalkeepers, one for defence, etc. And then use the above approach to create permutations of different players from each position and combine the end results. It would take ridiculously less amount of time, it would still be parallelizable and exhaustive (give you all possible combinations).

CART Methodology for data with mutually exhaustive rows

I am trying to use CART to analyse a data set whose each row is a segment, for example
Segment_ID | Attribute_1 | Attribute_2 | Attribute_3 | Attribute_4 | Target
1 2 3 100 3 0.1
2 0 6 150 5 0.3
3 0 3 200 6 0.56
4 1 4 103 4 0.23
Each segment has a certain population from the base data (irrelevant to my final use).
I want to condense, for example in the above case, the 4 segments into 2 big segments, based on the 4 attributes and on the target variable. I am currently dealing with 15k segments and want only 10 segments with each of the final segment based on target and also having a sensible attribute distribution.
Now, pardon my if I am wrong but CHAID on SPSS (if not using autogrow) will generally split the data into 70:30 ratio where it builds the tree on 70% of the data and tests on the remaining 30%. I can't use this approach since I need all my segments in the data to be included. I essentially want to club these segments into a a few big segments as explained before. My question is whether I can use CART (rpart in R) for the same. There is an explicit option 'subset' in the rpart function in R but I am not sure whether not mentioning it will ensure CART utilizing 100% of my data. I am relatively new to R and hence a very basic question.

Testing recurrences and orders in strings matlab

I have observed nurses during 400 episodes of care and recorded the sequence of surfaces contacts in each.
I categorised the surfaces into 5 groups 1:5 and calculated the probability density functions of touching any one of 1:5 (PDF).
PDF=[ 0.255202629 0.186199343 0.104052574 0.201533406 0.253012048]
I then predicted some 1000 sequences using:
for i=1:1000 % 1000 different nurses
seq(i,1:end)=randsample(1:5,max(observed_seq_length),'true',PDF);
end
eg.
seq = 1 5 2 3 4 2 5 5 2 5
stairs(1:max(observed_seq_length),seq) hold all
I'd like to compare my empirical sequences with my predicted one. What would you suggest to be the best strategy or property to look at?
Regards,
EDIT: I put r as a tag as this may well fall more easily under that category due to the nature of the question rather than the matlab code.

Resources