Grouping words that are similar - r

CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
I want to get either:
CompanyName2
Kraft
Kraft
Kraft
nestle
nestle
general motors
general motors
Dow
Dow
But would be absolutely fine with:
CompanyName2
1
1
1
2
2
3
3
I see algorithms for getting the distance between two words, so if I had just one weird name I would compare it to all other names and pick the one with the lowest distance. But I have thousands of names and want to group them all into groups.
I do not know anything about elastic search, but would one of the functions in the elastic package or some other function help me out here?
I'm sorry there's no programming here. I know. But this is way out of my area of normal expertise.

Solution: use string distance
You're on the right track. Here is some R code to get you started:
install.packages("stringdist") # install this package
library("stringdist")
CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
CompanyName = tolower(CompanyName) # otherwise case matters too much
# Calculate a string distance matrix; LCS is just one option
?"stringdist-metrics" # see others
sdm = stringdistmatrix(CompanyName, CompanyName, useNames=T, method="lcs")
Let's take a look. These are the calculated distances between strings, using Longest Common Subsequence metric (try others, e.g. cosine, Levenshtein). They all measure, in essence, how many characters the strings have in common. Their pros and cons are beyond this Q&A. You might look into something that gives a higher similarity value to two strings that contain the exact same substring (like dow)
sdm[1:5,1:5]
kraft kraft foods kfraft nestle nestle usa
kraft 0 6 1 9 13
kraft foods 6 0 7 15 15
kfraft 1 7 0 10 14
nestle 9 15 10 0 4
nestle usa 13 15 14 4 0
Some visualization
# Hierarchical clustering
sdm_dist = as.dist(sdm) # convert to a dist object (you essentially already have distances calculated)
plot(hclust(sdm_dist))
If you want to group then explicitly into k groups, use k-medoids.
library("cluster")
clusplot(pam(sdm_dist, 5), color=TRUE, shade=F, labels=2, lines=0)

Related

find every combination of elements in a column of a dataframe, which add up to a given sum in R

I'm trying to ease my life by writing a menu creator, which is supposed to permutate a weekly menu from a list of my favourite dishes, in order to get a little bit more variety in my life.
I gave every dish a value of how many days it approximately lasts and tried to arrange the dishes to end up with menus worth 7 days of food.
I've already tried solutions for knapsack functions from here, including dynamic programming, but I'm not experienced enough to get the hang of it. This is because all of these solutions are targeting only the most efficient option and not every combination, which fills the Knapsack.
library(adagio)
#create some data
dish <-c('Schnitzel','Burger','Steak','Salad','Falafel','Salmon','Mashed potatoes','MacnCheese','Hot Dogs')
days_the_food_lasts <- c(2,2,1,1,3,1,2,2,4)
price_of_the_food <- c(20,20,40,10,15,18,10,15,15)
data <- data.frame(dish,days_the_food_lasts,price_of_the_food)
#give each dish a distinct id
data$rownumber <- (1:nrow(data))
#set limit for how many days should be covered with the dishes
food_needed_for_days <- 7
#knapsack function of the adagio library as an example, but all other solutions I found to the knapsackproblem were the same
most_exspensive_food <- knapsack(days_the_food_lasts,price_of_the_food,food_needed_for_days)
data[data$rownumber %in% most_exspensive_food$indices, ]
#output
dish days_the_food_lasts price_of_the_food rownumber
1 Schnitzel 2 20 1
2 Burger 2 20 2
3 Steak 1 40 3
4 Salad 1 10 4
6 Salmon 1 18 6
Simplified:
I need a solution to a single objective single Knapsack problem, which returns all possible combinations of dishes which add up to 7 days of food.
Thank you very much in advance

Comparing rows in the same R dataframe

I want to take the nth row in a dataframe and compare it to all rows that are not the nth row and return how many of this columns match and/or mismatch.
I tried the match function and ifelse for single observations but I haven't been able to replicate it for the entire dataframe.
The dataset Superstore contains the order priority, customer name, ship mode, customer segment and category. It looks like this:
> head(df2)
Order.Priority Customer.Name Ship.Mode Customer.Segment Product.Category
1 Not Specified Dana Teague Regular Air Corporate Office Supplies
2 Critical Vanessa Boyer Regular Air Consumer Office Supplies
3 Critical Wesley Tate Regular Air Corporate Technology
4 High Brian Grady Delivery Truck Corporate Furniture
5 Medium Kristine Connolly Delivery Truck Corporate Furniture
6 High Emily Britt Regular Air Corporate Office Supplies
The code I tried (extracting relevant columns):
df <- read.csv("Superstore.csv", header = TRUE)
df2 <- df[,c(2,4,5,6,7)]
match(df2[2,],df2[1,],nomatch = 0)
This returns:
> match(df2[2,],df2[1,],nomatch = 0)
[1] 0 0 3 0 5
Using ifelse I get:
> ifelse(df2[1,]==df2[2,],1,0)
Order.Priority Customer.Name Ship.Mode Customer.Segment Product.Category
1 0 0 1 0 1
Like I said, this is exactly the result I need, but I haven't been able to replicate for the whole dataframe.

R data cleaning

I have a dataframe (df1) scrapped as a single column of data .
1
2 Amazon Pantry
3 Best Sellerin Soaps & Hand Wash
4
5 Palmolive Hygiene-Plus Sensitive Liquid Hand Wash, 300ml
6 Palmolive Hygiene-Plus Sensitive Liquid Hand Wash, 300ml
7 £0.90
8 ?
9
10 Palmolive Naturals Nourishing Liquid Hand Wash, 300ml
11 Palmolive Naturals Nourishing Liquid Hand Wash, 300ml
12 £0.90
13 ?
14
15 L'Oreal Men Expert Carbon Protect Deodorant 250ml
16 L'Oreal Men Expert Carbon Protect Deodorant 250ml
17 £1.50
In order to clean the data i tried using the below commands such that to get Product and pricing information in 2 separate columns . Can someone let me know if there is an alternate way of doing it .
install.packages("splitstackshape")
newdf <- cSplit(df1, "Amazon_Normal_Text2", direction = "long")
this is just a thought process...
everytime there's a "ml," extract information until ml going backward until there is a space and store that into volume variable. (substr)
extract information from £ to the end of the string and store that into price variable. (grep, regex, nchar)
extract from beginning of string until the character location for volume occurrence into product variable (substr, nchar)
look into substr, nchar, grep, regex

R - Combinations of a dataframe with constraints

I'm trying to make an R script for fantasy football (proper UK football, not hand egg :-)) where I can input a list of players in a csv and it will spit out every 11-player combination, which meet various constraints.
Here's my sample dataframe:
df <- read.csv("Filename.csv",
header = TRUE)
> print(df)
Name Positon Team Salary
1 Eric Dier D TOT 9300000
2 Erik Pieters D STO 9200000
3 Christian Fuchs D LEI 9100000
4 Héctor Bellerín D ARS 9000000
5 Charlie Daniels D BOU 9000000
6 Ben Davies D TOT 8900000
7 Federico Fernández D SWA 8800000
8 Per Mertesacker D ARS 8800000
9 Alberto Moreno D LIV 8700000
10 Chris Smalling D MUN 8700000
11 Seamus Coleman D EVE 8700000
12 Jan Vertonghen D TOT 8700000
13 Romelu Lukaku F EVE 12700000
14 Harry Kane F TOT 12500000
15 Max Gradel F BOU 11900000
16 Alexis Sánchez F ARS 11300000
17 Jamie Vardy F LEI 11200000
18 Theo Walcott F ARS 10700000
19 Olivier Giroud F ARS 10700000
20 Wilfried Bony F MCI 10000000
21 Kristoffer Nordfeldt G SWA 7000000
22 Joe Hart G MCI 6800000
23 Jack Rose G WBA 6600000
24 Asmir Begovic G CHE 6600000
25 Mesut Özil M ARS 15600000
26 Riyad Mahrez M LEI 15200000
27 Ross Barkley M EVE 13300000
28 Dimitri Payet M WHM 12800000
29 Willian M CHE 12500000
30 Bertrand Traore M CHE 12500000
31 Kevin De Bruyne M MCI 12400000
And the constraints are as follows:
1) The total salary of each 11-player lineup cannot exceed 100,000,000
2) There can only be a maximum of four players from one team. E.g. four player from 'CHE' (Chelsea).
3) There is a limit of how many players within each 11-player lineup can be from each position. There can be:
1 G (goalkeeper), 3 to 4 D (defender), 3 to 5 M (midfielder), 1 to 3 F (forward)
I'd like every 11 player combination that meets the above contraints to be returned. Order is not important (e.g. 1,2,3 is considered the same as 2,1,3 and shouldn't be duplicated) and a player can appear in more than one lineup.
I've done a fair bit of research and played around but can't seem to get anywhere with this. I'm new to R. I don't expect anyone to nail this for me, but if someone could point a newbie like myself in the right direction it would be much appreciated.
Thanks.
This can be solved as linear integer program using the library LPSolve.
This kind of problems are very well solvable -- opposed to what has been written before -- as typical the number of solutions are much smaller than the domain size.
You can add for each Player a zero one variable, whether or not that player is in the team.
The package can be installed using
install.packages("lpSolve")
install.packages("lpSolveAPI")
The documentation can be found at: https://cran.r-project.org/web/packages/lpSolve/lpSolve.pdf
First constraint sum of players 11
The salary is basically a sum of all players variable multiplied by the salary column and so on....
To get a proper solutions you need to specify in
lp.solve(all.bin=TRUE
Such that all variables referring to players are either zero or one.
( I understood that you are trying to learn, that's why I refrain from giving a full solution)
EDIT
As I got down-voted probably because of not giving the full solution. Kind of sad as as the original author explicitly wrote that he doesn't expect a full solution
library(lpSolve)
df <- read.csv("/tmp/football.csv",header = TRUE,sep=";")
f.obj <- rep(1,nrow(df))
f.con <-
matrix(c(f.obj <- rep(1,nrow(df)),
as.vector(df$Salary),
(df$Positon=="G") *1.0,
(df$Positon=="D") *1.0,
(df$Positon=="D") *1.0,
(df$Positon=="M") *1.0,
(df$Positon=="M") *1.0,
(df$Positon=="F") *1.0,
(df$Positon=="F") *1.0),nrow=9,byrow= TRUE)
f.dir <- c("==", "<=","==",">=","<=",">=","<=",">=","<=")
f.rhs<- c(11, #number players
100000000, #salary
1 , #Goalkeeper
3 , # def min
4 , # def max
3 , # mdef min
5, # mdef max
1, # for, min
3 # wor, max
)
solutions <- lp ("max", f.obj, f.con, f.dir, f.rhs,all.bin=TRUE)
I didn't add the Team Constraint as it wouldn't have provided any additionally insights here....
** EDIT2 **
This might come handy if you change your data set
R lpsolve binary find all possible solutions
A brute-force way to tackle this, (which is also beautifully parallelizable and guarantees you all possible combinations) is to calculate all 11-player permutations and then filter out the combinations that don't conform to your limits in a stepwise manner.
To make a program like this fit into your computer's memory, give each player a unique integer ID and create vectors of IDs as team sets. When you then implement your filters your functions can refer to the player info by that ID in a single dataframe.
Say df is your data frame with all player data.
df$id <- 1:nrow(df)
Get all combinations of ids:
# This will take a long time or run out of memory!
# In my 2.8Gz laptop took 466 seconds just for your 31 players
teams <- combn(df$id, 11)
Of course, if your dataframe is big (like hundreds of players) this implementation could take impossibly long to finish. You probably would be better off just sampling 11-sets from your player set without replacement and construct teams in an "on demand" fashion.
A more clever way is to partition your dataset according to player position into - one for goalkeepers, one for defence, etc. And then use the above approach to create permutations of different players from each position and combine the end results. It would take ridiculously less amount of time, it would still be parallelizable and exhaustive (give you all possible combinations).

Testing recurrences and orders in strings matlab

I have observed nurses during 400 episodes of care and recorded the sequence of surfaces contacts in each.
I categorised the surfaces into 5 groups 1:5 and calculated the probability density functions of touching any one of 1:5 (PDF).
PDF=[ 0.255202629 0.186199343 0.104052574 0.201533406 0.253012048]
I then predicted some 1000 sequences using:
for i=1:1000 % 1000 different nurses
seq(i,1:end)=randsample(1:5,max(observed_seq_length),'true',PDF);
end
eg.
seq = 1 5 2 3 4 2 5 5 2 5
stairs(1:max(observed_seq_length),seq) hold all
I'd like to compare my empirical sequences with my predicted one. What would you suggest to be the best strategy or property to look at?
Regards,
EDIT: I put r as a tag as this may well fall more easily under that category due to the nature of the question rather than the matlab code.

Resources