Calculate Top 10 from range of sets (survey rankings) - math

I'm creating a survey that a number of users will participate in. They will have to place 20 fruits in a Top 10 ranking format. Each user will only be able to rank 10 fruits, so there will be 10 fruits that don't get a rank.
How can I collate this information to output an overall Top 10 as ranked by all users in the survey? A formula or even piece of JS/Python code to loop over the rows would be great
Below is an example of the table/spreadsheet that I am going to receive.
I initially thought that summing all the rankings and then ordering by lowest total first would give the standings in a correct order*, but using that method would mean that Kiwi would come out on top even though it received none rankings. Also Mango would come before Banana even though Banana recieved a first place ranking.
Participant A
Participant B
Participant C
Participant D
Participant E
SUM
Apple
8
4
8
8
28
Banana
5
1
6
Blackberry
6
6
12
Blueberry
4
5
7
2
18
Cherry
8
10
18
Fig
3
3
10
16
Grape
1
9
7
9
26
Grapefruit
2
4
4
10
Kiwi
0
Lychee
3
3
Mango
5
5
Nectarine
6
1
9
16
Orange
10
3
13
Papaya
7
2
10
19
Peach
7
3
7
17
Pineapple
1
8
9
Pomegranate
2
6
6
14
Raspberry
9
9
4
2
24
Strawberry
5
5
10
Watermelon
10
1
11
*I imagine there are many ways to do this, so might not be a single correct way

I have solved this by creating a scoring system dependant on the rank of each fruit, and have expanded this to also add a weighted score to higher ranked fruit. I'm going to write this for JavaScript because that is how I tested it, but it can also be easily done in Google Sheets.
To start, create a base point array that will convert the rank into a corresponding point which will be awarded to the item. This can either be restricted to only award the top 𝑥 amount of ranked items with points, or the total number of items to be ranked. Then reverse the rank so that 1st obtains the highest score and last obtains the lowest score.
This can then be attributed to each set of rankings to give each item a score that can be summed. Taking into account items that haven't been ranked receiving a score of 0.
But this can create a skewed list where items that are ranked at a low position many times can easily over take items ranked at a high position only a few times.
To add weighting to the points simply use an exponential formula that will then inflate the higher ranked points more than the lower ranked points. This can be achieved by using either the base point value to the power of a coefficient, or the coefficient to the power of the base point value. In my opinion the latter produces more of a curve, but the coefficient must be tested otherwise the gap between the largest and smallest point scores can be too big.
I will try and create a Google Sheets with the formula included, but for now the JavaScript code is below.
Hopefully someone else finds this helpful!
const items = ['Apple', 'Banana', 'Blackberry', 'Blueberry', 'Cherry', 'Fig', 'Grape', 'Grapefruit', 'Kiwi', 'Lychee', 'Mango', 'Nectarine', 'Orange', 'Papaya', 'Peach', 'Pineapple', 'Pomegranate', 'Raspberry', 'Strawberry', 'Watermelon'];
const results = [
['Grape', 'Pomegranate', 'Fig', 'Blueberry', 'Banana', 'Blackberry', 'Papaya', 'Apple', 'Raspberry', 'Watermelon'],
['Pineapple', 'Grapefruit', 'Fig', 'Apple', 'Blueberry', 'Nectarine', 'Peach', 'Cherry', 'Raspberry', 'Orange'],
['Nectarine', 'Papaya', 'Lychee', 'Raspberry', 'Mango', 'Pomegranate', 'Blueberry', 'Pineapple', 'Grape', 'Fig'],
['Banana', 'Blueberry', 'Peach', 'Grapefruit', 'Strawberry', 'Pomegranate', 'Grape', 'Apple', 'Nectarine', 'Cherry'],
['Watermelon', 'Raspberry', 'Orange', 'Grapefruit', 'Strawberry', 'Blackberry', 'Peach', 'Apple', 'Grape', 'Papaya']
];
const top_x = 10 /* items.length */;
const exponential_coefficient = 1.3;
const sortObject = (object) =>
Object.fromEntries(Object.entries(object).sort(([, a], [, b]) => a - b).reverse());
const base_points = new Array(top_x).fill(null).map((value, index) => Math.abs(index - top_x));
const exponential_points = base_points.map((points) => Math.round(exponential_coefficient ** points));
const base_scores = items.reduce((acc, curr) => ((acc[curr] = 0), acc), {});
const exponential_scores = items.reduce((acc, curr) => ((acc[curr] = 0), acc), {});
results.forEach((set) =>
set.forEach((item, index) => {
base_scores[item] += base_points[index] || 0;
exponential_scores[item] += exponential_points[index] || 0;
})
);
console.log({ top_x, base_points, exponential_points, base_scores: sortObject(base_scores), exponential_scores: sortObject(exponential_scores) });

Related

dcast - error: Aggregate function missing

A little background information regarding my question: I have run a trial with 2 different materials, using 2x2 settings. Each treatment was performed in duplo, resulting in a total number of 2x2x2x2 = 16 runs in my dataset. The dataset has the following headings, in which repetition is either 1 or 2 (as it was performed in duplo).
| Run | Repetition | Material | Air speed | Class. speed | Parameter of interest |
I would like to transform this into a dataframe/table which has the following headings, resulting in 8 columns:
| Run | Material | Air speed | Class. speed | Parameter of interest from repetition 1 | Parameter of interest from repetition 2 |
This means that each treatment (combination of material, setting 1 and setting 2) is only shown once, and the parameter of interest is shown twice.
I have a dataset which looks as follows:
code rep material airspeed classifier_speed fine_fraction
1 L17 1 lupine 50 600 1
2 L19 2 lupine 50 600 6
3 L16 1 lupine 60 600 9
4 L22 2 lupine 60 600 12
5 L18 1 lupine 50 1200 4
6 L21 2 lupine 50 1200 6
I have melted it as follows:
melt1 <- melt(duplo_selection, id.vars = c("material", "airspeed", "classifier_speed", "rep"),
measure.vars=c("fine_fraction"))
and then tried to cast it as follows:
cast <- dcast(melt1, material + airspeed + classifier_speed ~ variable, value.var = "value")
This gives the following message:
Aggregate function missing, defaulting to 'length'
and this dataframe, in which the parameter of interest is counted rather than both values being presented.
Thanks for your effort and time to try to help me out, after a little puzzling I found out what I had to do.
I added replicate to each observation, being either a 1 or a 2, as the trial was performed in duplo.
Via the code
cast <- dcast(duplo_selection, material + airspeed + classifier_speed ~ replicate, value.var = "fine_fraction")
I came to the 5 x 8 table I was looking for.

Estimating Probabilities in Perudo

I am new to coding and using R. I am working on a project to simulate the game Liar's Dice, also known as Perudo, and have some questions about creating the simulation.
Basically, the game consists of two or more players rolling five dice in a cup, turning it over, and and making bids on how many of a certain side they think is on the table. You can look at your own dice, but not anyone else's. To make bids, on your turn you would say "two 5's," which would mean there are at least two dice that landed on 5. Each bid will either increase the side or the amount. So if you said "two 5's," I could then say "two 6's" or "three 3's" on my turn.
When you believe the last bid is incorrect, you would say "Liar" on your turn, then everyone reveals their dice. If you were wrong, you lose a dice, but if you were right, the last bidder loses a dice. This continues until there is only one player left who has dice.
First, I decided to create a function called cup() which rolls a cup of five six-sided dice.
cup <- function(sides = 6, dice = 5){
sample(1:sides, size = dice, replace = TRUE)
}
Next, with a little assistance, I created a new function called cups() which rolls three cups for three players.
cups <- function(players = 3, sides = 6, dice = 5){
out <- cup(sides, dice)
for(i in 2:players){
out <- rbind(out, cup(sides, dice))
}
rownames(out) <- 1:players
rownames(out) <- c("P1", "P2", "P3")
return(out)
}
What I want to accomplish next is to create a table of probabilities of possible dice outcomes. In other words, what's the probability of there being at least two of a side given fifteen dice (five for each player) in play? And then the probability of there being three, four, five, etc. all the way up to fifteen in this case.
My question is how would I go about doing this in R? And what direction should I go in after getting the probabilities in R?
Here is an empirical process for determining the percentage outcomes of all the same, 4 the same, 3 the same, 2 the same, none the same upon rolling 5 die:
library(gtools) # package with permutations function
allcombos <- permutations(6, 5, repeats.allowed = TRUE) # all 6 choose 5 with replacment combos
alluniques <- apply(allcombos, 1, unique) # uniques for each combo
alllengths <- sapply(alluniques, length) # lengths for each combo imputes num repeats
alllengths2 <- as.factor(alllengths) # convert to factor to count unique
allsum <- summary(alllengths2) # sum by num uniques
allsum
1 2 3 4 5 # 1=all same, 2=4 same, 3=3 same, 4=2 same, 5=all different
6 450 3000 3600 720
totsum <- sum(allsum)
allfrac <- allsum / totsum
allpercent <- allfrac * 100
allpercent
1 2 3 4 5
0.07716049 5.78703704 38.58024691 46.29629630 9.25925926 # percentage breakout
There is no doubt an analytical solution but I don't know what it is. You could use standard probability calculations to estimate specific outcomes among multiple players. E.g. P(at least 1 4-same | 3 players) or run some simulations.
Here's likely more than you asked for but focusing on number of sides on the dice, total number of dice and probability of rolling Nrolled or more
dicegame <- function(Nsides = 6,
Ndice = 5,
Nrolled = 1,
verbose = FALSE)
{
total_possible_outcomes <- choose(Nsides + Ndice - 1, Ndice)
outcomes_matrix <- t(combn(Nsides + Ndice - 1,
Ndice,
sort)) - matrix(rep(c(0:(Ndice - 1)),
each = total_possible_outcomes),
nrow = total_possible_outcomes)
chances <- sum(apply(outcomes_matrix, 1, function(x) sum(x==2)) >= Nrolled) / total_possible_outcomes
if(verbose) {
cat(paste("Number of dice",
Ndice,
"each with", Nsides, "sides",
"chances of rolling", Nrolled,
"\n or more of any one side are:\n"))
}
return(chances)
# return(total_possible_outcomes)
# return(outcomes_matrix)
}
dicegame(verbose = TRUE)
#> Number of dice 5 each with 6 sides chances of rolling 1
#> or more of any one side are:
#> [1] 0.5
dicegame(6, 15, 10)
#> [1] 0.01625387
Using probability we can demonstrate that the probability to get a value n times is equal to :
we can easily write this into an R function:
prob_get_n <- function(ntimes, players=3, dice=5, sides=6){
if(missing(ntimes)) ntimes <- 0:(players*dice)
choose(players*dice,ntimes)*(1-1/sides)^((players*dice)-ntimes)*sides^(-ntimes)
}
Notice that this function is by construction vectorised ie it accepts 1:2, c(9,5) as valid inputs.
prob_get_n() -> probs
data.frame(ntimes=1:length(probs)-1, probs=probs,or_more= rev(cumsum(rev(probs))))
ntimes probs or_more
1 0 6.490547e-02 1.000000e+00
2 1 1.947164e-01 9.350945e-01
3 2 2.726030e-01 7.403781e-01
4 3 2.362559e-01 4.677751e-01
5 4 1.417535e-01 2.315192e-01
6 5 6.237156e-02 8.976567e-02
7 6 2.079052e-02 2.739411e-02
8 7 5.346134e-03 6.603585e-03
9 8 1.069227e-03 1.257451e-03
10 9 1.663242e-04 1.882242e-04
11 10 1.995890e-05 2.190005e-05
12 11 1.814445e-06 1.941153e-06
13 12 1.209630e-07 1.267076e-07
14 13 5.582909e-09 5.744548e-09
15 14 1.595117e-10 1.616385e-10
16 15 2.126822e-12 2.126822e-12
Edit
Or we can use R built in dbinom function to get the distribution and pbinom to get the cumulative probability function:
probs <- function(ntimes, players=3, dice=5, sides=6){
if(missing(ntimes)) ntimes <- 0:(players*dice)
data.frame(ntimes=ntimes, probs=dbinom(ntimes, players*dice, 1/sides), or_more=1-pbinom(ntimes-1, players*dice, 1/sides))
}
ntimes probs or_more
1 0 6.490547e-02 1.000000e+00
2 1 1.947164e-01 9.350945e-01
3 2 2.726030e-01 7.403781e-01
4 3 2.362559e-01 4.677751e-01
5 4 1.417535e-01 2.315192e-01
6 5 6.237156e-02 8.976567e-02
7 6 2.079052e-02 2.739411e-02
8 7 5.346134e-03 6.603585e-03
9 8 1.069227e-03 1.257451e-03
10 9 1.663242e-04 1.882242e-04
11 10 1.995890e-05 2.190005e-05
12 11 1.814445e-06 1.941153e-06
13 12 1.209630e-07 1.267076e-07
14 13 5.582909e-09 5.744548e-09
15 14 1.595117e-10 1.616385e-10
16 15 2.126822e-12 2.126743e-12

R Looping: Assign record to class with least existing records

I have a group of individuals that I am distributing items to in an effort to move toward even distribution of total items across individuals.
Each individual can receive only certain item types.
The starting distribution of items is not equal.
The number of available items of each type is known, and must fully be exhausted.
df contains an example format for the person data. Note that Chuck has 14 items total, not 14 bats and 14 gloves.
df<-data.frame(person=c("Chuck","Walter","Mickey","Vince","Walter","Mickey","Vince","Chuck"),alloweditem=c("bat","bat","bat","bat","ball","ball","glove","glove"),startingtotalitemspossessed=c(14,9,7,12,9,7,12,14))
otherdf contains and example format for the items and number needing assignment
otherdf<-data.frame(item=c("bat","ball","glove"),numberneedingassignment=c(3,4,7))
Is there a best method for coding this form of item distribution? I imagine the steps to be:
Check which person that can receive a given item has the lowest total items assigned. Break a tie at random.
Assign 1 of the given item to this person.
Update the startingtotalitemspossessed for the person receiving the item.
Update the remaining number of the item left to assign.
Stop this loop for a given item if the total remaining is 0, and move to the next item.
Below is a partial representation of something like how i'd imagine this working as a view inside the loop, left to right.
Note: The number of items and people is very large. If possible, a method that would scale to any given number of people or items would be ideal!
Thank you in advance for your help!
I'm sure there are better ways, but here is an example:
df<-data.frame(person=c("Chuck","Walter","Mickey","Vince","Walter","Mickey","Vince","Chuck"),
alloweditem=c("bat","bat","bat","bat","ball","ball","glove","glove"),
total=c(14,9,7,12,9,7,12,14))
print(df)
## person alloweditem total
## 1 Chuck bat 14
## 2 Walter bat 9
## 3 Mickey bat 7
## 4 Vince bat 12
## 5 Walter ball 9
## 6 Mickey ball 7
## 7 Vince glove 12
## 8 Chuck glove 14
otherdf<-data.frame(item=c("bat","ball","glove"),
numberneedingassignment=c(3,4,7))
# Items in queue
queue <- rep(otherdf$item, otherdf$numberneedingassignment)
for (i in 1:length(queue)) {
# Find person with the lowest starting total
personToBeAssigned <- df[df$alloweditem == queue[i] &
df$total == min(df[df$alloweditem == queue[i], 3]), 1][1]
df[df$person == personToBeAssigned & df$alloweditem == queue[i], 3] <-
df[df$person == personToBeAssigned & df$alloweditem == queue[i], 3] + 1
}
print(df)
## person alloweditem total
## 1 Chuck bat 14
## 2 Walter bat 10
## 3 Mickey bat 9
## 4 Vince bat 12
## 5 Walter ball 10
## 6 Mickey ball 10
## 7 Vince glove 17
## 8 Chuck glove 16

Data Cleaning in R: remove test customer names

I am handling customer data that has customer first and last name. I want to clean the names of any random keystrokes. Test accounts are jumbled in the data-set and have junk names. For example in the below data I want to remove customers 2,5,9,10,12 etc. I would appreciate your help.
Customer Id FirstName LastName
1 MARY MEYER
2 GFRTYUIO UHBVYY
3 CHARLES BEAL
4 MARNI MONTANEZ
5 GDTDTTD DTTHDTHTHTHD
6 TIFFANY BAYLESS
7 CATHRYN JONES
8 TINA CUNNINGHAM
9 FGCYFCGCGFC FGCGFCHGHG
10 ADDHJSDLG DHGAHG
11 WALTER FINN
12 GFCTFCGCFGC CG GFCGFCGFCGF
13 ASDASDASD AASDASDASD
14 TYKTYKYTKTY YTKTYKTYK
15 HFHFHF HAVE
16 REBECCA CROSSWHITE
17 GHSGHG HGASGH
18 JESSICA TREMBLEY
19 GFRTYUIO UHBVYY
20 HUBHGBUHBUH YTVYVFYVYFFV
21 HEATHER WYRICK
22 JASON SPLICHAL
23 RUSTY OWENS
24 DUSTIN WILLIAMS
25 GFCGFCFGCGFC GRCGFXFGDGF
26 QWQWQW QWQWWW
27 LIWNDVLIHWDV LIAENVLIHEAV
28 DARLENE SHORTRIDGE
29 BETH HDHDHDH
30 ROBERT SHIELDS
31 GHERDHBXFH DFHFDHDFH
32 ACE TESSSSSRT
33 ALLISON AWTREY
34 UYGUGVHGVGHVG HGHGVUYYU
35 HCJHV FHJSEFHSIEHF
The problem seems to be that you'd need a solid definition of improbable names, and that is not really related to R. Anyway, I suggest you go by the first names and remove all those names that are not plausible. As a source of plausible first names or positive list, you could use e.g. SSA Baby Name Database. This should work reasonably well to filter out English first names. If you have more location specific needs for first names, just look online for other baby name databases and try to scrape them as a positive list.
Once you have them in a vector named positiveNames, filter out all non-positive names like this:
data_new <- data_original[!data_original$firstName %in% positiveNames,]
My approach is the following:
1) Merge FirstName and LastName into a single string, strname.
Then, count the number of letters for each strname.
2) At this point, we find that for real names, like "MARNIMONTANEZ", are composed of two 'M'; two 'A'; one 'R'; one 'I'; three 'N'; one 'O'; one 'T'.
And we find that fake names, like "GFCTFCGCFGCCGGFCGFCGFCGF", are composed of six 'G'; five 'F'; 8 'C'.
3) The pattern to distinguish real names from fake names becomes clear:
real names are characterized by a more variety of letters. We can measure this by creating a variable check_real computed as: number of unique letters / total string length
fake names are characterized by few letters repeated several times. We can measure this by creating a variable check_fake computed as: average frequency of each letter
4) Finally, we just have to define a threshold to identify an anomaly for both variable. In the cases where these threshold are triggered, a flag_real and a flag_fake appears.
if flag_real == 1 & flag_fake == 0, the name is real
if flag_real == 0 & flag_fake == 1, the name is fake
In the rare cases when the two flags agrees (i.e. flag_real == 1 & flag_fake == 1), you have to investigate the record manually to optimize the threshold.
You can calculate variability strength of full name (combine FirstName and LastName) by calculating length of unique letters in full name divided by total number of characters in the full name. Then, just remove the names that has low variability strength. This means that you are removing the names that has a high frequency of same random keystrokes resulting in low variability strength.
I did this using charToRaw function because it very faster and using dplyr library, as below:
# Building Test Data
df <- data.frame(CustomerId = c(1, 2, 3, 4, 5, 6, 7),
FirstName = c("MARY", "FGCYFCGCGFC", "GFCTFCGCFGC", "ASDASDASD", "GDTDTTD", "WALTER", "GFCTFCGCFGC"),
LastName = c("MEYER", "FGCGFCHGHG", "GFCGFCGFCGF", "AASDASDASD", "DTTHDTHTHTHD", "FINN", "CG GFCGFCGFCGF"), stringsAsFactors = FALSE)
#test data: df
# CustomerId FirstName LastName
#1 1 MARY MEYER
#2 2 FGCYFCGCGFC FGCGFCHGHG
#3 3 GFCTFCGCFGC GFCGFCGFCGF
#4 4 ASDASDASD AASDASDASD
#5 5 GDTDTTD DTTHDTHTHTHD
#6 6 WALTER FINN
#7 7 GFCTFCGCFGC CG GFCGFCGFCGF
library(dplyr)
df %>%
## Combining FirstName and LastName
mutate(FullName = paste(FirstName, gsub(" ", "", LastName, fixed = TRUE))) %>%
group_by(FullName) %>%
## Calculating variability strength for each full name
mutate(Variability = length(unique(as.integer(charToRaw(FullName))))/nchar(FullName))%>%
## Filtering full name, I set above or equal to 0.4 (You can change this)
## Meaning we are keeping full name that has variability strength greater than or equal to 0.40
filter(Variability >= 0.40)
# A tibble: 2 x 5
# Groups: FullName [2]
# CustomerId FirstName LastName FullName Variability
# <dbl> <chr> <chr> <chr> <dbl>
#1 1 MARY MEYER MARY MEYER 0.6000000
#2 6 WALTER FINN WALTER FINN 0.9090909
I tried to combine the suggestions in the below code. Thanks everyone for the help.
# load required libraries
library(hunspell)
library(dplyr)
# read data in dataframe df
df<-data.frame(CustomerId = c(1, 2, 3, 4, 5, 6, 7,8),
FirstName = c("MARY"," ALBERT SAM", "FGCYFCGCGFC", "GFCTFCGCFGC", "ASDASDASD", "GDTDTTD", "WALTER", "GFCTFCGCFGC"),
LastName = c("MEYER","TEST", "FGCGFCHGHG", "GFCGFCGFCGF", "AASDASDASD", "DTTHDTHTHTHD", "FINN", "CG GFCGFCGFCGF"), stringsAsFactors = FALSE)
# Keep unique names
df<-distinct(df,FirstName, LastName, .keep_all = TRUE)
# Spell check using hunspel
df$flag <- hunspell_check(df$FirstName) | hunspell_check(as.character(df$LastName))
# remove middle names
df$FirstNameOnly<-gsub(" .*","",df$FirstName)
# SSA name data using https://www.ssa.gov/oact/babynames/names.zip
# unzip files in folder named names
files<-list.files("/names",pattern="*.txt")
ssa_names<- do.call(rbind, lapply(files, function(x) read.csv(x,
col.names = c("Name","Gender","Frequency"),stringsAsFactors = FALSE)))
# Change SSA names to uppercase
ssa_names$Name <- toupper(ssa_names$Name)
# Flad for SSA names
df$flag_SSA<-ifelse(df$FirstNameOnly %in% ssa_names$Name,TRUE,FALSE)
rm(ssa_names)
# remove spaces and concatenate first name and last name
df$strname<-gsub(" ","",paste(df$FirstName,df$LastName, sep = ""))
# Name string length
df$len<-nchar(df$strname)
# Unique string length
for(n in 1:nrow(df))
{
df$ulen[n]<-length(unique(strsplit(df$strname[n], "")[[1]]))
}
# Ratio variable for unique string length over total string length
df$ratio<-ifelse(df$len==0,0,df$ulen/df$len)
# Histogram to determine cutoff ratio
hist(df$ratio)
test<-df[df$ratio<.4 & df$flag_SSA==FALSE & df$flag==FALSE,]

which R function to use for Text Auto-Correction?

I have a csv Document with 2 columns which contains Commodity Category and Commodity Name.
Ex:
Sl.No. Commodity Category Commodity Name
1 Stationary Pencil
2 Stationary Pen
3 Stationary Marker
4 Office Utensils Chair
5 Office Utensils Drawer
6 Hardware Monitor
7 Hardware CPU
and I have another csv file which contains various Commodity names.
Ex:
Sl.No. Commodity Name
1 Pancil
2 Pencil-HB 02
3 Pencil-Apsara
4 Pancil-Nataraj
5 Pen-Parker
6 Pen-Reynolds
7 Monitor-X001RL
The output I would like is to standardise and categorise the commodity names and classify them into respective Commodity Categories like shown below :
Sl.No. Commodity Name Commodity Category
1 Pencil Stationary
2 Pencil Stationary
3 Pencil Stationary
4 Pancil Stationary
5 Pen Stationary
6 Pen Stationary
7 Monitor Hardware
Step 1) I first have to use NLTK (Text mining methods) and clean the data so as to seperate "Pencil" from "Pencil-HB 02" .
Step 2) After cleaning I have to use Approximate String match technique i.e agrep() to match the patterns "Pencil *" or correcting "Pancil" to "Pencil".
Step 3)Once correcting the pattern I have to categorise. No idea how.
This is what I have thought about. I started with step 2 and I'm stuck in step 2 only.
I'm not finding an exact method to code this.
Is there any way to get the output as required?
If yes please suggest me the method I can proceed with.
You could use the stringdist package. The correct function below will correct the Commodity.Name in file2 based on distances of the item to different CName.
Then a left_join is used to join the two tables.
I also notice that there are some classifications if I use the default options for stringdistmatrix. You can try changing the weight argument of stringdistmatrix for better correction result.
> library(dplyr)
> library(stringdist)
>
> file1 <- read.csv("/Users/Randy/Desktop/file1.csv")
> file2 <- read.csv("/Users/Randy/Desktop/file2.csv")
>
> head(file1)
Sl.No. Commodity.Category Commodity.Name
1 1 Stationary Pencil
2 2 Stationary Pen
3 3 Stationary Marker
4 4 Office Utensils Chair
5 5 Office Utensils Drawer
6 6 Hardware Monitor
> head(file2)
Sl.No. Commodity.Name
1 1 Pancil
2 2 Pencil-HB 02
3 3 Pencil-Apsara
4 4 Pancil-Nataraj
5 5 Pen-Parker
6 6 Pen-Reynolds
>
> CName <- levels(file1$Commodity.Name)
> correct <- function(x){
+ factor(sapply(x, function(z) CName[which.min(stringdistmatrix(z, CName, weight=c(1,0.1,1,1)))]), CName)
+ }
>
> correctedfile2 <- file2 %>%
+ transmute(Commodity.Name.Old = Commodity.Name, Commodity.Name = correct(Commodity.Name))
>
> correctedfile2 %>%
+ inner_join(file1[,-1], by="Commodity.Name")
Commodity.Name.Old Commodity.Name Commodity.Category
1 Pancil Pencil Stationary
2 Pencil-HB 02 Pencil Stationary
3 Pencil-Apsara Pencil Stationary
4 Pancil-Nataraj Pencil Stationary
5 Pen-Parker Pen Stationary
6 Pen-Reynolds Pen Stationary
7 Monitor-X001RL Monitor Hardware
If you need the "Others" category, you just need to play with the weights.
I added a row "Diesel" in file2. Then compute the score using stringdist with customized weights (you should try varying the values). If the score is large than 2 (this value is related to how the weights are assigned), it doesn't correct anything.
PS: as we don't know all the possible labels, we have to do as.character to convect factor to character.
PS2: I am also using tolower for case insensitive scoring.
> head(file2)
Sl.No. Commodity.Name
1 1 Diesel
2 2 Pancil
3 3 Pencil-HB 02
4 4 Pencil-Apsara
5 5 Pancil-Nataraj
6 6 Pen-Parker
>
> CName <- levels(file1$Commodity.Name)
> CName.lower <- tolower(CName)
> correct_1 <- function(x){
+ scores = stringdistmatrix(tolower(x), CName.lower, weight=c(1,0.001,1,0.5))
+ if (min(scores)>2) {
+ return(x)
+ } else {
+ return(as.character(CName[which.min(scores)]))
+ }
+ }
> correct <- function(x) {
+ sapply(as.character(x), correct_1)
+ }
>
> correctedfile2 <- file2 %>%
+ transmute(Commodity.Name.Old = Commodity.Name, Commodity.Name = correct(Commodity.Name))
>
> file1$Commodity.Name = as.character(file1$Commodity.Name)
> correctedfile2 %>%
+ left_join(file1[,-1], by="Commodity.Name")
Commodity.Name.Old Commodity.Name Commodity.Category
1 Diesel Diesel <NA>
2 Pancil Pencil Stationary
3 Pencil-HB 02 Pencil Stationary
4 Pencil-Apsara Pencil Stationary
5 Pancil-Nataraj Pencil Stationary
6 Pen-Parker Pen Stationary
7 Pen-Reynolds Pen Stationary
8 Monitor-X001RL Monitor Hardware
There is an 'Approximate string matching' function amatch() in {stingdist} (at least in 0.9.4.6) that returns the most probable match from the pre-defined set of words. It has a parameter maxDist that can be set for the maximum distance to be matched, and a nomatch parameter that can be used for the 'other' category. Otherwise, method, weights, etc. can be set similarly to stringdistmatrix().
So, your original problem can be solved like this using a tidyverse compatible solution:
library(dplyr)
library(stringdist)
# Reading the files
file1 <- readr::read_csv("file1.csv")
file2 <- readr::read_csv("file2.csv")
# Getting the commodity names in a vector
commodities <- file1 %>% distinct(`Commodity Name`) %>% pull()
# Finding the closest string match of the commodities, and joining the file containing the categories
file2 %>%
mutate(`Commodity Name` = commodities[amatch(`Commodity Name`, commodities, maxDist = 5)]) %>%
left_join(file1, by = "Commodity Name")
This will return a data frame that contains the corrected commodity name and category. If the original Commodity name is more than 5 characters away (simplified explanation of string distance) from any of the possible commodity names, the corrected name will be NA.

Resources