Match Sentences in R - r

I have 2 tables. Table1 is a smaller table with around 10K values.
Table 1 (Sample):
KeyWords PageView
Phillips Trimmer 123
Buy Samsung Mobile 45
Ripe yellow Banana 63
Pepsi 140
Table 2 contains 1 Million Values.
Table 2 (Sample):
KeyWords PageView
Electric Trimmer 123
Samsung Mobile 45
Yellow Ripe Banana 63
Samsung S6 304
Banana 105
Phillips 209
Trimmer Phillips 29
Now I want to take all words from table 1 and look into table 2 an find the best match. The order of words should not have a big influence in the match i.e "Ripe yellow Banana" should match perfectly with "Yellow Ripe Banana". "Buy Samsung Mobile" should match with "Samsung Mobile" and with "Samsung S6".
The final output should look like this.
Table 3:
Word PageView Match
Phillips Trimmer 123 Trimmer Phillips
Buy Samsung Mobile 45 Samsung Mobile
Ripe yellow Banana 63 Yellow Ripe Banana
Pepsi 140 NA
Would really appreciate if we can Stem and Tokenize the sentence before doing a match.
I have tried the following but its not working properly and the loop takes quite a bit of time.
file_1$match <- ""
for(i in 1:dim(file_1)[1]) {
print(i)
x <- grep(file_1$Keywords[i],file_2$Keyword,value = T, ignore.case = m
T,useBytes = T)
x <- paste0(x,"")
file_1$match[i] <- x
}
I have tried using 'agrep' as well varying the 'max.distance' parameter. The results were not as expected.

Edited: I used the "apply" function to perform the following on every row of tab1:
the function inside "apply", takes x[1] which is keyword (let's say "Ripe Yellow Banana"), strsplit splits it by space ("Ripe" "Yellow" "Banana"), sapply performs grepl on each of these splits to see if the pattern exists in tab2. so you will have 3 column of true falses for "Ripe" "Yellow" "Banana". next step is to count the number of trues for each row and output the tab2 with that row number. I also put a if statement to give NA's if the max number of trues is 0:
tab1<-data.frame(Keyword=c("Phillips Trimmer",
"Buy Samsung Mobile","Ripe Yellow Banana","Pepsi"),
PageView=c(123,45,63,140))
tab2<-data.frame(Keyword=c("Electric Trimmer","Samsung Mobile",
"Yellow Ripe Banana","Samsung S6","Banana",
"Phillips","Trimmer Phillips","Buy Trimmer Philips"),
PageView=c(123,45,63,304,105,209,29,21))
tab2$StrLen<-apply(tab2,1,function(x)length(unlist(strsplit(x[1], " "))))
tab1$BestMatch<-apply(tab1,1,function(x){
a <-sapply(unlist(strsplit(x[1], " ")), grepl, tab2$Keyword)
a<-cbind(a,TRUECnt=rowSums(a==TRUE))
a<-as.data.frame(a)
a$StrLen <- length(unlist(strsplit(x[1], " ")))
if (max(a$TRUECnt)==0){
return(NA)
}
return(as.character(tab2[which(a$TRUECnt==max(a$TRUECnt) &
tab2$StrLen <= a$StrLen),]$Keyword))
})
View(tab1) View(tab1)
# Keyword PageView BestMatch
# 1 Phillips Trimmer 123 Trimmer Phillips
# 2 Buy Samsung Mobile 45 Samsung Mobile
# 3 Ripe Yellow Banana 63 Yellow Ripe Banana
# 4 Pepsi 140 <NA>

Related

Calculate Top 10 from range of sets (survey rankings)

I'm creating a survey that a number of users will participate in. They will have to place 20 fruits in a Top 10 ranking format. Each user will only be able to rank 10 fruits, so there will be 10 fruits that don't get a rank.
How can I collate this information to output an overall Top 10 as ranked by all users in the survey? A formula or even piece of JS/Python code to loop over the rows would be great
Below is an example of the table/spreadsheet that I am going to receive.
I initially thought that summing all the rankings and then ordering by lowest total first would give the standings in a correct order*, but using that method would mean that Kiwi would come out on top even though it received none rankings. Also Mango would come before Banana even though Banana recieved a first place ranking.
Participant A
Participant B
Participant C
Participant D
Participant E
SUM
Apple
8
4
8
8
28
Banana
5
1
6
Blackberry
6
6
12
Blueberry
4
5
7
2
18
Cherry
8
10
18
Fig
3
3
10
16
Grape
1
9
7
9
26
Grapefruit
2
4
4
10
Kiwi
0
Lychee
3
3
Mango
5
5
Nectarine
6
1
9
16
Orange
10
3
13
Papaya
7
2
10
19
Peach
7
3
7
17
Pineapple
1
8
9
Pomegranate
2
6
6
14
Raspberry
9
9
4
2
24
Strawberry
5
5
10
Watermelon
10
1
11
*I imagine there are many ways to do this, so might not be a single correct way
I have solved this by creating a scoring system dependant on the rank of each fruit, and have expanded this to also add a weighted score to higher ranked fruit. I'm going to write this for JavaScript because that is how I tested it, but it can also be easily done in Google Sheets.
To start, create a base point array that will convert the rank into a corresponding point which will be awarded to the item. This can either be restricted to only award the top 𝑥 amount of ranked items with points, or the total number of items to be ranked. Then reverse the rank so that 1st obtains the highest score and last obtains the lowest score.
This can then be attributed to each set of rankings to give each item a score that can be summed. Taking into account items that haven't been ranked receiving a score of 0.
But this can create a skewed list where items that are ranked at a low position many times can easily over take items ranked at a high position only a few times.
To add weighting to the points simply use an exponential formula that will then inflate the higher ranked points more than the lower ranked points. This can be achieved by using either the base point value to the power of a coefficient, or the coefficient to the power of the base point value. In my opinion the latter produces more of a curve, but the coefficient must be tested otherwise the gap between the largest and smallest point scores can be too big.
I will try and create a Google Sheets with the formula included, but for now the JavaScript code is below.
Hopefully someone else finds this helpful!
const items = ['Apple', 'Banana', 'Blackberry', 'Blueberry', 'Cherry', 'Fig', 'Grape', 'Grapefruit', 'Kiwi', 'Lychee', 'Mango', 'Nectarine', 'Orange', 'Papaya', 'Peach', 'Pineapple', 'Pomegranate', 'Raspberry', 'Strawberry', 'Watermelon'];
const results = [
['Grape', 'Pomegranate', 'Fig', 'Blueberry', 'Banana', 'Blackberry', 'Papaya', 'Apple', 'Raspberry', 'Watermelon'],
['Pineapple', 'Grapefruit', 'Fig', 'Apple', 'Blueberry', 'Nectarine', 'Peach', 'Cherry', 'Raspberry', 'Orange'],
['Nectarine', 'Papaya', 'Lychee', 'Raspberry', 'Mango', 'Pomegranate', 'Blueberry', 'Pineapple', 'Grape', 'Fig'],
['Banana', 'Blueberry', 'Peach', 'Grapefruit', 'Strawberry', 'Pomegranate', 'Grape', 'Apple', 'Nectarine', 'Cherry'],
['Watermelon', 'Raspberry', 'Orange', 'Grapefruit', 'Strawberry', 'Blackberry', 'Peach', 'Apple', 'Grape', 'Papaya']
];
const top_x = 10 /* items.length */;
const exponential_coefficient = 1.3;
const sortObject = (object) =>
Object.fromEntries(Object.entries(object).sort(([, a], [, b]) => a - b).reverse());
const base_points = new Array(top_x).fill(null).map((value, index) => Math.abs(index - top_x));
const exponential_points = base_points.map((points) => Math.round(exponential_coefficient ** points));
const base_scores = items.reduce((acc, curr) => ((acc[curr] = 0), acc), {});
const exponential_scores = items.reduce((acc, curr) => ((acc[curr] = 0), acc), {});
results.forEach((set) =>
set.forEach((item, index) => {
base_scores[item] += base_points[index] || 0;
exponential_scores[item] += exponential_points[index] || 0;
})
);
console.log({ top_x, base_points, exponential_points, base_scores: sortObject(base_scores), exponential_scores: sortObject(exponential_scores) });

Group similar strings and change their values to something common while retaining the individual rows

I have receipt data and there are descriptions of items but some are pretty similar and I'd like to code those similar ones with he same value to increase the chances of finding associations in the data. For example:
Strawberries
Premium Strawberries
Premium Strawberries
Hass Avocado
Mini Avocado
I'd like to have:
Strawberries
Strawberries
Strawberries
Avocado
Avocado
Something to that effect but I'm open to suggestions for sure.All I can think of is that some sort of fuzzy search might be what I need I just don't know how to implement that?
Thanks, once again!
One possible way are string distances. But be careful because they do not capture any meaning, just the similarity between actual strings. Below example could work like some heuristic, but pay attention to last example. Higher the threshold in cutree, less groups you will have, and probably more wrongly classified examples. Ergo lower threshold means that you are more strict, and possibly missing good solutions:
th <- 0.35 ## between 0 and 1
roles <- c("Strawberies","strawberries","Mini strawberries","Avocado","Hass avocado","Not Avocado")
mat <- stringdist::stringdistmatrix(roles,roles,method = "jw",p=0.025,nthread = parallel::detectCores())
colnames(mat) <- roles
rownames(mat) <- roles
t <- hclust(as.dist(mat),method = "single")
memb <- cutree(t,h=th)
df <- data.frame(a=c(roles),b=c(memb),stringsAsFactors = F)
df$to <- plyr::mapvalues(df$b,from=1:length(unique(memb)),to=df$a[!duplicated(df$b)])
prior <- data.frame(str=roles,to=df$to,stringsAsFactors = F)
prior
str to
1 Strawberies Strawberies
2 strawberries Strawberies
3 Mini strawberries Strawberies
4 Avocado Avocado
5 Hass avocado Avocado
6 Not Avocado Avocado
Assuming you receipt is a dataframe df, i.e.,
df <- data.frame("Strawberries",
"Premium Strawberries",
"Premium Strawberries",
"Hass Avocado",
"Mini Avocado",stringsAsFactors = F)
then maybe you can achieve it via
res <- gsub(".*\\s(\\w)","\\1",df$name)
yielding
>res
[1] "Strawberries" "Strawberries" "Strawberries" "Avocado"
[5] "Avocado"

Fuzzy Address matching R

Yeah, it's been asked before, but I can't find a thread that provides a simple, clean answer to this question.
I have example data below - I have two columns, col1 is the current address, col2 is an address I am told is 'better' than the current address. I need to see how much 'better' the second column is over the first. Most of the time, the second is better b/c it contains secondary information that the first is lacking, such as apartment number.
test <- as.data.frame(matrix(c(
"742 Evergreen Terrace" , "742 Evergreen Terrace Apt 3" ,
"31 Spooner Street #42" , "31 Spooner Street",
"129 W 81st Street" , "129 W 81st Street Apt 5A" ,
"245 E 73rd Street", "245 E 73rd Street Apt 6") , ncol=2, byrow=TRUE,
dimnames=list(NULL, c("old_addr" , "new_addr"))) ,stringsAsFactors=FALSE)
There is an answer I found here that gets close to what I would like:
Fuzzy match row in one column with same row in next column
I need to create a third column that is a simple 1/0 variable that == 1 if it's an approximate match, and 0 if not. I need to be able to specify threshold for approximate matching.
For my first example - 742 Evergreen Terrace vs 742 Evergreen Terrace Apt 3, the length differs by six. I need to be able to specify a length difference of six, or eight, or whatever.
I looked at agrep, but I need to compare two columns data within the same row, and it does not allow for that. I have also tried lapply, but its results make me think it is cycling through all data in the entire column, and I need row by row comparisons. Also max distance I do not understand, with the ifelse below and a max of 1 (if I understand this correctly to be 1 == there can be one unit of edit or change), it should be throwing errors but it only does in one case.
agrep(test$old_addr, test$new_addr, max.distance = 0.1, ignore.case = TRUE)
test$fuzz_match <- lapply(test$old_addr , agrep , x =
test$new_addr , max.distance = 1 , ignore.case = TRUE)
Any help is appreciated, thank you!
You can calculate the Levenshtein distance between each pair. Then what you need to decide is how large must the distance be for the two not to be the same address.
test$lev_dist <- mapply(adist, test$old_addr, test$new_addr)
test$same_addr <- test$lev_dist < 5
test
# old_addr new_addr lev_dist same_addr
# 1 742 Evergreen Terrace 742 Evergreen Terrace Apt 3 6 FALSE
# 2 31 Spooner Street #42 31 Spooner Street 4 TRUE
# 3 129 W 81st Street 129 W 81st Street Apt 5A 7 FALSE
# 4 245 E 73rd Street 245 E 73rd Street Apt 6 6 FALSE
You can use agrep() together with mapply() in a similar manner.
test$agrep_match <- mapply(agrep, test$old_addr, test$new_addr)
test$agrep_match <- lengths(test$agrep_match) == 1
test
# old_addr new_addr agrep_match
# 1 742 Evergreen Terrace 742 Evergreen Terrace Apt 3 TRUE
# 2 31 Spooner Street #42 31 Spooner Street FALSE
# 3 129 W 81st Street 129 W 81st Street Apt 5A TRUE
# 4 245 E 73rd Street 245 E 73rd Street Apt 6 TRUE
agrep() is also based on Levenshtein distance, but has a bunch of different options for adjusting the threshold, as I'm sure you've found.
There are other difference measures than Levenshtein that might be better suited for this application. Package stringdist has a number of other string distance metrics available.

Using tapply with multiple categories is not working

I am trying to use tapply in R so that it gives me a summary of averages for data that correspond to two categories (category_name and brand_name)
It is working fine when I include one category, i.e.:
brandavg <- tapply(train$price, train$brand_name, mean)
head(brandavg, 10)
Looks something like:
"" "brandavg" "brandstd"
"% Pure" 13.85 5.53719480565651
"10.Deep" 21 NA
"21men" 10 NA
But when I try to run it with two categories, using this code:
brandcatavg <- tapply(train$price, list(train$brand_name, train$category_name), mean)
head(brandcatavg, 10)
I get along list of category names (far more than 10!), but no averages.
I'm sure this has a really easy solution that I'm missing, but I'm hitting my head on the wall trying to figure it out.
Current code attempts have been:
brandcatavg <- tapply(train$price, list(train$brand_name, train$category_name), mean)
brandcatavg <- with(train, tapply(price, list(brand_name, category_name), mean))
Both yield something that looks like:
Vintage & Collectibles/Paper Ephemera/Postcard Vintage & Collectibles/Paper Ephemera/Stamps
Vintage & Collectibles/Serving/Bowl Vintage & Collectibles/Serving/Casserole
Vintage & Collectibles/Serving/Cream and Sugar Set Vintage & Collectibles/Serving/Dinnerware Set
Vintage & Collectibles/Serving/Flatware Vintage & Collectibles/Serving/Glassware
Vintage & Collectibles/Serving/Mug Vintage & Collectibles/Serving/Other
Vintage & Collectibles/Serving/Pitcher Vintage & Collectibles/Serving/Plate
Vintage & Collectibles/Serving/Salt and Pepper Shakers Vintage & Collectibles/Serving/Teacup
Vintage & Collectibles/Serving/Teapot Vintage & Collectibles/Serving/Tray
Vintage & Collectibles/Serving/Tumbler Vintage & Collectibles/Supplies/Bead
(Note that I cut the output for length...)
When I pull the first 10 columns it looks like:
train_id name item_condition_id category_name
1 0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts
2 1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & Parts
3 2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse
4 3 Leather Horse Statues 1 Home/Home Décor/Home Décor Accents
5 4 24K GOLD plated rose 1 Women/Jewelry/Necklaces
6 5 Bundled items requested for Ruie 3 Women/Other/Other
7 6 Acacia pacific tides santorini top 3 Women/Swimwear/Two-Piece
8 7 Girls cheer and tumbling bundle of 7 3 Sports & Outdoors/Apparel/Girls
9 8 Girls Nike Pro shorts 3 Sports & Outdoors/Apparel/Girls
10 9 Porcelain clown doll checker pants VTG 3 Vintage & Collectibles/Collectibles/Doll
brand_name price shipping
1 10 1
2 Razer 52 0
3 Target 10 1
4 35 1
5 44 0
6 59 0
7 Acacia Swimwear 64 0
8 Soffe 6 1
9 Nike 19 0
10 8 0
item_description
1 No description yet
2 This keyboard is in great condition and works like it came out of the box. All of the ports are tested and work perfectly. The lights are customizable via the Razer Synapse app on your PC.
3 Adorable top with a hint of lace and a key hole in the back! The pale pink is a 1X, and I also have a 3X available in white!
4 New with tags. Leather horses. Retail for [rm] each. Stand about a foot high. They are being sold as a pair. Any questions please ask. Free shipping. Just got out of storage
5 Complete with certificate of authenticity
6 Banana republic bottoms, Candies skirt with matching blazer,Amy Byers suit, Loft bottoms and cami top.
7 Size small but straps slightly shortened to fit xs, besides that, perfect condition
8 You get three pairs of Sophie cheer shorts size small and medium girls and two sports bra/boy shorts spandex matching sets in small and medium girls. All items total retail for [rm] in store and you can take him today for less than the price of one item at the store!)
9 Girls Size small Plus green. Three shorts total.
10 I realized his pants are on backwards after the picture. They were very dirty so I hand washed them. He has a stuffed body and painted porcelain head, hands and feet. Back before clowns were too scary. 9" tall. No chips or cracks but minor paint loss in a few places. Clown Circus Doll Collectible
(dput is giving me funky results.)
Eventually got this to work with the code:
library(dplyr)
brandcatavg <- train %>%
group_by(category_name, brand_name) %>%
summarise(averageprice = mean(price, na.rm=TRUE))
You are close,
Try:
with(train, tapply(price, list(brand_name, category_name), mean))

How to use chained ifelse and grepl?

I have a database of thoroughbred names that is structured as follows:
HorseName <- c("Grey emperor", "Smokey grey", "Gaining greys", "chestnut", "Glowing Chestnuts", "Ruby red", "My fair lady", "Man of war")
Number <- seq(1:8)
df <- data.frame(HorseName, Number)
I now wish to search for occurences of colours within each horse's name. Specifically, I wish to select all the instances of 'grey' and 'chestnut', creating a new column that identifies these colours. Any other names can be simply 'other' Unfortunately, the names are not consistent, with plurals included and varying case formats. How would I go about doing this in R?
My anticipated output would be:
df$Type <- c("Grey", "Grey", "Grey", "Chestnut", "Chestnut", "Other", "Other", "Other")
I am familiar with chained ifelse statements but unsure how to handle the plural occurences and case sensitivities!
In case you are interested in other ways to do this, here's a tidyverse alternative which has the same end result as #amrrs answer.
library(tidyverse)
library(stringr)
df %>%
mutate(Type = str_extract(str_to_lower(HorseName), "grey|chestnut")) %>%
mutate(Type = str_to_title(if_else(is.na(Type), "other", Type)))
#> HorseName Number Type
#> 1 Grey emperor 1 Grey
#> 2 Smokey grey 2 Grey
#> 3 Gaining greys 3 Grey
#> 4 chestnut 4 Chestnut
#> 5 Glowing Chestnuts 5 Chestnut
#> 6 Ruby red 6 Other
#> 7 My fair lady 7 Other
#> 8 Man of war 8 Other
Converting all the input text df$HorseName to lower case before pattern matching with grepl (using lower-cased pattern) solves this problem.
> df$Type <- ifelse(grepl('grey',tolower(df$HorseName)),'Grey',
+ ifelse(grepl('chestnut',tolower(df$HorseName)),'Chestnut',
+ 'others'))
> df
HorseName Number Type
1 Grey emperor 1 Grey
2 Smokey grey 2 Grey
3 Gaining greys 3 Grey
4 chestnut 4 Chestnut
5 Glowing Chestnuts 5 Chestnut
6 Ruby red 6 others
7 My fair lady 7 others
8 Man of war 8 others
>

Resources