Recent development in recoding repeated variables in R? - r

So I have a long dataset of sequence.
Every column (from t1 to t...n) has the same levels or categories.
There are more than 200 categories or levels and 144 column (variables) in total.
id t1 t2 t3 t...n
"1" "eating" "tv" "conversation" "..."
"2" "sleep" "driving" "relaxing" "..."
"3" "drawing" "kissing" "knitting" "..."
"..." "..." "..." "..." "..."
Variables t1 has the same levels has t2 and so on.
What I need is a loop-style recoding for each column (but avoiding to loop).
I would like to avoid the usual
seq$t1[seq$t1== "drawing"] <- 'leisure'
seq$t1[seq$t1== "eating"] <- 'meal'
seq$t1[seq$t1== "sleep"] <- 'personal care'
seq$t1[seq$t1== "..."] <- ...
The most convenient recoding style would be something like
c('leisure') = c('drawing', 'tv', ...)
That would help me to better cluster variables into bigger categories.
Is there some new and easier recoding methods in R that appeared lately ?
What would you advise me to use ?
This is a sample of my real dataset, 5 repeated observations (in column) for 10 respondents (in rows).
dtaSeq = structure(c("Wash and dress", "Eating", "Various arrangements", "Cleaning dwelling", "Ironing", "Activities related to sports",
"Eating", "Eating", "Other specified construction and repairs",
"Other specified physical care & supervision of a child", "Wash and dress",
"Filling in the time use diary", "Food preparation", "Wash and dress",
"Ironing", "Travel related to physical exercise", "Eating", "Eating",
"Other specified construction and repairs", "Other specified physical care & supervision of a child",
"Wash and dress", "Filling in the time use diary", "Food preparation",
"Wash and dress", "Food preparation", "Wash and dress", "Eating",
"Eating", "Other specified construction and repairs", "Other specified physical care & supervision of a child",
"Wash and dress", "Filling in the time use diary", "Baking",
"Teaching the child", "Food preparation", "Wash and dress", "Eating",
"Eating", "Other specified construction and repairs", "Other specified physical care & supervision of a child",
"Dish washing", "Unspecified TV watching", "Reading periodicals",
"Teaching the child", "Food preparation", "Reading periodicals",
"Eating", "Eating", "Other specified construction and repairs",
"Feeding the child", "Laundry", "Unspecified TV watching", "Cleaning dwelling",
"Teaching the child", "Eating", "Eating", "Eating", "Eating",
"Other specified construction and repairs", "Feeding the child"),
.Dim = c(10L, 6L), .Dimnames = list(c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10"), c("act1.050", "act1.051", "act1.052",
"act1.053", "act1.054", "act1.055")))

As far as I know, the car package can handle strings or characters in its recode-function, but I'm not sure. An alternative could be the sjmisc-package, making a detour by converting the strings to numeric values and set back value labels later:
library(sjmisc)
dtaSeq <- as.data.frame(dtaSeq)
# convert to values
dtaSeq.values <- to_value(dtaSeq)
# random recode example, use your own values for clustering here
dtaSeq.values <- rec(dtaSeq.values, "1:3=1; 4:6=2; else=3")
# set value labels, these will be added as attributes
dtaSeq.values <- set_val_labels(dtaSeq.values, c("meal", "leisure", "personal care"))
# replace numeric values with assicated label attributes
dtaSeq.values <- to_label(dtaSeq.values)
Result:
> head(dtaSeq.values)
act1.050 act1.051 act1.052 act1.053 act1.054 act1.055
1 personal care personal care leisure personal care meal leisure
2 meal meal meal meal personal care personal care
3 personal care meal meal meal leisure meal
4 meal personal care leisure personal care personal care leisure
5 leisure leisure meal leisure leisure meal
6 meal personal care leisure personal care leisure meal
An advantage of the sjmisc-recode function is, if you have a data frame with variables of similar "structure", you can recode the complete data frame just with one call to rec.
Does this help you?

You don't seem to have fully specified recoding rules for your real data,
so I made some up:
recodes <- list("meals"=c("Eating"),
"leisure"=c("Reading Periodicals",
"Unspecified TV watching"),
"child care"=c("Feeding the child","Teaching the child"),
"house care"=c("Food preparation","Dish washing",
"Cleaning dwelling","Ironing"))
Here's a general-purpose recoding function. car::recode does work,
but I find it a little clumsy. There's also plyr::revalue, but
it's one-to-one, not many-to-one.
recodeFun <- function(x) {
for (i in seq_along(recodes)) {
x[x %in% recodes[[i]]] <- names(recodes)[i]
}
return(x)
}
d2 <- recodeFun(dtaSeq)

Related

How do I create a row when grep doesn't find a match?

After a decade of lurking on stack overflow, I'm finally dipping in my toes to ask for help! Apologies for any mistakes!
I'm extracting tables from word to create my own data frame. There's about 50 documents, all with the same table, but the data isn't mine, and is a little messy, to put it mildly.
The table is 2 columns (Name, Values) by 60 rows, and df$Name contents are often written wrongly, or rows are missing all together. This is not my data, so editing it is not an option.
My problem is - I want to bind each word docs' data together, so they need to have the same columns. I will be transposing the data so Name becomes the header, Value becomes row 1. Because the df$Name contents are messy, I used grep to extract those rows I wanted. (previously I tried extracting but row number, but the row numbers changed between word docs)
These are all the values in df$Name that should be present.
Col <- c("Top Film / Web code (if applicable)", "Base Film / Web code (if applicable)", "Top Label / Sleeves code", "Base Label code", "Promotional Label code", "Trays Code", "SRP code", "SRP label code", "Packing format (overwrap, MAP, VAC)", "Vac pressure (if applicable)", "Die set", "Optimal running speed (max)", "Gas mix (if applicable)", "Pressure for Leaker checks (bar)", "Frequency of checks", "Metal Detection Limits", "No. of Units per pack","Pack weight", "Claims", "Shelf Life Of Product From Pack / Slice", "Date code format", "Health Mark", "UK & EU Address","“e” mark present", "Weight present", "Top Label Placement", "Base Label Placement", "Promo Label Placement", "Barcode (if applicable)", "No. of Packs per SRP/Basket","Weight of outercase", "Max No. of SRP/Baskets per pallet")
##use grep to get R to search for similar words present in all word docs################
toMatch <- c("Top Film","Base Film", "Top Label", "Base Label", "Promotional", "Trays", "SRP","Packing format", "Vac pressure", "Die set", "Optimal running speed", "Gas mix",
"Pressure", "Frequency", "Metal Detection Limits", "per pack",
"Pack weight",
"Claims", "Shelf Life", "Date code", "Health", "Address",
"“e”", "Weight present", "Top Label Place", "Base Label Place", "Promo Label Place", "Barcode","No. of Packs","outercase", "Max No.")
tab_select <- unique (df[grep(paste(toMatch,collapse="|"),
df$Name, ignore.case=TRUE),])
Using grep like this is pretty successful - but if a value is missing, there's no sign of it - So in this case "Trays Code" was not present - but I need a blank "Trays Code" (with NA in Value) to be created. Adding one in doesn't help, as it's at the bottom of the table, and I need them to stay in the right order.
Is there a way to get grep to match, but also create a row with NA if there are no matches?
I tried making a separate table with the correct column names - using dplyr to join, hoping any duplicates would disappear, but the slightly differing names in df$Name and Col mean more duplicates.
I'm not sure if I should be looping through each pattern and creating a row if there's no much - I'm just wary of making loops in loops in loops, which could happen.
ATM, this one grep formula is using multiple patterns, and some patterns pick up multiple rows of data, which might complicate things.
How about this:
df <- data.frame(Name = c("Top Film / Web code (if applicable)", "Base Film / Web code (if applicable)", "Top Label / Sleeves code", "Base Label code", "Promotional Label code", "Trays Code", "SRP code", "SRP label code", "Packing format (overwrap, MAP, VAC)", "Vac pressure (if applicable)", "Die set", "Optimal running speed (max)", "Gas mix (if applicable)", "Pressure for Leaker checks (bar)", "Frequency of checks", "Metal Detection Limits", "No. of Units per pack","Pack weight", "Claims", "Shelf Life Of Product From Pack / Slice", "Date code format", "Health Mark", "UK & EU Address","“e” mark present", "Weight present", "Top Label Placement", "Base Label Placement", "Promo Label Placement", "Barcode (if applicable)", "No. of Packs per SRP/Basket","Weight of outercase", "Max No. of SRP/Baskets per pallet"))
toMatch <- c("Top Film","Base Film", "Top Label", "Base Label", "Promotional", "Trays", "SRP","Packing format", "Vac pressure", "Die set", "Optimal running speed", "Gas mix", "Pressure", "Frequency", "Metal Detection Limits", "per pack", "Pack weight", "Claims", "Shelf Life", "Date code", "Health", "Address", "“e”", "Weight present", "Top Label Place", "Base Label Place", "Promo Label Place", "Barcode","No. of Packs","outercase", "Max No.")
df$Value <- 1:nrow(df)
df$Name[6] <- "Not Matched"
out <- lapply(toMatch, function(x){
if(any(grepl(x, df$Name))){
df[grep(x, df$Name), ]
}else{
data.frame(Name = x, Value=NA)
}
})
out <- do.call(rbind, out)
head(out, n=10)
#> Name Value
#> 1 Top Film / Web code (if applicable) 1
#> 2 Base Film / Web code (if applicable) 2
#> 3 Top Label / Sleeves code 3
#> 26 Top Label Placement 26
#> 4 Base Label code 4
#> 27 Base Label Placement 27
#> 5 Promotional Label code 5
#> 110 Trays NA
#> 7 SRP code 7
#> 8 SRP label code 8
Created on 2023-01-08 by the reprex package (v2.0.1)
Note that I changed the sixth observation of Name in df to "Not Matched" to show what happens when there is no match. It is a match for "Trays" in the original data. You can see what happens with no match on line 6 of the output.

How to count how many string in a column

I have a datasets with a column "amenities" and I want to count how many amenities in each row.
> airbnbT$amenities[1]
[1] ["Essentials", "Refrigerator", "Shampoo", "TV", "Dedicated workspace", "Hangers", "Iron", "Long term stays allowed", "Dishes and silverware", "First aid kit", "Free parking on premises", "Hair dryer", "Patio or balcony", "Washer", "Dryer", "Cooking basics", "Coffee maker", "Private entrance", "Hot water", "Fire extinguisher", "Wifi", "Air conditioning", "Hot tub", "Kitchen", "Microwave", "Oven", "Smoke alarm"]
14673 Levels: ["Air conditioning", "Baby bath", "Long term stays allowed", "Baby monitor"] ...
> class(airbnbT$amenities[1])
[1] "factor"
Here for row 1, there are 27 amenities.
Is there a way to count the comma in each row "," ? This way would count the numbers of amenities.
Try str_count from the stringr package. You will need to add 1 since there will be one fewer comma than the number of amenities:
library(stringr)
airbnbT$amenities_count = str_count(airbnbT$amenities,",") + 1

How to extract keywords below and above a text from an article

I have this character vector of lines from a journal:
test_1 <- c(" Journal of Neonatal Nursing 27 (2021) 106–110",
" Contents lists available at ScienceDirect",
" Journal of Neonatal Nursing",
" journal homepage: www.elsevier.com/locate/jnn",
"Comparison of inter-facility transports of critically ill neonates who died",
"after admission vs. survivors", "Robert Schultz a, *, Jennifer Berk-King a, Laura Wallace a, Girija Natarajan a, b",
"a", " Children’s Hospital of Michigan, Detroit, MI, USA",
"b", " Division of Neonatology, Wayne State University School of Medicine, Detroit, MI, USA",
"A R T I C L E I N F O A B S T R A C T",
"Keywords: Objective: To compare characteristics before, during and after inter-facility transports (IFT), and changes in the",
"Inter-facility transport Transport Risk Index of Physiologic Stability (TRIPS) before and after inter-facility transports (IFT) in infants",
"Neonatal intensive care who died within 7 days of admission to a level IV NICU versus matched survivors.",
"Mortality", " Study design: This retrospective case-control study included infants who died within 7 days of IFT and controls",
" matched for gestational age and reason for admission. Unplanned events were temperature or respiratory de­",
" rangements. Therapeutic interventions included increased respiratory support, resuscitation or blood product",
" transfusion.",
" Results: Our cohort was predominantly preterm and male. Cases had a higher rate of resuscitation, lower Apgar",
" scores, more respiratory acidosis, lower BP and higher TRIPS, compared to controls. Deterioration in TRIPS was",
" independently associated with male gender and unplanned events; not with patient group.",
" Conclusions: Rates of unplanned events, therapeutic interventions, and deterioration in TRIPS following IFT by a",
" transport team are comparable in cases and controls.",
" outcomes. The Transport Risk Index of Physiologic Stability (TRIPS) is",
"1. Introduction an assessment measure of infant status before and after transport (Lee"
)
I want to extract the Keywords from these lines, which are Inter-facility transport, Neonatal intensive care, Mortality. I've tried to get the line which has "Keywords" with test_1[str_detect(test_1, "^Keywords:")] I want to get all the keywords below this line and above 1. Introduction
What regex or stringr functions will do this?
Thanks
If I understood correctly, you are sort of scanning the pdf downloaded from here. I think you should find a better way to scan your PDFs.
Till then, the best option could be this:
library(stringr)
# get the line after ^Keywords:
start <- which(str_detect(test_1, "^Keywords:")) +1
# get the line before ^1. Introduction
end <- which(str_detect(test_1, "^1. Introduction")) -1
# get the lines in between
x <- test_1[start:end]
# Extract keywords
x <- str_trim(str_sub(x, 1, 60))
x <- x[x!=""]
x
#> [1] "Inter-facility transport" "Neonatal intensive care" "Mortality"
EDIT:
You can define a function to find the index of the line at which Keywords occurs and the indices of the lines below that line:
find_keywords <- function(pattern, text) {
index <- which(grepl(pattern, text))
sort(c(index + 1, index + 2, index + 3)) # If you suspect there are more than three keywords, then just `index + ...`
}
Based on that function, you can extract the keywords:
library(stringr)
str_extract(test_1[find_keywords(pattern = "^Keywords:", text = test_1)], "^\\S+")
[1] "Inter-facility" "Neonatal" "Mortality"

Subset datasets by variable before using expand.grid to calculate distance matrix

I have two datasets. One dataset has about ~30k rows, and the second dataset has ~60k rows. The smaller dataset (df1) has a unique identifier (upc), which is critical to my analysis.
The larger dataset (df2) does not have this unique identifier, but it does have a descriptive variable (product_title) that can be matched with a similar description variable in df1 and used to infer the unique identifier.
I am trying to keep things simple, so I used expand.grid.
df1_titles<-unique(df1$product_title) # List of 30k titles
df2_titles<-unique(df2$product_title) # List of 60k titles
r<- expand.grid(df1_titles,df2_titles) # Distance matrix
names(r) <- c("df1_titles","df2_titles")
r$dist <- stringdist(r$df1_titles,r$df2_titles, method="jw") # Calculate distance
r<-r[order(r$dist),]
r<-r[!duplicated(r$df1_titles),]
r<-subset(r,dist<.10)
Unfortunately, R is struggling to expand such a large grid. So, I had the idea to use a second variable (c1) in both datasets to constrain the expand.grid to similar items. Let's assume the values for category are the same for both datasets.
While I know it's not recommended to create dataframes using a loop, I didn't have a better idea for how to subset the data to expand.grid so I tried this approach anyways:
categories<-c("Beauty","Personal Care","Grocery","Household Essentials") # Variable with categories to subset
for (i in seq_along(categories)) {
df1_sub<-subset(wmt,category==categories[i])
df2_sub<-subset(m,category==categories[i])
df1_titles<-unique(df1_sub$product_title)
df2_titles<-unique(df2_sub$product_title)
### HOW DO I CREATE A LIST/GRID DYNAMICALLY? ### <-expand.grid(df1_titles,df2_titles)
}
After creating these grids, the plan would be to grab the unique identifier upc from df1 and assign it to matches in df2 before consolidating the datasets.
I'm sure there is a better way to do this, and hope that identifying a better way to reduce data.frames to relevant subsets before using expand.grid will be helpful to others!
dput(sample_n(subset(df1,select=c(product_title,c1)),50)) structure(list(product_title = c("Sriracha Hot Chili Sauce Single Packets 25 Count .25 oz each (3 Items Per Order, not per case)", "Duncan Hines Double Fudge Decadent Brownie Mix 17.6 oz by Duncan Hines", "Mikee Tropical Teriyaki Sauce, 20 oz, (Pack of 12)", "NESQUIK Strawberry Low Fat Milk 6-8 fl. oz. Bottles", "Dove Nutritive Solutions Conditioner, Coconut & Hydration 12 oz (Pack of 12)", "FLORATA 24\" Long Straight Velcro Wrap Around Ponytail Hair Extensions", "Bing Cherries, Dried (16 oz, ZIN: 527111) - 3-Pack", "San-J Tamari Brown Sesame Crackers, 3.7 oz (Pack of 12)", "PERDUE HARVESTLAND Breaded Chicken Breast Nugget (22 oz.)", "Fray Bentos Just Chicken Pie (425g) - Pack of 6", "Product of Thomas Coffee Regular Roast, Portion Packs (64 ct.) - Ground Coffee [Bulk Savings]", "Bombay Basmati Rice White, 2 LB (Pack of 12)", "Herbs for Kids, Sugar Free Elderberry Syrup, Cherry-Berry Flavor, 4 fl oz (pack of 3)", "Grain Millers BG13916 Grain Millers Rolled Oats No. 5 - 1x50LB", "Tuning Fork C 512 C512 SURGICAL MEDICAL INSTRUMENTS NEW", "Garnier Fructis Style Pure Clean Finishing Paste, All Hair Types, 2 oz. (Packaging May Vary) (Pack of 8)", "Stretch Island Organic Fruit Strips Grape -- 6 Pocket-Sized Fruit Strips pack of 6", "Torani Cinnamon Syrup 750ml", "JFC Nori Maki Arare Crackers 3 oz each (6 Items Per Order)", "FLORATA Ponytail Buns Wrap Bun Chignon Hair Extensions Wavy Curly Wedding Donut Hair Extensions Hairpiece Wig", "Kenra Platinum Hot Spray #20 8oz, PACK OF 8", "GBS Red and Black Shampoo Scalp Massage Brushes Plus 1 Soft Pocket Brush Made In USA 3 Pack Promotes Healthy Hair Growth Compliments Any Shampoo and Conditioner", "Clairol Professional Creme Permanent Developer - 20 volume (Size : 2 oz)", "Garnier Nutrisse Ultra Color Permanent Haircolor R3 Light Intense Auburn 1.0 ea(pack of 12)", "Kemps Swiss Style Chocolate Low Fat Milk, 1 gal", "Aussie Kids 3n1 Shampoo, Conditioner, & Bodywash with Pump Coral Reef Cupcake 29.2 oz.(pack of 4)", "Dequmana Gordal Olives, 12 Oz", "Duncan Hines Caramel Creamy Home-Style Frosting 16 Oz Canister", "Goya Goya Mole, 9 oz", "Fruit Roll-Ups Fruit Flavored Snacks Variety Pack (Pack of 16)", "Wild Huckleberry Mountain Huckleberry Barbecue Sauce", "La Flor Spicy Hot Seasoned Salt, 13 oz", "Clairol Nice n Easy Hair Color #79 Dark Brown, UK Loving Care (Pack of 3) + Beyond BodiHeat Patch, 1 Ct", "White Vinegar Liquid ''1 gallon, 4 Count, Liquid''", "Metallic Gold Dried Canella Berries - 6 oz Bunch", "La Flor Adobo All-Purpose Seasoning, 13 oz", "Marlos Bakeshop Marlos Bakeshop Biscotti, 1.25 oz", "Sam's Choice Frozen Burrito Bowl, Fajita Vegetable, 12.5 oz", "Conchita guava marmalade 14.1 oz Pack of 3", "HC Industries Kids Organics Kids Organics Shampoo, 12 oz", "6 Pack - Head & Shoulders Full & Thick 2-in-1 Anti-Dandruff Shampoo + Conditioner 32.1 oz", "Ice Breakers, Wintergreen Mints Tin, 1.5 Oz (Pack of 8)", "Mason Pearson - Boar Bristle & Nylon - Medium Junior Military Nylon & Bristle Hair Brush (Dark Ruby) -1pc", "Dove Nutritive Solutions Revival Cleansing Shampoo, 20.4 oz", "Boston's Best 12 Ct Jamaican Me Crazy", "Ultimate Baker Edible Glitter Mix It Up (1x3oz)", "Nori Maki Arare Rice Crackers with Seaweed 5 oz per Pack (1 Pack)", "H&S 2in1 MENS REFRESH POO 13.5oz-Pack of 5", "Keebler Club Mini Crackers, Multi-Grain, 11 Ounce (Pack of 20)", "Briess Sparkling Amber Liquid Malt Extract (30 Pound Pail)"),
c1 = c("Grocery", "Grocery", "Grocery", "Grocery", "Personal Care",
"Beauty", "Grocery", "Grocery", "Grocery", "Grocery", "Grocery",
"Grocery", "Grocery", "Grocery", "Beauty", "Beauty", "Grocery",
"Grocery", "Grocery", "Beauty", "Beauty", "Beauty", "Beauty",
"Beauty", "Grocery", "Beauty", "Grocery", "Grocery", "Grocery",
"Grocery", "Grocery", "Grocery", "Beauty", "Grocery", "Grocery",
"Grocery", "Grocery", "Grocery", "Grocery", "Personal Care",
"Beauty", "Grocery", "Beauty", "Beauty", "Grocery", "Grocery",
"Grocery", "Beauty", "Grocery", "Grocery")), row.names = c(16523L, 111871L, 28667L, 32067L, 8269L, 11076L, 50328L, 47200L, 99415L, 100031L, 39011L, 104854L, 29516L, 104643L, 3486L, 9689L, 52157L, 28995L, 47000L, 10895L, 3035L, 4992L, 3589L, 4276L, 32212L, 6055L, 22991L, 110279L, 27436L, 52282L, 14879L, 25710L, 6989L, 30133L, 51068L, 25490L, 45685L, 99073L, 18547L, 4991L, 5792L, 36241L, 10237L, 1430L, 40383L, 112458L, 46261L, 5875L, 46597L, 108099L ), class = "data.frame")
dput(sample_n(subset(df2,select=c(product_title,c1)),50))
structure(list(product_title = c("Drive Medical Heavy Duty Bariatric Plastic Seat Transfer Bench",
"Always Pure & Clean Ultra Thin Feminine Pads With Wings, Super Long",
"Patriot Candles Jar Candle Apple Clove Red", "Nature's Bounty Cardio-Health Probiotic Capsules",
"Finest Nutrition Biotin Plus Keratin", "Dr. Scholl's DuraGel Corn Remover",
"Humm Coconut Lime Kombucha 14 oz", "OneTouch Ultra Blue Test Strips",
"Kellogg's Rice Krispies Treats Bars M&M's", "Westbrae Natural Organic Chili Beans",
"Neutrogena Rapid Clear Acne Eliminating Spot Treatment Gel - 0.5 fl oz",
"Harris Bed Bug Killer", "Quart Storage Bags - 80ct - Up&Up cent (Compare to Ziploc Storage Bags)",
"Care Free Curl Gold Instant Curl Activator", "Purple Dessert Plate",
"Wexford Big Bubble Plastic Mailer 2", "L'Oreal Paris Advanced Haircare Total Repair Extreme Emergency Recovery Mask",
"Soap & Glory Spectaculips Matteallic Lip Cream Bronze Girl,Bronze Girl",
"No7 Instant Results Purifying Heating Mask - 2.5oz", "NuMe Classic Curling Wand",
"Revlon ColorSilk ColorStay Nourishing Conditioner Glowing Blonde",
"Weiman Lemon Oil Furniture Polish Lemon", "Dunkin' Donuts Ground Coffee Hazelnut",
"CocoaVia Cocoa Extract 375mg, Capsules", "Triple Paste AF Antifungal Ointment",
"Welch's Halloween Fruit Snacks 0.5oz 28 ct", "Studio 35 Purifying Natural Facial Wipes",
"Magnum Double Raspberry Mini Ice Cream Bars - 3ct", "CHI Twisted Fabric Finishing Paste",
"Creme Of Nature Argan Oil Intensive Conditioning Hair Treatment",
"Exergen Temporal Artery Thermometer", "Tolerex Formulated Liquid Diet Elemental Powder 6 Pack Unflavored",
"Gerber Nature Select 2nd Foods Nutritious Dinner Baby Food Chicken Noodle",
"Abreva Cold Sore Cream", "Super Macho Vitality and Stamina Dietary Supplement Softgel",
"M&M's Peanut Chocolates Halloween Ghoul's Mix - 3.27oz", "TruMoo protein milk cookies n' cream - 14 fl oz",
"DISNEY 25 Inch Plush Toy Assorted", "Beauty Infusion HYDRATING Manuka Honey & Collagen Sheet Mask",
"Edge Shave Gel, Twin Pack Sensitive Skin", "Haribo Sour Gold Bears Resealable Stand Up Pouch Pineapple",
"Jarrow Formulas Extra Virgin Coconut Oil, 1000mg, Softgels",
"Bliss Pore Patrol Oil-Free Hydrator with Willow Bark - 1.7oz",
"Airheads Candy Bites Watermelon", "Thrive Market Organic Sprouted Quinoa",
"Garnier Fructis Curl Stretch Loosening Pudding", "Systane Nighttime Lubricant Eye Ointment",
"SOHO Resort Organizer", "Enfamil Enfacare Lipil Infant Formula Powder",
"Fancy Feast Flaked Gourmet Cat Food Tuna"), c1 = c("Home Health Care Solutions",
"Personal Care", "Household Essentials", "Vitamin & Supplements",
"Vitamin & Supplements", "Personal Care", "Grocery", "Home Health Care Solutions",
"Grocery", "Grocery", "Beauty", "Household Essentials", "Household Essentials",
"Beauty", "Household Essentials", "Household Essentials", "Beauty",
"Beauty", "Beauty", "Beauty", "Beauty", "Household Essentials",
"Grocery", "Vitamin & Supplements", "Personal Care", "Grocery",
"Beauty", "Grocery", "Beauty", "Personal Care", "Personal Care",
"Home Health Care Solutions", "Grocery", "Personal Care", "Vitamin & Supplements",
"Grocery", "Grocery", "Baby, Kids & Toys", "Beauty", "Personal Care",
"Grocery", "Vitamin & Supplements", "Beauty", "Grocery", "Grocery",
"Beauty", "Personal Care", "Beauty", "Grocery", "Household Essentials"
)), row.names = c(39590L, 6987L, 13810L, 19403L, 26966L, 446L,
41599L, 28238L, 7622L, 19653L, 16458L, 18164L, 738L, 19819L,
43731L, 13310L, 17113L, 29729L, 29725L, 38903L, 25464L, 10048L,
42932L, 41179L, 37568L, 5830L, 14276L, 20526L, 31614L, 20119L,
40084L, 25978L, 1573L, 25121L, 3660L, 8850L, 10201L, 43313L,
17973L, 40423L, 10299L, 37320L, 32177L, 18491L, 32860L, 30439L,
24518L, 21579L, 24597L, 14687L), class = "data.frame")
Your idea is good. One realization of it then would be
df2$upc <- NA
for(ctg in unique(df2$c1)) {
d <- stringdistmatrix(df1[df1$c1 == ctg, "product_title"], df2[df2$c1 == ctg, "product_title"], method = "jw")
fuzz <- apply(d, 2, min)
passThr <- fuzz < 0.1
df2$fuzz[df2$c1 == ctg] <- fuzz
df2$upc[df2$c1 == ctg][passThr] <- df1[df1$c1 == ctg, "upc"][apply(d, 2, which.min)][passThr]
}
So, for each line in df2, it gets assigned a upc value from df1 whose product.title_r has the smallest distance from the corresponding product_title from df2. How well this works will depend on the number of categories, length(unique(df2$c1)). The more of them, the faster the loop.
Consider extending your expand.grid approach and build a list of data frames of nested merged elements. Then row bind all at once outside of loop.
# Variable with categories to subset
categories <- c("Beauty", "Personal Care", "Grocery", "Household Essentials")
df_list <- vector("list", length = length(categories))
for (i in seq_along(categories)) {
df1_sub <- subset(wmt, category == categories[i])
df2_sub <- subset(m, category == categories[i])
df1_titles <- unique(df1_sub$product_title)
df2_titles <- unique(df2_sub$product_title)
### HOW DO I CREATE A LIST/GRID DYNAMICALLY?
r <- expand.grid(df1_titles=df1_titles, df2_titles=df2_titles, stringsAsFactors=FALSE)
r$dist <- stringdist(r$df1_titles, r$df2_titles, method="jw")
r <- r[order(r$dist),]
r <- r[!duplicated(r$df1_titles),]
r <- subset(r, dist<.10)
# ASSIGN NESTED MERGE
df_list[i] = merge(merge(r, df1, by.x="df1_title", by.y="product_title"),
df2, by.x="df2_title", by.y="product_title")
}
# ROW BIND ALL DF ELEMENTS
final_df <- do.call(rbind, df_list)

Words in sentences and their nearest neighbors in a lexicons

I have following data frame:
sent <- data.frame(words = c("just right size", "size love quality", "laptop worth price", "price amazing user",
"explanation complex what", "easy set", "product best buy", "buy priceless when"), user = c(1,2,3,4,5,6,7,8))
Sent data frame resulted into:
words user
just right size 1
size love quality 2
laptop worth price 3
price amazing user 4
explanation complex what 5
easy set 6
product best buy 7
buy priceless when 8
I need to remove word at the begining of following sentence which is the same as a word at the end of previous sentece.
I mean eg. we have a sentences "just right size" and "size love quality", so I need to remove word size at the second user possition.
Then sentences "laptop worth price" and "price amazing user", so I need to remove word price at fourth user possition.
Can anyone help me, I'll appreciate any of your help. Thank you very much in advance.
You could extract the "first" and "last" word from the "words" column for the succeeding row and the current row using sub. If the words are the same, remove the first word from the succeeding row or else keep it as such (ifelse(...))
w1 <- sub(' .*', '', sent$words[-1])
w2 <- sub('.* ', '', sent$words[-nrow(sent)])
sent$words <- as.character(sent$words)
sent$words
#[1] "just right size" "size love quality"
#[3] "laptop worth price" "price amazing user"
#[5] "explanation complex what" "easy set"
#[7] "product best buy" "buy priceless when"
sent$words[-1] <- with(sent, ifelse(w1==w2, sub('\\w+ ', '',words[-1]),
words[-1]))
sent$words
#[1] "just right size" "love quality"
#[3] "laptop worth price" "amazing user"
#[5] "explanation complex what" "easy set"
#[7] "product best buy" "priceless when"

Resources