Best Match of each element in of a string with many strings - r
Input Data:
a <- c("coca cola","hot coffee","Running Shoes","Table cloth",
”mobile phones under 5000”,”Amazon kindle”)
b <- c("running shoes","plastic cup","pizza","Let’s go to hill","motor van",
"coffee table","drinking coffee on a rainy day",”Best mobile phones under 10000”,
”kindle e-books”,”Coffee Cup”)
Match each word of each sentence of a vector (here vector a) to all strings in a separate vector(here vector b) word by word and get the best match.
Logic:
All sentence of vector “a” has to be matched with all sentences of vector “b” word by word and a percentage has to be calculated.
There can be only one best match per sentence of vector “a”.
Example 1: “Running Shoes” in vector “a” matched perfectly with “Running Shoes” in vector “b” and the percentage_match is 100% (since both the words matched)
Example 2: the best match of “hot coffee” may be “drinking coffee on a rainy day” or “coffee table” or “Coffee cup” and the percentage_match is 50% (since only “coffee”, matched out of “hot coffee” in all the cases). In such scenario, where there is more than one contender with same max percentage_match, we will choose the best match with the lowest string length i.e “coffee table” and “coffee cup” gets priority over “drinking coffee on a rainy day”. Even after doing this, there is a tie, we are free to choose any thing (i.e either of “Coffee Table” or “Coffee cup”, can be the best match for “hot coffee”.
Code Tried:
as <- strsplit(a, " ")
bs <- strsplit(b, " ")
matchFun <- function(x, y) length(intersect(x, y)) / length(x) * 100
mx <- outer(as, bs, Vectorize(matchFun))
m <- apply(mx, 1, which.max) # the maximum column of each row
z <- unlist(apply(mx, 1, function(x) x[which.max(x)])) # maximum percentage
z[z == 0] <- NA # this gives you the NA if you want it
data.frame(a, Matching_String=b[m], match_perc=z)
Problem faced: Since my actual data is very big (more than 2 million records are to be matched with 1 Mn record), this code takes forever.
Here's one way to do this using stringdistmatrix from package stringdist. Basically, we are calculating the distance between the strings in a and b. Then we keep the smallest distance. There will always be a match, even if the distance is a high number. One thing you could do is establish a minimum distance, or NA otherwise.
library(stringdist)
m <- stringdistmatrix(tolower(a), tolower(b), method = "qgram")
b[apply(m, 1, which.min)]
#[1] "plastic cup" "coffee table" "running shoes"
#[4] "coffee table" "best mobile phones under 10000" "kindle e-books"
Related
R: grep multiple strings at once
I have a data frame with 1 variable and 5,000 rows, where each element is a string. 1. "Am open about my feelings." 2. "Take charge." 3. "Talk to a lot of different people at parties." 4. "Make friends easily." 5. "Never at a loss for words." 6. "Don't talk a lot." 7. "Keep in the background." ..... 5000. "Speak softly." I need to find and output row numbers that correspond to 3 specific elements. Currently, I use the following: grep("Take charge." , df[,1]) grep("Make friends easily.", df[,1]) grep("Make friends easily.", df[,1]) And get the following output: [1] 2 [2] 4 [3] 5000 Question 1. Is there a way to make syntax more succinct, so I do not have to use grep and df[,1] on every single line? Questions 2. If so, how to output a single numerical array of the necessary row positions, so the result would look something like this? 2, 4, 5000 What I tried so far. grep("Take charge." , "Make friends easily.","Make friends easily.", df[,1]) # this didn't work I tried to create a vector, called m1, that contains all three elements and then grep(m1, df[,1]) # this didn't work either
Since these are exact matches use this where phrases is a character vector of the phrases you want to match: match(phrases, df[, 1]) This also works provided no phrase is a substring of another phrase: grep(phrases, df[, 1])
Find specific strings, count their frequency in a given text, and report it as a proportion of the number of words
Trying to write a function in R that would : 1) look through each observation's string variables 2) identify and count certain strings that the user defines 3) report the findings as a proportion of the total number of words each observation contains. Here's a sample dataset: df <- data.frame(essay1=c("OMG. american sign language. knee-slides in leather pants", "my face looks totally different every time. lol."), essay2=c("cheez-its and dried cranberries. sparkling apple juice is pretty\ndamned coooooool too.<br />\nas for music, movies and books: the great american authors, mostly\nfrom the canon, fitzgerald, vonnegut, hemmingway, hawthorne, etc.\nthen of course the europeans, dostoyevski, joyce, the romantics,\netc. also, one of the best books i have read is all quiet on the\nwestern front. OMG. I really love that. lol", "at i should have for dinner\nand when; some random math puzzle, which I loooooove; what it means to be alive; if\nthe meaning of life exists in the first place; how the !##$ can the\npolitical mess be fixed; how the %^&* can the education system\nbe fixed; my current game design project; my current writing Lol"), essay3=c("Lol. I enjoy life and then no so sure what else to say", "how about no?")) The furtherest I managed to get is this function: find.query <- function(char.vector, query){ which.has.query <- grep(query, char.vector, ignore.case = TRUE) length(which.has.query) != 0 } profile.has.query <- function(data.frame, query){ query <- tolower(query) has.query <- apply(data.frame, 1, find.query, query=query) return(has.query) } This allows the user to detect if a given value is in the 'essay' for a given used, but that's not enough for the three goals outlined above. What this function would ideally do is to count the number of words identified, then divide that count by the total count of words in the overall essays (row sum of counts for each user). Any advice on how to approach this?
Using the stringi package as in this post: How do I count the number of words in a text (string) in R? library(stringi) words.identified.over.total.words <- function(dataframe, query){ # make the query all lower-case query <- tolower(query) # count the total number of words total.words <- apply(dataframe, 2, stri_count, regex = "\\S+") # count the number of words matching query number.query <- apply(dataframe, 2, stri_count, regex = query) # divide the number of words identified by total words for each column final.result <- colSums(number.query) / colSums(total.words) return(final.result) } (The df in your question has each essay in a column, so the function sums each column. However, in the text of your question you say you want row sums. If the input data frame was meant to have one essay per row, then you can change the function to reflect that.)
Getting approximately unique values from character vector
Identifying unique values is straight forward when the data is well behaved. Here I am looking for an approach to get a list of approximately unique values from a character vector. Let x be a vector with slightly different names for an entity, e.g. Kentucky loader may appear as Kentucky load or Kentucky loader (additional info) or somewhat similar. x <- c("Kentucky load" , "Kentucky loader (additional info)", "CarPark Gifhorn (EAP)", "Car Park Gifhorn (EAP) new 1.5.2012", "Center Kassel (neu 01.01.2014)", "HLLS Bremen (EAP)", "HLLS Bremen (EAP) new 06.2013", "Hamburg total sum (abc + TBL)", "Hamburg total (abc + TBL) new 2012") What I what to get out is something like: c("Kentucky loader" , "Car Park Gifhorn (EAP)", "Center Kassel (neu 01.01.2014)", "HLLS Bremen (EAP)", "Hamburg total (abc + TBL)") Idea Calculate some similarity measure between all strings (e.g. Levenshtein distance) Use longest common subset method Somehow :( decide which strings belong together based on this information. But I guess this will be a standard task (for those R users working with "dirty" data regularly), so I assume there will be a set of standard approaches to it. Does someone have a hint or is there a package that does this?
As #Jaap said, try playing with OpenRefine. The data carpentry course is pretty good. If you do want to stay in R, here's a solution for your example, using agrepl: z <- sapply(x, function(z) agrepl(z, x, max.distance = 0.2)) apply(z, 1, function(myz) x[myz][which.min(nchar(x[myz]))]) Which gives the smallest match in chars found for each member of x: [1] "Kentucky load" "Kentucky load" "CarPark Gifhorn (EAP)" [4] "CarPark Gifhorn (EAP)" "Center Kassel (neu 01.01.2014)" "HLLS Bremen (EAP)" [7] "HLLS Bremen (EAP)" "Hamburg total sum (abc + TBL)" "Hamburg total sum (abc + TBL)" This is good if you want to keep order of your vector to match others (or use on a column of a dataframe). You can call unique on this output to get your desired output.
Counting positive smiles in string using R
In src$Review each row is filled with text in Russian. I want to count the number of positive smiles in each row. For example, in "My apricot is orange)) (for sure)" I want to count not just the quantity of outbound brackets (i.e., excluding general brackets in "(for sure)"), but the amount of positive smiling characters ("))" — at least two outbound brackets, number of ":)", ":-)"). So, it works only if at least two outbound brackets are exhibited. Assume there is a string "I love this girl!)))) (she makes me happy) every day:):) :-)!" Here we count: )))) (4 units), ":)" (2 units), ":-)" (1 unit). After we combine the number of units (i.e., 7). Pay attention that we don't count brackets in "(she makes me happy)". Now I have following code in my script: smilecounts <- str_count(src$Review, "[))]") It counts only the total amount of bracket pairs ("()") (as I understand comparing data set and derivation of this command). I only need the total amount of ":)", ":-)", "))" (the total number of outbound brackets which display as "))" in rows) to be counted. For example, in ")))))" appear 5 outbound brackets, the condition of at least two outbound brackets together is satisfied, than we count the total amount of brackets in this part of text (i.e., 5 outbound brackets). Thank you so much for help in advance.
We can use regex lookarounds to extract the ) that follows a ) or : or :=, then use length to get the count. length(str_extract_all(str1, '(?<=\\)|\\!)\\)')[[1]]) #[1] 4 length(str_extract_all(str1, '(?<=:)\\)')[[1]]) #[1] 2 length(str_extract_all(str1, '(?<=:-)\\)')[[1]]) #[1] 1 Or this can be done using a loop pat <- c('(?<=\\)|\\!)\\)', '(?<=:)\\)', '(?<=:-)\\)') sum(sapply(lapply(pat, str_extract_all, string=str1), function(x) length(unlist(x)))) #[1] 7 data str1 <- "I love this girl!)))) (she makes me happy) every day:):) :-)!"
One way with regexpr and regmatches: vec <- "I love this girl!)))) (she makes me happy) every day:):) :-)!" Solution: #matches the locations of :-) or ))+ or :) a <- gregexpr(':-)+|))+|:)+', vec) #extracts those b <- regmatches(vec, a)[[1]] b #[1] "))))" ":)" ":)" ":-)" #table counts the instances b )))) :-) :) 1 1 2 Then I suppose you could count the number of single )s using nchar(b[1]) [1] 4 Or in a more automated way: tab <- table(b) #the following means "if a name of the table consists only of ) then #count the number of )s" tab2 <- ifelse(gsub(')','', names(table(b)))=='', nchar(names(table(b))), table(b)) names(tab2) <- names(tab) > tab2 )))) :-) :) 4 1 2
lpSolve in R with Character and Column Sum Contraints
I have a list of rooms, the rooms' maximum square feet, programs, the programs' maximum square feet, and the values of how well a room matches (match #) the intended program use. With help, I have been able to maximize match # and square feet use for one program per room. However, I would like to take this analysis one step further and allow for multiple programs in the same room or multiples of the same program if it has the highest match #, so long as the multiples still fit within the square foot requirements. Moreover, I would like to tell lpSolve overall that I only want "x" number of Offices, "y" number of Studios, etc. throughout the entire building. Here is my data and code thus far: program.size <- c(120,320,300,800,500,1000,500,1000,1500,400,1500,2000) room.size <- c(1414,682,1484,2938,1985,1493,427,1958,708,581,1485,652,727,2556,1634,187,2174,205,1070,2165,1680,1449,1441,2289,986,298,590,2925) (obj.vals <- matrix(c(3,4,2,8,3,7,4,8,6,4,7,7, 3,4,2,8,3,7,4,8,6,4,7,7, 4,5,3,7,4,6,5,7,5,3,6,6, 2,3,1,7,2,6,3,7,7,5,6,6, 4,5,3,7,4,6,5,7,5,3,6,6, 3,6,4,8,5,7,4,8,7,7,7,7, 3,4,2,8,3,7,4,8,6,4,7,7, 4,5,3,7,4,6,5,7,5,3,6,6, 6,7,5,7,6,6,7,7,5,3,6,6, 6,7,5,7,6,6,7,7,5,3,6,6, 5,6,6,6,5,7,8,6,4,2,5,5, 6,7,5,7,6,6,7,7,5,3,6,6, 6,7,5,7,6,6,7,7,5,3,6,6, 3,4,4,8,3,9,6,8,6,4,7,7, 3,4,2,6,3,5,4,6,6,4,5,5, 4,5,3,5,4,4,5,5,5,3,4,4, 5,6,4,8,5,7,6,8,6,4,7,7, 5,6,4,8,5,7,6,8,6,4,7,7, 4,5,5,7,4,8,7,7,5,3,6,6, 5,6,4,8,5,7,6,8,6,4,7,7, 3,4,2,6,3,5,4,6,6,4,5,5, 5,6,4,8,5,7,6,8,6,4,7,7, 5,6,4,8,5,7,6,8,6,4,7,7, 5,4,4,6,5,5,6,6,6,6,7,5, 6,5,5,5,6,4,5,5,5,7,6,4, 4,5,3,7,4,6,5,7,7,5,6,6, 6,5,5,5,6,4,5,5,5,7,6,4, 3,4,4,6,3,7,6,6,6,4,5,5), nrow=12)) rownames(obj.vals) <- c("Enclosed Offices", "Open Office", "Reception / Greeter", "Studio / Classroom", "Conference / Meeting Room", "Gallery", "Public / Lobby / Waiting", "Collaborative Space", "Mechanical / Support", "Storage / Archives", "Fabrication", "Performance") (obj.adj <- obj.vals * outer(program.size, room.size, "<=")) nr <- nrow(obj.adj) nc <- ncol(obj.adj) library(lpSolve) obj <- as.vector(obj.adj) con <- t(1*sapply(1:nc, function(x) rep(1:nc == x, each=nr))) dir <- rep("<=", nc) rhs <- rep(1, nc) mod <- lp("max", obj, con, dir, rhs, all.bin=TRUE) final <- matrix(mod$solution, nrow=nr) And so now my question is how can I allow the solver to maximize square foot use and match # within each room (column) and allow either multiple of the same programs or a combination of programs to accomplish this? I know I would have to lift the "<= 1" restriction in "mod", but I can't figure out how to let it find the best fit in each room and then ultimately, overall. The solution that should come for room [,1] is: $optimum 33 And it's going to try to fit 11 Enclosed Offices within the room which scores a much higher optimum match # than 1 Collaborative Space (8 matches) and 1 Storage / Archives (4 matches) for a total of 12 matches. And so this leads to my next question about limiting the overall number of certain programs within my solution matrix. I assume it would include some kind of as.numeric(data$EnclosedOffices "<=" 5) but I also can't figure out how to limit that. These numbers would be different for all the programs. Thanks for any and all help and feel free to ask for any clarification. Update: Constraints Maximize the match # (obj.vals) for each room. Maximize the program.size square feet within each room.size square feet. Do this by either using the same program multiple times (5 x Enclosed Offices) or combining programs (1 Collaborative Space and 1 Performance). Restrict and/or force the number of programs returned in the solution. The programs can be split up among rooms so long as it doesn't go over the maximum number I provide for that program in all the rooms. (Only 5 Enclosed Offices, 8 Studios/Classrooms, 1 Fabrication, etc. can be selected across all 28 columns.
If you use the R package lpSolveAPI (a wrapper for lpSolve) then it becomes a little easier. First, take a look at the mathematical formulation (an Integer Program) and then I show you the code to solve your problem. Formulation Let X_r_p be the decision variable that takes on positive integer values. X_r_p = Number of programs of type p assigned to room r (In all your problem will have 28*12=336 decision variables) Objective Function Maximize matching score Max sum(r) sum(p) C_r_p * X_r_p # Here C_r_p is the score of assigning p to room r Subject to Room Area Limitation Constraint Sum(p) Max_area_p * X_r_p <= Room Size (r) for each room r (We will have 28 such constraints) Restrict the Number of programs Constraint Sum(r) X_r_p <= Max_allowable(p) for each program p (We will have 12 such constraints) X_r_p >= 0, Integer That is the formulation in all. 336 columns and 40 rows. Implementatin in R Here's an implementation in R, using lpSolveAPI. Note: Since the OP didn't provide the max_allowable programs in the building, I generated my own data for max_programs. program.size <- c(120,320,300,800,500,1000,500,1000,1500,400,1500,2000) room.size <- c(1414,682,1484,2938,1985,1493,427,1958,708,581,1485,652,727,2556,1634,187,2174,205,1070,2165,1680,1449,1441,2289,986,298,590,2925) (obj.vals <- matrix(c(3,4,2,8,3,7,4,8,6,4,7,7,3,4,2,8,3,7,4,8,6,4,7,7, 4,5,3,7,4,6,5,7,5,3,6,6,2,3,1,7,2,6,3,7,7,5,6,6, 4,5,3,7,4,6,5,7,5,3,6,6, 3,6,4,8,5,7,4,8,7,7,7,7,3,4,2,8,3,7,4,8,6,4,7,7, 4,5,3,7,4,6,5,7,5,3,6,6, 6,7,5,7,6,6,7,7,5,3,6,6,6,7,5,7,6,6,7,7,5,3,6,6, 5,6,6,6,5,7,8,6,4,2,5,5, 6,7,5,7,6,6,7,7,5,3,6,6, 6,7,5,7,6,6,7,7,5,3,6,6, 3,4,4,8,3,9,6,8,6,4,7,7, 3,4,2,6,3,5,4,6,6,4,5,5, 4,5,3,5,4,4,5,5,5,3,4,4, 5,6,4,8,5,7,6,8,6,4,7,7, 5,6,4,8,5,7,6,8,6,4,7,7, 4,5,5,7,4,8,7,7,5,3,6,6, 5,6,4,8,5,7,6,8,6,4,7,7, 3,4,2,6,3,5,4,6,6,4,5,5, 5,6,4,8,5,7,6,8,6,4,7,7, 5,6,4,8,5,7,6,8,6,4,7,7, 5,4,4,6,5,5,6,6,6,6,7,5, 6,5,5,5,6,4,5,5,5,7,6,4, 4,5,3,7,4,6,5,7,7,5,6,6, 6,5,5,5,6,4,5,5,5,7,6,4, 3,4,4,6,3,7,6,6,6,4,5,5), nrow=12)) rownames(obj.vals) <- c("Enclosed Offices", "Open Office", "Reception / Greeter", "Studio / Classroom", "Conference / Meeting Room", "Gallery", "Public / Lobby / Waiting", "Collaborative Space", "Mechanical / Support", "Storage / Archives", "Fabrication", "Performance") For each of the 12 programs, let's set a maximum number of repetitions that can be assigned to all rooms combined. Note this is something that I added, since this data was not provided by the OP. (Restrict them from getting assigned to too many rooms.) max_programs <- c(1,2,3,1,5,2,3,4,1,3,1,2) library(lpSolveAPI) nrooms <- 28 nprgs <- 12 ncol = nrooms*nprgs lp_matching <- make.lp(ncol=ncol) #we want integer assignments set.type(lp_matching, columns=1:ncol, type = c("integer")) # sum r,p Crp * Xrp set.objfn(lp_matching, obj.vals) #28 rooms * 12 programs lp.control(lp_matching,sense='max') #' Set Max Programs constraints #' No more than max number of programs over all the rooms #' X1p + x2p + x3p ... + x28p <= max(p) for each p Add_Max_program_constraint <- function (prog_index) { prog_cols <- (0:(nrooms-1))*nprgs + prog_index add.constraint(lp_matching, rep(1,nrooms), indices=prog_cols, rhs=max_programs[prog_index]) } #Add a max_number constraint for each program lapply(1:nprgs, Add_Max_program_constraint) #' Sum of all the programs assigned to each room, over all programs #' area_1 * Xr1+ area 2* Xr2+ ... + area12* Xr12 <= room.size[r] for each room Add_room_size_constraint <- function (room_index) { room_cols <- (room_index-1)*nprgs + (1:nprgs) #relevant columns for a given room add.constraint(lp_matching, xt=program.size, indices=room_cols, rhs=room.size[room_index]) } #Add a max_number constraint for each program lapply(1:nrooms, Add_room_size_constraint) To solve this: > solve(lp_matching) > get.objective(lp_matching) [1] 195 get.variables(lp_matching) # to see which programs went to which rooms > print(lp_matching) Model name: a linear program with 336 decision variables and 40 constraints You can also write the IP model to a file to examine it: #Give identifiable column and row names rp<- t(outer(1:nrooms, 1:nprgs, paste, sep="_")) rp_vec <- paste(abc, sep="") colnames<- paste("x_",rp_vec, sep="") # RowNames rownames1 <- paste("MaxProg", 1:nprgs, sep="_") rownames2 <- paste("Room", 1:nrooms, "AreaLimit", sep="_") dimnames(lp_matching) <- list(c(rownames1, rownames2), colnames) write.lp(lp_matching,filename="room_matching.lp") Hope that helps. Update based on follow-up questions Follow up Question 1: Alter the code to ensure that every room has at least one program. Add the following constraint set: X_r_p >= 1 for all r Note: Since this is a maximization problem, the optimal solution should honor this constraint by default. In other words, it will always assign a program to any room if it could, assuming positive scores for assigning. Follow up question 2: Another question is if I can ask it to have more than 28 programs in total? For instance, if I want 28 Enclosed Offices they almost all could fit in the one room with 2938 square feet. How then can I ask R to still find other programs if the max is set for 28? In order to achieve this goal, you can do it a bit differently. Do not have the sum of all programs <= 28 constraint at all. (If you note in the solution above, my constraints are slightly different.) The constraint: Sum(r) X_r_p <= Max_allowable(p) for each program p only limits the max per type of program. There is no limit on the total. Also, you don't have to write one such constraint for each type of program. Write this constraints only if you want to limit its occurrence. To generalize this, you can set the lower and upper bounds for the total of each type of programs. This will give you very fine control over the assignments. min_allowable(p) <= sum(over r) X_r_p <= max_allowable(p) for any program type p