Getting approximately unique values from character vector - r

Identifying unique values is straight forward when the data is well behaved. Here I am looking for an approach to get a list of approximately unique values from a character vector.
Let x be a vector with slightly different names for an entity, e.g. Kentucky loader may appear as Kentucky load or Kentucky loader (additional info) or somewhat similar.
x <- c("Kentucky load" ,
"Kentucky loader (additional info)",
"CarPark Gifhorn (EAP)",
"Car Park Gifhorn (EAP) new 1.5.2012",
"Center Kassel (neu 01.01.2014)",
"HLLS Bremen (EAP)",
"HLLS Bremen (EAP) new 06.2013",
"Hamburg total sum (abc + TBL)",
"Hamburg total (abc + TBL) new 2012")
What I what to get out is something like:
c("Kentucky loader" ,
"Car Park Gifhorn (EAP)",
"Center Kassel (neu 01.01.2014)",
"HLLS Bremen (EAP)",
"Hamburg total (abc + TBL)")
Idea
Calculate some similarity measure between all strings (e.g.
Levenshtein distance)
Use longest common subset method
Somehow :( decide which strings belong together based on this information.
But I guess this will be a standard task (for those R users working with "dirty" data regularly), so I assume there will be a set of standard approaches to it.
Does someone have a hint or is there a package that does this?

As #Jaap said, try playing with OpenRefine. The data carpentry course is pretty good.
If you do want to stay in R, here's a solution for your example, using agrepl:
z <- sapply(x, function(z) agrepl(z, x, max.distance = 0.2))
apply(z, 1, function(myz) x[myz][which.min(nchar(x[myz]))])
Which gives the smallest match in chars found for each member of x:
[1] "Kentucky load" "Kentucky load" "CarPark Gifhorn (EAP)"
[4] "CarPark Gifhorn (EAP)" "Center Kassel (neu 01.01.2014)" "HLLS Bremen (EAP)"
[7] "HLLS Bremen (EAP)" "Hamburg total sum (abc + TBL)" "Hamburg total sum (abc + TBL)"
This is good if you want to keep order of your vector to match others (or use on a column of a dataframe).
You can call unique on this output to get your desired output.

Related

Best Match of each element in of a string with many strings

Input Data:
a <- c("coca cola","hot coffee","Running Shoes","Table cloth",
”mobile phones under 5000”,”Amazon kindle”)
b <- c("running shoes","plastic cup","pizza","Let’s go to hill","motor van",
"coffee table","drinking coffee on a rainy day",”Best mobile phones under 10000”,
”kindle e-books”,”Coffee Cup”)
Match each word of each sentence of a vector (here vector a) to all strings in a separate vector(here vector b) word by word and get the best match.
Logic:
All sentence of vector “a” has to be matched with all sentences of vector “b” word by word and a percentage has to be calculated.
There can be only one best match per sentence of vector “a”.
Example 1: “Running Shoes” in vector “a” matched perfectly with “Running Shoes” in vector “b” and the percentage_match is 100% (since both the words matched)
Example 2: the best match of “hot coffee” may be “drinking coffee on a rainy day” or “coffee table” or “Coffee cup” and the percentage_match is 50% (since only “coffee”, matched out of “hot coffee” in all the cases). In such scenario, where there is more than one contender with same max percentage_match, we will choose the best match with the lowest string length i.e “coffee table” and “coffee cup” gets priority over “drinking coffee on a rainy day”. Even after doing this, there is a tie, we are free to choose any thing (i.e either of “Coffee Table” or “Coffee cup”, can be the best match for “hot coffee”.
Code Tried:
as <- strsplit(a, " ")
bs <- strsplit(b, " ")
matchFun <- function(x, y) length(intersect(x, y)) / length(x) * 100
mx <- outer(as, bs, Vectorize(matchFun))
m <- apply(mx, 1, which.max) # the maximum column of each row
z <- unlist(apply(mx, 1, function(x) x[which.max(x)])) # maximum percentage
z[z == 0] <- NA # this gives you the NA if you want it
data.frame(a, Matching_String=b[m], match_perc=z)
Problem faced: Since my actual data is very big (more than 2 million records are to be matched with 1 Mn record), this code takes forever.
Here's one way to do this using stringdistmatrix from package stringdist. Basically, we are calculating the distance between the strings in a and b. Then we keep the smallest distance. There will always be a match, even if the distance is a high number. One thing you could do is establish a minimum distance, or NA otherwise.
library(stringdist)
m <- stringdistmatrix(tolower(a), tolower(b), method = "qgram")
b[apply(m, 1, which.min)]
#[1] "plastic cup" "coffee table" "running shoes"
#[4] "coffee table" "best mobile phones under 10000" "kindle e-books"

Efficient string similarity grouping

Setting:
I have data on people, and their parent's names, and I want to find siblings (people with identical parent names).
pdata<-data.frame(parents_name=c("peter pan + marta steward",
"pieter pan + marta steward",
"armin dolgner + jane johanna dough",
"jack jackson + sombody else"))
The expected output here would be a column indicating that the first two observations belong to family X, while the third and fourth columns are each in a separate family. E.g:
person_id parents_name family_id
1 "peter pan + marta steward", 1
2 "pieter pan + marta steward", 1
3 "armin dolgner + jane johanna dough", 2
4 "jack jackson + sombody else" 3
Current approach:
I am flexible regarding the distance metric. Currently, I use Levenshtein edit-distance to match obs, allowing for two-character differences. But other variants such as "largest common sub string" would be fine if they run faster.
For smaller subsamples I use stringdist::stringdist in a loop or stringdist::stringdistmatrix, but this is getting increasingly inefficient as sample size increases.
The matrix version explodes once a certain sample size is used. My terribly inefficient attempt at looping is here:
#create data of the same complexity using random last-names
#(4mio obs and ~1-3 kids per parents)
pdata<-data.frame(parents_name=paste0(rep(c("peter pan + marta ",
"pieter pan + marta ",
"armin dolgner + jane johanna ",
"jack jackson + sombody "),1e6),stringi::stri_rand_strings(4e6, 5)))
for (i in 1:nrow(pdata)) {
similar_fatersname0<-stringdist::stringdist(pdata$parents_name[i],pdata$parents_name[i:nrow(pdata)],nthread=4)<2
#[create grouping indicator]
}
My question: There should be substantial efficiency gains, e.g. because I could stop comparing strings once I found them to sufficiently different in something that is easier to assess, eg. string length, or first word. The string length variant already works and reduces complexity by a factor ~3. But thats by far too little. Any suggestions to reduce computation time are appreciated.
Remarks:
The strings are actually in unicode and not in the Latin alphabet (Devnagari)
Pre-processing to drop unused characters etc is done
There are two challenges:
A. The parallel execution of Levenstein distance - instead of a sequential loop
B. The number of comparisons: if our source list has 4 million entries, theoretically we should run 16 trillion of Levenstein distance measures, which is unrealistic, even if we resolve the first challenge.
To make my use of language clear, here are our definitions
we want to measure the Levenstein distance between expressions.
every expression has two sections, the parent A full name and the parent B full name which are separated by a plus sign
the order of the sections matters (i.e. two expressions (1, 2) are identical if Parent A of expression 1 = Parent A of expression 2 and Parent B or expression 1= Parent B of expression 2. Expressions will not be considered identical if Parent A of expression 1 = Parent B of expression 2 and Parent B of expression 1 = Parent A of expression 2)
a section (or a full name) is a series of words, which are separated by spaces or dashes and correspond to the the first name and last name of a person
we assume the maximum number of words in a section is 6 (your example has sections of 2 or 3 words, I assume we can have up to 6)
the sequence of words in a section matters (the section is always a first name followed by a last name and never the last name first, e.g. Jack John and John Jack are two different persons).
there are 4 million expressions
expressions are assumed to contain only English characters. Numbers, spaces, punctuation, dashes, and any non-English character can be ignored
we assume the easy matches are already done (like the exact expression matches) and we do not have to search for exact matches
Technically the goal is to find series of matching expressions in the 4-million expressions list. Two expressions are considered matching expression if their Levenstein distance is less than 2.
Practically we create two lists, which are exact copies of the initial 4-million expressions list. We call then the Left list and the Right list. Each expression is assigned an expression id before duplicating the list.
Our goal is to find entries in the Right list which have a Levenstein distance of less than 2 to entries of the Left list, excluding the same entry (same expression id).
I suggest a two step approach to resolve the two challenges separately. The first step will reduce the list of the possible matching expressions, the second will simplify the Levenstein distance measurement since we only look at very close expressions. The technology used is any traditional database server because we need to index the data sets for performance.
CHALLENGE A
The challenge A consists of reducing the number of distance measurements. We start from a maximum of approx. 16 trillion (4 million to the power of two) and we should not exceed a few tens or hundreds of millions.
The technique to use here consists of searching for at least one similar word in the complete expression. Depending on how the data is distributed, this will dramatically reduce the number of possible matching pairs. Alternatively, depending on the required accuracy of the result, we can also search for pairs with at least two similar words, or with at least half of similar words.
Technically I suggest to put the expression list in a table. Add an identity column to create a unique id per expression, and create 12 character columns. Then parse the expressions and put each word of each section in a separate column. This will look like (I have not represented all the 12 columns, but the idea is below):
|id | expression | sect_a_w_1 | sect_a_w_2 | sect_b_w_1 |sect_b_w_2 |
|1 | peter pan + marta steward | peter | pan | marta |steward |
There are empty columns (since there are very few expressions with 12 words) but it does not matter.
Then we replicate the table and create an index on every sect... column.
We run 12 joins which try to find similar words, something like
SELECT L.id, R.id
FROM left table L JOIN right table T
ON L.sect_a_w_1 = R.sect_a_w_1
AND L.id <> R.id
We collect the output in 12 temp tables and run an union query of the 12 tables to get a short list of all expressions which have a potential matching expressions with at least one identical word. This is the solution to our challenge A. We now have a short list of the most likely matching pairs. This list will contain millions of records (pairs of Left and Right entries), but not billions.
CHALLENGE B
The goal of challenge B is to process a simplified Levenstein distance in batch (instead of running it in a loop).
First we should agree on what is a simplified Levenstein distance.
First we agree that the levenstein distance of two expressions is the sum of the levenstein distance of all the words of the two expressions which have the same index. I mean the Levenstein distance of two expressions is the distance of their two first words, plus the distance of their two second words, etc.
Secondly, we need to invent a simplified Levenstein distance. I suggest to use the n-gram approach with only grams of 2 characters which have an index absolute difference of less than 2 .
e.g. the distance between peter and pieter is calculated as below
Peter
1 = pe
2 = et
3 = te
4 = er
5 = r_
Pieter
1 = pi
2 = ie
3 = et
4 = te
5 = er
6 = r_
Peter and Pieter have 4 common 2-grams with an index absolute difference of less than 2 'et','te','er','r_'. There are 6 possible 2-grams in the largest of the two words, the distance is then 6-4 = 2 - The Levenstein distance would also be 2 because there's one move of 'eter' and one letter insertion 'i'.
This is an approximation which will not work in all cases, but I think in our situation it will work very well. If we're not satisfied with the quality of the results we can try with 3-grams or 4-grams or allow a larger than 2 gram sequence difference. But the idea is to execute much fewer calculations per pair than in the traditional Levenstein algorithm.
Then we need to convert this into a technical solution. What I have done before is the following:
First isolate the words: since we need only to measure the distance between words, and then sum these distances per expression, we can further reduce the number of calculations by running a distinct select on the list of words (we have already prepared the list of words in the previous section).
This approach requires a mapping table which keeps track of the expression id, the section id, the word id and the word sequence number for word, so that the original expression distance can be calculated at the end of the process.
We then have a new list which is much shorter, and contains a cross join of all words for which the 2-gram distance measure is relevant.
Then we want to batch process this 2-gram distance measurement, and I suggest to do it in a SQL join. This requires a pre-processing step which consists of creating a new temporary table which stores every 2-gram in a separate row – and keeps track of the word Id, the word sequence and the section type
Technically this is done by slicing the list of words using a series (or a loop) of substring select, like this (assuming the word list tables - there are two copies, one Left and one Right - contain 2 columns word_id and word) :
INSERT INTO left_gram_table (word_id, gram_seq, gram)
SELECT word_id, 1 AS gram_seq, SUBSTRING(word,1,2) AS gram
FROM left_word_table
And then
INSERT INTO left_gram_table (word_id, gram_seq, gram)
SELECT word_id, 2 AS gram_seq, SUBSTRING(word,2,2) AS gram
FROM left_word_table
Etc.
Something which will make “steward” look like this (assume the word id is 152)
| pk | word_id | gram_seq | gram |
| 1 | 152 | 1 | st |
| 2 | 152 | 2 | te |
| 3 | 152 | 3 | ew |
| 4 | 152 | 4 | wa |
| 5 | 152 | 5 | ar |
| 6 | 152 | 6 | rd |
| 7 | 152 | 7 | d_ |
Don't forget to create an index on the word_id, the gram and the gram_seq columns, and the distance can be calculated with a join of the left and the right gram list, where the ON looks like
ON L.gram = R.gram
AND ABS(L.gram_seq + R.gram_seq)< 2
AND L.word_id <> R.word_id
The distance is the length of the longest of the two words minus the number of the matching grams. SQL is extremely fast to make such a query, and I think a simple computer with 8 gigs of RAM would easily do several hundred of million lines in a reasonable time frame.
And then it's only a matter of joining the mapping table to calculate the sum of word to word distance in every expression, to get the total expression to expression distance.
You are using the stringdist package anyway, does stringdist::phonetic() suit your needs? It computes the soundex code for each string, eg:
phonetic(pdata$parents_name)
[1] "P361" "P361" "A655" "J225"
Soundex is a tried-and-true method (almost 100 years old) for hashing names, and that means you don't need to compare every single pair of observations.
You might want to go further and do soundex on first name and last name seperately for father and mother.
My suggestion is to use a data science approach to identify only similar (same cluster) names to compare using stringdist.
I have modified a little bit the code generating "parents_name" adding more variability in first and second names in a scenario close to reality.
num<-4e6
#Random length
random_l<-round(runif(num,min = 5, max=15),0)
#Random strings in the first and second name
parent_rand_first<-stringi::stri_rand_strings(num, random_l)
order<-sample(1:num, num, replace=F)
parent_rand_second<-parent_rand_first[order]
#Paste first and second name
parents_name<-paste(parent_rand_first," + ",parent_rand_second)
parents_name[1:10]
Here start the real analysis, first extract feature from the names such as global length, length of the first, length of the second one, numeber of vowels and consonansts in both first and second name (and any other of interest).
After that bind all these feature and clusterize the data.frame in a high number of clusters (eg. 1000)
features<-cbind(nchars,nchars_first,nchars_second,nvowels_first,nvowels_second,nconsonants_first,nconsonants_second)
n_clusters<-1000
clusters<-kmeans(features,centers = n_clusters)
Apply stringdistmatrix only inside each cluster (containing similar couple of names)
dist_matrix<-NULL
for(i in 1:n_clusters)
{
cluster_i<-clusters$cluster==i
parents_name<-as.character(parents_name[cluster_i])
dist_matrix[[i]]<-stringdistmatrix(parents_name,parents_name,"lv")
}
In dist_matrix you have the distance beetwen each element in the cluster and you are able to assign the family_id using this distance.
To compute the distance in each cluster (in this example) the code takes approximately 1 sec (depending on the dimension of the cluster), in 15mins all the distances are computed.
WARNING: dist_matrix grow very fast, in your code is better if you will analyze it inside di for loop extracting famyli_id and then you can discard it.
You may improve by not comparing all the couples of lines.
Instead, create a new variable that will be helpfull for decide if it is worth comparing.
For exemple, create a new variable "score" contaning the ordered list of letters used in parents_name (for exemple if "peter pan + marta steward" then the score will be "ademnprstw"), and calculate distance only between lines where score are matching.
Of course, you can find a score that fits better your need, and improve a little to enable comparison when not all the letters used are common ..
I faced the same performance issue couple years ago. I had to match people's duplicates based on their typed names. My dataset had 200k names and the matrix approach exploded. After searching for some day about a better method, the method I'm proposing here did the job for me in some minutes:
library(stringdist)
parents_name <- c("peter pan + marta steward",
"pieter pan + marta steward",
"armin dolgner + jane johanna dough",
"jack jackson + sombody else")
person_id <- 1:length(parents_name)
family_id <- vector("integer", length(parents_name))
#Looping through unassigned family ids
while(sum(family_id == 0) > 0){
ids <- person_id[family_id == 0]
dists <- stringdist(parents_name[family_id == 0][1],
parents_name[family_id == 0],
method = "lv")
matches <- ids[dists <= 3]
family_id[matches] <- max(family_id) + 1
}
result <- data.frame(person_id, parents_name, family_id)
That way the while will compare fewer matches on every iteration. From that, you might implement different performance boosters, like filtering the names with the same first letter before comparing, etc.
Making equivalency groups on non transitive relation does not make sense. If A is like B and B is like C, but A is not like C, how would you make families from that? Using something like soundex (that was idea of Neal Fultz, not mine) seems the only meaningful option and it solves your problem with performance too.
What I have used to reduce the permutations involved in this sort of name matching, is create a function that counts the syllables in the name (surname) involved. Then store this in the database, as a pre-processed value. This becomes a Syllable Hash function.
Then you can choose to group words together with the same number of syllables as each other. (Although I use algorithms that allow 1 or 2 syllables difference, which may be presented as legitimate spelling / typo errors...But my research has found that 95% of misspellings share the same number of syllables)
In this case Peter and Pieter would have the same syllable count (2), but Jones and Smith do not (they have 1). (For example)
If your function does not get 1 syllable for Jones, then you may need to increase your tolerance to allow for at least 1 syllable difference in the Syllable Hash function grouping that you use. (To account for incorrect syllable function results, and to catch the matching surname correctly in the grouping)
My syllable counting function may not apply completely - as you might need to cope with non-English letter sets...(So I have not pasted the code...Its in C anyway) Mind you - the Syllable count function does not have to be accurate in terms of TRUE syllable count; it simply needs to act as a reliable Hashing function - which it does. Far superior to SoundEx which relies on the first letter being accurate.
Give it a go, you might be surprised how much improvement you get by implementing a Syllable Hash function. You may have to ask SO for help getting the function into your language.
If I get it right, you want to compare every parent pair (every row in parent_name data frame) with all other pairs (rows), and keep rows that have Levenstein distance smaller or equal to 2.
I have written following code for the beginning:
pdata<-data.frame(parents_name=c("peter pan + marta steward",
"pieter pan + marta steward",
"armin dolgner + jane johanna dough",
"jack jackson + sombody else"))
fuzzy_match <- list()
system.time(for (i in 1:nrow(pdata)){
fuzzy_match[[i]] <- cbind(pdata, parents_name_2 = pdata[i,"parents_name"],
dist = as.integer(stringdist(pdata[i,"parents_name"], pdata$parents_name)))
fuzzy_match[[i]] <- fuzzy_match[[i]][fuzzy_match[[i]]$dist <= 2,]
})
fuzzy_final <- do.call(rbind, fuzzy_match)
Does it return what you wanted?
it reproduces your output, i guess you will have to decide partial matching criteria, i kept the default agrep ones
pdata$parents_name<-as.character(pdata$parents_name)
x00<-unique(lapply(pdata$parents_name,function(x) agrep(x,pdata$parents_name)))
x=c()
for (i in 1:length(x00)){
x=c(x,rep(i,length(x00[[i]])))
}
pdata$person_id=seq(1:nrow(pdata))
pdata$family_id=x

Find specific strings, count their frequency in a given text, and report it as a proportion of the number of words

Trying to write a function in R that would :
1) look through each observation's string variables
2) identify and count certain strings that the user defines
3) report the findings as a proportion of the total number of words each observation contains.
Here's a sample dataset:
df <- data.frame(essay1=c("OMG. american sign language. knee-slides in leather pants", "my face looks totally different every time. lol."),
essay2=c("cheez-its and dried cranberries. sparkling apple juice is pretty\ndamned coooooool too.<br />\nas for music, movies and books: the great american authors, mostly\nfrom the canon, fitzgerald, vonnegut, hemmingway, hawthorne, etc.\nthen of course the europeans, dostoyevski, joyce, the romantics,\netc. also, one of the best books i have read is all quiet on the\nwestern front. OMG. I really love that. lol", "at i should have for dinner\nand when; some random math puzzle, which I loooooove; what it means to be alive; if\nthe meaning of life exists in the first place; how the !##$ can the\npolitical mess be fixed; how the %^&* can the education system\nbe fixed; my current game design project; my current writing Lol"),
essay3=c("Lol. I enjoy life and then no so sure what else to say", "how about no?"))
The furtherest I managed to get is this function:
find.query <- function(char.vector, query){
which.has.query <- grep(query, char.vector, ignore.case = TRUE)
length(which.has.query) != 0
}
profile.has.query <- function(data.frame, query){
query <- tolower(query)
has.query <- apply(data.frame, 1, find.query, query=query)
return(has.query)
}
This allows the user to detect if a given value is in the 'essay' for a given used, but that's not enough for the three goals outlined above. What this function would ideally do is to count the number of words identified, then divide that count by the total count of words in the overall essays (row sum of counts for each user).
Any advice on how to approach this?
Using the stringi package as in this post:
How do I count the number of words in a text (string) in R?
library(stringi)
words.identified.over.total.words <- function(dataframe, query){
# make the query all lower-case
query <- tolower(query)
# count the total number of words
total.words <- apply(dataframe, 2, stri_count, regex = "\\S+")
# count the number of words matching query
number.query <- apply(dataframe, 2, stri_count, regex = query)
# divide the number of words identified by total words for each column
final.result <- colSums(number.query) / colSums(total.words)
return(final.result)
}
(The df in your question has each essay in a column, so the function sums each column. However, in the text of your question you say you want row sums. If the input data frame was meant to have one essay per row, then you can change the function to reflect that.)

lpSolve in R with Character and Column Sum Contraints

I have a list of rooms, the rooms' maximum square feet, programs, the programs' maximum square feet, and the values of how well a room matches (match #) the intended program use. With help, I have been able to maximize match # and square feet use for one program per room. However, I would like to take this analysis one step further and allow for multiple programs in the same room or multiples of the same program if it has the highest match #, so long as the multiples still fit within the square foot requirements. Moreover, I would like to tell lpSolve overall that I only want "x" number of Offices, "y" number of Studios, etc. throughout the entire building. Here is my data and code thus far:
program.size <- c(120,320,300,800,500,1000,500,1000,1500,400,1500,2000)
room.size <- c(1414,682,1484,2938,1985,1493,427,1958,708,581,1485,652,727,2556,1634,187,2174,205,1070,2165,1680,1449,1441,2289,986,298,590,2925)
(obj.vals <- matrix(c(3,4,2,8,3,7,4,8,6,4,7,7,
3,4,2,8,3,7,4,8,6,4,7,7,
4,5,3,7,4,6,5,7,5,3,6,6,
2,3,1,7,2,6,3,7,7,5,6,6,
4,5,3,7,4,6,5,7,5,3,6,6,
3,6,4,8,5,7,4,8,7,7,7,7,
3,4,2,8,3,7,4,8,6,4,7,7,
4,5,3,7,4,6,5,7,5,3,6,6,
6,7,5,7,6,6,7,7,5,3,6,6,
6,7,5,7,6,6,7,7,5,3,6,6,
5,6,6,6,5,7,8,6,4,2,5,5,
6,7,5,7,6,6,7,7,5,3,6,6,
6,7,5,7,6,6,7,7,5,3,6,6,
3,4,4,8,3,9,6,8,6,4,7,7,
3,4,2,6,3,5,4,6,6,4,5,5,
4,5,3,5,4,4,5,5,5,3,4,4,
5,6,4,8,5,7,6,8,6,4,7,7,
5,6,4,8,5,7,6,8,6,4,7,7,
4,5,5,7,4,8,7,7,5,3,6,6,
5,6,4,8,5,7,6,8,6,4,7,7,
3,4,2,6,3,5,4,6,6,4,5,5,
5,6,4,8,5,7,6,8,6,4,7,7,
5,6,4,8,5,7,6,8,6,4,7,7,
5,4,4,6,5,5,6,6,6,6,7,5,
6,5,5,5,6,4,5,5,5,7,6,4,
4,5,3,7,4,6,5,7,7,5,6,6,
6,5,5,5,6,4,5,5,5,7,6,4,
3,4,4,6,3,7,6,6,6,4,5,5), nrow=12))
rownames(obj.vals) <- c("Enclosed Offices", "Open Office", "Reception / Greeter", "Studio / Classroom",
"Conference / Meeting Room", "Gallery", "Public / Lobby / Waiting",
"Collaborative Space", "Mechanical / Support", "Storage / Archives",
"Fabrication", "Performance")
(obj.adj <- obj.vals * outer(program.size, room.size, "<="))
nr <- nrow(obj.adj)
nc <- ncol(obj.adj)
library(lpSolve)
obj <- as.vector(obj.adj)
con <- t(1*sapply(1:nc, function(x) rep(1:nc == x, each=nr)))
dir <- rep("<=", nc)
rhs <- rep(1, nc)
mod <- lp("max", obj, con, dir, rhs, all.bin=TRUE)
final <- matrix(mod$solution, nrow=nr)
And so now my question is how can I allow the solver to maximize square foot use and match # within each room (column) and allow either multiple of the same programs or a combination of programs to accomplish this? I know I would have to lift the "<= 1" restriction in "mod", but I can't figure out how to let it find the best fit in each room and then ultimately, overall.
The solution that should come for room [,1] is:
$optimum
33
And it's going to try to fit 11 Enclosed Offices within the room which scores a much higher optimum match # than 1 Collaborative Space (8 matches) and 1 Storage / Archives (4 matches) for a total of 12 matches.
And so this leads to my next question about limiting the overall number of certain programs within my solution matrix. I assume it would include some kind of
as.numeric(data$EnclosedOffices "<=" 5)
but I also can't figure out how to limit that. These numbers would be different for all the programs.
Thanks for any and all help and feel free to ask for any clarification.
Update:
Constraints
Maximize the match # (obj.vals) for each room.
Maximize the program.size square feet within each room.size square
feet. Do this by either using the same program multiple times (5 x
Enclosed Offices) or combining programs (1 Collaborative Space and
1 Performance).
Restrict and/or force the number of programs returned in the
solution. The programs can be split up among rooms so long as it doesn't go over the maximum number I provide for that program in all the rooms. (Only 5 Enclosed Offices, 8 Studios/Classrooms, 1 Fabrication, etc. can be selected across all 28 columns.
If you use the R package lpSolveAPI (a wrapper for lpSolve) then it becomes a little easier.
First, take a look at the mathematical formulation (an Integer Program) and then I show you the code to solve your problem.
Formulation
Let X_r_p be the decision variable that takes on positive integer values.
X_r_p = Number of programs of type p assigned to room r
(In all your problem will have 28*12=336 decision variables)
Objective Function
Maximize matching score
Max sum(r) sum(p) C_r_p * X_r_p # Here C_r_p is the score of assigning p to room r
Subject to
Room Area Limitation Constraint
Sum(p) Max_area_p * X_r_p <= Room Size (r) for each room r
(We will have 28 such constraints)
Restrict the Number of programs Constraint
Sum(r) X_r_p <= Max_allowable(p) for each program p
(We will have 12 such constraints)
X_r_p >= 0, Integer
That is the formulation in all. 336 columns and 40 rows.
Implementatin in R
Here's an implementation in R, using lpSolveAPI.
Note: Since the OP didn't provide the max_allowable programs in the building, I generated my own data for max_programs.
program.size <- c(120,320,300,800,500,1000,500,1000,1500,400,1500,2000)
room.size <- c(1414,682,1484,2938,1985,1493,427,1958,708,581,1485,652,727,2556,1634,187,2174,205,1070,2165,1680,1449,1441,2289,986,298,590,2925)
(obj.vals <- matrix(c(3,4,2,8,3,7,4,8,6,4,7,7,3,4,2,8,3,7,4,8,6,4,7,7,
4,5,3,7,4,6,5,7,5,3,6,6,2,3,1,7,2,6,3,7,7,5,6,6, 4,5,3,7,4,6,5,7,5,3,6,6,
3,6,4,8,5,7,4,8,7,7,7,7,3,4,2,8,3,7,4,8,6,4,7,7, 4,5,3,7,4,6,5,7,5,3,6,6,
6,7,5,7,6,6,7,7,5,3,6,6,6,7,5,7,6,6,7,7,5,3,6,6, 5,6,6,6,5,7,8,6,4,2,5,5,
6,7,5,7,6,6,7,7,5,3,6,6, 6,7,5,7,6,6,7,7,5,3,6,6, 3,4,4,8,3,9,6,8,6,4,7,7,
3,4,2,6,3,5,4,6,6,4,5,5, 4,5,3,5,4,4,5,5,5,3,4,4, 5,6,4,8,5,7,6,8,6,4,7,7,
5,6,4,8,5,7,6,8,6,4,7,7, 4,5,5,7,4,8,7,7,5,3,6,6, 5,6,4,8,5,7,6,8,6,4,7,7,
3,4,2,6,3,5,4,6,6,4,5,5, 5,6,4,8,5,7,6,8,6,4,7,7, 5,6,4,8,5,7,6,8,6,4,7,7,
5,4,4,6,5,5,6,6,6,6,7,5, 6,5,5,5,6,4,5,5,5,7,6,4, 4,5,3,7,4,6,5,7,7,5,6,6,
6,5,5,5,6,4,5,5,5,7,6,4, 3,4,4,6,3,7,6,6,6,4,5,5), nrow=12))
rownames(obj.vals) <- c("Enclosed Offices", "Open Office", "Reception / Greeter", "Studio / Classroom",
"Conference / Meeting Room", "Gallery", "Public / Lobby / Waiting",
"Collaborative Space", "Mechanical / Support", "Storage / Archives",
"Fabrication", "Performance")
For each of the 12 programs, let's set a maximum number of repetitions that can be assigned to all rooms combined. Note this is something that I added, since this data was not provided by the OP. (Restrict them from getting assigned to too many rooms.)
max_programs <- c(1,2,3,1,5,2,3,4,1,3,1,2)
library(lpSolveAPI)
nrooms <- 28
nprgs <- 12
ncol = nrooms*nprgs
lp_matching <- make.lp(ncol=ncol)
#we want integer assignments
set.type(lp_matching, columns=1:ncol, type = c("integer"))
# sum r,p Crp * Xrp
set.objfn(lp_matching, obj.vals) #28 rooms * 12 programs
lp.control(lp_matching,sense='max')
#' Set Max Programs constraints
#' No more than max number of programs over all the rooms
#' X1p + x2p + x3p ... + x28p <= max(p) for each p
Add_Max_program_constraint <- function (prog_index) {
prog_cols <- (0:(nrooms-1))*nprgs + prog_index
add.constraint(lp_matching, rep(1,nrooms), indices=prog_cols, rhs=max_programs[prog_index])
}
#Add a max_number constraint for each program
lapply(1:nprgs, Add_Max_program_constraint)
#' Sum of all the programs assigned to each room, over all programs
#' area_1 * Xr1+ area 2* Xr2+ ... + area12* Xr12 <= room.size[r] for each room
Add_room_size_constraint <- function (room_index) {
room_cols <- (room_index-1)*nprgs + (1:nprgs) #relevant columns for a given room
add.constraint(lp_matching, xt=program.size, indices=room_cols, rhs=room.size[room_index])
}
#Add a max_number constraint for each program
lapply(1:nrooms, Add_room_size_constraint)
To solve this:
> solve(lp_matching)
> get.objective(lp_matching)
[1] 195
get.variables(lp_matching) # to see which programs went to which rooms
> print(lp_matching)
Model name:
a linear program with 336 decision variables and 40 constraints
You can also write the IP model to a file to examine it:
#Give identifiable column and row names
rp<- t(outer(1:nrooms, 1:nprgs, paste, sep="_"))
rp_vec <- paste(abc, sep="")
colnames<- paste("x_",rp_vec, sep="")
# RowNames
rownames1 <- paste("MaxProg", 1:nprgs, sep="_")
rownames2 <- paste("Room", 1:nrooms, "AreaLimit", sep="_")
dimnames(lp_matching) <- list(c(rownames1, rownames2), colnames)
write.lp(lp_matching,filename="room_matching.lp")
Hope that helps.
Update based on follow-up questions
Follow up Question 1: Alter the code to ensure that every room has at least one program.
Add the following constraint set:
X_r_p >= 1 for all r
Note: Since this is a maximization problem, the optimal solution should honor this constraint by default. In other words, it will always assign a program to any room if it could, assuming positive scores for assigning.
Follow up question 2: Another question is if I can ask it to have more than 28 programs in total? For instance, if I want 28 Enclosed Offices they almost all could fit in the one room with 2938 square feet. How then can I ask R to still find other programs if the max is set for 28?
In order to achieve this goal, you can do it a bit differently. Do not have the sum of all programs <= 28 constraint at all. (If you note in the solution above, my constraints are slightly different.)
The constraint:
Sum(r) X_r_p <= Max_allowable(p) for each program p
only limits the max per type of program. There is no limit on the total. Also, you don't have to write one such constraint for each type of program. Write this constraints only if you want to limit its occurrence.
To generalize this, you can set the lower and upper bounds for the total of each type of programs. This will give you very fine control over the assignments.
min_allowable(p) <= sum(over r) X_r_p <= max_allowable(p) for any program type p

How to create a word grouping report using R language and .Net?

I would like to create a simple application in C# that takes in a group of words, then returns all groupings of those individual words from a data set.
For example, given car and bike, return a list of groups/combinations of words (with the number of combinations found) from a data set.
To further clarify - given a category named "car", I would like to see a list of word groupings with the word "car". This category could also be several words rather than just one.
With a sample data set of:
CAR:
Another car for sale
Blue car on the horizon
For Sale - used car
this car is painted blue
should return
car : for sale : 2
car : blue : 2
I'd like to set a threshold, say 20 or greater, so if there are over 20 instances of the word(s) with car, then display them - category, words, count, where only category is known; words and count is determined by the algorithm.
The data set is in a SQL Server 2008 table, and I was hoping to use something like a .Net implementation of R to accomplish this.
I am guessing that the best way to accomplish this may be with the R programming language, and am only now looking at R.Net.
I would prefer to do this with .Net, as that is what I am most familiar with, but open to suggestions.
Can someone with some experience with this lead me in the right direction?
Thanks.
It seems your question consists of 4 parts:
Getting data from SQL Server 2008
Extracting substrings from a set of strings
Setting a threshold for when to accept that number
Producing some document or other output (?) containing this.
For 1, I think that's a different question (see the RODBC package), but I won't be dealing with that here as that's not the main part of your question. You've left 4. a little vague and I think that's also peripheral to the meat of your question.
Part 2 can be easily dealt with using regular expressions:
countstring <- function(string, pattern){
stringcount <- sum(grepl(pattern, string, ignore.case=TRUE), na.rm=TRUE)
paste(deparse(substitute(string)), pattern, stringcount, sep=" : ")
}
This function basically gets a vector of strings and a pattern to search for. It finds which of them match and gets the sum of the number that do (ie the count). It then prints out these together in one string. For example:
car <- c("Another car for sale", "Blue car on the horizon", "For Sale - used car", "this car is painted blue")
countstring(car, "blue")
## [1] "car : blue : 2"
Part 3 requires a small change to the function
countstring <- function(string, pattern, threshold=20){
stringcount <- sum(grepl(pattern, string, ignore.case=TRUE), na.rm=TRUE)
if(stringcount >= threshold){
paste(deparse(substitute(string)), pattern, stringcount, sep=" : ")
}
}

Resources