Given a collection of consumers competing for a limited resource, allocate that resource to maximize it's applicability - np-complete

Sorry the question title isn't very clear, this is a challenging question to ask without providing a more concrete example. Consider the following scenario:
I have a number of friends whose birthdays are coming up on dates (d1..dn), and I've managed to come up with a number of gifts I'd like to purchase them of cost (c1..cn). Unfortunately, I only have a fixed amount of money (m) that I can save per day towards purchasing these gifts. The question I'd like to ask is:
What is the ideal distribution of savings per gift (mi, where the sum of mi from 1..n == m) in order to minimize the aggregate deviance between my friends' birthdays and the date in which I'll have saved enough money to purchase that gift.
What I'm looking for is either a solution to this problem, or a mapping to a solved problem that I can utilize to deterministically answer this question. Thanks for pondering it, and let me know if I can provide any additional clarification!

I think you've stated a form of a knapsack problem with some additional complications - the knapsack problem is NP-Complete (p 247, Garey and Johnson). The basic knapsack problem is where you have a number of objects each with a volume and a value - you want to fill a knapsack of fixed volume with the objects to maximize the value without exceeding the knapsack capacity.
Given that you have stages (days) and resources (money) and the resources change by day while you decide what purchases to make, would lead me to a dynamic programming solution technique rather than a straight optimization model.
Could you clarify in comments "minimizing the deviance"? I'm not sure I understand that part.
BTW, mathoverflow.com is probably not helpful for this. If you look at algorithm questions, 50 on stackoverflow and 50 on mathoverflow, you'll find the questions (and answers) on stackoverflow have a lot more in common with the problem you are considering. There is a new site called OR Exchange, but there's not a lot of traffic there yet.

Related

Using OptaPlanner to create school time tables with some tricky constraints

I'm going to use OptaPlanner to lay out time tables for a school.
We're laying out the time tables for a full semester and every week could, if necessary, be slightly different.
There are some tricky constraints to take into account:
1. Weekly schedules
The lectures in one subject should be spread out somewhat evenly over the semester.
We can't for example put 20 math lectures the first week and "be done" with math for this semester.
In fact, it's nice to have some weekly predictibility
"Science year 2 have biology on Tuesday mornings"
This constraint must not be carved in stone however. Some weeks have to include work experience sessions, PE excursions, etc, in which case they must deviate from other weeks.
Problem
If I create a constraint that say, gives -1soft for not scheduling a subject the same time as the previous week, then OptaPlanner will waste a lot of time before it "accidentally" finds a good placement for a lecture, and even if it manages to converge so that each subject is scheduled the same time every week, it will never ever manage to move the entire series of lectures by moving them one by one. (That local optimum will never be escaped.)
2. Cross student group subjects
There's a large correlation between student groups and courses; For example, all students in Science year 2 mostly reads the same courses: Chemistry for Science year 2, Biology for Sience year 2, ...
The exception being language courses.
Each student can choose to study French, German or Spanish. So Spanish for year 2 is studied by a cross section of Science year 2 students, and Social Studies year 2 students, etc.
From the experience of previous (manual) scheduling, the optimal solution it's almost guaranteed to schedule all language classes in the same time slots. (If French is scheduled at 9 on Thursdays, then German and Spanish can be scheduled "for free" at 9 on Thursdays.)
Problem
There are many time slots in one semester, and the chances that OptaPlanner will discover a solution where all language lectures are scheduled at the same time by randomly moving individual lectures is small.
Also, similarly to problem 1: If OptaPlanner does manage to schedule French, German and Spanish at the same time, these "blocks" will never be moved elsewhere, since they are individual lectures, and the chances that all lectures will "randomly" move to the same new slot is tiny. Even with a large Tabu history length and so on.
My thoughts so far
As for problem 1 ("Weekly predictability") I'm thinking of doing the following:
In the construction phase for the full-semester-schedule I create a reduced version of the problem, that schedules (a reduced set of lectures) into a single "template week". Let's call it a "single-week-pre-scheduling". This template week is then repeated in the construction of the initial solution of the full semester which is the "real" planning entity.
The local search steps will then only focus on inserting PE excursions etc, and adjusting the schedule for the affected weeks.
As for problem 2 I'm thinking that the solution to problem 1 might solve this. In a 1 week schedule, it seems reasonable to assume that OptaPlaner will realize that language classes should be scheduled at the same time.
Regarding the local optimum settled by the single-week-pre-scheduling ("Biology is scheduled on Tuesday mornings"), I imagine that I could create a custom move operation that "bundles" these lectures into a single move. I have no idea how simple this is. I would really like to keep the code as simple as possible.
Questions
Are my thoughts reasonable? Is there a more clever way to approach these problems? If I have to create custom moves anyways, perhaps I don't need to construct a template-week?
Is there a way to assign hints or weights to moves? If so, I could perhaps generate moves with slightly larger weight that adjusts scheduling to adhere to predictable weeks and language scheduled in the same time slots.
A question well asked!
With regards to your first problem, I suggest you take a look at OptaWeb Employee Rostering and the concept of rotations. A rotation is "how things generally are" and then Planner has the freedom to diverge from the rotation at a penalty. Once you understand the concept of the rotation from the UI, take a look at the planning entity Shift and how the rotation is implemented with the use of employee and rotationEmployee variables. Note that only the employee is an actual #PlanningVariable, with the rotationEmployee being fixed.
That means that you have to define your rotations manually, therefore doing the work of the solver yourself. However, since this operation is only done once a semester I assume, maybe the solution could be to have a simpler solver generate a reasonable general rotation first, and then a second solver would take it and figure out the specific necessary adjustments?
With regards to your second problem, rotations could help there too. But I'm thinking maybe some move filtering and custom moves to help OptaPlanner to either move all language classes, or none? Writing efficient custom moves is not easy, and filtering stock moves is cumbersome. So I would only do it when the potential of other options is exhausted. If you end up doing this, look for MoveIteratorFactory.
My answer is a little vague, as we do not get into the specifics of the domain model, but for the purposes of designing the overall solution, it hopefully gives enough clues.

R: Clustering customers based on similar product interests for an event

I have a dataset with a list of customers and their product preferences. Basically, it is a simple CSV with a column called "CUSTOMER" and 5 other columns called "PRODUCT_WANTED_A", "PRODUCT_WANTED_B" and so on.
I asked these customers if they were interested to know more about a particular product, and answers could be simply YES or NO (1 or 0 in the dataset). The dataset can be downloaded here. Obviously, there will be customers with many different interests, based on the mix of their YES or NO in these 5 columns.
My goal is to understand which customers are similar to others in such interests. This will help me manage an agenda of product presentations and, in each meeting, I would like to understand the best grouping for it. I started with a hierarchical plot like this:
customer_list <- read.csv("customers_products_wanted.csv", sep=",", header = TRUE)
customer.hclust <- hclust(dist(customers_list))
plot(customer.hclust, customer_list$CUSTOMER)
library(rect.hclust)
rect.clust(customer.hplot,5)
This is the plot I got, asking for 5 clusters:
Tried the same, but with 10 clusters:
Question 1: I know it's always hard to tell, but looking at the charts and dataset, what would be your 'cut' to group customers? 5? 10?
I was reviewing the results, and in the same group, I had CUSTOMER112 with 1,0,1,0,1 as their preferences together with CUSTOMER 110 (1,1,1,1,1), CUSTOMER106 (1,1,1,1,0) and so on. The "distance" can be right, but in a given group I have customers with some relevant differences in their preferences.
Question 2: I don't know if it's a case of total ignorance about clustering, the code I used or even the dataset. Based on your experience, what would be your approach for the best clustering in this case?
Any comments will be highly appreciated. As you see, I did some efforts, but still in doubt.
Thanks a lot!
Ricardo
All answers were important, but #Ben video recommendation and #Samuel Tan advice on breaking the customers into grids, I found a good way to handle it.
The video gave me a lot of insights about "noisy" variables in hierarchical clustering, and the grid recommendation helped me think on what the data is really trying to tell me.
That said, a basic data cleaning process eliminated all customers with no interests in any products (this is obvious, but I didn't pay attention to it at first). Then, I ignored customers with a specific interest (single product). It was done because these customers wouldn't need to attend the workshop series I'm planning (they just want to listen about one product).
Evaluating all the others, interested in more than one product, I realized the product mix could point me to a better classification. From there, I grouped customers into 3 clusters: integration opportunities (2 or 3 products), convergence opportunities (4 products) and transformation opportunities (all products).
Now it's clear to me which customers I should focus on for my workshops, and plan my post-workshop sales campaigns leveraging materials that target each customer group (integration, convergence, transformation).
Thanks for all the advices!
Ricardo

How to mitigate against bandwagon effect (voting behavior) in my ranking system?

What I mean by bandwagon effect describes itself like so:
Already top-ranked items have a higher tendency to get voted on at all, possibly even to get upvoted.
What I am hoping to get is some concrete recommendations, at best based on your practical experience with a mathematical formula and in which situation it helped.
However, any useful pointers are more than welcome!
My ranking system
Please consider a ranking system at a website that has a reputation system and where users cast only upvotes on items and the ranking table is reset to start fresh every month.
Every user has one upvote per item within each month, and there is a reward for users who, within a certain month, upvoted an item that made it into the top ranks at the end of that month.
Users are told the following about what increases the weight of their upvote:
1)... the more reputation you have at the time of upvoting
2)... the fewer items you upvote within the current month (including the current upvote)
3)... the fewer upvotes that item already has within the current month before your own upvote
The ranking table is recalculated once a day and is visible to all.
Goal
I'd like to implement part 3) in an effort to correct the ranks of items where one cannot tell if some users just upvoted it because of the bandwagon effect (those users might hope to gain a "tactical" advantage simply by voting what they perceive lots of other users already upvoted)
Also, I hope to mitigate this way against the possible use of sock puppets that managed to attain some reputation, but upvote the same item or group of items.
Question
Is there a (maybe even tested?) mathematical formula that I could just apply on the time-ordered list of upvotes for each item to get a coffecient for each of those upvotes so that their weights will be corrected in a sensible fashion?
I'm thinking it's got to be something of a lograthmic function but I can't quite get a grip on it...
Thank you!
Edit
Zack says: "beyond a certain level of popularity, additional upvotes decrease the probability that something will be displayed"
To further clarify: what I am after is which actual mathematical approaches are worth trying out that will, in the form of a mathematical function, translate this descrease in pop (i.e., apply coefficients to the weights, see above) in sensible, balanced manner.
My hope is someone has practical experience with such approaches in a simmilar or general situation to the one above.
Consider applying the "Indie Rock Peter Principle": beyond a certain level of popularity, additional upvotes decrease the probability that something will be displayed.
Term coined by Leonard Richardson in this paper. Indie Rock Peter is of course from Diesel Sweeties.
I have always disliked the bandwagon effect in voting systems, especially "most viewed" rankings in which simply clicking on a highly ranked item increases its rank. My solution to this problem, which I have never tested or seen implemented, would be to keep track of how an item was reached (and then voted for), and ignore (or greatly decrease the weight of) votes that came from any sorted-by-ranking page.

How do you estimate a ROI for clearing technical debt?

I'm currently working with a fairly old product that's been saddled with a lot of technical debt from poor programmers and poor development practices in the past. We are starting to get better and the creation of technical debt has slowed considerably.
I've identified the areas of the application that are in bad shape and I can estimate the cost of fixing those areas, but I'm having a hard time estimating the return on investment (ROI).
The code will be easier to maintain and will be easier to extend in the future but how can I go about putting a dollar figure on these?
A good place to start looks like going back into our bug tracking system and estimating costs based on bugs and features relating to these "bad" areas. But that seems time consuming and may not be the best predictor of value.
Has anyone performed such an analysis in the past and have any advice for me?
Managers care about making $ through growth (first and foremost e.g. new features which attract new customers) and (second) through optimizing the process lifecycle.
Looking at your problem, your proposal falls in the second category: this will undoubtedly fall behind goal #1 (and thus get prioritized down even if this could save money... because saving money implies spending money (most of time at least ;-)).
Now, putting a $ figure on the "bad technical debt" could be turned around into a more positive spin (assuming that the following applies in your case): " if we invest in reworking component X, we could introduce feature Y faster and thus get Z more customers ".
In other words, evaluate the cost of technical debt against cost of lost business opportunities.
Sonar has a great plugin (technical debt plugin) to analyze your sourcecode to look for just such a metric. While you may not specifically be able to use it for your build, as it is a maven tool, it should provide some good metrics.
Here is a snippet of their algorithm:
Debt(in man days) =
cost_to_fix_duplications +
cost_to_fix_violations +
cost_to_comment_public_API +
cost_to_fix_uncovered_complexity +
cost_to_bring_complexity_below_threshold
Where :
Duplications = cost_to_fix_one_block * duplicated_blocks
Violations = cost_to fix_one_violation * mandatory_violations
Comments = cost_to_comment_one_API * public_undocumented_api
Coverage = cost_to_cover_one_of_complexity *
uncovered_complexity_by_tests (80% of
coverage is the objective)
Complexity = cost_to_split_a_method *
(function_complexity_distribution >=
8) + cost_to_split_a_class *
(class_complexity_distribution >= 60)
I think you're on the right track.
I've not had to calculate this but I've had a few discussions with a friend who manages a large software development organisation with a lot of legacy code.
One of the things we've discussed is generating some rough effort metrics from analysing VCS commits and using them to divide up a rough estimate of programmer hours. This was inspired by Joel Spolsky's Evidence-based Scheduling.
Doing such data mining would allow you to also identify clustering of when code is being maintained and compare that to bug completion in the tracking system (unless you are already blessed with a tight integration between the two and accurate records).
Proper ROI needs to calculate the full Return, so some things to consider are:
- decreased cost of maintenance (obviously)
- opportunity cost to the business of downtime or missed new features that couldn't be added in time for a release
- ability to generate new product lines due to refactorings
Remember, once you have a rule for deriving data, you can have arguments about exactly how to calculate things, but at least you have some figures to seed discussion!
I can only speak to how to do this empirically in an iterative and incremental process.
You need to gather metrics to estimate your demonstrated best cost/story-point. Presumably, this represents your system just after the initial architectural churn, when most of design trial-and-error has been done but entropy has had the least time to cause decay. Find the point in the project history when velocity/team-size is the highest. Use this as your cost/point baseline (zero-debt).
Over time, as technical debt accumulates, the velocity/team-size begins to decrease. The percentage decrease of this number with respect to your baseline can be translated into "interest" being paid on each new story point. (This is really interest paid on technical and knowledge debt)
Disciplined refactoing and annealing causes the the interest on technical debt to stablize at some value higher than your baseline. Think of this as the steady-state interest the product owner pays on the technical debt in the system. (The same concept applies to knowledge debt).
Some systems reach the point where the cost + interest on each new story point exceeds the value of the feature point being developed. This is when the system is bankrupt, and it's time to rewrite the system from scratch.
I think it's possible to use regression analysis to tease apart technical debt and knowledge debt (but I haven't tried it). For example, if you assume that technical debt correlates closely with some code metrics, e.g. code duplication, you could determine the degree the interest being paid is increasing because of technical debt versus knowledge debt.
+1 for jldupont's focus on lost business opportunities.
I suggest thinking about those opportunities as perceived by management. What do they think affects revenue growth -- new features, time to market, product quality? Relating debt paydown to those drivers will help management understand the gains.
Focusing on management perceptions will help you avoid false numeration. ROI is an estimate, and it is no better than the assumptions made in its estimation. Management will suspect solely quantitative arguments because they know there's some qualitative in there somewhere. For example, over the short term the real cost of your debt paydown is the other work the programmers aren't doing, rather than the cash cost of those programmers, because I doubt you're going to hire and train new staff just for this. Are the improvements in future development time or quality more important than features these programmers would otherwise be adding?
Also, make sure you understand the horizon for which the product is managed. If management isn't thinking about two years from now, they won't care about benefits that won't appear for 18 months.
Finally, reflect on the fact that management perceptions have allowed this product to get to this state in the first place. What has changed that would make the company more attentive to technical debt? If the difference is you -- you're a better manager than your predecessors -- bear in mind that your management team isn't used to thinking about this stuff. You have to find their appetite for it, and focus on those items that will deliver results they care about. If you do that, you'll gain credibility, which you can use to get them thinking about further changes. But appreciation of the gains might be a while in growing.
Being a mostly lone or small-team developer this is out of my field, but to me a great solution to find out where time is wasted is very, very detailed timekeeping, for example with a handy task-bar tool like this one that can even filter out when you go to the loo, and can export everything to XML.
It may be cumbersome at first, and a challenge to introduce to a team, but if your team can log every fifteen minutes they spend due to a bug, mistake or misconception in the software, you accumulate a basis of impressive, real-life data on what technical debt is actually costing in wages every month.
The tool I linked to is my favourite because it is dead simple (doesn't even require a data base) and provides access to every project/item through a task bar icon. Also entering additional information on the work carried out can be done there, and timekeeping is literally activated in seconds. (I am not affiliated with the vendor.)
It might be easier to estimate the amount it has cost you in the past. Once you've done that, you should be able to come up with an estimate for the future with ranges and logic even your bosses can understand.
That being said, I don't have a lot of experience with this kind of thing, simply because I've never yet seen a manager willing to go this far in fixing up code. It has always just been something we fix up when we have to modify bad code, so refactoring is effectively a hidden cost on all modifications and bug fixes.

Algorithm for similarity (of topic) of news items

I want to determine the similarity of the content of two news items, similar to Google news but different in the sense that I want to be able determine what the basic topics are then determine what topics are related.
So if an article was about Saddam Hussein, then the algorithm might recommend something about Donald Rumsfeld's business dealings in Iraq.
If you can just throw around key words like k-nearest neighbours and a little explanation about why they work (if you can) I will do the rest of the reseach and tweak the algorithm. Just looking for a place to get started, since I know someone out there must have tried something similar before.
First thoughts:
toss away noise words (and, you, is, the, some, ...).
count all other words and sort by quantity.
for each word in the two articles, add a score depending on the sum (or product or some other formula) of the quantities.
the score represent the similarity.
It seems to be that an article primarily about Donald Rumsfeld would have those two words quite a bit, which is why I weight them in the article.
However, there may be an article mentioning Warren Buffet many times with Bill Gates once, and another mentioning both Bill Gates and Microsoft many times. The correlation there would be minimal.
Based on your comment:
So if an article was about Saddam Hussein, then the algorithm might recommend something about Donald Rumsfeld's business dealings in Iraq.
that wouldn't be the case unless the Saddam article also mentioned Iraq (or Donald).
That's where I'd start and I can see potential holes in the theory already (an article about Bill Gates would match closely with an article about Bill Clinton if their first names are mentioned a lot). This may well be taken care of by all the other words (Microsoft for one Bill, Hillary for the other).
I'd perhaps give it a test run before trying to introduce word-proximity functionality since that's going to make it very complicated (maybe unnecessarily).
One other possible improvement would be maintaining 'hard' associations (like always adding the word Afghanistan to articles with Osama bin Laden in them). But again, that requires extra maintenance for possibly dubious value since articles about Osama would almost certainly mention Afghanistan as well.
At the moment I am thinking of something like this.
Each non-noise-word is a dimension. Each article is represented by a vector where the words that don't appear are represented by zero and those that do appear get a value that is equal to the number of times they appear divided by the total words on the page. Then I can take Euclidean distance between each of the points in this space to get the similarity of any two articles.
The next step would be to determine clusters of the articles, and then determine a central point for each cluster. Then compute the Euclidean distance between any two clusters which gives the similarity of the topics.
Baaah I think by typing it out I solved my own problem. Of course only in a very high level way, I am sure when I get down to it I will find problems ... the devil is always in the detail.
But comments and improvements still highly appreciated.

Resources