Selecting optimal combinations

Selecting optimal combinations - math

I have a problem that I am currently solving via brute force, but am looking for a more elegant solution. I have a system that runs various functions across multiple nodes. Each function is defined by a 'role'. Each 'role' can be defined to be allowed to one or more clients to hold it. Additionally, preference may be given to a particular client (or clients) over other clients.
The complexity comes in that it is also possible for 'roles' to be related to each other. For example, a client may only be able to hold 'RoleA' if they don't hold 'RoleB', or a client may only be able to hold 'RoleC' if they hold 'RoleD'. Additionally, roles can be related preferentially (i.e. it is preferred that a client holding 'RoleE' holds 'RoleF', but that this is not mandatory).
A client may advertise its willingness to hold any number of roles, but is not required to do so. i.e 'client1' may advertise for roles 'A', 'B', and 'C', while 'client2' may only advertise for roles 'A' and 'B'.
I have solved this problem in a brute force fashion, but obviously, as the number of related roles increases, solving it takes exponentially longer.
Currently, my algorithm is:
Work out all of the possible combinations for clients advertising a given role, and then asses that role in isolation to generate an list of legal combinations, ordered by preference.
Generate all possible combinations for the lists generated in the previous step, and iterate over these, deciding which is the 'most optimal' based on heuristics around mandatory, illegal, favoured, and unfavoured relationships of the group of roles. This is the part that explodes exponentially as the number of related roles increases.
I have tried some 'early out' approaches whereby a theoretical maximum possible 'score' is determined based on the role relationships, and that as soon as we encounter a combination that has a 'score' >= this that we just stop processing, but I'm wondering if there's a more mathematical solution. Any solution is presumably going to be an approximation of the optimal combination, but that is fine.
Ideally I need something that can run sub second.
Hopefully my explanation is not too vague and someone can point me in the right direction!
Thanks in advance.
Cam

Sounds like the Boolean satisfiability problem with some extra complication. BSP is an NP-complete problem, therefore there is no algorithm that can solve it in less than exponential time, however there are some algorithms (mentioned in the link) that can do it better than brute force.

Related

Can One Time Passwords be used as identifiers?

If I have bunch of OTPs mixed and if I know all of their generation seeds (the OPT URI) can I group by source URI?
I have a use case there I need the system to be 100% blind to the data relationships that its passing around.
For example: Users enter OTPs from their smartphones instead of their logins it should become very difficult identify entries by one user. As data is exported of the system that has OPT seeds is it possible to reestablish entry's ownership?

That's possible, but with a big complexity. You will need to generate codes for all seeds you have and then find if there is any match.
Also, there is a chance to receive the same code for different seeds at some moment. To avoid this problem you can ask a user for several consecutive codes, this significantly decreases the possibility of codes matching just by case.

R geographic address validation

I am trying to calculate physical distances between geographic locations (addresses) with ggmaps/mapdist function in R. Apart from the uncomfortable fact that Google Maps allows only 2500 queries/session, I have to cope with the misspelled or other way imperfect "addresses". The most typical problem is that the exact address strings themselves are added by several other info (floor, door etc.), but it is very problematic to detect any pattern in these what would allow applying regular expression.
My goal is:
Check if the address string is recognizable to Google Maps;
If not, find a way to truncate to an acceptable form, perhaps by parsing words step by step from the string.
Have anybody coped with this kind of problem?
Thanks.

There are a couple of factors running into each other here. One factor is the misspellings and other complexities related to addresses and the other is pinpointing (geocoding) a given address. Although they are related problems, each must be handled to accomplish your objectives.
There are numerous service providers out there that can do either or both with minimal cost involved. This can be found with a simple Google search. You can then investigate each to see if they match your use case and licensing requirements.
All of that considered, you'll want to get your address list cleaned up on a minimum. Doing that will enable you to utilize any number of geocoding providers.
Depending upon the size of your list, you can get your list cleaned up and geocoded for perhaps $20.
In the interest of full disclosure, I'm the founder of SmartyStreets. We provide a web interface (to help clean up the address list) as well as an API (which can be used on a continual basis to keep addresses clean). We also geocode your list at no extra charge. Further, we don't have any licensing restrictions on the number of lookups that can be performed during a given timeframe. (We have customers that hit us hundreds of millions of times per day.) The entire process of signing up and cleaning up your list takes just a few minutes.

how to generate unique and [pseudo] sequential GUIDs across multiple servers?

We are looking for solutions for generating Ids as per the title of this question.
for clarification:
we are using several different SQL Servers and application servers, any of which could be generating the id
we do not want to use a central ID-generating service/machine
we do not want to use DateTimes bitwise-converted to Guids because with so many machines there is a possibility of collision.
one possible solution is to assign each machine a start position, skip, and an offset, like this answer: https://stackoverflow.com/a/7916720/175127
this could very easily be our solution, but I'm hoping that someone among you might have a more elegant solution that better addresses some of the following issues:
one machine might end up assigning a lot more IDs and skip far ahead of the others. We might resync all the machines to have a new start position every day to help keep them on pace with each other, but this could result in a large amount of empty, unused IDs. We wish to minimize this.
we wish to see if it's possible to decrease the external dependency of each machine. At start they each have to find out how many machines there are, what the start point is, and they have to decide on unique offsets. I think having some form of central control to administer these things may be unavoidable.
the best that I could think of so far is to have a central machine that distribute ranges of Ids at a time. Each other machine grabs range-blocks as needed. If the central machine goes down, then we use the start-skip-offset system as a fallback.
Got any cool ideas SO?

Formulas to generate a unique id?

I would like to get a few ideas on generating unique id's without using the GUID. Preferably i would like the unique value to be of type int32.
I'm looking for something that can be used for database primary key as well as being url friendly.
Can these considered Unique?
(int)DateTime.Now.Ticks
(int)DateTime.Now * RandomNumber
Any other ideas?
Thanks
EDIT: Well i am trying to practise Domain Driven Design and all my entities need to have a ID upon creation to be valid. I could in theory call into the DB to get an auto incremented number but would rather steer clear of this as DB related stuff is getting into the Domain.

It depends on how unique you needed it to be and how many items you need to give IDs to. Your best bet may be assigning them sequentially; if you try to get fancy you'll likely run into the Birthday Paradox (collisions are more likely than you might expect) or (as in your case 1) above) be foreced to limit the rate at which you can issue them.
Your 1) above is a little better than the 2) for most cases; it's rate limited--you can't issue more than 1 ID per tick--but not susceptible to the Birthday Paradox. Your 2) is just throwing bits away. Might be slightly better to XOR with the random number, but in any case I don't think the rand is buying you anything, just hiding the problem & making it harder to fix.

Are these considered Globally Unique?
1) (int)DateTime.Now.Ticks 2)
(int)DateTime.Now * RandomNumber
Neither option is globally unique.
Option 1 - This is only unique if you can guarantee no more than one ID is generated per tick. From your description, it does not sound like this would work.
Option 2 - Random numbers are pseudo random, but not guaranteed to be unique. With that already in mind, we can reduce the DateTime portion of this option to a similar problem to option 1.
If you want a globally unique ID that is an int32, one good way would be a synchronous service of some sort that returns sequential IDs. I guess it depends on what your definition of global means. If you had larger than an int32 to work with, and you mean global on a given network, then maybe you could use IP address with a sequence number appended, where the sequence number is generated synchronously across processes.
If you have other unique identifiers besides IP address, then that would obviously be a better choice for displaying as part of a URL.

You can use the RNGCryptoServiceProvider class, if you are using .NET
RNGCryptoServiceProvider Class

Way to infer the size of the userbase of a site from sampling taken usernames

Suppose you wanted to estimate the size of a userbase of a site which does not publicize this information.
People are more likely to have acquired different usernames with different probabilities. For instance, if the username 'nick' doesn't exist on the system, it's likely to have an extremely small userbase. If the username 'starbaby' is taken, it's likely to be a much larger site. It seems like a straightforward Bayesian problem.
There is the problem that different sites may have a different space of allowable usernames. The biggest problem would be the legality of common characters such as spaces, I imagine. Another issue that could taint the prior distribution is whether the site suggests names when the one you want is taken, or leaves you to think of a more creative name yourself.
How could you build a training set of the frequency of occurrence of usernames across different sized systems? Is there a way to use Bayes to do numeric estimation rather than classification into fixed-width buckets?

What you need to do is accurately estimate the probability that a certain user name is present given the number of users registered. Lets say N is the number of users and u = 1 if user u is present and 0 if they are absent.
First of all, make the assumption that the probability distributions for each user name are independent of each other. This is not going to be true - and you've already come up with one reason why - but it will probably be necessary since it makes the data collection and the maths a lot easier.
You are going to need a lot of data from sites with registered user names and the total number of users of that site. Now, take any specific user name and imagine your data points on a 2d plot (with N on x and u on y), there's going to be one horizontal line of points at y=0 and another at y=1. You can either bin the x axis as you suggest and take the mean y coordinate of all the data points in the bin to get a discrete function, or you could try to fit the points on the graph to some class of functions. I don't really know what that class of functions that would be - maybe some kind of power law? (I'm thinking of Zipf's law).
You now have the probability distributions to apply Bayes' rule. I don't know what kind of prior for N you would want to use. A uniform distribution (up to some large number) would make no assumptions, but I would guess most sites have a small user base.
I suspect that in order to make this work, when you sample users from a site you will need to do so for a specific set of users. I'm betting that the popularity of user names is going to have a very long tail and so a random sample of users is going to give you a lot of very infrequently used names and therefore a lot of uninformative evidence.
EDIT: I had another thought; in most forums (and on StackOverflow) users have consecutive user ids, so you can use a single site with a large number of users to give you estimates for all smaller N.

I think this is a cool idea!
You may be able to put together a data set by using UserNameCheck.com for some different usernames and cross-referencing the results with the stated userbase sizes of those sites that give them out.
Note: that website does not seem to check if the usernames are valid for the site, so e.g. it thinks Gmail would let you register "nick#gmail.com" even though that's too short.

The only way is to get a large set of taken usernames on systems for which you know the size of the userbase. Data may be skewed in userbases where certain names are more common. Even a tiny userbase from a Lord of the Rings forum will likely contain the username Strider, for example.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex