If I have bunch of OTPs mixed and if I know all of their generation seeds (the OPT URI) can I group by source URI?
I have a use case there I need the system to be 100% blind to the data relationships that its passing around.
For example: Users enter OTPs from their smartphones instead of their logins it should become very difficult identify entries by one user. As data is exported of the system that has OPT seeds is it possible to reestablish entry's ownership?
That's possible, but with a big complexity. You will need to generate codes for all seeds you have and then find if there is any match.
Also, there is a chance to receive the same code for different seeds at some moment. To avoid this problem you can ask a user for several consecutive codes, this significantly decreases the possibility of codes matching just by case.
Related
I am trying to calculate physical distances between geographic locations (addresses) with ggmaps/mapdist function in R. Apart from the uncomfortable fact that Google Maps allows only 2500 queries/session, I have to cope with the misspelled or other way imperfect "addresses". The most typical problem is that the exact address strings themselves are added by several other info (floor, door etc.), but it is very problematic to detect any pattern in these what would allow applying regular expression.
My goal is:
Check if the address string is recognizable to Google Maps;
If not, find a way to truncate to an acceptable form, perhaps by parsing words step by step from the string.
Have anybody coped with this kind of problem?
Thanks.
There are a couple of factors running into each other here. One factor is the misspellings and other complexities related to addresses and the other is pinpointing (geocoding) a given address. Although they are related problems, each must be handled to accomplish your objectives.
There are numerous service providers out there that can do either or both with minimal cost involved. This can be found with a simple Google search. You can then investigate each to see if they match your use case and licensing requirements.
All of that considered, you'll want to get your address list cleaned up on a minimum. Doing that will enable you to utilize any number of geocoding providers.
Depending upon the size of your list, you can get your list cleaned up and geocoded for perhaps $20.
In the interest of full disclosure, I'm the founder of SmartyStreets. We provide a web interface (to help clean up the address list) as well as an API (which can be used on a continual basis to keep addresses clean). We also geocode your list at no extra charge. Further, we don't have any licensing restrictions on the number of lookups that can be performed during a given timeframe. (We have customers that hit us hundreds of millions of times per day.) The entire process of signing up and cleaning up your list takes just a few minutes.
I suppose using some sort of cryptography and other trickery it'd be possible to count how many occurrences have occurred.
for example suppose there is a way to identify each computer and my software is trying to count how many people have used it by connecting with each other which it does as it uses the internet.
so let's say my software is downloaded at computer A then so on...
like A>B>C...now the one at C need to know somehow that there are three unique computers that uses it.
and A>B>D needs to know it has also have three computers.
But if A>B>C>E and A>B>D>E now E needs to know there are 5 unique computers.
Now I could make system in which a unique id based on something (now what would that be) about computer gets stored in computer in array and software carries it with it and shares it with others whenever it is connected, then checks if there are new computers in array list so in end all know all others given enough connectivity.
However, from what I have learned from bitcoin and cryptography I have a feeling that there has to be another way beside storing a long string a million times (if there happen to be tons of computers).
Are you trying to count how many have ever used the program? Or how many are currently using the program? Or how many have used the program within some amount of time before now?
If your count includes computers that are not guaranteed to be accessible (e.g. if counting unique computers that have ever used the program, or used it since some time but not necessarily online now), then it seems inevitable that you will need some centralized repository of the official accumulating list. Each computer would need to communicate with that centralized list and pass it some unique identifier for the computer. If you want to know computers since time T, tracking time information of the connections is also needed.
If you only want the number of computers that are currently using it (and accessible to each other), it might be possible for each one to interrogate the others dynamically at the point of time it wants to form a current count. But even then, you would need some centralized convention for how they reach out to communicate. Conceptually they are each dynamically joining a "set" and then leaving it again later. Even if that "set" were not always located in a fixed single location, still there would need to be conceptually one official "set" and each instance would need to be able to connect with the "set" to join it and later leave it. That implies a standardized point of contact and means of contact.
So I suspect what you might really want may not be quite possible in the way you were hoping. That said, if you still want to think further about it, you might want to learn more about peer-to-peer software such as BitTorrent and others.
I have a problem that I am currently solving via brute force, but am looking for a more elegant solution. I have a system that runs various functions across multiple nodes. Each function is defined by a 'role'. Each 'role' can be defined to be allowed to one or more clients to hold it. Additionally, preference may be given to a particular client (or clients) over other clients.
The complexity comes in that it is also possible for 'roles' to be related to each other. For example, a client may only be able to hold 'RoleA' if they don't hold 'RoleB', or a client may only be able to hold 'RoleC' if they hold 'RoleD'. Additionally, roles can be related preferentially (i.e. it is preferred that a client holding 'RoleE' holds 'RoleF', but that this is not mandatory).
A client may advertise its willingness to hold any number of roles, but is not required to do so. i.e 'client1' may advertise for roles 'A', 'B', and 'C', while 'client2' may only advertise for roles 'A' and 'B'.
I have solved this problem in a brute force fashion, but obviously, as the number of related roles increases, solving it takes exponentially longer.
Currently, my algorithm is:
Work out all of the possible combinations for clients advertising a given role, and then asses that role in isolation to generate an list of legal combinations, ordered by preference.
Generate all possible combinations for the lists generated in the previous step, and iterate over these, deciding which is the 'most optimal' based on heuristics around mandatory, illegal, favoured, and unfavoured relationships of the group of roles. This is the part that explodes exponentially as the number of related roles increases.
I have tried some 'early out' approaches whereby a theoretical maximum possible 'score' is determined based on the role relationships, and that as soon as we encounter a combination that has a 'score' >= this that we just stop processing, but I'm wondering if there's a more mathematical solution. Any solution is presumably going to be an approximation of the optimal combination, but that is fine.
Ideally I need something that can run sub second.
Hopefully my explanation is not too vague and someone can point me in the right direction!
Thanks in advance.
Cam
Sounds like the Boolean satisfiability problem with some extra complication. BSP is an NP-complete problem, therefore there is no algorithm that can solve it in less than exponential time, however there are some algorithms (mentioned in the link) that can do it better than brute force.
I'm designing an application where my Order objects need to have a sequential and user-friendly Id field. I'm avoiding the HiLo algorithm because of the rather large gaps it produces (see here). Naturally, Guid values would make my corporate users go bananas. I'm also avoiding Oracle sequences because of the major disadvantages of it:
(From: NHibernate POID Generators revealed)
Post insert generators, as the name
suggest, assigns the id’s after the
entity is stored in the database. A
select statement is executed against
database. They have many drawbacks,
and in my opinion they must be used
only on brownfield projects. Those
generators are what WE DO NOT SUGGEST
as NH Team.
> Some of the drawbacks are the
following:
Unit Of Work is broken with the use of
those strategies. It doesn’t matter if
you’re using FlushMode.Commit, each
Save results in an insert statement
against DB. As a best practice, we
should defer insertions to the commit,
but using a post insert generator
makes it commit on save (which is what
UoW doesn’t do).
Those strategies
nullify batcher, you can’t take the
advantage of sending multiple queries
at once(as it must go to database at
the time of Save).
Any ideas/experience on implementing user-friendly IDs without major gaps between them?
Edit:
User friendly Id fields are ones my corporate users can memorize and even discuss and/or have phone conversations talking about a particular Order by its code, e.g. "I'm calling to know why the order #1625 was denied.".
The Id doesn't need to be strictly gapless, but I am worried that my users would get confused when they see gaps like 100, 201, 305. For my older projects, I currently implement NHibernate using Oracle sequences which occasionally lose a few sequences when exceptions are thrown, but yet keep a rather tidy order to them. The downside to them is how they break the Unit of Work which results in additional hits to the database for every Save command with or without the Session.Flush.
One option would be to keep a key-table that simply stores an incrementing value. This can introduce a few problems, namely possible locking issues as well as additional hits to the database.
Another option might be to refine what you mean by "User-friendly Id". This could consist of a combination of a Date/Time and a customer-specific sequence (or including the customer id as well). Also, your order id does not necessarily have to be the actual key on the table. There is nothing to say that you can't use a surrogate key with a separate "calculated" column which represents the order id.
The bottom-line is that it sounds like you want to use a surrogate key, but have the benefits of a natural key. It can be very difficult to have it both ways and a lot comes down to how you actually plan on using the data, how users interpret the data, and personal preference.
Suppose you wanted to estimate the size of a userbase of a site which does not publicize this information.
People are more likely to have acquired different usernames with different probabilities. For instance, if the username 'nick' doesn't exist on the system, it's likely to have an extremely small userbase. If the username 'starbaby' is taken, it's likely to be a much larger site. It seems like a straightforward Bayesian problem.
There is the problem that different sites may have a different space of allowable usernames. The biggest problem would be the legality of common characters such as spaces, I imagine. Another issue that could taint the prior distribution is whether the site suggests names when the one you want is taken, or leaves you to think of a more creative name yourself.
How could you build a training set of the frequency of occurrence of usernames across different sized systems? Is there a way to use Bayes to do numeric estimation rather than classification into fixed-width buckets?
What you need to do is accurately estimate the probability that a certain user name is present given the number of users registered. Lets say N is the number of users and u = 1 if user u is present and 0 if they are absent.
First of all, make the assumption that the probability distributions for each user name are independent of each other. This is not going to be true - and you've already come up with one reason why - but it will probably be necessary since it makes the data collection and the maths a lot easier.
You are going to need a lot of data from sites with registered user names and the total number of users of that site. Now, take any specific user name and imagine your data points on a 2d plot (with N on x and u on y), there's going to be one horizontal line of points at y=0 and another at y=1. You can either bin the x axis as you suggest and take the mean y coordinate of all the data points in the bin to get a discrete function, or you could try to fit the points on the graph to some class of functions. I don't really know what that class of functions that would be - maybe some kind of power law? (I'm thinking of Zipf's law).
You now have the probability distributions to apply Bayes' rule. I don't know what kind of prior for N you would want to use. A uniform distribution (up to some large number) would make no assumptions, but I would guess most sites have a small user base.
I suspect that in order to make this work, when you sample users from a site you will need to do so for a specific set of users. I'm betting that the popularity of user names is going to have a very long tail and so a random sample of users is going to give you a lot of very infrequently used names and therefore a lot of uninformative evidence.
EDIT: I had another thought; in most forums (and on StackOverflow) users have consecutive user ids, so you can use a single site with a large number of users to give you estimates for all smaller N.
I think this is a cool idea!
You may be able to put together a data set by using UserNameCheck.com for some different usernames and cross-referencing the results with the stated userbase sizes of those sites that give them out.
Note: that website does not seem to check if the usernames are valid for the site, so e.g. it thinks Gmail would let you register "nick#gmail.com" even though that's too short.
The only way is to get a large set of taken usernames on systems for which you know the size of the userbase. Data may be skewed in userbases where certain names are more common. Even a tiny userbase from a Lord of the Rings forum will likely contain the username Strider, for example.