I'm trying to create a data set for training a neural network for sports application. I'm trying to capture the impact player substitutions on points scored by a team. I have sets of substitutions (Jones for Smith) (Smith for Davis) etc. that I'm trying to represent with a unique number. For example every time my data set included a Jones for Smith substitution the function/program/hash would produce the same number.
I looked into Hash Codes (MDA, Sha), but these do seem to be the right way to go. I'm sort of stumped on this one. If anyone has come across a similar situation or has some programming wizardry they would care to share I would appreciate it. Thanks.
You could build a string of the primary keys, along the lines of substited,substituted for, next substituted, next substituted for, etc. e.g. "Jones,Smith,Smith,Davis". An MD5 hash of this string, whilst not guaranteed to be unique, is probably going to be unique enough for your purposes.
Related
Is there a way (an algorithm) to take someone's name (first name and surname) and turn it into a unique, non-reversible ID? I know I could just start at unique id = 1 and just add one. I was wondering if there was a way that I could generate IDs using someone's name.
I'm just after a pointer to the way to do it and I'll code it myself.
What you're asking for is called a hash function. Hash functions will always produce the same fixed-length output for a given input.
For non-cryptographic endeavours, SHA256 is often a good choice. It will produce a 256-bit output and is available natively in most standard libraries.
No.
You say you want to generate a non-reversible ID from a name. While you can generate an ID from a name, this will always be reversible. That is because there is not enough information in a name.
Consider that any name is chosen from a set of 7 billion names (that's how many people there are), then the maximum possible Shannon entropy of an ID is about 32.7 bits. In other words, it will take about 2^32.7 tries before you have tried all names on Earth. For a computer, that's easy.
For example, if we would create the IDs by hashing with SHA256, as Luke Joshua Park suggests, we can easily find the name corresponding to an ID using a regular brute force attack. My laptop needs only 0.5µs to compute a SHA256 hash. So to find the IDs off all humans on earth, I would need less than an hour if I use a somewhat beefy CPU.
I am trying to implement minesweeper solver in lisp. I know this is not rare problem but i didn't find any article that can help me with that. At start i have a minefield as input with numbers on uncovered fields. Algorithm should be finished when all mines are found. So, in every step i have to check what fields i can put in my list of mined fields and to choose one field from my list of not mined fields and open it. Later i will check is my list of mined fields completed and if yes algorithm is done. I would appreciate any help. I don't ask for source code, but i need good ideas. I am not experienced with this kind of problems.
I HAVE to use A* algorithm. And i don't need to open all unopened fields...I need to find positions of all mined fields. And of course it has to be the SHORTEST path to do that. When i find positions of all mined fields algorithm is finished. So, once more, i need to find all mined fields with optimal number of opened fields. And of course i need a heuristic for my algorithm which will help to choose one of all safe unopened fields.
And that list of safe unopened fields needs to be determined after every opening. So i need to call main function, that function will check did i find all mined fields, if not, then all safe adjacent unopened fields needs to be added to list of paths. And a path with best heuristic will be chosen
I did implement a minesweeper solver in my first year at the University so I can give you some tips. (This is not using A* algorithm)
Important - Not all positions are solvable.
Backtracking of the whole mine field is a bit complicated for advanced difficulties (complicated=takes some time, consider all the possibilites to place 100 mines in a 30x30 field).
You can solve everything locally, in the same way a human solves the minesweeper. The potential of this is to give the users a hint how to continue instead of solving everything.
Example:
Have a separate mine field where you do the solving
Find all the unsolved cells that have a solved (number/ known mine) cell close enough (2 cell distance)
For every such cell, take a 5x5 neighborhood with the cell in the center, find every possibility (backtracking) and check if the possibilites have something in common (mines/non-mines), if yes, you can check the mines and uncover the non-mines.
Repeat while you can uncover something.
When you cannot uncover anything and the number of remaining mines is small enough, you can try backtracking over the whole field.
I hope I remember it correctly, I did some proofs why the 5x5 area is enough to check but it was almost 10 years ago.
You do not need the A* algorithm; its purpose is to find the shortest path in a graph (such as the shortest path between two places in a map, or the smallest amount of moves that will solve a puzzle). You will probably want to use a technique that is known as backtracking.
As long as there are unopened fields, pick an unopened field that is next to an open field, and tentatively flag it as a mine. Then, look at an unopened field that is adjacent to the previous one as well as to an opened field, and flag that one as a mine too, if this doesn't contradict the adjacent numbers - if it does, flag it as safe instead. Continue. Eventually, you will have looked at all unopened fields that surround the current area and have found one possible way of flagging the fields as safe or unsafe. However, this was based on several guesses, so now you need to go back to the last field where you made a guess and then make the opposite guess and then move forwards again to get another possible flag combination. Then, go even further back, revise your guesses, and so on. This can be implemented quite neatly with recursion. Eventually, you will have a collection of possible flag combinations. If you can find a field that is safe in all possible flag combinations, open that field. Otherwise, pick a field that is safe in as many flag combinations as possible.
I am constructing a hash table mod 17 for example and I am trying to figure out an efficient way to deal with a repeating key value. Suppose I have like a random number generator and I make a 1000 random generated numbers, there is a chance that some of those numbers might occur multiple times. My implementation would have a linked list to an array for each of the slots i.e. 17 slots and keys would be stored in their respective position.
I want to kind of implement a failsafe sort of checker function that insures that there are no repeating keys in the hash table. I have been looking this up on the internet and have not found a most definite answer. MY idea was to keep each linked list sorted and have a lookahead to check if the number is there already. Does anyone know of a better idea?
Any thoughts and comments greatly appreciated.
If I understand, you want multiple values for the same key? I think it is not possible. When you go to retrieve the value, which value would you choose.
I apologize as I don't know whether this is more of a math question that belongs on mathoverflow or if it's a computer science question that belongs here.
That said, I believe I understand the fundamental difference between data, information, and knowledge. My understanding is that information carries both data and meaning. One thing that I'm not clear on is whether information is data. Is information considered a special kind of data, or is it something completely different?
The words data,information and knowlege are value-based concepts used to categorize, in a subjective fashion, the general "conciseness" and "usefulness" of a particular information set.
These words have no precise meaning because they are relative to the underlying purpose and methodology of information processing; In the field of information theory these have no meaning at all, because all three are the same thing: a collection of "information" (in the Information-theoric sense).
Yet they are useful, in context, to summarize the general nature of an information set as loosely explained below.
Information is obtained (or sometimes induced) from data, but it can be richer, as well a cleaner (whereby some values have been corrected) and "simpler" (whereby some irrelevant data has been removed). So in the set theory sense, Information is not a subset of Data, but a separate set [which typically intersects, somewhat, with the data but also can have elements of its own].
Knowledge (sometimes called insight) is yet another level up, it is based on information and too is not a [set theory] subset of information. Indeed Knowledge typically doesn't have direct reference to information elements, but rather tells a "meta story" about the information / data.
The unfounded idea that along the Data -> Information -> Knowledge chain, the higher levels are subsets of the lower ones, probably stems from the fact that there is [typically] a reduction of the volume of [IT sense] information. But qualitatively this info is different, hence no real [set theory] subset relationship.
Example:
Raw stock exchange data from Wall Street is ... Data
A "sea of data"! Someone has a hard time finding what he/she needs, directly, from this data. This data may need to be normalized. For example the price info may sometimes be expressed in a text string with 1/32th of a dollar precision, in other cases prices may come as a true binary integer with 1/8 of a dollar precision. Also the field which indicate, say, the buyer ID, or seller ID may include typos, and hence point to the wrong seller/buyer. etc.
A spreadsheet made from the above is ... Information
Various processes were applied to the data:
-cleaning / correcting various values
-cross referencing (for example looking up associated codes such as adding a column to display the actual name of the individual/company next to the Buyer ID column)
-merging when duplicate records pertaining to the same event (but say from different sources) are used to corroborate each other, but are also combined in one single record.
-aggregating: for example making the sum of all transaction value for a given stock (rather than showing all the individual transactions.
All this (and then some) turned the data into Information, i.e. a body of [IT sense] Information that is easily usable, where one can quickly find some "data", such as say the Opening and Closing rate for the IBM stock on June 8th 2009.
Note that while being more convenient to use, in part more exact/precise, and also boiled down, there is not real [IT sense] information in there which couldn't be located or computed from the original by relatively simple (if only painstaking) processes.
An financial analyst's report may contain ... knowledge
For example if the report indicate [bogus example] that whenever the price of Oil goes past a certain threshold, the value of gold start declining, but then quickly spikes again, around the time the price of coffee and tea stabilize. This particular insight constitute knowledge. This knowledge may have been hidden in the data alone, all along, but only became apparent when one applied some fancy statistically analysis, and/or required the help of a human expert to find or confirm such patterns.
By the way, in the Information Theory sense of the word Information, "data", "information" and "knowlegde" all contain [IT sense] information.
One could possibly get on the slippery slope of stating that "As we go up the chain the entropy decreases", but that is only loosely true because
entropy decrease is not directly or systematically tied to "usefulness for human"
(a typical example is that a zipped text file has less entropy yet is no fun reading)
there is effectively a loss of information (in addition to entropy loss)
(for example when data is aggregate the [IT sense] information about individual records get lost)
there is, particular in the case of Information -> Knowlege, a change in level of abstration
A final point (if I haven't confused everybody yet...) is the idea that the data->info->knowledge chain is effectively relative to the intended use/purpose of the [IT-sense] Information.
ewernli in a comment below provides the example of the spell checker, i.e. when the focus is on English orthography, the most insightful paper from a Wallstreet genius is merely a string of words, effectively "raw data", some of it in need of improvement (along the orthography purpose chain.
Similarly, a linguist using thousands of newspaper articles which typically (we can hope...) contain at least some insight/knowledge (in the general sense), may just consider these articles raw data, which will help him/her create automatically French-German lexicon (this will be information), and as he works on the project, he may discover a systematic semantic shift in the use of common words betwen the two languages, and hence gather insight into the distinct cultures.
Define information and data first, very carefully.
What is information and what is data is very dependent on context. An extreme example is a picture of you at a party which you email. For you it's information, but for the the ISP it's just data to be passed on.
Sometimes just adding the right context changes data to information.
So, to answer you question: No, information is not a subset of data. It could be at least the following.
A superset, when you add context
A subset, needle-in-a-haystack issue
A function of the data, e.g. in a digest
There are probably more situations.
This is how I see it...
Data is dirty and raw. You'll probably have too much of it.
... Jason ... 27 ... Denton ...
Information is the data you need, organised and meaningful.
Jason.age=27
Jason.city=Denton
Knowledge is why there are wikis, blogs: to keep track of insights and experiences. Note that these are human (and community) attributes. Except for maybe a weird science project, no computer is on Facebook telling people what it believes in.
information is an enhancement of data:
data is inert
information is actionable
note that information without data is merely an opinion ;-)
Information could be data if you had some way of representing the additional content that makes it information. A program that tries to 'understand' written text might transform the input text into a format that allows for more complex processing of the meaning of that text. This transformed format is a kind of data that represents information, when understood in the context of the overall processing system. From outside the system it appears as data, whereas inside the system it is the information that is being understood.
Suppose you wanted to estimate the size of a userbase of a site which does not publicize this information.
People are more likely to have acquired different usernames with different probabilities. For instance, if the username 'nick' doesn't exist on the system, it's likely to have an extremely small userbase. If the username 'starbaby' is taken, it's likely to be a much larger site. It seems like a straightforward Bayesian problem.
There is the problem that different sites may have a different space of allowable usernames. The biggest problem would be the legality of common characters such as spaces, I imagine. Another issue that could taint the prior distribution is whether the site suggests names when the one you want is taken, or leaves you to think of a more creative name yourself.
How could you build a training set of the frequency of occurrence of usernames across different sized systems? Is there a way to use Bayes to do numeric estimation rather than classification into fixed-width buckets?
What you need to do is accurately estimate the probability that a certain user name is present given the number of users registered. Lets say N is the number of users and u = 1 if user u is present and 0 if they are absent.
First of all, make the assumption that the probability distributions for each user name are independent of each other. This is not going to be true - and you've already come up with one reason why - but it will probably be necessary since it makes the data collection and the maths a lot easier.
You are going to need a lot of data from sites with registered user names and the total number of users of that site. Now, take any specific user name and imagine your data points on a 2d plot (with N on x and u on y), there's going to be one horizontal line of points at y=0 and another at y=1. You can either bin the x axis as you suggest and take the mean y coordinate of all the data points in the bin to get a discrete function, or you could try to fit the points on the graph to some class of functions. I don't really know what that class of functions that would be - maybe some kind of power law? (I'm thinking of Zipf's law).
You now have the probability distributions to apply Bayes' rule. I don't know what kind of prior for N you would want to use. A uniform distribution (up to some large number) would make no assumptions, but I would guess most sites have a small user base.
I suspect that in order to make this work, when you sample users from a site you will need to do so for a specific set of users. I'm betting that the popularity of user names is going to have a very long tail and so a random sample of users is going to give you a lot of very infrequently used names and therefore a lot of uninformative evidence.
EDIT: I had another thought; in most forums (and on StackOverflow) users have consecutive user ids, so you can use a single site with a large number of users to give you estimates for all smaller N.
I think this is a cool idea!
You may be able to put together a data set by using UserNameCheck.com for some different usernames and cross-referencing the results with the stated userbase sizes of those sites that give them out.
Note: that website does not seem to check if the usernames are valid for the site, so e.g. it thinks Gmail would let you register "nick#gmail.com" even though that's too short.
The only way is to get a large set of taken usernames on systems for which you know the size of the userbase. Data may be skewed in userbases where certain names are more common. Even a tiny userbase from a Lord of the Rings forum will likely contain the username Strider, for example.