CLSR chip testing problem - recursion

Finding ONE good VLSI chip in a population of good and bad ones, by using the
pair test.
Chip A Chip B Conclusion
------- ------- ----------
B is good A is good both are good or both are bad
B is good A is bad at least one is bad
B is bad A is good at least one is bad
B is bad A is bad at least one is bad
Assumption : number of goods > number of bads
We can solve this in O(n) time complexity by splitting the population in half
every time and collecting one element of the GOOD, GOOD pair.
T(n) = T(n/2) + n/2
But to collect the pairs we need n/2 memory separately.
Can we do this in-place without using extra memory ??

The algorithm is based on the question, "can we remove this chip?" So, for each chip to be removed, we simply erase it from our linked list, in place (or rather, no place at all).

Related

Human readable alternative for UUIDs

I am working on a system that makes heavy use of pseudonyms to make privacy-critical data available to researchers. These pseudonyms should have the following properties:
They should not contain any information (e.g. time of creation, relation to other pseudonyms, encoded data, …).
It should be easy to create unique pseudonyms.
They should be human readable. That means they should be easy for humans to compare, copy, and understand when read out aloud.
My first idea was to use UUID4. They are quite good on (1) and (2), but not so much on (3).
An variant is to encode UUIDs with a wider alphabet, resulting in shorter strings (see for example shortuuid). But I am not sure whether this actually improves readability.
Another approach I am currently looking into is a paper from 2005 titled "An optimal code for patient identifiers" which aims to tackle exactly my problem. The algorithm described there creates 8-character pseudonyms with 30 bits of entropy. I would prefer to use a more widely reviewed standard though.
Then there is also the git approach: only display the first few characters of the actual pseudonym. But this would mean that a pseudonym could lose its uniqueness after some time.
So my question is: Is there any widely-used standard for human-readable unique ids?
Not aware of any widely-used standard for this. Here’s a non-widely-used one:
Proquints
https://arxiv.org/html/0901.4016
https://github.com/dsw/proquint
A UUID4 (128 bit) would be converted into 8 proquints. If that’s too much, you can take the last 64 bits of the UUID4 (= just take 64 random bits). This doesn’t make it magically lose uniqueness; only increases the likelihood of collisions, which was non-zero to begin with, and which you can estimate mathematically to decide if it’s still OK for your purposes.
Here you go UUID Readable
Generate Easy to Remember, Readable UUIDs, that are Shakespearean and Grammatically Correct Sentences
This article suggests to use the first few characters from a SHA-256 hash, similarly to what git does. UUIDs are typically based on SHA-1, so this is not all that different. The tradeoff between property (2) and (3) is in the number of characters.
With d being the number of digits, you get 2 ** (4 * d) identifiers in total, but the first collision is expected to happen after 2 ** (2 * d).
The big question is really not about the kind of identifier you use, it is how you handle collisions.

Possibilities of dividing a class in groups with several criteria

I have to divide a class of 50 students writing a dissertation in 10 different discussion groups of 5 members each. In theory, there are 1.35363x10^37 possible ways of doing this, which is just the result of {50!}/{(5!^10)*10!)}, if it is already decided that the groups will consist of 5.
However, each group is to be led by a facilitator. This reduces the number of possible combinations considerably, because each facilitaror has one field of expertise among 5 possible ones, which should be matched to the topics the students are writing about as much as possible. If there are three facilitators with competence A, three with competence B, two with competence C, one with competence D and one with competence E, and 15 students are assigned to A, 15 to B, 10 to C, 5 to D and 5 to E, the number of possible combinations comes down to 252 505.
But both students and facilitators keep advocating for the use of more criteria, instead of just focusing on field of expertise. For example, wanting to be in a group of students that know each other, or being in a group with a facilitator that has particular knowledge of a specific research method.
I am trying to illustrate my intuitive reasoning, which tells me that each new criteria increases the complexity/impossibility of the task, if the objective is a completely efficient solution. But I can't get my head around expressing this analytically in a satisfactory manner.
Is my reasoning correct, that adding criteria would reduce the amount of possibilities that can be discarded following the inclusion-exclusion principle, thus making the task more complex, adding possible combinations? I also think that if the criteria are not compatible (for example if students that know each other are writing about different topics, and there aren't enough competent facilitators), certain constraints become inviable.
You need to distinguish between computational complexity and human complexity. Adding constraints almost automatically increases the human complexity of the problem in the sense that it means that there is more to wrap your mind around. But -- it isn't true that the computational complexity increases. At least sometimes it decreases.
For example, say you have a set of 200 items and you want to determine if there is a subset of them which satisfy some constraint. Depending on the constraint, There might be no feasible way to do it. After all, 2^200 is much too large to brute-force. Now add the constraint that the subset needs to have exactly 3 elements. Now all of a sudden it is possible to brute force (just run through all 1,313,400 3-element subsets until you either find a solution or determine that none exist). This is enough to show that it isn't true that adding a constraint always makes a problem intrinsically more difficult. In the discrete case a new constraint can cut down on the size of the search space in a way that can be exploited. In the continuous cases it can reduce degrees of freedom and thus lower the dimension of the problem. This isn't to say that it always makes it easier. Probably as a rule of thumb, additional constraints tend to make a problem more difficult.
Your actual problem isn't spelled out enough to give concrete advice. One possibility (and one way to handle a proliferation of somewhat extraneous constraints) is to divide the constraints into hard constraints which need to be satisfied and soft constraints which are merely desired but not strictly needed. Turn it into an optimization problem: find the solution which maximizes the number of soft-constraints that are satisfied, subject to the condition that it satisfies the hard constraints. Perhaps you can formulate it as an integer programming problem and hopefully find an exact solution. Or, if it is easy to generate solutions that satisfy the hard constraints and it is easy to mutate one such solution to obtain another (e.g. swap two students who are in different groups), then an evolutionary algorithm would be a reasonable heuristic.

Game Theory with prediction

To impress two (german) professors i try to improve the game theory.
AI in Computergames.
Game Theory:  Intelligence is a well educated proven Answer to an Question.
This means a thoughtfull decision is choosing an act who leads to an optimal result.
Question -> Resolution -> Answer -> Test (Check)
For Example one robot is fighting another robot.
This robot has 3 choices:
-move forward
-hold position
-move backward
The resulting Programm is pretty simple
randomseed = initvalue;
while (one_is_alive)
{
choice = randomselect(options,probability);
do_choice(roboter);
}  
We are using pseudorandomness.
The test for success is simply did he elimate the opponent.
The robots have automatically shooting weapons :
struct weapon
{
range
damage
}
struct life
{
hitpoints
}
Now for some Evolution.
We let 2 robots fight each other and remember the randomseeds.
What is the sign of a succesfull Roboter ?
struct {
ownrandomseed;
list_of_opponentrandomseed; // the array of the beaten opponents.
}
Now the question is how do we choose the right strategy against an opponent ?
 
We assume we have for every possible seed-strategy the optimal anti-strategy.  
Now the only thing we have to do is to observe the numbers from the opponent
and calculate his seed value.Then we could choose the right strategy.
For cracking the random generator we can use the manual method :
http://alumni.cs.ucr.edu/~jsun/random-number.pdf
or the brute Force :
https://jazzy.id.au/2010/09/20/cracking_random_number_generators_part_1.html
It depends on the algorithm used to generate the (pseudo) random numbers. If the pseudo random number generator algorithm is known, you can guess the seed by observing a number of states (robot moves). This is similar to brute force guessing a password, used for encryption, as, some encryption algorithms are known as stream ciphers, and are basically (sometimes exactly), a one time pad that is used to obfuscate the data. Now, lets say that you know the pseudorandom number generator used is a simple lagged fibonacci generator. Then, you know that they are generating each number by calculating x(n) = x(n - 2) + x(n - 3) % 3. Therefore, by observing 3 different robot moves, you will then be able to predict all of the future moves. The seed, is the first 3 numbers supplied that give the sequence you observe. Now, most random number generators are not this simple, some have up to 1024 bit length seeds, and would be impossible for a modern computer to cycle through all of those possibilities in a brute force manner. So basically, what you would need to do, is to find out what PRNG algorithm is used, find out all possible initial seed values, and devise an algorithm to determine the seed the opponent robot is using based upon their actions. Depending on the algorithm, there are ways of guessing the seed faster than testing each and every one. If there is a faster way of guessing such a seed, this means that the PRNG in question is not suitable from cryptographic applications, as it means passwords are easier guessed. AES256 itself has a break, but it still takes theoretically 2 ^ 111 guesses (instead of the brute force 2 ^256 guesses), which means it has been broken, technically, but 2 ^ 111 is still way too many operations for modern computers to process in a meaningful time frame.
if the PRNG was lagged fibonacci (which is never used anymore, I am just giving a simple example) and you observed that the robot did option 0, then, 1, then 2... you would then know that the next thing the robot will do is... 1, since 0 + 1 % 3 = 1. You could also backtrack, and figure out what the initial values were for this PRNG, which represents the seed.

Why is Unix/Terminal faster than R?

I'm new to Unix, however, I have recently realized that very simple Unix commands can do very simple things to large data set very very quickly. My question is why are these Unix commands so fast relative to R?
Let's begin by assuming that the data is big, but not larger than the amount of RAM on your computer.
Computationally, I understand that Unix commands are likely faster than their R counterparts. However, I can't imagine that this would explain the entire time difference. After all basic R functions, like Unix commands, are written in low-level languages like C/C++.
I therefore suspect that the speed gains have to do with I/O. While I only have a basic understanding of how computers work, I do understand that to manipulate data it most first be read from disk (assuming the data is local). This is slow. However, regardless of whether you use R functions or Unix commands to manipulate data both most obtain the data from disk.
Therefore I suspect that how data is read from disk, if that even makes sense, is what is driving the time difference. Is that intuition correct?
Thanks!
UPDATE: Sorry for being vague. This was done on purpose, I was hoping to discuss this idea in general, rather than focus on a specific example.
Regardless, I'll generate an example of counting the number of rows
First I'll generate a big data set.
row = 1e7
col = 50
df<-matrix(rpois(row*col,1),row,col)
write.csv(df,"df.csv")
Doing it with Unix
time wc -l df.csv
real 0m12.261s
user 0m1.668s
sys 0m2.589s
Doing it with R
library(data.table)
system.time({ nrow(fread("df.csv")) })
...
user system elapsed
26.77 1.67 47.07
Notice that elapsed/real > user + system. This suggests that the CPU is waiting on the disk.
I suspected the slow speed of R has to do with reading the data in. It appears that I'm right:
system.time(fread("df.csv"))
user system elapsed
34.69 2.81 47.41
My question is how is the I/O different for Unix and R. Why?
I'm not sure what operations you're talking about, but in general, more complex processing systems like R use more complex internal data structures to represent the data being manipulated, and constructing these data structures can be a big bottleneck, significantly slower than the simple lines, words, and characters that Unix commands like grep tend to operate on.
Another factor (depending on how your scripts are set up) is whether you're processing the data one thing at a time, in "streaming mode", or reading everything into memory. Unix commands tend to be written to operate in pipelines, and to read a small piece of data (usually one line), process it, maybe write out a result, and move on to the next line. If, on the other hand, you read the entire data set into memory before processing it, then even if you do have enough RAM, allocating and organizing all the necessary memory can be very expensive.
[updated in response to your additional information]
Aha. So you were asking R to read the whole file into memory at once. That accounts for much of the difference. Let's talk about a few more things.
I/O. We can think about three ways of reading characters from a file, especially if the style of processing we're doing affects the way that's most convenient to do the reading.
Unbuffered small, random reads. We ask the operating system for 1 or a few characters at a time, and process them as we read them.
Unbuffered large, block-sized reads. We ask the operating for big chunks of memory -- usually of a size like 1k or 8k -- and chew on each chunk in memory before asking for the next chunk.
Buffered reads. Our programming language gives us a way of asking for as many characters as we want out of an intermediate buffer, and code that's built into the language ("library" code) automatically takes care of keeping that buffer full by reading large, block-sized chunks from the operating system.
Now, the important thing to know is that the operating system would much rather read big, block-sized chunks. So #1 can be drastically slower than 2 and 3. (I've seen factors of 10 or 100.) But no well-written programs use #1, so we can pretty much forget about it. As long as you're using 2 or 3, the I/O speed will be roughly the same. (In extreme cases, if you know what you're doing, you can get a little efficiency increase by using 2 instead of 3, if you can.)
Now let's talk about the way each program processes the data. wc has basically 5 steps:
Read characters one at a time. (I can assure you it uses method 3.)
For each character read, add one to the character count.
If the character read was a newline, add one to the line count.
If the character read was or wasn't a word-separator character, update the word count.
At the very end, print out the counts of lines, words, and/or characters, as requested.
So as you can see it's all I/O and very simple, character-based processing. (The only step that's at all complicated is 4. As an exercise, I once wrote a version of wc that contrived not to do all of steps 2, 3, and 4 inside the read loop if the user didn't ask for all the counts. My version did indeed run significantly faster if you invoked wc -c or wc -l. But obviously the code was significantly more complicated.)
In the case of R, on the other hand, things are quite a bit more complicated. First, you told it to read a CSV file. So as it reads, it has to find the newlines separating lines and the commas separating columns. That's roughly equivalent to the processing that wc has to do. But then, for each number that it finds, it has to convert it into an internal number that it can work with efficiently. For example, if somewhere in the CSV file occurs the sequence
...,12345,...
R is going to have to read those digits (as individual characters) and then do the equivalent of the math problem
1 * 10000 + 2 * 1000 + 3 * 100 + 4 * 10 + 5 * 1
to get the value 12345.
But there's more. You asked R to build a table. A table is a specific, highly regular data structure which orders all the data into rigid rows and columns for efficient lookup. To see how much work that can be, let's use a slightly far-fetched hypothetical real-world example.
Suppose you're a survey company and it's your job to ask people walking by on the street certain questions. But suppose that the questions are complicated enough that you need all the people seated in a classroom at once. (Suppose further that the people don't mind this inconvenience.)
But first you have to build that classroom. You're not sure how many people are going to walk by, so you build an ordinary classroom, with room for 5 rows of 6 desks for 30 people, and you haul in the desks, and the people start filing in, and after 30 people file in you notice there's a 31st, so what do you do? You could ask him to stand in the back, but you're kind of fixated on the rigid-rows-and-columns idea, so you ask the 31st person to wait, and you quickly call the builders and ask them to build a second 30-person classroom right next to the first, and now you can accept the 31st person and in fact 29 more for a total of 60, but then you notice a 61st person.
So you ask him to wait, and you call the builders back again, and you have them build two more classrooms, so now you've got a nice 2x2 grid of 30-person classrooms, but the people keep coming and soon enough the 121st person shows up and there's not enough room and you still haven't even started asking your survey questions yet.
So you call some fancier builders that know how to do steelwork and you have them build a big 5-story building next door with 50-person classrooms, 5 on each floor, for a total of 50 x 5 x 5 = 1,250 desks, and you have the first 120 people (who've been waiting patiently) file out of the old rooms into the new building, and now there's room for the 121st person and quite a few more behind him, and you hire some wreckers to demolish the old classrooms and recycle some of the materials, and the people keep coming and pretty soon there's 1,250 people in your new building waiting to be surveyed and the 1,251st has just showed up.
So you build a giant new skyscraper with 1,000 desks on each floor and 100 floors, and you demolish the old 5-story building, but the people keep coming, and how big did you say your big data set was? 1e7 x 50? So I don't think the 100-story building is going to be big enough, either. (And when you're all done with all this, the only "survey question" you're going to ask is "How many rows are there?")
Contrived as it may seem, this is actually not too bad an analogy for what R is having to do internally to build the table to store that data set in.
Meanwhile, Bob's discount survey company, who can only tell you how many people he surveyed and how many were men and women and in which age brackets, is down there on the streetcorner, and the people are filing by, and Bob is jotting down tally marks on his clipboards, and the people, once surveyed, are walking away and going about their business, and Bob isn't wasting time and money building any classrooms at all.
I don't know anything about R, but see if there's a way to construct an empty 1e7 x 50 matrix up front, and read the CSV file into it. You might find that significantly quicker. R will still have to do some building, but at least it won't have any false starts.

Prolog Accumulators. Are they really a "different" concept?

I am learning Prolog under my Artificial Intelligence Lab, from the source Learn Prolog Now!.
In the 5th Chapter we come to learn about Accumulators. And as an example, these two code snippets are given.
To Find the Length of a List
without accumulators:
len([],0).
len([_|T],N) :- len(T,X), N is X+1.
with accumulators:
accLen([_|T],A,L) :- Anew is A+1, accLen(T,Anew,L).
accLen([],A,A).
I am unable to understand, how the two snippets are conceptually different? What exactly an accumulator is doing different? And what are the benefits?
Accumulators sound like intermediate variables. (Correct me if I am wrong.) And I had already used them in my programs up till now, so is it really that big a concept?
When you give something a name, it suddenly becomes more real than it used to be. Discussing something can now be done by simply using the name of the concept. Without getting any more philosophical, no, there is nothing special about accumulators, but they are useful.
In practice, going through a list without an accumulator:
foo([]).
foo([H|T]) :-
foo(T).
The head of the list is left behind, and cannot be accessed by the recursive call. At each level of recursion you only see what is left of the list.
Using an accumulator:
bar([], _Acc).
bar([H|T], Acc) :-
bar(T, [H|Acc]).
At every recursive step, you have the remaining list and all the elements you have gone through. In your len/3 example, you only keep the count, not the actual elements, as this is all you need.
Some predicates (like len/3) can be made tail-recursive with accumulators: you don't need to wait for the end of your input (exhaust all elements of the list) to do the actual work, instead doing it incrementally as you get the input. Prolog doesn't have to leave values on the stack and can do tail-call optimization for you.
Search algorithms that need to know the "path so far" (or any algorithm that needs to have a state) use a more general form of the same technique, by providing an "intermediate result" to the recursive call. A run-length encoder, for example, could be defined as:
rle([], []).
rle([First|Rest],Encoded):-
rle_1(Rest, First, 1, Encoded).
rle_1([], Last, N, [Last-N]).
rle_1([H|T], Prev, N, Encoded) :-
( dif(H, Prev)
-> Encoded = [Prev-N|Rest],
rle_1(T, H, 1, Rest)
; succ(N, N1),
rle_1(T, H, N1, Encoded)
).
Hope that helps.
TL;DR: yes, they are.
Imagine you are to go from a city A on the left to a city B on the right, and you want to know the distance between the two in advance. How are you to achieve this?
A mathematician in such a position employs magic known as structural recursion. He says to himself, what if I'll send my own copy one step closer towards the city B, and ask it of its distance to the city? I will then add 1 to its result, after receiving it from my copy, since I have sent it one step closer towards the city, and will know my answer without having moved an inch! Of course if I am already at the city gates, I won't send any copies of me anywhere since I'll know that the distance is 0 - without having moved an inch!
And how do I know that my copy-of-me will succeed? Simply because he will follow the same exact rules, while starting from a point closer to our destination. Whatever value my answer will be, his will be one less, and only a finite number of copies of us will be called into action - because the distance between the cities is finite. So the total operation is certain to complete in a finite amount of time and I will get my answer. Because getting your answer after an infinite time has passed, is not getting it at all - ever.
And now, having found out his answer in advance, our cautious magician mathematician is ready to embark on his safe (now!) journey.
But that of course wasn't magic at all - it's all being a dirty trick! He didn't find out the answer in advance out of thin air - he has sent out the whole stack of others to find it for him. The grueling work had to be done after all, he just pretended not to be aware of it. The distance was traveled. Moreover, the distance back had to be traveled too, for each copy to tell their result to their master, the result being actually created on the way back from the destination. All this before our fake magician had ever started walking himself. How's that for a team effort. For him it could seem like a sweet deal. But overall...
So that's how the magician mathematician thinks. But his dual the brave traveler just goes on a journey, and counts his steps along the way, adding 1 to the current steps counter on each step, before the rest of his actual journey. There's no pretense anymore. The journey may be finite, or it may be infinite - he has no way of knowing upfront. But at each point along his route, and hence when ⁄ if he arrives at the city B gates too, he will know his distance traveled so far. And he certainly won't have to go back all the way to the beginning of the road to tell himself the result.
And that's the difference between the structural recursion of the first, and tail recursion with accumulator ⁄ tail recursion modulo cons ⁄ corecursion employed by the second. The knowledge of the first is built on the way back from the goal; of the second - on the way forth from the starting point, towards the goal. The journey is the destination.
see also:
Technical Report TR19: Unwinding Structured Recursions into Iterations. Daniel P. Friedman and David S. Wise (Dec 1974).
What are the practical implications of all this, you ask? Why, imagine our friend the magician mathematician needs to boil some eggs. He has a pot; a faucet; a hot plate; and some eggs. What is he to do?
Well, it's easy - he'll just put eggs into the pot, add some water from the faucet into it and will put it on the hot plate.
And what if he's already given a pot with eggs and water in it? Why, it's even easier to him - he'll just take the eggs out, pour out the water, and will end up with the problem he already knows how to solve! Pure magic, isn't it!
Before we laugh at the poor chap, we mustn't forget the tale of the centipede. Sometimes ignorance is bliss. But when the required knowledge is simple and "one-dimensional" like the distance here, it'd be a crime to pretend to have no memory at all.
accumulators are intermediate variables, and are an important (read basic) topic in Prolog because allow reversing the information flow of some fundamental algorithm, with important consequences for the efficiency of the program.
Take reversing a list as example
nrev([],[]).
nrev([H|T], R) :- nrev(T, S), append(S, [H], R).
rev(L, R) :- rev(L, [], R).
rev([], R, R).
rev([H|T], C, R) :- rev(T, [H|C], R).
nrev/2 (naive reverse) it's O(N^2), where N is list length, while rev/2 it's O(N).

Resources