Having two sets of input combined on hadoop

Having two sets of input combined on hadoop - dictionary

I have a rather simple hadoop question which I'll try to present with an example
say you have a list of strings and a large file and you want each mapper to process a piece of the file and one of the strings in a grep like program.
how are you supposed to do that? I am under the impression that the number of mappers is a result of the inputSplits produced. I could run subsequent jobs, one for each string, but it seems kinda... messy?
edit: I am not actually trying to build a grep map reduce version. I used it as an example of having 2 different inputs to a mapper. Let's just say that I lists A and B and would like for a mapper to work on 1 element from list A and 1 element from list B
So given that the problem experiences no data dependency that would result in the need for chaining jobs, is my only option to somehow share all of list A on all mappers and then input 1 element of list B to each mapper?
What I am trying to do is built some type of a prefixed look-up structure for my data. So I have a giant text and a set of strings. This process has a strong memory bottleneck, therefore I was after 1 chunk of text/1 string per mapper

Mappers should be able to work independent and w/o side effects. The parallelism can be, that a mapper tries to match a line with all patterns. Each input is only processed once!
Otherwise you could multiply each input line with the number of patterns. Process each line with a single pattern. And run the reducer afterwards. A ChainMapper is the solution of choice here. But remember: A line will appear twice, if it matches two patterns. Is that what you want?
In my opinion you should prefer the first scenario: Each mapper processes a line independently and checks it against all known patterns.
Hint: You can distribute the patterns with the DistributedCache feature to all mappers! ;-) Input should be splitted with the InputLineFormat

a good friend had a great epiphany: what about chaining 2 mappers?
in the main, run a job that fires up a mapper (no reducer). The input is the list of strings, and we can arrange things so that each mapper gets one string only.
in turn, the first mapper starts a new job, where the input is the text. It can communicate the string by setting a variable in the context.

Regarding your edit:
In general a mapper is not used to process 2 elements at once. He shall only process one element a time. The job should be designed in a way, that there could be a mapper for each input record and it would still run correctly!
Of course it is suitable, that the mapper needs some supporting information to process the input. This information can be by-passed with the Job Configuration (Configuration.setString() for example). A larger set of data shall be passed via the distributed cache.
Did you have a look on one of these options?
I'm not sure if I fully understood your problem, so please check by yourself if that would work ;-)
BTW: A appreciating vote for my well investigated previous answer would be nice ;-)

Related

Design of JSR352 batch job: Is several steps a better design than one large batchlet?

My JSR352 batch job needs to read from a database, and then depending on the result flows to one of two pathways, each of which involves some more if/else scenarios. I wonder what the pros and cons between writing a single step with a large batchlet and several steps consisting of smaller batchlets would be. This job does not involves chunk steps with chunk size larger than 1, as it needs to persists the read result immediately in case there is any before proceeding to other logic. The job will be run using Control-M, I wonder if using multiple smaller steps provides more control points.

From that description, I'd suggest these
Benefits of more, fine-grained steps
1. Restart
After a job failure, the default behavior on restart is to begin executing at the step where the previous job execution failed. So breaking the job up into more steps allows you to avoid writing the logic to resume where you left off and avoid re-processing, and may save execution time in the process.
2. Reuse
By encapsulating a discrete function as its own batchlet, you can potentially compose other steps in other jobs (or even later in this job) implemented with this same batchlet.
3. Extract logic into XML
By moving the transition logic into the transition elements, and extracting the conditional flow (e.g. <next on="RC1" to="step3"/>, etc.)
into the job definition XML (JSL), you can introduce changes at a standard control point, without having to go into the Java source and find the right place.
Final Thoughts
You'll have to decide if those benefits are worth it for your case.
One more thought
I wouldn't automatically rule out the chunk step just because you are using a 1-item chunk, if you can still find benefits from the checkpointing or even possibly the skip/retry. (But that's probably a separate question.)

Golang RWMutex on map content edit

I'm starting to use RWMutex in my Go project with map since now I have more than one routine running at the same time and while making all of the changes for that a doubt came to my mind.
The thing is that I know that we must use RLock when only reading to allow other routines to do the same task and Lock when writing to full-block the map. But what are we supposed to do when editing a previously created element in the map?
For example... Let's say I have a map[int]string where I do Lock, put inside "hello " and then Unlock. What if I want to add "world" to it? Should I do Lock or can I do RLock?

You should approach the problem from another angle.
A simple rule of thumb you seem to understand just fine is
You need to protect the map from concurrent accesses when at least one of them is a modification.
Now the real question is what constitutes a modification of a map.
To answer it properly, it helps to notice that values stored in maps are not addressable — by design.
This was engineered that way simply due to the fact maps internally have intricate implementation which
might move values they contain in memory
to provide (amortized) fast access time
when the map's structure changes due to insertions and/or deletions of its elements.
The fact map values are not addressable means you can not do
something like
m := make(map[int]string)
m[42] = "hello"
go mutate(&m[42]) // take a single element and go modifying it...
// ...while other parts of the program change _other_ values
m[123] = "blah blah"
The reason you are not allowed to do this is the
insertion operation m[123] = ... might trigger moving
the storage of the map's element around, and that might
involve moving the storage of the element keyed by 42
to some other place in memory — pulling the rug
from under the feet of the goroutine
running the mutate function.
So, in Go, maps really only support three operations:
Insert — or replace — an element;
Read an element;
Delete an element.
You cannot modify an element "in place" — you can only
go in three steps:
Read the element;
Modify the variable containing the (read) copy;
Replace the element by the modified copy.
As you can now see, the steps (1) and (3) are mere map accesses,
and so the answer to your question is (hopefully) apparent:
the step (1) shall be done under at least an read lock,
and the step (3) shall be done under a write (exclusive) lock.
In contrast, elements of other compound types —
arrays (and slices) and fields of struct types —
do not have the restriction maps have: provided the storage
of the "enclosing" variable is not relocated, it is fine to
change its different elements concurrently by different goroutines.

Since the only way to change the value associated with the key in the map is to reassign the changed value to the same key, that is a write / modification, so you have to obtain the write lock–simply using the read lock will not be sufficient.

Is it OK for a DirectShow filter to seek the filters upstream from itself?

Normally seek commands are executed on a filter graph, get called on the renderers in the graph and calls are passed upstream by filters until a filter that can handle the seek does the actual seek operation.
Could an individual filter seek the upstream filters connected to one or more of its input pins in the same way without it affecting the downstream portion of the graph in unexpected ways? I wouldn't expect that there wouldn't be any graph state changes caused by calling IMediaSeeking.SetPositions upstream.
I'm assuming that all upstream filters are connected to the rest of the graph via this filter only.
Obviously the filter would need to be prepared to handle the resulting BeginFlush, EndFlush and NewSegment calls coming from upstream appropriately and distinguish samples that arrived before and after the seek operation. It would also need to set new sample times on its output samples so that the output samples had consistent sample presentation times. Any other issues?

It is perfectly feasible to do what you require. I used this approach to build video and audio mixer filters for a video editor. A full description of the code is available from the BBC White Papers 129 and 138 available from http://www.bbc.co.uk/rd
A rather ancient version of the code can be found on www.SourceForge.net if you search for AAFEditPack. The code is written in Delphi using DSPack to get access to the DirectShow headers. I did this because it makes it easier to handle com object lifetimes - by implementing smart pointers by default. It should be fairly straightforward to transfer the ideas to a C++ implementation if that is what you use.
The filters keep lists of the sub-graphs (a section of a graph but running in the same FilterGraph as the mixers). The filters implement a custom version of TBCPosPassThru which knows about the output pins of the sub-graph for each media clip. It handles passing on the seek commands to get each clip ready for replay when its point in the timeline is reached. The mixers handle the BeginFlush, EndFlush, NewSegment and EndOfStream calls for each sub-graph so they are kept happy. The editor uses only one FilterGraph that houses both video and audio graphs. Seeking commands are make by the graph on both the video and audio renderers and these commands are passed upstream to the mixers which implement them.
Sub-graphs that are not currently active are blocked by the mixer holding references to the samples they have delivered. This does not cause any problems for the FilterGraph because, as Roman R says, downstream filters only care about getting a consecutive stream of sample and do not know about what happens upstream.
Some key points you need to make sure of to avoid wasted debugging time are:
Your decoder filters need to be able to queue to the exact media frame or audio time. Not as easy to do as you might expect, especially with compressed formats such as mpeg2, which was designed for transmission and has no frame index in the files. If you do not do this, the filter may wait indefinitely to get a NewSegment call with the correct media times.
Your sub graphs need to present a NewSegment time equal to the value you asked for in your seek command before delivering samples. Some decoders may seek to the nearest key frame, which is a bit unhelpful and some are a bit arbitrary about the timings of their NewSegment and the following samples.
The start and stop times of each clip need to be within the duration of the file. Its probably not a good idea to police this in the DirectShow filter because you would probably want to construct a timeline without needing to run the filter first. I did this in the component that manages the FilterGraph.
If you want to add sections from the same source file consecutively in the timeline, and have effects that span the transition, you need to have two instances of the sub-graph for that file and if you have more than one transition for the same source file, your list needs to alternate the graphs for successive clips. This is because each sub graph should only play monotonically: calling lots of SetPosition calls would waste cpu cycles and would not work well with compressed files.
The filter's output pins define the entire seeking behaviour of the graph. The output sample time stamps (IMediaSample.SetTime) are implemented by the filter so you need to get them correct without any missing time stamps. and you can also set the MediaTime (IMediaSample.SetMediaTime) values if you like, although you have to be careful to get them correct or the graph may drop samples or stall.
Good luck with your development. If you need any more information please contact me through StackOverflow or DTSMedia.co.uk

Chicken or egg in functions/actors processing WorldState

I have read the series "Purely Functional Retrogames"
http://prog21.dadgum.com/23.html
It discusses some interesting techniques to build a (semi-)pure game world update loop.
However, I have the following side remark, that I cannot seem to get my head around:
Suppose you have a system, where every enemy, and every player are separate actors, or separate pure functions.
Suppose they all get a "WorldState" as input, and output a New WorldState (or, if you're thinking in actor terms, send the new WorldState to the next actor, ending with for instance a "Game Render" actor).
Then, there's two ways to go with this:
Either you start with one actor, (f.i. the player), and feed him the "current world".
Then, you feed the new world, to the next enemy, and so on, until all actors have converted the worlds. Then, the last world is the new world you can feed to the render loop. (or, if you followed the above article, you end up with a list of events that have occurred in the world, which can be processed).
A second way, is to just give all actors the current WorldState at the same time. They generate any changes, which may conflict (for instance, two enemies and the player can take a coin in the same animation frame) -> it is up to the game system to solve these conflicts by processing the events. By processing all events, the Game actor creates the new world, to be used in the next update frame.
I have a feeling I'm just confronted with the exact same "race condition" problem I wished to avoid by using pure functions with immutable data.
Any advice here?

I didn't read the article, but with the coin example you are creating a kind of global variable: you give a copy of the world state to all actors and you suppose that each actor will evaluate the game, take decision and expect that their action will succeed, regardless of the last phase which is the conflict solving. I would not call this a race condition but rather a "blind condition", and yes, this will not work.
I imagine that you came to this solution in order to allow parallelism, not available in solution 1. In my opinion, the problem is about responsibility.
The coin must belong to an actor (a server acting as resource manager) as anything in the application. This actor is the only responsible to decide what will happen to the coin.
All requests (is there something to grab, grab it, drop something...) should be sent to this actor (one actor per cell, or per map, level or any split that make sense for the game).
The way you will manage it is then up to you: serve all requests in the receive order, buffer them until a synchro message comes and make a random decision or priority decision... In any case the server will be able to reply to all actors with success or failure without any risk of race condition since the server process is run (at least in erlang) on a single core and proceed one message at a time.

In addition to Pascal answer, you can solve parallelization by splitting (i assume huge map) to smaller chunks which depend on last state (or part of it, like an edge) of its neighbours. This allows you to distribute this game among many nodes.

Packet data structure?

I'm designing a game server and I have never done anything like this before. I was just wondering what a good structure for a packet would be data-wise? I am using TCP if it matters. Here's an example, and what I was considering using as of now:
(each value in brackets is a byte)
[Packet length][Action ID][Number of Parameters]
[Parameter 1 data length as int][Parameter 1 data type][Parameter 1 data (multi byte)]
[Parameter 2 data length as int][Parameter 2 data type][Parameter 2 data (multi byte)]
[Parameter n data length as int][Parameter n data type][Parameter n data (multi byte)]
Like I said, I really have never done anything like this before so what I have above could be complete bull, which is why I'm asking ;). Also, is passing the total packet length even necessary?

Passing the total packet length is a good idea. It might cost two more bytes, but you can peek and wait for the socket to have a full packet ready to sip before receiving. That makes code easier.
Overall, I agree with brazzy, a language supplied serialization mechanism is preferrable over any self-made.
Other than that (I think you are using a C-ish language without serialization), I would put the packet ID as the first data on the packet data structure. IMHO that's some sort of convention because the first data member of a struct is always at position 0 and any struct can be downcast to that, identifying otherwise anonymous data.
Your compiler may or may not produce packed structures, but that way you can allocate a buffer, read the packet in and then either cast the structure depending on the first data member. If you are out of luck and it does not produce packed structures, be sure to have a serialization method for each struct that will construct from the (obviously non-destination) memory.
Endiannes is a factor, particularly on C-like languages. Be sure to make clear that packets are of the same endianness always or that you can identify a different endian based on a signature or something. An odd thing that's very cool: C# and .NET seems to always hold data in little-endian convention when you access them using like discussed in this post here. Found that out when porting such an application to Mono on a SUN. Cool, but if you have that setup you should use the serialization means of C# anyways.
Other than that, your setup looks very okay!

Start by considering a much simpler basic wrapper: Tag, Length, Value (TLV). Your basic packet will look then like this:
[Tag] [Length] [Value]
Tag is a packet identifier (like your action ID).
Length is the packet length. You may need this to tell whether you have the full packet. It will also let you figure out how long the value portion is.
Value contains the actual data. The format of this can be anything.
In your case above, the value data contains a further series of TLV structures (parameter type, length, value). You don't actually need to send the number of parameters, as you can work it from the data length and walking the data.
As others have said, I would put the packet ID (Tag) first. Unless you have cross-platform concerns, I would consider wrapping your application's serialised object in a TLV and sending it across the wire like that. If you make a mistake or want to change later, you can always create a new tag with a different structure.
See Wikipedia for more details on TLV.

To avoid reinventing the wheel, any serialization protocol will work for on the wire data (e.g. XML, JSON), and you might consider looking at BEEP for the basic protocol framework.
BEEP is summed up well in its FAQ document as 'kind of a "best hits" album of the tricks used by experienced application protocol designers since the early 80's.'

There's no reason to make something so complicated like that. I see that you have an action ID, so I suppose there would be a fixed number of actions.
For each action, you would define a data structure, and then you would put each one of those values in the structure. To send it over the wire, you just allocate sum(sizeof(struct.i)) bytes for each element in your structure. So your packet would look like this:
[action ID][item 1 (sizeof(item 1 bytes)][item 1 (sizeof(item 2 bytes)]...[item n (sizeof(item n bytes)]
The idea is, you already know the size and type of each variable on each side of the connection is, so you don't need to send that information.
For strings, you can just throw 'em in in a null terminated form, and then when you 'know' to look for a string based on your packet type, start reading and looking for a null.
--
Another option would be to use '\r\n' to delineate your variables. That would require some overhead, and you would have to use text, rather then binary values for numbers. But that way you could just use readline to read each variable. Your packets would look like this
[action ID]
[item 1 (as text)]
...
[item n (as text)]
--
Finally, simply serializing objects and passing them down the wire is a good way to do this too, with the least amount of code to write. Remember that you don't want to prematurely optimize, and that includes network traffic as well. If it turns out you need to squeeze out a little bit more performance later on you can go back and figure out a more efficient mechanism.
And check out google's protocol buffers, which are supposedly an extreemly fast way to serialize data in a platform-neutral way, kind of like a binary XML, but without nested elements. There's also JSON, which is another platform neutral encoding. Using protocol buffers or JSON would mean you wouldn't have to worry about how to specifically encode the messages.

Do you want the server to support multiple clients written in different languages? If not, it's probably not necessary to specify the structure exactly; instead use whatever facility for serializing data your language offers, simply to reduce the potential for errors.
If you do need the structure to be portable, the above looks OK, though you should specify stuff like endianness and text encoding as well in that case.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex