hadoop with mrjob piping on shell

hadoop with mrjob piping on shell - unix

I have an issue regarding mrjob.
I'm using an hadoopcluster over 3 datanodes using one namenode and one
jobtracker.
Starting with a nifty sample application I wrote something like the
following
first_script.py:
for i in range(1,2000000):
print "My Line "+str(i)
this is obviously writing a bunch of lines to stdout
the secondary script is the mrjobs Mapper and Reducer.
Calling from an unix (GNU) i tried:
python first_script| python second_script.py -r hadoop
This get's the job done but it is uploading the input to the hdfs
completely. Just when everything ist uploaded he is starting the
second job.
So my question is:
Is it possible to force a stream? (Like sending EOF?)
Or did I get the whole thing wrong?

Obviously you have long since forgotten about this but I'll reply anyway: No it's not possible to force a stream. The whole hadoop programming model is about taking files as input and outputting files (and possibly creating side effects e.g. uploading the same stuff to a database).

It might help if you clarified what you want to achieve a little more.
However it sounds like you might want the contents of a pipe to be periodically processed, rather than wait until the stream is finished. The stream cant be forced.
The reader of the pipe (your second_script.py) needs break its stdin into chunks, either using
a fixed number of lines like this question and answer, or
non-blocking reads and a preset idle period, or
a predetermined break sequence emitted from first_script.py, such as a 'blank' line consisting of only \0.

Related

R unable to process heavy tasks for many hours

I have a list [~90 files] of zipped files . I have written a loop to unzip them (become 1Gb approx. per file), do some computations, save the output for each of the files and delete the unzipped file. One iteration of this process takes like 30-60min per file [not all files are the same size exactly].
I am not too worried about the time as I can leave it working over the weekend. However, R doesn’t manage to get all the way through. I left it on Friday night and it was only running 12 hours so it only processed 30 of the 90 files.
I don’t deal with this type of heavy processes often but the same has happened in the past with analogous processes. Is there any command I need to insert in my loops to avoid the computer from freezing with this intensive processes? I tried gc() at the end of the loop to no avail.
Is there a list of “good practice” recommendations for this type of procedures?

If your session is freezing you are likely running into a problem you need to isolate as it may be a single file, or maybe you are becoming restricted by memory or extensively using swap.
Regardless, here are some tips or ideas you could implement:
Writing your code to process a file as a singular case, e.g. a function like
process_gz_folder(). Then loop over the file paths and invoke the function you created each time, this keeps the global environment clean.
As you already tried, sometimes gc() can help, but it depends on the situation and if memory is being cleared (after running rm() for example). Could be used after invoking function in first point.
Are you keeping the results of each folder in memory? Does this set of results get larger with each iteration? If so this may be taking up required memory - storing the results to disk as a suitable format will let you accumulate the results after each has been processed.
To add to the prior point, if files produce outputs making sure their names are appropriate and even adding a timestamp (e.g. inputfile_results_YYYYMMDD).
Code could check if file is already processed and skip to next, this can help restarting from scratch, especially if your method for checking if a file is processed is using the existence of an output (with timestamp!).
Using try() to make sure failures do not stop future iterations - however this should produce warnings/output to notify of a failure so that you can come back at a later point.
An abstract approach could be to create a single script that processes a single file, it could just include the function from the first point, proceeded with setTimeLimit() and provide a time for which if the file is not processed the code will stop running. Iterate over this script with a bash script invoking said R script with Rscript which can be passed arguments (filepaths for example). This approach may help avoid freezes but is dependent on you knowing and setting an acceptable time.
Determine if the files are too large for memory when processing the code may need be adjusted to be more memory efficient or change code to process the data incrementally as to not run out of memory.
Reduce other tasks on the computer that can take resources that may cause a freeze.
These are just some ideas that spring to mind that could be things to consider in your example (given the info provided). It would help to see some code and understand what kind of processing you are doing on each file.

Given as little information as you have provided, it is hard to tell what the problem really is.
If possible, I would first unzip and concatenate the files. Then preprocess the data and strip off all fields that are not required for analysis. The resulting file would then be used as input for R.
Also be aware that parsing the input strings as e.g. timestamps may be quite time consuming.

Block for serial output in GNURadio/GRC

I am working on a project that involves GNU Radio/GRC and am not very familiar with the software. I am trying to output data to a serial port in GNU Radio using a block, but have not found a way to do so.
I was wondering if there is a pre-defined block that I can use to put this information to a serial port (USB on a Raspberry Pi 3), or if I had to create my own block. And if I had to create my own block, what that code would look like.
I have been able to write the data to a file using the File Sink to make sure I was getting data, and was wondering if the fix is something as simple as changing the File sink to an serial port sink. See picture below:
http://imgur.com/a/BdaMZ
I also did some research and found a github repo that looks like what I need -- unfortunately, the repository that it links to is no longer there. It did mention using pyserial, which is what I believe is meant for creating my own block in python. The link to this repo is below:
https://github.com/jmalsbury/gr-pyserial

… was wondering if the fix is something as simple as changing the File sink to an serial port sink.
Yes! Or No, it's even easier:
So, in fact, you could even simply use your file sink to write to e.g. /dev/ttyS0 (or /dev/ttyUSB0, or whatever is the device name of your serial port), but you'd have to set up the serial port to work like you want it to separately first. A way of doing that would be using stty, e.g.
stty -F /dev/ttyS0 115200
prior to running your flow graph.
Note that practically all in your flow graph point points to you not being sufficiently proficient with GNU Radio to successfully exchange data. I can't cover everything here, please read the official Guided Tutorials, but:
In a flow graph like yours, where the IO is the inherently rate-limiting element, you must not use "Throttle". Throttle is really just a tool to avoid a flowgraph consuming all your CPU (and to slow down simulations)
Giving your files a .grc ending is bad practice, as that is the ending reserved for GNU Radio flow graphs.
Giving it a .txt ending is plain misleading, since there's no text involved whatsoever. The "file format" (I wouldn't even call it a format) is really just plain binary numbers, as your computer handles them; not decimal ASCII representations of these floating point binary numbers
I also did some research and found a github repo that looks like what I need -- unfortunately, the repository that it links to is no longer there. It did mention using pyserial, which is what I believe is meant for creating my own block in python. The link to this repo is below:
Don't know what you're referring to, https://github.com/jmalsbury/gr-pyserial is perfectly existing!

Is there a way to extract actual call stack addresses from a Windows Performance Recorder trace (WPR)?

According to https://randomascii.wordpress.com/2013/11/04/exporting-arbitrary-data-from-xperf-etl-files/, wpaexporter.exe should be the right tool to do so.
I manage to prepare a profile with the right data, but, unfortunately, wpaexporter keep trying to translate addresses, even if "-symbols" is not given to the command line, generating some useless
/<ModuleName.dll>!<Symbols disabled>
warnings.
This is annoying because part of our application use some Delphi code that can not generate symbols in a Microsoft compatible format. With addresses, we would be able to find the Delphi symbols in the call stack using map files.
Is there a way to extract call stack addresses from a wpr trace ?

Thanks, i completely missed processing options of xperf...
In the meantime, i found that LogParser (https://www.microsoft.com/en-us/download/details.aspx?id=24659) can also export an etl file to a csv (with actual values as well) :
LogParser.exe" "Select * from file.etl" -i:ETW -o:CSV -oTsFormat "HH:mm:ss.ln" > output_file.csv
From what i have seen so far, LogParser output might be more suitable for automatic parsing (only one line per event in the file, no header) while xperf output is more suitable for human processing (tabular representation).

Yes. You can also use xperf.exe. Have you tried the actions option?
xperf -a stack should help here I expect.
You can see detailed info with xperf -help processing command.

How to read new lines from a file as another process is appending them?

So I have ffmpeg writing its progress to a text file, and I need to read the new values (lines) from the said file. How should I approach this using Qt classes in order to minimize the amount of code I have to write?
I don't even have an idea where to start, other than doing ugly things like seeking to the end, storing this pos, then seeking to the end again a bit later and comparing the new pos to the previous one. It's unclear to me if QTextStream can be used here or not, for instance.

I used Win32 API own interface for the file system notification some time ago and that worked 100% reliably. Modern OSes provide us with notifications for the file change. And Qt incorporates such functionality as well. Specifically for the purpose of tracking the file changes I would use QFileSystem::fileChanged signal to start the slot myFileReadNextBuffer() method only in case if the file was changed. But then you would still want to evaluate how many bytes were added by subtracting the previous from the new file length. And there is also relative question here: How to know when and which files are changed in windows filesystem with winapi.
If the file is only growing:
Whether the file is text-based or not I would open it in shared mode and read to the end and read more till the end when the notification received.

Sending CSV via web to R and sending results back

I would like to be able to have an API system, in which a POST message that contains a csv file is sent to a server/webserver/domain name. It is used as an input for an R function, and then outputs a value which is set back to the sender of the POST message.
One of the issues I have is that most of the solutions I have seen such as rApache(http://rapache.net/) invoke R to run a script, and take back the output. The problem is that my R script also loads from disk some very large data files, which are used as further inputs in order to create the final output.
If running R from the console, with the large data files already loaded as well as all the relevant libraries, the final part of loading up the user input csv, running the function and creating an output is reasonably quick. I.e. for each POST request, it seems highly inefficient to keep re-invoking R loading all relevant files and then closing it after creating the output. I.e. having R constantly running with all the relevant files and libraries and finally only loading in the given CSV file to run the final calculations seems much more efficient...Is there a way to do this?
Shiny (http://shiny.rstudio.com/) looks like a close solution since it always has R running in the background and may be able to take in POST requests, but it also has a lot of unnecessary overhead which probably makes it too inefficient for my purposes.
Also will this method be able to handle many POST messages coming in simultaneously?
As always any help is always much appreciated. Thanks in advance.

FastRWeb can accept POST requests and may be what you are looking for.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex