As the input data of Amazon-EMR, is the only legal format plain text? - hadoop-streaming

As quoted in the "Developer Guide" of Amazon EMR, the files in the input directory should be formatted as plain text. Does it mean that i cannot upload some binary files or .png files and parse them by python script?

Likely not. See for example: https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/AUUZ0DKiJGw
what you can do is to have an input data be the file names themselves (either in S3 or HDFS). The Hadoop streaming script will get file names as input that it can open and process as it sees fit.

Related

Which is the correct encoding for a degree character?

I have a line of code that alters text
temperature<-as.numeric(gsub("°.*","",temp))
R does not like the "°" character. When I save the file it says I need to use a different encoding.
I have tried all sorts of different encodings from the list, but they all save the code in some variation of
temperature<-as.numeric(gsub("??.*","",temp))
My current solution is to open the script in notepad and copy paste the code into rstudio. Which encoding do I need to save a ° in rstudio?
The full solution to this in rstudio was to go to file -> save with encoding -> select ISO-8859-1 -> check the box Set as default encoding for source files. Now the file opens properly with the degree character every time.

How to open .jl file

I want to open a .jl file and convert it to a readable file preferably in .xls format.
I do not have any any idea about Julia language.
Is there a file opener for jl files?
I came here with the same question, but since the file I was looking at was clearly JSON data, I did some more searching.
The .jl file extension also refers to JSON lines, sometimes instead a .jsonl extension.
More here: http://jsonlines.org/
You can search for .json to Excel to find a converter, e.g. https://json-csv.com/ (this worked fine on my JSON lines file).
A .jl file is a julia script.
It is source code.
Not data.
You can open it up in any text editor, e.g. notepad on windows.
However, it won't normally contain anything useful to you unless you want to edit that code.
(It might contain some array literals that you want, I guess)
Perhaps you mean to ask "How can I open a .jld file"?
Which is a julia HDF5 file.
In which case please ask another question.
As I see, Julia is a script language therefor the file can be opened in a text editor like Notepad++, Vim, etc. Do not use word processor (like LibreOffice Writer) if you want to modify it, but it's OK if you want to read only.
To get started:
https://docs.julialang.org/en/stable/

How to convert .dat + .sps to .sav on command line

I get a lot of datasets that arrive as .dat files with syntax files for converting to SPSS (.sps). I'm an R user, so I need to convert the .dat file into a .sav that R can read.
In the past, I've used PSPP to do this manually. (I can't afford SPSS!) But I'd MUCH prefer a programmatic solution.
I thought pspp-convert would do the trick, but there's something I'm not understanding about how that works in terms of inputting the syntax file:
My files are:
data.dat
data.sps (which correctly points to data.dat)
I tried
pspp-convert data.sps data.sav
But get
`data.sps' is not a system or portable file.
Makes sense since the input is supposed to be a portable file. Am I trying to do something beyond the scope of this CLI?
Generally speaking, there MUST be some way to apply an SPS file to a DAT file to get a SAV file (or any other portable file) back, right?
From an SPSS Statistics point of view, a .dat file extension most often means the data is in a fixed ASCII text format. You would need the accompanying codebook to tell you what variables to read and in what formats. The SPSS Statistics command syntax file (.sps) does this for you. But this file is simply the list of SPSS Statistics commands used to read the ASCII data. It is not a data file itself.
Elsewhere you've referenced these files as "portable files". An SPSS Statistics portable file (.por) is a very special case of an ASCII file; structured to be read and written by SPSS Statistics. In any case, if your preferred tool takes an SPSS Statistics portable file (.por), these *.dat files likely aren't it.
Assuming these *.dat files are fixed ASCII text files, you'll need to discern how the information therein is stored and then use a likely tool for reading ASCII text.

Merging EBCDIC converted files and pdf files into a single file and pushing to mainframes

I have two pdf files and two text files which are converted into ebcdif format. The two text files acts like cover files for the pdf files containing details like pdf name, number of pages, etc. in a fixed format.
Cover1.det, Firstpdf.pdf, Cover2.det, Secondpdf.pdf
Format of the cover file could be:
Firstpdf.pdf|22|03/31/2012
that is
pdfname|page num|date generated
which is then converted into ebcdic format.
I want to merge all these files in a single file in the order first text file, first pdf file, second text file, second pdf file.
The idea is to then push this single merged file into mainframes using scp.
1) How to merge above mentioned four files into a single file?
2) Do I need to convert pdf files also in ebcdic format ? If yes, how ?
3) As far as I know, mainframe files also need record length details during transit. How to find out record length of the file if at all I succeed in merging them in a single file ?
I remember reading somewhere that it could be done using put and append in ftp. However since I have to use scp, I am not sure how to achieve this merging.
Thanks for reading.
1) Why not use something like pkzip?
2) I don't think converting the pdf files to ebcdic is necessary or even possible. The files need to be transfered in Binary mode
3) Using pkzip and scp you will not need the record length
File merging could easily be achieved by using Cat command in unix with > and >> append operators.
Also, if the next file should start from a new line (as was my case) a blank echo could be inserted between files.

Flex: Write data to file to open in Excel

I have a Flex application with a couple of DataGrids with data. I'd like to save the data to a file so that the user can keep working with them in Excel, OpenOffice or Numbers.
I'm currently writing a csv file straight off, which opens well in OpenOffice or Numbers, but not in Excel. The problem is with the Swedish characters ÅÄÖ, which turn up as other characters when opening in Excel. Converting (in Notepad++) the csv-file to ANSI encoding makes the ÅÄÖ show up correctly in Excel.
Is there any way to write ANSI-encoded files straight from Flex?
Any other options for writing a file that can be opened in Excel and OpenOffice?
(I've looked at the as3xls library, but according to the comments those files cannot be opened in OpenOffice)
Using the writeMultiByte function from the ByteArray class allows you to specify a character set. See :
http://www.adobe.com/livedocs/flash/9.0/ActionScriptLangRefV3/flash/utils/ByteArray.html#writeMultiByte%28%29
There is also the option of the as3xls package at http://code.google.com/p/as3xls/. I like this as it comes out as a straight excel file that can also be easily opened in open office as well.

Resources