Sequencefiles which map a single key to multiple values - bigdata

I am trying to do some preprocessing on data that will be fed to LucidWorks Big Data for indexing. LWBD accepts SolrXML in the form of Sequencefile files. I want to create a Pig script which will take all the SolrXML files in a directory and output them in the format
filename_1 => <here goes some XML>
...
filename_N => <here goes some more XML>
Pig's native PigStorage() load function can automatically create a column that includes the name of the file from which the data was extracted, which ideally would look like this:
{"filename_1", "<here goes some XML>"}
...
{"filename_N", "<here goes some more XML>"}
However, PigStorage() also automatically uses '\n' as a line delimiter, so what I actually end up with is a bag that looks like this:
{"filename_1", "<some partial XML from file 1>"}
{"filename_1", "<some more partial XML from file 1>"}
{"filename_1", "<the end of file 1>"}
...
I'm sure you get the picture. My question is, if I were to write this bag to a SequenceFile, how would it be read by other applications? Could it be combined as
"filename_1" => "<some partial XML from file 1>
<some more partial XML from file 1>
<the end of file 1>"
, by the default handling of the application I feed it to? Or is there some post-processing that I can do to get it into this format? Thank you for your help.

Since I can't find anything about a builtin SequenceFile writer, I'm assuming you are using a UDF (and if you aren't, then you need to).
You'll have to group the files (by filename) ahead of time, and then send that to the writer UDF.
DESCRIBE xml ;
-- xml: {filename: chararray, xml_data: chararray}
B = FOREACH (GROUP xml BY filename)
GENERATE group AS filename, xml.xml_data AS all_xml_data ;
Depending on how you have written the SequenceFile writer, it may be easier to convert the all_xml_data bag ahead of time to a chararray using a Python UDF like:
#outputSchema('xml_complete: chararray')
def stringify(bag):
delim = ''
return delim.join(bag)
NOTE: It is important to realize that this way the order of the xml data will become jumbled. If possible based on your data, stringify can maybe be expanded upon the reorgize it.

Related

How to get rid of the original file path instead of name in PGP encryption with ChoPGP library?

I tried PGP encrypting a file in ChoPGP library. At the end of the process it shows embedded file name along with the whole original file name path.
But I thought it will show only the filename without the whole path. Which I intend to work on and figure out a way to do so?
Doing the following:
using (ChoPGPEncryptDecrypt pgp = new ChoPGPEncryptDecrypt())
{
pgp.EncryptFile(#"\\appdevtest\c$\appstest\Transporter\test\Test.txt",
#"\\appdevtest\c$\appstest\Transporter\test\OSCTestFile_ChoPGP.gpg",
#"\\appdevtest\c$\appstest\Transporter\pgpKey\PublicKey\atorres\atorres_publicKey.asc",
true,
true);
}
which will result in:
But I would like to only extract the Test.txt in the end something
like this:
Looking at this line from the ChoPGP sources: https://github.com/Cinchoo/ChoPGP/blob/7152c7385022823013324408e84cb7e25d33c3e7/ChoPGP/ChoPGPEncryptDecrypt.cs#L221
You may find out that it uses internal function GetFileName, which ends up with this for the FileStream: return ((FileStream)stream).Name;.
And this one, according to documentation, Gets the absolute path of the file opened in the FileStream..
So you should either make fork of ChoPGP and modify this line to extract just filename, or submit a pull request to the ChoPGP. Btw, it's just a wrapper around the BouncyCastle so you may use that instead.

Not able to access certain JSON properties in Autoloader

I have a JSON file that is loaded by two different Autoloaders.
One uses schema evolution and besides replacing spaces in the json property names, writes the json directly to a delta table, and I can see all the values are there properly.
In the second one I am mapping to a defined schema and only use a subset of properties. So use a lot of withColumn and then a select to narrows to my defined column list.
Autoloader definition:
df = (spark
.readStream
.format('cloudFiles')
.option('cloudFiles.format', 'json')
.option('multiLine', 'true')
.option('cloudFiles.schemaEvolutionMode','rescue')
.option('cloudFiles.includeExistingFiles','true')
.option('cloudFiles.schemaLocation', bronze_schema)
.option('cloudFiles.inferColumnTypes', 'true')
.option('pathGlobFilter','*.json')
.load(upload_path)
.transform(lambda df: remove_spaces_from_columns(df))
.withColumn(...
Writer:
df.writeStream.format('delta') \
.queryName(al_stream_name) \
.outputMode('append') \
.option('checkpointLocation', checkpoint_path) \
.option('mergeSchema', 'true') \
.trigger(once = True) \
.table(bronze_table)
Issue is that some of the source columns are ok load and I get their values, and others are constantly null in the output table.
For example:
.withColumn('vl_rating', col('risk_severity.value')) # works
.withColumn('status', col('status.name')) # always null
...
.select(
'rating',
'status',
...
json is quite simple, these are all string values, they are always populated. The same code works against another simular json file in another autoloader without issue.
I have run out of ideas to fault find on this. My imports are minimal, outside of Autoloader the JSON loads fine.
e.g
%python
import pyspark.sql.functions as psf
jsontest = spark.read.option('inferSchema','true').json('dbfs:....json')
df = jsontest.withColumn('status', psf.col('status.name')).select('status')
display(df)
Results in the values of the status.name property of the json file
Any ideas would be greatly appreciated.
I have found generally what is causing this. Interesting cause!
I am scanning a whole directory of json files, and the schema evolves over time (as expected). But when I clear out the autoloader schema and checkpoint directories and only scan the latest json file it all works correctly.
So what I surmise is that something in schema evolution with the older json files causes Autoloader to get into a state where it will not put certain properties into the stream to the writer.
If anyone has any recommendation on how to implement some data quality analysis in an Autoloader I would be most appreciative if you would share.

Save an Excel sheet as PDF programatically through powerbuilder

There is a requirement to save an excel sheet as a pdf file programmatically through powerbuilder (Powerbuilder 12.5.1).
I run the code below; however, I am not getting the right results. Please let me know if I should do something different.
OLEObject ole_excel;
ole_excel = create OLEObject;
IF ( ole_excel.ConnectToObject(ls_DocPath) = 0 ) THEN
ole_excel.application.activeworkbook.SaveAs(ls_DocPath,17);
ole_excel.application.activeworkbook.ExportAsFixedFormat(0,ls_DocPath);
END IF;
....... (Parsing values from excel)
DESTROY ole_excel;
I have searched through this community and others for a solution but no luck so far. I tried using two different commands that I found during this search. Both of them return a null object reference error. It would be great if someone can point me in the right direction.
It looks to me like you need to have a reference to the 'activeworkbook'. This would be of type OLEobject so the declaration would be similar to: OLEobject lole_workbook.
Then you need to set this to the active work book. Look for the VBA code on Excel (should be in the Excel help) for something like a 'getactiveworkbook' method. You would then (in PB) need to do something like
lole_workbook = ole_excel.application.activeworkbook
This gets the reference for PB to the activeworkbook. Then do you saveas and etc. like this lole_workbook.SaveAs(ls_DocPath,17)
workBook.saveAs() documentation says that saveAs() has the following parameters:
SaveAs(Filename, FileFormat, Password, WriteResPassword, ReadOnlyRecommended, CreateBackup, AccessMode, ConflictResolution, AddToMru, TextCodepage, TextVisualLayout, Local)
we need the two first params:
FileName - full path with filename and extension, for instance: c:\myfolder\file.pdf
FileFormat - predefined constant, that represents the target file format.
According to google (MS does not list pdf format constant for XLFileFormat), FileFormat for pdf is equal to 57
so, try to use the following call:
ole_excel.application.activeworkbook.SaveAs(ls_DocPath, 57);

reading from dynamically set filenames in R

I am writing a CGI script in Perl with a section of embedded R script which produces a graph. The original data filename is unknown as it has been uploaded by the CGI script and is stored in a Perl variable called $filename.
My question is that I now would like to open that file in R using read.table(). I am using Statistics::R and so I have tried:
my $R = Statistics::R->new();
$R->set('filename',$filename);
my $out1 = $R->run(
q`rm(list=ls())`,
# Fetch data
q`setwd("/var/www/uploads")`,
q`peakdata<-read.table(filename, sep="",col.names=c("mz","intensity","ionsscore","matched","query","index","hit"))`,
q`attach(peakdata)` ...etc
I can get this to work ONLY if I change $filename into something static and known like 'data.txt' before trying to open the file in read.table - is there a way for me to open a file with a variable for a name?
Thank you in advance.
One possible way to do this is by doing a little more work in Perl.
This is untested code, to give you some ideas:
my $filename = 'fileNameIGotFromSomewhere.txt'
my $source_dir = '/var/www/uploads';
my $file = "$source_dir/$fielname";
# make sure we can read it
unless ( -r $file ) {
die 'can read that data file: $!";
}
Then instead of $R->set, you could interpolate the file name into the R program. Where you've used the single-quote operator, use the double-quote operator instead:
So instead of:
q`peakdata<-read.table(filename, sep="",col.names= .... )`
Use:
qq`peakdata<-read.table($filename, sep="",col.names= .... )`
Now this looks like it would be inviting problems similar to SQL/Code Injections, so that's why I put int the logic to insure that the file exists and is readable. You might be able to think other checks to add to safeguard your use of user-supplied info.

How to get programatically the file path from a directory parameter in abap?

The transaction AL11 returns a mapping of "directory parameters" to file paths on the application server AFAIK.
The trouble with transaction AL11 is that its program only calls c modules, there's almost no trace of select statements or function calls to analize there.
I want the ability to do this dynamically, in my code, like for instance a function module that took "DATA_DIR" as input and "E:\usr\sap\IDS\DVEBMGS00\data" as output.
This thread is about a similar topic, but it doesn't help.
Some other guy has the same problem, and he explains it quite well here.
I strongly suspect that the only way to get these values is through the kernel directly. some of them can vary depending on the application server, so you probably won't be able to find them in the database. You could try this:
TYPE-POOLS abap.
TYPES: BEGIN OF t_directory,
log_name TYPE dirprofilenames,
phys_path TYPE dirname_al11,
END OF t_directory.
DATA: lt_int_list TYPE TABLE OF abaplist,
lt_string_list TYPE list_string_table,
lt_directories TYPE TABLE OF t_directory,
ls_directory TYPE t_directory.
FIELD-SYMBOLS: <l_line> TYPE string.
START-OF-SELECTION-OR-FORM-OR-METHOD-OR-WHATEVER.
* get the output of the program as string table
SUBMIT rswatch0 EXPORTING LIST TO MEMORY AND RETURN.
CALL FUNCTION 'LIST_FROM_MEMORY'
TABLES
listobject = lt_int_list.
CALL FUNCTION 'LIST_TO_ASCI'
EXPORTING
with_line_break = abap_true
IMPORTING
list_string_ascii = lt_string_list
TABLES
listobject = lt_int_list.
* remove the separators and the two header lines
DELETE lt_string_list WHERE table_line CO '-'.
DELETE lt_string_list INDEX 1.
DELETE lt_string_list INDEX 1.
* parse the individual lines
LOOP AT lt_string_list ASSIGNING <l_line>.
* If you're on a newer system, you can do this in a more elegant way using regular expressions
CONDENSE <l_line>.
SHIFT <l_line> LEFT DELETING LEADING '|'.
SHIFT <l_line> RIGHT DELETING TRAILING '|'.
SPLIT <l_line>+1 AT '|' INTO ls_directory-log_name ls_directory-phys_path.
APPEND ls_directory TO lt_directories.
ENDLOOP.
Try the following
data : dirname type DIRNAME_AL11.
CALL 'C_SAPGPARAM' ID 'NAME' FIELD 'DIR_DATA'
ID 'VALUE' FIELD dirname.
Alternatively if you wanted to use your own parameters(AL11->configure) then read these out of table user_dir.

Resources