Pig output of Multistorage to be appended when I run it everyday - directory

I have a set of data on which I ran the multistorage command on column 'type' and now I have these paths in hdfs: "/output/type1/", "/output/type2/", "/output/type3/" etc.
Now,
Everyday i run a script with multistorage command on column 'type' to produce "/tmp/type1/", "/tmp/type2/", "/tmp/type3/" etc
(Types here could be either < or = the types in master output that is already present).
Since Pig doesn't allow me to provide the output path of an already existing directory, my script that runs everyday is /tmp/.
Is there a way to combine /tmp/ with /output/, under the right 'type' subdirectories?
Expected to have /tmp/type1/file under /output/type1/ as /output/type1/file
and so on.This way i can delete the /tmp and run the script again.
Any help is appreciated.
Thanks in advance.

Pig cannot handle directories but invoking fs commands. Mapping temporary directories to the final directories requires more than what can Pig do. You can use the FileSystem Api in a tiny java program and run it separatly or in Oozie workflow.
In addition to that you need to ensure that appended files have different filenames than the existant ones, this is not the default behaviour and you can achieve it by this command:
%declare timestamp `date +"%s"`
SET mapreduce.output.basename '$timestamp'
/* here we used the timestamp to get unicity*/

Related

scp_download to download multiple files based on a pattern?

I need to download many files from a server (specifically tectia) ideally using the ssh package. These files all follow the a predictable pattern across multiple sub folders. The filepath is formatted like this
/directory/subfolder/A001/abcde001.csv
Where A001 counts up alongside the last 3 digits of the filename (/A002/abcde002.csv and so on)
In the vignette for scp_download it states that the files parameter may contain wildcards so I have tried to do something like
scp_download(session, "/directory/subfolder/A.*/abcde.*[.]csv", to=tempdir())
and
scp_download(session, "directory/subfolder/A\\d{3}/abcde\\d{3}[.]csv", to=tempdir())
but no matter which combination of patterns or wildcards I can think of (which isn't many) I only get something like
Warning: SSH warning: scp: /directory/subfolder/A\d{3}/abcde\d{3}[.]csv: No such file or directory
What I'm hoping to do is either find a way to do pattern matching here, or to find a way to store tectia directories as a string to be read by scp_download. I've made sure that my session is connected properly and it works without attempting to pattern match, which it does.
I had the same problem. The problem is that when you use * in your pattern it gets escaped when you send it to the server. However, when you request a special file name like this /directory/subfolder/A001/abcde001.csv, it works fine.
Finally I changed my code based on the below steps:
I got the list of files/folders using ls command with ssh_exec_wait function and then store them on a variable.
Download files in the variable separately
session <- ssh_connect("username#ip",passwd="password")
files<-capture.output(ssh_exec_wait(session, command = 'ls /directory/subfolder/A001/*'))
dnc1<- scp_download(session, files[1], to = paste0(getwd(),"/data/"))
dnc2<- scp_download(session, files[2], to = paste0(getwd(),"/data/"))
dnc3<- scp_download(session, files[3], to = paste0(getwd(),"/data/"))
The bottom 3 commands can be done in a loop as this could be hundreds or thousands of records.

Remove date from filename UNIX

I am working in UNIX and trying to write the following commands. I am receiving a source file daily whose filename is in the format :
ONSITE_EXTR_ONSITE_EXTR_20170707.
Since I am receiving a file daily, the file name would change based on the current date, so ONSITE_EXTR_ONSITE_EXTR_20170708, ONSITE_EXTR_ONSITE_EXTR_20170709 etc. I need to strip the date out of the filename and rename it to ONSITE_EXTR_ONSITE_EXTR. After I have finished whatever data reading and processing I need to do, I need to change the file name back to ONSITE_EXTR_ONSITE_EXTR_20170707 for example. So since the file is being delivered daily, I cant hard code the date in whatever commands I write. Any help would be greatly appreciated
Depending on your toolchain, this may be as simple as running:
$ mv ONSITE_EXTR_ONSITE_EXTR_$(date +%Y%m%d) ONSITE_EXTR_ONSITE_EXTR
... before running the rest of your script, assuming you're using a Bash-like shell.
Having said that, you can just drop in ONSITE_EXTR_ONSITE_EXTR_$(date +%Y%m%d) into your script when trying to access your file instead.
This is all assuming the script's run the same day and in the same time zone as the file is downloaded.
If you were using bash and you had the file name in a variable, you could do:
IN="ONSITE_EXTR_ONSITE_EXTR_20170707"
echo ${IN:0:23}
to give ONSITE_EXTR_ONSITE_EXTR
Googling gives all sorts of guides here...

How to force robot framework to pick robot files in sequential order?

I have robot files in a folder (tests) as shown below:
tests
1_robotfile1.robot
2_robotfile2.robot
3_robotfile3.robot
4_robotfile4.robot
5_robotfile5.robot
6_robotfile6.robot
7_robotfile7.robot
8_robotfile8.robot
9_robotfile9.robot
10_robotfile10.robot
11_robotfile11.robot
Now if I execute '/root/users1/power$ pybot root/user1/tests' command, robot files are running in following order:
tests
1_robotfile1.robot
10_robotfile10.robot
11_robotfile11.robot
2_robotfile2.robot
3_robotfile3.robot
4_robotfile4.robot
5_robotfile5.robot
6_robotfile6.robot
7_robotfile7.robot
8_robotfile8.robot
9_robotfile9.robot
I want to force robot_framework to pick robot files in sequential order, like 1,2,3,4,5....
Do we have any option for this?
If you have the option of renaming your files, you just need to make sure that the prefix is sortable. For numbers, that means they should all have the same number of digits.
I recommend renaming your test cases to have three or four digits for the prefix:
001_robotfile1.robot
002_robotfile2.robot
003_robotfile3.robot
004_robotfile4.robot
005_robotfile5.robot
006_robotfile6.robot
007_robotfile7.robot
008_robotfile8.robot
009_robotfile9.robot
010_robotfile10.robot
011_robotfile11.robot
...
With that, they will sort in the order that you expect.
Following #Emna answer, RF docs ( http://robotframework.org/robotframework/latest/RobotFrameworkUserGuide.html#execution-order ) provides some solution.
So what could you do:
rename all the files to have consecutive and computer numbering (001-test.robot instead of 1-test.robot). This may break any internal references to other files (resources), hard to add test in-between,error prone when execution order needs to be changed
you can tag it as Emna
idea from RF docs - write a script to create argument file which will keep ordering in proper way and use it as argument to robot execution. For 1000+ files it should not take longer than few seconds.
try to design tests to not be dependent from execution order, use suite setup instead.
good luck ;)
Tag the tests as foo and bar so you can run each test separately:
pybot -i foo tests
or
pybot -i bar tests
and decide the order you want
pybot -i bar tests || pybot -i foo tests

Process many EDI files through single MFX

I've created a mapping in MapForce 2013 and exported the MFX file. Now, I need to be able to run the mapping using MapForce Server. The problem is, I need to specify both the input EDI file and the output file. As far as I can tell, the usage pattern is to run the mapping with MapForce server using the input/output configuration in the MFX itself, not passed in on the command line.
I suppose I could change the input/output to some standard file name and then just write the input file to that path before performing the mapping, and then grab the output from the standard output file path when the mapping is complete.
But I'd prefer to be able to do something like:
MapForceServer run -in=MyInputFile.txt -out=MyOutputFile.xml MyMapping.mfx > MyLogFile.txt
Is something like this possible? Perhaps using parameters within the mapping?
There are two options that I've come across in dealing with a similar situation.
Option 1- If you set the input XML file to *.xml in the component settings, mapforceserver.exe will automatically process all txt in the directory assuming your source is xml (this should work for text just the same). Similar to the example below you can set a cleanup routine to move the files into another folder after processing.
Note: It looks in the folder where the schema file is located.
Option 2 - Since your output is XML you can use Altova's raptorxml (rack up another license charge). Now you can generate code in XSLT 2.0 and use a batch file to automatically execute, something like this.
::#echo off
for %%f IN (*.xml) DO (RaptorXML xslt --xslt-version=2 --input="%%f" --output="out/%%f" %* "mymapping.xslt"
if NOT errorlevel 1 move "%%f" processed
if errorlevel 1 move "%%f" error)
sleep 15
mymapping.bat
I tossed in a sleep command to loop the batch for rechecking every 15 seconds. Unfortunately this does not work if your output target is a database.

how to specify wildcards in a filename for amazon EMR job

If I run a EMR job and specify wildcards in the directory path it all works fine
e.g: s3n://mybucket///*/fileName.gz --- picks all files with name fileName.gz under subdirectories of mybucket
However when I specify wildcards in the fileName then emr logs show an error that no match found. It seems to treat the '' character as a literal character part of fileName instead as a wildcard
e.g: s3n//mybucket/Dir1/fileName..gz
gives an error back that no matches were found for fielName.*.gz in that directory
How do we specify wildcards in filename for an amazon emr job
Just went through this myself. It is very useful to pass NON-globbed wildcard expressions from the start script to spark/pyspark because the distribution mechanism inside the spark program can be efficient when presented when something like this; note globbing at both directory level and filename level:
df = spark.read.json('s3://my-bucket/archive/*/2014/7/G.*.json.bz2')
Not to mention of course that almost all the time you want globbing to occur on the remote resource, not your local launch environment.
The trick is to ensure that the initial shell variable does not get globbed when created and also protected when presented to aws emr add-steps. Here is a simple launch script that assumes a cluster has been created. To
show it can be done, we also escape newlines to make it easier to see the args. Be careful, however, NOT to re-introduce extra whitespace when doing this!
# Use single quotes to stop globbing at the var level:
DATA_URI='s3://my-bucket/archive/*/2014/7/G.*.json.bz2'
# DO NOT add trailing slash to the output_uri. S3 will
# automatically create subdirs under that. e.g.
# --output_uri s3://$SRC_BUCKET/V4_t
# will be created and populated with many part-0000-... files.
# If you are not renaming or deleting the output_uri for each run,
# make sure your spark program uses overwrite mode for dataframe output e.g.
# dfx.write.mode("overwrite").json(output_uri)
# Careful to protect the DATA_URI arg by wrapping it single quotes:
aws emr add-steps \
--cluster-id j-3CDMYEF3NJGHR \
--steps Type=Spark,\
Name="myAnalytics",\
ActionOnFailure=CONTINUE,\
Args=[\
s3://$SRC_BUCKET/blunders.py,\
--game_data,\'$DATA_URI\',
--output_uri,s3://$SRC_BUCKET/V4_t]

Resources