scp_download to download multiple files based on a pattern? - r

I need to download many files from a server (specifically tectia) ideally using the ssh package. These files all follow the a predictable pattern across multiple sub folders. The filepath is formatted like this
/directory/subfolder/A001/abcde001.csv
Where A001 counts up alongside the last 3 digits of the filename (/A002/abcde002.csv and so on)
In the vignette for scp_download it states that the files parameter may contain wildcards so I have tried to do something like
scp_download(session, "/directory/subfolder/A.*/abcde.*[.]csv", to=tempdir())
and
scp_download(session, "directory/subfolder/A\\d{3}/abcde\\d{3}[.]csv", to=tempdir())
but no matter which combination of patterns or wildcards I can think of (which isn't many) I only get something like
Warning: SSH warning: scp: /directory/subfolder/A\d{3}/abcde\d{3}[.]csv: No such file or directory
What I'm hoping to do is either find a way to do pattern matching here, or to find a way to store tectia directories as a string to be read by scp_download. I've made sure that my session is connected properly and it works without attempting to pattern match, which it does.

I had the same problem. The problem is that when you use * in your pattern it gets escaped when you send it to the server. However, when you request a special file name like this /directory/subfolder/A001/abcde001.csv, it works fine.
Finally I changed my code based on the below steps:
I got the list of files/folders using ls command with ssh_exec_wait function and then store them on a variable.
Download files in the variable separately
session <- ssh_connect("username#ip",passwd="password")
files<-capture.output(ssh_exec_wait(session, command = 'ls /directory/subfolder/A001/*'))
dnc1<- scp_download(session, files[1], to = paste0(getwd(),"/data/"))
dnc2<- scp_download(session, files[2], to = paste0(getwd(),"/data/"))
dnc3<- scp_download(session, files[3], to = paste0(getwd(),"/data/"))
The bottom 3 commands can be done in a loop as this could be hundreds or thousands of records.

Related

Datastage Sequence job- how to process each file at a time if those files are in 7 different folders

DataStage - There are 7 folders in a path and in each folder there are 2 files . for eg : the 2 files are in the folllowing format- filename = test_s1_YYYYMMDD.txt, test_s1_YYYYMMDD.done. The path for these files are user/test/test_s1/
user/test/test_s2/
...
...
..
user/test/test_s7/------here s1,s2...s7 represents the different folders
In these folders the 2 above mentioned files are present , so how can i process each file in a sequence job?
First you need a job to process a file and the filename needs to be a parameter of that job.
For the Sequence level you need two levels - the inner one for the two files within each folder and a outer one for the different directories.
For the inner one you can choose to build a loop with to iterations or simply add the processing job twice to the sequence (which will reduce complexity in case it will always be two files).
The outer Sequence is a loop where you could parameterize the path in a way that the loop counter could be used to generate your 1-7 flexible path addon.
Check out more details on loops here
You can use the loop counter (stage_label.$Counter) to parameterize your job.
Depending on what you want to do with the files, it is an important decision how to process your files. Starting a job (or more) in a sequence for each file can lead to heavy overhead for just starting the jobs. Try loading all files at once in a parallel job using the sequenial file stage.
In the Sequential File Stage, set the appropriate Format. You can also set everything to none to just put each row in one column and process that in a later job. This will make the reading very flexible and forgiving. If your files are all the same structure, define your columns as needed.
To select the files, use File Patterns. In the Options of the Sequential File Stage, choose to have a File Name Column so you can process the filenames in a later job. You might also want to add a Row Number Column.
This method works pretty fast.

Phalcon 4.x creates multiple folders to store cached data. What is the benefit of having recursive and multiple folders?

$hash = '123456789';
$fileName = "test.{$hash}.html";
the final directory structure will be like following.
te/st/.1/23/45/67/test.123456789.html
it simply breaks down the file name in to directories of two characters leaving the last 2 chars from the path test.123456789
To prevent too many files in one folder error.
See https://docs.phalcon.io/4.0/en/cache#stream as well.

remove log information from report and save report in desire location

I am new to robot framework and wanted to see if i can get any simple code for custom report. I am also fine with answer to my problem. I went through all questions related to report but could not find any specific answer to my problem. currently my report contains log and wanted to see if i can remove log information from reports and save report in specific location. I just want to get PASS/FAIL information in my report. Can any one give me example how i can overcome this problem? I also need to know how i can save my report in different location. Any example would be helpful. Thank you in advance.
There is a tool called Rebot which is part of Robot Framework.
By default, Robot Framework creates XML reports. The XML reports are automatically converted into HTML reports by Rebot.
You can set the location of the output files in the execution by specifying the parameter --outputdir (and thus set a different base directory for outputs).
From the documentaiton:
All output files can be set using an absolute path, in which case they are created to the specified place, but in other cases, the path is considered relative to the output directory. The default output directory is the directory where the execution is started from, but it can be altered with the --outputdir (-d) option. The path set with this option is, again, relative to the execution directory, but can naturally be given also as an absolute path. Regardless of how a path to an individual output file is obtained, its parent directory is created automatically, if it does not exist already.
You can call Rebot yourself to control this conversion.
You can also run Rebot after the test was run in order to create new output on a different location.
See documentation in:
http://robotframework.org/robotframework/latest/RobotFrameworkUserGuide.html#post-processing-outputs
The following example shows how to store the HTML reports in a different location and including only partial data:
rebot --include smoke --name Smoke_Tests c:\results\output.xml --outputdir c:\version1.0\reports
In the example above, we process the file c:\results\output.xml, create a new report called Smoke_Tests that includes only tests with the tag smoke and save it to the output folder c:\version1.0\reports
In addition you can also set the location of the log file (HTML) from the execution.
The command line option --log (-l) determines where log files are created.
The command line option --report (-r) determines where report files are created
Removing log lines can be done a bit differently. If you run rebot --help you'll get the following options:
--removekeywords all|passed|for|wuks|name: * Remove keyword data
from all generated outputs. Keywords containing
warnings are not removed except in `all` mode.
all: remove data from all keywords
passed: remove data only from keywords in passed
test cases and suites
for: remove passed iterations from for loops
wuks: remove all but the last failing keyword
inside `BuiltIn.Wait Until Keyword Succeeds`
name:: remove data from keywords that match
the given pattern. The pattern is matched
against the full name of the keyword (e.g.
'MyLib.Keyword', 'resource.Second Keyword'),
is case, space, and underscore insensitive,
and may contain `*` and `?` as wildcards.
Examples: --removekeywords name:Lib.HugeKw
--removekeywords name:myresource.*
--flattenkeywords for|foritem|name: * Flattens matching keywords
in all generated outputs. Matching keywords get all
log messages from their child keywords and children
are discarded otherwise.
for: flatten for loops fully
foritem: flatten individual for loop iterations
name:: flatten matched keywords using same
matching rules as with
`--removekeywords name:`

How to create a new output file in R if a file with that name already exists?

I am trying to run an R-script file using windows task scheduler that runs it every two hours. What I am trying to do is gather some tweets through Twitter API and run a sentiment analysis that produces two graphs and saves it in a directory. The problem is, when the script is run again it replaces the already existing files with that name in the directory.
As an example, when I used the pdf("file") function, it ran fine for the first time as no file with that name already existED in the directory. Problem is I want the R-script to be running every other hour. So, I need some solution that creates a new file in the directory instead of replacing that file. Just like what happens when a file is downloaded multiple times from Google Chrome.
I'd just time-stamp the file name.
> filename = paste("output-",now(),sep="")
> filename
[1] "output-2014-08-21 16:02:45"
Use any of the standard date formatting functions to customise to taste - maybe you don't want spaces and colons in your file names:
> filename = paste("output-",format(Sys.time(), "%a-%b-%d-%H-%M-%S-%Y"),sep="")
> filename
[1] "output-Thu-Aug-21-16-03-30-2014"
If you want the behaviour of adding a number to the file name, then something like this:
serialNext = function(prefix){
if(!file.exists(prefix)){return(prefix)}
i=1
repeat {
f = paste(prefix,i,sep=".")
if(!file.exists(f)){return(f)}
i=i+1
}
}
Usage. First, "foo" doesn't exist, so it returns "foo":
> serialNext("foo")
[1] "foo"
Write a file called "foo":
> cat("fnord",file="foo")
Now it returns "foo.1":
> serialNext("foo")
[1] "foo.1"
Create that, then it returns "foo.2" and so on...
> cat("fnord",file="foo.1")
> serialNext("foo")
[1] "foo.2"
This kind of thing can break if more than one process might be writing a new file though - if both processes check at the same time there's a window of opportunity where both processes don't see "foo.2" and think they can both create it. The same thing will happen with timestamps if you have two processes trying to write new files at the same time.
Both these issues can be resolved by generating a random UUID and pasting that on the filename, otherwise you need something that's atomic at the operating system level.
But for a twice-hourly job I reckon a timestamp down to minutes is probably enough.
See ?files for file manipulation functions. You can check if file exists with file.exists, and then either rename the existing file, or create a different name for the new one.

how to specify wildcards in a filename for amazon EMR job

If I run a EMR job and specify wildcards in the directory path it all works fine
e.g: s3n://mybucket///*/fileName.gz --- picks all files with name fileName.gz under subdirectories of mybucket
However when I specify wildcards in the fileName then emr logs show an error that no match found. It seems to treat the '' character as a literal character part of fileName instead as a wildcard
e.g: s3n//mybucket/Dir1/fileName..gz
gives an error back that no matches were found for fielName.*.gz in that directory
How do we specify wildcards in filename for an amazon emr job
Just went through this myself. It is very useful to pass NON-globbed wildcard expressions from the start script to spark/pyspark because the distribution mechanism inside the spark program can be efficient when presented when something like this; note globbing at both directory level and filename level:
df = spark.read.json('s3://my-bucket/archive/*/2014/7/G.*.json.bz2')
Not to mention of course that almost all the time you want globbing to occur on the remote resource, not your local launch environment.
The trick is to ensure that the initial shell variable does not get globbed when created and also protected when presented to aws emr add-steps. Here is a simple launch script that assumes a cluster has been created. To
show it can be done, we also escape newlines to make it easier to see the args. Be careful, however, NOT to re-introduce extra whitespace when doing this!
# Use single quotes to stop globbing at the var level:
DATA_URI='s3://my-bucket/archive/*/2014/7/G.*.json.bz2'
# DO NOT add trailing slash to the output_uri. S3 will
# automatically create subdirs under that. e.g.
# --output_uri s3://$SRC_BUCKET/V4_t
# will be created and populated with many part-0000-... files.
# If you are not renaming or deleting the output_uri for each run,
# make sure your spark program uses overwrite mode for dataframe output e.g.
# dfx.write.mode("overwrite").json(output_uri)
# Careful to protect the DATA_URI arg by wrapping it single quotes:
aws emr add-steps \
--cluster-id j-3CDMYEF3NJGHR \
--steps Type=Spark,\
Name="myAnalytics",\
ActionOnFailure=CONTINUE,\
Args=[\
s3://$SRC_BUCKET/blunders.py,\
--game_data,\'$DATA_URI\',
--output_uri,s3://$SRC_BUCKET/V4_t]

Resources