Sorting by key in rocksdb - rocksdb

I'm trying to compare in java a huge amount of data with 2 folders entry folder1 and folder2.
Each folder contains several files of around 10 MBytes size each. I have about 100 files in each folder.
Each file contains a key,value line something like (total is around 500 Millions of lines for each folder):
RFE023334343432-45,456677
RFE54667565765-5,465368
and son on..
First Step
At first, each line of all files from folder1 is read and load into a rocksdb using
in my example above
key = RFE023334343432-45 and corresponding
value = 456677
Second step
Once my rocksdb is full with folder1 data,
for each line read in folder2, I call the folder1 rocksdb get() method to check if the key extracted for folder2 line exists into the rocksdb. It returns null when I does not exist.
Note that I cannot use rocksdb keyMayExist() method because It returns false positive results when you manipulate a too huge amount of data.
The performance are correct when the data inside the folder1 is ordered regarding the key value.
But my duration is multiplied by 3 when the input data are not sorted (I shuffled them using shell command). That is weird because in my test
I copy my unsort folder1 into folder2 (just duplication my folder). Thus even if folder1 is unsorted, folder2 is also unsorted in exactly the same way as folder1.
My question is how can I sort my rocksdb by key?

RocksDB always sorts data by key. You can use an iterator to K/V pairs from a RocksDB instance. Here is the API to create an iterator: https://github.com/facebook/rocksdb/blob/v6.22.1/include/rocksdb/db.h#L709-L716

Related

scp_download to download multiple files based on a pattern?

I need to download many files from a server (specifically tectia) ideally using the ssh package. These files all follow the a predictable pattern across multiple sub folders. The filepath is formatted like this
/directory/subfolder/A001/abcde001.csv
Where A001 counts up alongside the last 3 digits of the filename (/A002/abcde002.csv and so on)
In the vignette for scp_download it states that the files parameter may contain wildcards so I have tried to do something like
scp_download(session, "/directory/subfolder/A.*/abcde.*[.]csv", to=tempdir())
and
scp_download(session, "directory/subfolder/A\\d{3}/abcde\\d{3}[.]csv", to=tempdir())
but no matter which combination of patterns or wildcards I can think of (which isn't many) I only get something like
Warning: SSH warning: scp: /directory/subfolder/A\d{3}/abcde\d{3}[.]csv: No such file or directory
What I'm hoping to do is either find a way to do pattern matching here, or to find a way to store tectia directories as a string to be read by scp_download. I've made sure that my session is connected properly and it works without attempting to pattern match, which it does.
I had the same problem. The problem is that when you use * in your pattern it gets escaped when you send it to the server. However, when you request a special file name like this /directory/subfolder/A001/abcde001.csv, it works fine.
Finally I changed my code based on the below steps:
I got the list of files/folders using ls command with ssh_exec_wait function and then store them on a variable.
Download files in the variable separately
session <- ssh_connect("username#ip",passwd="password")
files<-capture.output(ssh_exec_wait(session, command = 'ls /directory/subfolder/A001/*'))
dnc1<- scp_download(session, files[1], to = paste0(getwd(),"/data/"))
dnc2<- scp_download(session, files[2], to = paste0(getwd(),"/data/"))
dnc3<- scp_download(session, files[3], to = paste0(getwd(),"/data/"))
The bottom 3 commands can be done in a loop as this could be hundreds or thousands of records.

Datastage Sequence job- how to process each file at a time if those files are in 7 different folders

DataStage - There are 7 folders in a path and in each folder there are 2 files . for eg : the 2 files are in the folllowing format- filename = test_s1_YYYYMMDD.txt, test_s1_YYYYMMDD.done. The path for these files are user/test/test_s1/
user/test/test_s2/
...
...
..
user/test/test_s7/------here s1,s2...s7 represents the different folders
In these folders the 2 above mentioned files are present , so how can i process each file in a sequence job?
First you need a job to process a file and the filename needs to be a parameter of that job.
For the Sequence level you need two levels - the inner one for the two files within each folder and a outer one for the different directories.
For the inner one you can choose to build a loop with to iterations or simply add the processing job twice to the sequence (which will reduce complexity in case it will always be two files).
The outer Sequence is a loop where you could parameterize the path in a way that the loop counter could be used to generate your 1-7 flexible path addon.
Check out more details on loops here
You can use the loop counter (stage_label.$Counter) to parameterize your job.
Depending on what you want to do with the files, it is an important decision how to process your files. Starting a job (or more) in a sequence for each file can lead to heavy overhead for just starting the jobs. Try loading all files at once in a parallel job using the sequenial file stage.
In the Sequential File Stage, set the appropriate Format. You can also set everything to none to just put each row in one column and process that in a later job. This will make the reading very flexible and forgiving. If your files are all the same structure, define your columns as needed.
To select the files, use File Patterns. In the Options of the Sequential File Stage, choose to have a File Name Column so you can process the filenames in a later job. You might also want to add a Row Number Column.
This method works pretty fast.

Phalcon 4.x creates multiple folders to store cached data. What is the benefit of having recursive and multiple folders?

$hash = '123456789';
$fileName = "test.{$hash}.html";
the final directory structure will be like following.
te/st/.1/23/45/67/test.123456789.html
it simply breaks down the file name in to directories of two characters leaving the last 2 chars from the path test.123456789
To prevent too many files in one folder error.
See https://docs.phalcon.io/4.0/en/cache#stream as well.

U-SQL Ignore Empty Files

I receive a daily dump of files from a data provider. On occasion we receive empty files (20bytes). Is there any way to automatically avoid processing or skip these files?
I have tried:
USING Extractors.Csv(skipFirstNRows:1, silent:true);
But I seem to get a vertex failure related to what I believe is the empty files.
We recently added a FILE.LENGTH property as a computed virtual column that you can use to filter out files of a certain size.
For example the following should only operate on the files that are larger than 20 bytes:
#data =
EXTRACT
// ... columns to extract
, file_sz = FILE.LENGTH()
FROM "/mydata/{*}"
USING Extractors.Csv();
#res =
SELECT *
FROM #data
WHERE file_sz > 20;

remove log information from report and save report in desire location

I am new to robot framework and wanted to see if i can get any simple code for custom report. I am also fine with answer to my problem. I went through all questions related to report but could not find any specific answer to my problem. currently my report contains log and wanted to see if i can remove log information from reports and save report in specific location. I just want to get PASS/FAIL information in my report. Can any one give me example how i can overcome this problem? I also need to know how i can save my report in different location. Any example would be helpful. Thank you in advance.
There is a tool called Rebot which is part of Robot Framework.
By default, Robot Framework creates XML reports. The XML reports are automatically converted into HTML reports by Rebot.
You can set the location of the output files in the execution by specifying the parameter --outputdir (and thus set a different base directory for outputs).
From the documentaiton:
All output files can be set using an absolute path, in which case they are created to the specified place, but in other cases, the path is considered relative to the output directory. The default output directory is the directory where the execution is started from, but it can be altered with the --outputdir (-d) option. The path set with this option is, again, relative to the execution directory, but can naturally be given also as an absolute path. Regardless of how a path to an individual output file is obtained, its parent directory is created automatically, if it does not exist already.
You can call Rebot yourself to control this conversion.
You can also run Rebot after the test was run in order to create new output on a different location.
See documentation in:
http://robotframework.org/robotframework/latest/RobotFrameworkUserGuide.html#post-processing-outputs
The following example shows how to store the HTML reports in a different location and including only partial data:
rebot --include smoke --name Smoke_Tests c:\results\output.xml --outputdir c:\version1.0\reports
In the example above, we process the file c:\results\output.xml, create a new report called Smoke_Tests that includes only tests with the tag smoke and save it to the output folder c:\version1.0\reports
In addition you can also set the location of the log file (HTML) from the execution.
The command line option --log (-l) determines where log files are created.
The command line option --report (-r) determines where report files are created
Removing log lines can be done a bit differently. If you run rebot --help you'll get the following options:
--removekeywords all|passed|for|wuks|name: * Remove keyword data
from all generated outputs. Keywords containing
warnings are not removed except in `all` mode.
all: remove data from all keywords
passed: remove data only from keywords in passed
test cases and suites
for: remove passed iterations from for loops
wuks: remove all but the last failing keyword
inside `BuiltIn.Wait Until Keyword Succeeds`
name:: remove data from keywords that match
the given pattern. The pattern is matched
against the full name of the keyword (e.g.
'MyLib.Keyword', 'resource.Second Keyword'),
is case, space, and underscore insensitive,
and may contain `*` and `?` as wildcards.
Examples: --removekeywords name:Lib.HugeKw
--removekeywords name:myresource.*
--flattenkeywords for|foritem|name: * Flattens matching keywords
in all generated outputs. Matching keywords get all
log messages from their child keywords and children
are discarded otherwise.
for: flatten for loops fully
foritem: flatten individual for loop iterations
name:: flatten matched keywords using same
matching rules as with
`--removekeywords name:`

Resources