Not able to access certain JSON properties in Autoloader - databricks-autoloader

I have a JSON file that is loaded by two different Autoloaders.
One uses schema evolution and besides replacing spaces in the json property names, writes the json directly to a delta table, and I can see all the values are there properly.
In the second one I am mapping to a defined schema and only use a subset of properties. So use a lot of withColumn and then a select to narrows to my defined column list.
Autoloader definition:
df = (spark
.readStream
.format('cloudFiles')
.option('cloudFiles.format', 'json')
.option('multiLine', 'true')
.option('cloudFiles.schemaEvolutionMode','rescue')
.option('cloudFiles.includeExistingFiles','true')
.option('cloudFiles.schemaLocation', bronze_schema)
.option('cloudFiles.inferColumnTypes', 'true')
.option('pathGlobFilter','*.json')
.load(upload_path)
.transform(lambda df: remove_spaces_from_columns(df))
.withColumn(...
Writer:
df.writeStream.format('delta') \
.queryName(al_stream_name) \
.outputMode('append') \
.option('checkpointLocation', checkpoint_path) \
.option('mergeSchema', 'true') \
.trigger(once = True) \
.table(bronze_table)
Issue is that some of the source columns are ok load and I get their values, and others are constantly null in the output table.
For example:
.withColumn('vl_rating', col('risk_severity.value')) # works
.withColumn('status', col('status.name')) # always null
...
.select(
'rating',
'status',
...
json is quite simple, these are all string values, they are always populated. The same code works against another simular json file in another autoloader without issue.
I have run out of ideas to fault find on this. My imports are minimal, outside of Autoloader the JSON loads fine.
e.g
%python
import pyspark.sql.functions as psf
jsontest = spark.read.option('inferSchema','true').json('dbfs:....json')
df = jsontest.withColumn('status', psf.col('status.name')).select('status')
display(df)
Results in the values of the status.name property of the json file
Any ideas would be greatly appreciated.

I have found generally what is causing this. Interesting cause!
I am scanning a whole directory of json files, and the schema evolves over time (as expected). But when I clear out the autoloader schema and checkpoint directories and only scan the latest json file it all works correctly.
So what I surmise is that something in schema evolution with the older json files causes Autoloader to get into a state where it will not put certain properties into the stream to the writer.
If anyone has any recommendation on how to implement some data quality analysis in an Autoloader I would be most appreciative if you would share.

Related

Use AWK to safely search and replace URLs in Wordpress SQL-Dump

I am working on a webtool to mirror a Wordpress installation into a development system.
The aim is to have a Live system for production and a development system for testing. The webtool then offers a one-click-sync between those systems.
Each of the systems is standalone, with its own webroot, database and url.
I am having a trouble with the database dump in which I have to search all the references to the source and replace them with the URL of the destination (e.g.: "www.example.com" -> "www-dev.example.com").
What I need to do is:
Find all occurences of the URL and replace it with the new one.
IF the match also matches the format of a serialized string it should set the Field-Seperator, and reload the match, so that the actual length can be set in the array.
In a first attempt I tried to solve this with a 'sed' command looking as follows: sed -i.orig 's/360\.example\.com/360-dev\.my\.example\.dev/g'.
This didn't work because there are serialized arrays contained in the dump, containing the url. The sed command is no good for updating the string-length-indicator of the serialized arrays.
My latest attempt is to use an awk as suggested here, because it's capable of arithmetic operations.
My awk script looks like this:
/360[.]example[.]com/ {
sub("360.example.com", "360-dev.my.example.dev");
if ($0 ~ /s:[[:digit:]]+:["](http[s]?:\/\/)?360[.]example[.]com["]/){
FS="\"";
$0=$0;
n=length($2)-1;
sub(/:[[:digit:]]+:/, ":" n ":");
}
} 1
There seem to be some errors in my script, which I can't find. It does not replace all of the occurrences of the url and completely skips the length-indicator-update.
How can I fix my script to achieve what I want to do?
EDIT: (Added Input/Output samples)
Databasedump consists of the whole wordpress-database with CREATE TABLE IF NOT EXISTS and INSERT statements for each table and record.
Normal (unserialized) occurence:
(36, 'home', 'http://360.example.com/blogname', 'yes'),
should result in:
(36, 'home', 'http://360-dev.my.example.dev/blogname', 'yes'),
Serialized occurence:
(404, 'wp-maintenance-mode', 'a:21:{s:6:"active";i:1;s:4:"time";i:0;s:4:"link";i:1;s:7:"support";i:0;s:10:"admin_link";i:1;s:7:"rewrite";s:0:"";s:6:"notice";i:1;s:4:"unit";i:1;s:5:"theme";i:0;s:8:"styleurl";s:69:"http://360.example.com/wp-content/themes/blogname/css/maintenance.css";s:5:"index";i:0;s:5:"title";s:0:"";s:6:"header";s:0:"";s:7:"heading";s:0:"";s:4:"text";s:12:"Example Text";s:7:"exclude";a:1:{i:0;s:0:"";}s:6:"bypass";i:0;s:4:"role";a:1:{i:0;s:13:"administrator";}s:13:"role_frontend";a:1:{i:0;s:13:"administrator";}s:5:"radio";i:0;s:4:"date";s:0:"";}', 'yes'),
Should result in:
(404, 'wp-maintenance-mode', 'a:21:{s:6:"active";i:1;s:4:"time";i:0;s:4:"link";i:1;s:7:"support";i:0;s:10:"admin_link";i:1;s:7:"rewrite";s:0:"";s:6:"notice";i:1;s:4:"unit";i:1;s:5:"theme";i:0;s:8:"styleurl";s:76:"http://360-dev.my.example.dev/wp-content/themes/blogname/css/maintenance.css";s:5:"index";i:0;s:5:"title";s:0:"";s:6:"header";s:0:"";s:7:"heading";s:0:"";s:4:"text";s:12:"Example Text";s:7:"exclude";a:1:{i:0;s:0:"";}s:6:"bypass";i:0;s:4:"role";a:1:{i:0;s:13:"administrator";}s:13:"role_frontend";a:1:{i:0;s:13:"administrator";}s:5:"radio";i:0;s:4:"date";s:0:"";}', 'yes'),
EDIT 2:
Now using wp-cli to do the task of search & replace.
I've got a multisite setup with blogs numbered (2,3,9).
Executing wp search-replace --url=360.example.com '360.example.com' '360-dev.my.example.dev' results in an error, telling me that the Single-Site tables (wp_redirection_items and wp_redirection_groups) cannot be found.
This is true, because they really do not exist, but rather for each blog (e.g: wp_2_redirection_items and so on). This error results in over 9000 missed occurences in s&r. It's possible to replace everything with wp search-replace --url=360.example.com '360.example.com' '360-dev.my.example.com' wp_*. But it still throws the error.
As suggested by #archimiro the task now is done by wp-cli.
But as I am also having a multisite setup, which lead to some errors I had to figure out the command for a full database search-replace task.
The final command:
wp search-replace --url=360.example.com '360.example.com' '360-dev.my.example.dev' wp_*.
Without explicitly telling wp-cli to search&replace in ALL (wp_*) tables it would stop by the time a "table not found" error is thrown.
Also not awk or wpcli but this is a php function I wrote that seems to work well.
function snr($search, $replace, $inputfile, $outputfile){
$sql = file_get_contents($inputfile);
$sql1 = str_replace($search,$replace,$sql);
file_put_contents($outputfile,$sql1);
$serstrings = preg_split("/(?<=[{;])s:/",$sql1);
foreach($serstrings as $i=>$serstring) {
if (!!strpos($serstring, $replace)){
$justString = str_replace("\\","",str_replace("\\\\","j",explode('\\";',explode(':\\"',$serstring)[1])[0]));
$correct = strlen($justString);
$serstrings[$i] = preg_replace('/^\d+/',$correct, $serstrings[$i]);
}
}
file_put_contents($outputfile,implode("s:",$serstrings));
}
I've used this in past with success:
sed 's|360\.example\.com|360-dev\.my\.example\.dev|g' com.sql > local.sql
Edit: sorry not awk, but neither is wp-cli.

remove log information from report and save report in desire location

I am new to robot framework and wanted to see if i can get any simple code for custom report. I am also fine with answer to my problem. I went through all questions related to report but could not find any specific answer to my problem. currently my report contains log and wanted to see if i can remove log information from reports and save report in specific location. I just want to get PASS/FAIL information in my report. Can any one give me example how i can overcome this problem? I also need to know how i can save my report in different location. Any example would be helpful. Thank you in advance.
There is a tool called Rebot which is part of Robot Framework.
By default, Robot Framework creates XML reports. The XML reports are automatically converted into HTML reports by Rebot.
You can set the location of the output files in the execution by specifying the parameter --outputdir (and thus set a different base directory for outputs).
From the documentaiton:
All output files can be set using an absolute path, in which case they are created to the specified place, but in other cases, the path is considered relative to the output directory. The default output directory is the directory where the execution is started from, but it can be altered with the --outputdir (-d) option. The path set with this option is, again, relative to the execution directory, but can naturally be given also as an absolute path. Regardless of how a path to an individual output file is obtained, its parent directory is created automatically, if it does not exist already.
You can call Rebot yourself to control this conversion.
You can also run Rebot after the test was run in order to create new output on a different location.
See documentation in:
http://robotframework.org/robotframework/latest/RobotFrameworkUserGuide.html#post-processing-outputs
The following example shows how to store the HTML reports in a different location and including only partial data:
rebot --include smoke --name Smoke_Tests c:\results\output.xml --outputdir c:\version1.0\reports
In the example above, we process the file c:\results\output.xml, create a new report called Smoke_Tests that includes only tests with the tag smoke and save it to the output folder c:\version1.0\reports
In addition you can also set the location of the log file (HTML) from the execution.
The command line option --log (-l) determines where log files are created.
The command line option --report (-r) determines where report files are created
Removing log lines can be done a bit differently. If you run rebot --help you'll get the following options:
--removekeywords all|passed|for|wuks|name: * Remove keyword data
from all generated outputs. Keywords containing
warnings are not removed except in `all` mode.
all: remove data from all keywords
passed: remove data only from keywords in passed
test cases and suites
for: remove passed iterations from for loops
wuks: remove all but the last failing keyword
inside `BuiltIn.Wait Until Keyword Succeeds`
name:: remove data from keywords that match
the given pattern. The pattern is matched
against the full name of the keyword (e.g.
'MyLib.Keyword', 'resource.Second Keyword'),
is case, space, and underscore insensitive,
and may contain `*` and `?` as wildcards.
Examples: --removekeywords name:Lib.HugeKw
--removekeywords name:myresource.*
--flattenkeywords for|foritem|name: * Flattens matching keywords
in all generated outputs. Matching keywords get all
log messages from their child keywords and children
are discarded otherwise.
for: flatten for loops fully
foritem: flatten individual for loop iterations
name:: flatten matched keywords using same
matching rules as with
`--removekeywords name:`

Sequencefiles which map a single key to multiple values

I am trying to do some preprocessing on data that will be fed to LucidWorks Big Data for indexing. LWBD accepts SolrXML in the form of Sequencefile files. I want to create a Pig script which will take all the SolrXML files in a directory and output them in the format
filename_1 => <here goes some XML>
...
filename_N => <here goes some more XML>
Pig's native PigStorage() load function can automatically create a column that includes the name of the file from which the data was extracted, which ideally would look like this:
{"filename_1", "<here goes some XML>"}
...
{"filename_N", "<here goes some more XML>"}
However, PigStorage() also automatically uses '\n' as a line delimiter, so what I actually end up with is a bag that looks like this:
{"filename_1", "<some partial XML from file 1>"}
{"filename_1", "<some more partial XML from file 1>"}
{"filename_1", "<the end of file 1>"}
...
I'm sure you get the picture. My question is, if I were to write this bag to a SequenceFile, how would it be read by other applications? Could it be combined as
"filename_1" => "<some partial XML from file 1>
<some more partial XML from file 1>
<the end of file 1>"
, by the default handling of the application I feed it to? Or is there some post-processing that I can do to get it into this format? Thank you for your help.
Since I can't find anything about a builtin SequenceFile writer, I'm assuming you are using a UDF (and if you aren't, then you need to).
You'll have to group the files (by filename) ahead of time, and then send that to the writer UDF.
DESCRIBE xml ;
-- xml: {filename: chararray, xml_data: chararray}
B = FOREACH (GROUP xml BY filename)
GENERATE group AS filename, xml.xml_data AS all_xml_data ;
Depending on how you have written the SequenceFile writer, it may be easier to convert the all_xml_data bag ahead of time to a chararray using a Python UDF like:
#outputSchema('xml_complete: chararray')
def stringify(bag):
delim = ''
return delim.join(bag)
NOTE: It is important to realize that this way the order of the xml data will become jumbled. If possible based on your data, stringify can maybe be expanded upon the reorgize it.

Does sbt have something like gradle's processResources task with ReplaceTokens support?

We are moving into Scala/SBT from a Java/Gradle stack. Our gradle builds were leveraging a task called processResources and some Ant filter thing named ReplaceTokens to dynamically replace tokens in a checked-in .properties file without actually changing the .properties file (just changing the output). The gradle task looks like:
processResources {
def whoami = System.getProperty( 'user.name' );
def hostname = InetAddress.getLocalHost().getHostName()
def buildTimestamp = new Date().format('yyyy-MM-dd HH:mm:ss z')
filter ReplaceTokens, tokens: [
"buildsig.version" : project.version,
"buildsig.classifier" : project.classifier,
"buildsig.timestamp" : buildTimestamp,
"buildsig.user" : whoami,
"buildsig.system" : hostname,
"buildsig.tag" : buildTag
]
}
This task locates all the template files in the src/main/resources directory, performs the requisite substitutions and outputs the results at build/resources/main. In other words it transforms src/main/resources/buildsig.properties from...
buildsig.version=#buildsig.version#
buildsig.classifier=#buildsig.classifier#
buildsig.timestamp=#buildsig.timestamp#
buildsig.user=#buildsig.user#
buildsig.system=#buildsig.system#
buildsig.tag=#buildsig.tag#
...to build/resources/main/buildsig.properties...
buildsig.version=1.6.5
buildsig.classifier=RELEASE
buildsig.timestamp=2013-05-06 09:46:52 PDT
buildsig.user=jenkins
buildsig.system=bobk-mbp.local
buildsig.tag=dev
Which, ultimately, finds its way into the WAR file at WEB-INF/classes/buildsig.properties. This works like a champ to record build specific information in a Properties file which gets loaded from the classpath at runtime.
What do I do in SBT to get something like this done? I'm new to Scala / SBT so please forgive me if this seems a stupid question. At the end of the day what I need is a means of pulling some information from the environment on which I build and placing that information into a properties file that is classpath loadable at runtime. Any insights you can give to help me get this done are greatly appreciated.
The sbt-buildinfo is a good option. The README shows an example of how to define custom mappings and mappings that should run on each compile. In addition to the straightforward addition of normal settings like version shown there, you want a section like this:
buildInfoKeys ++= Seq[BuildInfoKey](
"hostname" -> java.net.InetAddress.getLocalHost().getHostName(),
"whoami" -> System.getProperty("user.name"),
BuildInfoKey.action("buildTimestamp") {
java.text.DateFormat.getDateTimeInstance.format(new java.util.Date())
}
)
Would the following be what you're looking for:
sbt-editsource: An SBT plugin for editing files
sbt-editsource is a text substitution plugin for SBT 0.11.x and
greater. In a way, it’s a poor man’s sed(1), for SBT. It provides the
ability to apply line-by-line substitutions to a source text file,
producing an edited output file. It supports two kinds of edits:
Variable substitution, where ${var} is replaced by a value. sed-like
regular expression substitution.
This is from Community Plugins.

How to get programatically the file path from a directory parameter in abap?

The transaction AL11 returns a mapping of "directory parameters" to file paths on the application server AFAIK.
The trouble with transaction AL11 is that its program only calls c modules, there's almost no trace of select statements or function calls to analize there.
I want the ability to do this dynamically, in my code, like for instance a function module that took "DATA_DIR" as input and "E:\usr\sap\IDS\DVEBMGS00\data" as output.
This thread is about a similar topic, but it doesn't help.
Some other guy has the same problem, and he explains it quite well here.
I strongly suspect that the only way to get these values is through the kernel directly. some of them can vary depending on the application server, so you probably won't be able to find them in the database. You could try this:
TYPE-POOLS abap.
TYPES: BEGIN OF t_directory,
log_name TYPE dirprofilenames,
phys_path TYPE dirname_al11,
END OF t_directory.
DATA: lt_int_list TYPE TABLE OF abaplist,
lt_string_list TYPE list_string_table,
lt_directories TYPE TABLE OF t_directory,
ls_directory TYPE t_directory.
FIELD-SYMBOLS: <l_line> TYPE string.
START-OF-SELECTION-OR-FORM-OR-METHOD-OR-WHATEVER.
* get the output of the program as string table
SUBMIT rswatch0 EXPORTING LIST TO MEMORY AND RETURN.
CALL FUNCTION 'LIST_FROM_MEMORY'
TABLES
listobject = lt_int_list.
CALL FUNCTION 'LIST_TO_ASCI'
EXPORTING
with_line_break = abap_true
IMPORTING
list_string_ascii = lt_string_list
TABLES
listobject = lt_int_list.
* remove the separators and the two header lines
DELETE lt_string_list WHERE table_line CO '-'.
DELETE lt_string_list INDEX 1.
DELETE lt_string_list INDEX 1.
* parse the individual lines
LOOP AT lt_string_list ASSIGNING <l_line>.
* If you're on a newer system, you can do this in a more elegant way using regular expressions
CONDENSE <l_line>.
SHIFT <l_line> LEFT DELETING LEADING '|'.
SHIFT <l_line> RIGHT DELETING TRAILING '|'.
SPLIT <l_line>+1 AT '|' INTO ls_directory-log_name ls_directory-phys_path.
APPEND ls_directory TO lt_directories.
ENDLOOP.
Try the following
data : dirname type DIRNAME_AL11.
CALL 'C_SAPGPARAM' ID 'NAME' FIELD 'DIR_DATA'
ID 'VALUE' FIELD dirname.
Alternatively if you wanted to use your own parameters(AL11->configure) then read these out of table user_dir.

Resources