MarkLogic globally unique URIs - uri

During the mlcp import of aggregate XML file /space/data/big.xml, the default document URI will be like /space/data/big.xml-0-1 /space/data/big.xml-0-2, etc.
Since there is no unique tag value to use -uri_id in the document and generated above URIs are not globally unique, is there any option to get unique URIs (i.e. like for RDF c7f92bccb4e2bfdc-0-100.xml)?

You could apply a custom transformation module and then construct a URI that concatenates a random value using functions such as sem.uuidString().
function envelope(content, context)
{
content.uri = sem.uuidString() + "-" + content.uri;
return content;
};
exports.transform = envelope;
Using a Custom Transformation
Once you install a custom transformation function on MarkLogic Server, you can apply it to your mlcp import or copy job using the following options:
-transform_module - The path to the module containing your transformation.
$ mlcp.sh import -mode local -host mlhost -port 8000 \
-username user -password password \
-input_file_path /space/mlcp-test/data \
-transform_module /example/mlcp-transform.sjs

Related

Not able to access certain JSON properties in Autoloader

I have a JSON file that is loaded by two different Autoloaders.
One uses schema evolution and besides replacing spaces in the json property names, writes the json directly to a delta table, and I can see all the values are there properly.
In the second one I am mapping to a defined schema and only use a subset of properties. So use a lot of withColumn and then a select to narrows to my defined column list.
Autoloader definition:
df = (spark
.readStream
.format('cloudFiles')
.option('cloudFiles.format', 'json')
.option('multiLine', 'true')
.option('cloudFiles.schemaEvolutionMode','rescue')
.option('cloudFiles.includeExistingFiles','true')
.option('cloudFiles.schemaLocation', bronze_schema)
.option('cloudFiles.inferColumnTypes', 'true')
.option('pathGlobFilter','*.json')
.load(upload_path)
.transform(lambda df: remove_spaces_from_columns(df))
.withColumn(...
Writer:
df.writeStream.format('delta') \
.queryName(al_stream_name) \
.outputMode('append') \
.option('checkpointLocation', checkpoint_path) \
.option('mergeSchema', 'true') \
.trigger(once = True) \
.table(bronze_table)
Issue is that some of the source columns are ok load and I get their values, and others are constantly null in the output table.
For example:
.withColumn('vl_rating', col('risk_severity.value')) # works
.withColumn('status', col('status.name')) # always null
...
.select(
'rating',
'status',
...
json is quite simple, these are all string values, they are always populated. The same code works against another simular json file in another autoloader without issue.
I have run out of ideas to fault find on this. My imports are minimal, outside of Autoloader the JSON loads fine.
e.g
%python
import pyspark.sql.functions as psf
jsontest = spark.read.option('inferSchema','true').json('dbfs:....json')
df = jsontest.withColumn('status', psf.col('status.name')).select('status')
display(df)
Results in the values of the status.name property of the json file
Any ideas would be greatly appreciated.
I have found generally what is causing this. Interesting cause!
I am scanning a whole directory of json files, and the schema evolves over time (as expected). But when I clear out the autoloader schema and checkpoint directories and only scan the latest json file it all works correctly.
So what I surmise is that something in schema evolution with the older json files causes Autoloader to get into a state where it will not put certain properties into the stream to the writer.
If anyone has any recommendation on how to implement some data quality analysis in an Autoloader I would be most appreciative if you would share.

Build web graph with wget

I'm using wget with -r (recursive) option, to crawl and download all the pages starting from a root.
For debugging purpose I'd like to output which page routed me to another one, for example: https://stackoverflow.com/ -> https://stackoverflow.com/questions
Is there such a way to do that?
Please note that I need explicitly use wget.
The best solution I found untill now is to use the --warc-file option, to export a warc archive of my crawl. This format also store the Referer.
Using a python library to read the output I wrote the following simple script, to export a csv with source/target columns:
import warc
f = warc.open("crawler.warc")
for record in f:
if record['WARC-Type'] != 'request':
continue
for line in record.payload:
if line.startswith("Referer:"):
print line.replace("Referer: ", "").strip('\n\r'), ",", record['WARC-Target-URI']

How does julia know what path separator and root directory to use?

Functions like joinpath use the appropriate OS-dependent separator when joining two paths (ie / on Linux, \\ on Windows, etc). How do these functions know what separator to use?
Similarly, the root directory on Linux is /, but on Windows is probably C:\\. Is there a way to retrieve the OS-dependent root directory in Julia?
Note, I've had a look at the joinpath source on github, and it appears to use an undocumented function pathsep(a,b) and a global variable path_separator_re, but I can't see how either of these work.
It uses the Sys.isunix and Sys.iswindows functions in order to conditionally define the correct path_separator_re variables, etc.
https://github.com/JuliaLang/julia/blob/5c3f58039525972b24930f356821af8299f70a26/base/path.jl#L19-L41
if Sys.isunix()
# ...
const path_separator_re = r"/+"
# ...
splitdrive(path::String) = ("",path)
elseif Sys.iswindows()
# ...
const path_separator_re = r"[/\\]+"
# ...
function splitdrive(path::String)
m = match(r"^([^\\]+:|\\\\[^\\]+\\[^\\]+|\\\\\?\\UNC\\[^\\]+\\[^\\]+|\\\\\?\\[^\\]+:|)(.*)$", path)
String(m.captures[1]), String(m.captures[2])
end
else
error("path primitives for this OS need to be defined")
end
For the root directory, check out the homedir function, which uses libuv to determine it.
https://github.com/JuliaLang/julia/blob/5c3f58039525972b24930f356821af8299f70a26/base/path.jl#L52-L77
help?> homedir
search: homedir
homedir() -> AbstractString
Return the current user's home directory.
| Note
|
| homedir determines the home directory via libuv's uv_os_homedir. For details (for example on how to specify the home
| directory via environment variables), see the uv_os_homedir documentation.

Windows batch script - parse and expand the variable to pass as a string to external program?

I want to use a relative file path as a command line argument but as the example and assessment below will demonstrate, the variable passes \..\ as a string, it doesn't evaluate it.
Can I can force the command line to parse and expand the variable as a string?
: For example: I have a R script file I want to launch from the command line:
Set RPath=C:\Program Files\R\R-3.1.0\bin\Rscript.exe
SET RScript=%CD%\..\..\HCF_v9.R
SET SourceFile=%CD%\..\Source\
ECHO String used for Source Location - %SourceFile%
"%RPath%" "%RScript%" %SourceFile%
The inclusion of \..\ works in the call to R as an external program because the batch file can resolve it's own commands.
The variable of SourceFile however doesn't work because the SourceFile variable hasn't expanded \..\, it has just included it as part of the string and R can't process \..\
You can use the for replaceable parameters to resolve to the real path
for %%a in ("..\..\HCF_v9.R") do set "RScript=%%~fa"
#MC ND has provided the batch file approach; an R-centric approach would be to pass the current directory to R, and modify it there.
; batch file
Set RPath=C:\Program Files\R\R-3.1.0\bin\Rscript.exe
SET RScript=%CD%\..\..\HCF_v9.R
"%RPath%" "%RScript%" %CD%
# in R
srcpath <- commandArgs(TRUE)[1]
srcpath <- normalizePath(file.path(srcpath, "../Source"))

Easiest way to specify alternate transmogrifier _path?

I'm doing a content migration with collective.transmogrifier and I'm reading files off the file system with transmogrify.filesystem. Instead of importing the files "as is", I'd like to import them to a sub directory in Plone. What is the easiest way to modify the _path?
For example, if the following exists:
/var/www/html/bar/index.html
I'd like to import to:
/Plone/foo/bar/index.html
In other words, import the contents of "baz" to a subdirectory "foo". I see two options:
Use some blueprint in collective.transmogrifier to mangle _path.
Write some blueprint to mangle _path.
Am I missing anything easier?
Use the standard inserter blueprint to generate the paths; it accepts python expressions and can replace keys in-place:
[manglepath]
blueprint = collective.transmogrifier.sections.inserter
key = string:_path
value = python:item['_path'].replace('/var/www/html', '/Plone/foo')
This thus takes the output of the value python expression (which uses the item _path and stores it back under the same key.

Resources