Apache NIFI: ExtractAvroMetadata process - bigdata

The extractAvroMetadata indicates in its properties section that for the propoerty 'Metadata Keys' we can use a comma-seperated list to indicate the fields to get for the avro schema.
Has anyone already used that option since when I mention a list like doc,namesapce it does not work ?

The documentation says this:
"A comma-separated list of keys indicating key/value pairs to extract from the Avro file header. The key 'avro.schema' can be used to extract the full schema in JSON format, and 'avro.codec' can be used to extract the codec name if one exists."
So you can only use keys that are in the header, or the two mentioned keys.

Related

Can MLCP read input based on a condition

In marklogic, using MLCP can we read /export/import/copy data based on a condition?
Example : read only files with students subject element has only maths
Yes, you can apply the -query_filter option to restrict documents to those matching the filter query.
https://docs.marklogic.com/guide/mlcp/export#id_66898
The -query_filter option accepts a serialized XML cts:query or JSON cts.query as its value.
Controlling What is Exported, Copied, or Extracted
By default, mlcp exports all documents or all documents and metadata in the database, depending on whether you are exporting in document or archive format or copying the database. Several command line options are available to enable customization.
-query_filter - export/copy only documents matched by the specified cts query. You can use this option alone or in combination with a directory, collection or document selector filter.
-directory_filter - export only the documents in the listed database directories. You cannot use this option with -collection_filter or -document-selector.
-collection_filter - export only the documents in the listed collections. You cannot use this option with -directory_filter or -document_selector.
-document_selector export only documents selected by the specified XPath expression. You cannot use this option with -directory_filter or -collection_filter. Use -path_namespace to define namespace prefixes.

boto DynamoDb query/scan ProjectionExpression syntax?

From the documentation, it says "By default, a Scan returns all of the data attributes for every item; however, you can use the ProjectionExpression parameter so that the Scan only returns some of the attributes, rather than all of them."
I am wondering if anyone knows what's the syntax for using the ProjectionExpression parameter with boto?
For example I have
leagueTable = Table('leagues', schema=[HashKey('leagueId', data_type=NUMBER)]
I want to use the ProjectionExpression parameter to scan the table and only get back the selected field.
According to the documentation at http://docs.pythonboto.org/en/latest/ref/dynamodb2.html#boto.dynamodb2.table.Table.scan , the attributes parameter will allow you to specify a tuple of attributes and only return those attributes in the result set.
However, this uses the AttributesToGet API, instead of the newer ProjectionExpression API you are referring to. ProjectionExpression will allow you to retrieve individual list or map elements. To use ProjectionExpression, you would have to use the low-level API for boto, which matches the low-level DynamoDB API closely. The scan documentation for this can be found at: http://docs.pythonboto.org/en/latest/ref/dynamodb2.html#boto.dynamodb2.layer1.DynamoDBConnection.scan
Hope that helps, good luck!

How to list SolrCloud aliases?

In SolrCloud Collections API (https://cwiki.apache.org/confluence/display/solr/Collections+API), we can list collections using action:
/admin/collections?action=LIST
However, aliases are not included in this list. There is also no corresponding command for aliases (we can only CREATEALIAS or DELETEALIAS). How to list aliases?
This feature seems to be not implemented yet: https://issues.apache.org/jira/browse/SOLR-4968
However, you can use this command:
/admin/collections?action=CLUSTERSTATUS
Each collection will be listed together with the aliases it is covered by. Also in the bottom of the XML there is a separate node, summarising all aliases and covered collections.
The aliases list can be fetched in json format using the following command.
[solr_server_hostname]:8983/solr/zookeeper?detail=true&path=/aliases.json
The "data" field in this JSON holds the list of collections object.
For Solr 6.6+, you could use:
/solr/admin/collections?action=LISTALIASES
See https://solr.apache.org/guide/6_6/collections-api.html#CollectionsAPI-listaliases

How to find index for a field (if any)

I have some indexes in portal_catalog, for various types.
Given a portal_type and a fieldname, how can I find out the name of the index (if any) for that field?
Some relevant pointers to documentation about zcatalog might help me too!
Thanks..
There is no easy one-on-one way to determine this. In Plone 4, there are basically three different ways that an index in the catalog can obtain the information from your content type.
Index configuration
First and foremost, indexes can optionally be configured with the name(s) of the attributes or methods to index on a given object. Indexes generally have a getIndexSourceNames method that'll tell you what items they'll index.
Usually this is the same as the index id, but this is not a given. Generally, if your field accessor is listed in the result of getIndexSourceNames then that index will be indexing that field for a given type:
from Products.CMFCore.utils import getToolByName
catalog = getToolByName(context, 'portal_catalog')
for index in catalog.index_objects():
if field.accessor in index.getIndexSourceNames():
print 'Index %s indexes %s' % (index.getId(), field.getName()'
In the above examples, I assumed you already have a hold of your field object in the variable field, with an actual instance of your type in the variable context.
Custom indexing adapters
When an object is indexed, the catalog also constructs a wrapper around the object and will look up indexing adapters for your object. These adapters are registered both for the indexed name (so the name listed in getIndexSourceNames) and an interface or class. If your content class implements that interface or has an indexing adapter directly registered for it's class or a base class, the indexing adapter can be brought into play.
Indexing adapters are arbitrary snippets of code, and thus could call any field on your content object to produce their results. There is no programmatic way for you to determine if a given field on your content type will be used, or if any fields will be used at all.
The CMFPlone.CatalogTool module lists several examples of indexing adapters, these are all registered for Interface, meaning all objects:
allowedRolesAndUsers collects security information about your object.
getObjPositionInParent determines the position of the current object in it's container. Thus, this indexer does not need any information from the object itself to determine it's value.
sortable_title takes your content Title value and generates a value suitable for sorting catalog search results with. It normalizes the value, lowercases it, and prefixes numbers with leading zeros to make sorting on numbered titles easier.
Direct method access
Fields are basically generated methods on your content object. But your content class can also implement methods on it's class. The same remarks as for custom indexing adapters apply here; these are arbitrary Python code so they could be using your content type fields, aggregating and mangling the information before passing it to the index.
The Archetypes BaseObject class (used in all Archetypes content types) defines the SearchableText method for example. This method takes all available fields with the searchable property set to True, tries to get each field value as plain text, and aggregates the results for the SearchableText index.
Conclusion
You can only make educated guesses about index contents as they relate to your fields. By introspecting index configuration, you won't see if there might be a custom indexer adapter masking your field (register a getField index adapter and it'll be used instead of directly calling getField). Custom indexers and class methods can still access your fields and pass on the information to a catalog index.
You just add an index for the method or attribute name that you want to use for the index value--there's nothing too tricky about it and it can potentially all be done TTW
If you need a bit more logic to grab the index, check out this stackoverflow question: Problem with plone.indexer and Dexterity

Reason for duplicate keys on http post?

Reading about Http Post on Wikipedia it states that This is a format for encoding key-value pairs with possibly duplicate keys. Is this correct and if so what is the reasoning? Why would a client ever post duplicate keys and if a duplicate key is posted how is the correct corresponding value returned on server side?
To submit multiple values for the same thing.
In PHP, for example, you can name multiple input fields somedata[]. All values of the input boxes are then put in an array named somedata.

Resources