Forms Recognizer - Unable to train forms in subfolder with label file - microsoft-cognitive

We have multiple different forms to be trained in a model. We have created separate subfolders for each and have created the label files for each of the forms in their respective folder. When we train the model with "IncludeSubfolder" as true" and "uselabelfile" as true.. we get the following error
{"modelInfo":
{"modelId": "9d63e55b-23a5-43a4-a845-17864b35549d",
"status": "invalid",
"createdDateTime": "2020-04-03T11:15:14Z",
"lastUpdatedDateTime": "2020-04-03T11:15:14Z"},
"trainResult":
{"averageModelAccuracy": 0.0,
"errors":
[{"code": "1001", "message": "Not supported case of IncludeSubFolders and UseLabelFile set to true simultaneously."}]}}
any idea about this error ?

You need to direct the API to a sub-folder as stated in the documentation thus, setting includeSubFolders = true is not possible with useLabelFile = true simultaneously, therefore the error you are receiving!
The documentation states the following:
The labeled data feature has special input requirements beyond those
needed to train a custom model.
Make sure all the training documents are of the same format. If you
have forms in multiple formats, organize them into sub-folders based
on common format. When you train, you'll need to direct the API to a
sub-folder.
Notice:
When you train, you'll need to direct the API to a sub-folder.

Related

Telegraf MQTT consumer with multiple topics and json data

We use Telegraf to connect to an MQTT broker and subscribe to several topics. The data send through is all in JSON, but with different configurations.
[[inputs.mqtt_consumer]]
name_override = "devices"
topics = [
"devices/+/control",
]
servers = ["${MQTT_SERVER_URL}"]
tagexclude = ["host", "topic"]
data_format = "json"
json_name_key = ""
json_time_key = "ts"
json_time_format = "unix_ms"
tag_keys = ["site"]
json_string_fields = ["mode", "is_online"]
Do we need multiple different mqtt_consumer input plugins for different json structures, or can that be handled with the topic parser somehow? I'm struggling to find real world examples for this kind of setup.
Do we need multiple different mqtt_consumer input plugins for different json structures,
It depends on just how different the structure is. If the JSON is relatively flat, then you may not, but if it has different objects defined, I would suggest you use different inputs. Generally, if you have different structures then you probably have different tags and your ultimate time series metric will be different.

Submitting time data to wikibase with `wbcreateclaim` results in "invald-snak" error

I am currently trying to populate a wikidata instance via POST requests. For this purpose I use the requests library in Python together with the MediaWiki API.
So far I managed to create claims with different datatypes (like String, Quantity, Wikidata items, Media ...). The general scheme I use is this (with different value strings for each datatype):
import requests
session = requests.Session()
# authenticate and obtain a csrf_token
parameters = {
'action': 'wbcreateclaim',
'format': 'json',
'entity': 'Q1234',
'snaktype': 'value',
'property': 'P12',
'value': '{"time": "+2022-02-19T00:00:00Z", "timezone": 0, "precision": 11, "calendarmodel": "http://www.wikidata.org/entity/Q1985727"}',
'token': csrf_token,
'bot': 1,
}
r = session.post(api_url, data=parameters)
print(r.json())
Every attempt to insert data of type time leads to an invalid-snak error (info: "invalid snak data.").
The following changes did not solve the problem:
submitting the value string as dictionary value (without the single quotes),
putting the numeric values into (double) quotes,
using a local item for the calendarmodel ("http://localhost:8181/entity/Q73"),
adding before and after keys in the dictionary,
omitting timezone, precision, calendarmodel and combinations thereof,
formatting the time string as 2022-02-19,
submitting the request with administrator rights (though the error message does not
suggest a problem with insufficient user rights).
Do you know, what I'm doing wrong here? What must the value field look like?
I am aware of the fact that special libraries or interfaces for exist for these tasks. But I do want to use the Wikidata API directly with the requests library in Python.
Thank you very much for your help!
Installed software versions:
MediaWiki: 1.36.3
PHP: 7.4.27
MariaDB 10.3.32-MariaDB-1:10.3.32+maria~focal
ICU 67.1
It works if the value string is generated from the dictionary via json.dumps().

Drupal 8 jsonapi: How to change the structure of returned json relationships included array?

I am sending a request to the jsonapi node endpoint, and using the following parameter to include the "relationship" objects of those nodes for user (post author data) and user picture (avatar):
include=uid,uid.user_picture
This returns to me the json data of the nodes as well as all of the relationship objects lumped together (regardless of type) into the "included" array. The included array ends up looking like this by default:
"included": [
{
"type": "user--user",
"id": "4c717273-f903-4ffa-abe2-b5e5709ad727",
"attributes": {
"display_name": "cherylm1234",
...etc...
}
},
{
"type": "file--file",
"id": "4c717273-f903-4ffa-abe2-b5e5709ad727",
"attributes": {
"filename": "Cheryl-Fun.jpg",
...etc...
}
}
]
Context: We are trying to integrate jsonapi with an iOS app as the client/consumer.
My Problem: It's difficult to write up data models in the iOS client for these relationships (without looping and sorting every response) since they are all lumped into the same level of the "included" array. Also a problem is that each "type" will have its own set of attributes/fields so data models in iOS need to be built based on "type".
Question: Is it possible to change this structure so that all included objects are sorted by type?
I would like for the "included" array to be indexed by "type" where all objects of that type would be inside that index. Maybe something like:
included['user--user'] = [
{user-object},
{user-object},
...etc.
],
included['file--file'] = [
{file-object},
{file-object},
....etc.
],
I'm assuming this would require a custom module with some sort of hook where I could loop and sort the data before returning it to the iOS app. But, haven't been able to find any documentation on this type of json response modification.
Any help would be much appreciated.
Drupal implements the JSON:API specification. The specification specifies how a resource should be encoded. Both if it is returned as primary data or included as a relation resource.
If you want the data to be returned in another format, you would need to implement your own REST API in Drupal following another specification.
But if this is only about managing the data in a client, I would recommend to use one of the existing libraries for JSON:API specification. A list of implementations is provided at jsonapi.org/implementations/. The JSONAPI Swift package listed on that page seems to be well maintained.

Many-to-Many JSON API URI naming

I'm wondering if the following naming convention would correctly fit the JSON API standard, as it is not specifically mentioned anywhere that I can find.
Given a many to many relationship between accounts and products, with an account-product resource storing pivot data between the two.
I would to know how to handle the /api/v1/accounts/1/products relationship resource
account > account-product > product
URLs:
/api/v1/accounts: returns account resources
/api/v1/account-products: returns account-product resources
/api/v1/accounts/1/products: returns account-product resources
OR
/api/v1/accounts/1/products: return product resources related to the account
The two arguments here being this:
Option 1: account/1/products should return the link between the accounts and the products as it should essentially act as the ID should essentially act as a hyphen e.g. account/1/products really means account-products.
Option 2: account/1/products should returns products related to the account and also include the account-products resource as a mandatory relationship because the resource in the URI is product, not account-product
JSON:API specification is agnostic about URL design. Therefore it mostly depends where these URLs are used. But JSON:API spec comes with some recommendations on URL design. I assume that you follow that ones. Especially that /api/v1/accounts/1/products is used as related resource link, e.g.
{
"type": "accouts",
"id": "1",
"relationships": {
"products": {
"links": {
"related": "/api/v1/accounts/1/products"
}
}
}
}
In that case the spec is quite clear about what should be returned:
Related Resource Links
A “related resource link” provides access to resource objects linked
in a relationship. When fetched, the related resource object(s) are
returned as the response’s primary data.
For example, an article’s comments relationship could specify a link
that returns a collection of comment resource objects when retrieved
through a GET request.
https://jsonapi.org/format/#document-resource-object-related-resource-links
From how you describe your data structure an account has many account-products, which belongs to a product. So it should return the related account-products. You may include the products they belong to by default.
What may confuse you is the concept of intermediate relations like "Has One Through" in some ORMs (e.g. Eloquent). The naming account-products suggest that this might be an example of such. Something like that is not supported by JSON:API spec. Intermediate relationships should modeled using a normal resource type. So in your case account-products would be a normal resource type like accounts and products.

Java API to query CommonCrawl to populate Digital Object Identifier (DOI) Database

I am attempting to create a database of Digital Object Identifier (DOI) found on the internet.
By manually searching the CommonCrawl Index Server manually I have obtained some promising results.
However I wish to develop a programmatic solution.
This may result in my process only requiring to read the index files and not the underlying WARC data files.
The manual steps I wish to automate are these:-
1). for each CommonCrawl Currently available index collection(s):
2). I search ... "Search a url in this collection: (Wildcards -- Prefix: http://example.com/* Domain: *.example.com) " e.g. link.springer.com/*
3). this returns almost 6MB of json data that contains approx 22K unique DOIs.
How can I browse all available CommonCrawl indexes instead of searching for specific URLs?
From reading the API documentation for CommonCrawl I cannot see how I can browse all the indexes to extract all DOIs for all domains.
UPDATE
I found this example java code https://github.com/Smerity/cc-warc-examples/blob/master/src/org/commoncrawl/examples/S3ReaderTest.java
that shows how to access a common crawl dataset.
However when I run it I receive this exception
"main" org.jets3t.service.S3ServiceException: Service Error Message. -- ResponseCode: 404, ResponseStatus: Not Found, XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>common-crawl/crawl-data/CC-MAIN-2016-26/segments/1466783399106.96/warc/CC-MAIN-20160624154959-00160-ip-10-164-35-72.ec2.internal.warc.gz</Key><RequestId>1FEFC14E80D871DE</RequestId><HostId>yfmhUAwkdNeGpYPWZHakSyb5rdtrlSMjuT5tVW/Pfu440jvufLuuTBPC25vIPDr4Cd5x4ruSCHQ=</HostId></Error>
In fact every file I try to read results in the same error. Why is that?
what is the correct common crawl uri's for their datasets?
The data set location has changed since more than one year, see announcement. However, many examples and libraries still contain the old pointers. You can access the index files for all crawls back to 2013 on s3://commoncrawl/cc-index/collections/CC-MAIN-YYYY-WW/indexes/cdx-00xxx.gz - replace YYYY-WW with year and week of the crawle and expand xxx to 000-299 to get all 300 index parts. New crawl data is announced on the Common Crawl group, or read more about how to access the data.
To get the example code to work replace lines 24 and 25 with:
String fn = "crawl-data/CC-MAIN-2013-48/segments/1386163035819/warc/CC-MAIN-20131204131715-00000-ip-10-33-133-15.ec2.internal.warc.gz";
S3Object f = s3s.getObject("commoncrawl", fn, null, null, null, null, null, null);
Also note that the commoncrawl group have an updated example.

Resources