Java API to query CommonCrawl to populate Digital Object Identifier (DOI) Database - web-scraping

I am attempting to create a database of Digital Object Identifier (DOI) found on the internet.
By manually searching the CommonCrawl Index Server manually I have obtained some promising results.
However I wish to develop a programmatic solution.
This may result in my process only requiring to read the index files and not the underlying WARC data files.
The manual steps I wish to automate are these:-
1). for each CommonCrawl Currently available index collection(s):
2). I search ... "Search a url in this collection: (Wildcards -- Prefix: http://example.com/* Domain: *.example.com) " e.g. link.springer.com/*
3). this returns almost 6MB of json data that contains approx 22K unique DOIs.
How can I browse all available CommonCrawl indexes instead of searching for specific URLs?
From reading the API documentation for CommonCrawl I cannot see how I can browse all the indexes to extract all DOIs for all domains.
UPDATE
I found this example java code https://github.com/Smerity/cc-warc-examples/blob/master/src/org/commoncrawl/examples/S3ReaderTest.java
that shows how to access a common crawl dataset.
However when I run it I receive this exception
"main" org.jets3t.service.S3ServiceException: Service Error Message. -- ResponseCode: 404, ResponseStatus: Not Found, XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>common-crawl/crawl-data/CC-MAIN-2016-26/segments/1466783399106.96/warc/CC-MAIN-20160624154959-00160-ip-10-164-35-72.ec2.internal.warc.gz</Key><RequestId>1FEFC14E80D871DE</RequestId><HostId>yfmhUAwkdNeGpYPWZHakSyb5rdtrlSMjuT5tVW/Pfu440jvufLuuTBPC25vIPDr4Cd5x4ruSCHQ=</HostId></Error>
In fact every file I try to read results in the same error. Why is that?
what is the correct common crawl uri's for their datasets?

The data set location has changed since more than one year, see announcement. However, many examples and libraries still contain the old pointers. You can access the index files for all crawls back to 2013 on s3://commoncrawl/cc-index/collections/CC-MAIN-YYYY-WW/indexes/cdx-00xxx.gz - replace YYYY-WW with year and week of the crawle and expand xxx to 000-299 to get all 300 index parts. New crawl data is announced on the Common Crawl group, or read more about how to access the data.

To get the example code to work replace lines 24 and 25 with:
String fn = "crawl-data/CC-MAIN-2013-48/segments/1386163035819/warc/CC-MAIN-20131204131715-00000-ip-10-33-133-15.ec2.internal.warc.gz";
S3Object f = s3s.getObject("commoncrawl", fn, null, null, null, null, null, null);
Also note that the commoncrawl group have an updated example.

Related

Recursive, Non-Dynamic (Refreshable) Web API via Power BI

I am trying to write a recursive web API call in PBI to collect all 27,515 records, the oDATA feed has a limit of 1,000 rows. I need this data to be refreshable in the PBI service, therefore these 28 requests via M code cannot be formulated in a dynamic way. PBI only allows for static or non-dynamic sources for refresh within the service. Below, I will share two pieces of M code, 1. one that is considered to be a dynamic data source (not what I need, but pulls all 27,515 records correctly) and 2. one that is a static data source (which is giving an incorrect number of 19,000 records, but is the type of data source that I need for this refreshing problem).
Noteworthy: Upon initial API call I receive a table named table "d" (in the photo below) with two rows one row it titled "results" which contains all of the data (1,000 rows) I need per request, the second row is titled "__next" which has the next API URL with an embedded skiptoken from the current calls worth of data. This skiptoken tells the API which rows to skip so that the next request doesn't deliver the data we have already collected.
Table d, Initial Table
M Code for Dynamic Data Source: This dynamic data source is pulling the correct number of records in 28 requests (up to 1,000 records per request) totaling 27,515 rows.
= List.Generate( ()=> Json.Document(Web.Contents("https://my_instance/odata/v2/Table?$format=JSON&$paging=snapshot"))[d],
each Record.HasFields(_, "results")= true,
each try Json.Document(Web.Contents(_[__next]))[d] otherwise [df=[__next="dummy_variable"]])
M Code for Static Data Source: This static data source is the type that I need for refreshing in PBI service (I confirmed it does refresh in the service), but is returning an incorrect number of rows, 19,000 versus 27,515. This code is calling 19 requests versus the needed 28 requests. I believe the error lies in the Query portion where I am attempting to call the next API URL with the skiptoken from the previous request.
= List.Generate( ()=> Json.Document(Web.Contents("https://my_instance/odata/v2/Table?$format=JSON&$paging=snapshot"))[d],
each Record.HasFields(_, "results")= true,
each try Json.Document(Web.Contents("https://my_instance/odata/v2/Table?$format=JSON&$paging=snapshot", [Query=[q=_[__next]]]))[d] otherwise [df=[__next="dummy_variable"]])
Does anyone see an error in the static code for iteratively calling each new request in the table [d] which has rows labeled [results] (all the data) and another row labeled [__next] which has the next URL with the skiptoken from the previous API call.
To be clear, in Web.Contents the url must be static, but you can freely use dynamic components in the RelativePath optional option argument (as in this simple example function) which is how you can generate dynamic web API queries that work in the service without the error you are seeing w.r.t. dynamic queries:
(current_page as text) =>
let
data = Web.Contents(
"https://my_instance/api/v2/endpoint", // static!
[
RelativePath = "?page="&current_page // dynamic!
]
)
in
data
So if you can split out the relative path of your _next parameter and feed it into such a function it will be OK for automatic refreshes in the Power BI service.

Submitting time data to wikibase with `wbcreateclaim` results in "invald-snak" error

I am currently trying to populate a wikidata instance via POST requests. For this purpose I use the requests library in Python together with the MediaWiki API.
So far I managed to create claims with different datatypes (like String, Quantity, Wikidata items, Media ...). The general scheme I use is this (with different value strings for each datatype):
import requests
session = requests.Session()
# authenticate and obtain a csrf_token
parameters = {
'action': 'wbcreateclaim',
'format': 'json',
'entity': 'Q1234',
'snaktype': 'value',
'property': 'P12',
'value': '{"time": "+2022-02-19T00:00:00Z", "timezone": 0, "precision": 11, "calendarmodel": "http://www.wikidata.org/entity/Q1985727"}',
'token': csrf_token,
'bot': 1,
}
r = session.post(api_url, data=parameters)
print(r.json())
Every attempt to insert data of type time leads to an invalid-snak error (info: "invalid snak data.").
The following changes did not solve the problem:
submitting the value string as dictionary value (without the single quotes),
putting the numeric values into (double) quotes,
using a local item for the calendarmodel ("http://localhost:8181/entity/Q73"),
adding before and after keys in the dictionary,
omitting timezone, precision, calendarmodel and combinations thereof,
formatting the time string as 2022-02-19,
submitting the request with administrator rights (though the error message does not
suggest a problem with insufficient user rights).
Do you know, what I'm doing wrong here? What must the value field look like?
I am aware of the fact that special libraries or interfaces for exist for these tasks. But I do want to use the Wikidata API directly with the requests library in Python.
Thank you very much for your help!
Installed software versions:
MediaWiki: 1.36.3
PHP: 7.4.27
MariaDB 10.3.32-MariaDB-1:10.3.32+maria~focal
ICU 67.1
It works if the value string is generated from the dictionary via json.dumps().

How to overwrite sqlite database for existing users in a Xamarin Forms app?

We have a Xamarin Forms app.
We have a few existing users which had been using the app since last year where the database file name is MyAppName.db
Our new app strategy requires us to include the .db (Sqlite) along with the app to ensure we can include persistent information and does not require internet when you install the app, meaning we hope to overwrite the db file for existing users.
After we publish this new change where the database file is now MyNewDbFile.db, our users complain that they do not see any data, the app does not crash tough.
We capture error reports and we can see a common error in the our tool stating about "row value missused", a quick search indicates the value are not present and so the "SELECT * FROM TABLE WHERE COLUMNAME" query does not work causing the exception.
We can confirm we are using
Xamarin.Forms version 4.5.x
sqlite-net-sqlcipher version 1.6.292
We do not have any complex logic and cannot see a very specific reason causing this as this is not produced when we test our apps from the Alpha channel or the TestFlight.
We investigated the heart of the problem and can confirm that the troublemaker is the device Culture information.
We have been testing with English as the device language, however the minute the app is used by folks with any other language except English the query causes exception.
We make use of the Xamarin Essentials library to get device location (which is optional).
The latitude and longitude returned are culture specific.
So if you are using a Spanish language on device, the format for the Latitude and Longitude information is a , (comma) and NOT a . (dot).
That is, a Latitude value is something similar to 77.1234
where it is separated by a . (dot)
However for users with different region settings on device, the format would change to
77,1234 as value where it is separated by a , (comma).
The query requires us to have the values as string and that when not separated by a . (dot) fails to execute:
Query:
if (deviceLocation != null)
queryBuilder.Append($" ORDER BY((Latitude - {deviceLocation.Latitude})*(Latitude - {deviceLocation.Latitude})) +((Longitude - {deviceLocation.Longitude})*(Longitude - {deviceLocation.Longitude})) ASC");
Where the deviceLocation object is of type Xamarin.Essentials.Location which has double Latitude and Longitude values but when used as a string or even if you explicitly do deviceLocation.Latitude.ToString(), it would return the formatted values as per the device CultureInfo.
We can confirm that the problem is due to a mismatch of the format type and causing the query to throw an exception and in return making the experience as if there is no data.

Why does the url property key in Firebase snapshot keep changing?

I have not seen any discussion or awareness so far that Firebase does in fact make available a unique identifier--in fact the full URL--to each specific data record via their "snapshot" which they return, i.e. the wrapper around the data record (accessed via snapshot.val()). By doing a basic property examination of the snapshot I discovered that the unique URL is available (see examples below). However, it seems that, for some reason, Firebase keeps changing the name of the key every few days, causing my application to break. I have to go in and re-discover the new URL property key and change it so that it will work again.
Here are three examples of how I have seen the key change so far. Each value is the same, but the key keeps changing over time (i.e.: "Wb", "Xb", "bc").:
getMemberBySnapshot - snapshot has prop Wb with value https://prototype1.firebaseio.com/users/-IwohKfw1l5F3gFqyJJ5
getMemberBySnapshot - snapshot has prop Xb with value https://prototype1.firebaseio.com/users/-IwohKfw1l5F3gFqyJJ5
getMemberBySnapshot - snapshot has prop bc with value https://prototype1.firebaseio.com/users/-IwohKfw1l5F3gFqyJJ5
I have read Firebase's suggestions that developers should use an email address if they want a unique key (what if my model does not use an email field? What if a user wants to change their email?), or Firebase suggests altenatively to retrieve all existing records and then search through them on the client. Neither of these solutions are satisfying. But I'm seeing that they do provide the unique URL to each data record in the 'snapshot'. Why do they not provide a stabilized key so that a developer can call it consistently???
Firebase.js is a compiled script. The names of internal variables will change every time we compile it and release a new version, so you should definitely not be relying on any properties that are not documented on our website.
For your specific case, you should be using:
snapshot.ref().toString()
in order to get the URL.

StructureGroup Details using the Content Delivery/Broker API

I am trying to get all the structure groups published in a given publication using the PublicationID. I am expecting to get the structure groups with StructureGroupCriteria by passing the Root Structure Group TCM ID but getting page ids (I am expecting SGs).
Now I am trying to loop through the list and get details of each structuregroup. I did not find any API (.net) to get these details and also the API is returning only Pages.
What I have done and working so far using StructureGroupCriteria, returns list of Page IDs instead of SG IDs
PublicationCriteria pubCriteria = new PublicationCriteria(pubID);
// Root StructureGroup TCM ID -- tcm:45-3-4
StructureGroupCriteria sgCriteria = new StructureGroupCriteria("tcm:45-3-4", true);
Criteria allSGsInPub = CriteriaFactory.And(pubCriteria, sgCriteria);
Query allSGs = new Query(allSGsInPub);
string[] sgInfo = allSGs.ExecuteQuery();
Response.Write("Total : " + sgInfo.Length);
foreach (string sgid in sgInfo ) {
// HOW DO I get the Structure Group Details here
//TCMURI sgURI = new TCMURI(sgid);
}
Q # 1 : How to get the all the structuregroups and individual structure group details? (May be something simple, I am not able to find right API).
Q # 2 : How can I get all the structuregroups using ItemTypeCriteria sgCriteria = new ItemTypeCriteria(4); // 4 is SG Item Type .
When I tried this option, the query worked successfully but no results returned. Is this the expected behavior and should we always use StructureGroupCriteria instead of ItemTypeCriteria?
The reason for this approach, I want to avoid using the Root StructureGroup ID which is required with the above code. But at the moment, none of the approaches returning StructureGroup information and I always get Page Information.
Tridion Version: 2011 SP1, .net API.
Note: When I publish I am checking the publish SG info checkbox and published successfully. On Broker DB side, I can see the information on the taxnonomy table as well.
I was playing with Odata service and accidentally I found that I can get all my structure group information from Odata web service.
/cd_webservice/odata.svc/StructureGroups?$filter=PublicationId%20eq%2045
Also, the results are returning child structure groups with a depth parameter.
Just to clarify , using Broker API it is not feasible to get the structure groups (my original question). However, the workaround solution is to use OData Service to get the Structure Groups.
I don't think you will get Structure Groups returned by the Query object.
According to the documentation, when you publish Structure Group information the Structure Group hierarchy is published to the Content Delivery side where it is stored as a taxonomy.
Have you tried using the Taxonomy APIs to get the information you need?

Resources