CKAN: Harvest special data sets - opendata

I want to import a special set of data sets in my CKAN instance. With the CKAN-Harvester(http://docs.ckan.org/en/latest/harvesting.html) i am able to harvest from an other CKAN instance. But i don't need all of the data sets.
Is it possible to harvest only specific data sets with their id?

Not without writing some code.
You could add a filter to the harvester. gather_stage() [1] is where it asks CKAN for the latest edited packages (datasets) and creates a job for each one. Then the fetch_stage() [2] runs for each of those jobs to download each one and then import them. You might a filter in the fetch_stage, or alternatively change the gather_stage to ask for a subset of packages.
[1] https://github.com/okfn/ckanext-harvest/blob/2.0-dataset-sources/ckanext/harvest/harvesters/ckanharvester.py#L136
[2] https://github.com/okfn/ckanext-harvest/blob/2.0-dataset-sources/ckanext/harvest/harvesters/ckanharvester.py#L199

As of today, 2016-06-06, this is not built-in yet, but there's an open issue - Allow filtering of remote datasets to be harvested #155 - requesting exactly the same thing you want.
On a side note, the CKAN Harvester option to include/exclude organizations #169 was merged in 2015-10-27, but as its title says, it only added organizations_filter_include and organizations_filter_exclude.

Related

how to import data online to Grakn

is it possible to update and insert newly added data to Grakn in an online manner ?
I have read this tutorial https://dev.grakn.ai/docs/query/updating-data but I can not get to this answer.
thanks
It would help to define what you mean by "online" - if it means what I think it does (keeping valid data/queryable data as you load), you should be able to see newly loaded data as soon as a transaction commits.
So, you can load data (https://dev.grakn.ai/docs/query/insert-query), and on commit you can see the committed data, and you can modify data (your link), which can modify committed data.
In general, you want to load data in many transactions, which allows you see in-progress data sets being loaded in an "online" manner.

Making Glue delete source data after a job

AWS Glue is great for transforming data from a raw form into whichever format you need, and keeping the source and destination data sets synchronized.
However, I have a scenario where data lands in a 'landing area' bucket from untrusted external sources, and the first ETL step needs to be a data validation step which only allows valid data to pass to the data lake, while non-valid data is moved to a quarantine bucket for manual inspection.
Non-valid data includes:
bad file formats/encodings
unparseable contents
mismatched schemas
even some sanity checks on the data itself
The 'landing area' bucket is not part of the data lake, it is only a temporary dead drop for incoming data, and so I need the validation job to delete the files from this bucket once it has moved them to the lake and/or quarantine buckets.
Is this possible with Glue? If the data is deleted from the source bucket, won't Glue end up removing it downstream in a subsequent update?
Am I going to need a different tool (e.g. StreamSets, NiFi, or Step Functions with AWS Batch) for this validation step, and only use Glue once the data is in the lake?
(I know I can set lifecycle rules on the bucket itself to delete the data after a certain time, like 24 hours, but in theory this could delete data before Glue has processed it, e.g. in case of a problem with the Glue job)
Please see purge_s3_path in the docs:
glueContext.purge_s3_path(s3_path, options={}, transformation_ctx="")
Deletes files from the specified Amazon S3 path recursively.
Also, make sure your AWSGlueServiceRole has s3:DeleteObject permissions
Your glue environment comes with boto3. You should be better of using the boto3 s3 client/resource to delete the landing files after Youve completed processing the data via glue

Delete a Google Storage folder including all versions of objects inside

Hi and thanks in advance. I want to delete a folder from Google Cloud Storage, including all the versions of all the objects inside. That's easy when you use gsutil from your laptop (you can just use the folder name as prefix and put the flag to delete all versions/generations of each object)
..but I want it in a script that is triggered periodically (for example when I'm on holidays). My current ideas are Apps Script and Google Cloud Functions (or firebase functions). The problem is that in these cases I don't have an interface as powerful as gsutil, I have to use REST API, so I cannot say something like "delete everything with this prefix" and neither "all the versions of this object". Thus the best I can do is
a) List all the object given a prefix. So for prefix "myFolder" I receive:
myFolder/obj1 - generation 10
myFolder/obj1 - generation 15
myFolder/obj2 - generation 12
... and so on for hundreds of files and at least 1 generation/version per file.
b) For each file-generation delete it giving the complete object name plus its generation.
As you can see that seems a lot of work. Do you know a better alternative?
Listing the objects you want to delete and deleting them is the only way to achieve what you want.
The only alternative is to use Lifecycle which can delete objects for you automatically based on conditions, if the conditions satisfy your requirements.

Adding and updating data in R packages

I'm writing my own R package to carry out some specific analyses for which I make a bunch of API calls to get some data from some websites. I have multiple keys for each API and I want to cycle them for two reasons:
Ensure I don't go over my daily limit
Depending on who is using the package, different keys may be used
All my keys are stored in a .csv file api_details.csv. This file is read by a function that gets the latest usage statistics and returns the key with the most calls available. I can add the .csv file to the package/data folder and it is available when the package is loaded but this presents two problems:
The .csv file is not read properly and all columns names are pasted together to create a single variable name and all values pasted together to create a single observation per row.
As I continue working, I would like to add more keys (and perhaps more details about the keys) to the api_details.csv but I'm not sure about how to do that.
I can save the details as an .RData file but I'm not sure about how it would be updated or read outside of R (by other people). Using .csv means that anyone using the package can easily add/remove some keys.
What's the best method to address 1, 2 above?

Can Graphite (whisper) metrics be renamed?

I'm preparing to refactor some Graphite metric names, and would like to be able to preserve the historical data. Can the .wsp files be renamed (and possibly moved to new directories if the higher level components change)?
Example: group.subgroup1.metric is stored as:
/opt/graphite/storage/whisper/group/subgroup1/metric.wsp
Can I simply stop loading data and move metric.wsp to metricnew.wsp?
Can I move metric.wsp to whisper/group/subgroup2/metric.asp?
Yes.
The storage architecture is pretty flexible. Rename/move/delete away, just make sure update your storage-schema and aggregation settings for the new location/pattern.
More advanced use cases, like merging into existing whisper files, can get tricky but also can be done with the help of the included scripts. This contains an overview of the Whisper Scripts included. Check it out:
https://github.com/graphite-project/whisper
That said, it sounds like you don't already have existing data in the new target location so you can just move them.

Resources