Bulk export of Alfresco content - alfresco

We are planning to large amount of folders (sites) within Alfresco into a local disk.
I have been going through a lot of similar questions and tutorials but can't seem to understand how to initiate a download using the REST API.
This is my first time using this, can I get a step-by-step approach on how to tackle this?

Well there are many ways to download content out of Alfresco. If you have not already, I suggest looking at http://api-explorer.alfresco.com to understand the REST API.
You can download any object in Alfresco if you know its node reference. For example, suppose I have a file named test-0.txt and its node reference is as follows:
workspace://SpacesStore/0e61aa25-d181-4465-bef4-783932582636
I could use the REST API to download it, like this:
http://localhost:8080/alfresco/api/-default-/public/alfresco/versions/1/nodes/0e61aa25-d181-4465-bef4-783932582636/content
So, one strategy would be to traverse the nodes you want to export and then invoke that URL to download them.
Starting in Alfresco 5.2.1, Alfresco added a new endpoint called downloads. With it, you can request a download consisting of an arbitrary number of node references. So, if I have the following files:
test-0.txt: workspace://SpacesStore/0e61aa25-d181-4465-bef4-783932582636
test-1.txt: workspace://SpacesStore/6bdac77f-8499-4be3-9228-9aabf80ba3e3
test-2.txt: workspace://SpacesStore/a6861c8f-8444-4bce-87a2-191c56b6ec7c
test-3.txt: workspace://SpacesStore/118121e9-bd92-4dec-9de7-062e374e5fb5
I could ask Alfresco to create a download object (the actual content will be in ZIP format) consisting of all four of those files, like this:
curl --location --request POST 'http://localhost:8080/alfresco/api/-default-/public/alfresco/versions/1/downloads' \
--header 'Content-Type: application/json' \
--header 'Authorization: Basic YWRtaW46YWRtaW4=' \
--data-raw '{
"nodeIds":
[
"0e61aa25-d181-4465-bef4-783932582636",
"6bdac77f-8499-4be3-9228-9aabf80ba3e3",
"a6861c8f-8444-4bce-87a2-191c56b6ec7c",
"118121e9-bd92-4dec-9de7-062e374e5fb5"
]
}'
Alfresco will respond with something like:
{
"entry": {
"filesAdded": 0,
"bytesAdded": 0,
"totalBytes": 0,
"id": "91456d9a-ed9e-493a-9efa-a1e49fbb578b",
"totalFiles": 0,
"status": "PENDING"
}
}
Notice that status of PENDING. It is asynchronously building the ZIP we asked for. You can check on it by doing a GET on the download object, like:
http://localhost:8080/alfresco/api/-default-/public/alfresco/versions/1/downloads/91456d9a-ed9e-493a-9efa-a1e49fbb578b
Once the response comes back as DONE you can download the ZIP Alfresco prepared for you. Remember the node endpoint from the start of this post? It works here too. Just use the download ID in place of the node reference, like:
curl --location --request GET 'http://localhost:8080/alfresco/api/-default-/public/alfresco/versions/1/nodes/91456d9a-ed9e-493a-9efa-a1e49fbb578b/content' --header 'Authorization: Basic YWRtaW46YWRtaW4='
So, rather than individually download every object you are trying to export you could batch them up and download multiple objects compressed as a ZIP.
If you don't want to do it with straight REST you might also consider using CMIS. You can get a client library for your preferred language at the Apache Chemistry project.

Related

Unable to Create Application through REST API

Normally we are able to play around with REST APIs related to application, since the application has method to let us create a JWT Token for authentication.
But we are unable to create an application, don’t understand where and we can get the token to authorize us to let us create an application.
Let me tell step by step how to do that
Open the file {AMS_INSTALL_DIR}/webapps/root/WEB-INF/web.xml and change the following line
<filter-class>io.antmedia.console.rest.AuthenticationFilter</filter-class>
with this one
<filter-class>io.antmedia.console.rest.JWTServerFilter</filter-class>
Open the file {AMS_INSTALL_DIR}/conf/red5.properties and change the following lines
server.jwtServerControlEnabled=false
server.jwtServerSecretKey=
with these ones. You can use any 32 character alphanumeric key.
server.jwtServerControlEnabled=false
server.jwtServerSecretKey=WRITE_YOUR_32_CHARACTER_SECRET_KEY
For our sample we use cizvvh7f6ys0w3x0s1gzg6c2qzpk0gb9 as secret key
Restart the service
sudo service antmedia restart
Generate JWT Token. There are plenty of libraries that you can do programmatically. The easiest way for now is using JWT Debugger. So our generated token is eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.e30.tA6sZwz_MvD9Nocf3Xv_DXhJaeTNgfsHPlg3RHEoZRk
Make the call to Create Application as follows
curl -X POST -H "Content-Type: application/json" -H "Authorization:eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.e30.tA6sZwz_MvD9Nocf3Xv_DXhJaeTNgfsHPlg3RHEoZRk" "https://ovh36.antmedia.io:5443/rest/v2/applications/testapp"
The result should be something like {"success":true,"message":null,"dataId":null,"errorId":0}
The app should be generated in a couple of seconds. You can get the list of the applications with the following command
curl -X GET -H "Content-Type: application/json" -H "Authorization:eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.e30.tA6sZwz_MvD9Nocf3Xv_DXhJaeTNgfsHPlg3RHEoZRk" "https://ovh36.antmedia.io:5443/rest/v2/applications"
References:
Web Panel REST Methods
Web Panel REST Methods JWT Documentation

Anti-Scraping bypass?

Helllo,
I'm working on a scraper for this page : https://www.dirk.nl/
I'm trying to get in scrapy shell the 'row-wrapper' div class.
If I enter response.css('row-wrapper'), it gives me some random results, I think an anti scraping system is involved. I need the hrefs from this class.
Any opinions on how can I move forward ?
We would need a little bit more data, like the response you receive and any code if it's already set up.
But from the looks of it, it can be multiple things ( from 429 Response blocking the request because of the rate limit to sites internal API XHR causing data to not be rendered on page load etc. ).
Before fetching any website for scraping reasons, try curl, postman or insomnia software to see what type of the response are you going to receive. Some special servers and website architectures require certain cookies and headers while some don't. You simply have to do this research so you can make your scraping workflow efficient.
I ran curl https://www.dirk.nl/ and it returned data that's generated by Nuxt framework. In this case that data is unusable since Nuxt uses it's own functionality to parse data.
Instead, the best solution would be not to get the HTML based data but API content data.
Something like this:
curl 'https://content-api.dirk.nl/misc/specific/culios.aspx?action=GetRecipe' \
-H 'accept: application/json, text/plain, */*' \
--data-raw '{"id":"11962"}' \
--compressed
Will return:
{"id":11962,"slug":"Muhammara kerstkrans","title":"Muhammara kerstkrans","subtitle":"", ...Rest of the data
I don't understand this language but from my basic understanding this would be an API route for recipes.

Linkedin get user feed

I try to get my linkedin feed using this API :
https://linkedin.api-docs.io/v1.0/feed/42Hm9SaY2p2CGwPzp
I try to use this request : "GET /voyager/api/feed/updates" with this shell code :
curl --request GET \ --url
https://www.linkedin.com/voyager/api/feed/updates \ --data '{}'
But I get this response : "CSRF check failed". I understand why linledin respond this but how to avoid it ?
You missing headers, see API docs here: https://linkedin.api-docs.io/v1.0/feed and explanation how get headers here: https://towardsdatascience.com/using-browser-cookies-and-voyager-api-to-scrape-linkedin-via-python-25e4ae98d2a8
API docs a bit outdated, data output format might be different, this at least true for messaging/conversations, not sure about feed
In regards of headers I suggest to try apify.com and extract them in real time from browser instance (run puppeteer, login to LiN, get headers, save them)
Phantombuster will not allow you to use your own code so not very useful

Using R to call a Web Service: Send data and get the result table back in R

http://snomedct.t3as.org/ This is a web service that will analyse English clinical text, and report any concepts that can be detected.
For e.g.- I have headache. It will identify headache as a Symptom.
Now what I would like to do is send the sentence to the web service through R, and get the table back from the web page to R for further analysis purpose.
If we take their example curl command-line:
curl -s --request POST \
-H "Content-Type: application/x-www-form-urlencoded" \
--data-urlencode "The patient had a stroke." \
http://snomedct.t3as.org/snomed-coder-web/rest/v1.0/snomedctCodes
that can be translated to httr pretty easily.
The -s means "silent" (no progress meter or error messages) so we don't really have to translate that.
Any -H means to add a header to the request. This particular Content-Type header can be handled better with the encode parameter to httr::POST.
The --data-urlencode parameter says to URL encode that string and put it in the body of the request.
Finally, the URL is the resource to call.
library(httr)
result <- POST("http://snomedct.t3as.org/snomed-coder-web/rest/v1.0/snomedctCodes",
body="The patient had a stroke.",
encode="form")
Since you don't do this regularly, you can wrap the POST call with with_verbose() to see what's going on (look that up in the httr docs).
There are a ton of nuances that one should technically do after this (like check the HTTP status code with stop_for_status(), warn_for_status() or even just status_code(), but for simplicity let's assume the call works (this one is their example so it does work and returns a 200 HTTP status code which is A Good Thing).
By default, that web service is returning JSON, so we need to convert it to an R object. While httr does built-in parsing, I like to use the jsonlite package to process the result:
dat <- jsonlite::fromJSON(content(result, as="text"), flatten=TRUE)
The fromJSON function takes a few parameters that are intended to help shape JSON into a reasonable R data structure (many APIs return horrible JSON and/or XML). This API would fit into the "horrible" category. The data in dat is pretty gnarly and further decoding of it would be a separate SO question.

How to POST the contents of a file using curl

I'd like to be able to post the contents of a file to a MediaWiki site. So far I can do it as so:
curl --cookie wikiCookies.txt --negotiate -k -X POST -u:<username> -g 'https://<someWikiSite>/api.php?action=edit&title=TestPage&text=HelloWorld&token=<someToken>&format=json'
This works fine, but it has its limitations because of the length of the url.
Suppose I had a file foo.txt, how could I post the contents of this file to a MediaWiki site so that I wouldn't have to add the entire file contents to the url?
I've found the MediaWiki API http://www.mediawiki.org/wiki/API:Edit#Editing_pages, but I haven't been able to figure out how to curl POST entire file contents with it.
I think this should be a fairly simple question for anyone with a good understanding of curl, but no matter what I try, I can't get it to work.
Try this:
--data "text=<some_wiki_tag>this is encoded wiki content</some_wiki_tag>&title=TestPage&text=HelloWorld&token=<someToken>&format=json"
I think what you need is the -d, --data <data>
If the <data> starts with # then the rest should be a file name whose content will be send in the POST request.
Online curl manpage
-d, --data
(HTTP) Sends the specified data in a POST request to the HTTP server,
in the same way that a browser does when a user has filled in an HTML
form and presses the submit button. This will cause curl to pass the
data to the server using the content-type
application/x-www-form-urlencoded. Compare to -F, --form.
-d, --data is the same as --data-ascii. To post data purely binary, you should instead use the --data-binary option. To URL-encode the
value of a form field you may use --data-urlencode.
If any of these options is used more than once on the same command
line, the data pieces specified will be merged together with a
separating &-symbol. Thus, using '-d name=daniel -d skill=lousy' would
generate a post chunk that looks like 'name=daniel&skill=lousy'.
If you start the data with the letter #, the rest should be a file
name to read the data from, or - if you want curl to read the data
from stdin. The contents of the file must already be URL-encoded.
Multiple files can also be specified. Posting data from a file named
'foobar' would thus be done with --data #foobar. When --data is told
to read from a file like that, carriage returns and newlines will be
stripped out.

Resources