Anti-Scraping bypass? - web-scraping

Anti-Scraping bypass? - web-scraping

Helllo,
I'm working on a scraper for this page : https://www.dirk.nl/
I'm trying to get in scrapy shell the 'row-wrapper' div class.
If I enter response.css('row-wrapper'), it gives me some random results, I think an anti scraping system is involved. I need the hrefs from this class.
Any opinions on how can I move forward ?

We would need a little bit more data, like the response you receive and any code if it's already set up.
But from the looks of it, it can be multiple things ( from 429 Response blocking the request because of the rate limit to sites internal API XHR causing data to not be rendered on page load etc. ).
Before fetching any website for scraping reasons, try curl, postman or insomnia software to see what type of the response are you going to receive. Some special servers and website architectures require certain cookies and headers while some don't. You simply have to do this research so you can make your scraping workflow efficient.
I ran curl https://www.dirk.nl/ and it returned data that's generated by Nuxt framework. In this case that data is unusable since Nuxt uses it's own functionality to parse data.
Instead, the best solution would be not to get the HTML based data but API content data.
Something like this:
curl 'https://content-api.dirk.nl/misc/specific/culios.aspx?action=GetRecipe' \
-H 'accept: application/json, text/plain, */*' \
--data-raw '{"id":"11962"}' \
--compressed
Will return:
{"id":11962,"slug":"Muhammara kerstkrans","title":"Muhammara kerstkrans","subtitle":"", ...Rest of the data
I don't understand this language but from my basic understanding this would be an API route for recipes.

Related

Can't send POST request to Google Apps Script

I'm trying to deploy a Google Apps Script as a web app, but while I have no problem doing GET requests, I'm having trouble with POST requests.
My code is very simple:
function doGet(request) {
var result = JSON.stringify({ data: 'Thanks, I received the GET request' });
return ContentService.createTextOutput(result).setMimeType(ContentService.MimeType.JSON);
}
function doPost(request) {
var result = JSON.stringify({ data: 'Thanks, I received the POST request' });
return ContentService.createTextOutput(result).setMimeType(ContentService.MimeType.JSON);
}
I deployed the web app with "Execute the app as: Me" and "Who has access to the app: Anyone, even anonymous". Every time I do some change I re-deploy it with a new version ("Project version: New").
After publishing, my curl GET request works perfectly:
> curl -L https://script.google.com/macros/s/$SCRIPT_ID/exec
{"data":"Thanks, I received the GET request"}
However, my POST request (curl -L -XPOST https://script.google.com/macros/s/$SCRIPT_ID/exec) just shows me a generic Google HTML page saying "Sorry, unable to open the file at this time. Please check the address and try again".
I tried sending some data and providing a content type, but nothing changed. I also tried changing the output type to just ContentService.createTextOutput("OK"), but it didn't work either. Curiously, deleting doPost changes the error message to "Script function not found: doPost" as expected. If it makes any difference, this script is attached to a Google spreadsheet.
Are there any special permissions I need to give to the script for POST requests specifically?

It seems that the problem was with my usage of curl, on subtle differences between using -XPOST and not using it. As suggested by Tanaike, changing from:
curl -L -XPOST https://script.google.com/macros/s/$SCRIPT_ID/exec
to
curl -L -d '' https://script.google.com/macros/s/$SCRIPT_ID/exec
Solved the issue. Even though curl helpfully says "Unnecessary use of -X or --request, POST is already inferred" when I do a request with -XPOST and a payload, its behavior is different in the presence of redirects. -XPOST forces all subsequent requests after a redirect to be made using POST as a method. On the other hand, if I don't specify -XPOST, the requests after the first POST are made as GET requests. I don't know if this is curl's intended behavior, but it's certainly unintuitive.

20 second Vimeo API calls with WordPress integration

We have a WordPress custom build and have integrated the Vimeo API to pull videos through to the website.
The setup is working but the API calls are taking 20 seconds. We have tested using Postman and they only take 1-2 seconds.
Is there a solution to this?

Use the fields parameter on your requests to tell the API to only return the metadata needed for your application. Because Vimeo API responses can be quite large, especially when retrieving a list of videos, the fields parameter can significantly reduce the size of the response, and subsequently increase response time.
For example, let's say you're making a request to get the last 10 videos you uploaded. The request would look like this:
curl -X GET https://api.vimeo.com/me/videos?page=1&per_page=10
-H 'Accept: application/vnd.vimeo.*+json;version=3.4'
-H 'Authorization: bearer [token]'
The response would return the full and complete video objects for 10 videos, which can be quite large. However if you only need some of the metadata in the response, such as the video's name, description, and its link on vimeo.com, then the same request with the fields param will look like this:
curl -X GET https://api.vimeo.com/me/videos?page=1&per_page=10&fields=uri,name,description,link
-H 'Accept: application/vnd.vimeo.*+json;version=3.4'
-H 'Authorization: bearer [token]'
The fields parameter is documented here: https://developer.vimeo.com/api/common-formats#json-filter

Curl redirect without sending the first POST

I'm using "curl -L --post302 -request PUT --data-binary #file " to post a file to a redirected address. At the moment the redirection is not optional since it will allow for signed headers and a new destination. The GET version works well. The PUT version under a certain file size threshold works also. I need a way for the PUT to allow itself to be redirected without sending the file on the first request (to the redirectorURL) and then only send the file when the POST is redirected to a new URL. In other words, I don't want to transfer the same file twice. Is this possible? According to the RFC (https://www.rfc-editor.org/rfc/rfc2616#section-8.2) it appears that a server may send a 100 "with an undeclared wait for 100 (Continue) status, applies only to HTTP/1.1 requests without the client asking to send its payload" so what I'm asking for may be thwarted by the server. Is there a way around this with one curl call? If not, two curl calls?

Try curl -L -T file $URL as the more "proper" way to PUT that file. (Often repeated by me: -X and --request should be avoided if possible, they cause misery.)
curl will use "Expect: 100" by itself in this case, but you'll also probably learn that servers widely don't care about supporting that anyway so it'll most likely still end up having to PUT twice...

Using R to call a Web Service: Send data and get the result table back in R

http://snomedct.t3as.org/ This is a web service that will analyse English clinical text, and report any concepts that can be detected.
For e.g.- I have headache. It will identify headache as a Symptom.
Now what I would like to do is send the sentence to the web service through R, and get the table back from the web page to R for further analysis purpose.

If we take their example curl command-line:
curl -s --request POST \
-H "Content-Type: application/x-www-form-urlencoded" \
--data-urlencode "The patient had a stroke." \
http://snomedct.t3as.org/snomed-coder-web/rest/v1.0/snomedctCodes
that can be translated to httr pretty easily.
The -s means "silent" (no progress meter or error messages) so we don't really have to translate that.
Any -H means to add a header to the request. This particular Content-Type header can be handled better with the encode parameter to httr::POST.
The --data-urlencode parameter says to URL encode that string and put it in the body of the request.
Finally, the URL is the resource to call.
library(httr)
result <- POST("http://snomedct.t3as.org/snomed-coder-web/rest/v1.0/snomedctCodes",
body="The patient had a stroke.",
encode="form")
Since you don't do this regularly, you can wrap the POST call with with_verbose() to see what's going on (look that up in the httr docs).
There are a ton of nuances that one should technically do after this (like check the HTTP status code with stop_for_status(), warn_for_status() or even just status_code(), but for simplicity let's assume the call works (this one is their example so it does work and returns a 200 HTTP status code which is A Good Thing).
By default, that web service is returning JSON, so we need to convert it to an R object. While httr does built-in parsing, I like to use the jsonlite package to process the result:
dat <- jsonlite::fromJSON(content(result, as="text"), flatten=TRUE)
The fromJSON function takes a few parameters that are intended to help shape JSON into a reasonable R data structure (many APIs return horrible JSON and/or XML). This API would fit into the "horrible" category. The data in dat is pretty gnarly and further decoding of it would be a separate SO question.

Translating from cURL to straight HTTP requests

What would the following cURL command look like as a generic (without cURL) http request?
feedUri="https://www.someservice.com/feeds\
?prettyprint=true"
curl $feedUri --silent \
--header "GData-Version: 2"
For example how could such an http request be expressed in the browser address bar? Partucluarly, how do I express the --header information if I were to just type out the plain http request?

I don't know of any browser that lets you specify header information in the address bar. I believe there are plug-ins that let you do this, but I don't have any experience with them.
Here is one for firefox that looks promising:
https://addons.mozilla.org/en-US/firefox/addon/967
Basically what you want to do is not a standard browser feature.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Anti-Scraping bypass? - web-scraping

Related

Can't send POST request to Google Apps Script

20 second Vimeo API calls with WordPress integration

Curl redirect without sending the first POST

Using R to call a Web Service: Send data and get the result table back in R

Translating from cURL to straight HTTP requests

Categories

Resources