How do you scrape HTML + HTTP POST responses?

How do you scrape HTML + HTTP POST responses? - web-scraping

Does anyone know a way (free or paid tool, software library, etc) to scrape HTML and the HTTP responses? I've tried tools like Mozenda and Octoparse and they worked but only in getting the HTML.
If you open a site with chrome for example and open the developer tools, in the network tab you can see the traffic and the responses, I need to capture that same data but with a program.
I've tried replicating the post request and sending it with Postman and it worked, but I don't know how to automatize this (replicate the HTTP Headers sent would be the hard part, given that tokens expire)
Any type of help or tip would be very helpful thanks.

So after reading all the docs from Scrapy, Puppeteer and Selenium I can say it can be done with all 3, although the most straightforward way to do it would be with Puppeteer I believe.
We didn't scrape that site though it was too much work, we don't want to code a scraper from 0 is not that important for us.
Thanks, #Patrick Klein and #Gallaecio

Related

'method=PUT' in query parameters

This is something I have never seen before and I do not know if my Google search skills are lacking, but I cannot find anything saying it is and actual way of specifying the HTTP verb for a request.
To give some context on where I have encountered this: I am working on a project to create a very basic LRS to capture Statements from an Articulate Story.
I had Fiddler running to monitor the requests and noticed the Articulate Story tries to POST to a specified endpoint like so: 'endpoint/statements?method=PUT'
Anybody know what is up with this?

Upon further reading of the xAPI specification and the Articulate Documentation, this is something Articulate does... See this link [Implementation of the Tin Can API for articulate content][1]
[1]:https://articulate.com/de-DE/support/article/Implementing-Tin-Can-API-to-Support-Articulate-Content https://articulate.com/de-DE/support/article/Implementing-Tin-Can-API-to-Support-Articulate-Content

Troubleshooting Microsoft Cognitive Services Face API

Earlier today, I was able to send snapshots to the Face API and get responses including faceAttributes describing emotion.
I'm using JavaScript via XMLHttpRequest.
Now, though I've not changed the code, I get OK 200 from the API calls, but the responseText and the response properties are both, "[]".
I'd like to troubleshoot to see what I'm doing wrong, but it seems like the only information available in the cognitive services portal relates to quota.
Where should I look for further analytics?

You'll get an empty response if the API does not detect a face in the image or if the image file is too large (>4MB). You can confirm by testing with an image you know previously worked. To get the best results, make sure the face is well-lit and all features are reasonably visible.

Hello from Cognitive Services - Face API Team,
I wonder the problem belongs to one specific image or all API calls?
For a quick check, you can try the image on the online demo [1].
[1] https://azure.microsoft.com/en-us/services/cognitive-services/face/

Unfortunately doing the troubleshooting from the external perspective is quite difficult since you don't get any logs. The most common steps are to try to repro your problem using either the testing console (https://westus.dev.cognitive.microsoft.com/docs/services/563879b61984550e40cbbe8d/operations/563879b61984550f3039523b) or a tool such as curl or Fiddler so that you can see the raw REST request and response.
With one of those tools you can try to change up your request, try to call a different API, make sure there are no additional details being returned in the body or response headers, etc.
If all else fails please open a support incident from the Azure management portal and we can work with you.
We are also working to improve the logging and troubleshooting capabilities, but it may be some time to see improvements in this area.

Screen scraping in server side

I am new to screen scraping. When i use proxy server and when i track the HTTP transactions, i am getting my post datas revealed to me. So my doubt/problem here is,
1)Will it get stored in the server side or it will be revealed only to the client side?
2)Do we have an option of encrypting the post data in screen scraping?
3)Is it advisable to use screen scraping for banking applications?
I am using screen scraper tool which i have downloaded it from
http://www.screen-scraper.com/download/choose_version.php. (Enterprise version)
Thanks in advance.

My experience with scraping is that if you aren't doing anything super complex (like logging into a secure website like an online banking website, etc.) then Python has some great libraries that will help you out a lot.
To answer your questions:
1) You may need to be more clear, but this really depends on your server/client architecture.
2) As a matter of fact you do. Urllib and Urllib2 (built-in Python libraries) both have functions that enable you to encrypt data before you make a POST. As far as how secure this encryption is, for most applications, this will suffice.
3) I actually have done scraping on online banking sites! I'm not exactly familiar with that tool, but I would recommend using something a little different than a scraper. Selenium, which is a "web-driver", allows you to simulate the use of a browser, meaning anything that the broswer does in the background in order to validate the session is automatically taken care of. The main problem I ran into while trying to scrape the banking site was the loss of important session data.
Selenium - https://pypi.python.org/pypi/selenium
Other libraries you may find useful are: urllib, urllib2, and Mechanize
I hope I was somewhat helpful!

I've used screen-scraper to scrape banking sites before. It will impact the site just like your browser--if the site uses encryption the connection from screen-scraper to the site will be too.
If you have a client page sending data to screen-scraper, you probably should encrypt that. I generally just make the connection via SSH.

1) What do you mean by server side? Your proxy server or screen-scraper software? Any of them can read/store your information.
2) If you are connecting through HTTPS then your software should warn you about malicious proxy server: https://security.stackexchange.com/questions/8145/does-https-prevent-man-in-the-middle-attacks-by-proxy-server
3) I don't think they have some logger which they can read. But if you are concerned you can try to write your own. There are some APIs which you can read HTML easily with jQuery sintax:
https://pypi.python.org/pypi/pyquery or XPath: http://net.tutsplus.com/tutorials/javascript-ajax/web-scraping-with-node-js/

Alternatives to whurl for sharing api responses

I used to use whurl.heroku.com to make http web requests and share the responses with people. It's a great service for allowing people to see the results of requests themselves and test fixes.
It appears that whurl is going offline soon. Are there any good alternatives out there (besides hosting my own)?

Similar to what Mihai posted, I found Advanced Rest Client, a google chrome app. I prefer ARC a little more as it's an app so it doesn't take up space in my URL bar, and also because it's easier to use and has a richer saved history feature than XHR.

It seems someone is hosting Whurl again on heroku.com, https://gcurl.heroku.com/.

You can use XHR Poster extension for Google Chrome. Contains many of the functionalities of whurl.
It has JSON pretty print feature and handles all types of requests.

.NET Web Scripting

I'm attempting to right a script to logon (username and password) on to a website and download a file. I've tried using the webClient and webBrowser classes to no avail, they don't seem to work for what I need them to do.
Does anyone else have any suggestions?

I'd suggest you look at this StackOverflow thread. This is not specific to any programming language or platform (you didn't mention which language/platform you're using, so I can't offer any specific code advice) but in my answer I detail the basic approach you'll need to take for successful HTML screen-scraping.
In a nutshell, first you'll need to the right combination of HTTP headers, URLs, and POST data to fool the server into thinking your client app is a "real" browser. A tool like Fiddler which allows you to see actual HTTP requests going over the wire, and experimentally build new requests based on browser requests, is invaluable here.
Next you'll need to figure out, in your language and platform, how to produce that set of headers, URLs, and/or POST data. Typically, handling cookies and redirects are the hardest-- and most platforms have specialized classes (e.g. CookieContainer in .NET) to help with things like this.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How do you scrape HTML + HTTP POST responses? - web-scraping

Related

'method=PUT' in query parameters

Troubleshooting Microsoft Cognitive Services Face API

Screen scraping in server side

Alternatives to whurl for sharing api responses

.NET Web Scripting

Categories

Resources