How to scrape same data from thousands of authenticated urls? - web-scraping

I need to scrape data from more than 50 thousand different urls(....com0\?cid=1&aid=23&...), only "cid"and "aid" is changing. Always need same data fields, with same selectors. What approach do you suggest? Webpage has SSO authentication as a browser prompt
I am thinking to use scrapy library, but with no previous experience with scraping, it could be unreachable goal. I am able to do that with selenium and webdriver, but it is taking too long

Yes, you can do this with scrapy easily and faster than selenium.
You can simply set your urls with a loop
urls = ('http://www.example.com/page/{}'.format(i) for i in range(1,49999))
You can then use the formrequest code shown here to fill the form.

Related

Parsing Web page with R

this is my first time posting here. I do not have much experience (less than a week) with html parsing/web scraping and have difficulties parsing this webpage:
https://www.jobsbank.gov.sg/
What I wan to do is to parse the content of all available job listing in the web.
my approach:
click search on an empty search bar which will return me all records listed. The resulting web page is: https://www.jobsbank.gov.sg/ICMSPortal/portlets/JobBankHandler/SearchResult.do
provide the search result web address to R and identify all the job listing links
supply the job listing links to R and ask R to go to each listing and extract the content.
look for next page and repeat step 2 and 3.
However, the problem is that the resulting webpage I got from step 1 does not direct me to the search result page. Instead, it will direct me back to the home page.
Is there anyway to overcome this problem?
Suppose I managed to get the web address for the search result, I intent to use the following code:
base_url <- "https://www.jobsbank.gov.sg/ICMSPortal/portlets/JobBankHandler/SearchResult.do"
base_html <- getURLContent(base_url,cainfo="cacert.pem")[[1]]
links <- strsplit(base_html,"a href=")[[1]]
Learn to use the web developer tools in your web browser (hint: Use Chrome or Firefox).
Learn about HTTP GET and HTTP POST requests.
Notice the search box sends a POST request.
See what the Form Data parameters are (they seem to be {actionForm.checkValidRequest}:YES
{actionForm.keyWord}:my search string )
Construct a POST request using one of the R http packages with that form data in.
Hope the server doesn't care about the cookies, if it does, get the cookies and feed it cookies.
Hence you end up using postForm from RCurl package:
p = postForm(url, .params=list(checkValidRequest="YES", keyword="finance")
And then just extract the table from p. Getting the next page involves constructing another form request with a bunch of different form parameters.
Basically, a web request is more than just a URL, there's all this other conversation going on between the browser and the server involving form parameters, cookies, sometimes there's AJAX requests going on internally to the web page updating parts.
There's a lot of "I can't scrape this site" questions on SO, and although we could spoonfeed you the precise answer to this exact problem, I do feel the world would be better served if we just told you to go learn about the HTTP protocol, and Forms, and Cookies, and then you'll understand how to use the tools better.
Note I've never seen a job site or a financial site that doesn't like you scraping its content - although I can't see a warning about it on this site, that doesn't mean it's not there and I would be careful about breaking the Terms and Conditions of Use. Otherwise you might find all your requests failing.

Scrape ASP.NET Website with heavy javascript calls

I want to scrape this website - https://recorder.co.clark.nv.us/RecorderEcommerce/default.aspx.
I need to simulate clicking the 'Parcel #' link first then entering a value (i.e. 1234) into the Parcel # textbox and clicking search.
I need to scrape the data in the table which is shown at the bottom.
I'd like to write this in ASP.NET so I can push the Parcel # etc parameters through as part of the request. Once I get that request back, I'm confident I can parse it myself, I'm just not sure how I should exactly send the original request as it's not as simple as sending across parameters?
In your question you've specified both Javascript and asp.net so I really have no idea what technologies you're planning on using. I'd recommend HtmlAgility pack. It has a download from url option. It'll help with the parsing too.
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load("https://recorder.co.clark.nv.us/RecorderEcommerce/default.aspx");

How to Stream Through Large Amounts of Twitter Data?

I'll be working on a project that will require a live output of a number of tweets users have hash tagged on Twitter as well as their tweets. Something along the lines of MTV's Twitter Tracker: http://vma-twittertracker.mtv.com/live/#buzz.
What intrigued me about this site is how can they constantly make API calls to Twitter without breaching the request limit?
I'd appreciate if anyone could guide me on the most effective way to accomplish this. From the research I've carried out thus far, I presume I will need to use Twitter's Streaming API.
Since there is a chance that the number of tweets output to my page could be in their thousands (AJAX loaded) along with stats on number of retweets/favourites, what would be the most scalable approach within my .NET site? Any examples or guidance would be appreciated.
Check out Linq2Twitter. It is a great wrapper around the Twitter API, and provides two mechanisms that will help you:
There is a search function that allows you to search for hash tags, etc, which will limit the amount of data you are getting back
You have the option to specify getting all the data since a certain tweet ID. You can therefore incrementally search the feed by performing searches and searching, in subsequent calls, from the ID you left off on.
I have used this many times to search the public feed and have not had any issues to date. I think the search function is key not requesting too much. Good luck!
you can look into Storm framework. Below are few links for further reference:-
http://storm-project.net/
https://github.com/nathanmarz/storm
Thanks for all your responses.
It looks like sites such that display a lot of Twitter stats/data use third party approved providers that have direct access to Twitter's Firehose API.
I have managed to get in contact with an approved provider to supply us with the feeds of data required (and it ain't cheap!).

Scrapy: How to recrawl a page after some time?

Being lazy, I'm trying to use scrapy instead of implementing my own scraping service using celery+requests (been there, done that). Let's say I have a list of N pages that I like to monitor. After retrieving page X and reading its content, I want to tell the system to rescan it sometime later (depending on its content), say once two hours have passed.
Is such a thing possible with Scrapy?

Best Approach To Retrieve Search Result

I am trying to write a program that extracts shipping container information from a specific site. I've had success with several shipping companies wbsites that use POST methods to submit searches. For these sites I have been using cURL, a PHP libary. However, this one site http://www.cma-cgm.com/eBusiness/Tracking/ has been very difficult to interact with. I have tried using cURL but all I retrieve is the surrounding html without the actual search results.
A sample container I am trying to track is CMAU1173561.
The actual tracking URL seems to be http://www.cma-cgm.com/eBusiness/Tracking/Default.aspx?ContNum=CMAU1173561&T=292012319448 where ContNum is the shipping container and T is a value constructed from current time.
I also noted the .aspx. What is the best approach for retrieving these search results programatically?

Resources