Being lazy, I'm trying to use scrapy instead of implementing my own scraping service using celery+requests (been there, done that). Let's say I have a list of N pages that I like to monitor. After retrieving page X and reading its content, I want to tell the system to rescan it sometime later (depending on its content), say once two hours have passed.
Is such a thing possible with Scrapy?
Related
I need to click in 100000 different url from websites to scrape different data about the website. The data has to be manually extracted because it is different on each website and it doesn't follow a pattern.
My question is, is there any program or script where I can paste the 100000 URL and it just open/preload some urls in tabs/windows so when I close one tab the next url opens in a new tab? This way I work in the main website tab that takes me 10 seconds to review and I click control+w and go to the next url.
This way I could save so much time instead of just clicking manually on each link and waiting for it to load.
You can use python web scraping or RPA(if you don't know basic python) and by logical steps, users can automate n number of tasks to be done.
Or you can also make use of 'pyautogui library' of python to click on visual things.
Thumbs up. If helped...
this is my first time posting here. I do not have much experience (less than a week) with html parsing/web scraping and have difficulties parsing this webpage:
https://www.jobsbank.gov.sg/
What I wan to do is to parse the content of all available job listing in the web.
my approach:
click search on an empty search bar which will return me all records listed. The resulting web page is: https://www.jobsbank.gov.sg/ICMSPortal/portlets/JobBankHandler/SearchResult.do
provide the search result web address to R and identify all the job listing links
supply the job listing links to R and ask R to go to each listing and extract the content.
look for next page and repeat step 2 and 3.
However, the problem is that the resulting webpage I got from step 1 does not direct me to the search result page. Instead, it will direct me back to the home page.
Is there anyway to overcome this problem?
Suppose I managed to get the web address for the search result, I intent to use the following code:
base_url <- "https://www.jobsbank.gov.sg/ICMSPortal/portlets/JobBankHandler/SearchResult.do"
base_html <- getURLContent(base_url,cainfo="cacert.pem")[[1]]
links <- strsplit(base_html,"a href=")[[1]]
Learn to use the web developer tools in your web browser (hint: Use Chrome or Firefox).
Learn about HTTP GET and HTTP POST requests.
Notice the search box sends a POST request.
See what the Form Data parameters are (they seem to be {actionForm.checkValidRequest}:YES
{actionForm.keyWord}:my search string )
Construct a POST request using one of the R http packages with that form data in.
Hope the server doesn't care about the cookies, if it does, get the cookies and feed it cookies.
Hence you end up using postForm from RCurl package:
p = postForm(url, .params=list(checkValidRequest="YES", keyword="finance")
And then just extract the table from p. Getting the next page involves constructing another form request with a bunch of different form parameters.
Basically, a web request is more than just a URL, there's all this other conversation going on between the browser and the server involving form parameters, cookies, sometimes there's AJAX requests going on internally to the web page updating parts.
There's a lot of "I can't scrape this site" questions on SO, and although we could spoonfeed you the precise answer to this exact problem, I do feel the world would be better served if we just told you to go learn about the HTTP protocol, and Forms, and Cookies, and then you'll understand how to use the tools better.
Note I've never seen a job site or a financial site that doesn't like you scraping its content - although I can't see a warning about it on this site, that doesn't mean it's not there and I would be careful about breaking the Terms and Conditions of Use. Otherwise you might find all your requests failing.
This question relates to WordPress's wp-cron function but is general enough to apply to any DB-intensive calculation.
I'm creating a site theme that needs to calculate a time-decaying rating for all content in the system at regular intervals. This rating determines the order of posts on the homepage, which is paged to allow visitors to potentially view all content. This rating value needs to be calculated frequently to make sure the site has fresh content listed in the proper order.
The rating calculation is not heavy but the rating needs to be calculated for, potentially, 1,000s of items and doing that hourly via wp-cron will start to cause problems for sites with lots of content. Ignoring the impact on page load (wp-cron processes requests on page loads once a certain interval has been reached), at some point the script will reach a time limit. Setting up the site to use "plain ol' cron" will solve the page loading issue but not the timeout one.
Assuming that I have no control over the sites that this will run on, what's the best way to handle this rating calculation on a regular basis? A few things that came to mind:
Only calculate the rating for the most recent 1,000 posts, assuming that the rest won't be seen much. I don't like the idea of ignoring all old content, though.
Calculate the first, say, 100 or so, then only calculate the rating for older groups if those pages are loaded. This might be hard to get right, though, and lead to incorrect listing and ratings (which isn't a huge problem for older content but something I'd like to avoid)
Batch process 100 or so at regular intervals, keeping track of the last one processed. This would cycle through the whole body of content eventually.
Any other ideas? Thanks in advance!
Depending on the host, you're in for a potentially sticky situation. Let me outline a couple of ideal cases and you can pick/choose where you need to.
Option 1
Mirror the database first and use a secondary app (WordPress or otherwise) to do the calculations asynchronously against that DB mirror. When they're done, they can update a static file in the project root, write data to a shared Memcached instance, trigger a POST to WordPress' admin_post endpoint to write some internal state, whatever.
The idea here is that you're removing your active site from the equation. The last thing you want to do is have a costly cron job lock the live site's database or cause queries to slow down as it does its indexing.
Option 2
Offload the calculation entirely to a separate application. Tracking ratings in real time with WordPress is a poor idea as it bypasses page caching and triggers an uncachable request every time a new rating comes in. Pushing this off to a second server means your WordPress site is super fast, and it also means you can have the second server do the calculations for you in the first place.
If you're already using something like Elastic Search on the site, you can add ratings as an added indexing facet. Then just update posts as ratings change, and use the ES API to query most popular posts later.
Alternatively, you can use a hosted service like Keen IO to record and aggregate ratings.
Option 3
Still use cron, but don't schedule it as a cron job in WordPress. Instead, write a WP CLI routine that does the reindexing for you. Then, schedule real cron jobs to process the job.
This has the advantage of using PHP's command line version, which can be configured to skip the timeouts and memory limits imposed on the FPM/CGI/whatever version used to serve the site. It also means you don't have to wait for site traffic to trigger the job - and a long-running job won't block other cron events within WordPress from firing.
If using this process, I would set the job to run hourly and, each hour, run a batch of 1/24th of the total posts in the database. You can keep track of offsets or even processed post IDs in the database, the point is just that you're silently re-indexing posts throughout the day.
I have a function which web-scraping all latest news from a website (approximately 10 news and the number of news is up to that website). Note that the news are in chronical order.
For example, yesterday I got 10 news and stored in database. Today I get 10 news but there are 3 news that are not available from yesterday (7 news stayed the same, 3 new).
My current approach is to extract each news till I find an old news (the 1st among 7 news) then I stop extracting and only update the field "lastUpdateDate" of the old news + add new news to the database. I think this approach is somehow complicated and it takes time.
Actually I'm getting news from 20 websites with same content structure (Moodle) so each request will last about 2 minutes, which my free host doesn't support.
Is it better if I delete all the news and then extracting everything from the start (this actually increments a huge amount of the ID numbers in the database)?
First, check to see if the website has a published API. If it has one, use it.
Second, check the website's terms of service, which may specifically and explicitly disallow scraping the website.
Third, look at a module in your programming language of choice that handles both the fetching of the pages and the extraction of the content from the pages. In Perl, you would start with WWW::Mechanize or Web::Scraper.
Whatever you do, don't fall into the trap that so many who post to StackOverflow fall into: Fetching the web page, and then trying to parse the content themselves, most often with regular expressions which is an inadequate tool for the job. Surf the SO tag html-parsing for tales of sorrow from those who have tried to roll their own HTML parsing systems instead of using existing tools.
Its depend on requirement if you want to show old news to the users or not.
For scraping you can create a custom local script for cron job which will grab the data from those news websites and will store into database.
You can also check through subject if its already exist of not.
Final make a custom news block which will show all the database feed.
Considering that most languages have webscraping functionality either built in, or made by others, this is more of a general web-scraping question.
I have a site in which I would like to pull information from about 6 different pages. This normally would not be that bad; unfortunately though, the information on these pages changes roughly every ten seconds, which could mean over 2000 queries an hour (which is simply not okay). There is no api to the website I have in mind either. Is there any possible efficient way to get the amount of information I need without flooding them with requests, or am I out of luck?
At best, the site might return you an HTTP 304 Not Modified in its header when you make a request - indicating that you need not download the page, as nothing has changed. If the site is set up to do so, this might help decrease bandwidth, but would still require the same number of requests.
If there's a consistent update schedule, then at least you know when to make the requests - but you'll still have to ask (i.e.: make a request) to find out what information has changed.