I am trying to scrape a page and I have all the code set, just that I got stuck on the "load more" button. The page is simple, it has items, but only a few. On a web browser to view the rest of the items you would click on a html button which has an onClick event, when it is pressed, more items are loaded and so on until all of them are on the page, then it disappears. As of now I send a request and store it in a variable, the have BeautifulSoup parse it. How would I go about loading the rest of the items into that variable? Should I be taking a different approach?
Yes, you have to take different approach. Let me explain why.
"Load more" button usually triggers new request to site's API. Then JS code renders received data into page You're watching. BS is not an option to control such cases - You have to implement walking logic by yourself.
You have two approaches in this case:
Use scraper (or write code), that can evaluate js-code (webdrivers like selenium, puppeteer, etc) and create script for walking and crawling resulting DOM;
Investigate API, that is being used by "load more" button. If this API transparent and easy to use - it is usually possible to crawl all needed data via it (only with requests module).
Related
I have a lot of different scrapers, but all of them are working with server rendering pages or parse responses from API endpoints.
But now I have two very specific web sites to scrape:
First.
Single page, we should click on seach button to get first 10 items. To get next 10 items - click button "Next". After 2-3 sec data in search section is rerendered. On click "Next" I get dummy unparsed data from vaadin service. So data can be parsed only from rendered HTML page.
Second.
Same single page with same principles ( click search button to get init data, click Next button to load new data). But additionally I need click on every items to get all data to scrape ( I scrape some data from rendered search result + from modal window after click on each search result item)
Question - is it possible to scrape such websites with scrapy and splash? I know about selenium, but it's quite heavy and slow, I need other solution. Never worked with splash, but if I am not mistaken it's possisble to imitate click via lua script..
I would suggest avoiding Splash and instead reproducing the underlying requests.
The main issue I see here with going the Splash route is that, if there is no URL that you can use to access a page other than the first page from a web browser, and since Splash does not support (AFAIK) resuming a previous rendering, you would need Splash to have each request to Splash run a Lua script that clicks Next, waits and repeats for N pages.
If reproducing the requests is out of the picture for some reason, using an interactive headless browser (Selenium, Puppeteer) instead of a rendering service (Splash) may be better.
I am creating a project involving web scraping and web automation. I would like to first submit this form (http://rgsntl.rgs.cuhk.edu.hk/rws_prd_applx2/Public/tt_dsp_timetable.aspx) then once you submit this form, I want to scrape the HTML page that comes up. The problem is I am not sure how to submit this form through a Go program.
I was previously experimenting with Selenium to emulate a web browser but now I think there may be an easier way. I think that I should be able to make a POST request to the same address that the "submit" button of this form makes to and directly use the HTML page that is returned. The problem is that I cannot figure out how to get the address that the submit button makes a POST request to. I would like to ask if there is a way to monitor the address that the button makes a POST request to when it is clicked? Also if you see any flaws with my idea please do let me know. Thank you.
Right click mouse and select inspect option. After that select the Network tab.
When you fill all the entries and click submit button many urls flashes. Select the top url and under the headers tab you will see the request url for POST method.
See the image
I'm trying to describe it in as few steps as possible:
I have Page1.aspx with lot of controls, and Preview and Save button among those. I also have Page2.aspx that is the redirection target of a Preview Button click.
Since I need all the controls selections from Page1 to draw a preview on Page2 the redirection is done with setting Preview's PostBackUrl.
I also must have preview shown on a new tab or window so I used onClientClick="aspnetForm.target='_blank'" for Preview button definition.
Save button-click callback, after storing data to a database does redirection to some Page0.aspx (initial list of reports - the subject of the code)
Preview button works fine - a preview renders in a new tab, but when I go to the old tab and click on Save, I see from debugger, that firstly Page2.aspx(?) and secondly Page1.aspx are loaded. Then all the data is stored in the db, but though Page0 redirection is executed Page1.aspx stays loaded in the browser.
I have no idea what processes are behind this. Could one who knows give me an insight? Or if you consider my approach impossible to implement give some idea how to do the same?
If it's of importance, everything on the Page1 is located in an update panel.
Thank you very much for replying
In ASP.NET there are basically zero (0) circumstances in which you will ever send form data from one page to another. Although what exactly you are trying to accomplish is vague, you can consider some of the following:
Isolate unique operations/systems to a single page. If you have something like a User Profile, don't have three different aspx pages; just use a single page for the user or admin to manage that data / functions. Postback events are your friend.
Understand the difference between ViewState and traditional form data. I'm guessing that if you're trying to post form data from one page to another, you probably don't understand the point of ViewState. Using a single page to maintain temporary data that the user is currently working with is a great use for ViewState. If you want the data to appear on another page then you need to consider the data from the previous page as final and thus should be saved to a database or some other medium.
These are just some general guidelines because there is no exact answer to your problem without saying something generic like "You're doing it wrong." I would recommend starting by never again trying to post form data from one aspx page to another.
I have a performance issue where we have a 2 page setup as part of a workflow in a bigger system. This section is dedicated to rendering reports allowing users to chose their own parameters.
Page1.aspx collects parameter information for a report. It takes the information submitted on a form and validates it. If it validates OK, it stores the selections in the DB as XML, then redirects to Page2.aspx with the run id in the query string. Simple enough, performance is great.
Page2.aspx pulls the ID out of the DB and hydrates a Crystal ReportDocument object (taking milliseconds) then we call ExportToHttpStream which then renders the report as a PDF or DOC or XLS download (output format is determined in Page1.aspx). The performance of the ExportToHttpStream method is very poor due to the way our reports are written and DB indexes on the target system. This is outwith my control at the moment but I am promised that they are being worked on.
So the problem is, that when the submit button in Page1.aspx is pressed, the user experiences a very long delay before the download starts. It is then compounded by the user pressing the submit button again thinking there is a problem.
I think what I need to do is have Page1.aspx redirect to Page2.aspx. Page2.aspx should render the master page furniture and a loading div, and the report should render asynchronously somehow in the background before the save dialogue automatically pops up, after this i'd like to change the loading div to a 'Report generated, click here to go back'.
If this is the best way to achieve this, how can I load a full page, then request the report asynchronously? I'm open to any suggestions here.
You could use ajax to load the report on Page2.aspx and show a loading message while it's processing.
Look at the jQuery.load() method. This might be the easiest way to accomplish what you are trying to do.
Page1.aspx - collect parameters
Page2.aspx - report view, calls Page2Details.aspx via ajax.
Try loading Page2.aspx inside iframe and use jQuery to display waiting indicator and hide it after Page2.aspx download
Whilst both answers gave me some ground to go out and research in the right direction. My solution included using the fileDownload plugin from John Culviner to facilitate a similar solution:
jQuery fileDownload by John Culviner
This allowed me the following page structure:
Page1.aspx, gathers and validates parameters for the report and puts them into Oracle.
Page2.aspx, whilst passed in the runid (pointer to the parameters in the db) via the query string setup 3 hidden divs. Loading, Error and Success.
The script mentioned above was employed at this point. jQuery firstly sets the loading div visible then calls the plugin. The plugin dynamically creates an iframe and downloads the binary (xls/doc/pdf) from Page3.aspx. It then fires a success callback or failure. The success callback is fired by means of a cookie set at the end of the response in Page3.aspx.
I believe the plugin mentioned downloads using a 'text/plain' AJAX call in jQuery avoiding the limitation of there not being an octet-stream equivalent in AJAX.
It works, its not the cleanest solution by any means, it doesn't degrade one bit, but provides the users on our controlled intranet with an extremely responsive and pleasing UI.
I have an asp.net application with a search page, with criteria and result display on the same page. I want to keep a copy of the populated search page to redistribute it later to the same user, upon the button click on another page. It's kind of a "return to search" button. How can I do that?
Here is some context:
The search criteria is made up of some basic controls, and the results are then (after postback) displayed in a GridView. I also have a master page. Simple as that.
Now consider the following scenario: The user can investigate the results by clicking links that show detail pages, and can drill down over quite many detail pages with associated data. If he/she wants to get back to the search results he/she needs to click the back button of the browser quite many times.
I would like to provide a "Back to search" button on the master page that allows to return to the populated search page with one click.
Note:
I can not use the browser history in any way because it must work also when the user opened one of the detail views in another tab.
I have seen Keeping the Viewstate persistent and retrieve it on demand but it hope there is an easier solution because my grid is paginated and I have also more than one search page, where I would like to return to just the last one used.
Thanks, Marcel
I can offer some logical ways to resolve this problem, without using specialized asp.net features if they exist:
1) Is there some way to save the search string in GET request? So you can save it some way between moving through pages?
2) Another way is caching search pattern (with all filters or what you need there) somewhere - in database, for example and contain some key in get request, which would point on this pattern.