I have a lot of different scrapers, but all of them are working with server rendering pages or parse responses from API endpoints.
But now I have two very specific web sites to scrape:
First.
Single page, we should click on seach button to get first 10 items. To get next 10 items - click button "Next". After 2-3 sec data in search section is rerendered. On click "Next" I get dummy unparsed data from vaadin service. So data can be parsed only from rendered HTML page.
Second.
Same single page with same principles ( click search button to get init data, click Next button to load new data). But additionally I need click on every items to get all data to scrape ( I scrape some data from rendered search result + from modal window after click on each search result item)
Question - is it possible to scrape such websites with scrapy and splash? I know about selenium, but it's quite heavy and slow, I need other solution. Never worked with splash, but if I am not mistaken it's possisble to imitate click via lua script..
I would suggest avoiding Splash and instead reproducing the underlying requests.
The main issue I see here with going the Splash route is that, if there is no URL that you can use to access a page other than the first page from a web browser, and since Splash does not support (AFAIK) resuming a previous rendering, you would need Splash to have each request to Splash run a Lua script that clicks Next, waits and repeats for N pages.
If reproducing the requests is out of the picture for some reason, using an interactive headless browser (Selenium, Puppeteer) instead of a rendering service (Splash) may be better.
Related
I am trying to scrape a page and I have all the code set, just that I got stuck on the "load more" button. The page is simple, it has items, but only a few. On a web browser to view the rest of the items you would click on a html button which has an onClick event, when it is pressed, more items are loaded and so on until all of them are on the page, then it disappears. As of now I send a request and store it in a variable, the have BeautifulSoup parse it. How would I go about loading the rest of the items into that variable? Should I be taking a different approach?
Yes, you have to take different approach. Let me explain why.
"Load more" button usually triggers new request to site's API. Then JS code renders received data into page You're watching. BS is not an option to control such cases - You have to implement walking logic by yourself.
You have two approaches in this case:
Use scraper (or write code), that can evaluate js-code (webdrivers like selenium, puppeteer, etc) and create script for walking and crawling resulting DOM;
Investigate API, that is being used by "load more" button. If this API transparent and easy to use - it is usually possible to crawl all needed data via it (only with requests module).
Hello Im traying to scrape data from https://eservicios2.aguascalientes.gob.mx/sop/geobras/UI/frmObrasTodas.aspx
I can get the data from the main page but I don't know how to get the data from the form,
a) when choose a row and ask for "Detalle" , means detail goes to a form.
b) don't know how to follow the link
Need to get data from each row, can anybody help me.
the main issue and problem is that this is a asp.net web site. So, when you select a row, this likely uses a server side event. you MIGHT be able to write some JavaScript to select a row. But then the next issue is even more of a challenge. Once you select a row, then you have to click on a button. That button is going to run server side code. And that server side code is going to look at and grab the row value selected - VERY likely again server side code. Unlike say a simple web site with hyper-links?
.net sides are full driven from vb.net or c# code. We don't use silly things like hyper-links, or even silly parameters in the web URL.
So, after you select a row (perhaps possible in js), then you would then have to click on the details button. This again can be done with JavaScript
Say, in jQuery like this:
$('#NameOfButton').click();
So asp.net sites don't use simple code like what you see and get from someone who take that 3 day web developer program promising that now you are a experienced web developer. Asp.net sites as a result don't use simple HTML markup code and things like a simple hyper-link to drive the web site. There are no "links" for each row, but only code on the server side that runs to pull the data from the database, and then render that information, and THEN send it down as a html markup.
The bottom line?
The site is not simply HTML and simple hyper-links that you click on. When you click on that button, then the code behind (written in a nice language like c# or vb.net) runs. There is thus no markup code or even JavaScript code that is required here. You talking about clean and nice server side code. (and code written in a fantastic IDE - Visual Studio).
This means that aspx web sites are code behind driven, and as a result they are rather difficult to web scape in a automated fashion. You can get/grab the page you are on, but since there are no hyper-links to the additonal data (such as details), then you don't have a simple URL to follow/trace here.
Worse yet, the setup code (what occurs when you selected a single row) also in most cases has to be run. Only if all values are setup 100% correctly BEFORE hitting the "details" button will this thus work. And even worse, if you note, on the details page, there is no parameters in the URL. This means that not only is code behind required to run BEFORE the 2nd details page launches, but the correct setup code behind has to run. And even worse yet, is the 2nd page URL VERY likely also checks and ensures that the previous URL page was from the same site (as a result you can NOT JUST type in a url for the 2nd page - it will not work.
And in fact, if you look even closer? When you hit details button, the web pages re-loads, re-plots and renders what is CLEARLY a whole new web page and layout.
But note how the URL DOES NOT change!!! They are NOT even using a iframe for this.
This is because they are using what is called a server side re-direct. The key "tell tell" sign is that the URL remains the same, but the whole page layout is 100% different. What occurred is the server side did a re-direct to a 100% whole new page. But since the browser did not and was not causing this navigation? The code behind actually loads + displays a whole new web page and sends it down to the web client side.
However, note how the URL remains the same!!! This is due to the code behind is loading + displaying a whole new different web page - but since the navigation to that new page occurred with server side code?
Well then the server can load + send out anything it wants to the client - include a whole new web page, and you don't get nor see a web url change.
Again, this is typical of asp.net systems in which server side code drives the web site, and not much client side code.
You "might" be able to automate scraping. But you would need some custom code to select a given row, and then some code to click the details button. And that's going to be a REAL challenge, since any changes to the web page code (by you) also tend to be check for, and not allow server side.
The only practical web scrape approach would be to use some desktop tools to create a WHOLE instance of the web browser, let you the user navigate to the given web page that displays the data, and then hit some "capture" button in your application that now reads and parses out the data like you doing now for the main page.
I am completely new to web development. The question I have is rather simple (I guess), but after multiple hours of using google and experimenting I am still without any solution. The problem I have is probably not how to do it, but which keywords to use while searching.
I want to create a simple website. (For testing I use Caddy Server). For my website I use a simple index.html file. On my website I want to have 9 buttons, which will be disabled once clicked. After refreshing the page, every client should also see the changes, so the button-state has to be stored somewhere on the server.
Then there will be another button, which sets the web page to its initial state (all buttons enabled). The purpose of this web page is that 2 persons can click buttons successivley until only one button is left enabled (the web page reloads itself every second on every client). This will be used to select a certain map from a map-pool of 9 maps.
My main problem is, to store the button states, so after refreshing the page the buttons should be still disabled if they were clicked. All clients should see the buttons as diabled once they refresh their pages. Do I have to implement a database for this or store the button states in xml or json? Do I need javascript, jquery, php or ajax for this? I do not want to make it very complicated, so if I need for example a database for this, I will probably just give up.
What I'm asking for: Any point in the right direction on how to implement a simple button that keeps its state after reloading the page would be much appreciated. I found a solution for this using JQuery, but it does not work for me (button does not preserve state after refreshing See here).
Thank you so much for any help!
Your server will need a data store (database) to save the values desired for each button.
Client Side
Set disabled attribute on all relevant buttons in your HTML. On (client-side) page load, fetch the value(s) from your server (database) and depending on what the returned value(s) are, .clearAttribute("disabled") on all buttons accordingly.
Server Side
Have your server set the disabled attribute on the HTML <button> elements based on values in your database prior to serving the HTML to your client(s).
I have a performance issue where we have a 2 page setup as part of a workflow in a bigger system. This section is dedicated to rendering reports allowing users to chose their own parameters.
Page1.aspx collects parameter information for a report. It takes the information submitted on a form and validates it. If it validates OK, it stores the selections in the DB as XML, then redirects to Page2.aspx with the run id in the query string. Simple enough, performance is great.
Page2.aspx pulls the ID out of the DB and hydrates a Crystal ReportDocument object (taking milliseconds) then we call ExportToHttpStream which then renders the report as a PDF or DOC or XLS download (output format is determined in Page1.aspx). The performance of the ExportToHttpStream method is very poor due to the way our reports are written and DB indexes on the target system. This is outwith my control at the moment but I am promised that they are being worked on.
So the problem is, that when the submit button in Page1.aspx is pressed, the user experiences a very long delay before the download starts. It is then compounded by the user pressing the submit button again thinking there is a problem.
I think what I need to do is have Page1.aspx redirect to Page2.aspx. Page2.aspx should render the master page furniture and a loading div, and the report should render asynchronously somehow in the background before the save dialogue automatically pops up, after this i'd like to change the loading div to a 'Report generated, click here to go back'.
If this is the best way to achieve this, how can I load a full page, then request the report asynchronously? I'm open to any suggestions here.
You could use ajax to load the report on Page2.aspx and show a loading message while it's processing.
Look at the jQuery.load() method. This might be the easiest way to accomplish what you are trying to do.
Page1.aspx - collect parameters
Page2.aspx - report view, calls Page2Details.aspx via ajax.
Try loading Page2.aspx inside iframe and use jQuery to display waiting indicator and hide it after Page2.aspx download
Whilst both answers gave me some ground to go out and research in the right direction. My solution included using the fileDownload plugin from John Culviner to facilitate a similar solution:
jQuery fileDownload by John Culviner
This allowed me the following page structure:
Page1.aspx, gathers and validates parameters for the report and puts them into Oracle.
Page2.aspx, whilst passed in the runid (pointer to the parameters in the db) via the query string setup 3 hidden divs. Loading, Error and Success.
The script mentioned above was employed at this point. jQuery firstly sets the loading div visible then calls the plugin. The plugin dynamically creates an iframe and downloads the binary (xls/doc/pdf) from Page3.aspx. It then fires a success callback or failure. The success callback is fired by means of a cookie set at the end of the response in Page3.aspx.
I believe the plugin mentioned downloads using a 'text/plain' AJAX call in jQuery avoiding the limitation of there not being an octet-stream equivalent in AJAX.
It works, its not the cleanest solution by any means, it doesn't degrade one bit, but provides the users on our controlled intranet with an extremely responsive and pleasing UI.
In my ASP.Net application I have a requirement that when a user clicks on an UI element we generate a PDF for them which they can download. This is currently implemented by doing a form post to an ashx page. This page essentially inspects the form and then executes the correct server side page which either results in HTML or a PDF document of that pages HTML.
On the client I know ahead of time if we are going to be getting a PDF or HTML, when its an HTML I open a new window and direct the form post to that window and all works well. When its a PDF I don't change the target for the form and it remains on the current page.
This works, the user is presented with a save dialog, and the current page is not changed or lost.
The problem I have is that generating the PDF takes anywhere from 1-15 seconds. What I want to do is popup a please wait dialog. Displaying the popup is going to be easy, what I am not sure of is how do I know to close the popup? The popup will be a div in the current page.
The popup can have a client side timer which polls the server for task completion. The long running server task should update the progress in a database table or a server cache object which can be accessed by the polling service.
Couple of old articles from MSDN magazine. You should be able to use the same concepts with newer libraries like asp.net Ajax.
Reporting Task Progress With ASP.NET 2.0
Simplify Task Progress with ASP.NET "Atlas"
just have some javascript on the client side and let it show some animated GIF for 1-15 seconds (your choice) and close itself after the designated time.
Gulzar's suggestion was spot on. I have a simple ajax enabled wcf service which checks a session variable. My ashx page sets the variable to false when it starts processing and then true when its done.
I think there might be a race condition if the client checks before we set the session item to false; however, there are ways around that if we modify the service to set the session item to false after a client gets an im done response.
The tricks is still going to be figuring out what the intervalon the client should be. If we set it to low the user could save the file and then see the still processing message. I'm debating myself between half a second and a second. Anything less then a half a second seems unnessecary.
You said:
When its a PDF I don't change the
target for the form and it remains on
the current page.
If that is the case then the original page will be gone when the PDF is opened. In that situation I would have a loading animated gif and open it using Javascript into a div tag overlaying the rest of the page. You would not need to close it, so no timer or polling needed. It would just be gone when the page is gone.