Scrapy next page, no sense results

Scrapy next page, no sense results - web-scraping

Trust you are doing well!
I´m scraping some web pages and when I try to go to the next page I´m not able to, because the next page results, it doesn´t matter with what I´d look for at the first one.
An example:
Fist page look for: https://www.mister-auto.es/buscar/?q=corteco
Second page: https://www.mister-auto.es/buscar/?page=2
The problem that I´ve is that the results at the second doesn´t has no sense with what I´d look for.
I´m using crawlspider with linkextractor to go to the next.
Could you give me a hand?
Thank you very much for your support.

The website you're scraping is dynamic and when you're changing pages it does not reflect in the URL.
What you want is a tool like Puppeteer or Selenium to render the page dynamically, click buttons and extract the content you want. While it is a great tool for certain jobs, Scrapy has its limitations.

Related

Web scraping from a google search page using html tag

I'm trying to do a google search and get the first 5 result (title/URL) into a excel document.
I tried using 'Data Scraping' but depending on the search term, google will display a different page. Sometimes its will have video, images or related search term. So most of the time, I was not able to actually get all the result from the page as uiPath would not recognize them, probably because of the different div. So my thought was to get them by html tag, as every title use H3 but I can't find a way to do that.
Also tried with find children > get attributes but no success, I feel that might be the best ways tho, I'm just not enough experimented with it to make it work. Tried for hours.
Anyone had a similar problem and found a solution?

When I did this before I had to do multiple scrapes to get the data. The first scrape will get the initial page results and then you can do a second to get data on page 2 forward. I have had instances where i had to do multiple scrapes on the first page to get all the information but after page 1 the data is consistent and easy to scrape. Hope this helps.

data being hidden by and class regenerated when scraping web page using Beautiful Soup

I am trying to pull pricing data from a website, but each time the page is loaded, thet class is regenerated to a different sequence of letters, and the price is showing instead of a number. Is there a technique that I can use to bypass this in any way? Thanks! Here is the line of html as how it appears when I inspect the element:
<div class="zlgJQq">$</div>
<div class="qFwqmC hkVukg2 njGalW"> </div>
Your help would be much appreciated!

Perhaps that website is actively discouraging you from scraping their data. That would explain the apparently random class names. You might want to read their terms of use to be sure that it's OK to scrape their site.
However, if the raw HTML does not contain the price data but it is visible when the page is rendered, then it's likely that Javascript is being used to insert the prices after the page has loaded. You could try enabling the developer tools in your browser and monitoring the network activity while the page is loading. That might reveal that the site is using dynamic Ajax queries to populate the price data, and you could then write code to interact with the Ajax resource directly.
It's also possible that the price data is embedded somewhere in the HTML, possibly obfuscated, and then loaded dynamically by javascript.
That's just a couple of suggestions. You will need to analyse the site to see whether automated scraping is feasible. If you can let us know what website you're dealing with then someone might be able to suggest something more specific.

Wordpress code causing 404 errors

I am receiving 404 errors (showing on Google search console)that somehow relate to Facebook.
eg http://www.beerandcroissants.com/staying-in-mykonos-myconian-k-hotel/room-at-myconian-k-hotel-mykonos/%22https:/www.facebook.com/pages/Beer-and-croissants/1423705111261254
What it seems to be doing is taking one link (eg part of the whole link above) which relates to a photo on my blog and then appending the facebook page url. So, if I take the first part of the link above
(before the facebook part starts) I get a perfectly good link through to my site. If I take the second part (where the facebook link starts) then it takes me to my FB page. Again fine.
Why are these two linking together like this as it is this that seems to be causing the 404. Is it something in my settings. It's only just started happening.
Going to facebook directly and clicking on my post links takes me to the correct part of my blog.
I am not sure how to fix this. There are no broken links attached in google for me to view either. They keep happening every day. I now have 295 crawl errors and growing.
Would appreciate any help that can be given to lead me in the right direction.
I've had this as a suggestion....but don't know where to look for this code....
I'd say it may be coming from the following code on :staying-in-mykonos-myconian-k-hotel/room-at-myconian-k-hotel-mykonos/
It's also likely this is carried through the entire site.
https://www.facebook.com/pages/Beer-and-croissants/1423705111261254">
Note the " in the URL before https://www.facebook
I suggest you go through your code looking for similar issues.
Could someone please assist me if they are able.
Greatly appreciated.
thanks
Kerri

fill out search on website and screen scrape result in r

this is my first post, so if my question is too vague or not clear, please tell me so.
I'm trying to scrape a website with news-articles for a research project. But the link to the modified search on that webpage won't work, because the intranet-authentication will spit out an error.
So my idea was, that I fill out the search form and use the resulting link to scrape the website.
Since my boss likes to work with R, he would like me to write an R-skript to do so, but I have no idea how to and haven't found anything working.

You need two packages: RCurl and XML.
The RCurl package is used for internet browsing. It can access HTML forms with _GET or _PUT arguments. So, with it you can login or fill out the any form.
The output from the server would be in HTML. If you want to grep the links, you can use XLM package. I helps to get any data form XML format.
But before start, you have to find out that is the search form in webpage (and that arguments should be used). The Firefox browser could be useful. You need two add-ins: Live HTTP header and Firebug. With those add-ins you can inspect webpage much more easier.
I know that it did not solve you problem, but I could not say any more, since it deepens on particular situation and webpage structure. I believe that the tool I have mentioned is quite enough to achieve that you want.
Bet regards.

Picture or photo viewer on my Web site

Lets say I let the customer upload up to 5 pictures. I'm looking for a good way to let visitors see the images one by one.
I've seen some.. where there are thumbnails on the side/bottom (that looks like a vertical/horizontal film strip) and the default picture is the large one displayed. And viewers can click into others to show those pictures.
This could possibly be an AJAX solution. I just couldn't come up with the right keywords to Google this custom Web component. Perhaps it is "photo gallery". But I would be more interested to know what solutions developers here use for their site.

Perhaps lightbox is the keyword you're looking for: http://www.google.com/search?q=lightbox

Is something like this Galleria what you're after?
It's all implemented in Javascript so is simple to integrate.

There are several options but on first thought, I would reach for the fancybox jQuery plugin. The third example on their home page does exactly what you described. I've used this plugin a few times now and it's quite good.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex