im trying to learn how to use WEBSCRAPER.IO so i can build a database using a webpage's data (For a project) - But when i try to do as the video shows me i cannot get the scrape to go through the pages because the URL is different to that of the video.
Video Example
www.webpage/products/laptops?page=[1-20]
The webpage i want to scan
www.webpage/products/laptops/page/2/
So how would i create the Start URL for webscraper to go through the 20 pages
when i try to use the example from the video it only scans 1 page of my chosen webpage
I have tried veriations like
www.webpage/products/laptops/page/page=[1-20]/
www.webpage/products/laptops/page=[1-20]/
www.webpage/products/laptops?page=[1-20]/
but none of them seem to work. Im stuck.
Could anybody provide my with any advice.
Thank you.
Related
I have followed the tutorial on this website https://www.wp-tweaks.com/display-a-single-cell-from-google-sheets-wordpress/ which allows to dynamically display values from a Google spreadsheet on a WordPress page using a simple shortcode:
[get_sheet_value location="Cell Location"]
This solution worked seamlessly until a single page contained hundreds of those shortcodes (I basically need the whole content of the page to be editable via the spreadsheet). I started getting 100% Errors by API method (based on the Google Metrics) and the content was not displayed properly anymore. I realize that sending hundreds of read requests after each page load is not ideal and will inevitably affect the load performance and that Google imposes quota limits too. Is there a way to bypass this issue? For example by pulling the values from the Google spreadsheet only once a day. Unfortunately, I don't have much coding experience but I'm open to all solutions.
Thanks in advance!
You could publish the sheet to the web and embed it to your website:
In your sheet, go to File > Publish to the web
In the window that appears, click Embed.
Click Publish.
Copy the code in the text box and paste it into your site.
To show or hide parts of the spreadsheet, edit the HTML on your site.
It would look like this (click on Run code snippet):
<iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vR3UbHTtAkR8TGNtXU3o4hzkVVhSwhnckMp7tQVCl1Fds3AnU5WoUJZxTfJBZgcpBP0VqTJ9n_ptk6J/pubhtml?gid=1223818634&single=true&widget=true&headers=false"></iframe>
You could try reading the entire spreadsheet as a JSON file and parse it within your code.
https://www.freecodecamp.org/news/cjn-google-sheets-as-json-endpoint/
Hi!
Could someone tell me why using requests.get(url) with different url's I am getting same page. Story is:
I am scraping webpage to see products by brands. Thus I am generating urls based on brand list in page (I am retrieving them with xpath).
For other pages it works, however for this one is not.
So I wander maybe there is kind of protection from scraping the page? As when I am pasting those generated urls in chrome - it gives me page with specific Brand products I need.However with requests.get I am ending up at same page.
Also- maybe you can share easy- to - grasp info how requests works? How it reaches page source?
Zillion thanks to contributors !
Trust you are doing well!
I´m scraping some web pages and when I try to go to the next page I´m not able to, because the next page results, it doesn´t matter with what I´d look for at the first one.
An example:
Fist page look for: https://www.mister-auto.es/buscar/?q=corteco
Second page: https://www.mister-auto.es/buscar/?page=2
The problem that I´ve is that the results at the second doesn´t has no sense with what I´d look for.
I´m using crawlspider with linkextractor to go to the next.
Could you give me a hand?
Thank you very much for your support.
The website you're scraping is dynamic and when you're changing pages it does not reflect in the URL.
What you want is a tool like Puppeteer or Selenium to render the page dynamically, click buttons and extract the content you want. While it is a great tool for certain jobs, Scrapy has its limitations.
I am scraping data from a site, and each item has a related document URL. I want to scrape data from that document, which is available is HTML format after clicking link. Right now, I've been using Google Sheets to ImportFeed to get the basic columns filled.
Is there a next step that I could do to go into each respective URL and grab elements from the document and populate the Google sheet with them? The reason I'm using the RSS feed (instead of python and BS is because they actually offer an RSS feed.
I've looked, and haven't found a question that matches mine specifically.
Haven't personally tried this yet but I've come across web scraping samples using App Script with the use of UrlFetchApp.fetch. You can also check the XmlService sample which is also related to scraping.
I've a little problem after changing url and server of my site. At time of sharing url, facebook takes codes (hanacode and jw player code) in description. I want to hide them.
Is there any way to do it?
Visualization of the problem: http://i.stack.imgur.com/417xX.jpg