I am trying to scrape a a list of events from a site http://www.cityoflondon.gov.uk/events/, But when scrapping it with import.io I am able to extract just the first page.
How could I extract all pages at once?
You can extract data with this site, with either a Crawler or using Bulk Extract. The above website uses a very simply form of pagination:
http://www.cityoflondon.gov.uk/events/Pages/default.aspx
http://www.cityoflondon.gov.uk/events/Pages/default.aspx?start1=13
http://www.cityoflondon.gov.uk/events/Pages/default.aspx?start1=49
http://www.cityoflondon.gov.uk/events/Pages/default.aspx?start1=25
http://www.cityoflondon.gov.uk/events/Pages/default.aspx?start1=37
Here is a Data Set that I created for the above URLs that should contain all the relevant information.
319aebad-88ea-4053-a649-2087011ce041
If you have further question about an individual website, please contact support#import.io
Thanks!
Meg
Related
I'm trying to build a directory on my website and want to get that data from SERPs. The sites from my search results could have data on different pages.
For example, I want to build a directory of adult sports leagues in the US. I get my SERPs to gather my URLs for leagues. Then from that list, I want to search those individual URLs for: name of league, location, sports offered, contact info, description, etc.
Each website will have that info in different places, obviously. But I'd like to be able to get the data I'm looking for (which not every site will have) and put that in a CSV and then use it to build the directory on my website.
I'm not a coder but trying to find out if this is even feasible from my limited understanding of data scraping. Would appreciate any feedback!
I've looked at some data scraping software. Put requests on Fiverr with no response.
I have a situation where I need to extract tables from 13 different links, which have the same structure, and then append them into only one table with all the data. This way, at first I extracted the links from a home page by copying the link from the respective hyperlink, and then import the data through the Web connector on Power BI. However, 3 months later, I realized that those links changed every quarter but the link from the homepage where they are listed remain the same.
This way, I did some research and I found out this video on YouTube (https://www.youtube.com/watch?v=oxglJL0VWOI), which explained how I can scrape the links from a website, by building a table with the header of the link as a column and the respective link as another column. This way, I can have the links automatically updated, whenever I refresh the data.
The thing is that I am having issues to figure out how can I use this links to extract the data automatically without having to copy them one by one and then import the data using the Power BI Web connector (Web.BrowserContents). Does anyone can give me a hint of how can I implement this?
Thanks in advance!
Since IMDb removed the ability to export other people's ratings I have been trying to find a way to scrape them but I am a programming noob. I have followed this Youtube tutorial which scrapes general IMDb data with R: https://www.youtube.com/watch?v=28pyEDV9mMw&t=0s. I managed to adapt it to scrape a ratings link but it can only retrieve the first page of information as the ratings pages use a pagination key in the URL which completely changes with each page so I can't use seq() function to loop it. Is there a way to instead use the 'next' button to loop it instead of a URL pattern? Is there a way to extract and apply the pagination keys to a loop?
As an example here's what two page URLs look like:
https://www.imdb.com/user/ur3954564/ratings?sort=date_added%2Cdesc&mode=detail&paginationKey=mfq5ijak6z7uymjwuuwsomnsegl34knnqsdztp6xeepepyyfxdfiwpol52uhtjimq3iwclnm7gq7uk2y4kjygzipmzztamxq7zbhw4m47iyfrvleknv4axfhhxudjs5nyx5ijd27q5aqjjg6bqac2wheaznk2ouqhjumdro5dntkvduvzupds7a3psdwsefgy5eeijwasj3vzh2p&lastPosition=100
https://www.imdb.com/user/ur3954564/ratings?sort=date_added%2Cdesc&mode=detail&paginationKey=mfq5ijak6z7uymjwuuwsomnsegl34knnqsdztp6xeepepyyfxdfiwpol52uhtjimq3iwclnm7gq7uk2y4kjygzaenj3tooxr7vch65447iyfrvleknv4axfhhxudjs5nyx6ifb24rrhqjjg6bqac2wheaznk2ouqhjumdro5dntkvduvzupds7enxphz4xkrobzbcb4wvm7y7dnp&lastPosition=200
I'm trying to scrape data from some Etsy shops using Google Sheets. Specifically, I'm looking to import the product names in the seller's reviews (outlined in red in this photo). I've successfully imported these names using the following formula where the seller's Etsy URL (https://www.etsy.com/shop/PaperLarkDesigns) is located in cell B4:
=IMPORTXML(B4, "//div[#class='flag-body hide-xs hide-sm']")
But the navigation for the product reviews is dynamically generated, so the formula only imports the titles from the first page of reviews and there doesn't seem to be a URL to direct the formula to a specific page of reviews.
Is there a way to denote which page of reviews the importXML formula should pull these titles from? Or is it not possible to pull data from a site using this type of navigation?
I'm new to the more complex formulas in Excel/ Google Sheets, so thanks in advance for your help!
Unfortunately, the use of IMPORTXML in this situation is not possible.
According to the IMPORTXML documentation:
IMPORTXML imports data from any of various structured data types including XML, HTML, CSV, TSV, and RSS and ATOM XML feeds.
Therefore, the =IMPORTXML() command you are using reads the HTML source of the page without any JavaScript code associated with it and without executing it - the reason why you are not able to retrieve the wanted data from it.
Reference
IMPORTXML
I am trying to scrape a list of Sports venues from these two pages:
openplay.co.uk
and mylocalpitch.com
in the second one, the search results for venues are split into pages of 10 each. Now when I run a scraper on it, it looks at the first ten search results, but not the ones that are 'hidden' in the other pages.
I was using a scrape tool called import.io and it failed miserably. Is there a tool that can do it? Will I need to write my own?
I made a quick API for you to the site and managed to get more than 20 pages. If you visit the link below:
https://import.io/data/mine/?id=01ac4491-e40a-4e2b-a427-c057692e3d96
you can see a button called next page that should get you the other search results after the 10th result.
Let me know how you get on.