Data scraping from a list split into pages - web-scraping

I am trying to scrape a list of Sports venues from these two pages:
openplay.co.uk
and mylocalpitch.com
in the second one, the search results for venues are split into pages of 10 each. Now when I run a scraper on it, it looks at the first ten search results, but not the ones that are 'hidden' in the other pages.
I was using a scrape tool called import.io and it failed miserably. Is there a tool that can do it? Will I need to write my own?

I made a quick API for you to the site and managed to get more than 20 pages. If you visit the link below:
https://import.io/data/mine/?id=01ac4491-e40a-4e2b-a427-c057692e3d96
you can see a button called next page that should get you the other search results after the 10th result.
Let me know how you get on.

Related

Scraping other people's ratings from IMDb/avoiding the pagination keys in R?

Since IMDb removed the ability to export other people's ratings I have been trying to find a way to scrape them but I am a programming noob. I have followed this Youtube tutorial which scrapes general IMDb data with R: https://www.youtube.com/watch?v=28pyEDV9mMw&t=0s. I managed to adapt it to scrape a ratings link but it can only retrieve the first page of information as the ratings pages use a pagination key in the URL which completely changes with each page so I can't use seq() function to loop it. Is there a way to instead use the 'next' button to loop it instead of a URL pattern? Is there a way to extract and apply the pagination keys to a loop?
As an example here's what two page URLs look like:
https://www.imdb.com/user/ur3954564/ratings?sort=date_added%2Cdesc&mode=detail&paginationKey=mfq5ijak6z7uymjwuuwsomnsegl34knnqsdztp6xeepepyyfxdfiwpol52uhtjimq3iwclnm7gq7uk2y4kjygzipmzztamxq7zbhw4m47iyfrvleknv4axfhhxudjs5nyx5ijd27q5aqjjg6bqac2wheaznk2ouqhjumdro5dntkvduvzupds7a3psdwsefgy5eeijwasj3vzh2p&lastPosition=100
https://www.imdb.com/user/ur3954564/ratings?sort=date_added%2Cdesc&mode=detail&paginationKey=mfq5ijak6z7uymjwuuwsomnsegl34knnqsdztp6xeepepyyfxdfiwpol52uhtjimq3iwclnm7gq7uk2y4kjygzaenj3tooxr7vch65447iyfrvleknv4axfhhxudjs5nyx6ifb24rrhqjjg6bqac2wheaznk2ouqhjumdro5dntkvduvzupds7enxphz4xkrobzbcb4wvm7y7dnp&lastPosition=200

How to get all the reviews of the apps of Play Store using R?

I have a dataframe with the urls of all apps I want to get the reviews.
I see that there is a way to do this using Python (How to perform web scraping to get all the reviews of the an app in Google Play?), but I was not able to perform it.
Can I get all the reviews of the apps using R?
I created a code to scroll the webpage, but there are apps with too many reviews. And I want to get the reviews of too many apps.
Thus, scrolling the webpage is not a good way to get all the reviews.

Retrieve a number from each page of a paginated website

I have a list from approx. 36,000 URLs, ranging from https://www.fff.fr/la-vie-des-clubs/1/infos-cles to https://www.fff.fr/la-vie-des-clubs/36179/infos-cles (a few of those pages return 404 erros).
Each of those pages contains a number (the number of teams the soccer club contains). In the HTML file, the number appears as <p class="number">5</p>.
Is there a reasonably simple way to compile an excel or csv file with the URL and the associated number of teams as a field ?
I've tried looking into phantomJS but my method took 10 seconds to open a single webpage and I don't really want to spend 100 hours doing this. I was not able to figure out how (or whether it was at all possible) to use scraping tools such as import.io to do this.
Thanks !
For the goal you want to achieve, I can see two solutions:
Code it in Java: Jsoup + any CSV library
In a few minutes, the 36000+ urls can be downloaded easily.
Use a tool like Portia from scrapinghub.com
Portia is a WYSIWYG tool quickly helping you create your project and run it. They offer a free plan which can take in charge the 36000+ links.

How to scrape multiple pages with Import.io

I am trying to scrape a a list of events from a site http://www.cityoflondon.gov.uk/events/, But when scrapping it with import.io I am able to extract just the first page.
How could I extract all pages at once?
You can extract data with this site, with either a Crawler or using Bulk Extract. The above website uses a very simply form of pagination:
http://www.cityoflondon.gov.uk/events/Pages/default.aspx
http://www.cityoflondon.gov.uk/events/Pages/default.aspx?start1=13
http://www.cityoflondon.gov.uk/events/Pages/default.aspx?start1=49
http://www.cityoflondon.gov.uk/events/Pages/default.aspx?start1=25
http://www.cityoflondon.gov.uk/events/Pages/default.aspx?start1=37
Here is a Data Set that I created for the above URLs that should contain all the relevant information.
319aebad-88ea-4053-a649-2087011ce041
If you have further question about an individual website, please contact support#import.io
Thanks!
Meg

How To Extract Page URLs From Any Website in Bulk?

I'm looking for a Free solution/tool/software through which I can pull out all of the website's page URLs. Site has approx 992,000 pages so I need the URLs of all of them in excel sheet.
I'm using "site: mywebsite.com" and it gives me 992,000 results. I know I can make the max results per page 100 but that still doesn't make my life easier. Also google won't show any results over 1000. Tried to use the Google API but without any luck. Tried Sitemap Generators but they didn't work either.
You can use a crawler tool to crawl the entire website and save the URLs visited. Free tools include:
IRobotSoft: http://www.irobotsoft.com/help/irobot-manual.pdf. Use: CrawlWebsite (SourceSites, CallTask) function.
Scrapy: http://doc.scrapy.org/en/latest/intro/tutorial.html
Google limits search query results to 1000. The only way a tool could really bypass this is to do subsets of the keyword e.g. (site: abc.com + random-word). The random word would return fewer results and with enough of these queries scraped and combined into a list, one could then delete duplicates and gain a near-full to full list of the original desired search term.

Resources