Retrieve a number from each page of a paginated website - web-scraping

I have a list from approx. 36,000 URLs, ranging from https://www.fff.fr/la-vie-des-clubs/1/infos-cles to https://www.fff.fr/la-vie-des-clubs/36179/infos-cles (a few of those pages return 404 erros).
Each of those pages contains a number (the number of teams the soccer club contains). In the HTML file, the number appears as <p class="number">5</p>.
Is there a reasonably simple way to compile an excel or csv file with the URL and the associated number of teams as a field ?
I've tried looking into phantomJS but my method took 10 seconds to open a single webpage and I don't really want to spend 100 hours doing this. I was not able to figure out how (or whether it was at all possible) to use scraping tools such as import.io to do this.
Thanks !

For the goal you want to achieve, I can see two solutions:
Code it in Java: Jsoup + any CSV library
In a few minutes, the 36000+ urls can be downloaded easily.
Use a tool like Portia from scrapinghub.com
Portia is a WYSIWYG tool quickly helping you create your project and run it. They offer a free plan which can take in charge the 36000+ links.

Related

Issue scraping financial data via xpath + tables

I'm trying to build a stock analysis spreadsheet in Google sheets by using the importXML function in conjunction with XPath (absolute) and importHTML function using tables to scrape financial data from www.morningstar.co.uk key ratios page for the corresponding companies I like to keep an eye on.
Example: https://tools.morningstar.co.uk/uk/stockreport/default.aspx?tab=10&vw=kr&SecurityToken=0P00007O1V%5D3%5D0%5DE0WWE%24%24ALL&Id=0P00007O1V&ClientFund=0&CurrencyId=BAS
=importxml(N9,"/html/body/div[2]/div[2]/form/div[4]/div/div[1]/div/div[3]/div[2]/div[2]/div/div[2]/table/tbody/tr/td[3]")
=INDEX(IMPORTHTML(N9","table",12),3,2)
N9 being the cell containing the URL to the data source
I'm mainly using Morningstar as my source data due to the overwhelming amount of free information but the links keep on breaking, either the URL has slightly changed or the XPath hierarchy altered.
I'm guessing from what I've read so far is that busy websites such as these are dynamic and change often which is why my static links are breaking.
Is anyone able to suggest a solution or confirm if CSS selectors would be a more stable / reliable method of retrieving the data.
Many thanks in advance
Tried short XPath and long XPath links ( copied from dev tool in chrome ) frequently changed URL to repair link to data source but keeps breaking shortly after and unable to retrieve any information

How to use URLs extracted from a website as data source for another table in Power BI

I have a situation where I need to extract tables from 13 different links, which have the same structure, and then append them into only one table with all the data. This way, at first I extracted the links from a home page by copying the link from the respective hyperlink, and then import the data through the Web connector on Power BI. However, 3 months later, I realized that those links changed every quarter but the link from the homepage where they are listed remain the same.
This way, I did some research and I found out this video on YouTube (https://www.youtube.com/watch?v=oxglJL0VWOI), which explained how I can scrape the links from a website, by building a table with the header of the link as a column and the respective link as another column. This way, I can have the links automatically updated, whenever I refresh the data.
The thing is that I am having issues to figure out how can I use this links to extract the data automatically without having to copy them one by one and then import the data using the Power BI Web connector (Web.BrowserContents). Does anyone can give me a hint of how can I implement this?
Thanks in advance!

Need help in Web scraping webpages and its links by automatic funciton in R

I am interested to extract the data of paranormal activity reported in news, so that i can analyze the
data of space and time of appearance for any correlations. This project is just for fun, to learn and use web scraping, text extraction and spatial and time correlation analysis. So please forgive me for deciding on this topic, I wanted to do something interesting and challenging work.
First I found this website has some collection of the reported paranormal incidences, they have collection for 2009,2010,2011 and 2012.
The structure of the website goes like this in every year they have 1..10 pages...and links goes like this
for year2009
link http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm
In each page they have collected the stories under the heading like this
Internal structure
Paranormal Activity, Posted 03-14-09
each of these head lines has two pages inside it..goes like this
link http://paranormal.about.com/od/paranormalgeneralinfo/a/news_090314n.htm
On each of these pages they have actual reported stories collected on various headlines..and the actual websites link for those stories. I am interested in collecting those reported text and extract information regarding the kind of paranormal activity like ghost, demon or UFOs and the time, date and place of incidents. I wish to analyze this data for any spatial and time correlations. If UFO or Ghosts are real they must have some behavior and correlations in space or time in their movements. This is long shot of the story...
I need help in web scraping the text form the above said pages. Here i have wrote down the code to follow one page and its link down to last final text i want. Can anyone let me know is there any better and efficient way to get the clean text from the final page. Also automation of the collecting text by following all 10 pages for whole 2009.
library(XML)
#source of paranormal news from about.com
#first page to start
#2009 - http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm
pn.url<-"http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm"
pn.html<-htmlTreeParse(pn.url,useInternalNodes=T)
pn.h3=xpathSApply(pn.html,"//h3",xmlValue)
#extracting the links of the headlines to follow to the story
pn.h3.links=xpathSApply(pn.html,"//h3/a/#href")
#Extracted the links of the Internal structure to follow ...
#Paranormal Activity, Posted 01-03-09 (following this head line)
#http://paranormal.about.com/od/paranormalgeneralinfo/a/news_090314n.htm
pn.l1.url<-pn.h3.links[1]
pn.l1.html<-htmlTreeParse(pn.l1.url,useInternalNodes=T)
pn.l1.links=xpathSApply(pn.l1.html,"//p/a/#href")
#Extracted the links of the Internal structure to follow ...
#British couple has 'black-and-white-twins' twice (following this head line)
#http://www.msnbc.msn.com/id/28471626/
pn.l1.f1.url=pn.l1.links[7]
pn.l1.f1.html=htmlTreeParse(pn.l1.f1.url,useInternalNodes=T)
pn.l1.f1.text=xpathSApply(pn.l1.f1.html,"//text()[not(ancestor::script)][not(ancestor::style)]",xmlValue)
I sincerely thanks in advance for reading my post and your time for helping me.
I will be great full for any expert who would like to mentor me in this whole project.
Regards
Sathish
Try to use Scrapy and BeautifulSoup libraries. Despite their being Python based, they are considered the best in scrapping domain. You can use command line interface to connect both, for more details about connecting R and Python have a look here.

Best approach for fetching news from websites?

I have a function which web-scraping all latest news from a website (approximately 10 news and the number of news is up to that website). Note that the news are in chronical order.
For example, yesterday I got 10 news and stored in database. Today I get 10 news but there are 3 news that are not available from yesterday (7 news stayed the same, 3 new).
My current approach is to extract each news till I find an old news (the 1st among 7 news) then I stop extracting and only update the field "lastUpdateDate" of the old news + add new news to the database. I think this approach is somehow complicated and it takes time.
Actually I'm getting news from 20 websites with same content structure (Moodle) so each request will last about 2 minutes, which my free host doesn't support.
Is it better if I delete all the news and then extracting everything from the start (this actually increments a huge amount of the ID numbers in the database)?
First, check to see if the website has a published API. If it has one, use it.
Second, check the website's terms of service, which may specifically and explicitly disallow scraping the website.
Third, look at a module in your programming language of choice that handles both the fetching of the pages and the extraction of the content from the pages. In Perl, you would start with WWW::Mechanize or Web::Scraper.
Whatever you do, don't fall into the trap that so many who post to StackOverflow fall into: Fetching the web page, and then trying to parse the content themselves, most often with regular expressions which is an inadequate tool for the job. Surf the SO tag html-parsing for tales of sorrow from those who have tried to roll their own HTML parsing systems instead of using existing tools.
Its depend on requirement if you want to show old news to the users or not.
For scraping you can create a custom local script for cron job which will grab the data from those news websites and will store into database.
You can also check through subject if its already exist of not.
Final make a custom news block which will show all the database feed.

Using Yahoo! Pipes

Have you used pipes.yahoo.com to quickly and easily do... anything? I've recently created a quick mashup of StackOverflow tags (via rss) so that I can browse through new questions in fields I like to follow.
This has been around for some time, but I've just recently revisited it and I'm completely impressed with it's ease of use. It's almost to the point where I could set up a pipe and then give a client privileges to go in and edit feed sources... and I didn't have to write more than a few lines of code.
So, what other practical uses can you think of for pipes?
It's nice for aggregating feeds, yes, but the other handy thing to do is filtering the feeds. A while back, I created a feed for Digg (before Digg fell into the Fark pit of dispair). I didn't care about the overwhelming Apple and Ubuntu news, so I filtered those keywords out of Technology, which I then combined with Science and World & Business feeds.
Anyway, you can do a lot more than just combine things. If you wanted to be smart about it, you could set up per-subfeed and whole-feed filters to give granular or over-arching filtering abilities as the news changes and you get bored with one topic or another.
The one thing I have really used Y! Pipes for (rather than just playing around with it) is to clean up item titles, merge and finally de-dupe the feeds I got from querying multiple blog search engines with the same search term. This is something I’ve done in several very different contexts, eg. for my own ego surfing, in another case for the planet site set up by some conference’s organisers to keep an eye on their conference’s buzz, etc. Highly recommended.
You can do tons of things with pipes. For example for sites like digg or reddit, you can make one to bypass the site and go directly to the linked article (rewriting the RSS).
I like also to filter webcomics' feeds to keep just the comics, and then mix them all in only one feed
I've taken the liberty of copying your pipe and rearranging it a bit so that it's easier to add and remove tags:
Yahoo Pipe: StackOverflow Merge Tags
Tags are now listed in a string builder, so to add a tag you just have to hit the + button on the string builder and type in the tag preceded by a slash.
Well, pipes are real fast and useful.
Other effective uses might be:
1) combine many feeds into one, then sort, filter and translate it.
2) geocode your favorite feeds and browse the items on an interactive map.
3) power widgets/badges on your web site.
4) grab the output of any Pipes as RSS, JSON, KML, and other formats.
This is by no means a comprehensive list.
One of my favorite things to do with Yahoo! Pipes is to aggregate multiple craigslist feeds into a single feed. You can make a feed out of any category or search criteria on craigslist. I live in a university town and am always on the lookout for tickets to sporting events, for example. I have a half-dozen craigslist searches all being combined into a single feed via Yahoo! Pipes. This works a lot better for me than simply monitoring the entire "Tickets" category; filters out most of the tickets I am not interested in. Yes, this is another aggregating feeds example, but the craigslist usage is quite valuable with the ability to aggregate feeds that are themselves based upon searches.
I've used Pipes to translate blogs into English. I would have liked to use it to fetch the full text for blogs which only provide a summary of the content in the feed, but unfortunately they don't provide any input which fetches the content from a parameterizable source :-(.
Just stumbled on this while looking for ways to connect Excel to Pipes. A bit necromancer-ish, but here goes.
One thing I've done, is take an HTML page (science data) which has links to tons of CSV files for a bunch of Army Corps measurement stations. Each station has a big table of datafiles, all organized individually by month and year. I use YQL to parse out and organize the links to the individual CSV files in a way that Pipes can read them. Then, I use that as input into a Pipe, which has a user input for "Station" and "Date."
Using this, I can go to the Pipes page, type in those values and get the values only for a specific station and date, rather than have to find the station on a website, find the year and month in a big table, click the link, open the CSV file, and find the values for a day within that month's worth of data. I can even change the pipe to specify the hour, and the parameter, and then get a single value returned.
Now, I wish I could figure out how to program Excel so that I can use "=yahoo_function(station, datetime)" to place that value automatically into a cell give the values of other columns!

Resources