Web scraping : Extracting of papers links - r

I would like to collect political papers from this newspaper website https://www.seneweb.com/news/politique/ . There isn't possibility to get the links of the older papers. The last one that shows up is for 2019. But the website is deeper than this.
There isn't an option to load more. I have listened to the api too but I don't find anything.
Someone has more ideas?
url <- "https://www.seneweb.com/news/politique/"
newgrel <- "/news/Politique/"
link <- Rcrawler::LinkExtractor(url, urlbotfiler= FALSE, urlregexfilter=newgrel)$InternalLinks

Related

Issue scraping financial data via xpath + tables

I'm trying to build a stock analysis spreadsheet in Google sheets by using the importXML function in conjunction with XPath (absolute) and importHTML function using tables to scrape financial data from www.morningstar.co.uk key ratios page for the corresponding companies I like to keep an eye on.
Example: https://tools.morningstar.co.uk/uk/stockreport/default.aspx?tab=10&vw=kr&SecurityToken=0P00007O1V%5D3%5D0%5DE0WWE%24%24ALL&Id=0P00007O1V&ClientFund=0&CurrencyId=BAS
=importxml(N9,"/html/body/div[2]/div[2]/form/div[4]/div/div[1]/div/div[3]/div[2]/div[2]/div/div[2]/table/tbody/tr/td[3]")
=INDEX(IMPORTHTML(N9","table",12),3,2)
N9 being the cell containing the URL to the data source
I'm mainly using Morningstar as my source data due to the overwhelming amount of free information but the links keep on breaking, either the URL has slightly changed or the XPath hierarchy altered.
I'm guessing from what I've read so far is that busy websites such as these are dynamic and change often which is why my static links are breaking.
Is anyone able to suggest a solution or confirm if CSS selectors would be a more stable / reliable method of retrieving the data.
Many thanks in advance
Tried short XPath and long XPath links ( copied from dev tool in chrome ) frequently changed URL to repair link to data source but keeps breaking shortly after and unable to retrieve any information

Is it possible to scrape multiple data points from multiple URLs with data on different pages into a CSV?

I'm trying to build a directory on my website and want to get that data from SERPs. The sites from my search results could have data on different pages.
For example, I want to build a directory of adult sports leagues in the US. I get my SERPs to gather my URLs for leagues. Then from that list, I want to search those individual URLs for: name of league, location, sports offered, contact info, description, etc.
Each website will have that info in different places, obviously. But I'd like to be able to get the data I'm looking for (which not every site will have) and put that in a CSV and then use it to build the directory on my website.
I'm not a coder but trying to find out if this is even feasible from my limited understanding of data scraping. Would appreciate any feedback!
I've looked at some data scraping software. Put requests on Fiverr with no response.

Software to scrape or crawl for website urls

I want to scrape/crawl (don't know which one is best translation) website urls. For example i want to get every urls from:
www.Site.com/posts.html which contains www.Site.com/2015-04-01/1
So I would type in software www.Site.com and set depth to 2 and required url text www.Site.com/2015-04-01/1
So.. Software should:
go to: www.Site.com/posts.html
Find matched urls: Lets say it find:
www.Site.com/2015-04-01/1/Working-Stuff.html
www.Site.com/2015-04-01/1/New-stuff.html
www.Site.com/2015-04-01/1/News.html
And now it goes to first matched url (a) and look for another urls which contains www.Site.com/2015-04-01/1.
So for example it would look like this:
Main site: `www.Site.com/posts.html`
1)www.Site.com/2015-04-01/1/Working-Stuff.html
1a) www.Site.com/2015-04-01/1/Break.htm
1b) www.Site.com/2015-04-01/1/How-to.htm
1c) www.Site.com/2015-04-01/1/Lets-say.htm
1d) www.Site.com/2015-04-01/1/Gamer-life.htm
2) www.Site.com/2015-04-01/1/New-stuff.html
2a) www.Site.com/2015-04-01/1/My-Story-about.htm
3) www.Site.com/2015-04-01/1/News.html
3a) www.Site.com/2015-04-01/1/Go-to-hell.htm
3b) www.Site.com/2015-04-01/1/Leave.htm
Of course I don't need that prefix grouping 1), 2), 2a) etc. I want to grab only urls.
I used:
A1 website scraper - but when I try to scrape from ......html it cuts .html part and does not giving me full url list :/
[edited my previous slightly simplistic answer]
Screen scraping is the process of removing data from a web page. The R package rvest is very good at screen scraping.
Web crawling is the process of traversing through a website moving from page to page. The R package rselenium is very good at mimicking user's movement from page to page, but only when you know the structure of the web site.
You sound like you want to do a crawl from page to page, starting from a head page and moving forward. I think that you could code this up using a combination of the rvest and rselenium packages. Between the two of these you can customise and take any particular unknown route.

Update information on a MediaWiki page from an RSS feed?

I administer a mediawiki for a school group. We have a website that students complete projects in for virtual rewards. I want to put counters on a page of my wiki with statistical information (cumulative exp/coins, assignments completed, most productive student, etc.) about each of the seven groups that the students are divided up into. It would be simple enough if the two sites were hosted on the same server, but they are not. I figure that an RSS feed with the statistical information may be a good way to get info from website server to wiki server. How would I reference the information from the RSS feed in the wiki page?
Just to make sure my idea is clear, I would put in the feed something along the lines of:
[ATLAS]
exp=15000
coins=7500
eva=350
ip=150
dmg=500
[CERES]
exp=13000
;and so on
I would like to reference that in the wiki page. Is it doable?

Need help in Web scraping webpages and its links by automatic funciton in R

I am interested to extract the data of paranormal activity reported in news, so that i can analyze the
data of space and time of appearance for any correlations. This project is just for fun, to learn and use web scraping, text extraction and spatial and time correlation analysis. So please forgive me for deciding on this topic, I wanted to do something interesting and challenging work.
First I found this website has some collection of the reported paranormal incidences, they have collection for 2009,2010,2011 and 2012.
The structure of the website goes like this in every year they have 1..10 pages...and links goes like this
for year2009
link http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm
In each page they have collected the stories under the heading like this
Internal structure
Paranormal Activity, Posted 03-14-09
each of these head lines has two pages inside it..goes like this
link http://paranormal.about.com/od/paranormalgeneralinfo/a/news_090314n.htm
On each of these pages they have actual reported stories collected on various headlines..and the actual websites link for those stories. I am interested in collecting those reported text and extract information regarding the kind of paranormal activity like ghost, demon or UFOs and the time, date and place of incidents. I wish to analyze this data for any spatial and time correlations. If UFO or Ghosts are real they must have some behavior and correlations in space or time in their movements. This is long shot of the story...
I need help in web scraping the text form the above said pages. Here i have wrote down the code to follow one page and its link down to last final text i want. Can anyone let me know is there any better and efficient way to get the clean text from the final page. Also automation of the collecting text by following all 10 pages for whole 2009.
library(XML)
#source of paranormal news from about.com
#first page to start
#2009 - http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm
pn.url<-"http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm"
pn.html<-htmlTreeParse(pn.url,useInternalNodes=T)
pn.h3=xpathSApply(pn.html,"//h3",xmlValue)
#extracting the links of the headlines to follow to the story
pn.h3.links=xpathSApply(pn.html,"//h3/a/#href")
#Extracted the links of the Internal structure to follow ...
#Paranormal Activity, Posted 01-03-09 (following this head line)
#http://paranormal.about.com/od/paranormalgeneralinfo/a/news_090314n.htm
pn.l1.url<-pn.h3.links[1]
pn.l1.html<-htmlTreeParse(pn.l1.url,useInternalNodes=T)
pn.l1.links=xpathSApply(pn.l1.html,"//p/a/#href")
#Extracted the links of the Internal structure to follow ...
#British couple has 'black-and-white-twins' twice (following this head line)
#http://www.msnbc.msn.com/id/28471626/
pn.l1.f1.url=pn.l1.links[7]
pn.l1.f1.html=htmlTreeParse(pn.l1.f1.url,useInternalNodes=T)
pn.l1.f1.text=xpathSApply(pn.l1.f1.html,"//text()[not(ancestor::script)][not(ancestor::style)]",xmlValue)
I sincerely thanks in advance for reading my post and your time for helping me.
I will be great full for any expert who would like to mentor me in this whole project.
Regards
Sathish
Try to use Scrapy and BeautifulSoup libraries. Despite their being Python based, they are considered the best in scrapping domain. You can use command line interface to connect both, for more details about connecting R and Python have a look here.

Resources