Issue scraping financial data via xpath + tables

Issue scraping financial data via xpath + tables - web-scraping

I'm trying to build a stock analysis spreadsheet in Google sheets by using the importXML function in conjunction with XPath (absolute) and importHTML function using tables to scrape financial data from www.morningstar.co.uk key ratios page for the corresponding companies I like to keep an eye on.
Example: https://tools.morningstar.co.uk/uk/stockreport/default.aspx?tab=10&vw=kr&SecurityToken=0P00007O1V%5D3%5D0%5DE0WWE%24%24ALL&Id=0P00007O1V&ClientFund=0&CurrencyId=BAS
=importxml(N9,"/html/body/div[2]/div[2]/form/div[4]/div/div[1]/div/div[3]/div[2]/div[2]/div/div[2]/table/tbody/tr/td[3]")
=INDEX(IMPORTHTML(N9","table",12),3,2)
N9 being the cell containing the URL to the data source
I'm mainly using Morningstar as my source data due to the overwhelming amount of free information but the links keep on breaking, either the URL has slightly changed or the XPath hierarchy altered.
I'm guessing from what I've read so far is that busy websites such as these are dynamic and change often which is why my static links are breaking.
Is anyone able to suggest a solution or confirm if CSS selectors would be a more stable / reliable method of retrieving the data.
Many thanks in advance
Tried short XPath and long XPath links ( copied from dev tool in chrome ) frequently changed URL to repair link to data source but keeps breaking shortly after and unable to retrieve any information

Related

Scraping sector information from Yahoo Finance into Google Sheets using IMPORTXML [duplicate]

This question already has answers here:
Scraping data to Google Sheets from a website that uses JavaScript
(2 answers)
Closed last month.
I am very new to web-scraping and was introduced to it just today after trying to figure out a formula on a spreadsheet.
I would like to retrieve the Sector information onto Yahoo Finance into Google Sheets. I would also like to the data to update when there is a change to cell B7. Link: https://finance.yahoo.com/quote/MIDD/profile?p=MIDD
I came up with the following, but get a #N/A error: =importxml("https://finance.yahoo.com/quote/",B7,"/profile?p=",B7, "//*[#class='Fw(600) [#data-reactid='21']")
Please let me know what I might be doing wrong. Thank you in advance.

Solution
This is the right syntax to use IMPORTXML formula:
=IMPORTXML("URL", "XPATH_QUERY")
In your case this will translate to:
=importxml("https://finance.yahoo.com/quote/"&B7&"/profile?p="&B7,"//*[#class='Fw(600)'] [#data-reactid='21']")
Which will return an empty result.
Considerations
Keep in mind that many sites go to great lengths to actively prevent scraping. Allowing you to scrape their data entirely, undermines their business model. Since they might make profit from adds for example.
Check in the page you want to scrape if the tags you are watching for correspond to the data you wanted to get in the first place. I believe in this case it's just a matter of changing the tags values to the proper ones.

Attempting to design a flexible reporting system. Getting stuck

I’m having some trouble coming up with a future-proof-ish design for reports for a company. Essentially the requirements are:
Be able to pull whatever data from the database
Generate formatted report from that data by populating a template (HTML, docx)
Export to Word and/or PDF
So initially I made an API endpoint per report (this is a web app), and had PDFs generated and formatted correctly.
But now I need to get the data into .docx/Word format, and I’m trying to figure out how I can design something as D.R.Y. as possible so that I don’t have to put in a TON of work every time the company decides they need another report (they’ve done this two, three times which is how I became aware that I had coded myself into a corner).
Every report I’ve done thus far has been done via a “brute-force” method: code the queries needed for the report, format the data, and then render to PDF (using HTML to PDF via phantomjs).
The complexity occurred when the company came back and said “Hey, we need all of those reports in Word format, also we have 3 other new reports that we need and a report that is a slight variation on the old one but +/- 2 fields”.
I am just having trouble coming up with a solid design/abstraction here, one that doesn’t send me down a week long hacking spree every time a requirement changes.

Web-scraping in R when a table changes but the URL does not

I am trying to scrape NCAA gymnastics scores from roadtonationals.com into R. I have been able to do this in the past, using readLines(), but the website has been updated recently, and my old code no longer works.
In particular, when I am looking at the standings (roadtonationals.com/results/standings/), I can change season, year, week, and team/individual using the drop down menus. I can change between the four events and the all around using the tabs on the right. However, even if the table changes, the URL remains the same. I know very little about coding for websites, so I don’t even really know what this type of table is called or where to start with it.
Technically, I could copy and paste, but eventually, I’d like each individual score, like I used to be able to get, from a page like roadtonationals.com/results/schedule/meet/20409, which also involves selecting the teams or the events without changing the URL.
I found this question:
Using R to scrape tables when URL does not change
which seems to be asking the same thing that I am.
However, when I tried
library(httr)
standings <- POST(url = "https://roadtonationals.com/results/standings/season")
I get a message that says, “Not Acceptable.” and “An appropriate representation of the requested resource /results/standings/season could not be found on this server.”

Retrieve a number from each page of a paginated website

I have a list from approx. 36,000 URLs, ranging from https://www.fff.fr/la-vie-des-clubs/1/infos-cles to https://www.fff.fr/la-vie-des-clubs/36179/infos-cles (a few of those pages return 404 erros).
Each of those pages contains a number (the number of teams the soccer club contains). In the HTML file, the number appears as <p class="number">5</p>.
Is there a reasonably simple way to compile an excel or csv file with the URL and the associated number of teams as a field ?
I've tried looking into phantomJS but my method took 10 seconds to open a single webpage and I don't really want to spend 100 hours doing this. I was not able to figure out how (or whether it was at all possible) to use scraping tools such as import.io to do this.
Thanks !

For the goal you want to achieve, I can see two solutions:
Code it in Java: Jsoup + any CSV library
In a few minutes, the 36000+ urls can be downloaded easily.
Use a tool like Portia from scrapinghub.com
Portia is a WYSIWYG tool quickly helping you create your project and run it. They offer a free plan which can take in charge the 36000+ links.

Using Yahoo! Pipes

Have you used pipes.yahoo.com to quickly and easily do... anything? I've recently created a quick mashup of StackOverflow tags (via rss) so that I can browse through new questions in fields I like to follow.
This has been around for some time, but I've just recently revisited it and I'm completely impressed with it's ease of use. It's almost to the point where I could set up a pipe and then give a client privileges to go in and edit feed sources... and I didn't have to write more than a few lines of code.
So, what other practical uses can you think of for pipes?

It's nice for aggregating feeds, yes, but the other handy thing to do is filtering the feeds. A while back, I created a feed for Digg (before Digg fell into the Fark pit of dispair). I didn't care about the overwhelming Apple and Ubuntu news, so I filtered those keywords out of Technology, which I then combined with Science and World & Business feeds.
Anyway, you can do a lot more than just combine things. If you wanted to be smart about it, you could set up per-subfeed and whole-feed filters to give granular or over-arching filtering abilities as the news changes and you get bored with one topic or another.

The one thing I have really used Y! Pipes for (rather than just playing around with it) is to clean up item titles, merge and finally de-dupe the feeds I got from querying multiple blog search engines with the same search term. This is something I’ve done in several very different contexts, eg. for my own ego surfing, in another case for the planet site set up by some conference’s organisers to keep an eye on their conference’s buzz, etc. Highly recommended.

You can do tons of things with pipes. For example for sites like digg or reddit, you can make one to bypass the site and go directly to the linked article (rewriting the RSS).
I like also to filter webcomics' feeds to keep just the comics, and then mix them all in only one feed

I've taken the liberty of copying your pipe and rearranging it a bit so that it's easier to add and remove tags:
Yahoo Pipe: StackOverflow Merge Tags
Tags are now listed in a string builder, so to add a tag you just have to hit the + button on the string builder and type in the tag preceded by a slash.

Well, pipes are real fast and useful.
Other effective uses might be:
1) combine many feeds into one, then sort, filter and translate it.
2) geocode your favorite feeds and browse the items on an interactive map.
3) power widgets/badges on your web site.
4) grab the output of any Pipes as RSS, JSON, KML, and other formats.
This is by no means a comprehensive list.

One of my favorite things to do with Yahoo! Pipes is to aggregate multiple craigslist feeds into a single feed. You can make a feed out of any category or search criteria on craigslist. I live in a university town and am always on the lookout for tickets to sporting events, for example. I have a half-dozen craigslist searches all being combined into a single feed via Yahoo! Pipes. This works a lot better for me than simply monitoring the entire "Tickets" category; filters out most of the tickets I am not interested in. Yes, this is another aggregating feeds example, but the craigslist usage is quite valuable with the ability to aggregate feeds that are themselves based upon searches.

I've used Pipes to translate blogs into English. I would have liked to use it to fetch the full text for blogs which only provide a summary of the content in the feed, but unfortunately they don't provide any input which fetches the content from a parameterizable source :-(.

Just stumbled on this while looking for ways to connect Excel to Pipes. A bit necromancer-ish, but here goes.
One thing I've done, is take an HTML page (science data) which has links to tons of CSV files for a bunch of Army Corps measurement stations. Each station has a big table of datafiles, all organized individually by month and year. I use YQL to parse out and organize the links to the individual CSV files in a way that Pipes can read them. Then, I use that as input into a Pipe, which has a user input for "Station" and "Date."
Using this, I can go to the Pipes page, type in those values and get the values only for a specific station and date, rather than have to find the station on a website, find the year and month in a big table, click the link, open the CSV file, and find the values for a day within that month's worth of data. I can even change the pipe to specify the hour, and the parameter, and then get a single value returned.
Now, I wish I could figure out how to program Excel so that I can use "=yahoo_function(station, datetime)" to place that value automatically into a cell give the values of other columns!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Issue scraping financial data via xpath + tables - web-scraping

Related

Scraping sector information from Yahoo Finance into Google Sheets using IMPORTXML [duplicate]

Attempting to design a flexible reporting system. Getting stuck

Web-scraping in R when a table changes but the URL does not

Retrieve a number from each page of a paginated website

Using Yahoo! Pipes

Categories

Resources