Web Scraping Blocked - Using Googlesheets IMPORTXML()

Web Scraping Blocked - Using Googlesheets IMPORTXML() - web-scraping

Was hoping someone smarter than me would know how to get around two websites Woolworths.com.au and Coles.com.au that appear to block the use of the IMPORTXML() function.
I am trying to do some personal budgeting and I'm trying to input the price of various products into my spreadsheet. e.g. Toothpaste
Here's what my spreadsheet looks like so far, I am trying to avoid manually inputting the prices of the Woolworths and Coles items
Have been trying importHTML and importXML.
=IMPORTXML("https://www.woolworths.com.au/shop/productdetails/238473/colgate-plax-alcohol-free-mouthwash-freshmint","//shared-price[#class='ng-star-inserted']")
I was initially able to get some price data but after about a day it seems to have stopped working (I think they block my sheet)
Pretty tricky to work around so if anyone has suggestions for advanced Googlesheets forums to

Related

Scraping Spotify Top 200 streaming data with R

novice R user here. I'm looking to scrape a large amount of data on daily streaming volumes on songs that are on Spotify's Top 200 charts for a research project I am involved with. Basically, I would like to write a script to scrape all info for tracks in the top 200 on a given day, such as today's chart, and have this done for every day for a number of years, across a number of countries. I used some code from a guide that I followed previously to successfully scrape said data, but it is now not working for me.
I previously followed this guide pretty much word for word. While this originally worked, it now returns an empty tibble. I suspect that the problem may have to do with the fact that Spotify have re-developed their charts site since my last attempt. The site is different in appearance, but importantly the html node names appear to be different as well. My hunch is that this is what is causing the issue.
However, I am not at all sure if this is the case. Would appreciate it greatly if I could have some guidance on what I would need to do differently to achieve my aims, and whether it is indeed still possible to scrape these charts.
Cheers

Issue scraping financial data via xpath + tables

I'm trying to build a stock analysis spreadsheet in Google sheets by using the importXML function in conjunction with XPath (absolute) and importHTML function using tables to scrape financial data from www.morningstar.co.uk key ratios page for the corresponding companies I like to keep an eye on.
Example: https://tools.morningstar.co.uk/uk/stockreport/default.aspx?tab=10&vw=kr&SecurityToken=0P00007O1V%5D3%5D0%5DE0WWE%24%24ALL&Id=0P00007O1V&ClientFund=0&CurrencyId=BAS
=importxml(N9,"/html/body/div[2]/div[2]/form/div[4]/div/div[1]/div/div[3]/div[2]/div[2]/div/div[2]/table/tbody/tr/td[3]")
=INDEX(IMPORTHTML(N9","table",12),3,2)
N9 being the cell containing the URL to the data source
I'm mainly using Morningstar as my source data due to the overwhelming amount of free information but the links keep on breaking, either the URL has slightly changed or the XPath hierarchy altered.
I'm guessing from what I've read so far is that busy websites such as these are dynamic and change often which is why my static links are breaking.
Is anyone able to suggest a solution or confirm if CSS selectors would be a more stable / reliable method of retrieving the data.
Many thanks in advance
Tried short XPath and long XPath links ( copied from dev tool in chrome ) frequently changed URL to repair link to data source but keeps breaking shortly after and unable to retrieve any information

Scraping sector information from Yahoo Finance into Google Sheets using IMPORTXML [duplicate]

This question already has answers here:
Scraping data to Google Sheets from a website that uses JavaScript
(2 answers)
Closed last month.
I am very new to web-scraping and was introduced to it just today after trying to figure out a formula on a spreadsheet.
I would like to retrieve the Sector information onto Yahoo Finance into Google Sheets. I would also like to the data to update when there is a change to cell B7. Link: https://finance.yahoo.com/quote/MIDD/profile?p=MIDD
I came up with the following, but get a #N/A error: =importxml("https://finance.yahoo.com/quote/",B7,"/profile?p=",B7, "//*[#class='Fw(600) [#data-reactid='21']")
Please let me know what I might be doing wrong. Thank you in advance.

Solution
This is the right syntax to use IMPORTXML formula:
=IMPORTXML("URL", "XPATH_QUERY")
In your case this will translate to:
=importxml("https://finance.yahoo.com/quote/"&B7&"/profile?p="&B7,"//*[#class='Fw(600)'] [#data-reactid='21']")
Which will return an empty result.
Considerations
Keep in mind that many sites go to great lengths to actively prevent scraping. Allowing you to scrape their data entirely, undermines their business model. Since they might make profit from adds for example.
Check in the page you want to scrape if the tags you are watching for correspond to the data you wanted to get in the first place. I believe in this case it's just a matter of changing the tags values to the proper ones.

Import data from ebay to google spreadsheet using IMPORTHTML

I'm trying to Import a list from:
https://www.ebay.com/sch/i.html?_from=R40&_nkw=silver+chain&_sacat=0&rt=nc&LH_Sold=1&LH_Complete=1
to a Google Spreadsheet using =IMPORTHTML function, The Formula I was using as below,
A1:
https://www.ebay.com/sch/i.html?_from=R40&_nkw=silver+chain&_sacat=0&rt=nc&LH_Sold=1&LH_Complete=1
A2:
=IMPORTHTML(A1,"list",4)
But it returns Incorrect results.
For example:
ebay - Alloy (34,237)
google sheets - Alloy (42,069)
Can somebody help me, I'm new to google sheet scripting and I'll really grateful if somebody can help me.
waiting to hear from somebody...

You may be getting a discrepancy in results based on your location.
If you are located overseas for example, Ebay may parse the number of results based on sellers who will ship to your region. When the Google server makes the inquiry, it's doing so from a location different than your own, and thus may get a different total.
I run across this from time to time. If you're logged into Ebay you may try changing your shipping address and see what it does to the total results you get.

Web-scraping in R when a table changes but the URL does not

I am trying to scrape NCAA gymnastics scores from roadtonationals.com into R. I have been able to do this in the past, using readLines(), but the website has been updated recently, and my old code no longer works.
In particular, when I am looking at the standings (roadtonationals.com/results/standings/), I can change season, year, week, and team/individual using the drop down menus. I can change between the four events and the all around using the tabs on the right. However, even if the table changes, the URL remains the same. I know very little about coding for websites, so I don’t even really know what this type of table is called or where to start with it.
Technically, I could copy and paste, but eventually, I’d like each individual score, like I used to be able to get, from a page like roadtonationals.com/results/schedule/meet/20409, which also involves selecting the teams or the events without changing the URL.
I found this question:
Using R to scrape tables when URL does not change
which seems to be asking the same thing that I am.
However, when I tried
library(httr)
standings <- POST(url = "https://roadtonationals.com/results/standings/season")
I get a message that says, “Not Acceptable.” and “An appropriate representation of the requested resource /results/standings/season could not be found on this server.”

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web Scraping Blocked - Using Googlesheets IMPORTXML() - web-scraping

Related

Scraping Spotify Top 200 streaming data with R

Issue scraping financial data via xpath + tables

Scraping sector information from Yahoo Finance into Google Sheets using IMPORTXML [duplicate]

Import data from ebay to google spreadsheet using IMPORTHTML

Web-scraping in R when a table changes but the URL does not

Categories

Resources