Xpath of investing.com for scraping - web-scraping

I want to import data from https://www.investing.com/equities/boc-hong-kong-historical-data by importxml formula in Google Sheets. It can be done by importhtml but i would like to import it by xpath becase it would not has scraping updates issues.
I used IMPORTXML("https://www.investing.com/equities/boc-hong-kong-historical-data","//*[#id='curr_table']") and then it scraped but in bad shape; for example it does not specify rows and columns or Comma-delimited.
How can I extract data by xPath in Google Sheets?

I believe your goal as follows.
You want to retrieve the table in the URL of =IMPORTHTML("https://www.investing.com/equities/boc-hong-kong-historical-data","table",2) using the xpath on Google Spreadsheet.
Modified formula:
In order to retrieve the values using the xpath, please use the following xpath.
=IMPORTXML("https://www.investing.com/equities/boc-hong-kong-historical-data","//table[#id='curr_table']//tr")
In this case, the xpath is //table[#id='curr_table']//tr.
Also, you can use the xpath of //*[#id='curr_table']//tr.
Result:
Note:
As another method, I think that IMPORTHTML can be also used like below. This is the same with above formula.
=IMPORTHTML("https://www.investing.com/equities/boc-hong-kong-historical-data","table",2)
References:
IMPORTXML
IMPORTHTML

Related

XPATH inside IMPORTXML with double quotation marks in query

I am trying to scrape data from a website to Google Sheets but because of the double quotes in the xpath_query on "compTable" I keep a formula parse error. When I try do single quotes ie. 'compTable' I get the error imported content is empty. Is there a way I can handle double quotations in xpath inside of an importxml function and get this function to not return an error?
=IMPORTXML("https://www.levels.fyi/comp.html?track=Software%20Engineer&search=sydney&city=1311","//*[#id="compTable"]/tbody/tr[1]/td[2]/span/a")
For context I am trying to use this formula to get the company name from the table in the url e.g. Google, Amazon, Canva. Ultimately I want to scrape this website to create a Google Sheet with each row of the table in this URL so that I have each data point (company name, total compensation, level etc.) on each row of my Google Sheet.
use:
=IMPORTXML("https://www.levels.fyi/comp.html?track=Software%20Engineer&search=sydney&city=1311",
"//*[#id='compTable']/tbody/tr[1]/td[2]/span/a")

Scraping availability of the product

My goal is to scrape some data if the product available or not.
At present I am using the following:
=importxml (B2,//*[#id="product-controls"]/div/div1/div1)
Unfortunately, i am receiving an error. Here is the link to the file https://docs.google.com/spreadsheets/d/11OJvxRRIXJolpi2UttmNIOArAdwh1qeZhjqczlVI8oc/edit#gid=1531110146
As an example, I want to get the data from the url https://radiodetal.com.ua/mikroshema-5m0365r-dip8
and Xpath should be from here
got the formula
=importxml (B2,"//div[#class='stock']")

What to do if rvest isn't recognizing a node in R?

I am trying to scrape referee game data using rvest. See the code below:
page_ref<-read_html("https://www.pro-football-reference.com/officials/HittMa0r.htm")
ref_tab <- page_ref %>%
html_node("#games") %>%
html_text()
#html_table()
But rvest does not recognize any of the nodes for the "Games" table in the link. It can pull the data from the first table "Season Totals" just fine. So, am I missing something? In general, what does it mean if rvest doesn't recognize a node identified with SelectorGadget and is clearly identified in the developer tools?
It is because the first table is in the html you get from the server and the other tables are filled by JavaScript. Rvest can only get you the things that are there in the html response from server. If you want to get the data filled by JS, you need to use some other tool like Selenium or Puppeteer, for example.
Selenium
Puppeteer / gepetto

Identify names of metrics from view

I need to query the Google Analytics API to reproduce the following view:
In my Python code I have a list of dimensions and metrics that I want to query:
'metrics': [{'expression': 'ga:productListClicks'}],
'dimensions': [{'name': 'ga:landingPagePath'}],
My problem is that I do not know the name of the columns in the format 'ga:...' and in the Query Explorer there are multiple names for a given column.
Is there a way to see the name of the columns in the format 'ga:...' directly in GA?
If not, how can I find the right names?
If you use the GA query builder, you can use the type ahead/search feature of the tool to find these attributes.
You'll find the query explorer here:
https://ga-dev-tools.appspot.com/query-explorer/
This is what it looks like when you find these:
Some of the ones you're looking for are:
ga:impressions
ga:adClicks
ga:CTR
ga:sessions
ga:bounceRate
etc.
The benefit of doing this in the query builder, is you can then test it before going back to python. There are a lot of complications of mixing metrics and dimensions, and making sure what you're doing is valid here first will save headaches!

Scrapy returning numbers and letters instead of "?" for href value

I am trying to scrape a web forum using Scrapy for the href link info and when I do so, I get the href link with many letters and numbers where the question mark should be.
This is a sample of the html document that I am scraping:
I am scraping the html data for the href link using the following code:
response.xpath('.//*[contains(#id, "thread_title")]/#href').extract()
When I run this, I get the following results:
[u'showthread.php?s=f969fe6ed424b22d8fddf605a9effe90&t=2676278']
What should be returned is:
[u'showthread.php?t=2676278']
I have ran other tests scraping for href data with question marks elsewhere in the document and I also get the "s=f969fe6ed424b22d8fddf605a9effe90&" returned.
Why am I getting this data returned with the "s=f969fe6ed424b22d8fddf605a9effe90&" instead of just the question mark?
Thanks!
It seems that the site I am scraping from uses a unique identifier in order to more accurately update the number of views per the thread. I was not able to return scraped data without a unique id, it changed over time, and scraped a different HTML tag for the thread ID and then joined it to the web address (showthread.php?t=) to create the link I was looking for.

Resources