XPATH inside IMPORTXML with double quotation marks in query - web-scraping

I am trying to scrape data from a website to Google Sheets but because of the double quotes in the xpath_query on "compTable" I keep a formula parse error. When I try do single quotes ie. 'compTable' I get the error imported content is empty. Is there a way I can handle double quotations in xpath inside of an importxml function and get this function to not return an error?
=IMPORTXML("https://www.levels.fyi/comp.html?track=Software%20Engineer&search=sydney&city=1311","//*[#id="compTable"]/tbody/tr[1]/td[2]/span/a")
For context I am trying to use this formula to get the company name from the table in the url e.g. Google, Amazon, Canva. Ultimately I want to scrape this website to create a Google Sheet with each row of the table in this URL so that I have each data point (company name, total compensation, level etc.) on each row of my Google Sheet.

use:
=IMPORTXML("https://www.levels.fyi/comp.html?track=Software%20Engineer&search=sydney&city=1311",
"//*[#id='compTable']/tbody/tr[1]/td[2]/span/a")

Related

Xpath of investing.com for scraping

I want to import data from https://www.investing.com/equities/boc-hong-kong-historical-data by importxml formula in Google Sheets. It can be done by importhtml but i would like to import it by xpath becase it would not has scraping updates issues.
I used IMPORTXML("https://www.investing.com/equities/boc-hong-kong-historical-data","//*[#id='curr_table']") and then it scraped but in bad shape; for example it does not specify rows and columns or Comma-delimited.
How can I extract data by xPath in Google Sheets?
I believe your goal as follows.
You want to retrieve the table in the URL of =IMPORTHTML("https://www.investing.com/equities/boc-hong-kong-historical-data","table",2) using the xpath on Google Spreadsheet.
Modified formula:
In order to retrieve the values using the xpath, please use the following xpath.
=IMPORTXML("https://www.investing.com/equities/boc-hong-kong-historical-data","//table[#id='curr_table']//tr")
In this case, the xpath is //table[#id='curr_table']//tr.
Also, you can use the xpath of //*[#id='curr_table']//tr.
Result:
Note:
As another method, I think that IMPORTHTML can be also used like below. This is the same with above formula.
=IMPORTHTML("https://www.investing.com/equities/boc-hong-kong-historical-data","table",2)
References:
IMPORTXML
IMPORTHTML

Google Analytics Filter - Filtering out URL Parameters

I have the following urls:
example.org/
example.org/index.cfm
example.org/index.cfm?SSO=1
...all of which point the the same page. In GA, I want them all to point to
example.org/. Every month, I have to add up the data for each of these, and it's quite annoying.
I'm aware of how filters work, and I have the following:
Search & Replace > Request URI
Search String: index.cfm$
Replace String: (left the field blank)
...and that's working.
The issue is when I'm trying to search & replace or exclude the "?SSO=1" from the string. Nothing seems to be working.

Scrapy returning numbers and letters instead of "?" for href value

I am trying to scrape a web forum using Scrapy for the href link info and when I do so, I get the href link with many letters and numbers where the question mark should be.
This is a sample of the html document that I am scraping:
I am scraping the html data for the href link using the following code:
response.xpath('.//*[contains(#id, "thread_title")]/#href').extract()
When I run this, I get the following results:
[u'showthread.php?s=f969fe6ed424b22d8fddf605a9effe90&t=2676278']
What should be returned is:
[u'showthread.php?t=2676278']
I have ran other tests scraping for href data with question marks elsewhere in the document and I also get the "s=f969fe6ed424b22d8fddf605a9effe90&" returned.
Why am I getting this data returned with the "s=f969fe6ed424b22d8fddf605a9effe90&" instead of just the question mark?
Thanks!
It seems that the site I am scraping from uses a unique identifier in order to more accurately update the number of views per the thread. I was not able to return scraped data without a unique id, it changed over time, and scraped a different HTML tag for the thread ID and then joined it to the web address (showthread.php?t=) to create the link I was looking for.

Filter to Group URL on Visitors Flow

I have found a similar question earlier here:
Google Analytics Visitors Flow: grouping URLs?
However I'm confused because people suggest different way to write the Replace String, and either way I try it am not able to make it work.
So I have a ecommerce site with hundreds of different pages. The different parts of the website is:
http://example.com/sv/ (Root)
http://example.com/sv/category/1-name/
http://example.com/sv/product/1-name/
http://example.com/sv/designer-tool/1-name/
http://example.com/sv/checkout/
When I go to the visitors flow. I want to see the amount of people that go from example Root to Category, and from Category to Product, and from Product to Designer Tool, and from Designer Tool to Checkout. However now when I have so many different pages it becomes very difficult to follow the visitors flow, because the product pages are for example not grouped together.
So instead of above. I would like to remove the 1-name/ part in the end. And only see /sv/category/, /sv/product/, /sv/designer-tool/.
In the earlier post I understand you can use an advanced filter to do this. I have set the following settings:
Type: Search & Replace
Field: Request URI
Search String: ^/(category|product|designer-tool)(/\d*)(.*)
Replace String: /$A1$A3
I guess that my search string and my replace string is wrong. Any ideas?
EDIT: I updated my filter to the following:
Search String: ^/sv/(category|product|designer-tool)(/\d*)(.*)$
Replace String: /sv/\1/
Still testing and unsure if it's the correct way to set it up.
I was able to solve this by the Search String and the Replace String in my edit above.
So basically what I did was:
Create a secondary view/profile for your site. If you apply your filter to your one and only view/profile that means that you won't be able to see any detailed data about specific pages, because the filter removes/filter that.
Add an Advanced Filter with the following settings:
Type: Search & Replace
Field: Request URI
Search String: ^/sv/(category|product|designer-tool)(/\d*)(.*)$
Replace String: /sv/\1/
You need to wait 24h after creating your new profile/view before you can see any data in it.
So my confusion was regarding the Search and Replace String. The Search String is an regular expression for matching everything after your .tld. So for example http://www.example.com/sv/mypage/1-post/, the Search String will only search within /sv/mypage/1-post/.
The Replace String is what it should replace the whole Search String with. So in my case, I matched all URL's that had /sv/category/1-string/. I wanted only to keep the "category" part, so I replaced the whole string with /sv/category/ by inputting Replace String /sv/\1/
/sv/ means just what it says. \1 means that it should take the value of the first () of my Search String (In this case "category"). The ending / is just an ending slash.
All in all, it means that any URLs that looked like http://example.com/sv/category/1-string/ was changed to http://example.com/sv/category/. Meaning that I can now see data for all my categories as a group, instead of individual pages.

How to fetch output of page as string in ASP.net

I want to fetch output as string. I want to run webpage from my code like www.example.com after running i want to keep whole values in a string and again in next step i want to search all anchor href values from the string. But initially how will get domain ouput as string ASP.net.
It sounds like you're trying to scrape another website. This may help: http://www.beansoftware.com/ASP.NET-Tutorials/Screen-Scraping-Web-Fetching.aspx

Resources