Extract link text from a with webscraping using Google Sheets

Extract link text from a with webscraping using Google Sheets - web-scraping

I have the following <html> text:
Text
How should I do for getting "Text" value? I am trying with this, but I get an empty value:
=INDEX(importxml("http://www.remoteurl.com";"//a[#href='link.html']");1)

I tried using your syntax and it worked for me. I shortened it a little for testing purposes.
=importxml("https://www.remoteurl.com","//a[#href='link.html']")
Be sure that the href value you are passing in the xpath query is exactly what is present on the web page, e.g. if the web page uses a relative path then you must also use the same relative path.

I was doing it properly, but the problem is that coding was inside an iframe, so it was impossible to reach it.

Related

Error during web scraping in R using Selector Gadget

I hope you are all doing well.
I am facing an error during web scraping in R using the Selector Gadget Tool where when I am selecting the data using the tool on the Coursera website, the no. of values it shows is correct (10). But when I copy that particular CSS code in R and run it, it's showing 18 names in the list. Please if anyone can help me with this. Here is a screenshot of the selector gadget output:
And here is what gets returned in R when I scrape that css selector:

The rendered content seen via a browser is not exactly the same as that returned by an XHR request (rvest). This is because a browser can run JavaScript to update content.
Inspect the page source by pressing Ctrl+U in browser on that webpage.
You can re-write your css selector list to match the actual html returned. One example would be as follows, which also removes the reliance on dynamic classes which change more frequently and would break your program more quickly.
library(rvest)
read_html("https://in.coursera.org/degrees/bachelors") |>
html_elements('[data-e2e="degree-list"] div[class] > p:first-child') |>
html_text2()
Learn about CSS selectors and operators here: https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors

When using apoc.load.html, Is it possible to return the full HTML rather than only text?

Lets say I want to scrape the Neo4j RefCard found at: https://neo4j.com/docs/cypher-refcard/current/
And I would like to fetch a 'code' example along with its styling. Here's my target. Notice that it has CSS treatment (font, color...):
...so in Neo4j I call the apoc.load.html procedure as shown here, and you can see it's no problem finding the content:
It returns a map with three keys: tagName, attributes, and text.
The text is the issue for me. It's stripped of all styling. I would like for it to let me know more about the styling of the different parts of this text.
The actual HTML in the webpage looks like following image with all of these span class tags: cm-string, cm-node, cm-atom, etc. Note that this was not generated by Neo4j's apoc.load.html procedure. It came straight from my Chrome browser's inspect console.
I don't need the actual fonts and colors, just the tag names.
I can seen in the documentation that there is an optional config map you can supply, but there's no explanation for what can be configured there. It would be lovely if I could configure it to return, say, HTML rather than text.
The library that Neo4j uses for CSS selection here is jsoup.
So I am hoping to not strip the <span> tags, or otherwise, extract their class names for each segment of text.

Could you not generate the HTML yourself from the properties in your object? It looks they are all span tags with 3 different classes depending on whether your using the property name, property value, or property delimiter?
That is probably how they are generating the HTML themselves.

Okay, two years later I revisited this question I posted, and did find a solution. I'll keep it short.
The APOC procedure CALL apoc.load.html is using the scraping library Jsoup, which is not a full-fledged browser. When it visits a page it reads the html sent by the server but ignores any javascript. As a result, if a page uses javascript for inserting content or even just formatting the content, then Jsoup will miss the html that the javascript would have generated had it run.
So I have just tried out the service at prerender.com. It's simple to use. You send it a URL, it takes your url as an argument and fetches that page itself and executes the page's javascript as it does. It returns the final result as static HTML.
So if I just call prerender.com with apoc.load.html then the Jsoup library will simply ask for the html and this time it will get the fully rendered html. :)
You can try the following two queries and see the difference pre-rendering makes. The span tags in this page are rendered only by javascript. So if we call it asking for its span tags without pre-rendering we get nothing returned.
CALL apoc.load.html("https://neo4j.com/docs/cypher-refcard/current/", {target:".listingblock pre:contains(age: 38) span"}) YIELD value
UNWIND value.target AS spantags
RETURN spantags
...but if we call it via the prender.com website, you will get a bunch of span tags and their content.
CALL apoc.load.html("https://service.prerender.cloud/https://neo4j.com/docs/cypher-refcard/current/", {target:".listingblock pre:contains(age: 38) span"}) YIELD value
UNWIND value.target AS spantags
RETURN spantags

Scrapy: Unable to access class despite of it's there

I am trying to scrape this page, I am trying to fetch Color Name, LT. BLUE. From Chrome I see HTML:
<div id="desc-options"><div class="option"><span class="label">Color:</span> LT. BLUE</div><div class="option"><span class="label">Size:</span> 6.5</div></div>
I tried response.css("#desc-options") to access everything inside but returns []. Even BeautifulSoup is failing.

The element you're looking for is dynamically created via JavaScript. You cannot parse it from the plain HTML.
The good news is: the data you're looking for is probably still in the page. Check out the <script> tag defining the spConfig variable. Looks like there's some JSON there you can parse ...

How to click on a link based on text in a table using selenium

Hi All,
I have the following table with links that I need to select. In this specific example I need to select the DIY Payroll but sometimes this can change its position within the table. The current xpath is:
.//*[#id='catalog-category-div-1']/table/tbody/tr/td1/ul/li[4]/a
So I do a:
By.xpath(".//*[#id='catalog-category-div-1']/table/tbody/tr/td[1]/ul/li[4]/a").click()
But the problem is here is that it can change position where it can be in td[2] or td[3] and li[n'th postion]
Can I have selenium go through the table and click on it based on text. Will the By.linktext() work here ?

You can use the following codes these codes will handle the dynamic changes.
You can use linkText() method as follows:
driver.findElement(By.linkText("DIY Payroll")).click();
If you want to use xpath then you can use following code.
driver.findElement(By.xpath(.//a[contains(text(),'DIY Payroll')).click();
If you need any more clarification you are welcome :)

I would suggest that you try By.linkText() or By.partialLinkText(). It will locate an A tag that contains the desired text.
driver.findElement(By.linkText("DIY Payroll")).click();
A couple issues you might run into:
The link text may exist more than once on the page. In this case, find an element that's easy to find (e.g. by id) that is a parent of only the link you want and then search from that element.
driver.findElement(By.id("someId")).findElement(By.linkText("DIY Payroll")).click();
The A tag may contain extra spaces, other characters, be capitalized, etc. In these case, you'll just have to try using .partialLinkText() or trial and error.
In some cases I've seen a link that isn't an A tag or contains additional tags inside. In this case, you're going to have to find another method to locate the text like XPath.

You should use a CSS selector for this case:
Can you try:
By.CssSelector("a.browse-catalog-categories-link")

You can use XPath to do this. //a will select all 'a' tags. The part inside of the square brackets will select everything with text "DIY Payroll". Combined together you get the desired solution.
//a[contains(text(),'DIY Payroll')]

seam-gen and flex

I have integrated seam and flex with FlamingoDS
I got html file from mxml file and I stored it in WebContent folder it's fine
then I want to create link named as 'Plan' in menu.xhtml
My aim is to get that html file when i clicked on this button I don't know what to do for that
so, I have created some test.xhtml in that top element is the
for the template attribute this element I have given the template.html
and I used
then for 'Plan' link I gave the view="/test.xhtml"
It's fine when I clicked on that link I am getting the test.seam file which includes our html file but this html file is coming in some fixed area with scroll bars only eventhough there is a lot of space to fit
Please help...... me

First of all, it is very difficult to read your post. Please format it more readable.
Secondly, we can only guess what's wrong when we cannot see any code. But my hunch is that you are using s:decorate that includes some formating you are not aware of. This comes in standard seam-gen. Try removing that s:decorate stuff or point to another style you wish to use.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex