Get the text associated with a href element in a given page in scrapy - web-scraping

Currently my 'yield' in my scrapy spider looks as follows :
yield {
'hreflink':mylink,
'Parentlink':response.url
}
This returns me a dict
{
'hreflink':"https://www.southeasthealth.org/wp-content/uploads/Southeast-Health-Standard-Charges-2022.xlsx",
'Parentlink': "https://www.southeasthealth.org/financial-information-price-transparency/"
}
Now, I also want the 'text' that is associated with this particular hreflink, in that particular Parentlink. So my final output should look like
{
'hreflink':"https://www.southeasthealth.org/wp-content/uploads/Southeast-Health-Standard-Charges-2022.xlsx",
'Parentlink': "https://www.southeasthealth.org/financial-information-price-transparency/",
'Yourtext' : "Download Pricing Info"
}
What would be the simplest way to achieve that. I want to use Xpath expressions to get the "text" in a parentlink where href element = #href .
So far Here is what I tied -
Yourtext = response.xpath('//a[#href='+json.dumps(each)+']//text()').get()
but its not printing anything. I tried printing my response and it returns the right page - 'https://www.southeasthealth.org/financial-information-price-transparency/'

If I understand you correctly you want to get the text belonging to the link Download Pricing Info.
I suggest you try using:
response.xpath("//span[#class='fusion-button-text']//text()").get()

I found the answer to my question.
'//a[#href='+json.dumps(each)+']//text()'
This is the correct expression however the href link 'each' is case sensitive and it needs to match exactly for this Xpath to work.

Related

Not able to find the Xpath

I am trying to scrape IMDB top 250 movies using scrapy and stuck in finding the xpath for duration[I need to extract "2","h","44" and "m"] of each movie. Website link : https://www.imdb.com/title/tt15097216/?ref_=adv_li_tt
Here's the image of the HTML:
I've tried this Xpath but it's not accurate:
//li[#class ='ipc-inline-list__item']/following::li/text()
If it's always in the same position, what about:
//li[#class ='ipc-inline-list__item']/following::li[2]
or more simply:
//li[#class ='ipc-inline-list__item'][3]
or since the others have hyperlinks as the child, filter to just the li that has text() child nodes:
//li[#class ='ipc-inline-list__item'][text()]
However, the original XPath may be fine - it may be how you are consuming the information. If you are using .get() then try .getAll() instead.
You can use this XPath to locate the element:
//span[contains(#class,'Runtime')]
To extract the text you can use this:
//span[contains(#class,'Runtime')]/text()

How to translate TextMeshPro-StyleTags to the actual RichText in Unity?

I have the following string in TextMeshPro: "<style=Title>This is a Title (...)".
I would like to translate the StyleTag to the defined Opening Tags.
For this example it would translate the string above to the following: "<size=125%><align=center>This is a Title (...)".
How can I do this?
You can get the OpeningTags to a StyleTag by calling the following function: TMP_StyleSheet.GetStyle("[StyleName]").styleOpeningDefinition (with TMP_StyleSheet being a reference to the used TMP-StyleSheet).
So a possible solution is to extract the StyleName from your string (e.g. "(...text) <style=Example> (text...)" would become "Example") and feed it to the function above. Regular Expressions can help to extract the StyleName from your string. Then replace the whole tag with whatever the function returns (e.g. "<size=125%>"). (Note: It returns Null if the tag does not exist). Then do the same with the closing tag.

SimpleDom Search Via plaintext Text

I am using "PHP Simple HTML DOM Parser" library and looking forward to find elements based on its text value (plaintext)
For example i need to find span element using its value "Red".
<span class="color">Red</span>
I was expecting bellow code to work but seems that it just replaces the value instead of searching it.
$brand = $html->find('span',0)->plaintext='Red';
I read Manual and also i tried to look in library code itself but was not able to find the solution, kindly advise if i am missing something or it is simply not possible to do via Simple Html DOM Parser.
P.S
Kindly note that i am aware of other ways like regex.
Using $html->find('span', 0) will find the (N)th span where in this case n is zero.
Using $html->find('span',0)->plaintext='Red'; will set the plaintext to Red
If you want to find the elements where the text is Red you could use a loop and omit the 0 to find all the spans.
For example, using innertext instead of plaintext:
$spansWithRedText = [];
foreach($html->find('span') as $element) {
if ($element->innertext === "Red") {
$spansWithRedText[] = $element;
}
}

IMacros: Extract text from site

I need to extract to clipboard activation link, link every registration changed.
HTML Code:
Try something like this code:
SEARCH SOURCE=REGEXP:"(http://mctop.me/approve/\w+)" EXTRACT=$1
SET !CLIPBOARD {{!EXTRACT}}
Error -1200: parses "(http://mctop.me/approve/\w+)" - Unrecognized esc-sequence \w.

QTP - getting value of element

I am beginning with QTP and just cannot find out how to get value of element. For example when I just want to compare the number of results found by google. I tried to select the element with object spy and use Val(Element) to assign the value into variable..but it doesnt work. Could anyone help with this? BTW, I am not sure whether selecting the text (element) to compare with Object spy is correct.
Thanks!
You should use GetROProperty in order to get the text and then parse it for the value.
Looking at a Google results page I see that the result is in a paragraph with id=resultStats in the 3rd bold tag.
<p id="resultStats"> Results <b>1</b> - <b>10</b> of about
<b>2,920,000</b>
for <b>qtp</b>. (<b>0.22</b> seconds)</p>
So the following script gets the number (as a string with commas).
Browser("micclass:=Browser")
.Page("micclass:=Page")
.WebElement("html id:=resultStats")
.WebElement("html tag:=b","index:=2").GetROProperty("innertext")

Resources