XPath expression from a website to extract price

XPath expression from a website to extract price - wordpress

I try to extract the xpath for my content egg plugin on wordpress, but on those 2 websites i can't really find the right xpath. If is someone here to help me i be gratefull.
On the first one, for example: https://www.pcgarage.ro/sisteme-pc-garage/pc-garage/gaming-ares-iv/
I tried:
//span[contains(text(),'26.999,99 RON')]
.//*[#*itemprop='price']
And
//span[#class='price_num']
On the second one: https://www.emag.ro/aparat-de-aer-conditionat-heinner-crystal-9000-btu-clasa-a-functie-incalzire-filtru-cu-densitate-ridicata-follow-me-functie-turbo-r32-alb-hac-cr09whn/pd/DGW0DJBBM/?ref=hp_prod-widget_flash_deals_1_1&provider=site
//span[#class='product-new-price'] (this works but they give me and the <sup> with 99 (i don't need this), if is a way to exclude sup, would be great)

The XPath to get the price on the first page is
//td[#class='pip_text']//b
For the second site you can use this:
//div[#class='product-highlights-wrapper']//p[#class='product-new-price']

Related

Not able to find the Xpath

I am trying to scrape IMDB top 250 movies using scrapy and stuck in finding the xpath for duration[I need to extract "2","h","44" and "m"] of each movie. Website link : https://www.imdb.com/title/tt15097216/?ref_=adv_li_tt
Here's the image of the HTML:
I've tried this Xpath but it's not accurate:
//li[#class ='ipc-inline-list__item']/following::li/text()

If it's always in the same position, what about:
//li[#class ='ipc-inline-list__item']/following::li[2]
or more simply:
//li[#class ='ipc-inline-list__item'][3]
or since the others have hyperlinks as the child, filter to just the li that has text() child nodes:
//li[#class ='ipc-inline-list__item'][text()]
However, the original XPath may be fine - it may be how you are consuming the information. If you are using .get() then try .getAll() instead.

You can use this XPath to locate the element:
//span[contains(#class,'Runtime')]
To extract the text you can use this:
//span[contains(#class,'Runtime')]/text()

Extract XML child attribute based on another child attribute

I have the following XML structure. I am trying to extract the attributes StartDate and EndDate of the relationship period, that is only if rr:PeriodType is RELATIONSHIP_PERIOD.
However, the nodes for "relationship" and "accounting" have exactly the same name and am not sure how to proceed.
<rr:RelationshipPeriods>
<rr:RelationshipPeriod>
<rr:StartDate>2018-01-01T00:00:00.000Z</rr:StartDate>
<rr:EndDate>2018-12-31T00:00:00.000Z</rr:EndDate>
<rr:PeriodType>ACCOUNTING_PERIOD</rr:PeriodType>
</rr:RelationshipPeriod>
<rr:RelationshipPeriod>
<rr:StartDate>2019-01-02T00:00:00.000Z</rr:StartDate>
<rr:PeriodType>RELATIONSHIP_PERIOD</rr:PeriodType>
</rr:RelationshipPeriod>
</rr:RelationshipPeriods>
I tried using this code
ldply(xpathApply(xmlData, '//rr:RelationshipPeriod/rr:StartDate', getChildrenStrings), rbind)
But doesn't work well as it's hard to understand if it is extracting accounting or relationship period.
Any help would be greatly appreciated!

For rr:StartDate use XPath:
//rr:RelationshipPeriod[rr:PeriodType='RELATIONSHIP_PERIOD']/rr:StartDate
But probably better to first find the correct rr:RelationshipPeriod using XPath:
//rr:RelationshipPeriod[rr:PeriodType='RELATIONSHIP_PERIOD']
See this answer on how to reuse the result of a XPath.
But don't use // in front of rr:StartDate and rr:EndDate

Scrapy: grabbing sibling elements of a regex match

I'm using Scrapy to scrape college essay topics from college websites. I know how to match a keyword using a regular expression, but the information that I really want is the other elements in the same div as the match. The Response.css(...).re(...) function in Scrapy returns a string. Is there any way to navigate to the parent div of the regex match?
Example: https://admissions.utexas.edu/apply/freshman-admission#fndtn-freshman-admission-essay-topics. On the above page, I can match the essay topics h1 using: response.css("*::text").re("Essay Topics"). However, I can't figure out a way to grab the 2 actual essay topics in the same div under Topic A and Topic N.

That's not the right way to do it. You should use something like below
response.xpath("//div[#id='freshman-admission-essay-topics']//h5//text()").extract()
In case you just want css then you can use
In [7]: response.css("#freshman-admission-essay-topics h5::text, #freshman-admission-essay-topics h5 span::text").extract()
Out[7]: ['Topic A \xa0\xa0', 'Topic N']

Regular Expression - replace urls in style tags

I have searched on google and stackoverflow but didnt find a good answer. I also tryed it by myself, but iam no regex guru.
My goal is to replace all relative urls in a html style tag with the absolute version.
e.g.
style="url(/test.png)" with style="url(http://mysite.com/test.png)"
style="url("/test.png")" with style="url("http://mysite.com/test.png")"
style="url('/test.png')" with style="url('http://mysite.com/test.png')"
style="url(../test.png)" with style="url(http://mysite.com/test.png)"
style="url("../test.png")" with style="url('http://mysite.com/test.png')"
style="url('../test.png')" with style="url('http://mysite.com/test.png')"
and so on.
Here what i tryed with my poor regex "skils"
url\((?<Url>[^\)]*)\)
gives me the url in the "url" function.
thanks in advance!

Well, you can try the regex:
style="url\((['"])?(?:\.\.)?(?<url>[^'"]+)\1?\)"
And replace with:
style="url($1http://mysite.com$2$1)"
regex101 demo
(['"])? will capture quotes if they are present and use them again at \1?
([^'"]+) will capture the url itself.

Selenium IDE - Select checkbox on table row

I'm using Selenium IDE and I have a table where it has many rowns and columns. Each row has its own checkbox to select this row.
I was using this command to search for a specific row:
css=tr:contains('US Tester4') input[type="checkbox"]
But the problem is that in this colum, I have some other similar words like "US Tester41", "US Tester42" ... and when I use this command, it selects the wrong row.
I thought if I replace this word "contains" for some other like "equals" or "exactly" would work, but it didn't (I don't know the sintax).
Any ideas?
Follow the screenshot:
http://oi41.tinypic.com/2ake9hw.jpg

I'm not familiar with Selenium IDE, but with the selenium webdriver I would use an xpath. So I guess something like this will work for you:
xpath=//tr[td[3][text()='US Tester4']]//input[#type='checkbox']

This worked for me:
//tr//td[.='US Tester4']//input[type="checkbox"]
against:
<table>
<tr><td>US Tester</td>input(type="checkbox")</tr>
<tr><td>US Tester4</td>input(type="checkbox")</tr>
<tr><td>US Tester41</td>input(type="checkbox")</tr>
<tr><td>US Tester412</td>input(type="checkbox")</tr>
</table>
It matched the second element.

This worked for me
xpath=(//input[#name='uid'])[2])
The 2 being the order of elemets

I'm not very familiar with the IDE but I have used the Webdriver before. If possible I would use this xpath.
xpath = "//td[.= 'US Tester4']//previous-sibling::td//input[#type = 'checkbox']"
This should locate only one element on screen. Using previous-sibling and following-sibling is very helpful when you haven't got a good enough identifier on the exact element you want to find. In your case the which contains the checkbox hasn't a good identifier where as the after has text which you could match using the '=' operator. You just need to use the 'previous-sibling' to find the with the checkbox

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

XPath expression from a website to extract price - wordpress

The XPath to get the price on the first page is //td[#class='pip_text']//b For the second site you can use this: //div[#class='product-highlights-wrapper']//p[#class='product-new-price']

Related

Not able to find the Xpath

Extract XML child attribute based on another child attribute

Scrapy: grabbing sibling elements of a regex match

Regular Expression - replace urls in style tags

Selenium IDE - Select checkbox on table row

Categories

Resources