I am trying to scrape IMDB top 250 movies using scrapy and stuck in finding the xpath for duration[I need to extract "2","h","44" and "m"] of each movie. Website link : https://www.imdb.com/title/tt15097216/?ref_=adv_li_tt
Here's the image of the HTML:
I've tried this Xpath but it's not accurate:
//li[#class ='ipc-inline-list__item']/following::li/text()
If it's always in the same position, what about:
//li[#class ='ipc-inline-list__item']/following::li[2]
or more simply:
//li[#class ='ipc-inline-list__item'][3]
or since the others have hyperlinks as the child, filter to just the li that has text() child nodes:
//li[#class ='ipc-inline-list__item'][text()]
However, the original XPath may be fine - it may be how you are consuming the information. If you are using .get() then try .getAll() instead.
You can use this XPath to locate the element:
//span[contains(#class,'Runtime')]
To extract the text you can use this:
//span[contains(#class,'Runtime')]/text()
Related
Currently my 'yield' in my scrapy spider looks as follows :
yield {
'hreflink':mylink,
'Parentlink':response.url
}
This returns me a dict
{
'hreflink':"https://www.southeasthealth.org/wp-content/uploads/Southeast-Health-Standard-Charges-2022.xlsx",
'Parentlink': "https://www.southeasthealth.org/financial-information-price-transparency/"
}
Now, I also want the 'text' that is associated with this particular hreflink, in that particular Parentlink. So my final output should look like
{
'hreflink':"https://www.southeasthealth.org/wp-content/uploads/Southeast-Health-Standard-Charges-2022.xlsx",
'Parentlink': "https://www.southeasthealth.org/financial-information-price-transparency/",
'Yourtext' : "Download Pricing Info"
}
What would be the simplest way to achieve that. I want to use Xpath expressions to get the "text" in a parentlink where href element = #href .
So far Here is what I tied -
Yourtext = response.xpath('//a[#href='+json.dumps(each)+']//text()').get()
but its not printing anything. I tried printing my response and it returns the right page - 'https://www.southeasthealth.org/financial-information-price-transparency/'
If I understand you correctly you want to get the text belonging to the link Download Pricing Info.
I suggest you try using:
response.xpath("//span[#class='fusion-button-text']//text()").get()
I found the answer to my question.
'//a[#href='+json.dumps(each)+']//text()'
This is the correct expression however the href link 'each' is case sensitive and it needs to match exactly for this Xpath to work.
I try to extract the xpath for my content egg plugin on wordpress, but on those 2 websites i can't really find the right xpath. If is someone here to help me i be gratefull.
On the first one, for example: https://www.pcgarage.ro/sisteme-pc-garage/pc-garage/gaming-ares-iv/
I tried:
//span[contains(text(),'26.999,99 RON')]
.//*[#*itemprop='price']
And
//span[#class='price_num']
On the second one: https://www.emag.ro/aparat-de-aer-conditionat-heinner-crystal-9000-btu-clasa-a-functie-incalzire-filtru-cu-densitate-ridicata-follow-me-functie-turbo-r32-alb-hac-cr09whn/pd/DGW0DJBBM/?ref=hp_prod-widget_flash_deals_1_1&provider=site
//span[#class='product-new-price'] (this works but they give me and the <sup> with 99 (i don't need this), if is a way to exclude sup, would be great)
The XPath to get the price on the first page is
//td[#class='pip_text']//b
For the second site you can use this:
//div[#class='product-highlights-wrapper']//p[#class='product-new-price']
I'm using Scrapy to scrape college essay topics from college websites. I know how to match a keyword using a regular expression, but the information that I really want is the other elements in the same div as the match. The Response.css(...).re(...) function in Scrapy returns a string. Is there any way to navigate to the parent div of the regex match?
Example: https://admissions.utexas.edu/apply/freshman-admission#fndtn-freshman-admission-essay-topics. On the above page, I can match the essay topics h1 using: response.css("*::text").re("Essay Topics"). However, I can't figure out a way to grab the 2 actual essay topics in the same div under Topic A and Topic N.
That's not the right way to do it. You should use something like below
response.xpath("//div[#id='freshman-admission-essay-topics']//h5//text()").extract()
In case you just want css then you can use
In [7]: response.css("#freshman-admission-essay-topics h5::text, #freshman-admission-essay-topics h5 span::text").extract()
Out[7]: ['Topic A \xa0\xa0', 'Topic N']
I am trying to find an XPATH for this site the XPath under “Main Lists”. I have so far:
//div[starts-with(#class, ('sm-CouponLink_Label'))]
However this finds 32 matches…
`//div[starts-with(#class, ('sm-CouponLink_Label'))]`[contains(text(),'*')or[contains(Style(),'*')]
Unfortunately in this case I am wanting to use XPaths and not CSS.
It is for this site, my code is here and here's an image of XPATH I am after.
I have also tried:
CSS: div:nth-child(1) > .sm-MarketContainer_NumColumns3 > div > div
Xpath equiv...: //div[1]//div[starts-with(#class, ('sm-MarketContainer_NumColumns3'))]//div//div
Though it does not appear to work.
UPDATED
WORKING CSS: div.sm-Market:has(div >div:contains('Main Lists')) * > .sm-CouponLink_Label
Xpath: //div[Contains(#class, ('sm-Market'))]//preceding::('Main Lists')//div[Contains(#class, ('sm-CouponLink_Label'))]
Not working as of yet..
Though I am unsure Selenium have equivalent for :has
Alternatively...
Something like:
//div[contains(text(),"Main Lists")]//following::div[contains(#class,"sm-Market")]//div[contains(#class,"sm-CouponLink_Label")]//preceding::div[contains(#class,"sm-Market_HeaderOpen ")]
(wrong area)
You can get all required elements with below piece of code:
league_names = [league for league in driver.find_elements_by_xpath('//div[normalize-space(#class)="sm-Market" and .//div="Main Lists"]//div[normalize-space(#class)="sm-CouponLink_Label"]') if league.text]
This should return you list of only non-empty nodes
If I understand this correctly, you want to narrow down further the result of your first XPath to return only div that has inner text or has attribute style. In this case you can use the following XPath :
//div[starts-with(#class, ('sm-CouponLink_Label'))][#style or text()]
UPDATE
As you clarified further, you want to get div with class 'sm-CouponLink_Label' that resides in the 'Main Lists' section. For this purpose, you should try to incorporate the 'Main Lists' in the XPath somehow. This is one possible way (formatted for readability) :
//div[
div/div/text()='Main Lists'
]//div[
starts-with(#class, 'sm-CouponLink_Label')
and
normalize-space()
]
Notice how normalize-space() is used to filter out empty div from the result. This should return 5 elements as expected, here is the result when I tested in Chrome :
I'm using Selenium IDE and I have a table where it has many rowns and columns. Each row has its own checkbox to select this row.
I was using this command to search for a specific row:
css=tr:contains('US Tester4') input[type="checkbox"]
But the problem is that in this colum, I have some other similar words like "US Tester41", "US Tester42" ... and when I use this command, it selects the wrong row.
I thought if I replace this word "contains" for some other like "equals" or "exactly" would work, but it didn't (I don't know the sintax).
Any ideas?
Follow the screenshot:
http://oi41.tinypic.com/2ake9hw.jpg
I'm not familiar with Selenium IDE, but with the selenium webdriver I would use an xpath. So I guess something like this will work for you:
xpath=//tr[td[3][text()='US Tester4']]//input[#type='checkbox']
This worked for me:
//tr//td[.='US Tester4']//input[type="checkbox"]
against:
<table>
<tr><td>US Tester</td>input(type="checkbox")</tr>
<tr><td>US Tester4</td>input(type="checkbox")</tr>
<tr><td>US Tester41</td>input(type="checkbox")</tr>
<tr><td>US Tester412</td>input(type="checkbox")</tr>
</table>
It matched the second element.
This worked for me
xpath=(//input[#name='uid'])[2])
The 2 being the order of elemets
I'm not very familiar with the IDE but I have used the Webdriver before. If possible I would use this xpath.
xpath = "//td[.= 'US Tester4']//previous-sibling::td//input[#type = 'checkbox']"
This should locate only one element on screen. Using previous-sibling and following-sibling is very helpful when you haven't got a good enough identifier on the exact element you want to find. In your case the which contains the checkbox hasn't a good identifier where as the after has text which you could match using the '=' operator. You just need to use the 'previous-sibling' to find the with the checkbox