Scrapy: grabbing sibling elements of a regex match - css

I'm using Scrapy to scrape college essay topics from college websites. I know how to match a keyword using a regular expression, but the information that I really want is the other elements in the same div as the match. The Response.css(...).re(...) function in Scrapy returns a string. Is there any way to navigate to the parent div of the regex match?
Example: https://admissions.utexas.edu/apply/freshman-admission#fndtn-freshman-admission-essay-topics. On the above page, I can match the essay topics h1 using: response.css("*::text").re("Essay Topics"). However, I can't figure out a way to grab the 2 actual essay topics in the same div under Topic A and Topic N.

That's not the right way to do it. You should use something like below
response.xpath("//div[#id='freshman-admission-essay-topics']//h5//text()").extract()
In case you just want css then you can use
In [7]: response.css("#freshman-admission-essay-topics h5::text, #freshman-admission-essay-topics h5 span::text").extract()
Out[7]: ['Topic A \xa0\xa0', 'Topic N']

Related

Not able to find the Xpath

I am trying to scrape IMDB top 250 movies using scrapy and stuck in finding the xpath for duration[I need to extract "2","h","44" and "m"] of each movie. Website link : https://www.imdb.com/title/tt15097216/?ref_=adv_li_tt
Here's the image of the HTML:
I've tried this Xpath but it's not accurate:
//li[#class ='ipc-inline-list__item']/following::li/text()
If it's always in the same position, what about:
//li[#class ='ipc-inline-list__item']/following::li[2]
or more simply:
//li[#class ='ipc-inline-list__item'][3]
or since the others have hyperlinks as the child, filter to just the li that has text() child nodes:
//li[#class ='ipc-inline-list__item'][text()]
However, the original XPath may be fine - it may be how you are consuming the information. If you are using .get() then try .getAll() instead.
You can use this XPath to locate the element:
//span[contains(#class,'Runtime')]
To extract the text you can use this:
//span[contains(#class,'Runtime')]/text()

Unable to find xpath list trying to use wild card contains text or style

I am trying to find an XPATH for this site the XPath under “Main Lists”. I have so far:
//div[starts-with(#class, ('sm-CouponLink_Label'))]
However this finds 32 matches…
`//div[starts-with(#class, ('sm-CouponLink_Label'))]`[contains(text(),'*')or[contains(Style(),'*')]
Unfortunately in this case I am wanting to use XPaths and not CSS.
It is for this site, my code is here and here's an image of XPATH I am after.
I have also tried:
CSS: div:nth-child(1) > .sm-MarketContainer_NumColumns3 > div > div
Xpath equiv...: //div[1]//div[starts-with(#class, ('sm-MarketContainer_NumColumns3'))]//div//div
Though it does not appear to work.
UPDATED
WORKING CSS: div.sm-Market:has(div >div:contains('Main Lists')) * > .sm-CouponLink_Label
Xpath: //div[Contains(#class, ('sm-Market'))]//preceding::('Main Lists')//div[Contains(#class, ('sm-CouponLink_Label'))]
Not working as of yet..
Though I am unsure Selenium have equivalent for :has
Alternatively...
Something like:
//div[contains(text(),"Main Lists")]//following::div[contains(#class,"sm-Market")]//div[contains(#class,"sm-CouponLink_Label")]//preceding::div[contains(#class,"sm-Market_HeaderOpen ")]
(wrong area)
You can get all required elements with below piece of code:
league_names = [league for league in driver.find_elements_by_xpath('//div[normalize-space(#class)="sm-Market" and .//div="Main Lists"]//div[normalize-space(#class)="sm-CouponLink_Label"]') if league.text]
This should return you list of only non-empty nodes
If I understand this correctly, you want to narrow down further the result of your first XPath to return only div that has inner text or has attribute style. In this case you can use the following XPath :
//div[starts-with(#class, ('sm-CouponLink_Label'))][#style or text()]
UPDATE
As you clarified further, you want to get div with class 'sm-CouponLink_Label' that resides in the 'Main Lists' section. For this purpose, you should try to incorporate the 'Main Lists' in the XPath somehow. This is one possible way (formatted for readability) :
//div[
div/div/text()='Main Lists'
]//div[
starts-with(#class, 'sm-CouponLink_Label')
and
normalize-space()
]
Notice how normalize-space() is used to filter out empty div from the result. This should return 5 elements as expected, here is the result when I tested in Chrome :

R Using Regex to find a word after a pattern

I'm grabbing the following page and storing it in R with the following code:
gQuery <- getURL("https://www.google.com/#q=mcdimalds")
Within this, there's the following snippet of code
Showing results for</span> <a class="spell" href="/search?rlz=1C1CHZL_enUS743US743&q=mcdonalds&spell=1&sa=X&ved=0ahUKEwj9koqPx_TTAhUKLSYKHRWfDlYQvwUIIygA"><b><i>mcdonalds</i></b></a>
Everything other than "showing results for" and the italics tags encasing the desired name for extraction are subject to change from query to query.
What I want to do is extract the mcdonalds out of this string using regex that occurs here: <b><i>mcdonalds</i> aka the second instance of mcdonalds. However, I'm not too sure how to write the regex to do so.
Any help accomplishing this would be greatly appreciated. As always, please let me know if any additional information should be added to clarify the question.

Create a regex for a string with 1 item that changes

I am trying to build a regex for an inline CSS code that 1 item on changes
This is the line of code in question
<div="Box1" style="background-color:Transparent;border-color:Transparent;border-style:None;height:436px;"></div>
I need to be able to pick this out but the height is different on every page
so all the rest is exactly the same but the height changes
If you got that line, you can use the following regex to get the height.:
'<div="Box1" style="background-color:Transparent;border-color:Transparent;border-style:None;height:436px;"></div>'
.match(/height:([\sa-z0-9]+);/)
This will return:
["height:436px;", "436px"]
This example is in JS, I don't know in what language you want to use the Regex? But in CSS you cant.
[0-9]+ matches an arbitrary number.
However, for the HTML part you should not use a regex at all but a HTML parser - and then only use a regex on the style attribute.

Using selenium CSS selector for multiple things

This is in Perl if it matters. I have several lists of links that collapse and expand. I know how many there are from using
get_xpath_count('//li/a')
The problem is I need to get a list of the names of these actual links. I've tried using xpath, but haven't found much luck, and was hoping CSS selectors would be able to help. I've tried using
get_text('css=li a:nth-child('.$i.')'
which prints out a [-] icon next to a link, the very top link in the list, and then an out of range error. I'm not familiar was CSS selectors at all, so any help would be great. If I left out important info, please let me know,
Try this (in pseudo-code, because I avoid Perl like the plague):
list linkNames;
count = selenium.get_xpath_count('//li/a');
for (i = 1; i <= count; i++) {
list.append(selenium.get_text('xpath=(//li/a)[' + i +']');
}
Note:
XPath expressions count from 1 to n, not 0 to n-1 like most C-derived languages.
The XPath form for selecting the i'th match of a pattern is (pattern)[i], not pattern[i].
Selenium doesn't assume the (pattern)[i] locator is an XPath, so you need say so by starting it with xpath=.

Resources