Want to scrape highlights - css

I want to scrape the Hot stat line from flashscore. In order to do this, I want to for example scrape the Hot stat of this page: https://www.flashscore.com/match/UZOxr6ME/#match-summary
Therefore I want to scrape:
<div class="previewLine"><b>Hot stat:</b> PL games refereed by Andy Madley this season
have seen a 75% home win ratio.</div>.
Since there are on this page 6 lines with div class 'previewLine I am wording what the unique css path is for the hot stat line, or show this can be recognized.
I hope this question is clear.
Thanks in advance.

You can do it with xpath:
//div[#class="previewLine"][b[text()="Hot stat:"]]/text()
Find all div parents that has a class of "previewLine", and 'b' child with the text "Hot stat:".

Related

Not able to find the Xpath

I am trying to scrape IMDB top 250 movies using scrapy and stuck in finding the xpath for duration[I need to extract "2","h","44" and "m"] of each movie. Website link : https://www.imdb.com/title/tt15097216/?ref_=adv_li_tt
Here's the image of the HTML:
I've tried this Xpath but it's not accurate:
//li[#class ='ipc-inline-list__item']/following::li/text()
If it's always in the same position, what about:
//li[#class ='ipc-inline-list__item']/following::li[2]
or more simply:
//li[#class ='ipc-inline-list__item'][3]
or since the others have hyperlinks as the child, filter to just the li that has text() child nodes:
//li[#class ='ipc-inline-list__item'][text()]
However, the original XPath may be fine - it may be how you are consuming the information. If you are using .get() then try .getAll() instead.
You can use this XPath to locate the element:
//span[contains(#class,'Runtime')]
To extract the text you can use this:
//span[contains(#class,'Runtime')]/text()

XPath expression from a website to extract price

I try to extract the xpath for my content egg plugin on wordpress, but on those 2 websites i can't really find the right xpath. If is someone here to help me i be gratefull.
On the first one, for example: https://www.pcgarage.ro/sisteme-pc-garage/pc-garage/gaming-ares-iv/
I tried:
//span[contains(text(),'26.999,99 RON')]
.//*[#*itemprop='price']
And
//span[#class='price_num']
On the second one: https://www.emag.ro/aparat-de-aer-conditionat-heinner-crystal-9000-btu-clasa-a-functie-incalzire-filtru-cu-densitate-ridicata-follow-me-functie-turbo-r32-alb-hac-cr09whn/pd/DGW0DJBBM/?ref=hp_prod-widget_flash_deals_1_1&provider=site
//span[#class='product-new-price'] (this works but they give me and the <sup> with 99 (i don't need this), if is a way to exclude sup, would be great)
The XPath to get the price on the first page is
//td[#class='pip_text']//b
For the second site you can use this:
//div[#class='product-highlights-wrapper']//p[#class='product-new-price']

Scraping an item that varies in position from a XML subpage

so far I didn't succeed in scraping the table "Die Verlustursache" from this page
http://www.ubootarchiv.de/ubootwiki/index.php/U_205
using libraries (XML) (rvest) (readr)
I can address all tables on the site with individual code lines like
table <-readHTMLTable("http://www.ubootarchiv.de/ubootwiki/index.php/U_203") %>% .[1]
but the numeric numbers vary on all the other sites.
check for example: http://www.ubootarchiv.de/ubootwiki/index.php/U_27
I just realized that the table I need is always the fourth last one (meaning: the last table minus 4).
In another scraping project, I once used this line to only scrape the last item of a list page:
html_nodes(xpath="/html/body/div/div[3]/div[2]/div[1]/div[2]/div/table/tbody/tr[last()]"
However, I was not able to find a solution for something like "last - 4"
Please advise & Thx in advance
You could use this if it is always the fourth last table:
table <-readHTMLTable("http://www.ubootarchiv.de/ubootwiki/index.php/U_203")
table[length(table) - 4]

Scrapy: grabbing sibling elements of a regex match

I'm using Scrapy to scrape college essay topics from college websites. I know how to match a keyword using a regular expression, but the information that I really want is the other elements in the same div as the match. The Response.css(...).re(...) function in Scrapy returns a string. Is there any way to navigate to the parent div of the regex match?
Example: https://admissions.utexas.edu/apply/freshman-admission#fndtn-freshman-admission-essay-topics. On the above page, I can match the essay topics h1 using: response.css("*::text").re("Essay Topics"). However, I can't figure out a way to grab the 2 actual essay topics in the same div under Topic A and Topic N.
That's not the right way to do it. You should use something like below
response.xpath("//div[#id='freshman-admission-essay-topics']//h5//text()").extract()
In case you just want css then you can use
In [7]: response.css("#freshman-admission-essay-topics h5::text, #freshman-admission-essay-topics h5 span::text").extract()
Out[7]: ['Topic A \xa0\xa0', 'Topic N']

Reading in html with R rvest. How do I check if a CSS selector class contains anything?

this is my first attempt to deal with HTML and CSS selectors. I am using the R package rvest to scrap the Billboard Top 100 website. Some of the data that I am interested in include this weeks rank, song, weather or not the song is New, and weather or not the song has any awards.
I am able to get the song name and rank with the following:
library(rvest)
URL <- "http://www.billboard.com/charts/hot-100/2017-09-30"
webpage <- read_html(URL)
current_week_rank <- html_nodes(webpage, '.chart-row__current-week')
current_week_rank <- as.numeric(html_text(current_week_rank))
My problem comes with the new and award indicators. The songs are listed in rows with each of the 100 contained in:
<article> class="chart-row char-row--1 js chart-row" ....
</article>
If a song is new, this will have class within it like:
<div class="chart-row__new-indicator">
If a song has an award, there will be this class within it:
<div class="chart-row__award-indicator">
Is there a way that I can look at all 100 instances of the class="chart-row char-row--1 js chart-row" ... and see if either of these exist within it? The output that I get from the current_week_rank is one column of 100 values. I am hoping that there is a way to get this so that I have one observation for each song.
Thank you for any help or advice.
Basically amounts to a tailored version of the Q&A I indicated above. I can't tell for 100% certain whether the or is working as intended, since there's only one row in your example page with a <div class="chart-row__new-indicator">, and that row also happens to have a <div class="chart-row__award-indicator"> tag as well.
#xpath to focus on the 100 rows of interest
primary_xp = '//div[#class="chart-row__primary"]'
#xpath which subselects rows you're after
check_xp = paste('div[#class="chart-row__award-indicator" or' ,
'#class="chart-row__new-indicator"]')
webpage %>% html_nodes(xpath = primary_xp) %>%
#row__primary for which there are no such child nodes
# will come back NA, and hence so will html_attr('class')
html_node(xpath = check_xp) %>%
#! is a bit extraneous, as it only flips FALSE to TRUE
# for the rows you're after (necessity depends on
# particulars of your application)
html_attr('class') %>% is.na %>% `!`
FWIW, you may be able to shorten check_xp to the following:
check_xp = 'div[contains(#class, "indicator")]'
Which certainly covers both "chart-row__award-indicator" and "chart-row__new-indicator", but would also wrap up other nodes with a class containing "indicator", if such an alternative tag exists (you'll have to determine this for yourself)

Resources