How to identify this node using Xpath using Google Sheets IMPORTXML() - web-scraping

I am trying to use IMPORTXML() in my Google Sheet to extract data from a particular node at this URL. The node I'm targeting looks like the following.
<div data-elm-id="asset_2820933_address" class="styles__address-container--2l39p styles__u-mr-1--3qZyj">
<h4 data-elm-id="asset_2820933_address_content_1" class="styles__asset-font-big--vQU7K">
246 LOWER VIEW ROAD
</h4>
<label data-elm-id="asset_2820933_address_content_2" class="styles__asset-font-small--2JgrX">
Strasburg, VA 22657, Warren County
</label>
</div>
My goal is to extract the two strings representing the address.
246 LOWER VIEW ROAD
Strasburg, VA 22657, Warren County
However, when I do so, I get an error.
My Xpaths are as follows:
//h4[starts-with(#class,"styles__asset-font-big")]
//label[starts-with(#class,"styles__asset-font-small")]
So my full Google Sheets formula looks like this:
IMPORTXML("https://www.auction.com/residential/VA/active_lt/auction_date_order,resi_sort_v2_st/y_nbs/bank-owned,newly-foreclosed,foreclosures_at/", '//h4[starts-with(#class,"styles__asset-font-big")] | //label[starts-with(#class,"styles__asset-font-small")]')
Is this even possible? Or is scraping that site being blocked some how? If it's possible, what am I doing wrong?

Related

Want to scrape highlights

I want to scrape the Hot stat line from flashscore. In order to do this, I want to for example scrape the Hot stat of this page: https://www.flashscore.com/match/UZOxr6ME/#match-summary
Therefore I want to scrape:
<div class="previewLine"><b>Hot stat:</b> PL games refereed by Andy Madley this season
have seen a 75% home win ratio.</div>.
Since there are on this page 6 lines with div class 'previewLine I am wording what the unique css path is for the hot stat line, or show this can be recognized.
I hope this question is clear.
Thanks in advance.
You can do it with xpath:
//div[#class="previewLine"][b[text()="Hot stat:"]]/text()
Find all div parents that has a class of "previewLine", and 'b' child with the text "Hot stat:".

Scraping an item that varies in position from a XML subpage

so far I didn't succeed in scraping the table "Die Verlustursache" from this page
http://www.ubootarchiv.de/ubootwiki/index.php/U_205
using libraries (XML) (rvest) (readr)
I can address all tables on the site with individual code lines like
table <-readHTMLTable("http://www.ubootarchiv.de/ubootwiki/index.php/U_203") %>% .[1]
but the numeric numbers vary on all the other sites.
check for example: http://www.ubootarchiv.de/ubootwiki/index.php/U_27
I just realized that the table I need is always the fourth last one (meaning: the last table minus 4).
In another scraping project, I once used this line to only scrape the last item of a list page:
html_nodes(xpath="/html/body/div/div[3]/div[2]/div[1]/div[2]/div/table/tbody/tr[last()]"
However, I was not able to find a solution for something like "last - 4"
Please advise & Thx in advance
You could use this if it is always the fourth last table:
table <-readHTMLTable("http://www.ubootarchiv.de/ubootwiki/index.php/U_203")
table[length(table) - 4]

How to include diaeresis / trema in URL to query The Plant List website?

I query The Plant List website (http://www.theplantlist.org) from R, but this does not work if there is a diaeresis (ë) in the plant name.
Usually, searching for a plant species name, e.g. "Vaccinium acosta", correctly leads to the individual species page (in R and in Firefox) with the URL "http://www.theplantlist.org/tpl1.1/search?q=vaccinium+acosta".
How can I query the species page for "Vaccinium borneënse" using the species name in the URL (not the--unknown--record ID as in http://www.theplantlist.org/tpl1.1/record/tro-50262461)? Is this even possible for this website?
I tried, among others, the following, but they all lead to the overview page for the genus Vaccinium (containing many different species):
http://www.theplantlist.org/tpl1.1/search?q=vaccinium+borneënse
http://www.theplantlist.org/tpl1.1/search?q=Vaccinium+borneense
http://www.theplantlist.org/tpl1.1/search?q=Vaccinium+borne%C3%ABnse
http://www.theplantlist.org/tpl1.1/search?q=Vaccinium+borneënse
Ultimately, I want to read specific species pages for a list of species in R using read.csv:
read.csv("http://www.theplantlist.org/tpl1.1/search?q=vaccinium+acosta&csv=true")
You can use: http://www.theplantlist.org/tpl1.1/search?q=Vaccinium+borne?nse, as in the text in http://www.theplantlist.org/tpl1.1/search. You should just transcode it for every vocal with diaeresis.
IMHO this is an error of The Plant List. Diaeresis are not valid characters (for botanical nomenclature), but just used to help pronunciation, so the database should allow the second query.

Scraping for a rank number using Nokogiri in Ruby

I'm still doing some web scraping practice using this article:
https://www.pastemagazine.com/articles/2018/01/the-75-best-tv-shows-on-netflix-2018.html
I'd like to get just the rank number of each show and found what I think is the HTML element:
<div class="copy entry manual-ads">
<p>
<b class="big">
"75."
<i>
Chewing Gum
</i>
</b>
</p>
</div>
I'm using the following code to grab just the rank number (in this case, "75."):
doc.css("b.big").text
However, it returns the rank number along with the show title. How can I get just the rank number?
Use regex:
doc.css("b.big").text[/\d+/]

Is there a way to filter through html_nodes with html_attributes?

I started to use R and have a question, i'm trying to collect a list of prices of a html page. Here's an exemple of what i'm able to get when i ask R for prices
<h3 class="item_price" itemprop="price" content="16450"> 16 450 €
</h3>
I know that i have exactly 35 prices that follows <h3 class="item_price" itemprop="price" content="1234">
Is it possible to filter through h3 elements and attribute class="item_price" and then ask for content attribute value ?
Thanks for the help.
yes its possible - source: rvest::html_attr documentation
movie <- read_html("http://www.imdb.com/title/tt1490017/")
cast <- html_nodes(movie, "#titleCast span.itemprop")
html_text(cast)
html_name(cast)
html_attrs(cast)
html_attr(cast, "class")
In case you have a more sophisticated question, please provide a reproducible example.

Resources