How to retrieve texts in table? - web-scraping

I'm trying to get texts in from the table below
The texts I want to get are ["ONE", "TWO", "THREE", "FOUR", "FIVE"]
<div class="inner">
<table class="table_class">
<tbody>
<tr>
<td>
<dl>
<dt>AAA</dt>
<dd class="desc">ONE</dd>
</dl>
</td>
<td>
<dl>
<dt>BBB</dt>
<dd>TWO</dd>
</dl>
</td>
<td>
<dl>
<dt>CCC</dt>
<dd class="level4_2">THREE</dd>
</dl>
</td>
<td>
<dl>
<dt>DDDD</dt>
<dd>FOUR</dd>
</dl>
</td>
</tr>
<td>
<dl>
<dt>EEEE</dt>
<dd class="level5_1">FIVE</dd>
</dl>
</td>
</tr>
</tbody>
</table>
</div>
I tried once like codes below but the nodes are empty..
I'm not sure how to retrieve the informations I want correctly
Thanks in advance
let document = Html::parse_document(&body);
let stats = Selector::parse("div.inner > table > tbody > tr > td > dl > dd").unwrap();
let mut stats_element = document.select(&stats).collect::<Vec<_>>();
let first_stat = stats_element .text().collect::<Vec<_>>()[0]
.trim()
.to_string();

Try something like this:
let document = Html::parse_document(&body);
let stats = Selector::parse("div.inner > table > tbody > tr > td > dl > dd").unwrap();
for stats_element in document.select(&stats) {
for text in stats_element.text() {
println!("{:?}", text);
}
}
The following Iterator methods are likely to be useful here: map, flat_map, filter. impl FromIterator for Result<A, E> and Itertools::exactly_one may also be useful.

Related

How to get the content of a span returning empty nodeset?

This is the div of my website I want to work on to extract information:
<div class="_24er">
<table class="_4dmd _4eok uiGrid _51mz" cols="4" cellspacing="0" cellpadding="0"><tbody>
<tr class="_51mx">
<td class="_5px7 _51m-">
<span class="_5x8v _5a5j _5a5i">
<span class="_5a4-">FÉV</span>
<span class="_5a4z">11</span>
</span>
</td>
<td class="_4dmi _51m-"><div class="_4dmj">
<div class="_4dmk">
<a data-hovercard="/ajax/hovercard/event.php?id=769853670060959" href="/events/769853670060959/?acontext=%7B%22source%22%3A5%2C%22action_history%22%3A[%7B%22surface%22%3A%22page%22%2C%22mechanism%22%3A%22main_list%22%2C%22extra_data%22%3A%22%5C%22[]%5C%22%22%7D]%2C%22has_source%22%3Atrue%7D" id="js_9a" aria-describedby="u_2r_1" aria-owns="">
<span class=" _50f7"> HipHop Night With YOUSTAAZ (-60% Countdown Sur Toute La Carte)
</span>
</a>
</div>
<div class="_4dml fsm fwn fcg">
<span class="">11 févr. - 12 févr.</span>
<span aria-hidden="true"> · </span>
15 invités</div>
</div>
</td>
<td class="_5pxd _51m-">
<div class="_4dmn">
<div class="_30n-">
<a data-hovercard="/ajax/hovercard/hovercard.php?id=1276481845698447" href="https://xxxxxxx">JOBI - Gammarth</a>
</div>
<div class="_30n_">Tunis, Tunisie</div>
</div></td>
<td class="_4dmt _51mw _51m-">
<div class="_4dmu">
<div class="_2ib5">
<div class="_2ib4">
<div><button class="_4jy0 _4jy3 _517h _51sy _42ft" type="submit" value="1"><i alt="" class="_3-8_ img sp_7RV3BBvGAaI sx_1551de"></i>Ça m’intéresse</button></div>
</div>
</div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
I'm trying to extract the content of the span node like below :
<span class=" _50f7"> HipHop Night With YOUSTAAZ (-60% Countdown Sur Toute La Carte)
</span>
I already extracted the nodes for the date (month and day of the event) but when extracting the name of the event which is in the span showing above I get empty node :
cc<-remDr$findElement(using = "css", "[class = '_24er']")
cc<-remDr$getPageSource()
page_events<-read_html(cc[[1]][1])
events =html_nodes(page_events,'._24er')
mois_data=html_nodes(page_events,'._24er > table > tbody > tr > td > span > ._5a4-')
jours_data=html_nodes(page_events,'._24er > table > tbody > tr > td > span > ._5a4z')
links_events_data=html_nodes(page_events,'._24er > table > tbody > tr > td > div> div > a ')
//getting the name of events : I get {xml_nodeset (0)} as a result
nom_events_data=html_nodes(page_events,'._24er > table > tbody > tr > td > div> div > a > span > ._50f7')
//I tried to use the class to get the content, I get this error :
Error in xml2::xml_text(x, trim = trim) :
object 'noms_events_data' not found
nom_events_data=html_nodes(page_events,"[class='._50f7']")
//I tried to use the xpath , same error with the xpath:
nom_events_data=html_nodes(page_events,xpath = '//*[#id="js_9a"]/span')
//Result is always character(0)
noms_events = html_text(noms_events_data)
After verifying with the documention, the correct syntax is :
noms_events_data=html_nodes(page_events,"._50f7")
instead of:
noms_events_data=html_nodes(page_events,'[class="._50f7"]')

Remove all tr elements unless tr>td>input has a class 'DontRemoveMe'

Sorry for the very specific title, couldn't think of how to say it in more general terms.
Assume you have a table and each row contains a cell that has an input, but some input fields have a class of 'DontRemoveMe'. How do you target every row except the 'DontRemoveMe' rows?
Manipulation of DOM Elements requires JavaScript. One way to achieve this is with jQuery:
function remove() {
$('tr:not(.dontRemoveMe)').remove();
}
.dontRemoveMe td {
background-color: green;
}
<script src="https://code.jquery.com/jquery-3.3.1.min.js"></script>
<table style="width:100%">
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
<tr class="dontRemoveMe">
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Jon</td>
<td>Jones</td>
<td>33</td>
</tr>
</table>
<button onclick="remove()">Remove rows</button>
CSS (Not Yet Implemented):
Using CSS level 4 selectors, I believe it would be
tr:has(td>input:not(>>.DontRemoveMe))
However, those aren't implemented in any browser. So you would want to use javascript.
Javascript:
// Select all rows that don't contain a input.DontRemoveMe
let rows = Array.from(document.querySelectorAll("tr")).filter(x => !(x.querySelector("input.DontRemoveMe")));
// Add a special class to these rows so we can target them with CSS
rows.forEach(x => x.classList.add("selected"));
td {
padding: 8px; /* Padding for all rows to make background visible */
}
.selected {
background: red;
}
<table>
<tr>
<td><input type="text" value="selected" />
</td>
</tr>
<tr>
<td><input class="DontRemoveMe" type="text" value="not selected" />
</td>
</tr>
<tr>
<td>
<input type="text" value="selected" />
</td>
</tr>
</table>
Here is an old school javascript way.
Find all the tr tags then find any children with class DontRemoveMe, if it doesn't find any add a .hide class to the current row.
But, honestly I'd question the reason you want to do it like this, chances are there is a more sensible way.
var tr = document.getElementsByTagName('tr');
var i = 0;
var length = tr.length
for (; i < length; i++) {
var dontRemove = tr[i].getElementsByClassName('DontRemoveMe')
if (!dontRemove.length) {
tr[i].classList.add('hide')
}
}
td {
color: #ededed;
}
.red {
background-color: #ff3030;
}
.blue {
background-color: #6495ED;
}
.hide {
display: none;
}
<table>
<tr class="red">
<td>Normal</td>
<td>Normal</td>
<td class="DontRemoveMe">Don't Remove Me</td>
</tr>
<tr class="blue">
<td>Can't see me</td>
<td>Can't see me</td>
<td>Can't see me</td>
</tr>
<tr class="red">
<td class="DontRemoveMe">Don't Remove Me</td>
<td>Normal</td>
<td class="DontRemoveMe">Don't Remove Me</td>
</tr>
</table>

Showing Edit/Delete links when hover over the data row

I have a table with ng-repeat on the <tr> at with the last td I have edit/delete links, I only want them to show if user hover over the <tr>
<tr ng-repeat="form in allData | filter:search | orderBy: orderByValue : orderIn" ng-click="set_index($index)">
<td><a ng-href={{form.link}}>{{form.ar_ref}}</a></td>
<td>{{form.title}}</td>
<td>{{form.category}}
<span ng-class="{'show_edit_link', $index==selected_index}">
<button ng-click="showUpdate()">Update</button>
<button ng-click="showDelete()">Delete</button>
</span>
</td>
</tr>
My JS Controller:
pp.controller('formsListCtrl', ['$scope', '$http', function($scope, $http){
$http.get('/php_user/formJSON.php').success(function(response){
$scope.allData=response;
//Show hover edit links
$scope.selected_index = 0;
$scope.set_index = function(i){ //i is the $index that we will pass in when hover in the forms_admin.php
$scope.selected_index = i;
}
CSS:
.edit_link_show{
display: inline;
}
.edit_link{
display: none;
}
There is a syntax error in your ng-controller. It should be a : for the expression after class name and you also want to set another ng-class argument for selected_index not equal to $index:
<tr ng-repeat="form in allData | filter:search | orderBy: orderByValue : orderIn" ng-click="set_index($index)">
<td><a ng-href={{form.link}}>{{form.ar_ref}}</a></td>
<td>{{form.title}}</td>
<td>{{form.category}}
<span ng-class="{'show_edit_link': $index==selected_index, 'edit_link': $index!=selected_index}">
<button ng-click="showUpdate()">Update</button>
<button ng-click="showDelete()">Delete</button>
</span>
</td>
</tr>

How to convert Xpath to CSS

My xpath is: /html/body/div/table/tbody/tr[2]/td[4]
I need to get an CSS to use it in jsoup selector.
I found a comparison between xpath and css: here, and it's said in their example (Second <E> element anywhere on page) that I can't do it. Xpath xpath=(//E)[2] CSS N\A.
Maybe I can't find what I'm looking for. Any ideas?
Here's the html I'm trying to parse (I need to get values: 1 and 3):
<div class=tablecont>
<table width=100%>
<tr>
<td class=header align=center>Panel Color</td>
<td class=header align=center>Locked</td>
<td class=header align=center>Unqualified</td>
<td class=header align=center>Qualified</td>
<td class=header align=center>Finished</td>
<td class=header align=center>TOTAL</td>
</tr>
<tr>
<td align=center>
<div class=packagecode>ONE</div>
<div>
<div class=packagecolor style=background-color:#FC0;></div>
</div>
</td>
<td align=center>0</td>
<td align=center>0</td>
<td align=center>1</td>
<td align=center>12</td>
<td align=center class=rowhead>53</td>
</tr>
<tr>
<td align=center>
<div class=packagecode>two</div>
<div>
<div class=packagecolor style=background-color:#C3F;></div>
</div>
</td>
<td align=center>0</td>
<td align=center>0</td>
<td align=center>3</td>
<td align=center>42</td>
<td align=center class=rowhead>26</td>
</tr>
</table>
</div>
While an expression like (//E)[2] can't be represented with a CSS selector, an expression like E[2] can be emulated using the :nth-of-type() pseudo-class:
html > body > div > table > tbody > tr:nth-of-type(2) > td:nth-of-type(4)
Works good for me.
//Author: Oleksandr Knyga
function xPathToCss(xpath) {
return xpath
.replace(/\[(\d+?)\]/g, function(s,m1){ return '['+(m1-1)+']'; })
.replace(/\/{2}/g, '')
.replace(/\/+/g, ' > ')
.replace(/#/g, '')
.replace(/\[(\d+)\]/g, ':eq($1)')
.replace(/^\s+/, '');
}
Are you looking for something like this:
http://jsfiddle.net/YZu8D/
.tablecont tr:nth-child(2) td:nth-child(4) {background-color: yellow; }
.tablecont tr:nth-child(3) td:nth-child(4) {background-color: yellow; }
One should learn how to write css selectors, but a for a quick fix, try: cssify
For example, I put in your xpath and it spit out: html > body > div > table > tbody > tr:nth-of-type(2) > td:nth-of-type(4)
Try it out.

CSS select several types under one id

I want to select all th and td under div.
Writing something like:
#div_id th,td {
...
}
is not good since it selects all the td-s.
I saw solution here to write
#div_id th,#div_id td {
...
}
Is there other way, so I should not repeat the #div_id?
Thanks.
Pure CSS wise, there is no better answer than your own. But you could look into SASS or similar technologies, that would allow for things like that.
Assuming that your code looks something like this:
<div id="div_id">
<table border="1">
<tr>
<th>
Header 1
</th>
<th>
Header 2
</th>
</tr>
<tr>
<td>
row 1,
<p>
cell 1
</p>
</td>
<td>
row 1,
<p>
cell 2
</p>
</td>
</tr>
<tr>
<td>
row 2,
<p>
cell 1
</p>
</td>
<td>
row 2,
<p>
cell 2
</p>
</td>
</tr>
</table>
</div>​​​​​​​​​​​​​​​​​​​​​​
You could apply a rule such as this:
#div_id tr > * {
border: 1px dotted #CCC;
}​
Which selects the first element (and only the first element) inside of a tr, which is bound to only be a th or td.
Notice that if I had instead defined the rule as:
#div_id tr * {
border: 1px dotted #CCC;
}​
The p tags would also have a dotted border.
However, it generally makes more sense to simply type the extra 8 characters as it's easier to understand what's you did later.

Resources