CSS Selector Help To Extract First Line within TD

CSS Selector Help To Extract First Line within TD - web-scraping

I am trying to extract each line from an address on a webpage using CSS selector.
The HTML block within the page containing the address is this
<tr>
<td ALIGN="right" VALIGN="top" CLASS='directsub'>Mailing Address </td>
<td CLASS="coltext">Name<br/>Acme Foundation
<br/>PO Box 195<br/>
Olympia WA 98507
</td>
</tr>
When I am using the following selector, it is extracting the entire address starting from Acme Foundation to the ZIP code.
html > body > div:eq(3) > div > div > table:eq(0) > tbody > tr:nth-child(5) > td.coltext >
How do I extract each part of the address separately using CSS selectors instead of all in a single block?

You cannot do this with CSS because the address is divided with break tags. I would recommend using JavaScript to split the address up into individual lines.

Related

Scrapy response.css - two tags without distinct identifiers

I am just a beginner in scrapy facing some problems:
<tr>
<td rowspan="2" style="vertical-align: top; width: 20%;">
1. c4<br>
<script type="text/javascript">
...
<\script>
</td>
<td style="vertical-align: top;">The English Defense, here I give up the centre to Black as a target for attack.</td>
</tr>
If I want to get both the "c4" text and "The English Defense, here I give up the centre to Black as a target for attack.", it would be possible to use response.css('tr td::text').extract().
But what can I do if I want just the second <td> tag's text, since the <td> tags doesn't have id or class or anything else? In this link, I didn't find a solution to use style or rowspan...

You could use the nth-child selector. In your specific case that would be:
response.css("td:nth-child(2)::text").extract()

CSS selector for hideous table-driven markup

JSFiddle: https://jsfiddle.net/dc9wdwem/
I inherited a legacy application that some clients are still using and expecting upgrades for. One recent upgrade "broke" the existing CSS and the easiest way to resolve it is to "un-break" just one little table.
The markup is nested table upon nested table. But for the sake of stripping down to the bare essentials, here's the barest version of where to find my table.
<div id="someId">
<table>
<tr>
<td>
<table>
<tr>
<td>
<table> <!-- not this table --> </table>
</td>
<td>
<table> <!-- THIS ONE!! --> </table>
</td>
</tr>
</table>
</td>
</tr>
</table>
</div>
There are other tables and rows and cells scattered throughout, but this structure gets you there.
Using the "direct descendant" symbol is tricky because the tables are descended within rows and cells as well as other tables. So table>table>table isn't going to do it. But then if you go with a general descendent selector, you end up selecting too many things table table table will get you a whole bunch of tables. Here's the closest I got so far:
#someId>table table td:nth-child(2) table {
background-color: red;
}
I would normally be glad to add even more > selectors; however, I believe the browsers themselves are filling in tbody elements and so forth and I don't know that I can reasonably predict that the proper structure will always be intact. The above selector is selecting more tables than the one I'm trying to isolate.
None of the nested tables have IDs or classes, and nor do I have the opportunity to add them in. The upgrade process does not upgrade the customer's markup, which they may have themselves partially customized over time.
Anybody have any CSS selector magic that will work, assuming the above markup alongside browser-filled elements like tbody?

This will work for the specific HTML in your fiddle:
#someId>table table:nth-of-type(1) td:nth-of-type(2) table {
background-color: red;
}
Obviously, if the HTML changes in pretty much any way, this is probably not going to work.

You missed a Table in your css.
try:
div#someId > table table table td:nth-child(2) > table
https://jsfiddle.net/ba52Lwkg/
#someId > table table:first-of-type td + td > table
this should work.
https://jsfiddle.net/dc9wdwem/

Can and should a style sheet style every element in this HTML?

I'm developing a CMS plugin that generates HTML. I want to let users style the HTML any way they want. Here is the HTML:
<div id="ss:">
<table>
<colgroup>
<col span="1">
<!-- span can range from 3 to 6. -->
<col span="4">
<col span="4">
</colgroup>
<thead>
<tr>
<th rowspan="2">Variable text goes here</th>
<!-- span can range from 3 to 6. -->
<th colspan="4">Responses</th>
<th colspan="4">Percentage</th>
</tr>
<tr>
<!-- this row could contain from 6 to 12 headings -->
<th>Small</th>
<th>Med.</th>
<th>Large</th>
<th>Tot.</th>
<th>Small</th>
<th>Med.</th>
<th>Large</th>
<th>Tot.</th>
</tr>
</thead>
<tbody>
<!-- one more more rows with this structure -->
<tr>
<th>1. What size Coke do you prefer?</th>
<td>24</td>
<!-- largest number surrounded by strong tags -->
<td><strong>28</strong></td>
<td>0</td>
<td>52</td>
<td>46</td>
<!-- largest percent surrounded by strong tags -->
<td><strong>54</strong></td>
<td>0</td>
<td>100</td>
</tr>
</tbody>
</table>
</div>
I've placed the HTML inside div with an ID to allow users to select only elements within it. So my questions are:
Can a stylesheet style every element here without using classes, even if that means using pseudo-classes like nth-child?
Would that be a good practice? If not, what is a good strategy?
I could actually generate a class for every element, but where's the line between that's good and that's crazy?

Can a stylesheet style every element here without using classes, even if that means using pseudo-classes like nth-child?
Absolutely. There are many ways to target elements. You would have to use nth-child once you get to all the td, th and trs.
#ss:,
table,
colgroup,
col,
[span="1"],
[span="4"],
thead,
tr,
th,
[rowspan="2"],
[colspan="4"],
tbody,
td,
td strong {
// css
}
Would that be a good practice? If not, what is a good strategy?
The argument against using nth-child is that the browser has to process every child element to do the math and find the correct elements, but with using classes or ids it can find the correct elements easier. So it's easier for the browser to process the css targeting classes and ids. I just read about browser processing nth-child this week, but I couldn't find the article for reference. I'm a big fan of this CSS Tricks page for nth-child references
I could actually generate a class for every element, but where's the line between that's good and that's crazy?
Everyone has their own definition of crazy. Giving rows a class would be helpful, then let the user get into the nth-child depth.`

Why do you need IDs or classes? Just target the elements themselves
h1 {
...
}
h2 {
...
}
h3 {
..
}
etc...
You can target your stylesheets dynamically like in this SO post.
Also for the record, DOM look-ups by class are significantly faster than by ID. A quick Google search on that will tell you more than I ever could.

Click on specific link in table row

I'm testing with Selenium Webdriver in Firefox and ideally also in IE8.
Here is my html structure:
<table id="table">
<tbody>
<tr>
<td>Text1</td>
<td><a id="assign" href="/assign/1>Assign</a></td>
</tr>
<tr>
<td>Text2</td>
<td><a id="assign" href="/assign/2>Assign</a></td>
</tr>
<tr>
<td>Text3</td>
<td><a id="assign" href="/assign/3">Assign</a></td>
</tr>
</tbody>
</table>
Basically what I need to do is this:
Click on the assign link on the row that contains Text1
So far i came up with the XPATH: //*[#id='table']//tr/td//following-sibling::td//following-sibling::td//following-sibling::td//a that selects all the assign links. Changing it to //*[#id='table']//tr/td[text='Text1']//following-sibling::td//following-sibling::td//following-sibling::td//a returns "No matching nodes" from Firebug.
However, I want a CSS selector for this. So, i tried #table>tbody>tr:contains('Text1') but Firebug returns "Invalid CSS Selector".
Any suggestions ?

You should find the td that has preceding td sibling tag with Text1 text, then get the a tag:
//table[#id="table"]//td[preceding-sibling::td="Text1"]/a[#id="assign"]

Alternatively you can find 'tr' having 'td' with text = 'Text1' and then inside the 'tr' find 'td' having text 'Assign'
//table[#id='table']//tr[td[.='Text1']]/td[.='Assign']
About css selectors, there are no pure css selector for text based search. 'contains' is not standardized yet, so may not work in your case.

storeXpathCount within an frame, Selenium IDE

I want to store the count of li elements in an ul list. The list resides in an frame with the id content frame. I the li elements all contains an anchor tag with class listHead
First i tried this:
<tr>
<td>storeXpathCount</td>
<td>//ul/li/a[#class=listHead]</td>
<td>countMax</td>
</tr>
<tr>
<td>echo</td>
<td>${countMax}</td>
<td>${countMax}</td>
</tr>
The countMax returned is 0. If i change the target to //* i get an xpathCount of only 13. Inspecting the source showed that most of the page is within iframes. So, i tried adding select frame:
<tr>
<td>selectFrame</td>
<td>contentFrame</td>
<td></td>
</tr>
<tr>
<td>storeXpathCount</td>
<td>//ul/li/a[#class=listHead]</td>
<td>countMax</td>
</tr>
<tr>
<td>echo</td>
<td>${countMax}</td>
<td>${countMax}</td>
</tr>
The echo of countMax still returns 0, and if changed to //* 13. How do I get a count of the elements in the frame? I am using Selenium IDE 2.5.0 w. firefox.

It seems like your xpath attribute filter is missing quotes around the class name. Try:
//ul/li/a[#class="listHead"]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

CSS Selector Help To Extract First Line within TD - web-scraping

You cannot do this with CSS because the address is divided with break tags. I would recommend using JavaScript to split the address up into individual lines.

Related

Scrapy response.css - two tags without distinct identifiers

CSS selector for hideous table-driven markup

Can and should a style sheet style every element in this HTML?

Click on specific link in table row

storeXpathCount within an frame, Selenium IDE

Categories

Resources