Scrapy chain selector with different parents - css

I want to chain two selectors together that have different parents. The selector I'm using currently:
..css('td:nth-child(8) > span.cap.mtv > ::text')
Which yields:
<Selector xpath="descendant-or-self::td[count(preceding-sibling::*) = 7]/span[#class and contains(concat(' ', normalize-space(#class), ' '), ' cap ') and (#class and contains(concat(' ', normalize-space(#class), ' '), ' mtv '))]/*/text()" data='$725,000'>
The issue I have is that I also want the following:
..xpath('td[8]/div/text()')
Which yields:
<Selector xpath='td[8]/div/text()' data='UFA'>
Ultimately I want to use an item loader and extract to get:
$725,000
UFA
...
I want to achieve something similar to the following..
...xpath('td[8]').css('span.cap.mtv > ::text').xpath('/div/text()')
I have previously just sort of re-scraped an element w/ another set of selectors if the program had previously found nothing but would much rather have this sort of 'either/or' flexibility. Or would I be better of looking at another selector all together for this situation?
Any help is much appreciated!

If you're using item loaders, you can simply add multiple selectors for a single field as shown in scrapy docs.
Something like this should work, after creating a loader:
loader.add_css('field', 'td:nth-child(8) > span.cap.mtv > ::text')
loader.add_xpath('field', 'td[8]/div/text()')
Your input/output processors would then be responsible for how this information is combined.

Related

Scrapy can't manage to request text with neither CSS or xPath

I've been trying to extract some text for a while now, and while everything works fine, there is something I can't manage to get.
Take this website : https://duproprio.com/fr/montreal/pierrefonds-roxboro/condo-a-vendre/hab-305-5221-rue-riviera-854000
I want to get the texts from the class=listing-main-characteristics__number nodes (below the picture, the box with "2 chambres 1 salle de bain Aire habitable (s-sol exclu) 1,030 pi2 (95,69m2)", there are 3 elements with that class in the page ( "2", "1" and "1,030 pi² (95,69 m²)"). I've tried a bunch of options in XPath and CSS, but none has worked, some gave back strange answers.
For example, with :
response.xpath('//span[#class="listing-main-characteristics__number"]').getall()
I get :
['<span class="listing-main-characteristics\_\_number">\n 2\n </span>', '<span class="listing-main-characteristics\_\_number">\n 1\n </span>']
For example, something else that works just fine on the same webpage :
response.xpath('//div[#property="description"]/p/text()').getall()
If I get all the spans with this query :
response.css('span::text').getall()
I can find my texts mentioned in the beginning in the. But from this :
response.css('span[class=listing-main-characteristics__number]::text').getall()
I only get this
['\n 2\n ', '\n 1\n ']
Could someone clue me in with what kind of selection I would need? Thank you so much!
Here is the xpath that you have to use.
//div[#data-label='#description']//div[#class='listing-main-characteristics__label']|//div[#data-label='#description']//div[#class='listing-main-characteristics__item-dimensions']/span[2]
you might have to use the above xpath. (Add /text() is you want the associated text.)
response.xpath("//div[#data-label='#description']//div[#class='listing-main-characteristics__label']|//div[#data-label='#description']//div[#class='listing-main-characteristics__item-dimensions']/span[2]").getall()
Below is the python sample code
url = "https://duproprio.com/fr/montreal/pierrefonds-roxboro/condo-a-vendre/hab-305-5221-rue-riviera-854000#description"
driver.get(url)
# get the output elements then we will get the text from them
outputs = driver.find_elements_by_xpath("//div[#data-label='#description']//div[#class='listing-main-characteristics__label']|//div[#data-label='#description']//div[#class='listing-main-characteristics__item-dimensions']/span[2]")
for output in outputs:
# replace the new line character with space and trim the text
print(output.text.replace("\n", ' ').strip())
Output:
2 chambres
1 salle de bain
1,030 pi² (95,69 m²)
Screenshot:

How to make all the attributions in the css selector into one line with vim?

Here is my css part of css file.
body{
width:1100px;
height:800px;
}
div.main{
margin:20px auto 0 auto;
background-color:#f7f7f7;
}
I want to rewrite it as below.
body{width:1100px;height:800px;}
div.main{margin:20px auto 0 auto;background-color:#f7f7f7;}
All attritutions and values rewritten as only one line,is there a smarty vim command to do the job?
One option would be
g/{/,/}/j
which breaks down as
g start a global command
{ search for {
,/}/ for each match, set a range up until the }
j join the range
Note that this might be to naïve as-is. This doesn't take into account nested brackets. You might first want to set a visual range to the textblock you like to change.
You could use the J or gJ (alternative that doesn't add spaces) commands. They can be run in visual mode to join all selected lines, or take a count.
Alternatively, the splitjoin.vim plugin provides specific support for css rules as you are asking. With the cursor over the first line of the css block, type gJ to join the whole block into a single line.
Either way, you may want/need to run a replace to remove leading spaces before joining - :s/^\s\+// before joining the lines.
EDIT: I guess a 'smarty' way to do this, and without using plugins, would be the following macro: vf}:s/^\s\+/^MgvgJ (the ^M means pressing the enter key - you may have to enter the macro manually to get this). Use it by putting the cursor at the beginning of the line at the top of the css rule you want to rewrite.
As #romainl said, you should use a minifier. However I am going to assume what you really want is a way to glance at your css rules quickly. If that is the case then I suggest you look into folding. #Luc Hermitte gave a great answer on this subject on the post: Using vi, how can I make CSS rules into one liners?
Below is a variant of #Luc Hermitte answer. Put the following in ~/.vim/ftplugin/css_fold.vim:
let b:width = 25
" Use the following mappings to adjust the foldtext "columns"
nnoremap <silent> <buffer> >s :<c-u>let b:width+=v:count1<cr><c-l>
nnoremap <silent> <buffer> <s :<c-u>let b:width-=v:count1<cr><c-l>
if !exists('*s:CssFoldText')
function! s:CssFoldText()
let line = printf("% *s {", -1*b:width, substitute(getline(v:foldstart), "{\s*$", "", ""))
let nnum = nextnonblank(v:foldstart + 1)
let lst = []
while nnum <= v:foldend
let line = line . " " . substitute(getline(nnum), "^\s*", "", "")
let nnum += 1
endwhile
return line
endfunction
map <SID>xx <SID>xx
let s:sid = substitute(maparg("<SID>xx"),'xx$','', '')
unmap <SID>xx
endif
exe "setlocal foldtext=" . s:sid . "CssFoldText()"
setlocal foldmethod=syntax
Now you can use folding commands like zM to close all folds, zR to open all folds, and za to toggle the current fold. Vimcasts has a nice screencast on this topic, How to fold.
For more information see:
:h folds
:h 'foldtext'
:h 'foldmethod'
:h za
:h zR
:h zM

How should I specify an item in a checkbox list, when using the dom?

I am currently automating use of the wordpress editor using VBA in access 2003, but would like to extend the automation to include the selection of category, taxonomy items etc.
That is with Checkbox lists.
An example with the same structure is here: http://devblog.xing.com/frontend/the-checkbox-list/
My list is based on a hierarchy of geographic entities: On the local database I may have data associated with the locality: Algeria for instance.
I want to be able to use (ie As SHDocVw.InternetExplorer) ie.document.what?
I am a bit lost as to an elegant approach. I haven't tried it yet but I guess I can get the innerhtml for each selectit class, check to see if it contains my keyword an if so extract the input id with a bit of string manipulation, and then use ie.document.getelementbyid("whatever").Click Check or Toggle
But is there a better approach?
(Ultimately I will have to look at how to connect to the remote database and drag the tag_id from the tables there - but I thought this would be quicker especially in that the automation functionality in a larger sense already exists)
Any pointers appreciated!
<ul id="localitieschecklist" class="categorychecklist form-no-clear"
data-wp-lists="list:localities">
<li id="localities-8" class="popular-category">
<label class="selectit">
<input id="in-localities-8" type="checkbox" name="tax_input[localities][]" value="8"> </input>
Africa
</label>
<ul class="children"><li id="localities-96">
<label class="selectit"><input id="in-localities-96" type="checkbox" name="tax_input[localities][]" value="96"></input>
Algeria
</label>
Thanks for your help #Tim Williams. I am not sure that worked on the elegance front but it has enabled me to move forward, I run into a bit of headache with the hierarchical nature of the list, and time as dictated a compromise of only supporting the first two levels of the hierarchy. It will do for now, but any further comments are certainly welcome!
Localities
If artClassLocalities <> "" And Not IsNull(artClassLocalities) Then
artClasses = Split(artClassLocalities, ",")
Set Element = .Document.getElementByID("localitieschecklist")
For i = 0 To Element.childNodes.Length - 1 'the element collection represents the globe
'popular category items (6) one for each landmass
Set Landmass = Element.childNodes(i)
'landmass has 2 nodes
'child 1 is the selectitnode node 0 (item 1)
If InStr(1, artClassLocalities, Right(Landmass.childNodes(0).innerText, Len(Landmass.childNodes(0).innerText) - 1)) Then
Call Landmass.childNodes(0).childNodes(0).setAttribute("checked", True)
Else
Call Landmass.childNodes(0).childNodes(0).setAttribute("checked", False)
End If
For j = 0 To Landmass.childNodes(1).childNodes.Length - 1 'the children are the countries
Set Country = Landmass.childNodes(1).childNodes(j) ' a given child is a country
If InStr(1, artClassLocalities, Right(Country.childNodes(0).innerText, Len(Country.childNodes(0).innerText) - 1)) Then
Call Country.childNodes(0).childNodes(0).setAttribute("checked", True)
Else
Call Country.childNodes(0).childNodes(0).setAttribute("checked", False)
End If
'Support for Subregions not yet functional
'For k = 0 To Country.childNodes(1).childNodes.Length - 1
'Set PAndADiv = Country.childNodes(j)
Next
Next
End If

Using Vim, how can I make CSS rules into one liners?

I would like to come up with a Vim substitution command to turn multi-line CSS rules, like this one:
#main {
padding: 0;
margin: 10px auto;
}
into compacted single-line rules, like so:
#main {padding:0;margin:10px auto;}
I have a ton of CSS rules that are taking up too many lines, and I cannot figure out the :%s/ commands to use.
Here's a one-liner:
:%s/{\_.\{-}}/\=substitute(submatch(0), '\n', '', 'g')/
\_. matches any character, including a newline, and \{-} is the non-greedy version of *, so {\_.\{-}} matches everything between a matching pair of curly braces, inclusive.
The \= allows you to substitute the result of a vim expression, which we here use to strip out all the newlines '\n' from the matched text (in submatch(0)) using the substitute() function.
The inverse (converting the one-line version to multi-line) can also be done as a one liner:
:%s/{\_.\{-}}/\=substitute(submatch(0), '[{;]', '\0\r', 'g')/
If you are at the beginning or end of the rule, V%J will join it into a single line:
Go to the opening (or closing) brace
Hit V to enter visual mode
Hit % to match the other brace, selecting the whole rule
Hit J to join the lines
Try something like this:
:%s/{\n/{/g
:%s/;\n/;/g
:%s/{\s+/{/g
:%s/;\s+/;/g
This removes the newlines after opening braces and semicolons ('{' and ';') and then removes the extra whitespace between the concatenated lines.
If you want to change the file, go for rampion's solution.
If you don't want (or can't) change the file, you can play with a custom folding as it permits to choose what and how to display the folded text. For instance:
" {rtp}/fold/css-fold.vim
" [-- local settings --] {{{1
setlocal foldexpr=CssFold(v:lnum)
setlocal foldtext=CssFoldText()
let b:width1 = 20
let b:width2 = 15
nnoremap <buffer> + :let b:width2+=1<cr><c-l>
nnoremap <buffer> - :let b:width2-=1<cr><c-l>
" [-- global definitions --] {{{1
if exists('*CssFold')
setlocal foldmethod=expr
" finish
endif
function! CssFold(lnum)
let cline = getline(a:lnum)
if cline =~ '{\s*$'
return 'a1'
elseif cline =~ '}\s*$'
return 's1'
else
return '='
endif
endfunction
function! s:Complete(txt, width)
let length = strlen(a:txt)
if length > a:width
return a:txt
endif
return a:txt . repeat(' ', a:width - length)
endfunction
function! CssFoldText()
let lnum = v:foldstart
let txt = s:Complete(getline(lnum), b:width1)
let lnum += 1
while lnum < v:foldend
let add = s:Complete(substitute(getline(lnum), '^\s*\(\S\+\)\s*:\s*\(.\{-}\)\s*;\s*$', '\1: \2;', ''), b:width2)
if add !~ '^\s*$'
let txt .= ' ' . add
endif
let lnum += 1
endwhile
return txt. '}'
endfunction
I leave the sorting of the fields as exercise. Hint: get all the lines between v:foldstart+1 and v:voldend in a List, sort the list, build the string, and that's all.
I won’t answer the question directly, but instead I suggest you to reconsider your needs. I think that your “bad” example is in fact the better one. It is more readable, easier to modify and reason about. Good indentation is very important not only when it comes to programming languages, but also in CSS and HTML.
You mention that CSS rules are “taking up too many lines”. If you are worried about file size, you should consider using CSS and JS minifiers like YUI Compressor instead of making the code less readable.
A convenient way of doing this transformation is to run the following
short command:
:g/{/,/}/j
Go to the first line of the file, and use the command gqG to run the whole file through the formatter. Assuming runs of nonempty lines should be collapsed in the whole file.

How can I search CSS with Perl?

First question from a long time user.
I'm writing a Perl script that will go through a number of HTML files, search them line-by-line for instances of "color:" or "background-color:" (the CSS tags) and print the entire line when it comes across one of these instances. This is fairly straightforward.
Now I'll admit I'm still a beginning programmer, so this next part may be extremely obvious, but that's why I came here :).
What I want it to do is when it finds an instance of "color:" or "background-color:" I want it to trace back and find the name of the element, and print that as well. For example:
If my document contained the following CSS:
.css_class {
font-size: 18px;
font-weight: bold;
color: #FFEFA1;
font-family: Arial, Helvetica, sans-serif;
}
I would want the script to output something like:
css_class,#FFEFA1
Ideally it would output this as a text file.
I would greatly appreciate any advice that could be given to me regarding this!
Here is my script in full thus far:
$color = "color:";
open (FILE, "index.html");
#document = `<FILE>`;
close (FILE);
foreach $line (#document){
if($line =~ /$color/){
print $line;
}
}
Since you asked for advice (and this isn't a coding service) I'll offer just that.
Always use strictures and warnings:
use strict;
use warnings;
Always check the return value of open calls:
open(FILE, 'filename') or die "Can't read file 'filename' [$!]\n";
Use the three-arg form of open and lexical filehandles instead of globs:
open(my $fh, '<', 'filename') or die "Can't read file 'filename' [$!]\n";
Don't slurp when line-by-line processing will do:
while (my $line = <$fh>) {
# do something with $line
}
Use backreferences to retrieve data from regex matches:
if ($line =~ /color *: *(#[0-9a-fA-F]{6})/) {
# color value is in $1
}
Save the class name in a temporary variable so that you have it when you match a color:
if ($line =~ /^.(\w+) *\{/) {
$class = $1;
}
Well, this is not as simple as it seems.
CSS classes can be defined in many ways. For example,
.classy {
color: black;
}
Good luck using a line-by-line approach for parsing that.
Actually, my first approach would be searching CPAN. This looks promising:
CSS - Object oriented access to Cascading Style Sheets (CSS)
Edit:
I installed HTML::TreeBuilder and CSS modules from CPAN and concocted the following aberration:
use strict;
use HTML::TreeBuilder;
use CSS;
foreach my $file_name (#ARGV) {
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($file_name);
my $styles = $tree->find('style');
if ($styles) {
foreach my $style ($styles) {
# This is an insane hack, not guarantee
# to work in the future.
my $css = CSS->new;
$css->read_string(join "\n", #{$style->{_content}});
print $css->output;
}
}
$tree = $tree->delete;
}
This thing only prints all the CSS selectors from list of HTML files, but nicely formatted so you should be able to continue from here.
For yet another way to do it, you can ask perl to read from the file in sections other than lines, for example by using the "}" as a record separator.
my $color = "color:";
open (my $fh, '<', "index.html") || die "Can't open file: $!";
{
local $/ = "}";
while( my $section = <$fh>) {
if($section =~ /$color(.*)/) {
my ($selector) = $line =~ /(.*){/;
print "$selector, $section\n";
}
}
Untested! Also, this of course assumes that your CSS neatly ends its sections with a } on a line on it's own.
I'm not having problems with the regex's but rather with the capture of data. Since CSS elements are typically multi-line, I need to figure out how to create an array between the { and } with each linebreak as a delimiter for list items.
No, you don't.
For the problem as stated, the only lines of interest will be those containing either a class name or a color definition, and possibly also lines containing } to mark the end of a class. All other lines can be ignored, so there's no need to put them into an array.
Since class specifications cannot be nested[1], the last seen set of class names will always be the active set of classes. Therefore, you need only record the last seen set of class names and, when a color specification is encountered, print those class names.
There are still some potential difficulties handling cases in which a specification block is shared by multiple classes (.foo, .bar, .baz { ... }), which may or may not be spread across multiple lines, or if multiple attributes are defined on the same line, but dealing with those should follow fairly easily from what I've already laid out. Depending on your input data, you may also need to include a basic state engine to keep track of whether you're in comments or not.
[1] i.e., Although you can have semantically-nested classes, such as .foo and .foo .bar, they have to be specified in the CSS file as
.foo {
...
}
.foo .bar {
...
}
and cannot be
.foo {
...
.bar {
...
}
}
Although I have not tested the code below, but something like this should work:
if ($line =~ m/\.(.*?) \{(.*?)color:(.*?);(.*)/) {
print "$1,$3\n";
}
You should invest some time learning regular expressions for Perl.

Resources