How can I scrape the string from this tag in ruby - css

I'm currently trying to do my first proper project outside of Codecademy/Baserails and could use some pointers. I'm using a scraper as part of one of the Baserails projects as a base to work from. My aim is to get the string "Palms Trax" and store it in array called DJ. I also wish to get the string "Solid Steel Radio Show" and store it in an array called source. My plan was to extract all the lines from the details section into a subarray and to then filter it into the DJ and Source arrays but if there is a better way of doing it please tell me. I've been trying various different combinations such as '.details none.li.div', 'ul details none.li.div.a' etc but can't seem to stumble on the right one. Also could someone please explain to me why the code
page.css('ol').each do |line|
subarray = line.text.strip.split(" - ")
end
only works if I declare the subarray earlier outside of the loop as in the Baserails project I am working from this did not seem to be the case.
Here is the relevant html:
<!-- Infos -->
<ul class="details none">
<li><span>Source</span><div> Solid Steel Radio Show</div></li>
<li><span>Date</span><div>2015.02.27</div></li>
<li><span>Artist</span><div>Palms Trax</div></li>
<li><span>Genres</span><div>Deep HouseExperimentalHouseMinimalTechno</div></li>
<li><span>Categories</span><div>Radio ShowsSolid Steel Radio Show</div></li>
<li><span>File Size</span><div> 135 MB</div></li>
<li><span>File Format</span><div> MP3 Stereo 44kHz 320Kbps</div></li>
</ul>
and my code so far:
require "open-uri"
require "nokogiri"
require "csv"
#store url to be scraped
url = "http://www.electronic-battle-weapons.com/mix/solid-steel-palms-trax/"
#parse the page
page = Nokogiri::HTML(open(url))
#initalize empty arrays
details = []
dj = []
source = []
artist = []
track = []
subarray =[]
#store data in arrays
page.css('ul details none.li.div').each do |line|
details = line.text.strip
end
puts details
page.css('ol').each do |line|
subarray = line.text.strip.split(" - ")
end

I'm Alex, one of the co-founders of BaseRails. Glad you're now starting to work on your own projects - that's the best way to start applying what you've learned. I thought I'd chip in and see if I can help out.
I'd try this:
page.css(ul.details.none li div a)
This will grab each of the <a> tags, and you'll be able to use .text to extract the text of the link (e.g. Solid Steel Radio Show, Palms Trax, etc). To understand the code above, remember that the . means "with a class called..." and a space means "that has the following nested inside".
So in English, "ul.details.none li div a" is translated to become "a <ul> tag with a class called "details" and another class called "none" that has an <li> tag nested inside, with a <div> tag nested inside that, with an <a> tag inside that. Try that out and see if you can then figure out how to filter the results into DJ, Source, etc.
Finally, I'm not sure why your subarray needs to be declared. It shouldn't need to be declared if that's the only context in which you're using it. FYI the reason why we don't need to declare it in the BaseRails course is because the .split function returns an array by default. It's unlike our name, price, and details arrays where we're using a different function (<<). The << function can be used in multiple contexts, so it's important that we make clear that we're using it to add elements to an array.
Hope that helps!

Related

Extract number data from HTML with RobotFramework

I need to extract a number from an HTML page and convert it into a variable in my test case.
The problem is that there is no ID directly to this element, here is the HTML code, I want to get the 54 (that number can change that's why I need to identificate him with another way), I tried Get Text by using "resultat" but I get "54 ligne(s) trouvée(s)" but I only want "54":
<div class="tab-interpage> == $0
<div class="resultat">
<b>54</b>
ligne(s) trouvée(s)
</div>
...
You have other options how to locate an element, see Locating elements section in Selenium Library.
This might be a situation that requires xPath, I can imagine this one works (but I don't see the whole DOM, so I can't be 100 % sure):
//div[#class="resultat"]/b
combined with the keyword:
${var}= Get Text //div[#class="resultat"]/b
Obviously if there're more div elements with class "resultat", you might run into problems here. In this case, explore the DOM a bit more and see what are some other ways you can get to the element you need.
I think it'd be much more readable if the HTML elements had proper attributes like:
form with class attribute
unique ids usually work best

Qt: how to save QTextEdit contents with custom properties

I have a text editor (QTextEdit). Some words in my editor contains additional information attached (namely, two corresponding integer positions in wave file for that word).
They are stored in Python object as custom properties for QTextCharFormat objects (I attach them with code like this: self.editor.textCursor().setCharFormat(QTextCharFormat().setProperty(MyPropertyID, myWordAttachment) )
Unfortunately, if I save my document to html, all of that additional formatting is lost.
So, I want to perform simplest task: to save my document with all of it's formatting,including myWordAttachment (and to load it from disk).
Am I right that Qt5 doesn't have something ready for it, and I have to write all that document's serialization code by myself? (I still hope that where is simple function that did the job)
1.you loop your text every character.
2.and you catch the character and its charFormat()
3.and you get the properties.
4.Because the properties are eventually a value of something, int,str,...
So you get the properties by charFormat().property(1),(2),(3)... or properties()
5.The most important thing is the character's position & the range.You get the position during the 1th loop.
6.When you catch the CharFormats, you insert into something hashable object like list.
& and you don't forget to insert the CharFormats position.
6.you save your document and the position & properties.
My suggestion for your solution.
1.you can get characterCount() by the QTextDocument object.
2.you loop the range of the characterCount()
3.Before doing it, you make a QTextCursor object.
4.you set the textcursor at the first position.(movePosition method & Start moveoperation & KeepAnchor flag)
5.you move the cursor to right one character & Another.
6.you check the character's charFormat() by tc.charFormat() and the tc.position()
7.But it is the time to Think twice. CharFormat is always the bunch of characters.
you probably get some characters of the same CharFormat().
You can prepare for it.I can Think about some way,but... you should set the QCharFormat objectType or propertyId() for specifing the QCharFormat in Advance(during editing your document).Why don't you set the texts into the properties for after saving & loading.I hope you manage to pass here during debugging & tring.
8.if you get a charFormat,and you check the objectType().
9.if the objectType() is the same as Before searched, you pass the search engine without doing anything.
10.The second important thing is that calls clearSelection() each searching.
11.You save your document() as it is html strings.and you save the charFormats() properties.
12.when you load your document(),the html sentence comes back.
and load the properties.
you make QTextCursor and setPosition( the property's position saved in advance.)
you move QTextCursor until the position and you select the target texts.
you adopt the charFormat properties again and the end.
Summary
The important thing how you specify the charFormat().
You can catch the charFormat without any problem.but the charFormat() is adopted in some range.So you must distinguish the range.
1.The targeted texts is set in the QTextCharFormat's property.
2.You have The QTextCursor pass during the same QTextCharFormat's object.
I can Think of them...
I Think it is some helps for you.

How do I output a comma separated field as separate items efficiently?

I have a field in a mongodb collection with comma separated values such as...
{tags: 'Family friendly, Clean & tidy, Close to town centre, Good location, Friendly staff, Good breakfast, Book my next stay'}
In my template files I would normally call something like {{tags}}, or if I had an array might be able to use {{#each tags}} etc...
What I want to do though is wrap each item in additional HTML such as a span.
Any ideas?
UPDATE: Here's my helper function so far, it creates an array but I don't know the best way to use this in my HTML page so I can wrap spans around each item.
Template.tags.helpers({
getTags: function(input) {
var tagArray = [];
tagArray = input.split(',');
return tagArray;
}
})
You can use tags.split( ", " ), but storing the tags in an array is more flexible and makes more sense.
Using my helper getTags I could iterate over the array it returned with the following code:
{{#each getTags reviewTags}}
<span>{{this}}</span>
{{/each}}
The this keyword can be used to output each item.
I'm not sure if this is the most efficient way, but it keeps the HTML where I want it.

using the chrome console to select out data

I'm looking to pull out all of the companies from this page (https://angel.co/finder#AL_claimed=true&AL_LocationTag=1849&render_tags=1) in plain text. I saw someone use the Chrome Developer Tools console to do this and was wondering if anyone could point me in the right direction?
TLDR; How do I use Chrome console to select and pull out some data from a URL?
Note: since jQuery is available on this page, I'll just go ahead and use it.
First of all, we need to select elements that we want, e.g. names of the companies. These are being kept on the list with ID startups_content, inside elements with class items in a field with class name. Therefore, selector for these can look like this:
$('#startups_content .items .name a')
As a result, we will get bunch of HTMLElements. Since we want a plain text we need to extract it from these HTMLElements by doing:
.map(function(idx, item){ return $(item).text(); }).toArray()
Which gives us an array of company names. However, lets make a single plain text list out of it:
.join('\n')
Connecting all the steps above we get:
$('#startups_content .items .name a').map(function(idx, item){ return $(item).text(); }).toArray().join('\n');
which should be executed in the DevTools console.
If you need some other data, e.g. company URLs, just follow the same steps as described above doing appropriate changes.

Cucumber/Webrat: follow link by CSS class?

is it possible to follow a link by it's class name instead of the id, text or title? Given I have (haha, cucumber insider he?) the following html code:
<div id="some_information_container">
Translation here
</div>
I do not want to match by text because I'd have to care about the translation values in my tests
I want to have my buttons look all the same style, so I will use the CSS class.
I don't want to assign a id to every single link, because some of them are perfectly identified through the container and the link class
Is there anything I missed in Cucumber/Webrat? Or do you have some advices to solve this in a better way?
Thanks for your help and best regards,
Joe
edit: I found an interesting discussion going on about this topic right here - seems to remain an open issue for now. Do you have any other solutions for this?
Here's how I did it with cucumber, hope it helps. The # in the step definition helps the CSS understand whats going on.
This only works with ID's not class names
Step Definition
Then /^(?:|I )should see ([^\"]*) within a div with id "([^\"]*)"$/ do |text, selector|
# checks for text within a specified div id
within "##{selector}" do |content|
if defined?(Spec::Rails::Matchers)
content.should contain(text)
else
hc = Webrat::Matchers::HasContent.new(text)
assert hc.matches?(content), hc.failure_message
end
end
end
Feature
Scenario Outline: Create Project
When I fill in name with <title>
And I select <data_type> from data_type
And I press "Create"
Then I should see <title> within a div with id "specifications"
Scenarios: Search Terms and Results
| data_type | title |
| Books | A Book Title |
Here is how to assert text within an element with the class name of "edit_botton"
Then I should see "Translation here" within "[#class='edit_button']"
How about find('a.some-class').click?
I'm not very familiar with the WebRat API, but what about using a DOM lookup to get the reference ID of the class that you are looking for then passing that to the click_link function?
Here's a link to some javascript to retrieve an item by class.
http://mykenta.blogspot.com/2007/10/getelementbyclass-revisited.html
Now that I think about it, what about using Javascript to just simply change it to some random ID then clicking that?
Either way, that should work until the frugal debate of a name to include the getbyclass function as is resolved.
Does have_tag work for you?
have_tag('a.edit_button')

Resources