Extract specific text using iMacro

Extract specific text using iMacro - web-scraping

Os : ubuntu 14.04
browser : mozilla firefox V 36.0.1
Imacro : VERSION BUILD=8920312 RECORDER=FX
want an iMacro code that extract to csv file the number "118" (or 118.5) from this html code :
<div class="betting_row clearfix">
<a href="#" class="betLink cell fifty last_cell" id="3976203966" >Over 118.5
<strong class="odds">0.90</strong>
</a>
</div>
Note : that html code appear on any live basketball game in this bet site http://www.bet.co.za/ where i want to extract the "total points" number
I've tried this code:
TAG POS=1 TYPE=A ATTR=CLASS:"betLink cell fifty last_cell" EXTRACT=TXT
SET !EXTRACT EVAL("'{{!EXTRACT}}'.match(/\d\d\d\.\d/)[0]")
but it extract the number and the odd : " Over 118.5 0.90"
and all i want is "118"

For your particular case this code will be helpful:
TAG POS=1 TYPE=A ATTR=CLASS:"betLink cell fifty last_cell" EXTRACT=TXT
SET !EXTRACT EVAL("'{{!EXTRACT}}'.match(/\d\d\d\.\d/)[0]")
For more universal solution one needs to know possible patterns (or range) of the extracted value.
BTW, if you’re still interested in the script for a csv-file without quotes, contact me via e-mail ( shugarjs#gmail.com ) and I’ll give it to you.

Related

Incorrect behaviour of Google Translation API with notranslate tags

Google Translate API allows indicating chunks of text that should not be translated with
<span translate='no'>Skip this text while translating</span>
In some cases there is an incorrect behaviour with non-translate tags, that causes the translation API to omit one of the words and to duplicate the non-translate tag. Input of the translation API:
0c40152c asdasd alsdls3 ec3f297a <span translate="no">AAAAA123AAAA</span> Nov 30 translate
When translating from Italian to English (not sure if the language matters), the following result is returned:
0c40152c asdasd alsdls3 ec3f297a <span translate="no">AAAAA123AAAA</span> Nov 30 <span translate="no">AAAAA123AAAA</span>
Please note that the 'translate' at the end of the text is substituted with the non-translate tag.
This issues are present if instead of <span translate='no'> I use the alternative syntax: <span class='notranslate'>.
Is this a known bug ? Does it have a sensible workaround ?

Is this a known bug?
Yes: https://issuetracker.google.com/issues/121076288
Translation problem with notranslate class in span tag
Problem you have encountered:
The translation API gives wrong results translating from german to arabic
German text:
QANTARA Migration - Kostenfreie Erstprüfung Ihrer Chancen für die erfolgreiche Immigration nach Deutschland
Arabic translation:
QANTARA Migration - إجراء فحص أولي مجاني لفرص نجاح QANTARA Migration إلى ألمانيا
What you expected to happen:
Correct translation without doubling the span with notranslate - this was doubled in arabic translation as you can see
There are also a few others that seem related, like https://issuetracker.google.com/issues/74168658 and https://issuetracker.google.com/issues/35902695.
Does it have a sensible workaround?
Only hacky ones, I'm afraid.
The easiest workaround is just to replace such sections with a token, like a unique number or url that Translate is smart enough not to touch, translate, then swap the original string back in.
A more general solution is to use something like ModelFront (full-disclosure: I work there) to detect errors, and do something only in those cases.

It seems like you have specified Italian as input language, but there are very few words in the text which can be translated (for example “translate”) and they are not recognised in the source language.
This can derivate in issues with the translation algorithm, which seems to be the case here.
A workaround would be setting the source language to get automatically detected by the API and checking the confidence value:
The confidence value is an optional floating point value between 0 and
1. The closer this value is to 1, the higher the confidence level for the language detection. This member is not always available
If the confidence value is high enough for your needs, it will try to detect the appropriate source language to translate from.
Another workaround could be adding more words to the text so the algorithm has more data to work with. I have tested the API with the same input as you describe but adding a few more words. The result output is the expected.

How can I use custom classes in chapter titles with Asciidoc epub3 converter?

In the adoc file I define a chapter header like:
== [big-number]#2064# Das Spiele-Labor
For HTML that translates to
<span class="big-number">2064</span>
For the epub-Version, converted with asciidoctor-epub, apparently the class is omitted. The code line in the converter.rb:
<h1 class="chapter-title">#{title_upper}#{subtitle ? %[ <small class="subtitle">#{subtitle_formatted_upper}</small>] : nil}</h1>
(/var/lib/gems/1.9.1/gems/asciidoctor-epub3-1.5.0.alpha.7.dev/lib/asciidoctor-epub3/converter.rb)
How can I get the class information over to the chapter-title to format the first number in a special way?
Or is there another way to solve this? (The first number of the chapter title should be large and CSS hasn't got a 'first-word' attribute)

IDML : What are Kinsoku/Mojikumi tables?

I am new to the world of Adobe InDesign and IDML file format. I am trying to understand the IDML file format so that I can create IDML files dynamically through code!
I am going through the IDML File format specification and have found references to "Mojikumi Tables" and "Kinsoku Tables" and "Aki". Though the documentation defines various attributes for these elements, there's no clear explanation what these elements actually are.
Any pointers or links to relevant articles would be really helpful.
Thanks.

These are all additional typography settings used in laying out Japanese text.
Kinsoku: A rule set in the Japanese language that is used to determine characters that are not permitted at the beginning or end of a line. Reference.
Mojikumi: Determines spacing between punctuation, symbols, numbers, and other character classes in Japanese type. Reference.
Aki: Means space in Japanese:
"When the glyphs that correspond to characters of different character
classes come together in a run of text, there is spacing behaviour. In
other words, extra space, measured using a fraction of an em, is
introduced depending on which two character classes are in proximity*.
Typical values are one-fourth and one-half of an em"
(Footnote: * 'In Japanese this space is referred to as aki, which simply means
"space"')
Reference and source for this quote.
Here's a link to a book that should provide more information: CJKV Information Processing, 2nd Edition

Marking up citations in HTML

Is there a somewhat standardized way to semantically mark up citations in HTML? I know that when I'm citing from a website i can do:
<q title="Article by John" cite="http://example.com/article">quoted text...</q>
But I was rather thinking something a bit more precise, maybe using RDFa and Dublin Core. Along the lines of:
<q cite="http://example.com/article">quoted text...</q>
<span xmlns:dc="http://purl.org/dc/elements/1.1/" about="#berkman">
<cite property="dc:title">Find It Fast: How to Uncover Expert Information on Any Subject</cite>
<span property="dc:creator">Berkman, R. I.</span>
<span property="dc:date">1994</span>
<span property="dc:publisher">New York: HarperPerennial</span>
<span property="dc:type">book</span>
</span>
Then I could run some Javascript or XSLT over it to display the citation as a hover-text or footnote or something (HTML5 recommendation on footnotes). But this way seems to be rather loose on semantics. Isn't there a smart way to associate the quoted text (in the q tag) with an RDF triple? Like:
"quoted text..." voc:isQuotedFrom _b1.
_b1 dc:title "Find it Fast";
dc:creator "Berkman, R. I.".
I've stubled over BibTeXML and a proposed Citation microformat but they (as well as all usages of Dublin Core I've seen) always seem to focus on the metadata of a specific book (as it might appear in a bibliography) and not on how to mark up a citation and reference it to a book.
Any thoughts or tips appreciated, thanks.

Use the BIBO Ontology - it has all the terms you want:
http://bibotools.googlecode.com/svn/bibo-ontology/trunk/doc/index.html
In particular:
bibo:citedBy
bibo:cites
bibo:Quote
RDF doesn't support literals as subjects.
So in the example above I would recommend the following, based on the new RDFa-Core Specification:
<div vocab="http://purl.org/ontology/bibo/" typeof="Quote">
<span rel="cites" resource="http://mybookstore.com/books#berkman"></span>
<q cite="http://example.com/article" property="shortDescription">quoted text...</q>
</div>
<span about="http://mybookstore.com/books#berkman" typeof="Book" prefix="dc: http://purl.org/dc/elements/1.1/" vocab="http://purl.org/ontology/bibo/">
<cite property="dc:title">Find It Fast: How to Uncover Expert Information on Any Subject</cite>
<span property="dc:creator">Berkman, R. I.</span>
<span property="dc:date">1994</span>
<span property="dc:publisher">New York: HarperPerennial</span>
</span>
This would resolve to the following rdf:
#prefix dc: <http://purl.org/dc/elements/1.1/> .
#prefix bibo: <http://purl.org/ontology/bibo/> .
#prefix mybooks: <http://mybookstore.com/books#> .
#prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
_:b1 rdf:type bibo:Quote;
bibo:cites mybooks:berkman;
bibo:shortDescription "quoted text...".
mybooks:berkman rdf:type bibo:Book;
dc:title "Find It Fast: How to Uncover Expert Information on Any Subject";
dc:creator "Berkman, R. I.";
dc:date "1994";
dc:publisher "New York: HarperPerennial".

Some of it is stored in meta-tags in the header.
<meta name="author" content="Hege Refsnes" />
and
<meta name="revised" content="Hege Refsnes, 23/10/2011" />
both are copy/paste from w3cschools on page on meta tags
There are other commonly used ones that are not in the w3c specification. I'm not sure if there is a list anywhere.
additionally if you decide to use html5 there are new tags and attributes for that. some of the tags are address, cite, details, summary
Another feature of html5 that you can use in xhtml (although it wont validate in xhtml) is using custom attributes. In html5 you should start them with data-yourAttributName just in case yourAttributeName gets used in later versions of html.
E.G. <p data-date='13MAR2012'></p>

How to preserve spaces in hyperlinks when converting restructured text to html?

I'm using python docutils and the rst2html.py script to convert restructured text to html.
I want to convert a line like this:
Test1 `(link1) <C:/path with spaces/file.html>`_
Into something like this:
<p>Test1 <a class="reference external" href="C:/path with spaces/file.html">(link1)</a>
But instead I get this (spaces in path are dropped):
<p>Test1 <a class="reference external" href="C:/pathwithspaces/file.html">(link1)</a>
How do I preserve the whitespace in links?

I don't know how you are grabbing the line from the file (or stdin), but you should convert the link related string to HTML entities. You can find more information in the following link Escaping HTML - Python Wiki.
Hope this help you.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extract specific text using iMacro - web-scraping

Related

Incorrect behaviour of Google Translation API with notranslate tags

How can I use custom classes in chapter titles with Asciidoc epub3 converter?

IDML : What are Kinsoku/Mojikumi tables?

Marking up citations in HTML

How to preserve spaces in hyperlinks when converting restructured text to html?

Categories

Resources