Xpath-How to extract a particular word from the text()?

Xpath-How to extract a particular word from the text()? - web-scraping

Can anyone help me how to extract particular word from the text() from the Xpath expression
I'm currently scrapping the names of the coins from Website:https://coinmarketcap.com/currencies/bitcoin/
have used the Xpath expression:
(//h1[#class='priceHeading']/text())[1]
which has 'Bitcoin Price' I just need the first word 'Bitcoin' ignoring the rest.
Don't mind my mistakes, I'm a newbie here :)

Well, it kind of depends upon what you can rely on, and which version of XPath you are using. Whether a space is sufficient, or if you would want some more sophisticated tokenization is largely dependent on the data and your requirements.
With XPath 1.0 and later, you can use substring-before() a space
substring-before((//h1[#class='priceHeading']/text())[1], ' ')
With XPath 2.0 and later, you can use tokenize() and select the first item
tokenize((//h1[#class='priceHeading']/text())[1], ' ')[1]
If you know that it will always end with " Price" then you could use that value instead of just a space in the substring-before() or tokenize(), or could replace() " Price" with "''":
replace((//h1[#class='priceHeading']/text())[1], ' Price', '')

Related

DevExpress FormatString

Im trying to concatenate two different Objects data source with a space between them, like this:
FormatString([Check].[CheckAmount] + [Check].[CheckCurrency])
How can I add a space between them in the expression editor?

Please take a look at the Concat function:
Concat([Prop1], ' ', [Prop2])
ps. See the Criteria Language Syntax help-article for details.

Extract numerical value before a string in R

I have been mucking around with regex strings and strsplit but can't figure out how to solve my problem.
I have a collection of html documents that will always contain the phrase "people own these". I want to extract the number immediately preceding this phrase. i.e. '732,234 people own these' - I'm hoping to capture the number 732,234 (including the comma, though I don't care if it's removed).
The number and phrase are always surrounded by a . I tried using Xpath but that seemed even harder than a regex expression. Any help or advice is greatly appreciated!
example string: >742,811 people own these<
-> 742,811

Could you please try following.
val <- "742,811 people own these"
gsub(' [a-zA-Z]+',"",val)
Output will be as follows.
[1] "742,811"
Explanation: using gsub(global substitution) function of R here. Putting condition here where it should replace all occurrences of space with small or capital alphabets with NULL for variable val.

Try using str_extract_all from the stringr library:
str_extract_all(data, "\\d{1,3}(?:,\\d{3})*(?:\\.\\d+)?(?= people own these)")

Cognos Expression Definition: How do I deal with apostrophies in a string?

I am attempting to improve an existing "data item query expression" in Cognos 10 via Report Studio. The current expression works fine... except it can't accommodate words with an apostrophe in them.
In many cases, we have just removed the apostrophe in our supporting data sources, but instances of apostrophes remain. Example: L'ESSENTIAL has been changed to L ESSENTIAL. L'AGENDA has become L AGENDA. My goal is to correct the expression so when it does encounter a L'ESSENTIAL or L'AGENDA it knows what to do with them.
The trial-and-error efforts generally result in parsing errors.
I've tried to surrounding or preceding the apostrophe with quotes"', asterisks *', tildes ~' and percents %' but none of these iterations have been successful.
Here is a highly abbreviated version of the formula:
case when [_Dimensions].[Product Dimension (Configured)].[Product Dimension (Configured)].[Item].[Catalog Brand or Catalog Group] in ('L ESSENTIEL','L AGENDA') then '01 NO APOSTROPHE'
when [_Dimensions].[Product Dimension (Configured)].[Product Dimension (Configured)].[Item].[Catalog Brand or Catalog Group] in ('L%'ESSENTIEL','L%'AGENDA') then '02 WITH APOSTROPHE'
else '99 EVERYTHING ELSE'
end
How do I re-write the bolded part so it recognizes L'ESSENTIAL and L'AGENDA as strings?
Forgive my lack of experience in this arena... this is not my area of expertise, unfortunately.
Thanks in advance for any novice level guidance you can offer.

You can escape single-quote characters by using two single-quotes in a row. So the in() clause in bold above would be:
in ('L''ESSENTIEL','L''AGENDA')

Regular Expression to remove contents in string

I have a string as below:
4s: and in this <em>new</em>, 5s: <em>year</em> everybody try to make our planet clean and polution free.
Replace string:
4s: and in this <em>new</em>, <em>year</em> everybody try to make our planet clean and polution free.
what i want is ,if string have two <em> tags , and if gap between these two <em> tags is of just one word and also , format of that word will be of ns: (n is any numeric value 0 to 4 char. long). then i want to remove ns: from that string. while keeping punctuation marks('?', '.' , ',',) between two <em> as it is.
also i like to add note that. input string may or may not have punctuation marks between these two <em> tags.
My regular expression as below
Regex.Replace(txtHighlight, #"</em>.(\s*)(\d*)s:(\s*).<em", "</em> <em");
Hope it is clear to my requirement.
How can I do this using regular expressions?

Not really sure what you need, but how about:
Regex.Replace(txtHighlight, #"</em>(.)\s*\d+s:\s*(.)<em", "</em>$1$2<em");

If you just want to take out the 4s 5s bit you could do something like this:
Regex.Replace(txtHighlight, #"\s\d\:", "");
This will match a space followed by a digit followed by a colon.
If that's not what you're after, my apologies. I hope it might help :)

How to write a regex for any text except quotes or multiple hyphens?

Can anybody tell me how to write a regular expression for "no quotes (single or double) allowed and only single hyphens allowed"? For example, "good", 'good', good--looking are not allowed (but good-looking is).
I need put this regex like following:
<asp:RegularExpressionValidator ID="revProductName" runat="server"
ErrorMessage="Can not have " or '." Font-Size="Smaller"
ControlToValidate="txtProductName"
ValidationExpression="^[^'|\"]*$"></asp:RegularExpressionValidator>
The one I have is for double and single quotes. Now I need add multiple hyphens in there. I put like this "^[^'|\"|--]*$", but it is not working.

^(?:-(?!-)|[^'"-]++)*$
should do.
^ # Start of string
(?: # Either match...
-(?!-) # a hyphen, unless followed by another hyphen
| # or
[^'"-]++ # one or more characters except quotes/hyphen (possessive match)
)* # any number of times
$ # End of string

So, the regexp has to fail when ther is ', or ", or --.
So, the regexp should try this in every position, and if it's found, then fail:
^(?:(?!['"]|--).)*$
The idea is to consume all the line with ., but to check before using . each time that it not ', or ", or the beginning of --.
Also, I like the other answer very much. It uses a bit different approach. It consumes only non-'" symbols ([^'"]), and if it consumes -, it check if it's not followed by another -.
Also, there could be one more approach of searching for ', or ", or -- in the string, and then failing the regex if they are found. I could be achieved by using regex conditional expression. But this flavor of regex engine doesn't seem to support such kind of conditions.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Xpath-How to extract a particular word from the text()? - web-scraping

Related

DevExpress FormatString

Extract numerical value before a string in R

Cognos Expression Definition: How do I deal with apostrophies in a string?

Regular Expression to remove contents in string

How to write a regex for any text except quotes or multiple hyphens?

Categories

Resources