how can i remove special emoji's using xquery from text - xquery

I have a $text = "Hello 😀😃😄 💜 🙏🏻 🦦üäö$"
I wanted to remove just emoji's from the text using xquery. How can i do that?
Expected result : "Hello üäö$"
i tried to use:
replace($text, '\p{IsEmoticons}+', '')
but didn't work.
it just removed smiley's
Result now: "Hello 💜 🙏🏻 🦦üäö$"
Expected result : "Hello üäö$"
Thanks in advance :)

I outlined the approach in my answer to the original question, which I updated based on your comment asking about how to strip out 💜.
Quoting from that expanded answer:
The "Emoticons" block doesn't contain all characters commonly associated with "emoji." For example, 💜 (Purple Heart, U+1F49C), according to a site like https://www.compart.com/en/unicode/U+1F49C that lets you look up Unicode character information, is from:
Miscellaneous Symbols and Pictographs, U+1F300 - U+1F5FF
This block is not available in XPath or XQuery processors, since it is neither listed in the XML Schema 1.0 spec linked above, nor is it in Unicode block names for use in XSD regular expressions—a list of blocks that XPath and XQuery processors conforming to XML Schema 1.1 are required to support.
For characters from blocks not available in XPath or XQuery, you can manually construct character classes. For example, given the purple heart character above, we can match it as follows:
replace("Purple 💜 heart", "[🌀-🗿]", "")
This returns the expected result:
Purple Heart
This approach can be applied to 🙏🏻 , 🦦, or any other character:
Locate the character's unicode block.
Craft your regular expression with the block name (if available in XPath) or character class.
Alternatively, rather than locating the blocks of characters you want to strip out, you could identify the blocks of characters you want to preserve. For example, given the example string in the original post, perhaps the goal is to preserve only those characters in the "Basic Latin" block. To do so, we can match characters NOT in this block via the \P Category Escape:
xquery version "3.1";
let $text := "Hello 😀😃😄 💜 🙏🏻 🦦üäö$"
return
replace($text, "\P{IsBasicLatin}", "")
This query returns:
Hello $
Notice that this has stripped out the characters with diacritics, which perhaps isn't desired. These characters with diacritics belong to the Latin-1 Supplement block. To preserve characters from both the Latin and Latin-1 Supplement blocks, we'd need to adjust the query as follows:
xquery version "3.1";
let $text := "Hello 😀😃😄 💜 🙏🏻 🦦üäö$"
return
replace($text, "[^\p{IsBasicLatin}\p{IsLatin-1Supplement}]", "")
... which returns:
Hello üäö$
This now preserves the characters with diacritics.
To be precise about the characters you preserve or remove, you need to consult the Unicode blocks and charts.

Related

What do these characters mean?(ANSI Code)

It's the code I'm printing with node:
const m = `[38;5;1;48;5;16m TEST`
console.log(m)
output:
It changes the text color.
As you can see `` is a special char I don't understand(It's not being shown by the browser). How does it work?
Is there any alternative for ESC?
As #puucee already mentions they are terminal control characters. I find it surprising that it says ESC[ in the code as that won't be escaped in normal node. I suspect that maybe your IDE is converting the "true" escape character to ESC. Node does not support octal escapes (such as \033), but hexadecimal escapes. That is, you string should usually be like this:
console.log('\x1b[38;5;1;48;5;16m TEST \x1b[0m')
These are terminal control characters. They are often used e.g. for coloring the output. Some are non-printable. Backticks ` in your javascript example are called template literals.

convert comment string to an ASCII character list in sicstus-prolog

currently I am working on comparison between SICStus3 and SICStus4 but I got one issue that is SICStus4 will not consult any cases where the comment string has carriage controls or tab characters etc as given below.
Example case as given below.It has 3 arguments with comma delimiter.
case('pr_ua_sfochi',"
Response:
answer(amount(2370.09,usd),[[01AUG06SFO UA CHI Q9.30 1085.58FUA2SFS UA SFO Q9.30 1085.58FUA2SFS NUC2189.76END ROE1.0 XT USD 180.33 ZPSFOCHI 164.23US6.60ZP5.00AY XF4.50SFO4.5]],amount(2189.76,usd),amount(2189.76,usd),amount(180.33,usd),[[fua2sfs,fua2sfs]],amount(6.6,usd),amount(4.5,usd),amount(0.0,usd),amount(18.6,usd),lasttktdate([20061002]),lastdateafterres(200712282]),[[fic_ticketinfo(fare(fua2sfs),fic([]),nvb([]),nva([]),tktiss([]),penalty([]),tktendorsement([]),tourinfo([]),infomsgs([])),fic_ticketinfo(fare(fua2sfs),fic([]),nvb([]),nva([]),tktiss([]),penalty([]),tktendorsement([]),tourinfo([]),infomsgs([]))]],<>,<>,cat35(cat35info([])))
.
02/20/2006 17:05:10 Transaction 35 served by static.static.server1 (usclsefat002:7551) running E*Fare version $Name: build-2006-02-19-1900 $
",price(pnr(
user('atl','1y',<>,<>,dept(<>,'0005300'),<>,<>,<>),
[
passenger(adt,1,[ptconly(n)])
],
[
segment(1,sfo,chi,'ua','<>','100',20140901,0800,f,20140901,2100,'737',res(20140628,1316),hk,pf2(n,[],[],n),<>,flags(no,no,no,no,no,no,no,no,no)),
segment(2,chi,sfo,'ua','<>','101',20140906,1000,f,20140906,1400,'737',res(20140628,1316),hk,pf2(n,[],[],n),<>,flags(no,no,no,no,no,no,no,no,no))
]),[
rebook(n),
ticket(20140301,131659),
dbaccess(20140301,131659),
platingcarrier('ua'),
tax_exempt([]),
trapparm("trap:ffil"),
city(y)
])).
The below predicate will remove comment section in above case.
flatten-cases :-
getmessage(M1),
write_flattened_case(M1),
flatten-cases.
flatten-cases.
write_flattened_case(M1):-
M1 = case(Case,_Comment,Entry),!,
M2 = case(Case,Entry),
writeq(M2),write('.'),nl.
getmessage(M) :-
read(M),
!,
M \== end_of_file.
:- flatten-cases.
Now my requirement is to convert the comment string to an ASCII character list.
Layout characters other than a regular space cannot occur literally in a quoted atom or a double quoted list. This is a requirement of the ISO standard and is fully implemented in SICStus since 3.9.0 invoking SICStus 3 with the option --iso. Since SICStus 4 only ISO syntax is supported.
You need to insert \n and \t accordingly. So instead of
log('Response:
yes'). % BAD!
Now write
log('Response:\n\tyes').
Or, to make it better readable use a continuation escape sequence:
log('Response:\n\
\tyes').
Note that using literal tabs and literal newlines is highly problematic. On a printout you do not see them! Think of 'A \nB' which would not show the trailing spaces nor trailing tabs.
But there are also many other situations like: Making a screenshot of program text, making a photo of program text, using a 3270 terminal emulator and copying the output. In the past, punched cards. The text-mode when reading files (which was originally motivated by punched cards). Similar arguments hold for the tabulator which comes from typewriters with their manually settable tab stops.
And then on SO it is quite difficult to type in a TAB. The browser refuses to type it (very wisely), and if you copy it in, you get it rendered as spaces.
If I am at it, there is also another problem. The name flatten-case should rather be written flatten_case.

I am receiving \u200b in certain translations from the google-translate api

I am new to using the Google translate API and during testing we noticed that for some translations (I have not been able to find a pattern yet) we get \u200b characters in the response. That results in a lot of issues and above all it does not seem to server any purpose or make any sense. As simple example:
https://www.googleapis.com/language/translate/v2?key=YOURKEY&source=NL&target=EN&q=Hergeneer%20verkopen
returns:
{
"data": {
"translations": [
{
"translatedText": "Sell \u200b\u200bHerge Down"
}
]
}
}
Our software stumbles over these \u200b strings/characters and I have not found a way to prevent them or get rid of them.
Please read the documentation of the JSON format: https://json.org/
A string is a sequence of zero or more Unicode characters.
A char is either any Unicode character except " or \ or control-character,
[...]
or it is \u followed by four hex-digits.
We are in this last case, \u followed by four hex-digits, and it represents a Unicode character: Unicode Character 'ZERO WIDTH SPACE' (U+200B). It even has its own Wikipedia page: Zero-width space. And its Stack Overflow question: What's HTML character code 8203?.
Now, there are plenty Unicode characters with special behaviors, and this is one of those, an invisible one among others. So you need to be aware of how Unicode works, and you should sanitize input/output from third-parties API (and from user inputs as well).
Just define the list of characters that you actually want to support, and be sure to strip or filter out all the other ones. For instance, if you desire to support NL and EN, then you could strip what is outside the Latin script in Unicode.
Stripping the U+200B that you're encountering and other undesirable characters may save you from potential surprises like with:
big characters ⎲⎳
zalgo characters C̨̦̺̩̲̥͉̭͚̜̻̝̣̼͙̮̯̪o̴̡͇̘͎̞̲͇̦̲͞͡m̸̩̺̝̣̹̱͚̬̥̫̳̼̞̘̯͘ͅẹ͇̺̜́̕͢
invisible characters
emojis 👨‍👩‍👧‍👦#️⃣🏳️‍🌈

emphasis and not emphasis in the same word

In ReStructuredText, is it possible to have emphasis and no emphasis in the same word? For example:
*emph*not-emph
leading to "emph no-emph", but with no white space in between? I can't find a way to do it, not even with a substitution.
What you are looking for is Character-Level Inline Markup. The description from the reStructuredText specification is (emphasis mine):
It is possible to mark up individual characters within a word with backslash escapes [...] Backslash escapes can be used to allow arbitrary text to immediately follow inline markup.
The two examples provided in the specification are:
For a single character immediately following inline markup:
Python ``list``\s use square bracket syntax.
For arbitrary text immediately following inline markup:
Possible in *re*\ ``Structured``\ *Text*, though not encouraged.
So to achieve the output you want, you need to use the backslash-escaped whitespace pattern:
*emph*\ not-emph
The reason this is required is because the inline markup recognition rules require that:
Inline markup end-strings must end a text block or be immediately followed by
whitespace,
one of the ASCII characters - . , : ; ! ? \ / ' " ) ] } > or
a non-ASCII punctuation character with Unicode category Pd (Dash), Po (Other), Pe (Close), Pf (Final quote), or Pi (Initial quote).
Note that the use of that pattern above is discouraged in the reStructuredText specification:
The use of backslash-escapes for character-level inline markup is not encouraged. Such use is ugly and detrimental to the unprocessed document's readability. Please use this feature sparingly and only where absolutely necessary.

SQLite: which character can be ignored with FTS match in one word

I need to find any special character. If I put it in the middle of a word, SQLite FTS match can ignore it as if it does not exist, e.g.:
Text Body: book's
If my match string is 'books' I need to get result of "book's"..
No problem using porter or simple tokenizer.
I tried many characters for that like: book!s, book?s, book|s, book,s, book:s…, but when searching by match for 'books' no results of these returned.
I don't understand, why?
I am using: Contentless FTS4 Tables, and External Content FTS4 Tables, my text body has many characters in each word, should be changed to ignore it when searching..
I cannot change match query because I do not know where the special character in the word is. Also, I need to leave the original word length equal to the length of FTS Index word to use match info or snippet(); as such, I cannot remove these characters from text body.
The default tokenizers do not ignore punctuation characters but treat them as word separators.
So the text body or match string book's will end up as two words, book and s.
These will never match a single work like books.
To ignore characters like ', you have to install your own custom tokenizer.

Resources