QRegExp: individual quantifiers can't be non-greedy, but what good alternatives then? - qt

I'm trying to write code that appends ending _my_ending to the filename, and does not change file extension.
Examples of what I need to get:
"test.bmp" -> "test_my_ending.bmp"
"test.foo.bar.bmp" -> "test.foo.bar_my_ending.bmp"
"test" -> "test_my_ending"
I have some experience in PCRE, and that's trivial task using it. Because of the lack of experience in Qt, initially I wrote the following code:
QString new_string = old_string.replace(
QRegExp("^(.+?)(\\.[^.]+)?$"),
"\\1_my_ending\\2"
);
This code does not work (no match at all), and then I found in the docs that
Non-greedy matching cannot be applied to individual quantifiers, but can be applied to all the quantifiers in the pattern
As you see, in my regexp I tried to reduce greediness of the first quantifier + by adding ? after it. This isn't supported in QRegExp.
This is really disappointing for me, and so, I have to write the following ugly but working code:
//-- write regexp that matches only filenames with extension
QRegExp r = QRegExp("^(.+)(\\.[^.]+)$");
r.setMinimal(true);
QString new_string;
if (old_string.contains(r)){
//-- filename contains extension, so, insert ending just before it
new_string = old_string.replace(r, "\\1_my_ending\\2");
} else {
//-- filename does not contain extension, so, just append ending
new_string = old_string + time_add;
}
But is there some better solution? I like Qt, but some things that I see in it seem to be discouraging.

How about using QFileInfo? This is shorter than your 'ugly' code:
QFileInfo fi(old_string);
QString new_string = fi.completeBaseName() + "_my_ending"
+ (fi.suffix().isEmpty() ? "" : ".") + fi.suffix();

Related

Can R read html-encoded emoji characters?

Question
My question, explained below, is:
How can R be used to read a string that includes HTML emoji codes like πŸ€—?
I'd like to:
(1) represent the emoji symbol (e.g., as a unicode symbol: πŸ€—) in the parsed string, OR(2) convert it into its text equivalent (":hugging face:")
Background
I have an XML dataset of text messages (from the Android/iOS app Signal) that I am reading into R for a text mining project. The data look like this, with each text message represented in an sms node:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!-- File Created By Signal -->
<smses count="1">
<sms protocol="0" address="+15555555555" contact_name="Jane Doe" date="1483256850399" readable_date="Sat, 31 Dec 2016 23:47:30 PST" type="1" subject="null" body="Hug emoji: πŸ€—" toa="null" sc_toa="null" service_center="null" read="1" status="-1" locked="0" />
</smses>
Problem
I am currently reading the data using the xml2 package for R. When I use the xml2::read_xml function, however, I get the following error message:
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, :
xmlParseCharRef: invalid xmlChar value 55358
Which, as I understand, indicates that the emoji character is not recognized as valid XML.
Using the xml2::read_html function does work, but drops the emoji character. A small example of this is here:
example_text <- "Hugging emoji: πŸ€—"
xml2::xml_text(xml2::read_html(paste0("<x>", example_text, "</x>")))
(Output: [1] "Hugging emoji: ")
This character is valid HTML -- Googling πŸ€— actually converts it in the search bar to the "hugging face" emoji, and brings up results relating to that emoji.
Other information I've found that seems relevant to this question
I've been searching Stack Overflow, and have not found any questions relating to this particular issue. I've also not been able to find a table that straightforwardly gives HTML codes next to the emoji they represent, and so am not able to do an (albeit inefficient) conversion of these HTML codes to their textual equivalents in a big loop before parsing the dataset; for example, neither this list nor its underlying dataset seem to include the string 55358.
tl;dr: the emoji aren't valid HTML entities; UTF-16 numbers have been used to build them instead of Unicode code points. I describe an algorithm at the bottom of the answer to convert them so that they are valid XML.
Identifying the Problem
R definitely handles emoji:
In fact, a few packages exist for handling emoji in R. For example, the emojifont and emo packages both let you retrieve emoji based on Slack-style keywords. It's just a question of getting your source characters through from the HTML-escaped format so that you can convert them.
xml2::read_xml seems to do fine with other HTML entities, like an ampersand or double quotes. I looked at this SO answer to see whether there were any XML-specific constraints on HTML entities, and it seemed like they were storing emoji fine. So I tried changing the emoji codes in your reprex to the ones in that answer:
body="Hug emoji: πŸ˜€πŸ˜ƒ"
And, sure enough, they were preserved (though they're obviously not the hug emoji anymore):
> test8 = read_html('Desktop/test.xml')
> test8 %>% xml_child() %>% xml_child() %>% xml_child() %>% xml_attr('body')
[1] "Hug emoji: \U0001f600\U0001f603"
I looked up the hug emoji on this page, and the decimal HTML entity given there is not πŸ€—. It looks like the UTF-16 decimal codes for the emoji have been wrapped in &# and ;.
In conclusion, I think the answer is that your emoji are, in fact, not valid HTML entities. If you can't control the source, you might need to do some pre-processing to account for these errors.
So, why does the browser convert them properly? I'm wondering if the browser is a little more flexible with these things and is making some guesses about what those codes could be. I'm just speculating, though.
Converting UTF-16 to Unicode code points
After some more investigation, it looks like valid emoji HTML entities use the Unicode code point (in decimal, if it's &#...;, or hex, if it's &#x...;). The Unicode code point is different from the UTF-8 or UTF-16 code. (That link explains a lot about how emoji and other characters are variously encoded, BTW! Good read.)
So we need to convert the UTF-16 codes used in your source data to Unicode code points. Referring to this Wikipedia article on UTF-16, I've verified how it's done. Each Unicode code point (our target) is a 20-bit number, or five hex digits. When going from Unicode to UTF-16, you split it up into two 10-bit numbers (the middle hex digit gets cut in half, with two of its bits going to each block), do some maths on them and get your result).
Going backwards, as you want to, it's done like this:
Your decimal UTF-16 number (which is in two separate blocks for now) is 55358 56599
Converting those blocks to hex (separately) gives 0x0d83e 0x0dd17
You subtract 0xd800 from the first block and 0xdc00 from the second to give 0x3e 0x117
Converting them to binary, padding them out to 10 bits and concatenating them, it's 0b0000 1111 1001 0001 0111
Then we convert that back to hex, which is 0x0f917
Finally, we add 0x10000, giving 0x1f917
Therefore, our (hex) HTML entity is πŸ€—. Or, in decimal, &#129303
So, to preprocess this dataset, you'll need to extract the existing numbers, use the algorithm above, then put the result back in (with one &#...;, not two).
Displaying emoji in R
As far as I'm aware, there's no solution to printing emoji in the R console: they always come out as "U0001f600" (or what have you). However, the packages I described above can help you plot emoji in some circumstances (I'm hoping to expand ggflags to display arbitrary full-colour emoji at some point). They can also help you search for emoji to get their codes, but they can't get names given the codes AFAIK. But maybe you could try importing the emoji list from emojilib into R and doing a join with your data frame, if you've extracted the emoji codes into a column, to get the English names.
JavaScript Solution
I had this exact same problem, but needed the solution in JavaScript, not R. Using rensa's comment above (hugely helpful!), I created the following code to solve this issue, and I just wanted to share it in case anyone else happens across this thread as I did, but needed it in JavaScript.
str.replace(/(&#\d+;){2}/g, function(match) {
match = match.replace(/&#/g,'').split(';');
var binFirst = (parseInt('0x' + parseInt(match[0]).toString(16)) - 0xd800).toString(2);
var binSecond = (parseInt('0x' + parseInt(match[1]).toString(16)) - 0xdc00).toString(2);
binFirst = '0000000000'.substr(binFirst.length) + binFirst;
binSecond = '0000000000'.substr(binSecond.length) + binSecond;
return '&#x' + (('0x' + (parseInt(binFirst + binSecond, 2).toString(16))) - (-0x10000)).toString(16) + ';';
});
And, here's a full snippet of it working if you'd like to run it:
var str = 'πŸ˜ŠπŸ˜˜πŸ˜€πŸ˜†πŸ˜‚πŸ˜'
str = str.replace(/(&#\d+;){2}/g, function(match) {
match = match.replace(/&#/g,'').split(';');
var binFirst = (parseInt('0x' + parseInt(match[0]).toString(16)) - 0xd800).toString(2);
var binSecond = (parseInt('0x' + parseInt(match[1]).toString(16)) - 0xdc00).toString(2);
binFirst = '0000000000'.substr(binFirst.length) + binFirst;
binSecond = '0000000000'.substr(binSecond.length) + binSecond;
return '&#x' + (('0x' + (parseInt(binFirst + binSecond, 2).toString(16))) - (-0x10000)).toString(16) + ';';
});
document.getElementById('result').innerHTML = str;
// πŸ˜ŠπŸ˜˜πŸ˜€πŸ˜†πŸ˜‚πŸ˜
// is turned into
// πŸ˜ŠπŸ˜˜πŸ˜€πŸ˜†πŸ˜‚πŸ˜
// which is rendered by the browser as the emojis
Original:<br>πŸ˜ŠπŸ˜˜πŸ˜€πŸ˜†πŸ˜‚πŸ˜<br><br>
Result:<br>
<div id='result'></div>
My SMS XML Parser application is working great now, but it stalls out on large XML files so, I'm thinking about rewriting it in PHP. If/when I do, I'll post that code as well.
I've implemented the algorithm described by rensa above in R, and am sharing it here. I am happy to release the code snippet below under a CC0 dedication (i.e., putting this implementation into the public domain for free reuse).
This is a quick and unpolished implementation of rensa's algorithm, but it works!
utf16_double_dec_code_to_utf8 <- function(utf16_decimal_code){
string_elements <- str_match_all(utf16_decimal_code, "&#(.*?);")[[1]][,2]
string3a <- string_elements[1]
string3b <- string_elements[2]
string4a <- sprintf("0x0%x", as.numeric(string3a))
string4b <- sprintf("0x0%x", as.numeric(string3b))
string5a <- paste0(
# "0x",
as.hexmode(string4a) - 0xd800
)
string5b <- paste0(
# "0x",
as.hexmode(string4b) - 0xdc00
)
string6 <- paste0(
stringi::stri_pad(
paste0(BMS::hex2bin(string5a), collapse = ""),
10,
pad = "0"
) %>%
stringr::str_trunc(10, side = "left", ellipsis = ""),
stringi::stri_pad(
paste0(BMS::hex2bin(string5b), collapse = ""),
10,
pad = "0"
) %>%
stringr::str_trunc(10, side = "left", ellipsis = "")
)
string7 <- BMS::bin2hex(as.numeric(strsplit(string6, split = "")[[1]]))
string8 <- as.hexmode(string7) + 0x10000
unicode_pattern <- string8
unicode_pattern
}
make_unicode_entity <- function(x) {
paste0("\\U000", utf16_double_dec_code_to_utf8(x))
}
make_html_entity <- function(x) {
paste0("&#x", utf16_double_dec_code_to_utf8(x), ";")
}
# An example string, using the "hug" emoji:
example_string <- "test πŸ€— test"
output_string <- stringr::str_replace_all(
example_string,
"(&#[0-9]*?;){2}", # Find all two-character "&#...;&#...;" codes.
make_unicode_entity
# make_html_entity
)
cat(output_string)
# To print Unicode string (doesn't display in R console, but can be copied and
# pasted elsewhere:
# (This assumes you've used 'make_unicode_entity' above in the str_replace_all
# call):
stringi::stri_unescape_unicode(output_string)
Translated Chad's JavaScript answer to Go since I too had the same issue, but needed a solution in Go.
https://play.golang.org/p/h9JBFzqcd90
package main
import (
"fmt"
"html"
"regexp"
"strconv"
"strings"
)
func main() {
emoji := "πŸ˜ŠπŸ˜˜πŸ˜€πŸ˜†πŸ˜‚πŸ˜"
regexp := regexp.MustCompile(`(&#\d+;){2}`)
matches := regexp.FindAllString(emoji, -1)
var builder strings.Builder
for _, match := range matches {
s := strings.Replace(match, "&#", "", -1)
parts := strings.Split(s, ";")
a := parts[0]
b := parts[1]
c, err := strconv.Atoi(a)
if err != nil {
panic(err)
}
d, err := strconv.Atoi(b)
if err != nil {
panic(err)
}
c = c - 0xd800
d = d - 0xdc00
e := strconv.FormatInt(int64(c), 2)
f := strconv.FormatInt(int64(d), 2)
g := "0000000000"[2:len(e)] + e
h := "0000000000"[10:len(f)] + f
j, err := strconv.ParseInt(g + h, 2, 64)
if err != nil {
panic(err)
}
k := j + 0x10000
_, err = builder.WriteString("&#x" + strconv.FormatInt(k, 16) + ";")
if err != nil {
panic(err)
}
}
fmt.Println(html.UnescapeString(emoji))
emoji = html.UnescapeString(builder.String())
fmt.Println(emoji)
}

How do I use regex to find FS.File on FS.Collection in meteor

How do I use regex to find FS.File on FS.Collection in meteor. My code is as follows and it is not working
partOfFileName = "*User_" + clickedResellerId + "_*";
var imgs = Images.find({fileName:{$regex:partOfFileName}});
//var imgs = Images.find();
return imgs // Where Images is an FS.Collection instance
In place of fileName I've also tried name and it is not working either. Please help
I don't think your regex is valid. Did you perhaps mean the following?
partOfFileName = ".*User_" + clickedResellerId + "_.*";
Please note that POSIX wildcard notation is different from regular expressions. in Regular expressions the * operators indicates repetition of the preceding operator (in my case a ., i.e., anything). A * by itself has no meaning, and it doesn't mean "anything" like in POSIX.

How to remove double qoutes in Objective-C

Let me introduce myself.
My name is Vladimir, C++ programmer, I am from Serbia. two weeks ago I have started to learn objective-C and it was fine until tonight.
Problem:
I cant remove double quotes from my NSLog output.
NSLog(#"The best singers:%#", list.best);
Strings are joined with componentsJoinedByString:#" and "
I would like to get something like this:
The best singers: Mickey and John.
But I get this:
The best singers: ("Mickey", and "John").
I cant remove comma (,) and parentheses either.
I have tried with "replaceOccurencesOfString" but with no success. It can remove any character except qoute and comma.
Also I have used -(NSString *)description method to return string.
You are getting the raw output from your list (which I assume is an array). You will have to do your own formatting to get this to display in the format that you want. You can achieve this by building your string by iterating through your array. Note that this probably isn't the most efficient nor the most robust way to achieve this.
NSMutableString *finalString = [NSMutableString string];
BOOL first = YES;
for (NSString *nameString in list) {
if (first) {
[finalString appendString:nameString];
first = NO;
} else {
[finalString appendString:[NSString stringWithFormat:#" and %#", nameString]];
}
}

removing whitespaces from a QRegExpValidator

I have a code someone wrote and there
this->llBankCode = new widgetLineEditWithLabel(tr("Bankleitzahl"), "", Qt::AlignTop, this);
QRegExpValidator *validatorBLZ = new QRegExpValidator(this);
validatorBLZ->setRegExp(QRegExp( "[0-9]*", Qt::CaseSensitive));
this->llBankCode->lineEdit->setValidator(validatorBLZ);
as it can be seen from this code, is that validatorBLZ can accept only numbers between 0 and 9. I would like to change it, that validatorBLZ would be able to get as an input whitespace as well (but not to start with a whitespace), but it wont be shown.
Example:
if i try to copy & paste a string of the format '22 34 44', the result would be an empty field. What i would like to happen is that the string '22 34 44' will be shown in the field as '223444'.
How could i do it?
You could try using:
QString string = "22 34 44";
string.replace(QString(" "), QString(""));
That will replace any spaces with a non-space.
Write your own QValidator subclass and reimplement validate and fixup. Fixup does what you ask for: changes the input in a way that makes it intermediate/acceptable.
In your case, consider the following code-snippet for fixup:
fixup (QString &input) const
{
QString fixed;
fixed.reserve(input.size());
for (int i=0; i<input.size(); ++i)
if (input.at(i).isDigit()) fixed.append(input.at(i));
input = fixed;
}
(this is not tested)
The validate function will obviously look similar, returning QValidator::Invalid when it encounters a non-digit character and returning the according position in pos.
If your BLZ is limited to Germany, you could easily add the validation feature that it only returns QValidator::Acceptable when there are eight digits, and QValidator::Intermediate else.
Anyhow, writing an own QValidator, which often is very easy and straight forward, is the best (and most future-proof) solution most of the time. RegExes are great, but C++ clearly is the more powerful language here, which in addition results in a much more readable validator ;).

Using Vim, how can I make CSS rules into one liners?

I would like to come up with a Vim substitution command to turn multi-line CSS rules, like this one:
#main {
padding: 0;
margin: 10px auto;
}
into compacted single-line rules, like so:
#main {padding:0;margin:10px auto;}
I have a ton of CSS rules that are taking up too many lines, and I cannot figure out the :%s/ commands to use.
Here's a one-liner:
:%s/{\_.\{-}}/\=substitute(submatch(0), '\n', '', 'g')/
\_. matches any character, including a newline, and \{-} is the non-greedy version of *, so {\_.\{-}} matches everything between a matching pair of curly braces, inclusive.
The \= allows you to substitute the result of a vim expression, which we here use to strip out all the newlines '\n' from the matched text (in submatch(0)) using the substitute() function.
The inverse (converting the one-line version to multi-line) can also be done as a one liner:
:%s/{\_.\{-}}/\=substitute(submatch(0), '[{;]', '\0\r', 'g')/
If you are at the beginning or end of the rule, V%J will join it into a single line:
Go to the opening (or closing) brace
Hit V to enter visual mode
Hit % to match the other brace, selecting the whole rule
Hit J to join the lines
Try something like this:
:%s/{\n/{/g
:%s/;\n/;/g
:%s/{\s+/{/g
:%s/;\s+/;/g
This removes the newlines after opening braces and semicolons ('{' and ';') and then removes the extra whitespace between the concatenated lines.
If you want to change the file, go for rampion's solution.
If you don't want (or can't) change the file, you can play with a custom folding as it permits to choose what and how to display the folded text. For instance:
" {rtp}/fold/css-fold.vim
" [-- local settings --] {{{1
setlocal foldexpr=CssFold(v:lnum)
setlocal foldtext=CssFoldText()
let b:width1 = 20
let b:width2 = 15
nnoremap <buffer> + :let b:width2+=1<cr><c-l>
nnoremap <buffer> - :let b:width2-=1<cr><c-l>
" [-- global definitions --] {{{1
if exists('*CssFold')
setlocal foldmethod=expr
" finish
endif
function! CssFold(lnum)
let cline = getline(a:lnum)
if cline =~ '{\s*$'
return 'a1'
elseif cline =~ '}\s*$'
return 's1'
else
return '='
endif
endfunction
function! s:Complete(txt, width)
let length = strlen(a:txt)
if length > a:width
return a:txt
endif
return a:txt . repeat(' ', a:width - length)
endfunction
function! CssFoldText()
let lnum = v:foldstart
let txt = s:Complete(getline(lnum), b:width1)
let lnum += 1
while lnum < v:foldend
let add = s:Complete(substitute(getline(lnum), '^\s*\(\S\+\)\s*:\s*\(.\{-}\)\s*;\s*$', '\1: \2;', ''), b:width2)
if add !~ '^\s*$'
let txt .= ' ' . add
endif
let lnum += 1
endwhile
return txt. '}'
endfunction
I leave the sorting of the fields as exercise. Hint: get all the lines between v:foldstart+1 and v:voldend in a List, sort the list, build the string, and that's all.
I won’t answer the question directly, but instead I suggest you to reconsider your needs. I think that your β€œbad” example is in fact the better one. It is more readable, easier to modify and reason about. Good indentation is very important not only when it comes to programming languages, but also in CSS and HTML.
You mention that CSS rules are β€œtaking up too many lines”. If you are worried about file size, you should consider using CSS and JS minifiers like YUI Compressor instead of making the code less readable.
A convenient way of doing this transformation is to run the following
short command:
:g/{/,/}/j
Go to the first line of the file, and use the command gqG to run the whole file through the formatter. Assuming runs of nonempty lines should be collapsed in the whole file.

Resources