Unexpected Behavior of QRegularExpression - qt

I've just started switching to QRegularExpression, and I'm using it to tokenize a string with multiple delimiter possibilities. I've encountered a surprising behavior, which seems to me to be a bug. I'm using Qt 5.5.1 on Windows.
Here's sample code:
#include <QRegularExpression>
#include <QString>
#include <QtDebug>
int main(int argc, char *argv[])
{
Q_UNUSED (argc);
Q_UNUSED (argv);
QRegularExpression regex ("^ ");
qDebug () << "Expected: " << QString ("M 100").indexOf(regex);
qDebug () << "NOT expected:" << QString ("M 100").indexOf(regex, 1);
qDebug () << "Expected: " << QString (" 100").indexOf(regex);
QRegularExpression regex1 (" ");
qDebug () << "Expected: " << QString ("M 100").indexOf(regex1);
}
And the output:
Expected: -1
NOT expected: -1
Expected: 0
Expected: 1
The use of the caret (^) when used with a starting position other than 0 in the "indexOf" call is preventing the expression from matching. Intuitively, I expected that the caret matches the string at the position that I specified. Instead, it simply never matches.
I'm going to switch my tokenizing to use splitRref to avoid this problem. While that's probably slightly cleaner anyway, I need to understand whether this is correct behavior or if I should be reporting a bug to Qt.
UPDATE: Using splitRef doesn't entirely solve my problem because I need to use a regular expression to detect if some tokens are floating point numbers, and I can't use a QRegularExpression with QStringRef. For that possibility, I have to convert my QStringRef token into an actual QString, which was what I was trying to avoid in the first place.

^ matches at the beginning of the subject string, or after a newline when in multiline mode. The offset does not alter these semantics. Hence, matching /^ / (in regex notation) against M 100 at offset 1 correctly results in no match.
Perhaps you want \G? From pcrepattern(3):
\G matches at the first matching position in the subject
The \G assertion is true only when the current matching position is at the start point of the match, as specified by the startoffset argument of pcre_exec(). It differs from \A when the value of startoffset is non-zero.
With that, this code:
QRegularExpression regex ("\\G ");
qDebug () << "Expected: " << QString ("M 100").indexOf(regex);
qDebug () << "NOT expected:" << QString ("M 100").indexOf(regex, 1);
qDebug () << "Expected: " << QString (" 100").indexOf(regex);
prints
Expected: -1
NOT expected: 1
Expected: 0

Related

Do QString's toUpper()/toLower() member functions only convert Latin alpha characters?

According to the documentation QString's toUpper() and toLower() member functions convert in the C-locale. Does it mean that only Posix Portable Character Set characters (Latin A-Z/a-z) are converted and any international unicode characters are left as-is?
We can easily test this with the simple test program here:
#include <QString>
#include <iostream>
int main()
{
QString s("БΣTest3φب");
std::cout << "String: " << s.toStdString() << std::endl;
std::cout << "Lower: " << s.toLower().toStdString() << std::endl;
std::cout << "Upper: " << s.toUpper().toStdString() << std::endl;
return 0;
}
Which returns:
09:48:48: Starting /home/tzig/test/test ...
String: БΣTest3φب
Lower: бσtest3φب
Upper: БΣTEST3Φب
09:48:49: /home/tzig/test/test exited with code 0
So we know that:
Cyrillic letters work fine
Latin letters work fine
Numbers are untouched
Greek letters work fine
Arabic letters don't have uppercase/lowercase so they are untouched
Feel free to test other Unicode characters with the test program.

How to get QString qDebug output as string?

Let's have a look at this little application:
#include <QString>
#include <QDebug>
int main(int argc, char *argv[]) {
const auto test_string =
QString{"Some string \n \x01 \u0002 with some \r special chars"};
qDebug() << test_string;
qDebug(qPrintable(test_string));
}
It gives the following output:
"Some string \n \u0001 \u0002 with some \r special chars"
Some string
special chars
Press <RETURN> to close this window...
This demonstrates how the qDebug << operator comes with some functionality that converts all the special characters of a QString to some readable string, which can easily be put in a string declaration in C++.
I would like to use this functionality to feed strings into a custom logging framework. Is there a possibility to use the same conversion function directly?
Effectively, this would mean to convert test_string to a QString instance that gives the same output on both the above qDebug statements.
I had the same question and I did not found the complete answer (yet). However, I found QVariant which allows you to call toString() on the most basic C and Qt types:
QVariant("foo").toString(); // "foo"
QVariant(true).toString(); // "true"
QVariant(QDateTime("2020-11-28")).toString(); // "2020-11-28"
Then you could wrap this into one method:
QString variantToString(const QVariant variant) {
return (variant.userType() != QMetaType::QString
&& variant.canConvert(QMetaType::QStringList))
? "(" + variant.toStringList().join(", ") + ")"
: variant.toString();
}
variantToString(42); // "42" // works due to implicit cast
You could do a few extra checks for non-stringifiable types (see also canConvert() and userType(), e.g. lists, regular expressions or whatever you need, but I totally agree that it would be nicer to reuse Qt's own logging functions instead ...
I had a similar issue but I wanted to use my custom operator<< overload that was defined for QDebug, therefore, I did the following:
// This could be anything that provides an 'operator<<' overload.
QString value = "Hello, world!";
QString result;
QDebug{ &result } << value;

QRegExp does not match even though regex101.com does

I need to extract some data from string with simple syntax. The syntax is this:
_IMPORT:[any text] - [HEX number] #[decimal number]
Therefore I created regex you can see below in the code:
//SYNTAX: _IMPORT:%1 - %2 #%3
static const QRegExp matchImportLink("^_IMPORT:(.*?) - ([A-Fa-f0-9]+) #([0-9]+)$");
QRegExp importLink(matchImportLink);
QString qtWtf(importLink.pattern());
const int index = importLink.indexIn(mappingName);
qDebug()<< "Input string: "<<mappingName;
qDebug()<< "Regular expression:"<<qtWtf;
qDebug()<< "Result: "<< index;
For some reason, that does not work, I get this output:
Input string: "_IMPORT:ddd - 92806f0f96a6dea91c37244128f7d00f #0"
Regular expression: "^_IMPORT:(.*?) - ([A-Fa-f0-9]+) #([0-9]+)$"
Result: -1
I even tried to remove the anchors ^ and $ but that didn't help and also is undesired. The annoying thing is that this regexp works perfectly if I copy the output in regex101.com, as you can see here: https://regex101.com/r/oT6cY3/1
Can anyone explain what is wrong here? Did I stumble upon Qt bug? I use Qt 5.6. Is there any workaround for this?
It seems like Qt does not recognize the quatifier *? as valid. Check the method QRegExp::isValid() againts your pattern. In my case it did not work because of this. And the documentation tells that any invalid pattern will never match.
So first thing I tried was skipping the ? which perfectly fits your provided string with all capturing groups. Here is my code.
QString str("_IMPORT:ddd - 92806f0f96a6dea91c37244128f7d00f #0");
QRegExp exp("^_IMPORT:(.*) - ([A-Fa-f0-9]+) #([0-9]+)$");
qDebug() << "pattern:" << exp.pattern();
qDebug() << "valid:" << exp.isValid();
int pos = 0;
while ((pos = exp.indexIn(str, pos)) != -1) {
for (int i = 1; i <= exp.captureCount(); ++i)
qDebug() << "pos:" << pos << "len:" << exp.matchedLength() << "val:" << exp.cap(i);
pos += exp.matchedLength();
}
And here is the resulting output.
pattern: "^_IMPORT:(.*) - ([A-Fa-f0-9]+) #([0-9]+)$"
valid: true
pos: 0 len: 49 val: "ddd"
pos: 0 len: 49 val: "92806f0f96a6dea91c37244128f7d00f"
pos: 0 len: 49 val: "0"
Tested using Qt 5.6.1.
Also note that you may set greedy evaluation using QRegExp::setMinimal(bool).

Pass a string from ECL to C++

I'm trying to get into the fascinating world of Common Lisp embedded in C++. My problem is that I can't manage to read and print from c++ a string returned by a lisp function defined in ECL.
In C++ I have this function to run arbitrary Lisp expressions:
cl_object lisp(const std::string & call) {
return cl_safe_eval(c_string_to_object(call.c_str()), Cnil, Cnil);
}
I can do it with a number in this way:
ECL:
(defun return-a-number () 5.2)
read and print in C++:
auto x = ecl_to_float(lisp("(return-a-number)"));
std::cout << "The number is " << x << std::endl;
Everything is set and works fine, but I don't know to do it with a string instead of a number. This is what I have tried:
ECL:
(defun return-a-string () "Hello")
C++:
cl_object y = lisp("(return-a-string)");
std::cout << "A string: " << y << std::endl;
And the result of printing the string is this:
A string: 0x3188b00
that I guess is the address of the string.
Here it is a capture of the debugger and the contents of the y cl_object. y->string.self type is an ecl_character.
Debug
(Starting from #coredump's answer that the string.self field provides the result.)
The string.self field is defined as type ecl_character* (ecl/object.h), which appears to be given in ecl/config.h as type int (although I suspect this is slightly platform dependent). Therefore, you will not be able to just print it as if it was a character array.
The way I found worked for me was to reinterpret it as a wchar_t (i.e. a unicode character). Unfortunately, I'm reasonably sure this isn't portable and depends both on how ecl is configured and the C++ compiler.
// basic check that this should work
static_assert(sizeof(ecl_character)==sizeof(wchar_t),"sizes must be the same");
std::wcout << "A string: " << reinterpret_cast<wchar_t*>(y->string.self) << std::endl;
// prints hello, as required
// note the use of wcout
The alternative is to use the lisp type base-string which does use char (base-char in lisp) as its character type. The lisp code then reads
(defun return-a-base-string ()
(coerce "Hello" 'base-string))
(there may be more elegant ways to do the conversion to base-string but I don't know them).
To print in C++
cl_object y2 = lisp("(return-a-base-string)");
std::cout << "Another: " << y2->base_string.self << std::endl;
(note that you can't mix wcout and cout in the same program)
According to section 2.6 Strings of The ECL Manual, I think that the actual character array is found by accessing the string.self field of the returned object. Can you try the following?
std::cout << y->string.self << std::endl;
std::string str {""};
cl_object y2 = lisp("(return-a-base-string)");
//get dimension
int j = y2->string.dim;
//get pointer
ecl_character* selv = y2->string.self;
//do simple pointer addition
for(int i=0;i<j;i++){
str += (*(selv+i));
}
//do whatever you want to str
this code works when the string is build from ecl_characters
from the documentation:
"ECL defines two C types to hold its characters: ecl_base_char and ecl_character.
When ECL is built without Unicode, they both coincide and typically match unsigned char, to cover the 256 codes that are needed.
When ECL is built with Unicode, the two types are no longer equivalent, with ecl_character being larger.
For your code to be portable and future proof, use both types to really express what you intend to do."
On my system the return-a-base-string is not needed, but I think it could be good to add for compatibility. I use the (ecl) embedded CLISP 16.1.2 version.
The following piece of code reads a string from lisp and converts to C++ strings types - std::string and c-string- and store them on C++ variables:
// strings initializations: string and c-string
std::string str2 {""};
char str_c[99] = " ";
// text read from clisp, whatever clisp function that returns string type
cl_object cl_text = lisp("(coerce (text-from-lisp X) 'base-string)");
//cl_object cl_text = lisp("(text-from-lisp X)"); // no base string conversions
// catch dimension
int cl_text_dim = cl_text->string.dim;
// complete c-string char by char
for(int ind=0;i<cl_text_dim;i++){
str_c[i] = ecl_char(cl_text,i); // ecl function to get char from cl_object
}
str_c[cl_text_dim] ='\0'; // end of the c-string
str2 = str_c; // get the string on the other string type
std::cout << "Dim: " << cl_ text_dim << " C-String var: " << str_c() << " String var << str2 << std::endl;
It is a slow process as passing char by char but it is the only way by the moment I know. Hope it helps. Greetings!

Assign pair of raw pointers returned by a function to unique_ptr

I've looked around a little bit but couldn't find an answer to this.
I have a function returning a pair of pointers to objects, the situation can be simplified to:
#include <iostream>
#include <utility>
#include <memory>
std::pair<int *, int *> shallow_copy()
{
int *i = new int;
int *j = new int;
*i = 5;
*j = 7;
return std::make_pair(i, j);
}
int main(int argc, char *argv[])
{
std::pair<int *, int *> my_pair = shallow_copy();
std::cout << "a = " << my_pair.first << " b = " << *my_pair.second << std::endl;
// This is just creating a newpointer:
std::unique_ptr<int> up(my_pair.first);
std::cout << "a = " << &up << std::endl;
delete my_pair.first;
delete my_pair.second;
return 0;
}
I cannot change the return value of the function. From std::cout << "a = " << &up << std::endl; I can see that the address of the smart pointer is different from the address of the raw pointer.
Is there a way to capture tha std::pair returned by the function in a std::unique_ptr and prevent memory leaks without calling delete explicitly?
NB: The question have been edited to better state the problem and make me look smarter!
You're doing it the right way, but testing it the wrong one. You're comparing the address in first with the address of up. If you print up.get() instead (the address stored in up), you'll find they're equal.
In addition, your code has a double-delete problem. You do delete my_pair.first;, which deallocates the memory block pointed to by my_pair.first and also by up. Then, the destructor of up will deallocate it again when up goes out of scope, resulting in a double delete.
You also asked how to capture both pointers in smart pointers. Since the constructor of std::unique_ptr taking a raw pointer is explicit, you cannot directly do this with a simple std::pair<std::unique_ptr<int>, std::unique_ptr<int>>. You can use a helper function, though:
std::pair<std::unique_ptr<int>, std::unique_ptr<int>> wrapped_shallow_copy()
{
auto orig = shallow_copy();
std::pair<std::unique_ptr<int>, std::unique_ptr<int>> result;
result.first.reset(orig.first);
result.second.reset(orig.second);
return result;
}
Now, use wrapped_shallow_copy() instead of shallow_copy() and you will never leak memory from the call.

Resources