QRegExp does not match even though regex101.com does - qt

I need to extract some data from string with simple syntax. The syntax is this:
_IMPORT:[any text] - [HEX number] #[decimal number]
Therefore I created regex you can see below in the code:
//SYNTAX: _IMPORT:%1 - %2 #%3
static const QRegExp matchImportLink("^_IMPORT:(.*?) - ([A-Fa-f0-9]+) #([0-9]+)$");
QRegExp importLink(matchImportLink);
QString qtWtf(importLink.pattern());
const int index = importLink.indexIn(mappingName);
qDebug()<< "Input string: "<<mappingName;
qDebug()<< "Regular expression:"<<qtWtf;
qDebug()<< "Result: "<< index;
For some reason, that does not work, I get this output:
Input string: "_IMPORT:ddd - 92806f0f96a6dea91c37244128f7d00f #0"
Regular expression: "^_IMPORT:(.*?) - ([A-Fa-f0-9]+) #([0-9]+)$"
Result: -1
I even tried to remove the anchors ^ and $ but that didn't help and also is undesired. The annoying thing is that this regexp works perfectly if I copy the output in regex101.com, as you can see here: https://regex101.com/r/oT6cY3/1
Can anyone explain what is wrong here? Did I stumble upon Qt bug? I use Qt 5.6. Is there any workaround for this?

It seems like Qt does not recognize the quatifier *? as valid. Check the method QRegExp::isValid() againts your pattern. In my case it did not work because of this. And the documentation tells that any invalid pattern will never match.
So first thing I tried was skipping the ? which perfectly fits your provided string with all capturing groups. Here is my code.
QString str("_IMPORT:ddd - 92806f0f96a6dea91c37244128f7d00f #0");
QRegExp exp("^_IMPORT:(.*) - ([A-Fa-f0-9]+) #([0-9]+)$");
qDebug() << "pattern:" << exp.pattern();
qDebug() << "valid:" << exp.isValid();
int pos = 0;
while ((pos = exp.indexIn(str, pos)) != -1) {
for (int i = 1; i <= exp.captureCount(); ++i)
qDebug() << "pos:" << pos << "len:" << exp.matchedLength() << "val:" << exp.cap(i);
pos += exp.matchedLength();
}
And here is the resulting output.
pattern: "^_IMPORT:(.*) - ([A-Fa-f0-9]+) #([0-9]+)$"
valid: true
pos: 0 len: 49 val: "ddd"
pos: 0 len: 49 val: "92806f0f96a6dea91c37244128f7d00f"
pos: 0 len: 49 val: "0"
Tested using Qt 5.6.1.
Also note that you may set greedy evaluation using QRegExp::setMinimal(bool).

Related

QRegularExpression: how to get the failing position?

I guess that this has had to be asked before, but cannot find anything about it.
I also think that maybe the answer is just right there but I can't see it either.
So, if QRegularExpression::match() has not a match, how do I know the position of the character that made the validation fail?
I'm pretty sure that internally, there should be some variable storing the "current position" as the string is being evaluated.
Yes, maybe there is backtracking in that evaluation so if the exact failing char is hard to get, at least the last good one could be easier.
Any hints? Thank you.
Edit (2022-08-08):
I'm starting to feel like it's possible that no one asked this before, in fact, considering how people think I am asking something like "why my regex does not work". Not my case.
This is not about a particular regular expression. It's about Qt's class QRegularExpression.
I apologize if I've not been clear. I've tried to explain the best I could since the very beginning.
Anyway, let's say you have one string, to be evaluated against some (ANY) regex. No match is found. Then I want to know, if possible, the point where the evaluation failed.
This regex: "abc"
This string: "abd", failing position: 2
This regex: "abc"
This string: "acb", failing position: 1
This regex: "abc"
This string: "xyz", failing position: 0
I feel very stupid asking this, mostly because I think it's a very basic question.
But it's not what you immediately think at first glance. I swear I searched for answers the most I could, but everything I got was about errors in the regexes themselves.
I hate this, but it works.
int getFailingPosition(QString sRegEx,QString sText) {
int iResult;
QRegularExpression rxRegEx;
QRegularExpressionMatch rxmMatch;
rxRegEx.setPattern(QRegularExpression::anchoredPattern(sRegEx));
for(iResult=sText.length();iResult>0;iResult--) {
rxmMatch=rxRegEx.match(sText);
if(rxmMatch.hasMatch())
break;
else {
rxmMatch=rxRegEx.match(
sText,
0,
QRegularExpression::MatchType::PartialPreferCompleteMatch
);
if(rxmMatch.hasPartialMatch())
break;
}
sText.chop(1);
}
return iResult;
}
Tests:
#define REGEX_USA_ZIPCODE "\\d{4}?\\d$|^\\d{4}?\\d-\\d{4}"
#define REGEX_SIGNED_NUMBER "[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?"
#define REGEX_ISO8601_DATE "\\d{4}-(0[1-9]|1[012])-(0[1-9]|[12]\\d|3[0-1])"
#define REGEX_USA_PHONE "\\(?\\d{1,3}?\\)?[-.\\s]?\\d{1,4}[-.\\s]?\\d{1,4}[-.\\s]?\\d{1,9}"
qDebug() << getFailingPosition("abc","abcd"); // 3
qDebug() << getFailingPosition("abc","abd"); // 2
qDebug() << getFailingPosition("abc","acb"); // 1
qDebug() << getFailingPosition("abc","xyz"); // 0
qDebug() << getFailingPosition("abc","x"); // 0
qDebug() << getFailingPosition("abc",""); // 0
qDebug() << getFailingPosition("abc","a"); // 1
qDebug() << getFailingPosition("abc","ab"); // 2
qDebug() << getFailingPosition(REGEX_USA_ZIPCODE,"12345-1"); // 7 (missing chars)
qDebug() << getFailingPosition(REGEX_SIGNED_NUMBER,"-0.123e"); // 7 (missing chars)
qDebug() << getFailingPosition(REGEX_ISO8601_DATE,"2021-23-31"); // 5 (unexpected char)
qDebug() << getFailingPosition(REGEX_USA_PHONE,"202-3(24)-3000"); // 5 (unexpected char)
getFailingPosition() should be called only after we're sure there is not a match, or it would return the string length, giving the wrong idea that something's missing.
This should have a built-in function...

QString remove last characters

How to remove /Job from /home/admin/job0/Job
QString name = "/home/admin/job0/Job"
I want to remove last string after"/"
You have QString::chop() for the case when you already know how many characters to remove.
It is same as QString::remove(), just works from the back of string.
Find last slash with QString::lastIndexOf.
After that get substring with QString::left till the position of the last slash occurrence
QString name = "/home/admin/job0/Job";
int pos = name.lastIndexOf(QChar('/'));
qDebug() << name.left(pos);
This will print:
"/home/admin/job0"
You should check int pos for -1 to be sure the slash was found at all.
To include last slash in output add +1 to the founded position
qDebug() << name.left(pos+1);
Will output:
"/home/admin/job0/"
Maybe easiest to understand for later readers would probably be:
QString s("/home/admin/job0/Job");
s.truncate(s.lastIndexOf(QChar('/'));
qDebug() << s;
as the code literaly says what you intended.
You can do something like this:
QString s("/home/admin/job0/Job");
s.remove(QRegularExpression("\\/(?:.(?!\\/))+$"));
// s is "/home/admin/job0" now
If you are using Qt upper than 6 and sure that "/" constains in your word you should use QString::first(qsizetype n) const function instead QString::left(qsizetype n) const
Example:
QString url= "/home/admin/job0/Job"
QString result=url.first(lastIndexOf(QChar('/')));
If you run these code:
QElapsedTimer timer;
timer.start();
for (int j=0; j<10000000; j++)
{
QString name = "/home/admin/job0/Job";
int pos = name.lastIndexOf("/");
name.left(pos);
}
qDebug() << "left method" << timer.elapsed() << "milliseconds";
timer.start();
for (int j=0; j<10000000; j++)
{
QString name = "/home/admin/job0/Job";
int pos = name.lastIndexOf(QChar('/'));
name.first(pos);
}
qDebug() << "frist method" << timer.elapsed() << "milliseconds";
Results:
left method 10034 milliseconds
frist method 8098 milliseconds
sorry for replying to this post after 4 years, but I have (I think) the most efficient answer.
You can use
qstr.remove(0, 1); //removes the first character
qstr.remove(1, 1); //removes the last character
Thats everything you have to do, to delete characters ONE BY ONE (first or last) from a QString, until 1 character remains.

Unexpected Behavior of QRegularExpression

I've just started switching to QRegularExpression, and I'm using it to tokenize a string with multiple delimiter possibilities. I've encountered a surprising behavior, which seems to me to be a bug. I'm using Qt 5.5.1 on Windows.
Here's sample code:
#include <QRegularExpression>
#include <QString>
#include <QtDebug>
int main(int argc, char *argv[])
{
Q_UNUSED (argc);
Q_UNUSED (argv);
QRegularExpression regex ("^ ");
qDebug () << "Expected: " << QString ("M 100").indexOf(regex);
qDebug () << "NOT expected:" << QString ("M 100").indexOf(regex, 1);
qDebug () << "Expected: " << QString (" 100").indexOf(regex);
QRegularExpression regex1 (" ");
qDebug () << "Expected: " << QString ("M 100").indexOf(regex1);
}
And the output:
Expected: -1
NOT expected: -1
Expected: 0
Expected: 1
The use of the caret (^) when used with a starting position other than 0 in the "indexOf" call is preventing the expression from matching. Intuitively, I expected that the caret matches the string at the position that I specified. Instead, it simply never matches.
I'm going to switch my tokenizing to use splitRref to avoid this problem. While that's probably slightly cleaner anyway, I need to understand whether this is correct behavior or if I should be reporting a bug to Qt.
UPDATE: Using splitRef doesn't entirely solve my problem because I need to use a regular expression to detect if some tokens are floating point numbers, and I can't use a QRegularExpression with QStringRef. For that possibility, I have to convert my QStringRef token into an actual QString, which was what I was trying to avoid in the first place.
^ matches at the beginning of the subject string, or after a newline when in multiline mode. The offset does not alter these semantics. Hence, matching /^ / (in regex notation) against M 100 at offset 1 correctly results in no match.
Perhaps you want \G? From pcrepattern(3):
\G matches at the first matching position in the subject
The \G assertion is true only when the current matching position is at the start point of the match, as specified by the startoffset argument of pcre_exec(). It differs from \A when the value of startoffset is non-zero.
With that, this code:
QRegularExpression regex ("\\G ");
qDebug () << "Expected: " << QString ("M 100").indexOf(regex);
qDebug () << "NOT expected:" << QString ("M 100").indexOf(regex, 1);
qDebug () << "Expected: " << QString (" 100").indexOf(regex);
prints
Expected: -1
NOT expected: 1
Expected: 0

Pass a string from ECL to C++

I'm trying to get into the fascinating world of Common Lisp embedded in C++. My problem is that I can't manage to read and print from c++ a string returned by a lisp function defined in ECL.
In C++ I have this function to run arbitrary Lisp expressions:
cl_object lisp(const std::string & call) {
return cl_safe_eval(c_string_to_object(call.c_str()), Cnil, Cnil);
}
I can do it with a number in this way:
ECL:
(defun return-a-number () 5.2)
read and print in C++:
auto x = ecl_to_float(lisp("(return-a-number)"));
std::cout << "The number is " << x << std::endl;
Everything is set and works fine, but I don't know to do it with a string instead of a number. This is what I have tried:
ECL:
(defun return-a-string () "Hello")
C++:
cl_object y = lisp("(return-a-string)");
std::cout << "A string: " << y << std::endl;
And the result of printing the string is this:
A string: 0x3188b00
that I guess is the address of the string.
Here it is a capture of the debugger and the contents of the y cl_object. y->string.self type is an ecl_character.
Debug
(Starting from #coredump's answer that the string.self field provides the result.)
The string.self field is defined as type ecl_character* (ecl/object.h), which appears to be given in ecl/config.h as type int (although I suspect this is slightly platform dependent). Therefore, you will not be able to just print it as if it was a character array.
The way I found worked for me was to reinterpret it as a wchar_t (i.e. a unicode character). Unfortunately, I'm reasonably sure this isn't portable and depends both on how ecl is configured and the C++ compiler.
// basic check that this should work
static_assert(sizeof(ecl_character)==sizeof(wchar_t),"sizes must be the same");
std::wcout << "A string: " << reinterpret_cast<wchar_t*>(y->string.self) << std::endl;
// prints hello, as required
// note the use of wcout
The alternative is to use the lisp type base-string which does use char (base-char in lisp) as its character type. The lisp code then reads
(defun return-a-base-string ()
(coerce "Hello" 'base-string))
(there may be more elegant ways to do the conversion to base-string but I don't know them).
To print in C++
cl_object y2 = lisp("(return-a-base-string)");
std::cout << "Another: " << y2->base_string.self << std::endl;
(note that you can't mix wcout and cout in the same program)
According to section 2.6 Strings of The ECL Manual, I think that the actual character array is found by accessing the string.self field of the returned object. Can you try the following?
std::cout << y->string.self << std::endl;
std::string str {""};
cl_object y2 = lisp("(return-a-base-string)");
//get dimension
int j = y2->string.dim;
//get pointer
ecl_character* selv = y2->string.self;
//do simple pointer addition
for(int i=0;i<j;i++){
str += (*(selv+i));
}
//do whatever you want to str
this code works when the string is build from ecl_characters
from the documentation:
"ECL defines two C types to hold its characters: ecl_base_char and ecl_character.
When ECL is built without Unicode, they both coincide and typically match unsigned char, to cover the 256 codes that are needed.
When ECL is built with Unicode, the two types are no longer equivalent, with ecl_character being larger.
For your code to be portable and future proof, use both types to really express what you intend to do."
On my system the return-a-base-string is not needed, but I think it could be good to add for compatibility. I use the (ecl) embedded CLISP 16.1.2 version.
The following piece of code reads a string from lisp and converts to C++ strings types - std::string and c-string- and store them on C++ variables:
// strings initializations: string and c-string
std::string str2 {""};
char str_c[99] = " ";
// text read from clisp, whatever clisp function that returns string type
cl_object cl_text = lisp("(coerce (text-from-lisp X) 'base-string)");
//cl_object cl_text = lisp("(text-from-lisp X)"); // no base string conversions
// catch dimension
int cl_text_dim = cl_text->string.dim;
// complete c-string char by char
for(int ind=0;i<cl_text_dim;i++){
str_c[i] = ecl_char(cl_text,i); // ecl function to get char from cl_object
}
str_c[cl_text_dim] ='\0'; // end of the c-string
str2 = str_c; // get the string on the other string type
std::cout << "Dim: " << cl_ text_dim << " C-String var: " << str_c() << " String var << str2 << std::endl;
It is a slow process as passing char by char but it is the only way by the moment I know. Hope it helps. Greetings!

Peek on QTextStream

I would like to peek the next characters of a QTextStream reading a QFile, in order to create an efficient tokenizer.
However, I don't find any satisfying solution to do so.
QFile f("test.txt");
f.open(QIODevice::WriteOnly);
f.write("Hello world\nHello universe\n");
f.close();
f.open(QIODevice::ReadOnly);
QTextStream s(&f);
int i = 0;
while (!s.atEnd()) {
++i;
qDebug() << "Peek" << i << s.device()->peek(3);
QString v;
s >> v;
qDebug() << "Word" << i << v;
}
Gives the following output:
Peek 1 "Hel" # it works only the first time
Word 1 "Hello"
Peek 2 ""
Word 2 "world"
Peek 3 ""
Word 3 "Hello"
Peek 4 ""
Word 4 "universe"
Peek 5 ""
Word 5 ""
I tried several implementations, also with QTextStream::pos() and QTextStream::seek(). It works better, but pos() is buggy (returns -1 when the file is too big).
Does anyone have a solution to this recurrent problem? Thank you in advance.
You peek from QIODevice, but then you read from QTextStream, that's why peek works only once. Try this:
while (!s.atEnd()) {
++i;
qDebug() << "Peek" << i << s.device()->peek(3);
QByteArray v = s.device()->readLine ();
qDebug() << "Word" << i << v;
}
Unfortunately, QIODevice does not support reading single words, so you would have to do it yourself with a combination of peak and read.
Try disable QTextStream::autoDetectUnicode. This may read device ahead to perform detection and cause your problem.
Set also a codec just in case.
Add to the logs s.device()->pos() and s.device()->bytesAvailable() to verify that.
I've check QTextStream code. It looks like it always caches as much data as possible and there is no way to disable this behavior. I was expecting that it will use peek on device, but it only reads in greedy way. Bottom line is that you can't use QTextStream and peak device at the same time.

Resources