QRegularExpression: how to get the failing position? - qt

I guess that this has had to be asked before, but cannot find anything about it.
I also think that maybe the answer is just right there but I can't see it either.
So, if QRegularExpression::match() has not a match, how do I know the position of the character that made the validation fail?
I'm pretty sure that internally, there should be some variable storing the "current position" as the string is being evaluated.
Yes, maybe there is backtracking in that evaluation so if the exact failing char is hard to get, at least the last good one could be easier.
Any hints? Thank you.
Edit (2022-08-08):
I'm starting to feel like it's possible that no one asked this before, in fact, considering how people think I am asking something like "why my regex does not work". Not my case.
This is not about a particular regular expression. It's about Qt's class QRegularExpression.
I apologize if I've not been clear. I've tried to explain the best I could since the very beginning.
Anyway, let's say you have one string, to be evaluated against some (ANY) regex. No match is found. Then I want to know, if possible, the point where the evaluation failed.
This regex: "abc"
This string: "abd", failing position: 2
This regex: "abc"
This string: "acb", failing position: 1
This regex: "abc"
This string: "xyz", failing position: 0
I feel very stupid asking this, mostly because I think it's a very basic question.
But it's not what you immediately think at first glance. I swear I searched for answers the most I could, but everything I got was about errors in the regexes themselves.

I hate this, but it works.
int getFailingPosition(QString sRegEx,QString sText) {
int iResult;
QRegularExpression rxRegEx;
QRegularExpressionMatch rxmMatch;
rxRegEx.setPattern(QRegularExpression::anchoredPattern(sRegEx));
for(iResult=sText.length();iResult>0;iResult--) {
rxmMatch=rxRegEx.match(sText);
if(rxmMatch.hasMatch())
break;
else {
rxmMatch=rxRegEx.match(
sText,
0,
QRegularExpression::MatchType::PartialPreferCompleteMatch
);
if(rxmMatch.hasPartialMatch())
break;
}
sText.chop(1);
}
return iResult;
}
Tests:
#define REGEX_USA_ZIPCODE "\\d{4}?\\d$|^\\d{4}?\\d-\\d{4}"
#define REGEX_SIGNED_NUMBER "[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?"
#define REGEX_ISO8601_DATE "\\d{4}-(0[1-9]|1[012])-(0[1-9]|[12]\\d|3[0-1])"
#define REGEX_USA_PHONE "\\(?\\d{1,3}?\\)?[-.\\s]?\\d{1,4}[-.\\s]?\\d{1,4}[-.\\s]?\\d{1,9}"
qDebug() << getFailingPosition("abc","abcd"); // 3
qDebug() << getFailingPosition("abc","abd"); // 2
qDebug() << getFailingPosition("abc","acb"); // 1
qDebug() << getFailingPosition("abc","xyz"); // 0
qDebug() << getFailingPosition("abc","x"); // 0
qDebug() << getFailingPosition("abc",""); // 0
qDebug() << getFailingPosition("abc","a"); // 1
qDebug() << getFailingPosition("abc","ab"); // 2
qDebug() << getFailingPosition(REGEX_USA_ZIPCODE,"12345-1"); // 7 (missing chars)
qDebug() << getFailingPosition(REGEX_SIGNED_NUMBER,"-0.123e"); // 7 (missing chars)
qDebug() << getFailingPosition(REGEX_ISO8601_DATE,"2021-23-31"); // 5 (unexpected char)
qDebug() << getFailingPosition(REGEX_USA_PHONE,"202-3(24)-3000"); // 5 (unexpected char)
getFailingPosition() should be called only after we're sure there is not a match, or it would return the string length, giving the wrong idea that something's missing.
This should have a built-in function...

Related

Unexpected Behavior of QRegularExpression

I've just started switching to QRegularExpression, and I'm using it to tokenize a string with multiple delimiter possibilities. I've encountered a surprising behavior, which seems to me to be a bug. I'm using Qt 5.5.1 on Windows.
Here's sample code:
#include <QRegularExpression>
#include <QString>
#include <QtDebug>
int main(int argc, char *argv[])
{
Q_UNUSED (argc);
Q_UNUSED (argv);
QRegularExpression regex ("^ ");
qDebug () << "Expected: " << QString ("M 100").indexOf(regex);
qDebug () << "NOT expected:" << QString ("M 100").indexOf(regex, 1);
qDebug () << "Expected: " << QString (" 100").indexOf(regex);
QRegularExpression regex1 (" ");
qDebug () << "Expected: " << QString ("M 100").indexOf(regex1);
}
And the output:
Expected: -1
NOT expected: -1
Expected: 0
Expected: 1
The use of the caret (^) when used with a starting position other than 0 in the "indexOf" call is preventing the expression from matching. Intuitively, I expected that the caret matches the string at the position that I specified. Instead, it simply never matches.
I'm going to switch my tokenizing to use splitRref to avoid this problem. While that's probably slightly cleaner anyway, I need to understand whether this is correct behavior or if I should be reporting a bug to Qt.
UPDATE: Using splitRef doesn't entirely solve my problem because I need to use a regular expression to detect if some tokens are floating point numbers, and I can't use a QRegularExpression with QStringRef. For that possibility, I have to convert my QStringRef token into an actual QString, which was what I was trying to avoid in the first place.
^ matches at the beginning of the subject string, or after a newline when in multiline mode. The offset does not alter these semantics. Hence, matching /^ / (in regex notation) against M 100 at offset 1 correctly results in no match.
Perhaps you want \G? From pcrepattern(3):
\G matches at the first matching position in the subject
The \G assertion is true only when the current matching position is at the start point of the match, as specified by the startoffset argument of pcre_exec(). It differs from \A when the value of startoffset is non-zero.
With that, this code:
QRegularExpression regex ("\\G ");
qDebug () << "Expected: " << QString ("M 100").indexOf(regex);
qDebug () << "NOT expected:" << QString ("M 100").indexOf(regex, 1);
qDebug () << "Expected: " << QString (" 100").indexOf(regex);
prints
Expected: -1
NOT expected: 1
Expected: 0

QRegExp does not match even though regex101.com does

I need to extract some data from string with simple syntax. The syntax is this:
_IMPORT:[any text] - [HEX number] #[decimal number]
Therefore I created regex you can see below in the code:
//SYNTAX: _IMPORT:%1 - %2 #%3
static const QRegExp matchImportLink("^_IMPORT:(.*?) - ([A-Fa-f0-9]+) #([0-9]+)$");
QRegExp importLink(matchImportLink);
QString qtWtf(importLink.pattern());
const int index = importLink.indexIn(mappingName);
qDebug()<< "Input string: "<<mappingName;
qDebug()<< "Regular expression:"<<qtWtf;
qDebug()<< "Result: "<< index;
For some reason, that does not work, I get this output:
Input string: "_IMPORT:ddd - 92806f0f96a6dea91c37244128f7d00f #0"
Regular expression: "^_IMPORT:(.*?) - ([A-Fa-f0-9]+) #([0-9]+)$"
Result: -1
I even tried to remove the anchors ^ and $ but that didn't help and also is undesired. The annoying thing is that this regexp works perfectly if I copy the output in regex101.com, as you can see here: https://regex101.com/r/oT6cY3/1
Can anyone explain what is wrong here? Did I stumble upon Qt bug? I use Qt 5.6. Is there any workaround for this?
It seems like Qt does not recognize the quatifier *? as valid. Check the method QRegExp::isValid() againts your pattern. In my case it did not work because of this. And the documentation tells that any invalid pattern will never match.
So first thing I tried was skipping the ? which perfectly fits your provided string with all capturing groups. Here is my code.
QString str("_IMPORT:ddd - 92806f0f96a6dea91c37244128f7d00f #0");
QRegExp exp("^_IMPORT:(.*) - ([A-Fa-f0-9]+) #([0-9]+)$");
qDebug() << "pattern:" << exp.pattern();
qDebug() << "valid:" << exp.isValid();
int pos = 0;
while ((pos = exp.indexIn(str, pos)) != -1) {
for (int i = 1; i <= exp.captureCount(); ++i)
qDebug() << "pos:" << pos << "len:" << exp.matchedLength() << "val:" << exp.cap(i);
pos += exp.matchedLength();
}
And here is the resulting output.
pattern: "^_IMPORT:(.*) - ([A-Fa-f0-9]+) #([0-9]+)$"
valid: true
pos: 0 len: 49 val: "ddd"
pos: 0 len: 49 val: "92806f0f96a6dea91c37244128f7d00f"
pos: 0 len: 49 val: "0"
Tested using Qt 5.6.1.
Also note that you may set greedy evaluation using QRegExp::setMinimal(bool).

Error on adding to empty comboBox in Qt

I use Qt 5.2.0 (MSVC 2010).
I added to my form in Qt a ComboBox.
Then I want to fill it with numbers:
for (i = 0; i < n; i++){
ui->tableCombo->addItem(QString::number(i));
}
When I add a first element right in the form, it successfully adds numbers. But when I leave it empty, it throws an error:
ASSERT failure in QVector::operator[]: "index out of range"
Debugger shows that error occured right in this line. And there is no QVector across the line.
After adding qDebug().
qDebug() << "readFileToStringList: msg10";
for (i = 0; i < n; i++){
qDebug() << "readFileToStringList: msg20 i = " << i;
ui->tableCombo->addItem(QString::number(i+1));
qDebug() << "readFileToStringList: msg30";
}
qDebug() << "readFileToStringList: msg40";
I get the same result
readFileToStringList: msg10
readFileToStringList: msg20 i = 0
ASSERT failure in QVector<T>::operator[]: "index out of range", file C:\Qt\Qt5.2.0\5.2.0\mingw48_32\include/QtCore/qvector.h, line 369
I had this exact problem and couldn't figure it out for a couple hours. I realized ::addItem() was triggering the indexChanged(int) signal, which I had connected to a function that was causing an out-of-range error in a container.
I would say it was possibly the problem here too, but I'm sure the OP has moved on since then. To me it isn't exactly intuitive that the indexChanged signal would be called on insertion of new items, since it doesn't actually change the currentIndex.
Hopefully if anyone else gets tripped up this will help them!
addItem() doesn't throw that error! I'm positive it's coming from another instruction in your code.
Qt documentation has an entire section on Debugging Techniques, but if you are afraid of debuggers you can use the poor's man debugger: spread several qDebug() messages before and after the instructions you think are responsible for the problem:
qDebug() << "methodX: msg10";
for (i = 0; i < n; i++){
qDebug() << "methodX: msg20 i = " << i;
ui->tableCombo->addItem(QString::number(i));
qDebug() << "methodX: msg30";
}
qDebug() << "methodX: msg40";
If the message methodX: msg30 gets printed to the screen, means that addItem() didn't cause the error.

Peek on QTextStream

I would like to peek the next characters of a QTextStream reading a QFile, in order to create an efficient tokenizer.
However, I don't find any satisfying solution to do so.
QFile f("test.txt");
f.open(QIODevice::WriteOnly);
f.write("Hello world\nHello universe\n");
f.close();
f.open(QIODevice::ReadOnly);
QTextStream s(&f);
int i = 0;
while (!s.atEnd()) {
++i;
qDebug() << "Peek" << i << s.device()->peek(3);
QString v;
s >> v;
qDebug() << "Word" << i << v;
}
Gives the following output:
Peek 1 "Hel" # it works only the first time
Word 1 "Hello"
Peek 2 ""
Word 2 "world"
Peek 3 ""
Word 3 "Hello"
Peek 4 ""
Word 4 "universe"
Peek 5 ""
Word 5 ""
I tried several implementations, also with QTextStream::pos() and QTextStream::seek(). It works better, but pos() is buggy (returns -1 when the file is too big).
Does anyone have a solution to this recurrent problem? Thank you in advance.
You peek from QIODevice, but then you read from QTextStream, that's why peek works only once. Try this:
while (!s.atEnd()) {
++i;
qDebug() << "Peek" << i << s.device()->peek(3);
QByteArray v = s.device()->readLine ();
qDebug() << "Word" << i << v;
}
Unfortunately, QIODevice does not support reading single words, so you would have to do it yourself with a combination of peak and read.
Try disable QTextStream::autoDetectUnicode. This may read device ahead to perform detection and cause your problem.
Set also a codec just in case.
Add to the logs s.device()->pos() and s.device()->bytesAvailable() to verify that.
I've check QTextStream code. It looks like it always caches as much data as possible and there is no way to disable this behavior. I was expecting that it will use peek on device, but it only reads in greedy way. Bottom line is that you can't use QTextStream and peak device at the same time.

How to deal with "%1" in the argument of QString::arg()?

Everybody loves
QString("Put something here %1 and here %2")
.arg(replacement1)
.arg(replacement2);
but things get itchy as soon as you have the faintest chance that replacement1 actually contains %1 or even %2 anywhere. Then, the second QString::arg() will replace only the re-introduced %1 or both %2 occurrences. Anyway, you won't get the literal "%1" that you probably intended.
Is there any standard trick to overcome this?
If you need an example to play with, take this
#include <QCoreApplication>
#include <QDebug>
int main()
{
qDebug() << QString("%1-%2").arg("%1").arg("foo");
return 0;
}
This will output
"foo-%2"
instead of
"%1-foo"
as might be expected (not).
qDebug() << QString("%1-%2").arg("%2").arg("foo");
gives
"foo-foo"
and
qDebug() << QString("%1-%2").arg("%3").arg("foo");
gives
"%3-foo"
See the Qt docs about QString::arg():
QString str;
str = "%1 %2";
str.arg("%1f", "Hello"); // returns "%1f Hello"
Note that the arg() overload for multiple arguments only takes QString. In case not all the arguments are QStrings, you could change the order of the placeholders in the format string:
QString("1%1 2%2 3%3 4%4").arg(int1).arg(string2).arg(string3).arg(int4);
becomes
QString("1%1 2%3 3%4 4%2").arg(int1).arg(int4).arg(string2, string3);
That way, everything that is not a string is replaced first, and then all the strings are replaced at the same time.
You should try using
QString("%1-%2").arg("%2","foo");

Resources