Can you use wildcards in a token with ParseKit? - wildcard

I'm trying to add a symbol token using ParseKit below:
[t.symbolState add:#"<p style=\"margin-left: 20px;\">"];
I'm wondering if ParseKit allows for wildcards when adding a symbol, such as:
[t.symbolState add:#"<p style=\"margin-left: ##px;\">"];
I want to be able to then extract the wildcard from the token during the parsing procedure.
Is such a thing possible with ParseKit?

Developer of ParseKit here.
I think using ParseKit in this way is not a good idea.
ParseKit (and its successor PEGKit) excel at tokenizing input and then parsing at the token level.
There are several natural tokens in the example input you've provided, but what you are trying to do here is ignore those natural tokens, combine them into a blob of input, and then do fancy sub-token matching using patterns.
There is a popular, powerful tool for fancy sub-token matching using patterns: Regular Expressions. They will be a much better solution for that kind of thing than PEGKit.
However, I still don't think Regular Expressions are the tool you want to use here (or, at least not the only tool).
It looks like you want to parse XML input. Don't use Regex or PEGKit for that. Use an XML parser. Always use an XML parser for parsing XML input.
You may choose to use another XML API layered on top of the XML Parser (SAX, StAX, DOM, XSLT, XQuery, etc.) but, underneath it all, you should be parsing with an XML parser (and, of course, all of those tools I listed, do).
See here for more info.
Then, once you have the style attribute string value you are looking for, use Regex to do fancy pattern matching.

Related

Is there any way to find special characters in string

I have a requirement to perform different operations if string contains any special character.
Is there any way to implement regular expression in gremlin.
Input_Name= Test#input
if Input_Name.contains( "#/$%...")
{
println " error "
}
else
{
println "sucess"
}
Currently the Gremlin language does not have a TextP.regex predicate. Some implementations, such as JanusGraph, do add custom regex extensions to Gremlin. You could also, if the database you are using allows it, use Groovy closure syntax to include a regex in a query. Within the TinkerPop community we are planning to add a TextP.regex to the Gremlin language. The code is written and on a branch that we hope will be part of the TinkerPop 3.6.0 release if all goes well.
However, in your case, perhaps the existing TextP.containing could be used if there are a finite set of special characters you are looking for but it is likely not the most optimal way to solve the problem as you will have to or several has steps together.
Another option might be to use an external index if your database implementation supports that.
Just as an example of the closure syntax, if your implementation allows it, a REGEX match would look like the example below. In general though, use of closures is not recommended, and many implementations either fully block or extremely limit their use.
gremlin> g.V().limit(20).filter {it.get().values('desc').next() ==~ "[A-Za-z]* [A-Z]'(.*)"}.values('desc')
==>Chicago O'Hare International Airport

Pact matching non JSON body

Is there any way of matching non JSON bodies (either XML, byte or whatever). Looking for the Python solution, however will appreciate any ideas behind that (even monkeypatching).
It's possible, but not directly supported.
Currently there's only the ability to match JSON. You can fake non-JSON matching by expecting a string body, but then you won't be able to use pact's built in matchers- which might mean your tests will be data dependent unless you do a bit of leg work.
There is a stub for xml support, but it's not currently implemented.
If you're willing to get your hands dirty in Ruby (not that different to Python!) you can write your own matcher. I can show you how to configure the pact-provider-verifier to use the custom matching code. Currently, if you use a content type that is not JSON, as J_A_X says, it will do an exact string diff.

Using duplicate parameters in a URL

We are building an API in-house and often are passing a parameter with multiple values.
They use: mysite.com?id=1&id=2&id=3
Instead of: mysite.com?id=1,2,3
I favor the second approach but I was curious if it was actually incorrect to do the first?
I'm not an HTTP guru, but from what I understand there's not a definitive standard on the query part of the URL regarding multiple values, it's typically up to the CGI that handles the request to parse the query string.
RFC 1738 section 3.3 mentions a searchpart and that it should go after the ? but doesn't seem to elaborate on its format.
http://<host>:<port>/<path>?<searchpart>
I did not (bother to) check which RFC standard defines it. (Anyone who knows about this please leave a reference in the comment.) But in practice, the mysite.com?id=1&id=2&id=3 way is already how a browser would produce when a form contains duplicated fields, typically the checkboxes. See it in action in this w3schools example page. So there is a good chance that the whatever programming language you are using, already provides some helper functions to parse an input like that and probably returns a list.
You could, of course, go with your own approach such as mysite.com?id=1,2,3, which is not bad at all in this particular case. But you will need to implement your own logic to produce and to consume such format. Now you may or may not need to think about handling some corner cases by yourself, such as: what if the input is not well-formed, like mysite.com?id=1,2,? And do you need to invent yet another separator, if the comma sign itself can also be a valid input, like mysite.com?name=Doe,John|Doe,Jane? Would you reach to a point that you will use a json string as the value, like mysite.com?name=["John Doe", "Jane Doe"]? etc. etc.. Your mileage may vary.
Worth adding that inconsistend handling of duplicate parameters in the URL on the server is may lead to vulnerabilities, specifically server-side HTTP parameter pollution, with a practical example - Client side Http Parameter Pollution - Yahoo! Classic Mail Video Poc.
in your first approach you will get an array of querystring values but in second approach you will get a string of querystring values.
I guess it depends on technology you use, how it becomes convenient. I am currently standing in front of the same question using currency=USD,CHF or currency=USD&currency=CHF
I am using Thymeleaf and using the second option makes it easy to work, I can then request something like: ${param.currency.contains(currency.value)}. When I try to use the first option it seems it takes the "array" like a string, so I need to split first and then do contain, what leads me to a more mess code.
Just my 50 cents :-)

Good Regex for sanitising input

I'm after a general regex for sanitising form input, I want to use it on first name last name fields , which will be stored in DB, and pretty much use it in other general places if I can.
I'm using ASP.net does any on
Sanitising user data is an output problem, not an input problem.
What is considered "sanitary" for a MySQL database is not necessarily "sanitary" for MSSQL or PostGreSQL. What is considered "sanitary" for a database is most likely not the same as what you could safely send in an HTML document. XHTML is a different story again and if you are outputing the user-supplied data into a javascript block or a CSS block it's different yet again. There is no way to sanitise user-supplied data for all output targets.
It's better to use the supplied library functions for sanitising data rather than building your own regex. PHP (which I happen to know better than ASP.net) has mysql_real_escape_string(). I'm sure ASP.net will have a library function for sanitising user-supplied data for use with various databases. It will also likely have library functions for sanitising user-supplied data for HTML as well.
Parameterised queries are even better than sanitising user-supplied data. And it can be done with ASP.net. This is the right way to use a database.

Good whitelist for search terms

I'm implementing a simple search on a website, and right now I'm working on sanitizing the input. My plan is to make a whitelist of allowed characters. I'm using PHP, and so far I've got the current regex:
preg_replace('/[^a-z0-9 -]/i', '', $s);
So, I'm removing anything that's not alphanumeric or a space or a hyphen.
Is there a generally accepted whitelist for this sort of thing, or does it just depend on the application? I'm going to be searching on book titles, author names and book blurbs.
What about 2010 (A space odyssey)? What about Giscard d`Estaing's autobiography? ... This is really impossible to answer generally, it will depend on your application and data structures.
You want to look into the fulltext search functions of the database of your choice, or even specialized search appliances like Sphinx.
Clarify what engine you will use first to actually perform your search, and the rules on what you need to strip out will become much clearer.
Google has some pretty advanced rules for searches, but their basic rule is this:
Generally, punctuation is ignored, including ##$%^&*()=+[]\ and other special characters.
However, Google makes exceptions for common search terms, like C++, C#, or $100.
If you want a search as sophisticated as Google's, you can make rules against the above punctuation and have some exceptions. However, for a simple search, just ignore the characters that Google generally ignores.
There's not a generic regular expression to solve this problem. Your code strips out a lot of things you might want to keep, like commas, exclamation points, (semi-)colons, and non-English letters. If you have a full list of all of the titles in your database, you should be able to write a script that will construct a list of all characters found in all of your titles. If your regular expression strips out any of those characters, then you risk having problems (although passing this test doesn't mean that you won't run into problems).
Depending on how the rest of your search is implemented, you may be able to strip out valid characters and still return relevant search results. In this case, you would want your expression to allow non-English characters (since you don't want to split a word) but you might be able to remove all punctuation marks that aren't inside of a quote-delimited phrase. For example, searching for red haired should give you all of the results you would get from searching for red-haired plus a few extra.

Resources