Scraping an HTML response for specific values on Apigee gateway - apigee

Has anyone determined a good method of scraping an HTML response, formatted largely in HTML5 and non-XML compliant, for specific values using the Apigee gateway.
That is to say, if I were to get the following piece of a response,
<input name="a" value="a1">
<input name="b" value="b1">
<input name="c" value="c1">
<input name="d" value="d1">
Can I return the values of a and b?
As this is html and not XML Strict, Apigee's XPATH does not work.
Alternatively, is there a recommended method of allowing DOM parsing at the gateway?

Try regex in java or javascript or py policy baseed on your preference. You can assign entire response payload to a variable as a string. Then you can do string operations such as regex match to extract specific part of the HTML text.
For XML response payload you can use XSLT and XPATH expressions.

You can take the NodeJs route to do this on the gateway. Since Apigee Edge supports NodeJs out of the box, you can use NPMs to play around with the DOM. My personal favorite is Cheerio [https://github.com/cheeriojs/cheerio]. BTW Cheerio is also based on JQuery
var cheerio = require('cheerio'),
$ = cheerio.load('<h2 class="title">Hello world</h2>');
$('h2.title').text('Hello there!');
$('h2').addClass('welcome');
$.html();
//=> <h2 class="title welcome">Hello there!</h2>

Related

amazon S3 upload API is sensitive to http post parameter ordering

I've created a bucket where policy makes it mandatory to specify the content-type of the object being uploaded. If I specify the content-type after file element, (example below)
<form action="..">
...
<input type='file' name='file' />
<input name='content-type' value='image/jpeg' />
<input type='submit' />
</form>
it returns following error
<Error>
<Code>AccessDenied</Code>
<Message>Invalid according to Policy: Policy Condition failed: ["starts-with", "$Content-Type", ""]</Message>
<RequestId>15063EB427B4A469</RequestId>
<HostId>yEzAPF4Z2inaafhcqyQ4ooLVKdwnsrwqQhnYg6jm5hPQWSOLtPTuk0t9hn+zkBEbk+rP4S5Nfvs=</HostId>
</Error>
If I specify content-type before file element, upload works as expected. I encountered this behaviour for the first time. I have a few questions regarding it.
Is it part of some specification where clients and all intermediate proxies are supposed to maintain order of http post params? Please point me to it.
Why would you make your API be aware of such ordering? In this particular case I can guess that the file can be huge and unless you are seeing all expected params before, you should immediately return failure. Please correct me if my understanding is not correct.
It is part of the spec that the parts are sent as ordered in the form. There is no reason to believe that reordering by an intermediate proxy would be allowed.
The form data and boundaries (excluding the contents of the file) cannot exceed 20K.
...
The file or content must be the last field in the form.
http://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-HTTPPOSTForms.html#sigv4-HTTPPOSTFormFields
The logical assumption is that this design allows S3 to reject invalid uploads early.

JMeter: "__VIEWSTATE" and "__EVENTVALIDATION" value does not get replaced after extracting values using Regular Expression Extractor Post Processor

I am trying to load test the ASP.net website and after a bit of research, it became apparent that JMeter was running into issues with VIEWSTATE, which is one of the workarounds ASP.NET WebForms uses to make HTTP appear to be stateful. JMeter is sending a stale value for VIEWSTATE since it is replaying the HTTP requests in the test plan. I extracted the VIEWSTATE from each response and re-include that value on requests. I did it with two Regular Expression Extractors but I still don't see values getting replaced after parameterization.
Your regexp is probably wrong.
It's better to use css/jquery extractor instead of regexp in this case
Just put:
- expression : input[id=__VIEWSTATE]
- attribute : value
and for second one:
expression : input[id=__EVENTVALIDATION]
attribute : value
Use the below regex..it worked for me
input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="([A-Za-z0-9+=/-_]+?)"

How to remove cross site scripting? [duplicate]

I have a website that allows to enter HTML through a TinyMCE rich editor control. It's purpose is to allow users to format text using HTML.
This user entered content is then outputted to other users of the system.
However this means someone could insert JavaScript into the HTML in order to perform a XSS attack on other users of the system.
What is the best way to filter out JavaScript code from a HTML string?
If I perform a Regular Expression check for <SCRIPT> tags it's a good start, but an evil doer could still attach JavaScript to the onclick attribute of a tag.
Is there a fool-proof way to script out all JavaScript code, whilst leaving the rest of the HTML untouched?
For my particular implementation, I'm using C#
Microsoft have produced their own anti-XSS library, Microsoft Anti-Cross Site Scripting Library V4.0:
The Microsoft Anti-Cross Site Scripting Library V4.0 (AntiXSS V4.0) is an encoding library designed to help developers protect their ASP.NET web-based applications from XSS attacks. It differs from most encoding libraries in that it uses the white-listing technique -- sometimes referred to as the principle of inclusions -- to provide protection against XSS attacks. This approach works by first defining a valid or allowable set of characters, and encodes anything outside this set (invalid characters or potential attacks). The white-listing approach provides several advantages over other encoding schemes. New features in this version of the Microsoft Anti-Cross Site Scripting Library include:- A customizable safe list for HTML and XML encoding- Performance improvements- Support for Medium Trust ASP.NET applications- HTML Named Entity Support- Invalid Unicode detection- Improved Surrogate Character Support for HTML and XML encoding- LDAP Encoding Improvements- application/x-www-form-urlencoded encoding support
It uses a whitelist approach to strip out potential XSS content.
Here are some relevant links related to AntiXSS:
Anti-Cross Site Scripting Library
Microsoft Anti-Cross Site Scripting Library V4.2 (AntiXSS V4.2)
Microsoft Web Protection Library
Peter, I'd like to introduce you to two concepts in security;
Blacklisting - Disallow things you know are bad.
Whitelisting - Allow things you know are good.
While both have their uses, blacklisting is insecure by design.
What you are asking, is in fact blacklisting. If there had to be an alternative to <script> (such as <img src="bad" onerror="hack()"/>), you won't be able to avoid this issue.
Whitelisting, on the other hand, allows you to specify the exact conditions you are allowing.
For example, you would have the following rules:
allow only these tags: b, i, u, img
allow only these attributes: src, href, style
That is just the theory. In practice, you must parse the HTML accordingly, hence the need of a proper HTML parser.
If you want to allow some HTML but not all, you should use something like OWASP AntiSamy, which allows you to build a whitelisted policy over which tags and attributes you allow.
HTMLPurifier might also be an alternative.
It's of key importance that it is a whitelist approach, as new attributes and events are added to HTML5 all the time, so any blacklisting would fail within short time, and knowing all "bad" attributes is also difficult.
Edit: Oh, and regex is a bit hard to do here. HTML can have lots of different formats. Tags can be unclosed, attributes can start with or without quotes (single or double), you can have line breaks and all kinds of spaces within the tags to name a few issues. I would rely on a welltested library like the ones I mentioned above.
Regular expressions are the wrong tool for the job, you need a real HTML parser or things will turn bad. You need to parse the HTML string and then remove all elements and attributes but the allowed ones (whitelist approach, blacklists are inherently insecure). You can take the lists used by Mozilla as a starting point. There you also have a list of attributes that take URL values - you need to verify that these are either relative URLs or use an allowed protocol (typically only http:/https:/ftp:, in particular no javascript: or data:). Once you've removed everything that isn't allowed you serialize your data back to HTML - now you have something that is safe to insert on your web page.
I try to replace tag element format like this:
public class Utility
{
public static string PreventXSS(string sInput) {
if (sInput == null)
return string.Empty;
string sResult = string.Empty;
sResult = Regex.Replace(sInput, "<", "< ");
sResult = Regex.Replace(sResult, #"<\s*", "< ");
return sResult;
}
}
Usage before save to db:
string sResultNoXSS = Utility.PreventXSS(varName)
I have test that I have input data like :
<script>alert('hello XSS')</script>
it will be run on browser. After I add Anti XSS the code above will be:
< script>alert('hello XSS')< /script>
(There is a space after <)
And the result, the script won't be run on browser.

what is the correct syntax of a post form in an http request

What is the format of a form sent in an http post request ?
I am trying an http client program and want to send a form in an http post request.
I tried :
< FORM METHOD=POST >
< INPUT name="name" value="chriss">
< /FORM >
is this correct ?
on the server side, when I try to get the value of name ( i use : form.getFirstValue("name")) I get null.
(I am using restlet as my API.)
Can anyone help me please
The body of the POST request sent by an HTML form is usually using the "application/x-www-form-urlencoded" media type.
If your client is also a Restlet client, you should be able to use the Form class, set the required values for each name/value pairs, and get the representation to send using getWebRepresentation().
Essentially, the body will look like this:
name=chriss
If you had more parameters, they would be separated by &.
(If you were sending files, you'd use the multipart/form-data encoding instead.)
An HTML reference will be helpful. There are plenty of good HTML books and online references.
<form method="post" action="/url/to/submit/to">
<input type="text" name="name" value="chriss">
</form>

Is encoding of the query string needed?

I see that Firefox does NOT encode an URL like http://www.mysite.com/foo?bar=10/12/2010 when it sends a GET request. I know that URLs must be encoded, so I expected to see Firefox requesting http://www.mysite.com/foo?bar=10%2F12%2F2010 (/ = %2F). I inspected the GET requests using Wireshark.
Should the query string in the url be escaped?
I use WebHarvest and I see that when I ask it to download a page with the http directive, an URL like the one above is encoded like I expected (%2F instead of "/").
The / is allowed in plain in the query of a URI:
query = *( pchar / "/" / "?" )
Anything else must be encoded using the percent-encoding.
If, by escaped, you mean URL-encoded, the short answer is yes.
There are a number of characters that are normally encoded during URL encoding but could normally appear in the URL without problem.
But sometimes the potential problems are not always obvious. I would recommend URL encoding query arguments, and decoding them from your site. After all, if you decode too many times, that should not cause any problem.
Can't reproduce your problem.
<form>
<input type="hidden" name="bar" value="10/12/2010">
<input type="submit">
</form>
This displays the proper escape in address bar. Aren't you supplying this URL in an <a> element? Then you need to escape it in the HTML page yourself by either hardcoding it or utilizing the functions provided by the server side language.

Resources