I am trying to debug some problems with very picking/complex webservices where some of the clients that are theoretically making the same requests are getting different results. A debugging proxy like Charles helps a lot but since the requests are complex (lots of headers, cookies, query strings, form data, etc) and the clients create the headers in different orders (which should be perfectly acceptable), etc. it's an extremely tedious process to do manually.
I'm pondering writing something to do this myself but I was hoping someone else had already solved this problem?
As an aside does anyone know of any Charles-like debugging proxies that are completely opensource? If Charles were open source I would definitely contribute any work I did on this front back to the project. If there is something similar out there, I would much rather do this than write an separate program from scratch (especially since I imagine Charles or any analog already has all of the data structures I might need etc).
Edit:
Just to be clear -- text diffing will not work as the order of lines (e.g. headers at least) may be different and/or the order of values within lines (e.g. cookies at least) can be different and in both cases as long as the names and values and metadata are all the same, the different ordering should not cause requests that are otherwise the same to be considered different.
Fiddler has such an option, if you have WinDiff in your path. I don't know though if it will suit your needs, because at first glance it's jus doing text comparisions. But perhaps it normalizes the sessions before that, so I can't say.
If there's nothing purpose built for the job, you can use packet capture to get the message content saved to a text file (something that inserts itself in the IP stack like CommView). The you can text diff the results for different messages.
Can the open-source proxy Squid maybe help?
Related
I want to prevent or hamper the parsing of the classifieds website that I'm improving.
The website uses API with JSON responses. As a solution, I want to add useless data between my data as programmers will probably parse by ID. And not give a clue about it in neither JSON response body nor header; so they won't be able to distinguish it without close inspection.
To prevent users from seeing it, I won't give that "useless data" to my users if they don't request it explicitly by ID. From an SEO perspective, I know that Google won't parse the page with useless data if there isn't any internal or external link.
How reliable would that technic be? And what problems/disadvantages/drawbacks do you think can occur in terms of user experience or SEO? Any ideas or suggestions will be very much appreciated.
P.S. I'm also limiting big request counts made in a short time. But it doesn't help. That's why I'm thinking of this technic.
I think banning parsers won't be better because they can change IP and etc.
Maybe I can get a better solution by requiring a login to access more than 50 item details for example (and that will work for Selenium, maybe?). Registering will make it harder. Even they do it, I can see these users and slow their response times and etc.
Icinga2 by default sends some of it's internal performance metrics to graphite, but I can't see a way to send my own performance data, but not the internal, so I was wondering if there was a way to configure graphite's carbon-cache to simply ignore certain metrics?
I know I could possibly work around this by using carbon-relay selectively relay on a subset, but this feels a cludge
I am also aware that it would be better if icinga2 also didn't send things I don't care about, which I will continue to look into, but i can see other use cases where I might want to stop storing certain metrics sooner than I might be able to push a code update to stop an application sending them
If I've understood your problem correctly, I think what you're looking for here is the carbon-cache blacklist. You specify a regular expression to ignore particular patterns of metrics.
For example, I use:
^prod\.service\..+$
... to ignore anything starting with prod.service (unfortunately it can get messy with escaping periods!). The file can be made of any number of lines like that.
I need to make an application which will access an URL(like http://google.com) and return the time spent to load all elements(images, css, js...) and compare this results with the previous results.
This application need to be a Desktop app, and I will save the informations in a text file ou xml, and use this file do compare with previous results.
I have searched for a similar application, but nothing...
There are some plugins for firefox that list these elements, like Yslow or Firebug, but not what I need.
So, i'm totally lost and I don't know how to start this work?
Exists the possibility of make this application? What language is better for this type of application?
Thks!
This is a very objective question, so without you elaborating more on your requirements, you may not get any useful answers.
Some things you would need to answer are: how many URLs you want to check, where are you wanting to store the results (database, files etc), does it need to run on the desktop or on a server etc.
Personally, I like the statistics that cURL gives you - DNS time, connect time, receive time etc - so you could write something in PHP, but as I stress that is personal preference and may not suit your situation.
Is there a way to determine if the request coming to a handler (lets assume the handler responds to get and post) is being performed by a real browser versus a programmatic client?
I already know that it is easy to spoof things like the User Agent and the Referrer, but are there other headers that are more difficult to spoof? Maybe headers that are not commonly available in classes like .net's HttpWebRequest?
The other path that I looked at is maybe using the Encrypted View State to send a value to the browser that gets validated on the server side, though couldn't that value simply be scraped from the previous response and added as a post parameter to the next request?
Any help would be much appreciated,
Cheers,
There is no easy way to differentiate because in the end, a post programitically looks the same to the server as a post by a user from the browser.
As mentioned, captcha's can be used to control posting but are not perfect (as it is very hard but not impossible for a computer to solve them). They also can annoy users.
Another route is only allowing authenticated users to post, but this can also still be done programatically.
If you want to get a good feel for how people are going to try to abuse your site, then you may want to look at http://seleniumhq.org/
This is very similar to the famous Halting Problem in computer science. See some more on the proof, and Alan Turing here: http://webcache.googleusercontent.com/search?q=cache:HZ7CMq6XAGwJ:www-inst.eecs.berkeley.edu/~cs70/fa06/lectures/computability/lec30.ps+alan+turing+infinite+loop+compiler&cd=1&hl=en&ct=clnk&gl=us
The most common way is using captcha's. Of course captcha's have their own issues (users don't really care for them) but they do make it much more difficult to programatically post data. Doesn't really help with GETs though you can force them to solve a captcha before delivering content.
Many ways to do this, like dynamically generated XHR requests that can only be made with human tasks.
Here's a great article on NP-Hard problems. I can see a huge possibility here:
http://www.i-programmer.info/news/112-theory/3896-classic-nintendo-games-are-np-hard.html
One way: You could use some tricky JS to handle tokens on click. So your server issues token-id's to elements on the page during the backend render phase. Log these in a database or data file. Then, when users click around and submit, you can compare the id's sent via the onclick() function. There's plenty of ways around this, but you could apply some heuristics to determine if posts are too fast to be a human or not, that is, even if they scripted the hijacking of the token-ids and auto submitted, you could check that the time between click events appears automated. Signed up for a twitter account lately? They use passive human detection that while not 100% foolproof, it is slower and more difficult to break. Many if not all of the spam accounts there had to be human opened.
Another Way: http://areyouahuman.com/
As long as you are using encrypted methods verifying humanity without crappy CAPTCHA is possible.I mean, don't ignore your headers either. These are complimentary ways.
The key is to have enough complexity to make for an NP-Complete problem in terms of number of ways to solve the total number of problems is extraordinary. http://en.wikipedia.org/wiki/NP-complete
When the day comes when AI can solve multiple complex Human problems on their own, we will have other things to worry about than request tampering.
http://louisville.academia.edu/RomanYampolskiy/Papers/1467394/AI-Complete_AI-Hard_or_AI-Easy_Classification_of_Problems_in_Artificial
Another company doing interesting research is http://www.vouchsafe.com/play-games they actually use games designed to trick the RTT into training the RTT how to be more solvable by only humans!
In J2ME ,Which connection type is better?Get or post.Which one is faster?which one uses less bandwidth?and which one is supported by most of the handsets?What are the advantages and disadvantages of both?
Also, see Is there a limit to the length of a GET request? which may be relevant if you plan to abuse GET.
Be aware that network operators (certainly in the UK) have caching schemes in place that may affect your traffic.
If you look at what Opera Mini does, they only use HTTP POST in their HTTP mode.
I think this is a great idea because of the following reasons:
POST's are never cached (according to HTTP spec at least) - this saves you from operator caching etc.
It seems some operators do better with POST's than GET's - this is a feeling I get from what some Nigerian users mention.
Opera has the most installations of any J2ME app in the world most probably, and if they do it, it's probably safer.
No problems with HTTP GET limits on query length.
You can use a more flexible data format if you like that uses less data (no encoding needed on the data as with GET)
I think it's much cleaner, but does require some extra work, e.g. if you are using your HTTP web logs to parse out number of requests per "?type=blah" for example, then you'll have to move that into your site's logic.
If you follow standards get should be used only for data retrieval and post for adding new items. It depends on the server handler implementation which one is faster/slower.