I am new to screen scraping. When i use proxy server and when i track the HTTP transactions, i am getting my post datas revealed to me. So my doubt/problem here is,
1)Will it get stored in the server side or it will be revealed only to the client side?
2)Do we have an option of encrypting the post data in screen scraping?
3)Is it advisable to use screen scraping for banking applications?
I am using screen scraper tool which i have downloaded it from
http://www.screen-scraper.com/download/choose_version.php. (Enterprise version)
Thanks in advance.
My experience with scraping is that if you aren't doing anything super complex (like logging into a secure website like an online banking website, etc.) then Python has some great libraries that will help you out a lot.
To answer your questions:
1) You may need to be more clear, but this really depends on your server/client architecture.
2) As a matter of fact you do. Urllib and Urllib2 (built-in Python libraries) both have functions that enable you to encrypt data before you make a POST. As far as how secure this encryption is, for most applications, this will suffice.
3) I actually have done scraping on online banking sites! I'm not exactly familiar with that tool, but I would recommend using something a little different than a scraper. Selenium, which is a "web-driver", allows you to simulate the use of a browser, meaning anything that the broswer does in the background in order to validate the session is automatically taken care of. The main problem I ran into while trying to scrape the banking site was the loss of important session data.
Selenium - https://pypi.python.org/pypi/selenium
Other libraries you may find useful are: urllib, urllib2, and Mechanize
I hope I was somewhat helpful!
I've used screen-scraper to scrape banking sites before. It will impact the site just like your browser--if the site uses encryption the connection from screen-scraper to the site will be too.
If you have a client page sending data to screen-scraper, you probably should encrypt that. I generally just make the connection via SSH.
1) What do you mean by server side? Your proxy server or screen-scraper software? Any of them can read/store your information.
2) If you are connecting through HTTPS then your software should warn you about malicious proxy server: https://security.stackexchange.com/questions/8145/does-https-prevent-man-in-the-middle-attacks-by-proxy-server
3) I don't think they have some logger which they can read. But if you are concerned you can try to write your own. There are some APIs which you can read HTML easily with jQuery sintax:
https://pypi.python.org/pypi/pyquery or XPath: http://net.tutsplus.com/tutorials/javascript-ajax/web-scraping-with-node-js/
Related
TL;DR
Is there a tool can record all the network activity as I visit a website and create a mock server that responds to those requests with the same responses?
I'm investigating ways of mocking the complex backend for our React application. We're currently developing against the real backend (plus test/staging environments). I've looked around a bit and it looks there are a number of tools for mocking individual endpoints/features and sending the rest through to the real API (Mirage is leading the pack at the moment).
However, the Platonic ideal would be to mock the entire server so that a front end dev can work without an internet connection (again: Platonic ideal). It's a crazy lofty goal, I know this. And of course it would require mocking not only our backend but also requests any 3rd-party data sources. And of course the data would be thin and dumb and stale. But this is just for ultra-speedy front end development, it's just mocking. The data doesn't need to be rich, it'll be up to us to make it as useful/realistic as we need it to be.
Probably the quickest way would be to recreate the responses the backend is already sending, and then modifying as needed for new features or features under test etc.
To do this, we might go into Chrome DevTools and recreate everything on the network tab. Mock every request that was made by hardcoding response that returned. Taking it from there, do smart things like use url pattern matching to return a simple placeholder image for any request to get a user's avatar.
What I want to know is: is there any tool out there that does this automatically? That can watch as I load the site, click a bunch of stuff, take a bunch of actions, and spit out or set up a mock that recreates all the responses? And then we could edit any of them as we saw fit to simplify.
Does something like this exist? Maybe it's a browser tool. Maybe it's webpack middleware. Maybe it's a magic rooster.
PS. I imagine this may not be a specific, actionable enough question for SO. I'll understand if it's closed, but I'd really appreciate being directed somewhere where such questions/discussions would fit? I'm new enough to this world that SO is all I know!
There is a practice called service virtualization - a subset of the test double family.
Wikipedia has a list of tools you can use to do that. Here a couple of examples from that list:
Open Source Wiremock will let to record the mocks and edit the responses programmaticaly
Commercial Traffic Parrot will let you record the mocks and edit the responses via a UI and/or programatically
https://mswjs.io/ can mock all the requests for you. It intercepts all your client`s requests and returns your defined mock data.
Could any one instruct me the steps of implementing Encrypted Media Extensions using videojs-contrib-eme in local server (with Access Point) which doesn't has internet.
Users connect to local server using WiFi with mobile and playback the videos in browser.
So my question is as
EME implementations use the following external components:
Key System
Content Decryption Module (CDM):
License (Key) server
Packaging service
(refer for more info -- https://developers.google.com/web/fundamentals/media/eme)
what components are already provided by videojs-contrib-eme , and what components do I need to implement ?
It sounds like you are building for an off-line case - the main DRM's supported by most browsers, Widevine, FairPlay and PlayReady, require an internet connection usually for the license request and response.
It is possible to have persistent licenses, i.e. a DRM license which will work offline for download and go use cases like watching movies offline, but even this requires internet connectivity for the original license request and response.
If you plan to implement your own proprietary DRM system, then you will need more changes than just to the player itself, i.e. video.js, in your example.
You will need to implement some form of key server, your own CDM and some form of packager.
It's certainly possible to do all this, but it is a lot of work. If this is not just for a learning exercise, it may be more practical to implement some simple encryption solution on your server and then add simple decryption functionality just before you play the content. This is not as secure but may be good enough for your needs.
Alternatively if you really want DRM level security, it might be worth seeing if you can have limited internet access just for the DRM license requests and responses which are typically very small. This would also you leverage standard browsers and packagers.
I was searching for information about one of my doubts, but I couldn't find any. I'm working in an ASP.NET site and using AJAX to require data, since I'm currently working on my own, I don't know web programming's best practices.
I usually get all the information I need from the server and use Javascript to display / Modify it and AJAX to send it back to the server. A friend of mine uses PHP for most part of the programming, He rarelly uses any javascript and he told me it's way faster this way, since it does not consume the client's resources.
The basic question actually is:
According to the best practices, is it better for the server just to provide the data needed for the
application or is better you use the server for more than this?
That is going to depend on the expected amount of traffic for the site, the amount of content being generated, and the expectations of the end-user.
In a high-traffic site, it is actually "faster" for the end-user if you let javascript generate a portion of the content on the client side. Also, you can deliver a better user experience with long load times through client side scripting than you can if the content is loaded completely on the server.
In most cases you would need at least some backend code. E.g. when validating user input or when retrieving information from a real persistent database. Or what about when somebody has javascript disabled in his user-agent or somebody with a screenreader or searchengine crawlers?
IMHO you should at least (again in most cases) have the backend code which is able to do all the work and spit out a full webpage to the client. In addition to this you can add javascript functionality to make the user interface "smoother" by for example validating user data before submitting it to the server (remember to ALWAYS also check on the serverside) or by loading partial html (AJAX).
The point about being faster or using less resources when doing it serverside doesn't make much sense. Even if it does that it doesn't matter (but again I highly doubt this statement). If you use clientside scripting to only load parts that are needed it would rather use less resources on both the client- and the serverside.
I'm attempting to right a script to logon (username and password) on to a website and download a file. I've tried using the webClient and webBrowser classes to no avail, they don't seem to work for what I need them to do.
Does anyone else have any suggestions?
I'd suggest you look at this StackOverflow thread. This is not specific to any programming language or platform (you didn't mention which language/platform you're using, so I can't offer any specific code advice) but in my answer I detail the basic approach you'll need to take for successful HTML screen-scraping.
In a nutshell, first you'll need to the right combination of HTTP headers, URLs, and POST data to fool the server into thinking your client app is a "real" browser. A tool like Fiddler which allows you to see actual HTTP requests going over the wire, and experimentally build new requests based on browser requests, is invaluable here.
Next you'll need to figure out, in your language and platform, how to produce that set of headers, URLs, and/or POST data. Typically, handling cookies and redirects are the hardest-- and most platforms have specialized classes (e.g. CookieContainer in .NET) to help with things like this.
The closest example I can think of is iTunes. I'm thinking about a system where a server stores loads of files, and each user only has access to those they have paid for. Using a desktop app, they can download these to their local PC where they are stored as regular files.
How might one approach this? I can see a couple of possible options, and have some initial thoughts, but would welcome feedback on these or other ideas. If you post your preferred design, people can vote on them!
1)Use HTTP requests, and the response is the file data. Then a simple servlet (or similar) can act as a control on which files are downloaded.
PROs: easy to do
CONs: seems a little hacky, how would you display a progress bar?
2)Use sockets, and a custom server app which pipes data to the server
PROs: Perhaps more performant (?), can send data in nice sized chunks
CONs: A little more work on the client side, quite a bit more to write a custom server-side app that runs 24/7
Thanks in advance. Someone please edit my tags, I can't think of the right ones!
Use HTTP requests, and the response is the file data. Then a simple servlet (or similar) can act as a control on which files are downloaded. PROs: easy to do CONs: seems a little hacky, how would you display a progress bar?
I don't see why this is hacky? Your App would authenticate using the user's user name and password (if you want it to work like iTunes) and fetch files according to permission level. A progress bar is easy to do because you will get the content-length header in the response. It's a more flexible approach than FTP - but if FTP already does everything you need, go for that.
As said, FTP is what you need. To control per user, per file permissions you can create one system user and then you can apply filesystem level ACLs. Then, a FTP server like PureFTPd will let you login with system accounts with the specified permissions.