As far as I understand it, scraping website content, especially if scraping of content is explicitly forbidden in the ToC is illegal.
My question is what if I manually copy URLs from the browser and save it in a database for later analysis or whatever e.g. craigslist postings. Is this legal?
What if I write an app where I would manually copy/paste craigslist URLs and it would extract only the metadata from the header e.g. the Title or metatags and save that to a database, would that be legal since metadata is technically not content... or is it?
Related
I'm trying to download a municipal planning plan together with all the relevant documents.
All documents can be reached from the following link
I've tried the following command (that worked well for other sites) and some variations without success.
wget -E -k -r -l 3 "http://www.mavat.moin.gov.il/MavatPS/Forms/SV4.aspx?tid=4&et=1&mp_id=ppnCWTcsST9gG0%2fa0ayWnjFyZ%2bo14s221Ujlpi7UvR4jIRAHLKhJ8lOLSkomZ%2fvlHk8b2T0oENpI6Wh2hKzxQJCw9BPJP8gav%2ftgiKlk5S0%3d"
The same plan in their new site I can't get the files either,
https://mavat.iplan.gov.il/SV4/1/5000931297/310
I'd appreciate any help.
Well, these days, and especially with .net web sites?
We don't use hyper-links with a simple (full) path name to actual files from the web server. In fact in most cases one will not even give the web server rights to those folders. (they are not exposed to Internet Services).
So, no actual links as a full "url" to documents exist.
What happens is when you click on a button or button link? Then the code behind on the web server runs. (and that is code you don't have). And further more, that code behind can browser, read, retrieve any file from any folder on the server or other servers. But links from the web site don't exist and it not even possible to type in a url to resolve to a actual file name on the server.
So the server side code (not internet services) goes and grabs the document. In fact, the documents could be in a database. So, the code behind on the server side runs and pulls the binary data from the database (which represents a valid PDF file). Or the code behind reads the file from disk and then STREAMS the file for a download.
Now, this is often done for reasons of security. It means that no valid URL exists to get at a document.
Not only is this done for security, but from a developer point of view, it often better to retrieve a row from a database. That row can have the information you SEE rendered on that form, but the web page is not static, and the display of information is thus a developer coding a pull of rows from a database, and then you simply "assign" that data to some type of control - save datagrid, or listview or whatever. (this assignment of data is only 1 or two lines of code, and then the control + web server renders that datagrid control.
So, this is done since the developer thus only assigns the result of a database query to the control when then renders on the form. Thus, to add or remove documents? Then you only have to edit the database for the information on the web page to render.
As a result? There is no direct links to the actual documents on the server. To retrieve a document, you would have to send to the web site the exact command required.
You can hit f12 (most browsers support this). This will put your browser into developer mode. If we do this, and then select elements (select element feature). Now click on a pdf link. You get this:
<img src="../images/ft/file_PDF.gif" style="cursor:pointer"
onclick="openDoc('99000526871729',
'AABA7BE646E182B67DB1C15220E531DF36BBB591D8EEA7757435B2606C08E6F9')">
So, note above. The above code event openDoc is the SERVER side code you have to run to retrive a document. There is thus NO link. And you not going to be able to wire up, or run your OWN web page that hits that server and runs the routine "onclick".
However, the onclick DOES expose the internal database document numbers used to pull/read and retrieve a given document. But the path name, and how the code gets/grabs this file? You have no idea, and HAVE to run server side code (c#, or vb.net) code. That code as noted grabs the file and then uses code to "stream" the file when you download or click on a link.
So for simple HTML like pages? Well, for those that took a one day HTML course? Sure, such web sites will have scr=some path name to a valid url). And these simple systems thus allow you to enter a URL to grab/get a document. And those documents are fully exposed to the web site, and a simple valid URL path name to a file exists. Not so with asp.net, and as noted, this is not only done for security, but it a better over all developer experience to write code that grabs the files as opposed to rendering full path link names to files.
There are many additional benefits. For example, the database that drives this likely has a setting (or some settings) that contain the path names to the documents. If they run out of storage, or say want to move older files to a much slower storage system, which of course is much lower cost? Then can move the files, and update the path name columns in the database. The web site will continue to work, since we NEVER using a exposed URL on the web site. And as noted, actual direct URL's don't exist, and the web server (IIS) as opposed to the code behind will not even have rights to the file names.
As a result?
You not be able to simply pull the web page, and THEN extract the URL's to file names.
What you might be able to do is write code that loads the web page, and then scans all the event code stubs for the links, and have your code click on each button with web browser automation. But, even that don't allow you to enter file names into the download prompts.
So, what you ask is not easy, likely not possible, and a very difficult task. And the simple reason is that site does not use simple HTML and static links to files, and it never actually exposes a direct link to files, and even worse yet is the web server does not have or even allow a URL direct link to a site - they don't exist, and the web site will not even have rights or even allow such URL's to file names. (only the .net code behind does - not internet services).
and grabs the document and then code "streams" the file to to the web site or link you clicked on. So the simple HTML coders in the past would create say a folder (usually a virtual folder) that points to the files on some server/folder. But with .net, it easier (and far more secure).
Modern development tools don't use old fashioned ideas like a URL's to directly retrieve a file - they are designed differently.
In some cases, URL's are allowed or created, and this is done for reasons of sharing links. So if you have a cute video or document? Then the designers of the system will often permit use of parameters in the URL, so you can share a link to someone else. This page has no such provisions. So, you can share a link to the page, but no actual URL to documents or even provisions to allow URL's to a document even exists.
So this quite much means to retrieve a document, you have to go to that web page, and ONLY when you click on a document will the web site "stream" down that one particular document in question.
I would like to retrieve the url from all the unsecured pages on my site.
Are there tools for this?
If I've understood well the question, you need to check, by crowling your entire site, for pages alerting the user about "security" issues.
In substance you got to find if any of your site's pages contains an "HTTP:" (not HTTPS) url inside the html code, while browsing yout site on https://yoursite.com.
If this is the case, i will use wget (to locally download the "rendered" pages of the site locally) and grep (to find any http:// link inside the code) for this task, under linux.
If the site is not generated by a CMS, and you can have access to the sources, I will parse the source files to search for any "HTTP:" inside the code.
BTW, this method does not help you in checking for linked objects (images, scripts...) using unsecure SHA-1 certificates or expired ones even under https://...
Could you kindly further qualify your specific needs?
Best regards.
I'm creating a database/storage using Firebase as the backend for my website.
I have the option of having file metadata like size, cache control etc.
where do I get this information from? and
why is it important?
A majority of the information is already populated for you (size, content-type, etc.), so you don't have to get it anywhere else.
Additional information (such as content-disposition or cache-control) end up setting the relevant headers when the file is downloaded, which can change the behavior (such as what the name of the downloaded file is, or how long the browser caches the file).
Basically, in the normal case you won't have to do anything special, but if you want custom behavior, you have these features when you need them.
I'm building a backend admin system which edits json files that control the look and feel of the main site. I want to add a 'preview' button before the user hits save. To do that, I want to use the main site, but instead of calling the actual json file in production, save a temp version of it and redirect this user's traffic for that file to the temp file - from the original site code.
i've considered both chrome pluggins, configuring iframe somehow or, in worst case scenario, grabbing the production front-end, parsing out the call to the prod json file and replacing with new temp json file. That is obviously not ideal as it would entail a lot of work and if anything changes on the prod site, this will have to be updated.
I would love your ideas!
Do you have access to the main site's source code? You could implement a preview option from the main site which accepts a GET parameter and uses a temporary JSON setting based on this GET parameter.
From the backend admin system's point of view, it's just a matter of adding the JSON as part of the ajax GET request.
Unfortunately though, there is no easy way of doing this if you don't have access to the main site's source code or if you can't reach out to whoever maintains that main site.
Your cleanest option might be to recreate the main site's look and feel instead and pass it off as a "preview".
Sites like cnn.com or foxnews.com.
Where do they store all the articles? In html files? In database?
More logically to store everything in DB but how to generate a static link to something that is inside DB?
It's not that they have a a dynamic page load like: LoadArticle.aspx?ArticleID=123, every article has it's own address.
Please explain how this is done.
They use a special content management library called VoodooLib.dll.
Seriously, when you write something to a database, you normally generate some kind of unique identifier - 123, for example. It gets permanently associated with that record (article content). After that it is used to generate the same id as part of an Url at any time later.
As for the static link, it is a simple matter of Url Rewriting.
You generate static links to display on a page because they work much better for SEO. When a request for that static Url hits the server, it gets substituted for something "server friendly" and then gets to be processed.
They probably use some form of Content Management System (CMS). There are many different ones out there - most store the actual content in a database or as XML (some store XML in a database). They will the either publish that content as static HTML pages or, more commonly now, as dynamic pages that are cached. Many use what are known as "friendly URLs" that are virtual addresses that are mapped to the actual physical file path using URL-rewriting techniques.
Note you can't tell whether a page is dynamic or static simply from the extension. It is quite possible to have dynamic pages that end in the .html extension.
Just because the URL looks "static" doesn't mean it is; they could be using something like mod_rewrite or an IIS ISAPI to make the URLs more search engine friendly.
For the high-volume news sites that you mention, however, they may very well generate the pages statically in order to prevent overloading the database with repeated requests for the same article.
Look at the URl of this page, it doesn't have xxx.aspx?some-query-string
You are refering to using friendly URLs.
To do something like that, one common way is to use URL Rewrite and/or some custom HTTPModule
Here's a good reference: http://weblogs.asp.net/scottgu/archive/2007/02/26/tip-trick-url-rewriting-with-asp-net.aspx
Just because a page has a normal URL does not mean that it isn't serving dynamic content. With the Apache mod_rewrite module, it is possible to manipulate URLs. So, for example, a page like http://www.domain.tld/permalink/12345/message-title-slug can be converted internally to http://www.domain.tld/permalink/index.php?id=12345&slug=message-title-slug.
I do not know exactly what cnn.com and foxnews.com use, but I would bet that they use a Content Management System (CMS) which serves all pages dynamically, with the content stored either in a database or on the filesystem, and with authoring/publishing all being performed through the particular CMS.
Just checking cnn.com, the article links have in them
Year
Location (US or WORLD/specificlocationid)
Month
Day
Article name.
All of this information together can be used to uniquely identify any article (even less of it is probably actually needed). The dynamic content loading page address could easily be hidden by some method of URL rewriting, and then the information in the requested URL is used to determine which article in the DB is to be served up.
I don't know why all the other answerers seem to assume that some form of URL rewriting is necessary to create friendly URLs. It's not true at all.
It's perfectly possible to write web serving code that splits a URL into parameters - eg year, month, title - and pass that directly to the code that gets the content from the database, without any need to rewrite the URL. Most modern web frameworks such as Django and Rails include this functionality out of the box.
This is done through mod-rewrite techniques.
Here's an article about the mod rewriting engine: http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html
And here's their "guide": http://httpd.apache.org/docs/2.0/misc/rewriteguide.html
I hope that helps. It should make for a good starting point. Goodluck.