Scrape image with no extension - web-scraping

I'm trying to scrape images this site:
http://mis.historiska.se/mis/sok/bild.asp?uid=336358&g=1
The site also have the option to download different sizes, like big image here:
http://catview.historiska.se/catview/media/highres/336358
I have no problem downloading manual, scraping the image, or even scraping the url, but the image and url is missing the image extension.
I need to scrape the full url with filename and extension., NOT the actual image.

The proper way to do this would be to check the headers after making a request to the given url for the filename and extension. A simple curl request to the given url gives me the following response:
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: image/jpeg
Content-Length: 569050
Date: Wed, 20 Jan 2016 15:33:49 GMT
The best way to guess the file extension would be to just check "Content-Type" header. Similarly, in order to get the filename, we'd be using the "Content-Disposition" header which need not necessarily be provided in the headers in which case we'll need to guess the filename from the URL.
A simple python snippet for guessing extension would be as follows:
import requests
import mimetypes
resp = requests.get(url)
content_type = resp.headers['content-type']
ext = mimetypes.guess_extension(content_type)

Related

How to save response body of Intellij http client response to a file on local file system?

In the response handling section
if below is the overall response
[
{
"empId":1001,
"empName":"abc"
},
{
"empId":1002,
"empName":"xyz"
}
]
I am able to get this response as
> {%
console.log(response.body);
%}
Is there any way to write this response to file on a local file system?
Also, we seem to have access only to client and response objects.
Can we also write control structures such as for loop, etc.?
You can redirect a response to a file. Use >> to create a new file with a suffix if it already exists and >>! to rewrite the file if it exists. You can specify an absolute path or relative to the current HTTP Request file. You can also use variables in paths, including environment variables and the following predefined variables:
{{$projectRoot}} points to the project root: .idea
{{$historyFolder}} points to .idea/httpRequests/
The following example HTTP request creates myFile.json in myFolder next to the HTTP Request file and redirects the response to it. If the file already exists, it creates myFile-1.json and so on.
POST https://httpbin.org/post
Content-Type: application/json
{
"id": 999,
"value": "content"
}
>> myFolder/myFile.json
ref:
https://www.jetbrains.com/help/idea/exploring-http-syntax.html#response-redirect
https://www.jetbrains.com/help/idea/http-response-handling-api-reference.html
Not possible yet, here's a link to a corresponding feature request: https://youtrack.jetbrains.com/issue/IDEA-239333. You can vote/comment it to receive updates.
Edit: this is now possible, see PhpStorm docs on that topic.

How to upload a file with PUT in HTTPie

I'm searching for the syntax to write a PUT operation that upload a file with HTTPie. Please could you point me to the right syntax ? I could not find a way to do so on the official documentation
To achieve this with httpie, you need to do two things:
Set the HTTP method to PUT, which is trivial: $ http PUT […]
Pass the contents of the file, for which there are various ways:
Redirected input:
$ http PUT httpbin.org/put Content-Type:image/png < /images/photo.png
Request data from a filename (automatically sets the Content-Type header):
$ http PUT httpbin.org/put #/images/photo.png
Form file upload:
$ http --form PUT httpbin.org/put photo=#/images/photo.png

Changing MIME type with Meteor

When I'm running my app on localhost, I get 2 warnings concerning MIME type. This is one of them:
Resource interpreted as Stylesheet but transferred with MIME type text/html: "http://localhost:3000/BootstrapEssentials/bootstrap.css".
The other warning is identical with a different file. Both files are in my working directory. So far, I have been to these similar questions but they haven't helped:
Resource interpreted as stylesheet but transferred with MIME type text/html (seems not related with web server)
Chrome says "Resource interpreted as script but transferred with MIME type text/plain.", what gives?
Resource interpreted as stylesheet but transferred with MIME type text/html
Originally I was trying to use this line:
<link rel="stylesheet" href="/BootstrapEssentials/bootstrap.css">
I have since added in the type field:
<link rel="stylesheet" href="/BootstrapEssentials/bootstrap.css" type="text/css">
but that didn't do anything. I also have used the JavaScript Console to see that in the response header it has content-type: text/html; charset=utf-8 and I believe that if I can change that to being content-type: text/css; charset=utf-8 then everything will be fine but I can't find how to do that either.
As per the Meteor docs:
All files inside a top-level directory called public/ are served as-is to the client. When referencing these assets, do not include public/ in the URL, write the URL as if they were all in the top level. For example, reference public/bg.png as /bg.png. This is the best place for favicon.ico, robots.txt, and similar files.
Moving your stylesheets to a directory in public should do the trick!

Gzip on nginx server enabled, but only certain files compressed

I have a website that is hosted on an nginx server with gzip compression enabled. If I check my website on http://gzipwtf.com/, I see that only 2 files out of 7 are compressed. First I tought that maybe something is wrong with the filename/pattern, but I could not figure out why certain files are compressed and others not. There is no connection between file-type, filename, file-location and compression. It's not only js files or css files not working. Both types work.. but not for every file.
I have the following files:
css/sb.min.css (compressed)
css/style.min.css (not compressed)
css/jquery.mcustomscrollbar.min.css (not compressed)
js/jquery.mcustomscrollbar.concat.min.js (not compressed)
js/jquery.sidebar.min.js (not compressed)
js/jquery.min.js (not compressed)
js/sbhandlers.js (compressed)
I also checked in the DeveloperTools of Google Chrome. The Request Header was on each file like this:
Accept:*/*
Accept-Encoding:gzip, deflate, sdch
Accept-Language:en-US,en;q=0.8
...
But as to expect only the two compressed files had this in the Response Header:
Content-Encoding:gzip
I have absolutely no idea why only two files are compressed. I already wrote to the support of the webhoster, but he said everything is configured correctly.
Am I missing the point?!
Thanks for your help!

Using FileInfo to see when a file was updated... on another server

Hey guys I am pulling in a Vehicle Feed to an autodealer website for one of our clients. Every night at midnight(ish) the new XML file is uploaded to our FTP and it overwrites the current one. Currently he has two Identical websites and the file needs to be uploaded to both, I was looking into setting it up so both websites can use the same XML file so we can cut down on the risk of errors and for convince.
Pulling the file works great, both websites can read the XML file and have no issues displaying the inventory. The issue comes in when I try to display the date the file was last updated. I created a small snippet that reads the date the file was updated and displays "Last Update: and the date" but when I try and reference a non-local file I get a error that says "URI formats are not supported". Does anyone know of a way to do this or if its even possible?
what it currently is
FileInfo fileInfo = new FileInfo(Server.MapPath("~/feed/VEHICLES.XML"));
DateTime timeOfCreation = fileInfo.LastWriteTime;
what i tried
FileInfo fileInfo = new FileInfo("http://www.autodealername.com/feed/VEHICLES.XML");
DateTime timeOfCreation = fileInfo.LastWriteTime;
this was no good
This can be done via FTP, since you're using it already.
http://msdn.microsoft.com/en-us/library/system.net.ftpwebresponse.lastmodified.aspx
FileInfo uses information from the underlying file system which isn't available over HTTP. You'll need to think of some other way.
if you load the file in this way:
FileInfo fileInfo = new FileInfo("http://www.autodealername.com /feed/VEHICLES.XML");
most likely the file is retrieved to you by IIS or the webserver on that domain/site and this is not the same as opening the file from the file system directly.
I think you have two alternatives at least:
open the file from a network share like \\machinename\ShareName\FileName;
create a service endpoint on the remote server (WCF or XML web service) which gets in a file name and returns the information you need;
You can try using a WebRequest using the HEAD method and look for the Last-Modified header.
Here's the code I used...
var web = WebRequest.Create("http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=4") as HttpWebRequest;
web.Method = "HEAD";
var response = web.GetResponse();
var lastModified = DateTime.Parse(response.Headers["last-modified"]);
Console.WriteLine(lastModified);
Here's what the http response looks like (from Fiddler)...
HTTP/1.1 200 OK
Server: nginx/0.8.36
Date: Wed, 23 Nov 2011 17:37:44 GMT
Content-Type: image/png
Connection: keep-alive
Cache-Control: max-age=604800
Last-Modified: Tue, 06 Sep 2011 21:44:29 GMT
ETag: "6237328de6ccc1:0"
Content-Length: 19706
X-Cache: HIT
Accept-Ranges: bytes
You could also add the updated field to the feed so you can get the last time it was updated from the feed itself.
RSS pubDate:
http://www.w3schools.com/rss/rss_tag_pubdate.asp
<?xml version="1.0" encoding="ISO-8859-1" ?>
<rss version="2.0">
<channel>
<title>W3Schools Home Page</title>
<link>http://www.w3schools.com</link>
<description>Free web building tutorials</description>
<!-- YOU COULD USE THIS -->
<pubDate>Thu, 27 Apr 2006</pubDate>
<item>
<title>RSS Tutorial</title>
<link>http://www.w3schools.com/rss</link>
<description>New RSS tutorial on W3Schools</description>
</item>
</channel>
</rss>
Atom updated:
http://www.atomenabled.org/developers/syndication/atom-format-spec.php#rfc.section.1.1
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Example Feed</title>
<link href="http://example.org/"/>
<!-- YOU COULD USE THIS -->
<updated>2003-12-13T18:30:02Z</updated>
<author>
<name>John Doe</name>
</author>
<id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>
<entry>
<title>Atom-Powered Robots Run Amok</title>
<link href="http://example.org/2003/12/13/atom03"/>
<id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
<updated>2003-12-13T18:30:02Z</updated>
<summary>Some text.</summary>
</entry>
</feed>
Maybe try using the FileSystemWatcher Class, which can notify you when a file was changed, modified, etc. Take a look at it.
Good luck!

Resources