Use robots.txt file to block everything except images

Use robots.txt file to block everything except images - cdn

I only serve images to my CDN.
I have a robots.txt file set up in my CDN domain which is separate from the one set up in my 'normal' www domain.
I want to format the CDN robots.txt file in my CDN domain so that it blocks the indexing of everything except images (regardless of their location)?
The reason for all this is that I want to avoid duplicate content.
Is this correct?
User-agent: *
Disallow: /
Allow: /*.jpg$
Allow: /*.jpeg$
Allow: /*.gif$
Allow: /*.png$

If you have all images in certain folders, you could use:
For google-bot only:
User-agent: Googlebot-Image
Allow: /some-images-folder/
For all user-agents:
User-agent: *
Allow: /some-images-folder/
Additionally, Google has introduced increased flexibility to the
robots.txt file standard through the use asterisks. Disallow patterns
may include "*" to match any sequence of characters, and patterns may
end in "$" to indicate the end of a name.
To allow a specific file type (for example.gif images) you can use following robots.txt entry:
User-agent: Googlebot-Image
Allow: /*.gif$
Info 1: By default (in case you don't have a robots.txt), all content is crawled.
Info 2: The Allow statement should come before the Disallow statement, no matter how specific your statements are..
Here's a wiki link to the robot's exclusion standard for a more detailed description.
According to that, your example should look like:
User-agent: *
Allow: /*.jpg$
Allow: /*.jpeg$
Allow: /*.gif$
Allow: /*.png$
Disallow: /
NOTE: As nev pointed out in his comment it's also important to watch out for query strings at the end of extensions, like image.jpg?x12345, so also include
Allow: /*.jpg?*$

Yeah! Disallow is right! Allow is right too!
And just as a tip specify a sitemap too! :)

Related

Nginx hanging on audio file request

I'm having the most bizarre issue with nginx.
After upgrading from 1.6.3 to 1.12.2 on RHEL 7.2, requests for audio files are just hanging:
Connecting to mydomain [...] ... connected.
HTTP request sent, awaiting response...
In my nginx access.log, I'm seeing a 200 status:
"GET /media/Example.mp3 HTTP/1.1" 200 105243 "-" "Wget/1.19.4 (linux-gnu)" "-"
If I request an MP4 file in the same directory, with the same permissions, it works just fine. I've tried MP4s that are both larger and smaller than my MP3 file: they work just fine.
CSS/JS/images also work fine.
If I comment out the MP3 mime type in /etc/nginx/mime.types, and then request /media/Example.mp3, it works just fine (!!!).
I added the ogg mime type to see if this was somehow related to just audio, and indeed, OGG files fail in the same way as MP3s.
I've set up debug logging, and everything looks normal for an MP3 request.
I've disabled SELinux, checked the permissions on the files, parent folders, etc. and confirmed that there is not a problem with the actual MP3 file.
I've tried turning sendfile off.
I can't undo this YUM transaction; it looks like there was a security issue with that version of nginx, and it is no longer available.
I've searched around online, but can't find any related reports. Does anyone have any thoughts/suggestions?
EDIT
When I set the Accept header and try to connect, curl output looks like:
* Trying my.ipaddress...
* TCP_NODELAY set
* Connected to my.host (my.ipaddress) port 80 (#0)
> GET /media/Example.mp3 HTTP/1.1
> Host: my.host
> User-Agent: curl/7.58.0
> Accept: audio/mpeg
>
And then it just hangs...

How do I what Content Types are on offer (for HTTP Content Negotiation)?

What one gets back when resolving a DOI depends on content negotiation.
I was looking at https://citation.crosscite.org/docs.html#sec-3
and I see different services offer different Content Types.
For a particular URL I want to know all the content types it can give me.
Some of them might be more useful than any that I am aware of (i.e. i don't want to write a list of preferences in advance).
For example:
https://doi.org/10.5061/dryad.1r170
I thought maybe OPTIONS was the way to do it
but that gave back nothing interesting, only about allowed request methods.
shell> curl -v -X OPTIONS http://doi.org/10.5061/dryad.1r170
* Hostname was NOT found in DNS cache
* Trying 2600:1f14:6cf:c01::d...
* Trying 54.191.229.235...
* Connected to doi.org (2600:1f14:6cf:c01::d) port 80 (#0)
> OPTIONS /10.5061/dryad.1r170 HTTP/1.1
> User-Agent: curl/7.38.0
> Host: doi.org
> Accept: */*
>
< HTTP/1.1 200 OK
* Server Apache-Coyote/1.1 is not blacklisted
< Server: Apache-Coyote/1.1
< Allow: GET, HEAD, POST, TRACE, OPTIONS
< Content-Length: 0
< Date: Mon, 29 Jan 2018 07:01:14 GMT
<
* Connection #0 to host doi.org left intact

I guess there is no such standard yet, but Link header: https://www.w3.org/wiki/LinkHeader could expose this information.
But personally, I won't rely too much on it. For example, a server could start sending a new content type and still NOT expose it via this header.
It might be useful to check the API response headers frequently, via manual or automated means for any changes.

LinkedIn Link Sharing: Open Graph Image Issue

NOTE: I've seen a variety of similar questions to this one so I'm going to try and be succinct in describing the issue along with what I've done so far.
The issue is that LinkedIn is failing to properly scrape images from articles on a WordPress site. The site is using All-in-One SEO to add the appropriate meta tags and these tags and, judging by facebook's sharing and object debuggers, is doing so correctly.
Here's a sample article that demonstrates the issue.
Upon entering the URI into a LinkedIn article, LinkedIn attempts to fetch the page's data. It returns with the title and description but leaves an empty space where the image would presumably display:
In tailing the access logs, I've seen LinkedIn hitting the site along with 200 status codes for the page itself and the image:
[ip redacted] - - [29/Mar/2017:19:50:44 +0000] "GET /linkedin-test/?refTest=LI17 HTTP/1.1" 200 23758 "-" "LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/3.1 +http://www.linkedin.com)" 0.906 "[ips redacted]"
[ip redacted] - - [29/Mar/2017:19:50:44 +0000] "GET /wp-content/uploads/2017/03/modern-architecture-skyscrapers-modest-with-images-of-modern-architecture-ideas-on-gallery.jpg HTTP/1.1" 200 510088 "-" "LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/3.1 +http://www.linkedin.com)" 0.000 "[ips redacted]"
Following some other Stack Overflow threads, I experimented with the following:
Attempt to bust LinkedIn's cache with query strings
Result: No change with query strings or completely new URLs
Verify image dimensions for og:image resource are not too small
Result: No change was seen here in using images that match or exceed those indicated in LinkedIn's knowledge base
Revise all meta tags to include prefix="og: http://ogp.me/ns#"
Result: No change
I feel like I've hit a wall so any suggestions or thoughts are definitely welcome!
Thanks!

This seems to work fine for me:
https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fdev-agilealliance.pantheonsite.io%2F
Source: Official Microsoft Documentation for Sharing on LinkedIn. Results:
So, what changed??? I did some digging.
Let's look at the 2018 robots.txt file for dev-agilealliance.pantheonsite.io, and compare it to the working, 2020 robots.txt file. Seems pretty clear. This is why things were blocked in 2018...
# Pantheon's documentation on robots.txt: http://pantheon.io/docs/articles/sites/code/bots-and-indexing/
User-agent: *
Disallow: /
And this is the working robots.txt in 2020...
User-agent: RavenCrawler
User-agent: rogerbot
User-agent: dotbot
User-agent: SemrushBot
User-agent: SemrushBot-SA
User-agent: PowerMapper
User-agent: Swiftbot
Allow: /
Of course, it could also be that sharing services are much more lenient in 2020 than in 2017! But things do seem to be working now, according to the above specs., I can confirm.

Nginx return statement not accepting "text"

Following config is working for me:
server {
listen 80;
root /app/web;
index index.json;
location / {
return 409;
}
}
If I hit the website the 409 page will be presented. However following is not working:
server {
listen 80;
root /app/web;
index index.json;
location / {
return 409 "foobar";
}
}
The page is unreachable. But according to the docs http://nginx.org/en/docs/http/ngx_http_rewrite_module.html#return
return 409 "foobar";
should work. Any ideas whats wrong? There are no logs in nginx/error.log.

The thing is, Nginx does exactly what you ask it to do. You can verify this by calling curl -v http://localhost (or whatever hostname you use). The result will look somewhat like this:
* Rebuilt URL to: http://localhost/
* Hostname was NOT found in DNS cache
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost
> Accept: */*
>
< HTTP/1.1 409 Conflict
* Server nginx/1.4.6 (Ubuntu) is not blacklisted
< Server: nginx/1.4.6 (Ubuntu)
< Date: Fri, 08 May 2015 19:43:12 GMT
< Content-Type: application/octet-stream
< Content-Length: 6
< Connection: keep-alive
<
* Connection #0 to host localhost left intact
foobar
As you can see, Nginx returns both 409 and foobar, as you ordered.
So the real question here is why your browser shows the pretty formatted error page when there is no custom text after the return code, and the gray "unreachable" one, when such text is present.
And the answer is: because of the Content-Type header value.
The HTTP standard states that some response codes should or must come with the response body. To comply with the standard, Nginx does this: whenever you return a special response code without the required body, the web server sends its own hardcoded HTML response to the client. And a part of this response is the header Content-Type: text/html. This is why you see that pretty white error page, when you do return 409 without the text part — because of this header your browser knows that the returned data is HTML and it renders it as HTML.
On the other hand, when you do specify the text part, there is no need for Nginx to send its own version of the body. So it just sends back to the client your text, the response code and the value of Content-Type that matches the requested file (see /etc/nginx/mime.types).
When there is no file, like when you request a folder or a site root, the default MIME type is used instead. And this MIME type is application/octet-stream, which defines some binary data. Since most browsers have no idea how to render random binary data, they do the best they can, that is, they show their own hardcoded error pages.
And this is why you get what you get.
Now if you want to make your browser to show your foobar, you need to send a suitable Content-Type. Something like text/plain or text/html. Usually, this can be done with add_header, but not in your case, for this directive works only with a limited list of response codes (200, 201, 204, 206, 301, 302, 303, 304, or 307).
The only other option I see is to rewrite your original request to something familiar to Nginx, so that it could use a value from /etc/nginx/mime.types for Content-Type:
server {
listen 80;
root /app/web;
index index.json;
location / {
rewrite ^.*$ /index.html;
return 409 "foobar";
}
}
This might seem somewhat counter-intuitive but this will work.
EDIT:
It appears that the Content-Type can be set with the default_type directive. So you can (and should) use default_type text/plain; instead of the rewrite line.

Updating #ivan-tsirulev 's answer:
By now you can set headers even for page with status codes for errors using always.
location #custom_error_page {
return 409 "foobar";
add_header Content-Type text/plain always;
}
But if you set default_type, the response headers will have two Content-Type headers: default, then added. Nevertheless, it works fine.

How to differentiate request coming from command-line and browsers?

To check whether it is a cli or http request, in PHP this method php_sapi_namecan be used, take a look here. I am trying to replicate that in apache conf file. The underlying idea is, if the request is coming from cli a 'minimal info' is served, if the request is from browsers then the users are redirected to different location. Is this possible?
MY PSEUDO CODE:
IF (REQUEST_COMING_FROM_CLI) {
ProxyPass / http://${IP_ADDR}:5000/
ProxyPassReverse / http://${IP_ADDR}:5000/
}ELSE IF(REQUEST_COMING_FROM_WEB_BROWSERS){
ProxyPass / http://${IP_ADDR}:8585/welcome/
ProxyPassReverse / http://${IP_ADDR}:8585/welcome/
}
Addition: cURL uses host of different protocols including http, ftp & telnet. Can apache figure out if the request is from cli or browser?

For as far as I know, there is no way to find the difference using apache.
if a request from the command-line is set up properly, apache can not make a difference between command-line and browser.
When you check it in PHP (using php_sapi_name, as you suggested), it only checks where php itself was called from (cli, apache, etc.), not where the http request came from.
using telnet for the command line, you can connect to apache, set the required http-headers and send the request as if you were using a browser(only, the browser sets the headers for you)
so, i do not think apache could differentiate between console or browser

The only way to do this is to test the user agent sent in the header of the request but this information can be easily changed.
By default every php http request looks like this to the apache server:
192.168.1.15 - - [01/Oct/2008:21:52:43 +1300] "GET / HTTP/1.0" 200 5194 "-" "-"
this information can be easily changed to look like a browser, for example using this
ini_set('user_agent',
'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3');
the http request will look like this
192.168.1.15 - - [01/Oct/2008:21:54:29 +1300] "GET / HTTP/1.0" 200 5193
"-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3"
At this moment the apache will think that the received connection come from a windows firefox 3.0.3.
So there is no a exact way to get this information.

You can use a BrowserMatch directive if the cli requests are not spoofing a real browser in the User-Agent header. Else, like everyone else has said, there is no way to tell the difference.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Use robots.txt file to block everything except images - cdn

Yeah! Disallow is right! Allow is right too! And just as a tip specify a sitemap too! :)

Related

Nginx hanging on audio file request

How do I what Content Types are on offer (for HTTP Content Negotiation)?

LinkedIn Link Sharing: Open Graph Image Issue

Nginx return statement not accepting "text"

How to differentiate request coming from command-line and browsers?

Categories

Resources