LinkedIn Link Sharing: Open Graph Image Issue - wordpress

NOTE: I've seen a variety of similar questions to this one so I'm going to try and be succinct in describing the issue along with what I've done so far.
The issue is that LinkedIn is failing to properly scrape images from articles on a WordPress site. The site is using All-in-One SEO to add the appropriate meta tags and these tags and, judging by facebook's sharing and object debuggers, is doing so correctly.
Here's a sample article that demonstrates the issue.
Upon entering the URI into a LinkedIn article, LinkedIn attempts to fetch the page's data. It returns with the title and description but leaves an empty space where the image would presumably display:
In tailing the access logs, I've seen LinkedIn hitting the site along with 200 status codes for the page itself and the image:
[ip redacted] - - [29/Mar/2017:19:50:44 +0000] "GET /linkedin-test/?refTest=LI17 HTTP/1.1" 200 23758 "-" "LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/3.1 +http://www.linkedin.com)" 0.906 "[ips redacted]"
[ip redacted] - - [29/Mar/2017:19:50:44 +0000] "GET /wp-content/uploads/2017/03/modern-architecture-skyscrapers-modest-with-images-of-modern-architecture-ideas-on-gallery.jpg HTTP/1.1" 200 510088 "-" "LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/3.1 +http://www.linkedin.com)" 0.000 "[ips redacted]"
Following some other Stack Overflow threads, I experimented with the following:
Attempt to bust LinkedIn's cache with query strings
Result: No change with query strings or completely new URLs
Verify image dimensions for og:image resource are not too small
Result: No change was seen here in using images that match or exceed those indicated in LinkedIn's knowledge base
Revise all meta tags to include prefix="og: http://ogp.me/ns#"
Result: No change
I feel like I've hit a wall so any suggestions or thoughts are definitely welcome!
Thanks!

This seems to work fine for me:
https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fdev-agilealliance.pantheonsite.io%2F
Source: Official Microsoft Documentation for Sharing on LinkedIn. Results:
So, what changed??? I did some digging.
Let's look at the 2018 robots.txt file for dev-agilealliance.pantheonsite.io, and compare it to the working, 2020 robots.txt file. Seems pretty clear. This is why things were blocked in 2018...
# Pantheon's documentation on robots.txt: http://pantheon.io/docs/articles/sites/code/bots-and-indexing/
User-agent: *
Disallow: /
And this is the working robots.txt in 2020...
User-agent: RavenCrawler
User-agent: rogerbot
User-agent: dotbot
User-agent: SemrushBot
User-agent: SemrushBot-SA
User-agent: PowerMapper
User-agent: Swiftbot
Allow: /
Of course, it could also be that sharing services are much more lenient in 2020 than in 2017! But things do seem to be working now, according to the above specs., I can confirm.

Related

Why wget fails to download a file but browser succeeds?

I am trying to download virus database for clamav from http://database.clamav.net/main.cvd location. I am able to download main.cvd from web browser(chrome or firefox) but unable to do same with wget and get the following error:
--2021-05-03 19:06:01-- http://database.clamav.net/main.cvd
Resolving database.clamav.net (database.clamav.net)... 104.16.219.84, 104.16.218.84, 2606:4700::6810:db54, ...
Connecting to database.clamav.net (database.clamav.net)|104.16.219.84|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-05-03 19:06:01 ERROR 403: Forbidden.
Any lead on this issue?
Edit 1:
This is how my chrome cookies look like when I try to download main.cvd
Any lead on this issue?
It might be possible that blocking is based on User-Agent header. You might use --user-agent= option to set same User-Agent as for browser. Example
wget --user-agent="Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0" https://www.example.com
will download example.com page and identify itself as Firefox to server. If you want know more about User-Agent's part meaning you might read Developer Mozilla docs for User-Agent header
Check for the session cookies or tokens from browser, as some websites place similar kind of security

Wordpress site gets infected with malware, random POST requests from hackers return 200 results, trying to understand how this happens

A word press site i maintain, gets infected with .ico extension PHP scripts and their invocation links. I periodically remove them. Now i have written a cron job to find and remove them every minute. I am trying to find the source of this hack. I have closed all the back doors as far as i know ( FTP, DB users etc..).
After reading similar questions and looking at https://perishablepress.com/protect-post-requests/, now i think this could be because of malware POST requests. Monitoring the access log i see plenty of POST requests that fail with 40X response. But i also see requests that succeed which should not. Example one below, first request fails, similar POST Requests succeeds with 200 response few hours later.
I tried duplicating a similar request from https://www.askapache.com/online-tools/http-headers-tool/, but that fails with 40X response. Help me understand this behavior. Thanks.
POST Fails as expected
146.185.253.165 - - [08/Dec/2019:04:49:13 -0700] "POST / HTTP/1.1" 403 134 "http://website.com/" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/534.24 (KHTML, like Gecko) RockMelt/0.9.58.494 Chrome/11.0.696.71 Safari/534.24" website.com
Few hours later same post succeeds
146.185.253.165 - - [08/Dec/2019:08:55:39 -0700] "POST / HTTP/1.1" 200 33827 "http://website.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.861.0 Safari/535.2" website.com
146.185.253.167 - - [08/Dec/2019:08:55:42 -0700] "POST / HTTP/1.1" 200 33827 "http://website.com/" "Mozilla/5.0 (Windows NT 5.1)

Nginx hanging on audio file request

I'm having the most bizarre issue with nginx.
After upgrading from 1.6.3 to 1.12.2 on RHEL 7.2, requests for audio files are just hanging:
Connecting to mydomain [...] ... connected.
HTTP request sent, awaiting response...
In my nginx access.log, I'm seeing a 200 status:
"GET /media/Example.mp3 HTTP/1.1" 200 105243 "-" "Wget/1.19.4 (linux-gnu)" "-"
If I request an MP4 file in the same directory, with the same permissions, it works just fine. I've tried MP4s that are both larger and smaller than my MP3 file: they work just fine.
CSS/JS/images also work fine.
If I comment out the MP3 mime type in /etc/nginx/mime.types, and then request /media/Example.mp3, it works just fine (!!!).
I added the ogg mime type to see if this was somehow related to just audio, and indeed, OGG files fail in the same way as MP3s.
I've set up debug logging, and everything looks normal for an MP3 request.
I've disabled SELinux, checked the permissions on the files, parent folders, etc. and confirmed that there is not a problem with the actual MP3 file.
I've tried turning sendfile off.
I can't undo this YUM transaction; it looks like there was a security issue with that version of nginx, and it is no longer available.
I've searched around online, but can't find any related reports. Does anyone have any thoughts/suggestions?
EDIT
When I set the Accept header and try to connect, curl output looks like:
* Trying my.ipaddress...
* TCP_NODELAY set
* Connected to my.host (my.ipaddress) port 80 (#0)
> GET /media/Example.mp3 HTTP/1.1
> Host: my.host
> User-Agent: curl/7.58.0
> Accept: audio/mpeg
>
And then it just hangs...

Use robots.txt file to block everything except images

I only serve images to my CDN.
I have a robots.txt file set up in my CDN domain which is separate from the one set up in my 'normal' www domain.
I want to format the CDN robots.txt file in my CDN domain so that it blocks the indexing of everything except images (regardless of their location)?
The reason for all this is that I want to avoid duplicate content.
Is this correct?
User-agent: *
Disallow: /
Allow: /*.jpg$
Allow: /*.jpeg$
Allow: /*.gif$
Allow: /*.png$
If you have all images in certain folders, you could use:
For google-bot only:
User-agent: Googlebot-Image
Allow: /some-images-folder/
For all user-agents:
User-agent: *
Allow: /some-images-folder/
Additionally, Google has introduced increased flexibility to the
robots.txt file standard through the use asterisks. Disallow patterns
may include "*" to match any sequence of characters, and patterns may
end in "$" to indicate the end of a name.
To allow a specific file type (for example.gif images) you can use following robots.txt entry:
User-agent: Googlebot-Image
Allow: /*.gif$
Info 1: By default (in case you don't have a robots.txt), all content is crawled.
Info 2: The Allow statement should come before the Disallow statement, no matter how specific your statements are..
Here's a wiki link to the robot's exclusion standard for a more detailed description.
According to that, your example should look like:
User-agent: *
Allow: /*.jpg$
Allow: /*.jpeg$
Allow: /*.gif$
Allow: /*.png$
Disallow: /
NOTE: As nev pointed out in his comment it's also important to watch out for query strings at the end of extensions, like image.jpg?x12345, so also include
Allow: /*.jpg?*$
Yeah! Disallow is right! Allow is right too!
And just as a tip specify a sitemap too! :)

How to differentiate request coming from command-line and browsers?

To check whether it is a cli or http request, in PHP this method php_sapi_namecan be used, take a look here. I am trying to replicate that in apache conf file. The underlying idea is, if the request is coming from cli a 'minimal info' is served, if the request is from browsers then the users are redirected to different location. Is this possible?
MY PSEUDO CODE:
IF (REQUEST_COMING_FROM_CLI) {
ProxyPass / http://${IP_ADDR}:5000/
ProxyPassReverse / http://${IP_ADDR}:5000/
}ELSE IF(REQUEST_COMING_FROM_WEB_BROWSERS){
ProxyPass / http://${IP_ADDR}:8585/welcome/
ProxyPassReverse / http://${IP_ADDR}:8585/welcome/
}
Addition: cURL uses host of different protocols including http, ftp & telnet. Can apache figure out if the request is from cli or browser?
For as far as I know, there is no way to find the difference using apache.
if a request from the command-line is set up properly, apache can not make a difference between command-line and browser.
When you check it in PHP (using php_sapi_name, as you suggested), it only checks where php itself was called from (cli, apache, etc.), not where the http request came from.
using telnet for the command line, you can connect to apache, set the required http-headers and send the request as if you were using a browser(only, the browser sets the headers for you)
so, i do not think apache could differentiate between console or browser
The only way to do this is to test the user agent sent in the header of the request but this information can be easily changed.
By default every php http request looks like this to the apache server:
192.168.1.15 - - [01/Oct/2008:21:52:43 +1300] "GET / HTTP/1.0" 200 5194 "-" "-"
this information can be easily changed to look like a browser, for example using this
ini_set('user_agent',
'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3');
the http request will look like this
192.168.1.15 - - [01/Oct/2008:21:54:29 +1300] "GET / HTTP/1.0" 200 5193
"-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3"
At this moment the apache will think that the received connection come from a windows firefox 3.0.3.
So there is no a exact way to get this information.
You can use a BrowserMatch directive if the cli requests are not spoofing a real browser in the User-Agent header. Else, like everyone else has said, there is no way to tell the difference.

Resources