I am looking at this robots.txt.
How do I know which part of the website can be scraped? Can this page be scraped?
The page you ask "https://www.marketscreener.com/BANK-OF-AMERICA-11751/" is not disallowed in robots. The read of the robots file would be:
Disallow: /formation/patrimoine/
All files and folders inside /formation/patrimoine/
Disallow: /content_*
All files and folders starting "with content_"
Disallow: /mods_a/setcesi.php*
The file "/mods_a/setcesi.php" and all querystrings to the file
Disallow: /prov.php*
The file "/prov.php" and all querystrings to the file
Disallow: /images/maps/static_map.php*
The file "/images/maps/static_map.php" and all querystrings to the file
Disallow: /*/news-twitter/*
Any file or folder that contains the path "/news-twitter/"
Disallow: /images/actions/2019/FD/*
Any file or folder inside "/images/actions/2019/FD/"
Still you need to guarantee that you do not hit any of this content that could be inside an ajax call or a image src. Also you should make sure to comply with basic rules of robots namely the number of requests you make to the page and specific copyrights.
Related
I am using WordPress, I have to hide WP-includes, and WP-content/uploads from WordPress. I have tried to add the below code in htaccess
Options -Indexes
Also, I have referred to this link but still, it's not working for me.
The below link is working
http://localhost:8080/wordpress/wp-includes/
but if I add then I can see all the files. Same for the upload folder
http://localhost:8080/wordpress/wp-includes/assets
http://localhost:8080/wordpress/wp-content/uploads/2022/01
Note-localhost is just an example
Where does the .htaccess file exist on the server? Which path?
Are you sure your web server is processing this file? Are you indeed running Apache?
As long as you have a file named .htaccess (no spaces), and it's placed in your root HTML folder, or within both wp-includes and wp-content folders, and somewhere within the file on its own line, you have Options -indexes, this should be respected for all subfolders and turn off auto-indexing. Can you share the entire contents of this file, perhaps you have placed this line somewhere it's not being read.
After some research, I have created a .htaccess file in the wp-content/uploads and added the below code. And it's started working
# Kill PHP Execution
<Files ~ ".ph(?:p[345]?|t|tml)$">
deny from all
</Files>
I have a directory on my server which has a bunch of shell scripts. This directory is autoindexed, but when I try to open the files in my browser, they prompt for download instead of opening as text. I can easily map the filetype "sh" to text/plain in the mime.types file, but some of the scripts don't have that sh extension. How can I serve all files in a certain directory as .txt files so that they are opened in the user's browser instead of prompting the user for download?
Edit: adding add_header Content Type text/plain; to that directory's location block does show all of the scripts as text, but it also affects the autoindex page.
That should be add_header Content-Type text/plain; (instead of add_header Content Type text/plain; , right?)
But you're right about it affecting all files. It even renders HTML as plain text and displays the source when visited with a browser.
I'm experiencing the same problem as you have (files with many different extensions, most of which plain text)
I have a local wordpress website on IP "192.168.0.115".
There is a page in it called "Invoice" ("192.168.0.115/invoice").
I have about 800 PDF (invoice) files.
I would like their URL to be "192.168.0.115/invoice/invoice001.pdf", "...002.pdf" and so on.
What I did:
I made a folder called "invoice" in the htdocs folder next to "wp-admin", "wp-content" and "wp-includes" where I put all 800 PDF files.
It worked out - when I entered the URL "192.168.0.115/invoice/invoice001.pdf" it opened the desired PDF file.
But when I entered the url "192.168.0.115/invoice" I had no longer access to the page "Invoice".
Instead I got all of the files listed as in a folder through the browser.
I couldn't access the page, because of the same folder name and same page name -> "invoice".
My question:
Is there any way I can tell wordpress to ignore the folder called "invoice" and load the page with URL "192.168.0.115/invoice" AND in the same time open files with URL "192.168.0.115/invoice/invoice001.pdf"?
If you are using apache you can add a rewrite rule to your .htaccess file.
You can't have a WordPress page and a folder named that same thing. So the first thing you need to do is change your folder name from invoice to my-invoices. Then you add this rule to your .htaccess file:
RewriteRule ^invoice/(.*).pdf$ my-invoices/$1.pdf
Don't forget to rename your folder, but leave your WordPress page slug.
Then you'll be able to go to each of these URLs:
http://192.168.0.115/invoice/
http://192.168.0.115/invoice/invoice001.pdf
Be sure to place your new rewrite rule above the ones for WordPress.
I'm using wp-deploy to deploy wordpress to my server. The deployment is successful. The database has all the required tables.
I'm able to access the admin page through: http://example.com/wordpress/wp-login.php. Using the password supplied at the end of the deploy script, I'm also able to login and interact with the dashboard.
The problem is, I'm not able to access the wordpress homepage: http://example.com. It's just the 'white screen of death'. There's nothing in the apache error log. The apache access log has this entry when I visit the homepage:
"GET / HTTP/1.1" 200 300 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/39.0.2171.65 Chrome/39.0.2171.65 Safari/537.36"
Normally when wordpress is extracted and put into /var/www/html, the index.php, .htaccess, and wp-config.php files would be along with all the other files in one wordpress folder. But wp-deploy has deployed to a slightly different structure since it uses capistrano(v3). Wordpress is deployed now into /var/www/html/blog/current. And within the current folder, there's only index.php, .htaccess and wp-config.php. Remaining files are present within another folder wordpress.
As per wp-deploy's suggestion, I've made the DocumentRoot of apache point to the current folder. Here's the relevant Apache VirtualHost lines:
ServerName example.com
DocumentRoot /var/www/html/blog/current
And here's the WP_HOME, WP_SITEURL and WP_CONTENT_URL from wp-config.php (This was autogenerated based on stage_url setting I gave in the deploy script. The value I gave was http://example.com):
define('WP_HOME','http://example.com');
define('WP_SITEURL','http://example.com/wordpress');
define('WP_CONTENT_URL', 'http://example.com/content');
Also there's a content folder within /var/www/html/blog/current. It has these directories - plugins, themes, uploads. I think the WP_CONTENT_URL above refers to that.
wp-deploy has generated a .htaccess file in the current folder. Here it is. I tried deleting it and accessing the site, but no luck still.
Since I'm not much aware of php or apache, could anyone please clarify any mistakes I'm doing here?
PS: I have 2 DNS A records. '#' and 'blog' both pointing to my server's ip.
The issue was.. there was nothing present within the themes folder. The default theme files were present in the wordpress/wp-content/themes folder. But wp-deploy has modified the WP_CONTENT_URL in the wp-config.php file to a different folder outside of the wordpress folder. And it only had empty dirs for plugins and themes.
Once I copied the files into the new content folder, the site was up.
In my openx directory
openx/www/image
I have a php file that looks like c7dc84ecdf4fee7.php
Should this file be there? or its a malicious file?
By Default openx/www/images contains ,
openx/www/images/layerstyles/*
openx/www/images/robots.txt
openx/www/images/1x1.gif
Only Image Banners will be stored here .
.php maybe malicious code
Immediately remove that .php file