Software to scrape or crawl for website urls - web-scraping

I want to scrape/crawl (don't know which one is best translation) website urls. For example i want to get every urls from:
www.Site.com/posts.html which contains www.Site.com/2015-04-01/1
So I would type in software www.Site.com and set depth to 2 and required url text www.Site.com/2015-04-01/1
So.. Software should:
go to: www.Site.com/posts.html
Find matched urls: Lets say it find:
www.Site.com/2015-04-01/1/Working-Stuff.html
www.Site.com/2015-04-01/1/New-stuff.html
www.Site.com/2015-04-01/1/News.html
And now it goes to first matched url (a) and look for another urls which contains www.Site.com/2015-04-01/1.
So for example it would look like this:
Main site: `www.Site.com/posts.html`
1)www.Site.com/2015-04-01/1/Working-Stuff.html
1a) www.Site.com/2015-04-01/1/Break.htm
1b) www.Site.com/2015-04-01/1/How-to.htm
1c) www.Site.com/2015-04-01/1/Lets-say.htm
1d) www.Site.com/2015-04-01/1/Gamer-life.htm
2) www.Site.com/2015-04-01/1/New-stuff.html
2a) www.Site.com/2015-04-01/1/My-Story-about.htm
3) www.Site.com/2015-04-01/1/News.html
3a) www.Site.com/2015-04-01/1/Go-to-hell.htm
3b) www.Site.com/2015-04-01/1/Leave.htm
Of course I don't need that prefix grouping 1), 2), 2a) etc. I want to grab only urls.
I used:
A1 website scraper - but when I try to scrape from ......html it cuts .html part and does not giving me full url list :/

[edited my previous slightly simplistic answer]
Screen scraping is the process of removing data from a web page. The R package rvest is very good at screen scraping.
Web crawling is the process of traversing through a website moving from page to page. The R package rselenium is very good at mimicking user's movement from page to page, but only when you know the structure of the web site.
You sound like you want to do a crawl from page to page, starting from a head page and moving forward. I think that you could code this up using a combination of the rvest and rselenium packages. Between the two of these you can customise and take any particular unknown route.

Related

Retrieve a number from each page of a paginated website

I have a list from approx. 36,000 URLs, ranging from https://www.fff.fr/la-vie-des-clubs/1/infos-cles to https://www.fff.fr/la-vie-des-clubs/36179/infos-cles (a few of those pages return 404 erros).
Each of those pages contains a number (the number of teams the soccer club contains). In the HTML file, the number appears as <p class="number">5</p>.
Is there a reasonably simple way to compile an excel or csv file with the URL and the associated number of teams as a field ?
I've tried looking into phantomJS but my method took 10 seconds to open a single webpage and I don't really want to spend 100 hours doing this. I was not able to figure out how (or whether it was at all possible) to use scraping tools such as import.io to do this.
Thanks !
For the goal you want to achieve, I can see two solutions:
Code it in Java: Jsoup + any CSV library
In a few minutes, the 36000+ urls can be downloaded easily.
Use a tool like Portia from scrapinghub.com
Portia is a WYSIWYG tool quickly helping you create your project and run it. They offer a free plan which can take in charge the 36000+ links.

Parsing Web page with R

this is my first time posting here. I do not have much experience (less than a week) with html parsing/web scraping and have difficulties parsing this webpage:
https://www.jobsbank.gov.sg/
What I wan to do is to parse the content of all available job listing in the web.
my approach:
click search on an empty search bar which will return me all records listed. The resulting web page is: https://www.jobsbank.gov.sg/ICMSPortal/portlets/JobBankHandler/SearchResult.do
provide the search result web address to R and identify all the job listing links
supply the job listing links to R and ask R to go to each listing and extract the content.
look for next page and repeat step 2 and 3.
However, the problem is that the resulting webpage I got from step 1 does not direct me to the search result page. Instead, it will direct me back to the home page.
Is there anyway to overcome this problem?
Suppose I managed to get the web address for the search result, I intent to use the following code:
base_url <- "https://www.jobsbank.gov.sg/ICMSPortal/portlets/JobBankHandler/SearchResult.do"
base_html <- getURLContent(base_url,cainfo="cacert.pem")[[1]]
links <- strsplit(base_html,"a href=")[[1]]
Learn to use the web developer tools in your web browser (hint: Use Chrome or Firefox).
Learn about HTTP GET and HTTP POST requests.
Notice the search box sends a POST request.
See what the Form Data parameters are (they seem to be {actionForm.checkValidRequest}:YES
{actionForm.keyWord}:my search string )
Construct a POST request using one of the R http packages with that form data in.
Hope the server doesn't care about the cookies, if it does, get the cookies and feed it cookies.
Hence you end up using postForm from RCurl package:
p = postForm(url, .params=list(checkValidRequest="YES", keyword="finance")
And then just extract the table from p. Getting the next page involves constructing another form request with a bunch of different form parameters.
Basically, a web request is more than just a URL, there's all this other conversation going on between the browser and the server involving form parameters, cookies, sometimes there's AJAX requests going on internally to the web page updating parts.
There's a lot of "I can't scrape this site" questions on SO, and although we could spoonfeed you the precise answer to this exact problem, I do feel the world would be better served if we just told you to go learn about the HTTP protocol, and Forms, and Cookies, and then you'll understand how to use the tools better.
Note I've never seen a job site or a financial site that doesn't like you scraping its content - although I can't see a warning about it on this site, that doesn't mean it's not there and I would be careful about breaking the Terms and Conditions of Use. Otherwise you might find all your requests failing.

How To Extract Page URLs From Any Website in Bulk?

I'm looking for a Free solution/tool/software through which I can pull out all of the website's page URLs. Site has approx 992,000 pages so I need the URLs of all of them in excel sheet.
I'm using "site: mywebsite.com" and it gives me 992,000 results. I know I can make the max results per page 100 but that still doesn't make my life easier. Also google won't show any results over 1000. Tried to use the Google API but without any luck. Tried Sitemap Generators but they didn't work either.
You can use a crawler tool to crawl the entire website and save the URLs visited. Free tools include:
IRobotSoft: http://www.irobotsoft.com/help/irobot-manual.pdf. Use: CrawlWebsite (SourceSites, CallTask) function.
Scrapy: http://doc.scrapy.org/en/latest/intro/tutorial.html
Google limits search query results to 1000. The only way a tool could really bypass this is to do subsets of the keyword e.g. (site: abc.com + random-word). The random word would return fewer results and with enough of these queries scraped and combined into a list, one could then delete duplicates and gain a near-full to full list of the original desired search term.

Google analytics and dynamic pages

I have a (Symfony based) website. I would LIKE to analyize the site traffic using Google Analytics. My site is divided into several (i.e. N) categories, each of which may have 0 to M sub categories.
Schematically, the taxonomy of the site breaks down into something like this:
N major categories
Each major category may have 0 to M sub categories
further nesting is possible, but I have just kept it simple for the purpose of illustration.
I need to know which sections of the website are genererating more traffic, so that I can concentrate my efforts on those sections. My question is:
Is there anyway to identify the data that is being generated from the different sections of my site?.
Put another way, is there a code or 'tag' that I can generate dynamically (in each page that is being monitored) and pass to GA, so that I can identify which section of the website the traffic came from?
The documentation I found on google about this topic was not very useful (atleast it did not answer this question).
You can pass a uri to _trackPageview that would permit you to log the request in whatever format you'd like, including however your user's requesting the page.
Remove/replace the original call to pageTracker._trackPageview with the following:
pageTracker._trackPageview('/topcategory/subcategory');
You'd just need to plug in the topcategory and subcategory info. If the info is available in the URL you could parse it out using js on the fly.

How to provide multiple search functionality in website?

I am developing a web application, in which I have the following type of search functionality;
Normal search: where user will enter the search keyword to search the records.
Popular: this is no a kind of search, it will display the popular records on the website, something as digg and other social bookmarking sites does.
Recent: In this I am displaying Recently added records in my website.
City Search: Here I am presenting city names to the user like "Delhi", "Mumbai" etc and when user click this link then all records from that particular city will be displayed.
Tag Search: Same as city search I have tag links, when user will click on a tag then all records marked with that tag will be displayed to the user.
Alphabet Search: Same as city and tag this functionality also has links of letters like "A", "B", .... etc and when user clicks on any letter link then all records starting with that particular letter will be displayed to the user
Now, my problem is I have to provide above listed searches to the user, but I am not able to decide that I'll go with one page (result.aspx) which will display all the searches records, and I'll figure using query string that which search is user is using and what data I have to display to the user. Such as, lets say I am searching for city, delhi and tag delhi-hotels then the urls for both will be as :
For City: www.example.com/result.aspx?search_type=city&city_name=delhi
For Tags: www.example.com/result.aspx?search_type=tag&tag_name=delhi-hotels
For Normal Search: www.example.com/result.aspx?search_type=normal&q=delhi+hotels+and+bar&filter=hotlsOnly
Now, I feels above Idea of using a single page for all searches is messy. So I thought of some more and cleaner Idea, which is using separate pages for all type of searches as
For City: www.example.com/city.aspx?name=delhi
For Tags: www.example.com/tag.aspx?name=delhi-hotels
For Normal Search: www.example.com/result.aspx?q=delhi+hotels+and+bar&filter=hotlsOnly
For Recent: www.example.com/recent.aspx
For Popular: www.example.com/popular.aspx
My new idea is cleaner and it tells specifically everything to the user that which page is for what, it also gives him idea that where he is now, what records he's seeing now. But the new idea has one problem, In case I have to change anything in my search result display then I have to make changes in all pages one by one, I thought that solution for this problem too, which is using user-control under repeater control, I'll pass all my values one by one to user-control for rendering HTML for each record.
Everything is fine with new Idea, But I am still no able to decide that with which I dea I have to go for, Can anyone tell me your thoughts on this problem.
I want to implement an idea which will be easy to maintain, SEO friendly (give good ranking to my website), user-friendly(easy to use and understand for the users)
Thanks.
One thing to mention on the SEO front:
As a lot of the "results" pages will be linking through to the same content, there are a couple of advantages to appearing* to have different URLs for these pages:
Some search engines get cross if you appear to have duplicate content on the site, or if there's the possiblity for almost infinite lists.
Analysing traffic flow.
So for point 1, as an example, you'll notice that SO has numberous ways of finding questions, including:
On the home page
Through /questions
Through /tags
Through /unanswered
Through /feeds
Through /search
If you take a look at the robots.txt for SO, you'll see that spiders are not allowed to visit (among other things):
Disallow: /tags
Disallow: /unanswered
Disallow: /search
Disallow: /feeds
Disallow: /questions/tagged
So the search engine should only find one route to the content rather than three or four.
Having them all go through the same page doesn't allow you to filter like this. Ideally you want the search engine to index the list of Cities and Tags, but you only need it to index the actual details once - say from the A to Z list.
For point 2, when analysing your site traffic, it will be a lot easier to see how people are using your site if the URLs are meaningful, and the results aren't hidden in the form header - many decent stats packages allow you to report on query string values, or if you have "nice" urls, this is even easier. Having this sort of information will also make selling advertising easier if that's what's you're interested in.
Finally, as I mentioned in the comments to other responses, users may well want to bookmark a particular search - having the query baked into the URL one way or another (query strings or rewritten url) is the simiplist way to allow this.
*I say "appearing" because as others have pointed out, URL rewriting would enable this without actually having different pages on the server.
There are a few issues that need to be addressed to properly answer your question:
You do not necessarily need to redirect to the Result page before being able to process the data. The page or control that contains the search interface on submitting could process the submitted search parameters (and type of search) and initiate a call to the database or intermediary webservice that supplies the search result. You could then use a single Results page to display the retrieved data.
If you must pass the submitted search parameters via querystring to the result page, then you would be much better off using a single Result page which parses these parameters and displays the result conditionally.
Most users do not rely on the url/querystring information in the browser's address bar to identify their current location in a website. You should have something more visually indicative (such as a Breadcrumbs control or header labels) to indicate current location. Also, as you mentioned, the maintainability issue is quite significant here.
I would definitely not recommend the second option (using separate result pages for each kind of search). If you are concerned about SEO, use URL rewriting to construct URL "slugs" to create more intuitive paths.
I would stick with the original result.aspx result page. My reasoning for this from a user point of view is that the actual URL itself communicates little information. You would be better off creating visual cues on the page that states stuff like "Search for X in Category Y with Tags Z".
As for coding and maintenance, since everything is so similar besides the category it would be wise to just keep it in one tight little package. Breaking it out as you proposed with your second idea just complicates something that doesn't need to be complicated.
Ditch the querystrings and use URL rewriting to handle your "sections".. much better SEO and clearer from a bookmark/user readability standpoint.
City: www.example.com/city/delhi/
Tag: www.example.com/tag/delhi-hotels/
Recent: www.example.com/recent/
Popular: www.example.com/popular/
Regular search can just go to www.example.com/search.aspx or something.

Resources