How To Extract Page URLs From Any Website in Bulk? - web-scraping

I'm looking for a Free solution/tool/software through which I can pull out all of the website's page URLs. Site has approx 992,000 pages so I need the URLs of all of them in excel sheet.
I'm using "site: mywebsite.com" and it gives me 992,000 results. I know I can make the max results per page 100 but that still doesn't make my life easier. Also google won't show any results over 1000. Tried to use the Google API but without any luck. Tried Sitemap Generators but they didn't work either.

You can use a crawler tool to crawl the entire website and save the URLs visited. Free tools include:
IRobotSoft: http://www.irobotsoft.com/help/irobot-manual.pdf. Use: CrawlWebsite (SourceSites, CallTask) function.
Scrapy: http://doc.scrapy.org/en/latest/intro/tutorial.html

Google limits search query results to 1000. The only way a tool could really bypass this is to do subsets of the keyword e.g. (site: abc.com + random-word). The random word would return fewer results and with enough of these queries scraped and combined into a list, one could then delete duplicates and gain a near-full to full list of the original desired search term.

Related

WordPress custom query-string not index by Google

I have a WordPress site (www.AgingSafely.com) and on it I have built a plugin to show the “Details” about various Adult Family Homes (AFHs). All of the details are retrieved out of database table via a query-string (?asi_id=WA_Af_nnnnn) where the n’s are the AFH’s license number. I have created a “Site Map” page (https://www.agingsafely.com/asi-site-map/) that lists an overview and has links to the Details Page for each AFH, so that Google can find and link them. They are also listed in sitemap.xml.
Google isn’t indexing them, but is indexing the more normal pages on my site.
I figure that I need to change my URLs from https://www.agingsafely.com/adult-family-home/?asi_id=WA_Af_751252 to something like https://www.agingsafely.com/adult-family-home/AFH/751252 to make Google happy. To add a little more complication, The “Af” in the query string is for “Adult Family Home”. The plugin also handles “Boarding Homes” “Bf” and “Nursing Facilities” “Nf”.
How do I get the URL with the ?asi_id=WA_Af_751252 rewritten to AFH/75152
This appears to two parts: Change the links in the plugins to the /AFH/nnnn format which should be easy. Have some re-write rule that converts the new URL format back to a query string.
What is the best way to do this?
Does Google ignore query strings?
Are you planning on a lot of people entering in that particular string in a google search? Possibly some will, but probably not. If you want people to be able to easily find your products/homes via google searches, yes I would change the links to something like https://www.agingsafely.com/adult-family-home/AFH/751252. Literally spell out as much as you can, unless a string is a popular part number that people search for or something like that.
Also, is your site integrated with google analytics and google search console? I would definitely do that if you haven't.

Google Analytics - grouping Page

I've been reading a lot about grouping page on Google Analytics but haven't figured out a clear answer to a problem.
My issue is this one :
Same pages but one with / and other without /
Basically, when I read my analytics I have two different entry for the same page, because some external links send people to one entry without the trailing slash (lets call it Page1), on others send people to url with the trailing slash (Page2).
It's a bit anoying when reading the stats, because you have to add up these two pages to have a clear view about what's going on.
I tried one option: add filters that remove the trailing slash. With this, I was able to get all the statistics on Page 1. It was a simple filter (Search and replace filter) that was grouping the two pages.
However, looking back at this option, It created another problem: this filter is not retroactive, which means when I will look at Page1, I will have stats from the day I applied this filter, whereas Page2 will score 0 from this exact same date. A small picture to make that clear:
Statistics on Page2 with Slash
Statistics on Page1 filter
Clearly here there is a discontinuity on my stats. To check long term datas I have to select another page, and to check new data I have to check the page without the trailing slash.
I removed this filter because it's very difficult to read data right now, and I'm looking for a solution to groupe these two pages so my data will be readable...
Thank you very much for you help,
Michael
Edit: I'm on Wordpress, maybe a way there?
There is nothing you can do within Google Analytics. I suggest you create your reports in Google Data Studio, which is free, and which allows you to aggregate Urls by using regular expressions to find matching parts (example e.g. in this question).

Retrieve a number from each page of a paginated website

I have a list from approx. 36,000 URLs, ranging from https://www.fff.fr/la-vie-des-clubs/1/infos-cles to https://www.fff.fr/la-vie-des-clubs/36179/infos-cles (a few of those pages return 404 erros).
Each of those pages contains a number (the number of teams the soccer club contains). In the HTML file, the number appears as <p class="number">5</p>.
Is there a reasonably simple way to compile an excel or csv file with the URL and the associated number of teams as a field ?
I've tried looking into phantomJS but my method took 10 seconds to open a single webpage and I don't really want to spend 100 hours doing this. I was not able to figure out how (or whether it was at all possible) to use scraping tools such as import.io to do this.
Thanks !
For the goal you want to achieve, I can see two solutions:
Code it in Java: Jsoup + any CSV library
In a few minutes, the 36000+ urls can be downloaded easily.
Use a tool like Portia from scrapinghub.com
Portia is a WYSIWYG tool quickly helping you create your project and run it. They offer a free plan which can take in charge the 36000+ links.

Scrape all google search result for a specific name

I think the question has been answered here before,but i could not find the desired topic.I am a newbie in web scraping.I have to develop a script that will take all the google search result for a specific name.Then it will grab the related data against that name and if there is found more than one,the data will be grouped according to their names.
All I know is that,google has some kind of restriction on scraping.They provide a custom search api.I still did not use that api,but hoping to get all the resulted links corresponding to a query from that api. But, could not understand what will be the ideal process to do the scraping of the information from that links.Any tutorial link or suggestion is very much appreciated.
You should have provided a bit more what you have been doing, it does not sound like you even tried to solve it yourself.
Anyway, if you are still on it:
You can scrape Google through two ways, one is allowed one is not allowed.
a) Use their API, you can get around 2k results a day.
You can up it to around 3k a day for 2000 USD/year. You can up it more by getting in contact with them directly.
You will not be able to get accurate ranking positions from this method, if you only need a lower number of requests and are mainly interested in getting some websites according to a keyword that's the choice.
Starting point would be here: https://code.google.com/apis/console/
b) You can scrape the real search results
That's the only way to get the true ranking positions, for SEO purposes or to track website positions. Also it allows to get a large amount of results, if done right.
You can Google for code, the most advanced free (PHP) code I know is at http://scraping.compunect.com
However, there are other projects and code snippets.
You can start off at 300-500 requests per day and this can be multiplied by multiple IPs. Look at the linked article if you want to go that route, it explains it in more details and is quite accurate.
That said, if you choose route b) you break Googles terms, so either do not accept them or make sure you are not detected. If Google detects you, your script will be banned by IP/captcha. Not getting detected should be a priority.

How to split large dataset into google sitemaps

I'm working on the site and want to create google sitemap efficiently (with images) I have above 30 000 pages and every page have image on them. In every month is about 1000 new pages. (I have also lots of other pages but they don't need to be in sitemap), old pages are not changed very offen but they can be deleted or modified.
I have one sitemap index page and 35 sitemaps in them every sitemap have 1000 pages, (I have limitation, probalby by the framework I use, in the number of results)
Is there better solution for this?
Unless you can read the database and create the image sitemap directly from that, you will need to check out sitemapper solutions that can be scheduled (to run at night = efficient?) and create image sitemaps, e.g. A1 Sitemap Generator (use the image sitemap preset), but there are more tools available if you search Google :)
However, if you have direct access to the database containing both page and image URLs, you may want to program it yourself? That would be the most "efficient method" (i.e reading the database directly)

Resources