How to split large dataset into google sitemaps - xml-sitemap

I'm working on the site and want to create google sitemap efficiently (with images) I have above 30 000 pages and every page have image on them. In every month is about 1000 new pages. (I have also lots of other pages but they don't need to be in sitemap), old pages are not changed very offen but they can be deleted or modified.
I have one sitemap index page and 35 sitemaps in them every sitemap have 1000 pages, (I have limitation, probalby by the framework I use, in the number of results)
Is there better solution for this?

Unless you can read the database and create the image sitemap directly from that, you will need to check out sitemapper solutions that can be scheduled (to run at night = efficient?) and create image sitemaps, e.g. A1 Sitemap Generator (use the image sitemap preset), but there are more tools available if you search Google :)
However, if you have direct access to the database containing both page and image URLs, you may want to program it yourself? That would be the most "efficient method" (i.e reading the database directly)

Related

How to open 100000 different websites manually

I need to click in 100000 different url from websites to scrape different data about the website. The data has to be manually extracted because it is different on each website and it doesn't follow a pattern.
My question is, is there any program or script where I can paste the 100000 URL and it just open/preload some urls in tabs/windows so when I close one tab the next url opens in a new tab? This way I work in the main website tab that takes me 10 seconds to review and I click control+w and go to the next url.
This way I could save so much time instead of just clicking manually on each link and waiting for it to load.
You can use python web scraping or RPA(if you don't know basic python) and by logical steps, users can automate n number of tasks to be done.
Or you can also make use of 'pyautogui library' of python to click on visual things.
Thumbs up. If helped...

Retrieve a number from each page of a paginated website

I have a list from approx. 36,000 URLs, ranging from https://www.fff.fr/la-vie-des-clubs/1/infos-cles to https://www.fff.fr/la-vie-des-clubs/36179/infos-cles (a few of those pages return 404 erros).
Each of those pages contains a number (the number of teams the soccer club contains). In the HTML file, the number appears as <p class="number">5</p>.
Is there a reasonably simple way to compile an excel or csv file with the URL and the associated number of teams as a field ?
I've tried looking into phantomJS but my method took 10 seconds to open a single webpage and I don't really want to spend 100 hours doing this. I was not able to figure out how (or whether it was at all possible) to use scraping tools such as import.io to do this.
Thanks !
For the goal you want to achieve, I can see two solutions:
Code it in Java: Jsoup + any CSV library
In a few minutes, the 36000+ urls can be downloaded easily.
Use a tool like Portia from scrapinghub.com
Portia is a WYSIWYG tool quickly helping you create your project and run it. They offer a free plan which can take in charge the 36000+ links.

How To Extract Page URLs From Any Website in Bulk?

I'm looking for a Free solution/tool/software through which I can pull out all of the website's page URLs. Site has approx 992,000 pages so I need the URLs of all of them in excel sheet.
I'm using "site: mywebsite.com" and it gives me 992,000 results. I know I can make the max results per page 100 but that still doesn't make my life easier. Also google won't show any results over 1000. Tried to use the Google API but without any luck. Tried Sitemap Generators but they didn't work either.
You can use a crawler tool to crawl the entire website and save the URLs visited. Free tools include:
IRobotSoft: http://www.irobotsoft.com/help/irobot-manual.pdf. Use: CrawlWebsite (SourceSites, CallTask) function.
Scrapy: http://doc.scrapy.org/en/latest/intro/tutorial.html
Google limits search query results to 1000. The only way a tool could really bypass this is to do subsets of the keyword e.g. (site: abc.com + random-word). The random word would return fewer results and with enough of these queries scraped and combined into a list, one could then delete duplicates and gain a near-full to full list of the original desired search term.

Best approach for fetching news from websites?

I have a function which web-scraping all latest news from a website (approximately 10 news and the number of news is up to that website). Note that the news are in chronical order.
For example, yesterday I got 10 news and stored in database. Today I get 10 news but there are 3 news that are not available from yesterday (7 news stayed the same, 3 new).
My current approach is to extract each news till I find an old news (the 1st among 7 news) then I stop extracting and only update the field "lastUpdateDate" of the old news + add new news to the database. I think this approach is somehow complicated and it takes time.
Actually I'm getting news from 20 websites with same content structure (Moodle) so each request will last about 2 minutes, which my free host doesn't support.
Is it better if I delete all the news and then extracting everything from the start (this actually increments a huge amount of the ID numbers in the database)?
First, check to see if the website has a published API. If it has one, use it.
Second, check the website's terms of service, which may specifically and explicitly disallow scraping the website.
Third, look at a module in your programming language of choice that handles both the fetching of the pages and the extraction of the content from the pages. In Perl, you would start with WWW::Mechanize or Web::Scraper.
Whatever you do, don't fall into the trap that so many who post to StackOverflow fall into: Fetching the web page, and then trying to parse the content themselves, most often with regular expressions which is an inadequate tool for the job. Surf the SO tag html-parsing for tales of sorrow from those who have tried to roll their own HTML parsing systems instead of using existing tools.
Its depend on requirement if you want to show old news to the users or not.
For scraping you can create a custom local script for cron job which will grab the data from those news websites and will store into database.
You can also check through subject if its already exist of not.
Final make a custom news block which will show all the database feed.

Dynamic Data Web Site is unusable due to slowness

I have created a small "Dynamic Data Web Site" using the Entity Framework. I've no experience with this really, but it looks very interesting. Anyway, I have a single table being displayed on a single web page. The table contains over 21000 rows and the page limits me to 10 records per page, which is all fine.
My problem is that the page is incredibly slow. I'm guessing that maybe every row in the table is being loaded whenever I try to navigate, but I can't be sure this is the cause.
How can I increase the performance of the page? I want to be able to click through pages of results quickly and easily. It currently takes more than 60 seconds to click to the next set of results.
this is usually caused by filters on a table where the filter has MANY rows you could fix this using the Autocomplete filter which prefilters the data base what the user types in.
You can get this filter and other from ny NuGet package Dynamic Data Custom Filters
Also try having a look in it using Ayende's EFProf. It is a commercial product but it has a free 30 day trial. I can sometimes point out silly things you are doing and suggest some ways to optimise your data access

Resources