How to open 100000 different websites manually - web-scraping

I need to click in 100000 different url from websites to scrape different data about the website. The data has to be manually extracted because it is different on each website and it doesn't follow a pattern.
My question is, is there any program or script where I can paste the 100000 URL and it just open/preload some urls in tabs/windows so when I close one tab the next url opens in a new tab? This way I work in the main website tab that takes me 10 seconds to review and I click control+w and go to the next url.
This way I could save so much time instead of just clicking manually on each link and waiting for it to load.

You can use python web scraping or RPA(if you don't know basic python) and by logical steps, users can automate n number of tasks to be done.
Or you can also make use of 'pyautogui library' of python to click on visual things.
Thumbs up. If helped...

Related

How to use URLs extracted from a website as data source for another table in Power BI

I have a situation where I need to extract tables from 13 different links, which have the same structure, and then append them into only one table with all the data. This way, at first I extracted the links from a home page by copying the link from the respective hyperlink, and then import the data through the Web connector on Power BI. However, 3 months later, I realized that those links changed every quarter but the link from the homepage where they are listed remain the same.
This way, I did some research and I found out this video on YouTube (https://www.youtube.com/watch?v=oxglJL0VWOI), which explained how I can scrape the links from a website, by building a table with the header of the link as a column and the respective link as another column. This way, I can have the links automatically updated, whenever I refresh the data.
The thing is that I am having issues to figure out how can I use this links to extract the data automatically without having to copy them one by one and then import the data using the Power BI Web connector (Web.BrowserContents). Does anyone can give me a hint of how can I implement this?
Thanks in advance!

Retrieve a number from each page of a paginated website

I have a list from approx. 36,000 URLs, ranging from https://www.fff.fr/la-vie-des-clubs/1/infos-cles to https://www.fff.fr/la-vie-des-clubs/36179/infos-cles (a few of those pages return 404 erros).
Each of those pages contains a number (the number of teams the soccer club contains). In the HTML file, the number appears as <p class="number">5</p>.
Is there a reasonably simple way to compile an excel or csv file with the URL and the associated number of teams as a field ?
I've tried looking into phantomJS but my method took 10 seconds to open a single webpage and I don't really want to spend 100 hours doing this. I was not able to figure out how (or whether it was at all possible) to use scraping tools such as import.io to do this.
Thanks !
For the goal you want to achieve, I can see two solutions:
Code it in Java: Jsoup + any CSV library
In a few minutes, the 36000+ urls can be downloaded easily.
Use a tool like Portia from scrapinghub.com
Portia is a WYSIWYG tool quickly helping you create your project and run it. They offer a free plan which can take in charge the 36000+ links.

Scrapy: How to recrawl a page after some time?

Being lazy, I'm trying to use scrapy instead of implementing my own scraping service using celery+requests (been there, done that). Let's say I have a list of N pages that I like to monitor. After retrieving page X and reading its content, I want to tell the system to rescan it sometime later (depending on its content), say once two hours have passed.
Is such a thing possible with Scrapy?

How to split large dataset into google sitemaps

I'm working on the site and want to create google sitemap efficiently (with images) I have above 30 000 pages and every page have image on them. In every month is about 1000 new pages. (I have also lots of other pages but they don't need to be in sitemap), old pages are not changed very offen but they can be deleted or modified.
I have one sitemap index page and 35 sitemaps in them every sitemap have 1000 pages, (I have limitation, probalby by the framework I use, in the number of results)
Is there better solution for this?
Unless you can read the database and create the image sitemap directly from that, you will need to check out sitemapper solutions that can be scheduled (to run at night = efficient?) and create image sitemaps, e.g. A1 Sitemap Generator (use the image sitemap preset), but there are more tools available if you search Google :)
However, if you have direct access to the database containing both page and image URLs, you may want to program it yourself? That would be the most "efficient method" (i.e reading the database directly)

Next track or shuffle in M3U playlist?

I have an M3U play-list that has URLs for some MP3s from around the web. It's on a server so I can open it on other computers and my iPhone.
Unfortunately, all the players I've tried don't let me hit the "next" button to go to the next song in the play-list. Is there a way to specify that ability in the M3U file? Or, if not that, can I make a media player to automatically shuffle the play-list?
I could always make a script to shuffle it myself, but I'd like to use something built into M3U if it exists.
if you don't already have the length of the track in your M3U, try including it, in this format:
#EXTINF:180,Unknown Artist - Unknown Track
/tmp/musicfile.mp3
(where 180 is the length of the track in seconds)
Alternatively, have you seen this article on creating iPhone-and-desktop-friendly embedded-audio web pages?

Resources