How do I Extract data from websites with different structure? - web-scraping

I am currently working on a job portal kind of project in which we generate links(official government website links) related to jobs through customized search engine. Is there any way to extract the data from these links generated?.
I have tried web scraping but the structure of all the websites are different. So I needed a generic method to extract the data from these websites...?

Related

Is there a way in R to extract data from a website using Microsoft Power BI

I am working on a project related to COVID travel restrictions and want to use data from
https://migration.iom.int/, in particular the data on country travel restriction (press on the tab on the bottom right once the page has loaded). My usual rvest approach to web scraping does not seem to work for the site. Any suggestions on possible ways to extract data from the site?
The data is from JS files like this one: https://migration.iom.int/sites/all/themes/fmp/pages/heatmap/js/heatmap_2020-07-23.js

Search Engine in Website showing results from certain sites

I am currently planning to build a search engine website. I want my engine to search in certain other sites (let's say 10 sites) and return results from them. One way to achieve that is do it from scratch: build spiders that would scan sites, build a data base of products and index them and then a simple search mechanism to return results. Is there an easier way to do that (maybe using plugins in CMSs like Wordpress?), free or payed? Which way to build it would be more efficient and fast?

How to do web scraping using R

I’m a beginner in web scraping and trying to learn how to implement an automated process to collect data from the web submitting search terms.
The specific problem I’m working on is as follows:
Given the stackoverflow webpage https://stackoverflow.com/ I submit a search for the term “web scraping” and want to collect in a list all question links and the content for each question.
Is it possible to scrape these results?
My plan is to create a list of terms:
term <- c(“web scraping”, “crawler”, “web spider”)
submit a research for each term and collect both question title and content of the question.
Of course the process should be repeated for each pages of results.
Unfortunately, being relatively new to web scraping, I'm not sure what to do.
I’ve already downloaded some packages to scrape the web (rvest, RCurl, XML, RCrawler).
Thanks for your help

How to perform web scraping dynamically using R

I am trying to automate web scraping for different Physician Names. The process is I am having a list of Physician names in .csv file
The first process is, the names of the Physician should be entered in the search bar of this site.
Then the search button is to be hit.
Then the first link is to be selected.
Then I want to perform web scraping to collect required details of the Physician.
These are the things to be performed.
The same thing is to be applied for every Physician.
Can anyone help me with this process using R?
Google searching 'web scraping with R' brought me this tutorial and this tutorial. Both of these seem simple enough that you should be able to accomplish what you need. Also, heed hrbrmstr's warning, and see if you can acquire the data you need with abusing metacrawler's website.

Import.io - Can it replace Kimonolabs

I use Kimonolabs right now for scraping data from websites that have the same goal. To make it easy, lets say these websites are online shops selling stuff online (actually they are job websites with online application possibilities, but technically it looks a lot like a webshop).
This works great. For each website an scraper-API is created that goes trough the available advanced search page to crawl all product-url's. Let's call this API the 'URL list'. Then a 'product-API' is created for the product-detail-page that scrapes all necessary elements. E.g. the title, product text and specs like the brand, category, etc. The product API is set to crawl daily using all the URL's gathered in the 'URL list'.
Then the gathered information for all product's is fetched using Kimonolabs JSON endpoint using our own service.
However, Kimonolabs will quit its service end of february 2016 :-(. So, I'm looking for an easy alternative. I've been looking at import.io, but I'm wondering:
Does it support automatic updates (letting the API scrape hourly/daily/etc)?
Does it support fetching all product-URL's from a paginated advanced search page?
I'm tinkering around with the service. Basically, it seems to extract data via the same easy proces as Kimonolabs. Only, its unclear to me if paginating the URL's necesarry for the product-API and automatically keeping it up to date are supported.
Any import.io users here that can give advice if import.io is a usefull alternative for this? Maybe even give some pointers in the right direction?
Look into Portia. It's an open source visual scraping tool that works like Kimono.
Portia is also available as a service and it fulfills the requirements you have for import.io:
automatic updates, by scheduling periodic jobs to crawl the pages you want, keeping your data up-to-date.
navigation through pagination links, based on URL patterns that you can define.
Full disclosure: I work at Scrapinghub, the lead maintainer of Portia.
Maybe you want to give Extracty a try. Its a free web scraping tool that allows you to create endpoints that extract any information and return it in JSON. It can easily handle paginated searches.
If you know a bit of JS you can write CasperJS Endpoints and integrate any logic that you need to extract your data. It has a similar goal as Kimonolabs and can solve the same problems (if not more since its programmable).
If Extracty does not solve your needs you can checkout these other market players that aim for similar goals:
Import.io (as you already mentioned)
Mozenda
Cloudscrape
TrooclickAPI
FiveFilters
Disclaimer: I am a co-founder of the company behind Extracty.
I'm not that much fond of Import.io, but seems to me it allows pagination through bulk input urls. Read here.
So far not much progress in getting the whole website thru API:
Chain more than one API/Dataset It is currently not possible to fully automate the extraction of a whole website with Chain API.
For example if I want data that is found within category pages or paginated lists. I first have to create a list of URLs, run Bulk Extract, save the result as an import data set, and then chain it to another Extractor.Once set up once, I would like to be able to do this in one click more automatically.
P.S. If you are somehow familiar with JS you might find this useful.
Regarding automatic updates:
This is a beta feature right now. I'm testing this for myself after migrating from kimonolabs...You can enable this for your own APIs by appending &bulkSchedule=1 to your API URL. Then you will see a "Schedule" tab. In the "Configure" tab select "Bulk Extract" and add your URLs after this the scheduler will run daily or weekly.

Resources