Scraping websites data in structured columns

Scraping websites data in structured columns - web-scraping

I am trying to scrape few websites data using selenium. I'm scraping it using the standard way like by css, xpath, id, etc, but for every new website I have to write new selenium script by inspecting every websites element.
Now I got a new requirement from client, he wants a scraper which will scrape data from "any website", no matter what the website html structure is.
And it should get and put the data in columns like, postdate, description, category, location and few other headers like department, region etc
All above headers might not be present for a given website, but scraper should get the available data and put that into the respective column in a DB.
And I need to do it without using the selenium standard way, according to me its like a search engine robot who goes to websites and fetch h1 h2 h3 etc..
I am confused how I can achieve this, I am looking for some hint, idea, suggestions etc, to complete my task. I hope my question is valid for SO.
any Programming language will do.
Thanks in advance

Related

Looking for a web scraper tool to extract entire table out of web pages and put them in different sheets in excel

The first column of this table contains all the links I have to work with: https://www.metabolomicsworkbench.org/data/DRCCStudySummary.php?Mode=StudySummary&SortBy=Analysis&AscDesc=asc&ResultsPerPage=2000
From each of the links I have to download entire tables like this: https://www.metabolomicsworkbench.org/data/show_metabolites_by_study.php?STUDY_ID=ST000886&SORTFIELD=moverz_quant
and put each of the table from each of the links into separate sheets in excel.
I'd highly appreciate if anyone could tell me how to automate the entire process.
P.S.: I can't code...

ParseHub is a tool that is free and powerful web scraper to scrape data from tables.
I have used it in the past by following this step by step description:
NO CODING NEEDED.

How would I show, in a Google Sheets cell, the number of issues found in a JIRA search

I have a Google sheet that creates URLs to JIRA with project ID and some parameters to have specific searches available from the "hub" sheet for each project listed. What I'd like to do is have the text in the hyperlink cell display the number of issues in the search from the link.
Now I'd just like to know what's the best way to do this, as I'm not a programmer at all so I'd rather spend time learning something that will end up working instead of just trying things on my own .-.
Could a kind soul maybe let me know what they think the best tool/flow for this would be?
PS: The reason I'm bothering with a sheet and not a JIRA Dashboard is that the order and list of the projects I need to keep track of changes every one or two days :[

if you are looking for scraping the generated URL, you will need to use one of the import formulae which fits your need
IMPORTHTML
IMPORTXML
IMPORTDATA
etc.
then all you is combine it like:
=HYPERLINK(CONCATENATE("URL link to search"), IMPORT...())

How to perform web scraping dynamically using R

I am trying to automate web scraping for different Physician Names. The process is I am having a list of Physician names in .csv file
The first process is, the names of the Physician should be entered in the search bar of this site.
Then the search button is to be hit.
Then the first link is to be selected.
Then I want to perform web scraping to collect required details of the Physician.
These are the things to be performed.
The same thing is to be applied for every Physician.
Can anyone help me with this process using R?

Google searching 'web scraping with R' brought me this tutorial and this tutorial. Both of these seem simple enough that you should be able to accomplish what you need. Also, heed hrbrmstr's warning, and see if you can acquire the data you need with abusing metacrawler's website.

Import.io - Can it replace Kimonolabs

I use Kimonolabs right now for scraping data from websites that have the same goal. To make it easy, lets say these websites are online shops selling stuff online (actually they are job websites with online application possibilities, but technically it looks a lot like a webshop).
This works great. For each website an scraper-API is created that goes trough the available advanced search page to crawl all product-url's. Let's call this API the 'URL list'. Then a 'product-API' is created for the product-detail-page that scrapes all necessary elements. E.g. the title, product text and specs like the brand, category, etc. The product API is set to crawl daily using all the URL's gathered in the 'URL list'.
Then the gathered information for all product's is fetched using Kimonolabs JSON endpoint using our own service.
However, Kimonolabs will quit its service end of february 2016 :-(. So, I'm looking for an easy alternative. I've been looking at import.io, but I'm wondering:
Does it support automatic updates (letting the API scrape hourly/daily/etc)?
Does it support fetching all product-URL's from a paginated advanced search page?
I'm tinkering around with the service. Basically, it seems to extract data via the same easy proces as Kimonolabs. Only, its unclear to me if paginating the URL's necesarry for the product-API and automatically keeping it up to date are supported.
Any import.io users here that can give advice if import.io is a usefull alternative for this? Maybe even give some pointers in the right direction?

Look into Portia. It's an open source visual scraping tool that works like Kimono.
Portia is also available as a service and it fulfills the requirements you have for import.io:
automatic updates, by scheduling periodic jobs to crawl the pages you want, keeping your data up-to-date.
navigation through pagination links, based on URL patterns that you can define.
Full disclosure: I work at Scrapinghub, the lead maintainer of Portia.

Maybe you want to give Extracty a try. Its a free web scraping tool that allows you to create endpoints that extract any information and return it in JSON. It can easily handle paginated searches.
If you know a bit of JS you can write CasperJS Endpoints and integrate any logic that you need to extract your data. It has a similar goal as Kimonolabs and can solve the same problems (if not more since its programmable).
If Extracty does not solve your needs you can checkout these other market players that aim for similar goals:
Import.io (as you already mentioned)
Mozenda
Cloudscrape
TrooclickAPI
FiveFilters
Disclaimer: I am a co-founder of the company behind Extracty.

I'm not that much fond of Import.io, but seems to me it allows pagination through bulk input urls. Read here.
So far not much progress in getting the whole website thru API:
Chain more than one API/Dataset It is currently not possible to fully automate the extraction of a whole website with Chain API.
For example if I want data that is found within category pages or paginated lists. I first have to create a list of URLs, run Bulk Extract, save the result as an import data set, and then chain it to another Extractor.Once set up once, I would like to be able to do this in one click more automatically.
P.S. If you are somehow familiar with JS you might find this useful.

Regarding automatic updates:
This is a beta feature right now. I'm testing this for myself after migrating from kimonolabs...You can enable this for your own APIs by appending &bulkSchedule=1 to your API URL. Then you will see a "Schedule" tab. In the "Configure" tab select "Bulk Extract" and add your URLs after this the scheduler will run daily or weekly.

grabbing data from a ASP.NET webForm

I'm fairly new to web development and never before did i do any screen-scraping nor web-crawling, but yesterday a friend of mine asked me if i would be able to grab some data from this website, which is not mine, nor his, but the data is publicly available even for download.
The problem with the data is, it's available only as one file per one date or company, rather than one file for multiple dates or companies, which involves a lot of tedious 'clicking trough' the calendar and so he thought it would be nice if i would be able to create some app that could grab all the data with one click and output it in one single file or something similar..
The website uses aspx webFrom with __doPostBack to retrieve the data for different dates, even the links to download the data in XSL aren't the usual "href=…" links, they are, i assume, references for some asp script…
To be honest the only thing i tried was PHP cURL which didn't work, but since i tried cURL for the first time, i don't even know if it didn't work because it is not possible with cURL, or just because i don't know how to work with it.
I am only somewhat proficient in PHP and JavaScript, but not in ASP, though i would't mind learning something new.
So my question is..
Is it at all possible to grab the data from a website like this? and if it is, would you be so kind as to give me some hints on how to approach this kind of problem?
the website, again, is here http://extranet.net4gas.cz/capacity_ee.aspx
Thanks

C# has a nice WebClient class to do the job:
// Create web client.
WebClient client = new WebClient();
// Download string.
string value = client.DownloadString("http://www.microsoft.com/");
once you have the page html in a string you use regular expressions to scrape the content you are looking for.
here is a very basic regular expression to give a hint:
Regex regex = new Regex(#"\d+");
Match match = regex.Match("hello here 10 values");
if (match.Success)
{
Console.WriteLine(match.Value);
}

Marosko, as you said the data on website is open for public, so for sure you can scrape data out of it. Now, it is to decrease the manual click through dates and scraping data out of it. I personally don't have much idea about how Curl will work but I am sure it will involve a lot of coding. I would rather suggest you to automate the entire process using some automation tool, like a software application. Try Automation Anywhere, I bought it few months back for some data extraction purpose and it worked very well. It is automated and you can check the screen scraping capabilities it shows. Its my favorite :)
Charles

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex