How to perform web scraping dynamically using R - r

I am trying to automate web scraping for different Physician Names. The process is I am having a list of Physician names in .csv file
The first process is, the names of the Physician should be entered in the search bar of this site.
Then the search button is to be hit.
Then the first link is to be selected.
Then I want to perform web scraping to collect required details of the Physician.
These are the things to be performed.
The same thing is to be applied for every Physician.
Can anyone help me with this process using R?

Google searching 'web scraping with R' brought me this tutorial and this tutorial. Both of these seem simple enough that you should be able to accomplish what you need. Also, heed hrbrmstr's warning, and see if you can acquire the data you need with abusing metacrawler's website.

Related

How would I show, in a Google Sheets cell, the number of issues found in a JIRA search

I have a Google sheet that creates URLs to JIRA with project ID and some parameters to have specific searches available from the "hub" sheet for each project listed. What I'd like to do is have the text in the hyperlink cell display the number of issues in the search from the link.
Now I'd just like to know what's the best way to do this, as I'm not a programmer at all so I'd rather spend time learning something that will end up working instead of just trying things on my own .-.
Could a kind soul maybe let me know what they think the best tool/flow for this would be?
PS: The reason I'm bothering with a sheet and not a JIRA Dashboard is that the order and list of the projects I need to keep track of changes every one or two days :[
if you are looking for scraping the generated URL, you will need to use one of the import formulae which fits your need
IMPORTHTML
IMPORTXML
IMPORTDATA
etc.
then all you is combine it like:
=HYPERLINK(CONCATENATE("URL link to search"), IMPORT...())

How to search the internet for pages containing specified terms and storing the results in a data table, from within R, using OpenSearch

I am setting up a database of certain events that have occurred in the past, and need to search the internet for a number of terms to retrieve as many pages as possible that contain terms related to the happenings i want to document.
First I looked into achieving this using Googles "Custom Search API", after reading this question:
Need to access Google Custom search api through R
I did manage to get a JSON of search results through the browser, but not through R, so I moved on.
When I saw that the Custom Search API was using OpenSearch, and found the rOpenSearch package for R, I wanted to try going down this path:
http://terradue.github.io/rOpenSearch/
After reading through the documentation, there was only provided examples of searching sites that provide opensearch descriptions. As I need to search as many websites as possible, it seems like I would need an opensearch description for a search engine like Google. But I can't seem to find that anywhere.
Is there any way to search the internet via. R using OpenSearch, and collecting the results in a data table?
If you know of a better solution to my problem, I'd appreciate if you could point me in another direction.
If I read well, you are looking for something called Web Scraping via R.
<See me!>

Converting Excel math to SQL in VBNet, ASP.NET web application

I am trying to automate a process that is currently done mainly with excel files. These files have been used for a while and are customized just how the user likes them. I am turning this into a data driven VB NEt application and now and at the task of configuring all the computed columns to do the equations the user's excel spread sheets are doing currently.
The main ones needed that I can't find information on are STANDARDIZE, PERCENTRANK and STDEVA (atleast for computed columns- I have seen STEVA used in select queries)
Excuse me if there is documentation on this I can refer to, I searched google and stackoverflow and wasn't able to find anything. If you could point me to any documentation like this that might exist- that would be a huge help!

Import.io - Can it replace Kimonolabs

I use Kimonolabs right now for scraping data from websites that have the same goal. To make it easy, lets say these websites are online shops selling stuff online (actually they are job websites with online application possibilities, but technically it looks a lot like a webshop).
This works great. For each website an scraper-API is created that goes trough the available advanced search page to crawl all product-url's. Let's call this API the 'URL list'. Then a 'product-API' is created for the product-detail-page that scrapes all necessary elements. E.g. the title, product text and specs like the brand, category, etc. The product API is set to crawl daily using all the URL's gathered in the 'URL list'.
Then the gathered information for all product's is fetched using Kimonolabs JSON endpoint using our own service.
However, Kimonolabs will quit its service end of february 2016 :-(. So, I'm looking for an easy alternative. I've been looking at import.io, but I'm wondering:
Does it support automatic updates (letting the API scrape hourly/daily/etc)?
Does it support fetching all product-URL's from a paginated advanced search page?
I'm tinkering around with the service. Basically, it seems to extract data via the same easy proces as Kimonolabs. Only, its unclear to me if paginating the URL's necesarry for the product-API and automatically keeping it up to date are supported.
Any import.io users here that can give advice if import.io is a usefull alternative for this? Maybe even give some pointers in the right direction?
Look into Portia. It's an open source visual scraping tool that works like Kimono.
Portia is also available as a service and it fulfills the requirements you have for import.io:
automatic updates, by scheduling periodic jobs to crawl the pages you want, keeping your data up-to-date.
navigation through pagination links, based on URL patterns that you can define.
Full disclosure: I work at Scrapinghub, the lead maintainer of Portia.
Maybe you want to give Extracty a try. Its a free web scraping tool that allows you to create endpoints that extract any information and return it in JSON. It can easily handle paginated searches.
If you know a bit of JS you can write CasperJS Endpoints and integrate any logic that you need to extract your data. It has a similar goal as Kimonolabs and can solve the same problems (if not more since its programmable).
If Extracty does not solve your needs you can checkout these other market players that aim for similar goals:
Import.io (as you already mentioned)
Mozenda
Cloudscrape
TrooclickAPI
FiveFilters
Disclaimer: I am a co-founder of the company behind Extracty.
I'm not that much fond of Import.io, but seems to me it allows pagination through bulk input urls. Read here.
So far not much progress in getting the whole website thru API:
Chain more than one API/Dataset It is currently not possible to fully automate the extraction of a whole website with Chain API.
For example if I want data that is found within category pages or paginated lists. I first have to create a list of URLs, run Bulk Extract, save the result as an import data set, and then chain it to another Extractor.Once set up once, I would like to be able to do this in one click more automatically.
P.S. If you are somehow familiar with JS you might find this useful.
Regarding automatic updates:
This is a beta feature right now. I'm testing this for myself after migrating from kimonolabs...You can enable this for your own APIs by appending &bulkSchedule=1 to your API URL. Then you will see a "Schedule" tab. In the "Configure" tab select "Bulk Extract" and add your URLs after this the scheduler will run daily or weekly.

How to automate the process of downloading a image based on its name using web scraping?

I am learning basics of web scraping. I would like to automate following process
Go to a site for example: http://www.vesseltracker.com
Provide the vessel name or MMSI number
Download the image of the vessel
Repeat this process
I followed the
Get all Images from WebPage Program | Java
link to get the basic but with little success. Could any one provide me with example in java??
Thanks a lot in advance
Either of the examples on the stack-overflow question that you referenced should work for downloading images. Simply replace the System.out.println line with Ben Noland's saveUrl function from this stack-overflow question.
If you want a specific image, you'll need to determine how you want to filter it out from the others. Perhaps only save the images that contain /vessel/ in the url?

Resources