Scraping Google public data - web-scraping

I'm interested in scraping this dataset from Google Public data:
https://www.google.com/publicdata/explore?ds=gb66jodhlsaab_#!ctype=l&strail=false&bcs=d&nselm=h&met_y=Capital_St&scale_y=lin&ind_y=false&rdim=state&idim=state:AL:AK:AZ:AR:CO:CA:CT:DE:DC:FL:GA:HI:IL:ID:IN:KS:KY:LA:IA:ME:MD:MA:MI:MS:MO:MN:MT:NV:NH:NE:NJ:NM:NY:NC:ND:OK:OH:OR:PA:RI:SC:SD:TX:UT:TN:VT:WA:VA:WV:WI:WY&ifdim=state&hl=en_US&dl=en_US&ind=false
Is there a way to do it, given that when I click on the link of the bottom of the page, it's not obvious where the data are stored? (And Google Public data doesn't allow downloads?)

There's a link at the bottom of the page which leads to where the dataset came from. They've got data available in Excel so it might be what you are looking for.
https://www.fhwa.dot.gov/policyinformation/index.cfm

Related

Issue scraping financial data via xpath + tables

I'm trying to build a stock analysis spreadsheet in Google sheets by using the importXML function in conjunction with XPath (absolute) and importHTML function using tables to scrape financial data from www.morningstar.co.uk key ratios page for the corresponding companies I like to keep an eye on.
Example: https://tools.morningstar.co.uk/uk/stockreport/default.aspx?tab=10&vw=kr&SecurityToken=0P00007O1V%5D3%5D0%5DE0WWE%24%24ALL&Id=0P00007O1V&ClientFund=0&CurrencyId=BAS
=importxml(N9,"/html/body/div[2]/div[2]/form/div[4]/div/div[1]/div/div[3]/div[2]/div[2]/div/div[2]/table/tbody/tr/td[3]")
=INDEX(IMPORTHTML(N9","table",12),3,2)
N9 being the cell containing the URL to the data source
I'm mainly using Morningstar as my source data due to the overwhelming amount of free information but the links keep on breaking, either the URL has slightly changed or the XPath hierarchy altered.
I'm guessing from what I've read so far is that busy websites such as these are dynamic and change often which is why my static links are breaking.
Is anyone able to suggest a solution or confirm if CSS selectors would be a more stable / reliable method of retrieving the data.
Many thanks in advance
Tried short XPath and long XPath links ( copied from dev tool in chrome ) frequently changed URL to repair link to data source but keeps breaking shortly after and unable to retrieve any information

How to use URLs extracted from a website as data source for another table in Power BI

I have a situation where I need to extract tables from 13 different links, which have the same structure, and then append them into only one table with all the data. This way, at first I extracted the links from a home page by copying the link from the respective hyperlink, and then import the data through the Web connector on Power BI. However, 3 months later, I realized that those links changed every quarter but the link from the homepage where they are listed remain the same.
This way, I did some research and I found out this video on YouTube (https://www.youtube.com/watch?v=oxglJL0VWOI), which explained how I can scrape the links from a website, by building a table with the header of the link as a column and the respective link as another column. This way, I can have the links automatically updated, whenever I refresh the data.
The thing is that I am having issues to figure out how can I use this links to extract the data automatically without having to copy them one by one and then import the data using the Power BI Web connector (Web.BrowserContents). Does anyone can give me a hint of how can I implement this?
Thanks in advance!

Seeing cost data in analytics reports

I've created a Custom Data Uploader script
which is uploading data to my google analytics profile.
I can see it's working and its uploading the file. I can see it on the "Custom Definitions" Tab in the Profile page. (Second picture on the link I attached).
But I cant see the data on the reports.
I tried to look under Traffic Sources -> Overview, it should be there from what I thought.
Where can I find this data in the reports?
https://developers.google.com/analytics/devguides/platform/features/cost-data-import
The Traffic Sources > Cost Analysis report should contain your data, but it can take 12 hours for the data on a new feed to show up. However, I've found that subsequent loads are usually much faster.

rcurl & innerHTML/innertext (scraping google trends with R)

I've used rcurl a fair bit for simple text retrieval and simple scraping, but I'm stumped with google trends. Let's use obama & romney as an example. If you append "&export=1", google trends returns a page displaying the data underlying the graph.
http://www.google.com/trends/explore?q=obama%2C+romney#q=obama%2C%20romney&export=1
On that page, the data lives in the reportContent div, which you can examine by inspecting the element for:
<div id="reportContent" class="report-content"> </div>
More specifically, it is tucked away in the innerHTML and the innertext properties associated with that div. I've never seen this before & am wondering how to access that data with rcurl. I'm also curious, if anyone happens to know, why google does not just present the data in simple html. I'll admit I'm not very knowledgable; I'm reading as much as I can about it, but what I have found out about the innertext property (not much) is not particularly illuminating or helpful in modifying my rcurl script.
You have to login google in order to get multiple trends data, otherwise, it is easy for you to be blocked by google. Google may consider several factors when blocking you, e.g. IP address/ google accounts/device type / machine or human.
I provide a online google trends scraping service on http://www.datadriver.info/scrapdata/?case_task_id=b333f048be31cad3922f1c8c919700f860f5adbe, Using this service, you won't encounter the boring problem "You have reached your quota limit. Please try again later."

Need help in Web scraping webpages and its links by automatic funciton in R

I am interested to extract the data of paranormal activity reported in news, so that i can analyze the
data of space and time of appearance for any correlations. This project is just for fun, to learn and use web scraping, text extraction and spatial and time correlation analysis. So please forgive me for deciding on this topic, I wanted to do something interesting and challenging work.
First I found this website has some collection of the reported paranormal incidences, they have collection for 2009,2010,2011 and 2012.
The structure of the website goes like this in every year they have 1..10 pages...and links goes like this
for year2009
link http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm
In each page they have collected the stories under the heading like this
Internal structure
Paranormal Activity, Posted 03-14-09
each of these head lines has two pages inside it..goes like this
link http://paranormal.about.com/od/paranormalgeneralinfo/a/news_090314n.htm
On each of these pages they have actual reported stories collected on various headlines..and the actual websites link for those stories. I am interested in collecting those reported text and extract information regarding the kind of paranormal activity like ghost, demon or UFOs and the time, date and place of incidents. I wish to analyze this data for any spatial and time correlations. If UFO or Ghosts are real they must have some behavior and correlations in space or time in their movements. This is long shot of the story...
I need help in web scraping the text form the above said pages. Here i have wrote down the code to follow one page and its link down to last final text i want. Can anyone let me know is there any better and efficient way to get the clean text from the final page. Also automation of the collecting text by following all 10 pages for whole 2009.
library(XML)
#source of paranormal news from about.com
#first page to start
#2009 - http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm
pn.url<-"http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm"
pn.html<-htmlTreeParse(pn.url,useInternalNodes=T)
pn.h3=xpathSApply(pn.html,"//h3",xmlValue)
#extracting the links of the headlines to follow to the story
pn.h3.links=xpathSApply(pn.html,"//h3/a/#href")
#Extracted the links of the Internal structure to follow ...
#Paranormal Activity, Posted 01-03-09 (following this head line)
#http://paranormal.about.com/od/paranormalgeneralinfo/a/news_090314n.htm
pn.l1.url<-pn.h3.links[1]
pn.l1.html<-htmlTreeParse(pn.l1.url,useInternalNodes=T)
pn.l1.links=xpathSApply(pn.l1.html,"//p/a/#href")
#Extracted the links of the Internal structure to follow ...
#British couple has 'black-and-white-twins' twice (following this head line)
#http://www.msnbc.msn.com/id/28471626/
pn.l1.f1.url=pn.l1.links[7]
pn.l1.f1.html=htmlTreeParse(pn.l1.f1.url,useInternalNodes=T)
pn.l1.f1.text=xpathSApply(pn.l1.f1.html,"//text()[not(ancestor::script)][not(ancestor::style)]",xmlValue)
I sincerely thanks in advance for reading my post and your time for helping me.
I will be great full for any expert who would like to mentor me in this whole project.
Regards
Sathish
Try to use Scrapy and BeautifulSoup libraries. Despite their being Python based, they are considered the best in scrapping domain. You can use command line interface to connect both, for more details about connecting R and Python have a look here.

Resources