Looping through website links with R

Looping through website links with R - r

I am looking to incorporate a loop in R which goes through every game's boxscore data on the NFL statistics website here: http://www.pro-football-reference.com/years/2012/games.htm
At the moment I am having to manually click on the "boxscore" link for every game every week; is there any way to automate this in R? My code works with the Full play-by-play dataset within each link; it's taking me ages at the moment!

Web scraping may be against the terms of use of some websites. The enforceability of these terms is unclear. While outright duplication of original expression will in many cases be illegal, in the United States the courts ruled in Feist Publications v. Rural Telephone Service that duplication of facts is allowable.
require(RCurl)
require(XML)
bdata<-getURL('http://www.pro-football-reference.com/years/2012/games.htm')
bdata<-htmlParse(bdata)
boxdata<-xpathSApply(bdata,'//a[contains(#href,"boxscore")]',xmlAttrs)[-1]
The above will get the boxscore stem for the various games.

Related

Scraping Spotify Top 200 streaming data with R

novice R user here. I'm looking to scrape a large amount of data on daily streaming volumes on songs that are on Spotify's Top 200 charts for a research project I am involved with. Basically, I would like to write a script to scrape all info for tracks in the top 200 on a given day, such as today's chart, and have this done for every day for a number of years, across a number of countries. I used some code from a guide that I followed previously to successfully scrape said data, but it is now not working for me.
I previously followed this guide pretty much word for word. While this originally worked, it now returns an empty tibble. I suspect that the problem may have to do with the fact that Spotify have re-developed their charts site since my last attempt. The site is different in appearance, but importantly the html node names appear to be different as well. My hunch is that this is what is causing the issue.
However, I am not at all sure if this is the case. Would appreciate it greatly if I could have some guidance on what I would need to do differently to achieve my aims, and whether it is indeed still possible to scrape these charts.
Cheers

Extract business description (Item 1) of multiple firms from their 10-K reports

I am trying to extract business descriptions of multiple firms from their 10-K reports using the R package, edgar. I am using getBusinDescr function to do so. However, I am only able to extract Item 1 (the business desciption) together with item 1A (the risk factors). Does anybody know how to manipulate the code of function "getBusinDescr" to only retrieve item 1? The parsing somehow has to end at "Item 1A. Risk Factors".

I have been working on SEC filings for a while now for my research and my suggestion is to either develop your own scraper, which I don't advise unless you know what you are doing, or you refer to the Software Repository for Accounting and Finance from the University of Notre Dame. You can find the link here.
People have already downloaded the whole 10-K filings coded as Stage One Data Parser. The full dataset is a bit heavy but it's already in plain txt so no hassle there. The only thing you need to do is to define some regular expressions to heuristically look for the beginning and ending of the Item 1 and 1A of the report.
Feel free to reach me out for more.

Any free mapping service to display and filter 250000+ datapoints?

I have participated in a Hackathon in my city, and the traffic department made public a dataset with more than 250 thousand traffic accident datapoints, each one containing Latitude, Longitude, type of accident, vehicles involved, etc.
I made a test to display the data using Google Maps API and Google Fusion Tables, but the usage limits were quickly reached with the first two years of a total of 13 years of records.
The data for two years can be displayed and filtered here.
So my question is:
Which free online services could I use in order to interactively display and filter 250 thousand such datapoints as map layers?
It is important that the service be free, because we are volunteering our time for non-profit public good. Currently our City Hall is implementing an API, but it is not ready yet, and it would be useful to present them some popularly well-accepted use-cases to make some political pressure for further API development with THEIR server (specially remotely querying a database instead of crawling a bunch of .csv files as it is now...)
An alternative would be to put everything in GitHub and load the whole dataset client-side to be manipulated with D3.js for example, but that seems very inefficient either for the client/user as for the server.
Thanks for reading, and feel free to re-tag if needed.

You need Google Maps API for Business to achieve what you want, but it costs a lot of money.
However, in some cases, you can get this Business Licence if you work for non-profit organization. I can't find the exact rules to be eligible for this free licence. I tried googled them but I can't find anything. I only find this link, just take a look if it can answer your problem.

You should be able to do that with Google Fusion Tables. The limit is 100,000 points per table, but you can overlay 5 layers onto a single map so in effect you can reach 500,000 points. I implemented the website below and have run it with over 200,000 points.
http://www.skyscan.co.uk/mapsearch.html

Need help in Web scraping webpages and its links by automatic funciton in R

I am interested to extract the data of paranormal activity reported in news, so that i can analyze the
data of space and time of appearance for any correlations. This project is just for fun, to learn and use web scraping, text extraction and spatial and time correlation analysis. So please forgive me for deciding on this topic, I wanted to do something interesting and challenging work.
First I found this website has some collection of the reported paranormal incidences, they have collection for 2009,2010,2011 and 2012.
The structure of the website goes like this in every year they have 1..10 pages...and links goes like this
for year2009
link http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm
In each page they have collected the stories under the heading like this
Internal structure
Paranormal Activity, Posted 03-14-09
each of these head lines has two pages inside it..goes like this
link http://paranormal.about.com/od/paranormalgeneralinfo/a/news_090314n.htm
On each of these pages they have actual reported stories collected on various headlines..and the actual websites link for those stories. I am interested in collecting those reported text and extract information regarding the kind of paranormal activity like ghost, demon or UFOs and the time, date and place of incidents. I wish to analyze this data for any spatial and time correlations. If UFO or Ghosts are real they must have some behavior and correlations in space or time in their movements. This is long shot of the story...
I need help in web scraping the text form the above said pages. Here i have wrote down the code to follow one page and its link down to last final text i want. Can anyone let me know is there any better and efficient way to get the clean text from the final page. Also automation of the collecting text by following all 10 pages for whole 2009.
library(XML)
#source of paranormal news from about.com
#first page to start
#2009 - http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm
pn.url<-"http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm"
pn.html<-htmlTreeParse(pn.url,useInternalNodes=T)
pn.h3=xpathSApply(pn.html,"//h3",xmlValue)
#extracting the links of the headlines to follow to the story
pn.h3.links=xpathSApply(pn.html,"//h3/a/#href")
#Extracted the links of the Internal structure to follow ...
#Paranormal Activity, Posted 01-03-09 (following this head line)
#http://paranormal.about.com/od/paranormalgeneralinfo/a/news_090314n.htm
pn.l1.url<-pn.h3.links[1]
pn.l1.html<-htmlTreeParse(pn.l1.url,useInternalNodes=T)
pn.l1.links=xpathSApply(pn.l1.html,"//p/a/#href")
#Extracted the links of the Internal structure to follow ...
#British couple has 'black-and-white-twins' twice (following this head line)
#http://www.msnbc.msn.com/id/28471626/
pn.l1.f1.url=pn.l1.links[7]
pn.l1.f1.html=htmlTreeParse(pn.l1.f1.url,useInternalNodes=T)
pn.l1.f1.text=xpathSApply(pn.l1.f1.html,"//text()[not(ancestor::script)][not(ancestor::style)]",xmlValue)
I sincerely thanks in advance for reading my post and your time for helping me.
I will be great full for any expert who would like to mentor me in this whole project.
Regards
Sathish

Try to use Scrapy and BeautifulSoup libraries. Despite their being Python based, they are considered the best in scrapping domain. You can use command line interface to connect both, for more details about connecting R and Python have a look here.

Get Annual Financial Data for a Stock for many years in R

Suppose I want to regress in R Gross Profit on Total Revenue. I need data for this, and the more, the better.
There is a library on CRAN that I find very useful: quantmod , that does what I need.
library(quantmod)
getFinancials(Symbol="AMD", src="google")
#to get the names of the matrix: rownames(AMD.f$IS$A)
Total.Revenue<-AMD.f$IS$A["Revenue",]
Gross.Profit<-AMD.f$IS$A["Gross Profit",]
#finally:
reg1<-lm(Gross.Profit~Total.Revenue)
The biggest issue that I have is that this library gets me data only for 4 years (4 observations, and who runs a regression with only 4 observations???). Is there any other way (maybe other libraries) that would get data for MORE than 4 years?

I agree that this is not an R programming question, but I'm going to make a few comments anyway before this question is (likely) closed.
It boils down to this: getting reliable fundamental data across sectors and markets is difficult enough even if you have money to spend. If you are looking at the US then there are a number of options, but all the major (read 'relatively reliable') providers require thousands of dollars per month - FactSet, Bloomberg, Datastream and so on. For what it's worth, for working with fundamental data I prefer and use FactSet.
Generally speaking, because the Excel tools offered by each provider are more mature, I have found it easier to populate spreadsheets with the data and then read the data into R. Then again, I typically deal with the fundamentals of a few dozen companies at most, because once you move out of the domain of your "known" companies the time it takes to check anomalies increases exponentially.
There are numerous potential "gotchas". The most obvious is that definitions vary from sector to sector. "Sales" for an industrial company is very different from "sales" for a bank, for example. Another problem is changes in definitions. Pretty much every year some accounting regulation or other changes and breaks your data series. Last year minorities were reported here, but this year this item is moved to another position in the P&L and so on.
Another problem is companies themselves changing. How does one deal with mergers, acquisitions and spin-offs, for example? This sort of thing can make measuring organic sales growth next to impossible. Yet another point to bear in mind is that if you're dealing with operating or net profit, you have to consider exceptionals and whether to adjust for them.
Dealing with companies outside the US adds a whole bunch of further problems. Of course, the major data providers try to standardise globally (FactSet Fundamentals for example). This just adds another layer of abstraction and typically it is hard to check to see how the data has been manipulated.
In short, getting the data is onerous and I know of no reliable free sources. Unless you're dealing with the simplest items for a very homogenous group of companies, this is a can of worms even if you do have the data.