Using read_html in R to get Russell 3000 holdings? - r

I was wondering if there is a way to automatically pull the Russell 3000 holdings from the iShares website in R using the read_html (or rvest) function?
url: https://www.ishares.com/us/products/239714/ishares-russell-3000-etf
(all holdings in the table on the bottom, not just top 10)
So far I have had to copy and paste into an Excel document, save as a CSV, and use read_csv to create a tibble in R of the ticker, company name, and sector.
I have used read_html to pull the SP500 holdings from wikipedia, but can't seem to figure out the path I need to put in to have R automatically pull from iShares website (and there arent other reputable websites I've found with all ~3000 holdings). Here is the code used for SP500:
read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")%>%
html_node("table.wikitable")%>%
html_table()%>%
select('Symbol','Security','GICS Sector','GICS Sub Industry')%>%
as_tibble()
First post, sorry if it is hard to follow...
Any help would be much appreciated
Michael

IMPORTANT
According to the Terms & Conditions listed on BlackRock's website (here):
Use any robot, spider, intelligent agent, other automatic device, or manual process to search, monitor or copy this Website or the reports, data, information, content, software, products services, or other materials on, generated by or obtained from this Website, whether through links or otherwise (collectively, "Materials"), without BlackRock's permission, provided that generally available third-party web browsers may be used without such permission;
I suggest you ensure you are abiding by those terms before using their data in a way that violates those rules. For educational purposes, here is how data would be obtained:
First you need to get to the actual data (not the interactive javascript). How familiar are you with the devloper function on your browser? If you navigate through the webiste and track the traffic, you will notice a large AJAX:
https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?tab=all&fileType=json
This is the data you need (all). After locating this, it is just cleaning the data. Example:
library(jsonlite)
#Locate the raw data by searching the Network traffic:
url="https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?tab=all&fileType=json"
#pull the data in via fromJSON
x<-jsonlite::fromJSON(url,flatten=TRUE)
>Large list (10.4 Mb)
#use a comination of `lapply` and `rapply` to unlist, structuring the results as one large list
y<-lapply(rapply(x, enquote, how="unlist"), eval)
>Large list (50677 elements, 6.9Mb)
y1<-y[1:15]
> str(y1)
List of 15
$ aaData1 : chr "MSFT"
$ aaData2 : chr "MICROSOFT CORP"
$ aaData3 : chr "Equity"
$ aaData.display: chr "2.95"
$ aaData.raw : num 2.95
$ aaData.display: chr "109.41"
$ aaData.raw : num 109
$ aaData.display: chr "2,615,449.00"
$ aaData.raw : int 2615449
$ aaData.display: chr "$286,156,275.09"
$ aaData.raw : num 2.86e+08
$ aaData.display: chr "286,156,275.09"
$ aaData.raw : num 2.86e+08
$ aaData14 : chr "Information Technology"
$ aaData15 : chr "2588173"
**Updated: In case you are unable to clean the data, here you are:
testdf<- data.frame(matrix(unlist(y), nrow=50677, byrow=T),stringsAsFactors=FALSE)
#Where we want to break the DF at (every nth row)
breaks <- 17
#number of rows in full DF
nbr.row <- nrow(testdf)
repeats<- rep(1:ceiling(nbr.row/breaks),each=breaks)[1:nbr.row]
#split DF from clean-up
newDF <- split(testdf,repeats)
Result:
> str(head(newDF))
List of 6
$ 1:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "MSFT" "MICROSOFT CORP" "Equity" "2.95" ...
$ 2:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "AAPL" "APPLE INC" "Equity" "2.89" ...
$ 3:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "AMZN" "AMAZON COM INC" "Equity" "2.34" ...
$ 4:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "BRKB" "BERKSHIRE HATHAWAY INC CLASS B" "Equity" "1.42" ...
$ 5:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "FB" "FACEBOOK CLASS A INC" "Equity" "1.35" ...
$ 6:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "JNJ" "JOHNSON & JOHNSON" "Equity" "1.29" ...

Related

HTTR warning when there is no data in desired response

I am building an API wrapper using httr. The API I'm using doesn't have content within people for the current year, but returns data for the copyright. When I use httr::GET I still get a 200 status code since there is a response.
The response should have data similar to 2019. How do I use httr to throw an error? Is there a warning similar to httr::warn_for_status available?
Example of request that works and returns people, vs. one that doesn't
library(httr)
data <- GET("https://statsapi.mlb.com/api/v1/sports/1/players?season=2019")
# response with data in content. I'll spare everyone the 1,410 rows in the JSON Response
str(content(data, type= "application/json"), list.len = 2)
#> List of 2
#> $ copyright: chr "Copyright 2020 MLB Advanced Media, L.P. Use of any content on this page acknowledges agreement to the terms po"| __truncated__
#> $ people :List of 1410
#> ..$ :List of 36
#> .. ..$ id : int 472551
#> .. ..$ fullName : chr "Fernando Abad"
#> .. .. [list output truncated]
#> ..$ :List of 35
#> .. ..$ id : int 650556
#> .. ..$ fullName : chr "Bryan Abreu"
#> .. .. [list output truncated]
#> .. [list output truncated]
no_data <- GET("https://statsapi.mlb.com/api/v1/sports/1/players?season=2020")
str(content(no_data, type= "application/json"))
#> List of 2
#> $ copyright: chr "Copyright 2020 MLB Advanced Media, L.P. Use of any content on this page acknowledges agreement to the terms po"| __truncated__
#> $ people : list()
The alternative I've used is to parse the data and then use nrow(df) < 1. There has to be a better way.

Converting a dataframe with a character vector as a column into longform

I was hoping somebody could help me with a problem I'm having with an exercise on the DataCamp "Building Web Applications with Shiny in R" course, specifically with transforming one of the datasets they use in the exercise.
I've imported their dataset (RDS) using the readRDS function and it looks like this:
$ id : int 10259 25693 20130 22213 13162 6602 42779 3735 16903 12734 ...
$ cuisine : chr "greek" "southern_us" "filipino" "indian" ...
$ ingredients:List of 39774
..$ : chr "romaine lettuce" "black olives" "grape tomatoes" "garlic" ...
..$ : chr "plain flour" "ground pepper" "salt" "tomatoes" ...
..$ : chr "eggs" "pepper" "salt" "mayonaise" ...
..$ : chr "water" "vegetable oil" "wheat" "salt"
..$ : chr "black pepper" "shallots" "cornflour" "cayenne pepper" ...
..$ : chr "plain flour" "sugar" "butter" "eggs" ...
..$ : chr "olive oil" "salt" "medium shrimp" "pepper" ...
..$ : chr "sugar" "pistachio nuts" "white almond bark" "flour" ...
..$ : chr "olive oil" "purple onion" "fresh pineapple" "pork" ...
..$ : chr "chopped tomatoes" "fresh basil" "garlic" "extra-virgin olive oil" ...
In their tutorial, they have a dataset that's been transformed so that there are three columns, id, cuisine and ingredients, but ingredients only has one ingredient (meaning there are multiple rows for the same id).
Usually when I have to do something like this, I use the dplyr function 'gather', but this won't work in this instance as it is for gathering multiple columns, rather than spitting up a column containing character vectors of varying length. I also tried to use the separate() function, but this requires you to specify what columns you want to separate the vectors into, which I can't do as they all vary in length.
If somebody could give me an idea as to how I'd go about transforming the above dataframe so that it's longform, I'd be very grateful.
Many thanks!
Sounds like you are looking for spread: https://tidyr.tidyverse.org/reference/spread.html. This effectively does the opposite of gather.
Should also be mentioned that gather and spread are no longer being updated, having been replaced with their arguably more explicit counterparts pivot_longer and pivot_wider: https://tidyr.tidyverse.org/reference/pivot_longer.html and https://tidyr.tidyverse.org/reference/pivot_wider.html. Datacamp may not have updated their courses to reflect this however.

how to scrape columns In R Programming from Website.?

Can Anyone suggest how can we read Avg Price And Value column in R from the given website.
I am not able to understand what is happening,with the same code i am able to read all the columns except these 2 columns.
The Code I am Using is :
library(rvest)
library(dplyr)
url="http://relationalstocks.com/showinsiders.php?date=2017-09-15&buysell=buysell"
url_html<-read_html(url)
SharesTraded_html=html_nodes(url_html,'td:nth-child(6)')
SharesTraded=html_text(SharesTraded_html)
SharesTraded=as.numeric(gsub(",",'',SharesTraded))
AvgPriceDollars_html=html_node(url_html,'td:nth-child(7)')
AvgPriceDollars=html_text(AvgPriceDollars_html)
AvgPriceDollars
http://relationalstocks.com/showinsiders.php?date=2017-09-15&buysell=buysell
Simplest way to do that is to use html_table :
library(rvest)
library(dplyr)
url <- read_html("http://relationalstocks.com/showinsiders.php?date=2017-09-15&buysell=buysell")
tb <- url %>%
html_node("#insidertab") %>%
html_nodes("table") %>%
html_table(fill = TRUE) %>%
as.data.frame()
str(tb)
'data.frame': 253 obs. of 9 variables:
$ Reported.Time: chr "2017-09-15 21:00:47" "2017-09-15 20:11:26" "2017-09-15 20:11:26" "2017-09-15 20:10:27" ...
$ Tran. : chr "2017-09-12 Purchase" "2017-09-13 Sale" "2017-09-14 Sale" "2017-09-15 Sale" ...
$ Company : chr "Double Eagle Acquisition Corp." "PHIBRO ANIMAL HEALTH CORP" "PHIBRO ANIMAL HEALTH CORP" "Guidewire Software, Inc." ...
$ Ticker : chr "EAGL" "PAHC" "PAHC" "GWRE" ...
$ Insider : chr "SAGANSKY JEFFREYChief Executive Officer, Director, 10% owner" "Johnson Richard GChief Financial Officer" "Johnson Richard GChief Financial Officer" "Roza ScottChief Business Officer" ...
$ Shares.Traded: chr "30,000" "15,900" "39,629" "782" ...
$ Avg.Price : chr "$10.05" "$36.46" "$36.23" "$78.20" ...
$ Value : chr "$301,500" "$579,714" "$1,435,758" "$61,152" ...
$ Filing : logi NA NA NA NA NA NA ...

List aviable WFS layers and read into data frame with rgdal

I have the following problem according to different sources it should be able to read WFS layer in R using rgdal.
dsn<-"WFS:http://geomap.reteunitaria.piemonte.it/ws/gsareprot/rp-01/areeprotwfs/wfs_gsareprot_1?service=WFS&request=getCapabilities"
ogrListLayers(dsn)
readOGR(dsn,"SIC")
The result of that code should be 1) to list the available WFS layer and 2) to read a specific Layer (SIC) into R as a Spatial(Points)DataFrame.
I tried several other WFS server but it does not work.
I always get the warning:
Cannot open data source
Checking for the WFS driver i get the following result:
> "WFS" %in% ogrDrivers()$name
[1] FALSE
Well it looks like the WFS driver is not implemented in rgdal (anymore?)
Or why are there so many examples "claiming" the opposite?
I also tried the gdalUtils package and well it works but It gives out the whole console message of ogrinfo.exe and not only the available layers.(I guess it "just" calls the ogrinfo.exe and sends the result back to R like using the r shell or system command).
Well does anyone know what I´m making wrong, or if something like that is even possible with rgdal or any similar package?
You can combine the two packages to accomplish your task.
First, convert the layer you need into a local shapefile using gdalUtils. Then, use rgdal as normal. NOTE: you'll see a warning message after the ogr2ogr call but it performed the conversion fine for me. Also, ogr2ogr won't overwrite local files without the overwrite parameter being TRUE (there are other parameters that may be of use as well).
library(gdalUtils)
library(rgdal)
dsn <- "WFS:http://geomap.reteunitaria.piemonte.it/ws/gsareprot/rp-01/areeprotwfs/wfs_gsareprot_1?service=WFS&request=getCapabilities"
ogrinfo(dsn, so=TRUE)
## [1] "Had to open data source read only."
## [2] "INFO: Open of `WFS:http://geomap.reteunitaria.piemonte.it/ws/gsareprot/rp-01/areeprotwfs/wfs_gsareprot_1?service=WFS&request=getCapabilities'"
## [3] " using driver `WFS' successful."
## [4] "1: AreeProtette"
## [5] "2: ZPS"
## [6] "3: SIC"
ogr2ogr(dsn, "sic.shp", "SIC")
sic <- readOGR("sic.shp", "sic", stringsAsFactors=FALSE)
## OGR data source with driver: ESRI Shapefile
## Source: "sic.shp", layer: "sic"
## with 128 features
## It has 23 fields
plot(sic)
str(sic#data)
## 'data.frame': 128 obs. of 23 variables:
## $ gml_id : chr "SIC.510" "SIC.472" "SIC.470" "SIC.508" ...
## $ objectid : chr "510" "472" "470" "508" ...
## $ inspire_id: chr NA NA NA NA ...
## $ codice : chr "IT1160026" "IT1160017" "IT1160018" "IT1160020" ...
## $ nome : chr "Faggete di Pamparato, Tana del Forno, Grotta delle Turbiglie e Grotte di Bossea" "Stazione di Linum narbonense" "Sorgenti del T.te Maira, Bosco di Saretto, Rocca Provenzale" "Bosco di Bagnasco" ...
## $ cod_tipo : chr "B" "B" "B" "B" ...
## $ tipo : chr "SIC" "SIC" "SIC" "SIC" ...
## $ cod_reg_bi: chr "1" "1" "1" "1" ...
## $ des_reg_bi: chr "Alpina" "Alpina" "Alpina" "Alpina" ...
## $ mese_istit: chr "11" "11" "11" "11" ...
## $ anno_istit: chr "1996" "1996" "1996" "1996" ...
## $ mese_ultmo: chr "2" NA NA NA ...
## $ anno_ultmo: chr "2002" NA NA NA ...
## $ sup_sito : chr "29396102.9972" "82819.1127" "7272687.002" "3797600.3563" ...
## $ perim_sito: chr "29261.8758" "1227.8846" "17650.289" "9081.4963" ...
## $ url1 : chr "http://gis.csi.it/parchi/schede/IT1160026.pdf" "http://gis.csi.it/parchi/schede/IT1160017.pdf" "http://gis.csi.it/parchi/schede/IT1160018.pdf" "http://gis.csi.it/parchi/schede/IT1160020.pdf" ...
## $ url2 : chr "http://gis.csi.it/parchi/carte/IT1160026.djvu" "http://gis.csi.it/parchi/carte/IT1160017.djvu" "http://gis.csi.it/parchi/carte/IT1160018.djvu" "http://gis.csi.it/parchi/carte/IT1160020.djvu" ...
## $ fk_ente : chr NA NA NA NA ...
## $ nome_ente : chr NA NA NA NA ...
## $ url3 : chr NA NA NA NA ...
## $ url4 : chr NA NA NA NA ...
## $ tipo_geome: chr "poligono" "poligono" "poligono" "poligono" ...
## $ schema : chr "Natura2000" "Natura2000" "Natura2000" "Natura2000" ...
Neither the questioner nor the answerer say how rgdal was installed. If it is a CRAN binary for Windows or OSX, it may well have a smaller set of drivers than an independent installation of GDAL underlying gdalUtils. Always state your platform, and whether rgdal was installed binary or from source, and always provide the output of the messages displayed as rgdal loads, as well as of sessionInfo() to show the platform on which you are running.
Given the possible difference in sets of drivers, the advice given seems reasonable.

R Studio crashes when using system(some-http-command) - anybody know a fix?

Whenever I use any sort of HTTP command via the system() function in R studio, the rainbow circle of death appears and I have to force-quit R Studio. Up until now, I've written a bunch of checks to make sure a user isn't in R Studio before using an HTTP command (which I use a ton to access data), but it's quite a pain, and it would be fantastic to get to the root of the problem.
e.g.
system("http get http://api.bls.gov/publicAPI/v1/timeseries/data/CXUALCBEVGLB0101M")
causes R studio to crash. Oddly, on another laptop of mine, such commands don't crash R Studio but cause the following error: 'sh: http: command not found', even though http is installed and works fine when using the terminal.
Does anybody know how to fix this problem / why it happens / does it occur for you guys too? Although I know a lot about R, I'm afraid I have no idea how to try to fix this problem.
Thanks!!!
Using http from the httpie package on Linux hangs RStudio (and not plain terminal R) on my Linux system (your rainbow circle implies its a Mac?) so I'm getting the same behaviour as you.
Installing and using wget works for me:
system("wget -O /tmp/data.out http://api.bls.gov/publicAPI/v1/timeseries/data/CXUALCBEVGLB0101M")
Or you could try R's native download.file function. There's a whole bunch of other functions for getting stuff off the web - see the Web Task View http://cran.r-project.org/web/views/WebTechnologies.html
I've not seen this http command used much, so maybe its flakey. Or maybe its opening stdin...
Yes... Try this:
system("http get http://api.bls.gov/publicAPI/v1/timeseries/data/CXUALCBEVGLB0101M >/tmp/data2.out </dev/null" )
I think http is opening stdin, the Unix standard input channel, RStudio isn't sending anything to it. So it waits. If you explicitly assign http's stdin as /dev/null then http completes. This works for me in RStudio.
However, I still prefer wget or curl-based solutions!
Without more contextual information regarding Rstudio version / operating system it is hard to do more than suggest an alternative approach that avoids the use system()
Instead you could use RCurl and getURL
library(RCurl)
getURL('http://api.bls.gov/publicAPI/v1/timeseries/data/CXUALCBEVGLB0101M')
#[1] "{\"status\":\"REQUEST_SUCCEEDED\",\"responseTime\":129,\"message\":[],\"Results\":{\n\"series\":\n[{\"seriesID\":\"CXUALCBEVGLB0101M\",\"data\":[{\"year\":\"2013\",\"period\":\"A01\",\"periodName\":\"Annual\",\"value\":\"445\",\"footnotes\":[{}]},{\"year\":\"2012\",\"period\":\"A01\",\"periodName\":\"Annual\",\"value\":\"451\",\"footnotes\":[{}]},{\"year\":\"2011\",\"period\":\"A01\",\"periodName\":\"Annual\",\"value\":\"456\",\"footnotes\":[{}]}]}]\n}}"
You could also use PUT, GET, POST, etc directly in R, abstracted from RCurl by the httr package:
library(httr)
tmp <- GET("http://api.bls.gov/publicAPI/v1/timeseries/data/CXUALCBEVGLB0101M")
dat <- content(tmp, as="parsed")
str(dat)
## List of 4
## $ status : chr "REQUEST_SUCCEEDED"
## $ responseTime: num 27
## $ message : list()
## $ Results :List of 1
## ..$ series:'data.frame': 1 obs. of 2 variables:
## .. ..$ seriesID: chr "CXUALCBEVGLB0101M"
## .. ..$ data :List of 1
## .. .. ..$ :'data.frame': 3 obs. of 5 variables:
## .. .. .. ..$ year : chr [1:3] "2013" "2012" "2011"
## .. .. .. ..$ period : chr [1:3] "A01" "A01" "A01"
## .. .. .. ..$ periodName: chr [1:3] "Annual" "Annual" "Annual"
## .. .. .. ..$ value : chr [1:3] "445" "451" "456"
## .. .. .. ..$ footnotes :List of 3
## .. .. .. .. ..$ :'data.frame': 1 obs. of 0 variables
## .. .. .. .. ..$ :'data.frame': 1 obs. of 0 variables
## .. .. .. .. ..$ :'data.frame': 1 obs. of 0 variables

Resources