how to scrape columns In R Programming from Website.?

how to scrape columns In R Programming from Website.? - r

Can Anyone suggest how can we read Avg Price And Value column in R from the given website.
I am not able to understand what is happening,with the same code i am able to read all the columns except these 2 columns.
The Code I am Using is :
library(rvest)
library(dplyr)
url="http://relationalstocks.com/showinsiders.php?date=2017-09-15&buysell=buysell"
url_html<-read_html(url)
SharesTraded_html=html_nodes(url_html,'td:nth-child(6)')
SharesTraded=html_text(SharesTraded_html)
SharesTraded=as.numeric(gsub(",",'',SharesTraded))
AvgPriceDollars_html=html_node(url_html,'td:nth-child(7)')
AvgPriceDollars=html_text(AvgPriceDollars_html)
AvgPriceDollars
http://relationalstocks.com/showinsiders.php?date=2017-09-15&buysell=buysell

Simplest way to do that is to use html_table :
library(rvest)
library(dplyr)
url <- read_html("http://relationalstocks.com/showinsiders.php?date=2017-09-15&buysell=buysell")
tb <- url %>%
html_node("#insidertab") %>%
html_nodes("table") %>%
html_table(fill = TRUE) %>%
as.data.frame()
str(tb)
'data.frame': 253 obs. of 9 variables:
$ Reported.Time: chr "2017-09-15 21:00:47" "2017-09-15 20:11:26" "2017-09-15 20:11:26" "2017-09-15 20:10:27" ...
$ Tran. : chr "2017-09-12 Purchase" "2017-09-13 Sale" "2017-09-14 Sale" "2017-09-15 Sale" ...
$ Company : chr "Double Eagle Acquisition Corp." "PHIBRO ANIMAL HEALTH CORP" "PHIBRO ANIMAL HEALTH CORP" "Guidewire Software, Inc." ...
$ Ticker : chr "EAGL" "PAHC" "PAHC" "GWRE" ...
$ Insider : chr "SAGANSKY JEFFREYChief Executive Officer, Director, 10% owner" "Johnson Richard GChief Financial Officer" "Johnson Richard GChief Financial Officer" "Roza ScottChief Business Officer" ...
$ Shares.Traded: chr "30,000" "15,900" "39,629" "782" ...
$ Avg.Price : chr "$10.05" "$36.46" "$36.23" "$78.20" ...
$ Value : chr "$301,500" "$579,714" "$1,435,758" "$61,152" ...
$ Filing : logi NA NA NA NA NA NA ...

Related

Converting three variables in a dataframe from chr to int in R

Very new to R but I'm hoping this will be simple. I have a html table I scraped into R and it looks like this:
`'data.frame': 238 obs. of 6 variables:
$ Facility Name : chr "Affinity Healthcare Center" "Alameda Care Center" "Alcott Rehabilitation Hospital" "Alden Terrace Convalescent Hospital" ...
$ City : chr "Paramount" "Burbank" "Los Angeles" "Los Angeles" ...
$ State : chr " CA" " CA" " CA" " CA" ...
$ Confirmed Staff : chr "26" "36" "14" "27" ...
$ Confirmed Residents: chr "29" "49" "26" "85" ...
$ Total Deaths : chr 26 36 14 27 19 3 1 7 16 3 ...`
I want Confirmed Staff, Confirmed Residents and Total Deaths to be integers so I can do some math on them and sort, order, etc.
I tried this for one variable and it seemed to work ok:
`tbls_ls4$`Total Deaths` <- as.integer(tbls_ls4$`Confirmed Staff`)`
But I'd like to apply it to all three variables and not sure how to do that.

Several ways to do that:
library(data.table)
setDT(df)
cols <- c("Confirmed Staff", "Confirmed Residents", "Total Deaths")
df[,cols := lapply(.SD, as.integer),.SDcols = cols]
or if you prefer base R :
df[, cols] <- lapply(cols, function(d) as.integer(df[,d]))

You could also use mutate_at from the tidyverse package to convert certain columns from character to integer.
library(tidyverse)
df<-df %>%
# Indicate the column names you wish to convert inside vars
mutate_at(vars(ConfirmedStaff,ConfirmedResidents,TotalDeaths),
# Indicate the function to apply to each column
as.integer)

Using read_html in R to get Russell 3000 holdings?

I was wondering if there is a way to automatically pull the Russell 3000 holdings from the iShares website in R using the read_html (or rvest) function?
url: https://www.ishares.com/us/products/239714/ishares-russell-3000-etf
(all holdings in the table on the bottom, not just top 10)
So far I have had to copy and paste into an Excel document, save as a CSV, and use read_csv to create a tibble in R of the ticker, company name, and sector.
I have used read_html to pull the SP500 holdings from wikipedia, but can't seem to figure out the path I need to put in to have R automatically pull from iShares website (and there arent other reputable websites I've found with all ~3000 holdings). Here is the code used for SP500:
read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")%>%
html_node("table.wikitable")%>%
html_table()%>%
select('Symbol','Security','GICS Sector','GICS Sub Industry')%>%
as_tibble()
First post, sorry if it is hard to follow...
Any help would be much appreciated
Michael

IMPORTANT
According to the Terms & Conditions listed on BlackRock's website (here):
Use any robot, spider, intelligent agent, other automatic device, or manual process to search, monitor or copy this Website or the reports, data, information, content, software, products services, or other materials on, generated by or obtained from this Website, whether through links or otherwise (collectively, "Materials"), without BlackRock's permission, provided that generally available third-party web browsers may be used without such permission;
I suggest you ensure you are abiding by those terms before using their data in a way that violates those rules. For educational purposes, here is how data would be obtained:
First you need to get to the actual data (not the interactive javascript). How familiar are you with the devloper function on your browser? If you navigate through the webiste and track the traffic, you will notice a large AJAX:
https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?tab=all&fileType=json
This is the data you need (all). After locating this, it is just cleaning the data. Example:
library(jsonlite)
#Locate the raw data by searching the Network traffic:
url="https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?tab=all&fileType=json"
#pull the data in via fromJSON
x<-jsonlite::fromJSON(url,flatten=TRUE)
>Large list (10.4 Mb)
#use a comination of `lapply` and `rapply` to unlist, structuring the results as one large list
y<-lapply(rapply(x, enquote, how="unlist"), eval)
>Large list (50677 elements, 6.9Mb)
y1<-y[1:15]
> str(y1)
List of 15
$ aaData1 : chr "MSFT"
$ aaData2 : chr "MICROSOFT CORP"
$ aaData3 : chr "Equity"
$ aaData.display: chr "2.95"
$ aaData.raw : num 2.95
$ aaData.display: chr "109.41"
$ aaData.raw : num 109
$ aaData.display: chr "2,615,449.00"
$ aaData.raw : int 2615449
$ aaData.display: chr "$286,156,275.09"
$ aaData.raw : num 2.86e+08
$ aaData.display: chr "286,156,275.09"
$ aaData.raw : num 2.86e+08
$ aaData14 : chr "Information Technology"
$ aaData15 : chr "2588173"
**Updated: In case you are unable to clean the data, here you are:
testdf<- data.frame(matrix(unlist(y), nrow=50677, byrow=T),stringsAsFactors=FALSE)
#Where we want to break the DF at (every nth row)
breaks <- 17
#number of rows in full DF
nbr.row <- nrow(testdf)
repeats<- rep(1:ceiling(nbr.row/breaks),each=breaks)[1:nbr.row]
#split DF from clean-up
newDF <- split(testdf,repeats)
Result:
> str(head(newDF))
List of 6
$ 1:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "MSFT" "MICROSOFT CORP" "Equity" "2.95" ...
$ 2:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "AAPL" "APPLE INC" "Equity" "2.89" ...
$ 3:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "AMZN" "AMAZON COM INC" "Equity" "2.34" ...
$ 4:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "BRKB" "BERKSHIRE HATHAWAY INC CLASS B" "Equity" "1.42" ...
$ 5:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "FB" "FACEBOOK CLASS A INC" "Equity" "1.35" ...
$ 6:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "JNJ" "JOHNSON & JOHNSON" "Equity" "1.29" ...

Shiny Dashboard Interactive charts

I have a data frame headcount.df which looks like below:
Classes ‘data.table’ and 'data.frame': 2762 obs. of 7 variables:
$ Worker ID : chr "1693" "1812" "1822" "1695" ...
$ Job Posting Title: chr "Accountant" "Business Analyst I" "Finance Analyst II" "Business Analyst V" ...
$ State/Province : chr "Texas" "Michigan" "Heredia" "California" ...
$ Country : chr "USA" "USA" "CRI" "USA" ...
$ Worker Start Date: POSIXct, format: "2016-05-01" "2016-05-01" "2016-05-01" "2016-05-01" ...
$ Worker End Date : POSIXct, format: "2017-04-30" "2017-04-30" "2017-04-30" "2017-04-30" ...
$ Labor Type : chr "Business Professional" "Business Professional" "Business Professional" "Business Professional" ...
as a note there may be duplicate records in here
I am able to create a chart using ggplot using the below
x <- "2017-03-03"
y <- "2017-10-31"
headcountbar <- headcount1.df %>%
filter(`Worker Start Date`>=x & `Worker End Date`<=y) %>%
group_by(`State/Province`) %>%
summarise(Headcount = n_distinct(Worker))
ggplot(data = headcountbar, aes(x=`State/Province`,y = Headcount, fill = `State/Province` )) +
geom_bar(stat="identity",position = position_dodge())
The above code only gives me a headcount of workers between the two dates, I would like to be able to break it by month/quarter as well.
I would like to use shinydashboard to make this more interactive where I can select the x axis to maybe show headcount by state over time range or headcount by labor type.
I know there is a lot in here so any guidance is greatly appreciated.

Scrape with a loop and avoid 404 error

I am trying to scrape wiki for certain astronomy related definitions for my project. The code works pretty well, but I am not able to avoid 404s. I tried tryCatch. I think I am missing something here.
I am looking for a way overcome 404s while running a loop. Here is my code:
library(rvest)
library(httr)
library(XML)
library(tm)
topic<-c("Neutron star", "Black hole", "sagittarius A")
for(i in topic){
site<- paste("https://en.wikipedia.org/wiki/", i)
site <- read_html(site)
stats<- xmlValue(getNodeSet(htmlParse(site),"//p")[[1]]) #only the first paragraph
#error = function(e){NA}
stats[["topic"]] <- i
stats<- gsub('\\[.*?\\]', '', stats)
#stats<-stats[!duplicated(stats),]
#out.file <- data.frame(rbind(stats,F[i]))
output<-rbind(stats,i)
}

Build the variable urls in the loop using sprintf.
Extract all the body text from paragraph nodes.
Remove any vectors returning a length(0)
I added a step to include all of the body text annotated by a prepended [paragraph - n] for reference..because well...friends don't let friends waste data or make multiple http requests.
Build a data frame for each iteration in your topics list in the form of below:
Bind all of the data.frames in the list into one...
wiki_url : should be obvious
topic: from the topics list
info_summary: The first paragraph (you mentioned in your post)
all_info: In case you need more..ya know.
Note that I use an older, source version of rvest
for ease of understanding i'm simply assigning the name html to what would be your read_html.
library(rvest)
library(jsonlite)
html <- rvest::read_html
wiki_base <- "https://en.wikipedia.org/wiki/%s"
my_table <- lapply(sprintf(wiki_base, topic), function(i){
raw_1 <- html_text(html_nodes(html(i),"p"))
raw_valid <- raw_1[nchar(raw_1)>0]
all_info <- lapply(1:length(raw_valid), function(i){
sprintf(' [paragraph - %d] %s ', i, raw_valid[[i]])
}) %>% paste0(collapse = "")
data.frame(wiki_url = i,
topic = basename(i),
info_summary = raw_valid[[1]],
trimws(all_info),
stringsAsFactors = FALSE)
}) %>% rbind.pages
> str(my_table)
'data.frame': 3 obs. of 4 variables:
$ wiki_url : chr "https://en.wikipedia.org/wiki/Neutron star" "https://en.wikipedia.org/wiki/Black hole" "https://en.wikipedia.org/wiki/sagittarius A"
$ topic : chr "Neutron star" "Black hole" "sagittarius A"
$ info_summary: chr "A neutron star is the collapsed core of a large star (10–29 solar masses). Neutron stars are the smallest and densest stars kno"| __truncated__ "A black hole is a region of spacetime exhibiting such strong gravitational effects that nothing—not even particles and electrom"| __truncated__ "Sagittarius A or Sgr A is a complex radio source at the center of the Milky Way. It is located in the constellation Sagittarius"| __truncated__
$ all_info : chr " [paragraph - 1] A neutron star is the collapsed core of a large star (10–29 solar masses). Neutron stars are the smallest and "| __truncated__ " [paragraph - 1] A black hole is a region of spacetime exhibiting such strong gravitational effects that nothing—not even parti"| __truncated__ " [paragraph - 1] Sagittarius A or Sgr A is a complex radio source at the center of the Milky Way. It is located in the constell"| __truncated__
EDIT
A function for error handling.... returns a logical. So this becomes our first step.
url_works <- function(url){
tryCatch(
identical(status_code(HEAD(url)),200L),
error = function(e){
FALSE
})
}
Based on your use of 'exoplanet' Here is all of the applicable data from the wiki page:
exo_data <- (html_nodes(html('https://en.wikipedia.org/wiki/List_of_exoplanets'),'.wikitable')%>%html_table)[[2]]
str(exo_data)
'data.frame': 2048 obs. of 16 variables:
$ Name : chr "Proxima Centauri b" "KOI-1843.03" "KOI-1843.01" "KOI-1843.02" ...
$ bf : int 0 0 0 0 0 0 0 0 0 0 ...
$ Mass (Jupiter mass) : num 0.004 0.0014 NA NA 0.1419 ...
$ Radius (Jupiter radii) : num NA 0.054 0.114 0.071 1.012 ...
$ Period (days) : num 11.186 0.177 4.195 6.356 19.224 ...
$ Semi-major axis (AU) : num 0.05 0.0048 0.039 0.052 0.143 0.229 0.0271 0.053 1.33 2.1 ...
$ Ecc. : num 0.35 1.012 NA NA 0.0626 ...
$ Inc. (deg) : num NA 72 89.4 88.2 87.1 ...
$ Temp. (K) : num 234 NA NA NA 707 ...
$ Discovery method : chr "radial vel." "transit" "transit" "transit" ...
$ Disc. Year : int 2016 2012 2012 2012 2010 2010 2010 2014 2009 2005 ...
$ Distance (pc) : num 1.29 NA NA NA 650 ...
$ Host star mass (solar masses) : num 0.123 0.46 0.46 0.46 1.05 1.05 1.05 0.69 1.25 0.22 ...
$ Host star radius (solar radii): num 0.141 0.45 0.45 0.45 1.23 1.23 1.23 NA NA NA ...
$ Host star temp. (K) : num 3024 3584 3584 3584 5722 ...
$ Remarks : chr "Closest exoplanet to our Solar System. Within host star’s habitable zone; possibl
y Earth-like." "controversial" "controversial" "controversial" ...
test our url_works function on random sample of the table
tests <- dplyr::sample_frac(exo_data, 0.02) %>% .$Name
Now lets build a ref table with the Name, url to check, and a logical if the url is valid, and in one step create a list of two data frames, one containing the urls that don't exists....and the other that do. The ones that check out we can run through the above function with no issues. This way the error handling is done before we actually start trying to parse in a loop. Avoids headaches and gives a reference ack to what items need to be further looked into.
b <- ldply(sprintf('https://en.wikipedia.org/wiki/%s',tests), function(i){
data.frame(name = basename(i), url_checked = i,url_valid = url_works(i))
}) %>%split(.$url_valid)
> str(b)
List of 2
$ FALSE:'data.frame': 24 obs. of 3 variables:
..$ name : chr [1:24] "Kepler-539c" "HD 142 A c" "WASP-44 b" "Kepler-280 b" ...
..$ url_checked: chr [1:24] "https://en.wikipedia.org/wiki/Kepler-539c" "https://en.wikipedia.org/wiki/HD 142 A c" "https://en.wikipedia.org/wiki/WASP-44 b" "https://en.wikipedia.org/wiki/Kepler-280 b" ...
..$ url_valid : logi [1:24] FALSE FALSE FALSE FALSE FALSE FALSE ...
$ TRUE :'data.frame': 17 obs. of 3 variables:
..$ name : chr [1:17] "HD 179079 b" "HD 47186 c" "HD 93083 b" "HD 200964 b" ...
..$ url_checked: chr [1:17] "https://en.wikipedia.org/wiki/HD 179079 b" "https://en.wikipedia.org/wiki/HD 47186 c" "https://en.wikipedia.org/wiki/HD 93083 b" "https://en.wikipedia.org/wiki/HD 200964 b" ...
..$ url_valid : logi [1:17] TRUE TRUE TRUE TRUE TRUE TRUE ...
Obviously the second item of the list contains the data frame with valid urls, so apply the prior function to the url column in that one. Note that I sampled the table of all planets for purposes of explanation...There are 2400 some-odd names, so that check will take a min or two to run in your case. Hope that wraps it up for you.

List aviable WFS layers and read into data frame with rgdal

I have the following problem according to different sources it should be able to read WFS layer in R using rgdal.
dsn<-"WFS:http://geomap.reteunitaria.piemonte.it/ws/gsareprot/rp-01/areeprotwfs/wfs_gsareprot_1?service=WFS&request=getCapabilities"
ogrListLayers(dsn)
readOGR(dsn,"SIC")
The result of that code should be 1) to list the available WFS layer and 2) to read a specific Layer (SIC) into R as a Spatial(Points)DataFrame.
I tried several other WFS server but it does not work.
I always get the warning:
Cannot open data source
Checking for the WFS driver i get the following result:
> "WFS" %in% ogrDrivers()$name
[1] FALSE
Well it looks like the WFS driver is not implemented in rgdal (anymore?)
Or why are there so many examples "claiming" the opposite?
I also tried the gdalUtils package and well it works but It gives out the whole console message of ogrinfo.exe and not only the available layers.(I guess it "just" calls the ogrinfo.exe and sends the result back to R like using the r shell or system command).
Well does anyone know what I´m making wrong, or if something like that is even possible with rgdal or any similar package?

You can combine the two packages to accomplish your task.
First, convert the layer you need into a local shapefile using gdalUtils. Then, use rgdal as normal. NOTE: you'll see a warning message after the ogr2ogr call but it performed the conversion fine for me. Also, ogr2ogr won't overwrite local files without the overwrite parameter being TRUE (there are other parameters that may be of use as well).
library(gdalUtils)
library(rgdal)
dsn <- "WFS:http://geomap.reteunitaria.piemonte.it/ws/gsareprot/rp-01/areeprotwfs/wfs_gsareprot_1?service=WFS&request=getCapabilities"
ogrinfo(dsn, so=TRUE)
## [1] "Had to open data source read only."
## [2] "INFO: Open of `WFS:http://geomap.reteunitaria.piemonte.it/ws/gsareprot/rp-01/areeprotwfs/wfs_gsareprot_1?service=WFS&request=getCapabilities'"
## [3] " using driver `WFS' successful."
## [4] "1: AreeProtette"
## [5] "2: ZPS"
## [6] "3: SIC"
ogr2ogr(dsn, "sic.shp", "SIC")
sic <- readOGR("sic.shp", "sic", stringsAsFactors=FALSE)
## OGR data source with driver: ESRI Shapefile
## Source: "sic.shp", layer: "sic"
## with 128 features
## It has 23 fields
plot(sic)
str(sic#data)
## 'data.frame': 128 obs. of 23 variables:
## $ gml_id : chr "SIC.510" "SIC.472" "SIC.470" "SIC.508" ...
## $ objectid : chr "510" "472" "470" "508" ...
## $ inspire_id: chr NA NA NA NA ...
## $ codice : chr "IT1160026" "IT1160017" "IT1160018" "IT1160020" ...
## $ nome : chr "Faggete di Pamparato, Tana del Forno, Grotta delle Turbiglie e Grotte di Bossea" "Stazione di Linum narbonense" "Sorgenti del T.te Maira, Bosco di Saretto, Rocca Provenzale" "Bosco di Bagnasco" ...
## $ cod_tipo : chr "B" "B" "B" "B" ...
## $ tipo : chr "SIC" "SIC" "SIC" "SIC" ...
## $ cod_reg_bi: chr "1" "1" "1" "1" ...
## $ des_reg_bi: chr "Alpina" "Alpina" "Alpina" "Alpina" ...
## $ mese_istit: chr "11" "11" "11" "11" ...
## $ anno_istit: chr "1996" "1996" "1996" "1996" ...
## $ mese_ultmo: chr "2" NA NA NA ...
## $ anno_ultmo: chr "2002" NA NA NA ...
## $ sup_sito : chr "29396102.9972" "82819.1127" "7272687.002" "3797600.3563" ...
## $ perim_sito: chr "29261.8758" "1227.8846" "17650.289" "9081.4963" ...
## $ url1 : chr "http://gis.csi.it/parchi/schede/IT1160026.pdf" "http://gis.csi.it/parchi/schede/IT1160017.pdf" "http://gis.csi.it/parchi/schede/IT1160018.pdf" "http://gis.csi.it/parchi/schede/IT1160020.pdf" ...
## $ url2 : chr "http://gis.csi.it/parchi/carte/IT1160026.djvu" "http://gis.csi.it/parchi/carte/IT1160017.djvu" "http://gis.csi.it/parchi/carte/IT1160018.djvu" "http://gis.csi.it/parchi/carte/IT1160020.djvu" ...
## $ fk_ente : chr NA NA NA NA ...
## $ nome_ente : chr NA NA NA NA ...
## $ url3 : chr NA NA NA NA ...
## $ url4 : chr NA NA NA NA ...
## $ tipo_geome: chr "poligono" "poligono" "poligono" "poligono" ...
## $ schema : chr "Natura2000" "Natura2000" "Natura2000" "Natura2000" ...

Neither the questioner nor the answerer say how rgdal was installed. If it is a CRAN binary for Windows or OSX, it may well have a smaller set of drivers than an independent installation of GDAL underlying gdalUtils. Always state your platform, and whether rgdal was installed binary or from source, and always provide the output of the messages displayed as rgdal loads, as well as of sessionInfo() to show the platform on which you are running.
Given the possible difference in sets of drivers, the advice given seems reasonable.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how to scrape columns In R Programming from Website.? - r

Related

Converting three variables in a dataframe from chr to int in R

Using read_html in R to get Russell 3000 holdings?

Shiny Dashboard Interactive charts

Scrape with a loop and avoid 404 error

List aviable WFS layers and read into data frame with rgdal

Categories

Resources