Scrape with a loop and avoid 404 error

Scrape with a loop and avoid 404 error - r

I am trying to scrape wiki for certain astronomy related definitions for my project. The code works pretty well, but I am not able to avoid 404s. I tried tryCatch. I think I am missing something here.
I am looking for a way overcome 404s while running a loop. Here is my code:
library(rvest)
library(httr)
library(XML)
library(tm)
topic<-c("Neutron star", "Black hole", "sagittarius A")
for(i in topic){
site<- paste("https://en.wikipedia.org/wiki/", i)
site <- read_html(site)
stats<- xmlValue(getNodeSet(htmlParse(site),"//p")[[1]]) #only the first paragraph
#error = function(e){NA}
stats[["topic"]] <- i
stats<- gsub('\\[.*?\\]', '', stats)
#stats<-stats[!duplicated(stats),]
#out.file <- data.frame(rbind(stats,F[i]))
output<-rbind(stats,i)
}

Build the variable urls in the loop using sprintf.
Extract all the body text from paragraph nodes.
Remove any vectors returning a length(0)
I added a step to include all of the body text annotated by a prepended [paragraph - n] for reference..because well...friends don't let friends waste data or make multiple http requests.
Build a data frame for each iteration in your topics list in the form of below:
Bind all of the data.frames in the list into one...
wiki_url : should be obvious
topic: from the topics list
info_summary: The first paragraph (you mentioned in your post)
all_info: In case you need more..ya know.
Note that I use an older, source version of rvest
for ease of understanding i'm simply assigning the name html to what would be your read_html.
library(rvest)
library(jsonlite)
html <- rvest::read_html
wiki_base <- "https://en.wikipedia.org/wiki/%s"
my_table <- lapply(sprintf(wiki_base, topic), function(i){
raw_1 <- html_text(html_nodes(html(i),"p"))
raw_valid <- raw_1[nchar(raw_1)>0]
all_info <- lapply(1:length(raw_valid), function(i){
sprintf(' [paragraph - %d] %s ', i, raw_valid[[i]])
}) %>% paste0(collapse = "")
data.frame(wiki_url = i,
topic = basename(i),
info_summary = raw_valid[[1]],
trimws(all_info),
stringsAsFactors = FALSE)
}) %>% rbind.pages
> str(my_table)
'data.frame': 3 obs. of 4 variables:
$ wiki_url : chr "https://en.wikipedia.org/wiki/Neutron star" "https://en.wikipedia.org/wiki/Black hole" "https://en.wikipedia.org/wiki/sagittarius A"
$ topic : chr "Neutron star" "Black hole" "sagittarius A"
$ info_summary: chr "A neutron star is the collapsed core of a large star (10–29 solar masses). Neutron stars are the smallest and densest stars kno"| __truncated__ "A black hole is a region of spacetime exhibiting such strong gravitational effects that nothing—not even particles and electrom"| __truncated__ "Sagittarius A or Sgr A is a complex radio source at the center of the Milky Way. It is located in the constellation Sagittarius"| __truncated__
$ all_info : chr " [paragraph - 1] A neutron star is the collapsed core of a large star (10–29 solar masses). Neutron stars are the smallest and "| __truncated__ " [paragraph - 1] A black hole is a region of spacetime exhibiting such strong gravitational effects that nothing—not even parti"| __truncated__ " [paragraph - 1] Sagittarius A or Sgr A is a complex radio source at the center of the Milky Way. It is located in the constell"| __truncated__
EDIT
A function for error handling.... returns a logical. So this becomes our first step.
url_works <- function(url){
tryCatch(
identical(status_code(HEAD(url)),200L),
error = function(e){
FALSE
})
}
Based on your use of 'exoplanet' Here is all of the applicable data from the wiki page:
exo_data <- (html_nodes(html('https://en.wikipedia.org/wiki/List_of_exoplanets'),'.wikitable')%>%html_table)[[2]]
str(exo_data)
'data.frame': 2048 obs. of 16 variables:
$ Name : chr "Proxima Centauri b" "KOI-1843.03" "KOI-1843.01" "KOI-1843.02" ...
$ bf : int 0 0 0 0 0 0 0 0 0 0 ...
$ Mass (Jupiter mass) : num 0.004 0.0014 NA NA 0.1419 ...
$ Radius (Jupiter radii) : num NA 0.054 0.114 0.071 1.012 ...
$ Period (days) : num 11.186 0.177 4.195 6.356 19.224 ...
$ Semi-major axis (AU) : num 0.05 0.0048 0.039 0.052 0.143 0.229 0.0271 0.053 1.33 2.1 ...
$ Ecc. : num 0.35 1.012 NA NA 0.0626 ...
$ Inc. (deg) : num NA 72 89.4 88.2 87.1 ...
$ Temp. (K) : num 234 NA NA NA 707 ...
$ Discovery method : chr "radial vel." "transit" "transit" "transit" ...
$ Disc. Year : int 2016 2012 2012 2012 2010 2010 2010 2014 2009 2005 ...
$ Distance (pc) : num 1.29 NA NA NA 650 ...
$ Host star mass (solar masses) : num 0.123 0.46 0.46 0.46 1.05 1.05 1.05 0.69 1.25 0.22 ...
$ Host star radius (solar radii): num 0.141 0.45 0.45 0.45 1.23 1.23 1.23 NA NA NA ...
$ Host star temp. (K) : num 3024 3584 3584 3584 5722 ...
$ Remarks : chr "Closest exoplanet to our Solar System. Within host star’s habitable zone; possibl
y Earth-like." "controversial" "controversial" "controversial" ...
test our url_works function on random sample of the table
tests <- dplyr::sample_frac(exo_data, 0.02) %>% .$Name
Now lets build a ref table with the Name, url to check, and a logical if the url is valid, and in one step create a list of two data frames, one containing the urls that don't exists....and the other that do. The ones that check out we can run through the above function with no issues. This way the error handling is done before we actually start trying to parse in a loop. Avoids headaches and gives a reference ack to what items need to be further looked into.
b <- ldply(sprintf('https://en.wikipedia.org/wiki/%s',tests), function(i){
data.frame(name = basename(i), url_checked = i,url_valid = url_works(i))
}) %>%split(.$url_valid)
> str(b)
List of 2
$ FALSE:'data.frame': 24 obs. of 3 variables:
..$ name : chr [1:24] "Kepler-539c" "HD 142 A c" "WASP-44 b" "Kepler-280 b" ...
..$ url_checked: chr [1:24] "https://en.wikipedia.org/wiki/Kepler-539c" "https://en.wikipedia.org/wiki/HD 142 A c" "https://en.wikipedia.org/wiki/WASP-44 b" "https://en.wikipedia.org/wiki/Kepler-280 b" ...
..$ url_valid : logi [1:24] FALSE FALSE FALSE FALSE FALSE FALSE ...
$ TRUE :'data.frame': 17 obs. of 3 variables:
..$ name : chr [1:17] "HD 179079 b" "HD 47186 c" "HD 93083 b" "HD 200964 b" ...
..$ url_checked: chr [1:17] "https://en.wikipedia.org/wiki/HD 179079 b" "https://en.wikipedia.org/wiki/HD 47186 c" "https://en.wikipedia.org/wiki/HD 93083 b" "https://en.wikipedia.org/wiki/HD 200964 b" ...
..$ url_valid : logi [1:17] TRUE TRUE TRUE TRUE TRUE TRUE ...
Obviously the second item of the list contains the data frame with valid urls, so apply the prior function to the url column in that one. Note that I sampled the table of all planets for purposes of explanation...There are 2400 some-odd names, so that check will take a min or two to run in your case. Hope that wraps it up for you.

Related

writing a loop to go through a large list with sublists and save these sublist

I would like to extract data from large list with many sub-lists called 'summary' https://www.dropbox.com/s/uiair94p0v7z2zr/summary10.csv?dl=0
This file is compilation of the fitting of dose response curve by patient and drugs. I share a small file with just 10 patients, 105 drugs and x and y as readout for the fitting with each 100pt.
I would like to save all the fits for each patient and every drug in a separate file.
I tried to write the list into a df to use tidyverse but didn't manage. I have only started out with R so this is very complex for me.
for (i in 1:length(summary10))
{for (j in 1:length(summary10[[i]]))
{x1 <- summary10[[i]][[j]][[1]]
y1 <- summary10[[i]][[j]][[2]]
print(summary10[[i]][[j]]);}}
the loop works but I don't know how to save them in different files so that I will be able to know what is what. I tried something I found online but it doesn't work:
for (i in 1:length(summary10))
{for (j in 1:length(summary10[[i]]))
{x1 <- summary10[[i]][[j]][[1]]
y1 <- summary10[[i]][[j]][[2]]
cbind(x1,y1) -> resp
write.csv(resp, file = paste0(summary[[i]], ".-csv"), row.names = FALSE)
}}
Error in file(file, ifelse(append, "a", "w")) : invalid 'description' argument
In addition:
Warning message: In if (file == "") file <- stdout() else if (is.character(file)) { : the condition has length > 1 and only the first element will be used

It's really hard to anticipate what goes wrong, when we cannot see how you made summary10. No way am I going to guess how you came from your tabular file, to a list of lists (or whatever summary10 may be).
But in the end, your error indicates that you are providing an illicit filename in the file = paste0(summary[[i]], ".-csv") argument. First tip on debugging is simply printing to console. Try this on for size:
cbind(x1,y1) -> resp
cat(paste0(summary[[i]], ".-csv", '\n') # <-----
# use `cat` to print to console the contents of your expressiosn
write.csv(resp, file = paste0(summary[[i]], ".-csv"), row.names = FALSE)
What is it? It should evaluate to a simple string, say B.M.21.S.-csv, but it might not be the case.
At a first glance, I would guess you've misspelled your variable. summary is usually a function, whereas you might be looking for summary10. Still, the i'th element of summary10 looks like it could be a list itself, so your expression will fail to produce a simple string.
Update with summary10
I always recommend using str to examine the structure of an object. For lists, use the argument max.level to avoid printing endless nested lists:
> str(summary10, max.level=1)
List of 10
$ B-HR-25 :List of 106
$ B-SR-22 :List of 106
$ B-VHR-01:List of 106
$ B-SR-23 :List of 106
$ B-SR-24 :List of 106
$ B-HR-21 :List of 106
$ B-M-21 :List of 106
$ B-SR-21 :List of 106
$ B-MR-01 :List of 106
$ B-M-01 :List of 106
And then a step further in:
> str(summary10[[1]], max.level=2)
List of 106
$ PP242 :List of 2
..$ x: num [1:100] 1 1.1 1.2 1.32 1.45 ...
..$ y: num [1:100] 0.923 0.922 0.921 0.92 0.919 ...
$ AZD8055 :List of 2
..$ x: num [1:100] 1 1.1 1.2 1.32 1.45 ...
..$ y: num [1:100] 0.953 0.953 0.953 0.952 0.952 ...
So object summary10 is a collection of patients (lists of lists); summary10[1] is the collection containing the first patient, summary10[[1]] the first patient (a list itself) with their responses to drugs.
So what happens when you try to make a filename from summary10[[i]]? Try it, I won't print the output here. Back to str(summary10), the patients' designations ("B-HR-25", etc.) are the names of the entries. Get them with names(summary10). As an exercise, compare names(summary10), names(summary10)[1], names(summary10[1]) and names(summary10[[1]]).

Using read_html in R to get Russell 3000 holdings?

I was wondering if there is a way to automatically pull the Russell 3000 holdings from the iShares website in R using the read_html (or rvest) function?
url: https://www.ishares.com/us/products/239714/ishares-russell-3000-etf
(all holdings in the table on the bottom, not just top 10)
So far I have had to copy and paste into an Excel document, save as a CSV, and use read_csv to create a tibble in R of the ticker, company name, and sector.
I have used read_html to pull the SP500 holdings from wikipedia, but can't seem to figure out the path I need to put in to have R automatically pull from iShares website (and there arent other reputable websites I've found with all ~3000 holdings). Here is the code used for SP500:
read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")%>%
html_node("table.wikitable")%>%
html_table()%>%
select('Symbol','Security','GICS Sector','GICS Sub Industry')%>%
as_tibble()
First post, sorry if it is hard to follow...
Any help would be much appreciated
Michael

IMPORTANT
According to the Terms & Conditions listed on BlackRock's website (here):
Use any robot, spider, intelligent agent, other automatic device, or manual process to search, monitor or copy this Website or the reports, data, information, content, software, products services, or other materials on, generated by or obtained from this Website, whether through links or otherwise (collectively, "Materials"), without BlackRock's permission, provided that generally available third-party web browsers may be used without such permission;
I suggest you ensure you are abiding by those terms before using their data in a way that violates those rules. For educational purposes, here is how data would be obtained:
First you need to get to the actual data (not the interactive javascript). How familiar are you with the devloper function on your browser? If you navigate through the webiste and track the traffic, you will notice a large AJAX:
https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?tab=all&fileType=json
This is the data you need (all). After locating this, it is just cleaning the data. Example:
library(jsonlite)
#Locate the raw data by searching the Network traffic:
url="https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?tab=all&fileType=json"
#pull the data in via fromJSON
x<-jsonlite::fromJSON(url,flatten=TRUE)
>Large list (10.4 Mb)
#use a comination of `lapply` and `rapply` to unlist, structuring the results as one large list
y<-lapply(rapply(x, enquote, how="unlist"), eval)
>Large list (50677 elements, 6.9Mb)
y1<-y[1:15]
> str(y1)
List of 15
$ aaData1 : chr "MSFT"
$ aaData2 : chr "MICROSOFT CORP"
$ aaData3 : chr "Equity"
$ aaData.display: chr "2.95"
$ aaData.raw : num 2.95
$ aaData.display: chr "109.41"
$ aaData.raw : num 109
$ aaData.display: chr "2,615,449.00"
$ aaData.raw : int 2615449
$ aaData.display: chr "$286,156,275.09"
$ aaData.raw : num 2.86e+08
$ aaData.display: chr "286,156,275.09"
$ aaData.raw : num 2.86e+08
$ aaData14 : chr "Information Technology"
$ aaData15 : chr "2588173"
**Updated: In case you are unable to clean the data, here you are:
testdf<- data.frame(matrix(unlist(y), nrow=50677, byrow=T),stringsAsFactors=FALSE)
#Where we want to break the DF at (every nth row)
breaks <- 17
#number of rows in full DF
nbr.row <- nrow(testdf)
repeats<- rep(1:ceiling(nbr.row/breaks),each=breaks)[1:nbr.row]
#split DF from clean-up
newDF <- split(testdf,repeats)
Result:
> str(head(newDF))
List of 6
$ 1:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "MSFT" "MICROSOFT CORP" "Equity" "2.95" ...
$ 2:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "AAPL" "APPLE INC" "Equity" "2.89" ...
$ 3:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "AMZN" "AMAZON COM INC" "Equity" "2.34" ...
$ 4:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "BRKB" "BERKSHIRE HATHAWAY INC CLASS B" "Equity" "1.42" ...
$ 5:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "FB" "FACEBOOK CLASS A INC" "Equity" "1.35" ...
$ 6:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "JNJ" "JOHNSON & JOHNSON" "Equity" "1.29" ...

Extract the data from the sublist into dataframe

the structure of the list as follow (the list goes on with the same structure):
> str(parsedData)
> List of 1658
> $ :List of 2
> ..$ Date : chr "2010-08-16"
> ..$ Volatility: num 11.1
> $ :List of 2
> ..$ Date : chr "2010-08-17"
> ..$ Volatility: num 26.2
as you can see, on the name of the first level of structure is empty space. I tried to extract the elements but fail:
> parsedData$Date
>NULL
anyone can tell me how to extract only the Date and Volatility from this list (especially with no title) and put them all in the same dataframe like this? Thanks!
Date Volatility
2010-08-16 11.1
2010-08-17 26.2
... ...
(this is the first time i ask question, sorry for any editing mistake :) )

Not tested:
setNames(data.frame(do.call(rbind,lapply(1:length(parsedData),function(i)cbind(parsedData[[i]][1],parsedData[[i]][2])))),c("Date","Volatility")
OR:
setNames(data.frame(do.call(rbind,lapply(1:length(parsedData),function(i)t(parsedData[[i]][1:2])))),c("Date","Volatility"))

Fail to create couponbonds object in termstrc package using R

I am trying to use R package termstrc to estimate the term structure. To do that I have to prepare the data as the couponbonds class required by the package. I used some fake data to prevent the potential problem of the real data. Though I tried a lot, it still didn't work.
Any idea what is going wrong?
structure of the official demo data which works
data("govbonds")
str(govbonds)
List of 3
$ GERMANY:List of 8
..$ ISIN : chr [1:52] "DE0001141414" "DE0001137131" "DE0001141422" "DE0001137149" ...
..$ MATURITYDATE: Date[1:52], format: "2008-02-15" "2008-03-14" "2008-04-11" ...
..$ ISSUEDATE : Date[1:52], format: "2002-08-14" "2006-03-08" "2003-04-11" ...
..$ COUPONRATE : num [1:52] 0.0425 0.03 0.03 0.0325 0.0413 ...
..$ PRICE : num [1:52] 100 99.9 99.8 99.8 100.1 ...
..$ ACCRUED : num [1:52] 4.09 2.66 2.43 2.07 2.39 ...
..$ CASHFLOWS :List of 3
.. ..$ ISIN: chr [1:384] "DE0001141414" "DE0001137131" "DE0001141422" "DE0001137149" ...
.. ..$ CF : num [1:384] 104 103 103 103 104 ...
.. ..$ DATE: Date[1:384], format: "2008-02-15" "2008-03-14" "2008-04-11" ...
..$ TODAY : Date[1:1], format: "2008-01-30"
#another two are omitted here
- attr(*, "class")= chr "couponbonds"
> ns_res <- estim_nss(govbonds, c("GERMANY"), method = "ns",tauconstr=list(c(0.2, 5, 0.1)))
[1] "Searching startparameters for GERMANY"
beta0 beta1 beta2 tau1
5.008476 -1.092510 -3.209695 2.400100
my code to prepare fake data
bond=list()
bond$CHINA=list()
n=30*12#suppose I have n bond
enddate=as.Date('2014/11/7')
isin=sprintf('DE%010d',1:n)#some fake ISIN
bond$CHINA$ISIN=isin
bond$CHINA$MATURITYDATE=enddate+(1:n)*30
bond$CHINA$ISSUEDATE=rep(enddate,n)
bond$CHINA$COUPONRATE=rep(5/100,n)
bond$CHINA$PRICE=rep(100,n)
bond$CHINA$ACCRUED=rep(0,n)
bond$CHINA$CASHFLOWS=list()
bond$CHINA$CASHFLOWS$ISIN=isin
bond$CHINA$CASHFLOWS$CF=100+(1:n)*5/12
bond$CHINA$CASHFLOWS$DATE=enddate+(1:n)*30
bond$CHINA$TODAY=enddate
class(bond)='couponbonds'
ns_res <- estim_nss(bond, c("CHINA"), method = "ns",tauconstr=list(c(0.2, 5, 0.1)))
the output
Error in `colnames<-`(`*tmp*`, value = c("DE0000000001", "DE0000000002", :
attempt to set 'colnames' on an object with less than two dimensions

The problem was finally solved by adding one cashflow with amount zero to the CASHFLOW$CF.
Put it in another way, at least one bond should have at least two cashflows.
Then you may face another error caused by uniroot function. Be sure to only include the cashflow after TODAY. The termstrc doesn't filter the cashflow for you by using TODAY.

read.zoo works but then as.xts fails with "currently unsupported data type"

I've a csv file of daily bars, with just two lines:
"datestamp","Open","High","Low","Close","Volume"
"2012-07-02",79.862,79.9795,79.313,79.509,48455
(That file was an xts that was converted to a data.frame then passed on to write.csv)
I load it with this:
z=read.zoo(file='tmp.csv',sep=',',header=T,format = "%Y-%m-%d")
And it is fine as print(z) shows:
Open High Low Close Volume
2012-07-02 79.862 79.9795 79.313 79.509 48455
But then as.xts(z) gives: Error in coredata.xts(x) : currently unsupported data type
Here is the str(z) output:
‘zoo’ series from 2012-07-02 to 2012-07-02
Data:List of 5
$ : num 79.9
$ : num 80
$ : num 79.3
$ : num 79.5
$ : int 48455
- attr(*, "dim")= int [1:2] 1 5
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:5] "Open" "High" "Low" "Close" ...
Index: Date[1:1], format: "2012-07-02"
I've so far confirmed it is not that 4 columns are num and one column is int, as I still get the error even after removing the Volume column. But, then, what could that error message be talking about?

As Sebastian pointed out in the comments, the problem is in the single row. Specifically the coredata is a list when read.zoo reads a single row, but something else (a matrix?) when there are 2+ rows.
I replaced the call to read.zoo with the following, and it works fine whether 1 or 2+ rows:
d=read.table(fname,sep=',',header=T)
x=as.xts(subset(d,select=-datestamp),order.by=as.Date(d$datestamp))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scrape with a loop and avoid 404 error - r

Related

writing a loop to go through a large list with sublists and save these sublist

Using read_html in R to get Russell 3000 holdings?

Extract the data from the sublist into dataframe

Fail to create couponbonds object in termstrc package using R

read.zoo works but then as.xts fails with "currently unsupported data type"

Categories

Resources