I am trying to gather the locations for the documents found on the SEC website. When I use the read_html() function, the table function is returning empty sets and I'm not sure why. When I inspect the elements using my web browser, I can see that the nodes in are populated but that information is not being carried over into my session.
test_url <- "https://www.sec.gov/edgar/search/#/dateRange=custom&entityName=(CIK%25200000887568)&startdt=1980-01-01&enddt=2021-06-23&filter_forms=10-K"
pg <- read_html(test_url) %>%
html_nodes(., css="#hits > table")
#But its empty
xml_attrs(pg, xml_child(pg[[1]], 2))
[[1]]
class
"table"
Thank you for any and all help!
Related
New to webscraping.
I am trying to scrape a site. I recently learnt how to get information from tables, but I want to know how to get the table name. (I believe table name might be wrong word here but bear with me)
Eg - https://www.msc.com/che/about-us/our-fleet?page=1
MSC is shipping firm and I need to get the list of their fleet and information on each ship.
I have written the following code that will retrieve the table data for each ship.
df <- MSCwp[i,1] %>%
read_html() %>% html_table()
MSCwp is the list url. This code gets me all the information I need about the ships listed in the webpage expect its name.
Is there any way to retrieve the name along with the table?
Eg - df for the above mentioned website will return 10 tables. (corresponding to the ships in the webpage). df[1] will have information about the ship Agamemnon but I am not sure how to retrieve the shipname along with the table.
You need to pull the names out from the main page.
library(rvest)
library(dplyr)
url <- "https://www.msc.com/che/about-us/our-fleet?page=1"
page <- read_html(url)
names <- page %>% html_elements("dd a") %>% html_text()
names
[1] "AGAMEMNON" "AGIOS DIMITRIOS" "ALABAMA" "ALLEGRO" "AMALTHEA" "AMERICA" "ANASTASIA"
[8] "ANTWERP TRADER" "ARCHIMIDIS" "ARIES"
In this case I am looking for the text in the "a" child node of the "dd" nodes.
Really new with R here - this is the code I normally use to scrape tables, but I couldn't get it to work due to the table on this website being reactive.
This is the url: https://sgpgrid.com/filter/property-fund-management-including-reit-management-and-direct-property-fund-management
And this is the code chunk I used.
library(rvest)
d2 <- read_html("https://sgpgrid.com/filter/property-fund-management-including-reit-management-and-direct-property-fund-management")
stats <- d2 %>%
html_node(".rt-table") %>%
html_table()
stats
Rstudio keeps showing "Error in html_table.xml_node(.) : html_name(x) == "table" is not TRUE" whenever I try to run the code...
Would really appreciate any help here :(
The data is rendered from a JSON object located in a script tag (ReactJS local state). You can get this by searching the script tag with id __NEXT_DATA__ :
library(rvest)
data <- "https://sgpgrid.com/filter/property-fund-management-including-reit-management-and-direct-property-fund-management" %>%
read_html %>%
html_nodes('script#__NEXT_DATA__') %>%
html_text()
output <- jsonlite::fromJSON(data)
print(output$props$initialState$company$companies)
I want to compare rookies across leagues with stats like Points per game (PPG) and such. ESPN and NBA have great tables to scrape from (as does Basketball-reference), but I just found out that they're not stored in html, so I can't use rvest. For context, I'm trying to scrape tables like this one (from NBA):
https://i.stack.imgur.com/SdKjE.png
I'm trying to learn how to use HTTR and JSON for this, but I'm running into some issues. I followed the answer in this post, but it's not working out for me.
This is what I've tried:
library(httr)
library(jsonlite)
coby.white <- GET('https://www.nba.com/players/coby/white/1629632')
out <- content(coby.white, as = "text") %>%
fromJSON(flatten = FALSE)
However, I get an error:
Error: lexical error: invalid char in json text.
<!DOCTYPE html><html class="" l
(right here) ------^
Is there an easier way to scrape a table from ESPN or NBA, or is there a solution to this issue?
ppg and others stats come from]
https://data.nba.net/prod/v1/2019/players/1629632_profile.json
and player info e.g. weight, height
https://www.nba.com/players/active_players.json
So, you could use jsonlite to parse e.g.
library(jsonlite)
data <- jsonlite::read_json('https://data.nba.net/prod/v1/2019/players/1629632_profile.json')
You can find these in the network tab when refreshing the page. Looks like you can use the player id in the url to get different players info for the season.
You actually can web scrape with rvest, here's an example of scraping White's totals table from Basketball Reference. Anything on Sports Reference's sites that is not the first table of the page is listed as a comment, meaning we must extract the comment nodes first then extract the desired data table.
library(rvest)
library(dplyr)
cobywhite = 'https://www.basketball-reference.com/players/w/whiteco01.html'
totalsdf = cobywhite %>%
read_html %>%
html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse='') %>%
read_html() %>%
html_node("#totals") %>%
html_table()
I'm trying to scrape a table, however I can only get it to paste the value of the hyper link. I'm want the URL to be pasted instead of the value in the table. I've worked out how to do this for a single hyperlink however I need to go through and acquire every xpath. Is there a quicker way of doing this?
This is the code I've been working with:
library(rvest)
url <- read_html("https://coinmarketcap.com/coins/views/all/")
cryptocurrencies <- url %>% html_nodes(xpath = '//*[#id="currencies-all"]')
%>% html_table(fill = T)
cryptocurrencies <- cryptocurrencies[[1]]
I suspect there is an argument in the html_nodes function that would allow me to paste the 'href' however I can't seem to workout how to do it. Thanks
First, you need to use html_attr() to get attributes of each note, in your case, the attribute is href
relative_paths <- page %>%
html_nodes(".currency-name-container") %>%
html_attr("href") #note it is relative path
relative_paths[1:3]
"/currencies/bitcoin/" "/currencies/ethereum/" "/currencies/ripple/"
Once you get the relative path, you can use jump_to() or follow_link() function to do scraping on each page.
#display first three result
for (path in relative_paths) {
current_session <- html_session("https://coinmarketcap.com/coins/views/all/") %>%
jump_to(path)
#do something here
print(current_session$url)
}
[1] "https://coinmarketcap.com/currencies/bitcoin/"
[1] "https://coinmarketcap.com/currencies/ethereum/"
[1] "https://coinmarketcap.com/currencies/ripple/
Or can get the absolute path:
#or get absolute path
absolute_path <- paste0("https://coinmarketcap.com",relative_paths)
absolute_path[1:3]
[1] "https://coinmarketcap.com/currencies/bitcoin/" "https://coinmarketcap.com/currencies/ethereum/" "https://coinmarketcap.com/currencies/ripple/"
Finally, you can merge it into your data frame.
There are a number of NBA Fantasy Projections that I would like to scrape in a more streamlined approach. Currently I use a combination of importhtml function in google sheets and simple archaic cut'n'paste.
I use R regularly to scrape other data from the internet, however, I can't manage to get these tables to scrape. The tables I am having trouble with are located at three separate addresses (1 table per page), they are:
1) http://www.sportsline.com/nba/player-projections/player-stats/all-players/
2) https://swishanalytics.com/optimus/nba/daily-fantasy-projections
3) http://www.sportingcharts.com/nba/dfs-projections/
For all my other scraping activities I use packages rvest and xml. Following the same process I've tried both methods listed below which result in the outputs shown. I'm sure this has something to do with the format of the table on the website, however I haven't been able to find something that can help me.
Method 1
library(XML)
projections1 <- readHTMLTable("http://www.sportsline.com/nba/player-projections/player-stats/all-players/")
projections2 <- readHTMLTable("https://swishanalytics.com/optimus/nba/daily-fantasy-projections")
projections3 <- readHTMLTable("http://www.sportingcharts.com/nba/dfs-projections/")
Output
projections1
named list()
projections2
named list()
Warning message:
XML content does not seem to be XML: 'https://swishanalytics.com/optimus/nba/daily-fantasy-projections'
projections3 - I get the headers of the table but not the content of the table.
Method 2
library(rvest)
URL <- "http://www.sportsline.com/nba/player-projections/player-stats/all-players/"
projections1 <- URL %>%
read_html %>%
html_nodes("table") %>%
html_table(trim=TRUE,fill=TRUE)
URL <- "https://swishanalytics.com/optimus/nba/daily-fantasy-projections"
projections2 <- URL %>%
read_html %>%
html_nodes("table") %>%
html_table(trim=TRUE,fill=TRUE)
URL <- "http://www.sportingcharts.com/nba/dfs-projections/"
projections3 <- URL %>%
read_html %>%
html_nodes("table") %>%
html_table(trim=TRUE,fill=TRUE)
Output
projections1
list()
projections2 - I get the headers of the table but not the content of the table.
projections3 - I get the headers of the table but not the content of the table.
If anybody could point me in the right direction it would be greatly appreciated.
the content of the table is generated by javascript, so readHTMLTable and read_html find nothing, you can find the table as below
projections1: link
import requests
url = 'http://www.sportsline.com/sportsline-web/service/v1/playerProjections?league=nba&position=all-players&sourceType=FD&game=&page=PS&offset=0&max=25&orderField=&optimal=false&release2Ver=true&auth=3'
r = requests.get(url)
print r.json()
projections2: view-source:https://swishanalytics.com/optimus/nba/daily-fantasy-projections Line 1181
import requests
url = 'https://swishanalytics.com/optimus/nba/daily-fantasy-projections'
r = requests.get(url)
text = r.content
print eval(text.split('this.players = ')[1].split(';')[0])
projections3: view-source Line 918