How can I scrape data from this website (multiple webpages) using R? - r

I am a beginner in scraping data from website. It seems difficult for me to interpret the structure of html using XML or other packages.
Can anyone help me to download the data from this website?
http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp
It is about the investment from China. The character set is in Chinese.
What I've tried so far:
library("rvest")
url <- "http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp"
firm <- url %>%
html() %>%
html_nodes(xpath='//*[#id="Grid1MainLayer"]/table[1]') %>%
html_table()
firm <- firm[[1]] head(firm)

You can try with the function in the XML package called readHTMLTable that should download all the tables in the page and already format it into a data.frame.
library(XML)
all_tables = readHTMLTable("http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp")
Then since there is only one table in the page you linked it should be enough to get the first element so:
target_table = all_tables[[1]]

Related

How to scrape data from a website with similar “#” urls in menu tabs using R?

I want to scrape stock data from other tabs of the following website: http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=35178706978554988 But all of them have same urls. when I try to use "rvest" library function such as read_html() , html_nodes() and html_text() , I can just scrape data from the main tab. Switching between tabs get same results. I tried to use following code, but still couldn't get appropriate results.
Previously I could extract some info such as "InsCode" and "ZTitad" stored in the "" section using "rvest". But because all other tabs' data are not written in the "html-source" section, I didn't have any idea what to do.
#Scraping Libraries
library(rvest)
library(jsonlite)
#Target website
my_url<-"http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=35178706978554988"
pagesource <- read_html(my_url)
content<- pagesource %>% html_node("script") %>% html_text()
data <- fromJSON(content)
Ultimately I want to export "حقیقی-حقوقی" tab data into a data-frame to continue my other analysis.

Scraping table into R

Trying to scrape data into R from table on website but having some trouble in finding the table. The table seems to not have any distinguishable class or id as it is labeled as . I have scraped data from many tables but have never encountered this type of situation. Still a novice and have done some searches but nothing has worked as of yet.
I have tried the following R commands but they only produce a result with "" and no data scraped. I had some results in assigning the article-template class but it is jumbled so I know that the data exists on the html_read. I am just not sure the proper way to program for this type of website.
Here is what I have so far in R:
library(xml2)
library(rvest)
website <- read_html("https://baseballsavant.mlb.com/daily_matchups")
dailymatchup <- website %>%
html_nodes(xpath = '//*[#id="leaderboard"]') %>%
html_text()
Essentially pulling the data from the table should allow a quick and organizable data frame. This is the ultimate target I am seeking.

download/scraping table from htmlTable

I am trying to download a csv from
https://oceanwatch.pifsc.noaa.gov/erddap/griddap/goes-poes-1d-ghrsst-RAN.html
or I am trying to scrape data frame the html table output from the website found here
https://oceanwatch.pifsc.noaa.gov/erddap/griddap/goes-poes-1d-ghrsst-RAN.htmlTable?analysed_sst[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],analysis_error[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],mask[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],sea_ice_fraction[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)]
I have tried to scrape the data using
library(rvest)
url <- read_html("https://oceanwatch.pifsc.noaa.gov/erddap/griddap/goes-
poes-1d-ghrsst-RAN.htmlTable?analysed_sst[(2019-02-09T12:00:00Z):1:(2019-
02-09T12:00:00Z)][(-7):1:(42)][(179):1:(238)],analysis_error[(2019-02-
09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-7):1:(42)][(179):1:
(238)],mask[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-7):1:(42)]
[(179):1:(238)],sea_ice_fraction[(2019-02-09T12:00:00Z):1:(2019-02-
09T12:00:00Z)][(-7):1:(42)][(179):1:(238)]")
test <- url %>%
html_nodes(xpath='table.erd.commonBGColor.nowrap') %>%
html_text()
And I have tried to download a csv with
download.file(url, destfile = "~/Documents/test.csv", mode = 'wb')
But neither worked either. The download.file function downloaded a csv with the node description. and the rvest method gave me a huge character string on my macbook and a null data frame on my windows. I have also tried to use selectorgadget (chrome extension) to obtain only data i need, but selectorgadget does not seem to work on the htmlTable
I managed to find workaround solution using htmltab package, not sure if it's optimal though, it's big data frame for a webpage, took a while to load in data frame. table[2] is for actual table, as there're 2 html tables in link you've given.
url1 <- "https://oceanwatch.pifsc.noaa.gov/erddap/griddap/goes-poes-1d-ghrsst-RAN.htmlTable?analysed_sst[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],analysis_error[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],mask[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],sea_ice_fraction[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)]"
tbls <- htmltab(url1,which = "//table[2]")
rdf <- as.data.frame(tbls)
let me know if it helps.

R: Scraping multiple tables in URL

I'm learning how to scrape information from websites using httr and XML in R. I'm getting it to work just fine for websites with just a few tables, but can't figure it out for websites with several tables. Using the following page from pro-football-reference as an example: https://www.pro-football-reference.com/boxscores/201609110atl.htm
# To get just the boxscore by quarter, which is the first table:
URL = "https://www.pro-football-reference.com/boxscores/201609080den.htm"
URL = GET(URL)
SnapTable = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)[[1]]
# Return the number of tables:
AllTables = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)
length(AllTables)
[1] 2
So I'm able to scrape info, but for some reason I can only capture the top two tables out of the 20+ on the page. For practice, I'm trying to get the "Starters" tables and the "Officials" tables.
Is my inability to get the other tables a matter of the website's setup or incorrect code?
If it comes down to web scraping in R make intensive use of the package rvest.
While managing to get the html is just about fine - rvest makes use of css selectors - SelectorGadget helps finding a pattern in styling for a particular table which is hopefully unique. Therefore you can extract exactly the tables you are looking for instead of coincidence
To get you started - read the vignette on rvest for more detailed information.
#install.packages("rvest")
library(rvest)
library(magrittr)
# Store web url
fb_url = "https://www.pro-football-reference.com/boxscores/201609080den.htm"
linescore = fb_url %>%
read_html() %>%
html_node(xpath = '//*[#id="content"]/div[3]/table') %>%
html_table()
Hope this helps.

Web Scraping NBA Fantasy Projections - R

There are a number of NBA Fantasy Projections that I would like to scrape in a more streamlined approach. Currently I use a combination of importhtml function in google sheets and simple archaic cut'n'paste.
I use R regularly to scrape other data from the internet, however, I can't manage to get these tables to scrape. The tables I am having trouble with are located at three separate addresses (1 table per page), they are:
1) http://www.sportsline.com/nba/player-projections/player-stats/all-players/
2) https://swishanalytics.com/optimus/nba/daily-fantasy-projections
3) http://www.sportingcharts.com/nba/dfs-projections/
For all my other scraping activities I use packages rvest and xml. Following the same process I've tried both methods listed below which result in the outputs shown. I'm sure this has something to do with the format of the table on the website, however I haven't been able to find something that can help me.
Method 1
library(XML)
projections1 <- readHTMLTable("http://www.sportsline.com/nba/player-projections/player-stats/all-players/")
projections2 <- readHTMLTable("https://swishanalytics.com/optimus/nba/daily-fantasy-projections")
projections3 <- readHTMLTable("http://www.sportingcharts.com/nba/dfs-projections/")
Output
projections1
named list()
projections2
named list()
Warning message:
XML content does not seem to be XML: 'https://swishanalytics.com/optimus/nba/daily-fantasy-projections'
projections3 - I get the headers of the table but not the content of the table.
Method 2
library(rvest)
URL <- "http://www.sportsline.com/nba/player-projections/player-stats/all-players/"
projections1 <- URL %>%
read_html %>%
html_nodes("table") %>%
html_table(trim=TRUE,fill=TRUE)
URL <- "https://swishanalytics.com/optimus/nba/daily-fantasy-projections"
projections2 <- URL %>%
read_html %>%
html_nodes("table") %>%
html_table(trim=TRUE,fill=TRUE)
URL <- "http://www.sportingcharts.com/nba/dfs-projections/"
projections3 <- URL %>%
read_html %>%
html_nodes("table") %>%
html_table(trim=TRUE,fill=TRUE)
Output
projections1
list()
projections2 - I get the headers of the table but not the content of the table.
projections3 - I get the headers of the table but not the content of the table.
If anybody could point me in the right direction it would be greatly appreciated.
the content of the table is generated by javascript, so readHTMLTable and read_html find nothing, you can find the table as below
projections1: link
import requests
url = 'http://www.sportsline.com/sportsline-web/service/v1/playerProjections?league=nba&position=all-players&sourceType=FD&game=&page=PS&offset=0&max=25&orderField=&optimal=false&release2Ver=true&auth=3'
r = requests.get(url)
print r.json()
projections2: view-source:https://swishanalytics.com/optimus/nba/daily-fantasy-projections Line 1181
import requests
url = 'https://swishanalytics.com/optimus/nba/daily-fantasy-projections'
r = requests.get(url)
text = r.content
print eval(text.split('this.players = ')[1].split(';')[0])
projections3: view-source Line 918

Resources