I am trying to download a csv from
https://oceanwatch.pifsc.noaa.gov/erddap/griddap/goes-poes-1d-ghrsst-RAN.html
or I am trying to scrape data frame the html table output from the website found here
https://oceanwatch.pifsc.noaa.gov/erddap/griddap/goes-poes-1d-ghrsst-RAN.htmlTable?analysed_sst[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],analysis_error[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],mask[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],sea_ice_fraction[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)]
I have tried to scrape the data using
library(rvest)
url <- read_html("https://oceanwatch.pifsc.noaa.gov/erddap/griddap/goes-
poes-1d-ghrsst-RAN.htmlTable?analysed_sst[(2019-02-09T12:00:00Z):1:(2019-
02-09T12:00:00Z)][(-7):1:(42)][(179):1:(238)],analysis_error[(2019-02-
09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-7):1:(42)][(179):1:
(238)],mask[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-7):1:(42)]
[(179):1:(238)],sea_ice_fraction[(2019-02-09T12:00:00Z):1:(2019-02-
09T12:00:00Z)][(-7):1:(42)][(179):1:(238)]")
test <- url %>%
html_nodes(xpath='table.erd.commonBGColor.nowrap') %>%
html_text()
And I have tried to download a csv with
download.file(url, destfile = "~/Documents/test.csv", mode = 'wb')
But neither worked either. The download.file function downloaded a csv with the node description. and the rvest method gave me a huge character string on my macbook and a null data frame on my windows. I have also tried to use selectorgadget (chrome extension) to obtain only data i need, but selectorgadget does not seem to work on the htmlTable
I managed to find workaround solution using htmltab package, not sure if it's optimal though, it's big data frame for a webpage, took a while to load in data frame. table[2] is for actual table, as there're 2 html tables in link you've given.
url1 <- "https://oceanwatch.pifsc.noaa.gov/erddap/griddap/goes-poes-1d-ghrsst-RAN.htmlTable?analysed_sst[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],analysis_error[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],mask[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],sea_ice_fraction[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)]"
tbls <- htmltab(url1,which = "//table[2]")
rdf <- as.data.frame(tbls)
let me know if it helps.
Related
I am trying to download data from url
https://migration.iom.int/datasets/europe-%E2%80%94-mixed-migration-flows-europe-quarterly-overview-april-june-2021
On this page is available dataset with file into Excel and link for downloading data is https://migration.iom.int/system/tdf/datasets/Q2%202021%20Mixed%20Migration%20Flows%20to%20Europe%20%28April%20-%20June%202021%29.xlsx?file=1&type=node&id=12261
So I want to download all this data in Excel format directly into R.
library(rvest)
URL <- "https://migration.iom.int/system/tdf/datasets/Q2%202021%20Mixed%20Migration%20Flows%20to%20Europe%20%28April%20-%20June%202021%29.xlsx?file=1&type=node&id=12261"
pg <- read_html(URL)
html_attr(html_nodes(pg, "download"), "href")
But I made some mistake and I don't make download. So can anybody help me how to download this data into R .
I personally would go about it in the following way.
Download the data into a specified destination, read the excel file from that location. An idea would be:
download.file(url, destinationFile)
fileInR <- read.table(file = desinationFile,sep = “\t”)
However, a simple google search for both (downloading and reading in an excel file in R) should provide you with plenty more options.
Basically, on baseball-reference.com there is a way to switch the tables to csv format, but not actually a .csv link. I am trying to see if the csv formatted text on the webpage can be converted to a .csv file in order to make it a usable table.
I tried to use the normal 'rvest' package with the following code
#Los Angeles Dodgers
dodgerBatting <- read_html('https://www.baseball-reference.com/teams/LAD/2019.shtml')
dodgerCSV <- dodgerBatting%>%
html_nodes('#csv_team_batting')%>%
html_text()
print(head(dodgerCSV))
The results are basically an empty character
character(0)
You can get the tables present on the webpage using html_table command in rvest.
library(rvest)
url <- "https://www.baseball-reference.com/teams/LAD/2019.shtml"
out_table <- url %>% read_html %>% html_table()
This returns a list of dataframes, we can access individual dataframes using out_table[[1]], out_table[[2]]. You might need to do some cleaning before using them.
If needed in csv format, we can use write.csv command to write them
write.csv(out_table[[1]], "/path/of/the/file.csv")
There are 2 parts of my questions as I explored 2 methods in this exercise, however I succeed in none. Greatly appreciated if someone can help me out.
[PART 1:]
I am attempting to scrape data from a webpage on Singapore Stock Exchange https://www2.sgx.com/derivatives/negotiated-large-trade containing data stored in a table. I have some basic knowledge of scraping data using (rvest). However, using Inspector on chrome, the html hierarchy is much complex then I expected. I'm able to see that the data I want is hidden under < div class= "table-container" >,and here's what I've tied:
library(rvest)
library(httr)
library(XML)
SGXurl <- "https://www2.sgx.com/derivatives/negotiated-large-trade"
SGXdata <- read_html(SGXurl, stringsASfactors = FALSE)
html_nodes(SGXdata,".table-container")
However, nothing has been picked up by the code and I'm doubt if I'm using these code correctly.
[PART 2:]
As I realize that there's a small "download" button on the page which can download exactly the data file i want in .csv format. So i was thinking to write some code to mimic the download button and I found this question Using R to "click" a download file button on a webpage, but i'm unable to get it to work with some modifications to that code.
There's a few filtera on the webpage, mostly I will be interested downloading data for a particular business day while leave other filters blank, so i've try writing the following function:
library(httr)
library(rvest)
library(purrr)
library(dplyr)
crawlSGXdata = function(date){
POST("https://www2.sgx.com/derivatives/negotiated-large-trade",
body = NULL
encode = "form",
write_disk("SGXdata.csv")) -> resfile
res = read.csv(resfile)
return(res)
}
I was intended to put the function input "date" into the “body” argument, however i was unable to figure out how to do that, so I started off with "body = NULL" by assuming it doesn't do any filtering. However, the result is still unsatisfactory. The file download is basically empty with the following error:
Request Rejected
The requested URL was rejected. Please consult with your administrator.
Your support ID is: 16783946804070790400
The content is loaded dynamically from an API call returning json. You can find this in the network tab via dev tools.
The following returns that content. I find the total number of pages of results and loop combining the dataframe returned from each call into one final dataframe containing all results.
library(jsonlite)
url <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=0&pageSize=250'
r <- jsonlite::fromJSON(url)
num_pages <- r$meta$totalPages
df <- r$data
url2 <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=placeholder&pageSize=250'
if(num_pages > 1){
for(i in seq(1, num_pages)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$data
df <- rbind(df, newdf)
}
}
I'm trying to scrape the URL of the first website returned from Google searches using the Rvest package in R.
I seem to be able to get the URL into an XML file, but I can't transfer the right part of the XML file into a data frame.
I've used the code below.
url <- 'https://www.google.co.nz/search?rlz=1C1GCEB_enNZ790NZ790&ei=P4jsW6fbL4_RrQHd_K3wBw&q=auckland+university+of+technology+lifespan+development+and+communication+heal504&oq=auckland+university+of+technology+lifespan+development+and+communication+heal504&gs_l=psy-ab.3...20931.45570..45696...3.0..2.284.15672.0j63j18......0....1..gws-wiz.......0j0i71j35i39j0i67j0i131j0i131i67j0i20i263j0i13j0i22i10i30j0i22i30j33i21j33i160j33i22i29i30j33i10.xTnG49NmCBs'
googleurl <- read_html(url)
address <- html_nodes(googleurl,'.r')
address <- html_text(address)
urlname <- data.frame(address)
I can see the URL when I open the XML file in R as pictured in the attached image. However, when I transfer this to a data frame using html_text the relevant URL seems to be lost.
Screenshot image
html_text() return text for element, you need to select a tag to get URL and using html_attr()
address <- html_nodes(googleurl,'.r>a')
address <- html_attr(address, "href")
I am a beginner in scraping data from website. It seems difficult for me to interpret the structure of html using XML or other packages.
Can anyone help me to download the data from this website?
http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp
It is about the investment from China. The character set is in Chinese.
What I've tried so far:
library("rvest")
url <- "http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp"
firm <- url %>%
html() %>%
html_nodes(xpath='//*[#id="Grid1MainLayer"]/table[1]') %>%
html_table()
firm <- firm[[1]] head(firm)
You can try with the function in the XML package called readHTMLTable that should download all the tables in the page and already format it into a data.frame.
library(XML)
all_tables = readHTMLTable("http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp")
Then since there is only one table in the page you linked it should be enough to get the first element so:
target_table = all_tables[[1]]