Scrape webpage using R and Chrome - r

I am trying to pull the table from this website into R by using path from Chrome inspection, but it does not work. Could you help me with that? Thanks.
library(rvest)
library(XML)
url <- "https://seekingalpha.com/symbol/MNHVF/profitability"
webpage <- read_html(url)
rank_data_html <- html_nodes(webpage, 'section#cresscap') # table.cresscap-table
rank_data <- html_table(rank_data_html)
rank_data1 <- rank_data[[1]]

Data comes from an additional xhr call made dynamically by the page. You can make a request to this and handle json response with jsonlite. Extract the relevant list of lists and use dplyr bind_rows to generate your output. You can rename columns to match those on page if you want.
library(jsonlite)
library(dplyr)
data <- jsonlite::read_json('https://seekingalpha.com/symbol/MNHVF/cresscap/fields_ratings?category_id=4&sa_pro=false')
df <- bind_rows(data$fields)
head(df)

Related

R Curl is it possible to wait several seconds before Scrape Javascript content

I am using RCurl to scrape Sentiment data, but I need to make it wait several seconds first before it scraped, this is my initial code:
library(stringr)
library(curl)
links <- "https://www.dailyfx.com/sentiment"
con <- curl(links)
open(con)
html_string <- readLines(con, n = 3000)
html_string[1580:1700] #The data value property is "--" in this case
How to add the waiting seconds properly?
Special thanks for #MrFlick appointing the situations
curl will only pull the source code for that web page. The data that
is shown on that page is loaded via javascript after the page loads;
it is not contained in the page source. If you want to interact with a
page that uses javascript, you'll need to use something like RSelenium
instead. Or you'll need to reverse engineer the javascript to see
where the data is coming from and then perhaps make a curl request to
the data endpoint directly rather than the HTML page
With that said, I use RSelenium to accomplish this to a desired way:
library(RSelenium)
library(rvest)
library(tidyverse)
library(stringr)
rD <- rsDriver(browser="chrome", verbose=F, chromever = "103.0.5060.134")
remDr <- rD[["client"]]
remDr$navigate("https://www.dailyfx.com/sentiment")
Sys.sleep(10) #Give the page fully loaded
html <- remDr$getPageSource()[[1]]
html_obj <- read_html(html)
#Take Buy and Sell Sentiment of Specific Assets
buy_sentiment <- html_obj %>%
html_nodes(".dfx-technicalSentimentCard__netLongContainer") %>%
html_children()
buy_sentiment <- as.character(buy_sentiment[[15]])
buy_sentiment <- as.numeric(str_match(buy_sentiment, "[0-9]+"))
sell_sentiment <- html_obj %>%
html_nodes(".dfx-technicalSentimentCard__netShortContainer") %>%
html_children()
sell_sentiment <- as.character(sell_sentiment[[15]])
sell_sentiment <- as.numeric(str_match(sell_sentiment, "[0-9]+"))

How to read specific tags using XML2

Problem
I am trying to get all of the url's in https://www.ato.gov.au/sitemap.xml (N.B it's a ~9mb file) using xml2. Any pointers appreciated.
My attempt
library("xml2")
data1 <- read_xml("https://www.ato.gov.au/sitemap.xml")
xml_find_all(data, ".//loc")
I'm not getting the output I need:
{xml_nodeset (0)}
Not using xml2 but I was able to get it using rvest
library(dplyr)
library(rvest)
url <- "https://www.ato.gov.au/sitemap.xml"
url %>%
read_html() %>%
html_nodes("loc") %>%
html_text()
Just in case you need all the urls in dataframe you can use below code:
library(XML)
library(xml2)
library(httpuv)
library(httr)
library(RCurl)
library(data.table)
library(dplyr)
url <- "https://www.ato.gov.au/sitemap.xml"
xData <- getURL(url)
doc <- xmlParse(xData)
data<-xmlToList(doc)
a<-as.data.frame(unlist(data))
a<-dplyr::filter(a,grepl("http",`unlist(data)`) )
head(a)
Above code will give you a dataframe with list of all urls. I was just wondering you can also use "Xenu" url fetcher software to extract urls from website which are not included in sitemap.
Let me know in case you stuck somewhere in middle.

Rselenium xpath not able to save response

I'm trying to get the stocks from https://www.vinmonopolet.no/
for example this wine https://www.vinmonopolet.no/vmp/Land/Chile/Gato-Negro-Cabernet-Sauvignon-2017/p/295301
Using Rselenium
library('RSelenium')
rD=rsDriver()
remDr =rD[["client"]]
remDr$navigate("https://www.vinmonopolet.no/vmp/Land/Chile/Gato-Negro-Cabernet-Sauvignon-2017/p/295301")
webElement = remDr$findElement('xpath', '//*[#id="product_2953010"]/span[2]')
webElement$clickElement()
It will render Response
But how to store it?
Full XML
Maybe be rvest is what you are looking for?
library(rvest, tidyverse)
url <- "https://www.vinmonopolet.no/vmp/Land/Chile/Gato-Negro-Cabernet-Sauvignon-2017/p/295301"
page <- read_html(url)
stock <- page %>%
html_nodes(".product-stock-status div") %>%
html_text()
stock.df <- data.frame(url,stock)
To extract the number use
stock.df <- stock.df %>%
mutate(stock=as.numeric(gsub(".*?([0-9]+).*", "\\1", stock)))
Got it to work just sending the right plain request no need of R
https://www.vinmonopolet.no/vmp/store-pickup/1101/pointOfServices?locationQuery=0661&cartPage=false&entryNumber=0&CSRFToken=718228c1-1dc1-41cd-a35e-23197bed7b0c

How can I scrape data from this website (multiple webpages) using R?

I am a beginner in scraping data from website. It seems difficult for me to interpret the structure of html using XML or other packages.
Can anyone help me to download the data from this website?
http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp
It is about the investment from China. The character set is in Chinese.
What I've tried so far:
library("rvest")
url <- "http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp"
firm <- url %>%
html() %>%
html_nodes(xpath='//*[#id="Grid1MainLayer"]/table[1]') %>%
html_table()
firm <- firm[[1]] head(firm)
You can try with the function in the XML package called readHTMLTable that should download all the tables in the page and already format it into a data.frame.
library(XML)
all_tables = readHTMLTable("http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp")
Then since there is only one table in the page you linked it should be enough to get the first element so:
target_table = all_tables[[1]]

How to get data from Wikipedia page using WikipediR package in R?

I need to fetch a certain part of data from multiple Wikipedia pages. How can I do that using WikipediR package? Or is there some other better option for the same. To be precise, I need only the below marked part from all the pages.
How can I get that? Any help would be appreciated.
Can you be a little more specific as to what you want? Here's a simple way to import data from the web, and specifically from Wikipedia.
library(rvest)
scotusURL <- "https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States"
## ********************
## Option 1: Grab the tables from the page and use the html_table function to extract the tables you're interested in.
temp <- scotusURL %>%
html %>%
html_nodes("table")
html_table(temp[1]) ## Just the "legend" table
html_table(temp[2]) ## THE MAIN TABLE
Now, if you want to import data from multiple pages that have essentially the same structure, but maybe just change by some number or something, please try this method.
library(RCurl);library(XML)
pageNum <- seq(1:10)
url <- paste0("http://www.totaljobs.com/JobSearch/Results.aspx?Keywords=Leadership&LTxt=&Radius=10&RateType=0&JobType1=CompanyType=&PageNum=")
urls <- paste0(url, pageNum)
allPages <- lapply(urls, function(x) getURLContent(x)[[1]])
xmlDocs <- lapply(allPages, function(x) XML::htmlParse(x))

Resources