How can I do web scraping USDA FoodData central? - r

I am trying to scrap a table in this url with using R. I tried this code, but it showed that xml_missing. How can I retrieve a nutrition table in this url?
library(rvest)
library(tidyverse)
url <- "https://fdc.nal.usda.gov/fdc-app.html#/food-details/2237774/nutrients"
read_html(url) %>% html_element(xpath = '// [#id="nutrients-table"]')

Related

How to do web scraping using R when redirecting another html?

I am a new user of rvest package in R to conduct web scraping on Marriott website.
I would like to make a list of the name and price of Marriott hotel in Japan from the url: https://www.marriott.com/search/submitSearch.mi?destinationAddress.destination=Japan&destinationAddress.country=JP&searchType=InCity&filterApplied=false.
What I have done is as below;
#library
library(rvest)
library(dplyr)
#get the url
url = "https://www.marriott.com/hotel-search.mi" # url
html = read_html(url) # read webpage
# pull out links to get the labels
links = html %>%
html_nodes(".js-region-pins") %>%
html_attr("href") %>%
str_subset("^.*Japan*")
Here links include the url of the page that includes 47 Japanese hotel as below;
links
[1] "/search/submitSearch.mi?destinationAddress.destination=Japan&destinationAddress.country=JP&searchType=InCity&filterApplied=false"
Then,
url_japan = paste("https://www.marriott.com",links,sep="")
url_japan
[1] "https://www.marriott.com/search/submitSearch.mi?destinationAddress.destination=Japan&destinationAddress.country=JP&searchType=InCity&filterApplied=false"
Here is the problem, which I came across with.
When we jump to the url_japan, it appears that the loaded page is redirected to another url (https://www.marriott.com/search/findHotels.mi).
In this case, how can I continue web-scraping with rvest package?

How to scrape data from a website with similar “#” urls in menu tabs using R?

I want to scrape stock data from other tabs of the following website: http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=35178706978554988 But all of them have same urls. when I try to use "rvest" library function such as read_html() , html_nodes() and html_text() , I can just scrape data from the main tab. Switching between tabs get same results. I tried to use following code, but still couldn't get appropriate results.
Previously I could extract some info such as "InsCode" and "ZTitad" stored in the "" section using "rvest". But because all other tabs' data are not written in the "html-source" section, I didn't have any idea what to do.
#Scraping Libraries
library(rvest)
library(jsonlite)
#Target website
my_url<-"http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=35178706978554988"
pagesource <- read_html(my_url)
content<- pagesource %>% html_node("script") %>% html_text()
data <- fromJSON(content)
Ultimately I want to export "حقیقی-حقوقی" tab data into a data-frame to continue my other analysis.

Web scraping data for use in R-Studio

I am wanting to pull the data out of this server site and into R-Studio. I am new to R so not at all sure what is possible. Any help with coding to achieve this would be appreciated.
http://hbrcdata.hbrc.govt.nz/hydrotel/cgi-bin/hydwebserver.cgi/points/details?point=679&samples=true
install.packages("rvest")
library('rvest')
install.packages('XML')
library('XML')
library("httr")
#Specifying the url for desired website to be scrapped
url <- 'http://hbrcdata.hbrc.govt.nz/hydrotel/cgi-
bin/hydwebserver.cgi/points/samples?point=679'
webpage <- read_html(url)
tbls <- html_nodes(webpage, "table")
head(tbls)
tbls_ls <- webpage %>%
html_nodes("table") %>%
html_table(fill = TRUE)
tbl <- as.data.frame(tbls_ls)
View(tbl)
I have tried to fetch few other tables from the given website which is working fine.
for example:
rainfall depth:
http://hbrcdata.hbrc.govt.nz/hydrotel/cgi-bin/hydwebserver.cgi/points/details?point=63
small modification in the url as follows will fetch you actual table. rest all code reamins same (details?point=63 as samples?point=63)
url <- 'http://hbrcdata.hbrc.govt.nz/hydrotel/cgi-bin/HydWebServer.cgi/points/samples?point=63'
for more help you can refer the website:
http://bradleyboehmke.github.io/2015/12/scraping-html-tables.html

Getting email address in web scraping through rvest

Hi I am trying to get little information about this webpage through web scraping in R language using the package rvest. I am getting name and everything but I am unable to get email id i.e. info#brewhemia.co.uk. If I see in the read_html as text, I don't see email id in html parsed text. Can anybody please help? I am new to web scraping. But I know R Language.
link <- 'https://food.list.co.uk/place/22191-brewhemia-edinburgh/'
page <- read_html(link)
name_html <- html_nodes(page,'.placeHeading')
business_adr <- html_text(adr_html)
tel_html <- html_nodes(page,'.value')
business_tel <- html_text(tel_html)
The email id is in 'a' html tag but I am not able to extract it.
You need a javascript engine here to process the js code. Luckily, R has got V8.
Modify your code after installing V8 package:
library(rvest)
library(V8)
link <- 'https://food.list.co.uk/place/22191-brewhemia-edinburgh/'
page <- read_html(link)
name_html <- html_nodes(page,'.placeHeading')
business_adr <- html_text(adr_html)
tel_html <- html_nodes(page,'.value')
business_tel <- html_text(tel_html)
emailjs <- page %>% html_nodes('li') %>% html_nodes('script') %>% html_text()
ct <- v8()
read_html(ct$eval(gsub('document.write','',emailjs))) %>% html_text()
Output:
> read_html(ct$eval(gsub('document.write','',emailjs))) %>% html_text()
[1] "info#brewhemia.co.uk"

Web Scraping NBA Fantasy Projections - R

There are a number of NBA Fantasy Projections that I would like to scrape in a more streamlined approach. Currently I use a combination of importhtml function in google sheets and simple archaic cut'n'paste.
I use R regularly to scrape other data from the internet, however, I can't manage to get these tables to scrape. The tables I am having trouble with are located at three separate addresses (1 table per page), they are:
1) http://www.sportsline.com/nba/player-projections/player-stats/all-players/
2) https://swishanalytics.com/optimus/nba/daily-fantasy-projections
3) http://www.sportingcharts.com/nba/dfs-projections/
For all my other scraping activities I use packages rvest and xml. Following the same process I've tried both methods listed below which result in the outputs shown. I'm sure this has something to do with the format of the table on the website, however I haven't been able to find something that can help me.
Method 1
library(XML)
projections1 <- readHTMLTable("http://www.sportsline.com/nba/player-projections/player-stats/all-players/")
projections2 <- readHTMLTable("https://swishanalytics.com/optimus/nba/daily-fantasy-projections")
projections3 <- readHTMLTable("http://www.sportingcharts.com/nba/dfs-projections/")
Output
projections1
named list()
projections2
named list()
Warning message:
XML content does not seem to be XML: 'https://swishanalytics.com/optimus/nba/daily-fantasy-projections'
projections3 - I get the headers of the table but not the content of the table.
Method 2
library(rvest)
URL <- "http://www.sportsline.com/nba/player-projections/player-stats/all-players/"
projections1 <- URL %>%
read_html %>%
html_nodes("table") %>%
html_table(trim=TRUE,fill=TRUE)
URL <- "https://swishanalytics.com/optimus/nba/daily-fantasy-projections"
projections2 <- URL %>%
read_html %>%
html_nodes("table") %>%
html_table(trim=TRUE,fill=TRUE)
URL <- "http://www.sportingcharts.com/nba/dfs-projections/"
projections3 <- URL %>%
read_html %>%
html_nodes("table") %>%
html_table(trim=TRUE,fill=TRUE)
Output
projections1
list()
projections2 - I get the headers of the table but not the content of the table.
projections3 - I get the headers of the table but not the content of the table.
If anybody could point me in the right direction it would be greatly appreciated.
the content of the table is generated by javascript, so readHTMLTable and read_html find nothing, you can find the table as below
projections1: link
import requests
url = 'http://www.sportsline.com/sportsline-web/service/v1/playerProjections?league=nba&position=all-players&sourceType=FD&game=&page=PS&offset=0&max=25&orderField=&optimal=false&release2Ver=true&auth=3'
r = requests.get(url)
print r.json()
projections2: view-source:https://swishanalytics.com/optimus/nba/daily-fantasy-projections Line 1181
import requests
url = 'https://swishanalytics.com/optimus/nba/daily-fantasy-projections'
r = requests.get(url)
text = r.content
print eval(text.split('this.players = ')[1].split(';')[0])
projections3: view-source Line 918

Resources