I am trying to webscrape some recipes for my own personal collection. It works great on some sites because the website structure sometimes easily allows for scraping, but some are harder. This one I have no idea how to deal with:
https://www.koket.se/halloumigryta-med-tomat-linser-och-chili
For the moment, let's just assume I want the ingredients on the left. If I inspect the website it looks like what I want are the two article class="ingredients" chunks. But I can't seem to get there.
I start with the following:
library(rvest)
library(tidyverse)
read_html("https://www.koket.se/halloumigryta-med-tomat-linser-och-chili") %>%
html_nodes(".recipe-column-wrapper") %>%
html_nodes(xpath = '//*[#id="react-recipe-page"]')
However, running the above code shows that all of the ingredients are stored in data-item like so:
<div id="react-recipe-page" data-item="{
"chefNames":"<a href='/kockar/siri-barje'>Siri Barje</a>",
"groupedIngredients":[{
"header":"Kokosris",
"ingredients":[{
"name":"basmatiris","unit":"dl","amount":"3","amount_info":{"from":3},"main":false,"ingredient":true
}
<<<and so on>>>
So I am a little bit puzzled, because from inspecting the website everything seems to be neatly placed in things I can extract, but now it's not. Instead, I'd need some serious regular expressions in order to get everything like I want it.
So my question is: am I missing something? Is there some way I can get the contents of the ingredients articles?
(I tried SelectorGadget, but it just gave me No valid path found).
You can extract attributes by using html_attr("data-item") from the rvest package.
Furthermore, the data-item attribute looks like it's in JSON, which you can convert to a list using the fromJSON from the jsonlite package:
html <- read_html("https://www.koket.se/halloumigryta-med-tomat-linser-och-chili") %>%
html_nodes(".recipe-column-wrapper") %>%
html_nodes(xpath = '//*[#id="react-recipe-page"]')
recipe <- html %>% html_attr("data-item") %>%
fromJSON
Lastly, the recipe list contains lots of different values, which are not relevant, but the ingredients and measurements are there as well in the element recipe$ingredients.
Related
I'm trying to obtain the game IDs for each game listed in this page.
https://www.chess.com/member/bogginssloggins
Here's what I'm doing now:
First, I'm downloading the HTML with RSelenium and saving it as htmlfile.txt (the table doesn't render unless you use Selenium)
Then, I'm using RVest to parse the HTML.
Here is my code, skipping the RSelenium part
library(rvest)
html <- read_html("htmlfile.txt")
GameTable <- html %>% html_table() %>% .[[1]]
Unfortunately GameTable doesn't include the game IDs, just the data actually visible in the table. A sample GameID would be something like the link below.
https://www.chess.com/analysis/game/live/9296762565?username=bogginssloggins
These games are very much displayed in the html but I don't know how to systematically grab them and link them to the corresponding rows of the table. My ideal output would be the data in the table on the webpage (e.g. the players in the game, who won, etc but also including a column for the gameID. I believe one of the important things to look for is the "archived-games-link" in the html. There are twenty of those links in the html and twenty rows in the table, so it seems like it should match. However, when I do the below code
"htmlfile.txt" %>% read_html() %>%
html_nodes("[class='archived-games-link']") %>%
html_attr("href")
I get only 18 results returned, even though when I ctrl+f for "archived-games-link" in the html document 20 results are returned.
I am trying to get a text from a webpage. To simplify my question, let me use #RonakShah's Stackoverflow account as an example to extract the reputation value. With 'SelectorGadget' showing "div, div", I used the following code:
library(rvest)
so <- read_html('https://stackoverflow.com/users/3962914/ronak-shah') %>%
html_nodes("div") %>% html_nodes("div") %>% html_text()
This gave an object so with as many as 307 items.
Then, I turned the object into a dataframe:
so <- as.data.frame(so)
view(so)
Then, manually gone through all items in the dataframe until finding the correct value so$so[69]. My question is how to quickly find the specific target value. In my real case, it is a little more complicated for doing it manually as there are multiple items with the same values and I need to identify the correct order. Thanks.
You need to find a specific tag and it the respective class closer to your target. You can find that using selector gadget.
library(rvest)
read_html('https://stackoverflow.com/users/3962914/ronak-shah') %>%
html_nodes("div.grid--cell.fs-title") %>%
html_text()
#[1] "254,328"
As far as scraping StackOverflow is concerned it has an API to get the information about users/question/answers. In R, there is a wrapper package around it called stackr (not on CRAN) which makes it very easy.
library(stackr)
data <- stack_users(3962914)
data$reputation
[1] 254328
data has lot of other information as well about the user.
3962914 is the user id of the user you are interested in which can be found out from their profile link. (https://stackoverflow.com/users/3962914/ronak-shah).
I am trying to scrape off the number amounts listed in a set of donation websites. So in this example, I would like to get
$3, $10, $25, $100, $250, $1500, $2800
The xpath indicates that one of them should be
/html/body/div[1]/div[3]/div[2]/div/div[1]/div/div/
form/div/div[1]/div/div/ul/li[2]/label
and the css selector
li.btn--wrapper:nth-child(2) > label:nth-child(1)
Up to the following, I see something in the xml_nodeset:
library(rvest)
url <- "https://secure.actblue.com/donate/pete-buttigieg-announcement-day"
read_html(url) %>% html_nodes(
xpath = '//*[#id="cf-app-target"]/div[3]/div[2]/div/div[1]/div/div'
)
Then I see add the second part of the xpath and it shows up blank. Same with
X %>% html_nodes("li")
which gives a bunch of things, but all the StyledButton__StyledAnchorButton-a7s38j-0 kEcVlT turn blank.
I have worked with rvest for a fair bit now, but this one's baffling. And I am not quite sure how RSelenium will help here, although I have knowledge on how to use it for screenshots and clicks. If it helps, the website also refuses to be captured in the wayback machine---there's only the background and nothing else.
I have even tried just taking a screenshot with RSelenium and attempting ocr with tessaract and magick, but while other pages worked this particular example spectacularly fails, because the text is in white and in a rather nonstandard font. Yes, I've tried image_negate and image_resize to see if it helped, but it only showed that relying on OCR is rather a bad idea, as it depends on screenshot size.
Any advice on how to best extract what I want in this situation? Thanks.
You can use regex to extract numbers from script tag. You get a comma separated character vector
library(rvest)
library(stringr)
con <- url('https://secure.actblue.com/donate/pete-buttigieg-announcement-day?refcode=website', "rb")
page = read_html(con)
res <- page %>%
html_nodes(xpath=".//script[contains(., 'preloadedState')]")%>%
html_text() %>% as.character %>%
str_match_all(.,'(?<="amounts":\\[)(\\d+,?)+')
print(res[[1]][,1])
Try it here
I'm learning how to scrape information from websites using httr and XML in R. I'm getting it to work just fine for websites with just a few tables, but can't figure it out for websites with several tables. Using the following page from pro-football-reference as an example: https://www.pro-football-reference.com/boxscores/201609110atl.htm
# To get just the boxscore by quarter, which is the first table:
URL = "https://www.pro-football-reference.com/boxscores/201609080den.htm"
URL = GET(URL)
SnapTable = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)[[1]]
# Return the number of tables:
AllTables = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)
length(AllTables)
[1] 2
So I'm able to scrape info, but for some reason I can only capture the top two tables out of the 20+ on the page. For practice, I'm trying to get the "Starters" tables and the "Officials" tables.
Is my inability to get the other tables a matter of the website's setup or incorrect code?
If it comes down to web scraping in R make intensive use of the package rvest.
While managing to get the html is just about fine - rvest makes use of css selectors - SelectorGadget helps finding a pattern in styling for a particular table which is hopefully unique. Therefore you can extract exactly the tables you are looking for instead of coincidence
To get you started - read the vignette on rvest for more detailed information.
#install.packages("rvest")
library(rvest)
library(magrittr)
# Store web url
fb_url = "https://www.pro-football-reference.com/boxscores/201609080den.htm"
linescore = fb_url %>%
read_html() %>%
html_node(xpath = '//*[#id="content"]/div[3]/table') %>%
html_table()
Hope this helps.
I tried to scrape Kickstarter. However I do not get a result when I try to get the URLs that refer to the projects.
This should be one of the results:
https://www.kickstarter.com/projects/1534822242/david-bowie-hunger-city-photo-story?ref=category_ending_soon
and this is my code:
Code:
main.page1 <- read_html(x ="https://www.kickstarter.com/discover/advanced?
category_id=1&sort=end_date&seed=2498921&page=1")
urls1 <- main.page1 %>% # feed `main.page` to the next step
html_nodes(".block.img-placeholder.w100p") %>% # get the CSS nodes
html_attr("href") # extract the URLs
Does anyone see where I go wrong?
First declare all the packages you use - I had to go search to realise I needed rvest:
> library(rvest)
> library(dplyr)
Get your HTML:
> main.page1 <- read_html(x ="https://www.kickstarter.com/discover/advanced?category_id=1&sort=end_date&seed=2498921&page=1")
As that stands, the data for each project is stashed in a data-project attribute in a bunch of divs. Some Javascript (I suspect built using the React framework) in the browser will normally fill the other DIVs in and get the images, format the links etc. But you have just grabbed the raw HTML so that isn't available. But the raw data is.... So....
The relevant divs appear to be class "react-disc-landing" so this gets the data as text strings:
> data = main.page1 %>%
html_nodes("div.react-disc-landing") %>%
html_attr("data-project")
These things appear to be JSON strings:
> substr(data[[1]],1,80)
[1] "{\"id\":208460273,\"photo\":{\"key\":\"assets/017/007/465/9b725fdf5ba1ee63e8987e26a1d33"
So let's use the rjson package to decode the first one:
> library(rjson)
> jdata = fromJSON(data[[1]])
jdata is now a very complex nested list. Use str(jdata) to see what is in it. I'm not sure what bit of it you want, but maybe this URL:
> jdata$urls$web$project
[1] "https://www.kickstarter.com/projects/1513052868/sense-of-place-by-jose-davila"
If not, the URL you want must be in that structure somewhere.
Repeat over data[[i]] to get all links.
Note that you should check the site T+Cs that you are allowed to do this, and also see if there's an API you should really be using.