I tried to scrape Kickstarter. However I do not get a result when I try to get the URLs that refer to the projects.
This should be one of the results:
https://www.kickstarter.com/projects/1534822242/david-bowie-hunger-city-photo-story?ref=category_ending_soon
and this is my code:
Code:
main.page1 <- read_html(x ="https://www.kickstarter.com/discover/advanced?
category_id=1&sort=end_date&seed=2498921&page=1")
urls1 <- main.page1 %>% # feed `main.page` to the next step
html_nodes(".block.img-placeholder.w100p") %>% # get the CSS nodes
html_attr("href") # extract the URLs
Does anyone see where I go wrong?
First declare all the packages you use - I had to go search to realise I needed rvest:
> library(rvest)
> library(dplyr)
Get your HTML:
> main.page1 <- read_html(x ="https://www.kickstarter.com/discover/advanced?category_id=1&sort=end_date&seed=2498921&page=1")
As that stands, the data for each project is stashed in a data-project attribute in a bunch of divs. Some Javascript (I suspect built using the React framework) in the browser will normally fill the other DIVs in and get the images, format the links etc. But you have just grabbed the raw HTML so that isn't available. But the raw data is.... So....
The relevant divs appear to be class "react-disc-landing" so this gets the data as text strings:
> data = main.page1 %>%
html_nodes("div.react-disc-landing") %>%
html_attr("data-project")
These things appear to be JSON strings:
> substr(data[[1]],1,80)
[1] "{\"id\":208460273,\"photo\":{\"key\":\"assets/017/007/465/9b725fdf5ba1ee63e8987e26a1d33"
So let's use the rjson package to decode the first one:
> library(rjson)
> jdata = fromJSON(data[[1]])
jdata is now a very complex nested list. Use str(jdata) to see what is in it. I'm not sure what bit of it you want, but maybe this URL:
> jdata$urls$web$project
[1] "https://www.kickstarter.com/projects/1513052868/sense-of-place-by-jose-davila"
If not, the URL you want must be in that structure somewhere.
Repeat over data[[i]] to get all links.
Note that you should check the site T+Cs that you are allowed to do this, and also see if there's an API you should really be using.
Related
I am trying to get a text from a webpage. To simplify my question, let me use #RonakShah's Stackoverflow account as an example to extract the reputation value. With 'SelectorGadget' showing "div, div", I used the following code:
library(rvest)
so <- read_html('https://stackoverflow.com/users/3962914/ronak-shah') %>%
html_nodes("div") %>% html_nodes("div") %>% html_text()
This gave an object so with as many as 307 items.
Then, I turned the object into a dataframe:
so <- as.data.frame(so)
view(so)
Then, manually gone through all items in the dataframe until finding the correct value so$so[69]. My question is how to quickly find the specific target value. In my real case, it is a little more complicated for doing it manually as there are multiple items with the same values and I need to identify the correct order. Thanks.
You need to find a specific tag and it the respective class closer to your target. You can find that using selector gadget.
library(rvest)
read_html('https://stackoverflow.com/users/3962914/ronak-shah') %>%
html_nodes("div.grid--cell.fs-title") %>%
html_text()
#[1] "254,328"
As far as scraping StackOverflow is concerned it has an API to get the information about users/question/answers. In R, there is a wrapper package around it called stackr (not on CRAN) which makes it very easy.
library(stackr)
data <- stack_users(3962914)
data$reputation
[1] 254328
data has lot of other information as well about the user.
3962914 is the user id of the user you are interested in which can be found out from their profile link. (https://stackoverflow.com/users/3962914/ronak-shah).
I am trying to scrape off the number amounts listed in a set of donation websites. So in this example, I would like to get
$3, $10, $25, $100, $250, $1500, $2800
The xpath indicates that one of them should be
/html/body/div[1]/div[3]/div[2]/div/div[1]/div/div/
form/div/div[1]/div/div/ul/li[2]/label
and the css selector
li.btn--wrapper:nth-child(2) > label:nth-child(1)
Up to the following, I see something in the xml_nodeset:
library(rvest)
url <- "https://secure.actblue.com/donate/pete-buttigieg-announcement-day"
read_html(url) %>% html_nodes(
xpath = '//*[#id="cf-app-target"]/div[3]/div[2]/div/div[1]/div/div'
)
Then I see add the second part of the xpath and it shows up blank. Same with
X %>% html_nodes("li")
which gives a bunch of things, but all the StyledButton__StyledAnchorButton-a7s38j-0 kEcVlT turn blank.
I have worked with rvest for a fair bit now, but this one's baffling. And I am not quite sure how RSelenium will help here, although I have knowledge on how to use it for screenshots and clicks. If it helps, the website also refuses to be captured in the wayback machine---there's only the background and nothing else.
I have even tried just taking a screenshot with RSelenium and attempting ocr with tessaract and magick, but while other pages worked this particular example spectacularly fails, because the text is in white and in a rather nonstandard font. Yes, I've tried image_negate and image_resize to see if it helped, but it only showed that relying on OCR is rather a bad idea, as it depends on screenshot size.
Any advice on how to best extract what I want in this situation? Thanks.
You can use regex to extract numbers from script tag. You get a comma separated character vector
library(rvest)
library(stringr)
con <- url('https://secure.actblue.com/donate/pete-buttigieg-announcement-day?refcode=website', "rb")
page = read_html(con)
res <- page %>%
html_nodes(xpath=".//script[contains(., 'preloadedState')]")%>%
html_text() %>% as.character %>%
str_match_all(.,'(?<="amounts":\\[)(\\d+,?)+')
print(res[[1]][,1])
Try it here
GOAL: I'm trying to scrape win-loss records for NBA teams from basketball-reference.com.
More broadly, I'm trying to better understand how to correctly use CSS selector gadget to scrape specified elements from a website, but would appreciate a solution for this problem.
The url I'm using (https://www.basketball-reference.com/leagues/NBA_2018_standings.html) has multiple tables on it, so I'm trying to use the CSS selector gadget to specify the element I want, which is the "Expanded Standings" table - about 1/3 of the way down the page.
I have read various tutorials about web scraping that involve the rvest and dplyr packages, as well as the CSS selector web browser add-in (which I have installed in Chrome, my browser of choice). That's what I'm going for.
Here is my code so far:
url <- "https://www.basketball-reference.com/leagues/NBA_2018_standings.html"
css <- "#expanded_standings"
url %>%
read_html() %>%
html_nodes(css) %>%
html_table()
The result of this code is an error:
Error: html_name(x) == "table" is not TRUE
When I delete the last line of code, I get:
url %>%
read_html() %>%
html_nodes(css)
{xml_nodeset (0)}
It seems like there's an issue with the way I'm defining the CSS object/how I'm using the CSS selector tool. What I've been doing is clicking at the very right edge of the desired table, so that the table has a rectangle around it.
I've also tried to click a specific "cell" in the table (i.e., "65-17', which is the value in the "Overall" column for the Houston Rockets row), but that seems to highlight some, but not all of the table, and the random parts of other tables on the web page.
Can anyone provide a solution? Bonus points if you can help me understand where/why what I'm doing is incorrect.
Thanks in advance!
library(rvest)
library(dplR)
library(stringr)
library(magrittr)
url <- "https://www.basketball-reference.com/leagues/NBA_2018_standings.html"
css <- "#expanded_standings"
css <- "#all_expanded_standings"
webpage <- read_html(url)
print(webpage)
mynode <- html_nodes(webpage,css)
mystr <- toString(mynode)
mystr <- gsub("<!--","",mystr)
mystr <- gsub("-->","",mystr)
newdiv <- read_html(mystr)
newtable <- html_nodes(newdiv,"#expanded_standings")
newframe <- html_table(newtable)
print(newframe)
library(rvest)
library(dplR)
library(stringr)
library(magrittr)
url <- "https://www.basketball-reference.com/leagues/NBA_2018_standings.html"
css <- "#expanded_standings"
css <- "#all_expanded_standings"
webpage <- read_html(url)
print(webpage)
mynode <- html_nodes(webpage,css)
#print node to console - interprets slashes
cat(toString(mynode))
I tried downloading the bare url html(before javascript render). Seems strange like the table data is in a comment block. In this div - there is the 'Expanded Standings' table.
I used python and beautifulsoup to extract the element and then remove the comment markers, resoup the string section and then parse the string into td bits. Strange like the rank is in a th element.
I am trying to webscrape some recipes for my own personal collection. It works great on some sites because the website structure sometimes easily allows for scraping, but some are harder. This one I have no idea how to deal with:
https://www.koket.se/halloumigryta-med-tomat-linser-och-chili
For the moment, let's just assume I want the ingredients on the left. If I inspect the website it looks like what I want are the two article class="ingredients" chunks. But I can't seem to get there.
I start with the following:
library(rvest)
library(tidyverse)
read_html("https://www.koket.se/halloumigryta-med-tomat-linser-och-chili") %>%
html_nodes(".recipe-column-wrapper") %>%
html_nodes(xpath = '//*[#id="react-recipe-page"]')
However, running the above code shows that all of the ingredients are stored in data-item like so:
<div id="react-recipe-page" data-item="{
"chefNames":"<a href='/kockar/siri-barje'>Siri Barje</a>",
"groupedIngredients":[{
"header":"Kokosris",
"ingredients":[{
"name":"basmatiris","unit":"dl","amount":"3","amount_info":{"from":3},"main":false,"ingredient":true
}
<<<and so on>>>
So I am a little bit puzzled, because from inspecting the website everything seems to be neatly placed in things I can extract, but now it's not. Instead, I'd need some serious regular expressions in order to get everything like I want it.
So my question is: am I missing something? Is there some way I can get the contents of the ingredients articles?
(I tried SelectorGadget, but it just gave me No valid path found).
You can extract attributes by using html_attr("data-item") from the rvest package.
Furthermore, the data-item attribute looks like it's in JSON, which you can convert to a list using the fromJSON from the jsonlite package:
html <- read_html("https://www.koket.se/halloumigryta-med-tomat-linser-och-chili") %>%
html_nodes(".recipe-column-wrapper") %>%
html_nodes(xpath = '//*[#id="react-recipe-page"]')
recipe <- html %>% html_attr("data-item") %>%
fromJSON
Lastly, the recipe list contains lots of different values, which are not relevant, but the ingredients and measurements are there as well in the element recipe$ingredients.
I'm stuck on this one after much searching....
I started with scraping the contents of a table from:
http://www.skatepress.com/skates-top-10000/artworks/
Which is easy:
data <- data.frame()
for (i in 1:100){
print(paste("page", i, "of 100"))
url <- paste("http://www.skatepress.com/skates-top-10000/artworks/", i, "/", sep = "")
temp <- readHTMLTable(stringsAsFactors = FALSE, url, which = 1, encoding = "UTF-8")
data <- rbind(data, temp)
} # end of scraping loop
However, I need to additionally scrape the detail that is contained in a pop-up box when you click on each name (and on the artwork title) in the list on the site.
I can't for the life of me figure out how to pass the breadcrumb (or artist-id or painting-id) through in order to make this happen. Since straight up using rvest to access the contents of the nodes doesn't work, I've tried the following:
I tried passing the painting id through in the url like this:
url <- ("http://www.skatepress.com/skates-top-10000/artworks/?painting_id=576")
site <- html(url)
But it still gives an empty result when scraping:
node1 <- "bread-crumb > ul > li.activebc"
site %>% html_nodes(node1) %>% html_text(trim = TRUE)
character(0)
I'm (clearly) not a scraping expert so any and all assistance would be greatly appreciated! I need a way to capture this additional information for each of the 10,000 items on the list...hence why I'm not interested in doing this manually!
Hoping this is an easy one and I'm just overlooking something simple.
This will be a more efficient base scraper and you can get progress bars for free with the pbapply package:
library(xml2)
library(httr)
library(rvest)
library(dplyr)
library(pbapply)
library(jsonlite)
base_url <- "http://www.skatepress.com/skates-top-10000/artworks/%d/"
n <- 100
bind_rows(pblapply(1:n, function(i) {
mutate(html_table(html_nodes(read_html(sprintf(base_url, i)), "table"))[[1]],
`Sale Date`=as.Date(`Sale Date`, format="%m.%d.%Y"),
`Premium Price USD`=as.numeric(gsub(",", "", `Premium Price USD`)))
})) -> skatepress
I added trivial date & numeric conversions.
I belive your main issue is that the site requires a login to get the additional data. You should give that (i.e. logging in) a shot using httr and grab the wordpress_logged_inXXXXXXX… cookie from that endeavour. I just grabbed it from inspecting the session with Developer Tools in Chrome and that will also work for you (but it's worth the time to learn how to do it via httr).
You'll need to scrape two additional <a … tags from each table row. The one for "artist" looks like:
Pablo Picasso
You can scrape the contents with:
POST("http://www.skatepress.com/wp-content/themes/skatepress/scripts/query_artist.php",
set_cookies(wordpress_logged_in_XXX="userid%XXXXXreallylongvalueXXXXX…"),
encode="form",
body=list(id="pab_pica_1881"),
verbose()) -> artist_response
fromJSON(content(artist_response, as="text"))
(The return value is too large to post here)
The one for "artwork" looks like:
Les femmes d′Alger (Version ′O′)
and you can get that in similar fashion:
POST("http://www.skatepress.com/wp-content/themes/skatepress/scripts/query_artwork.php",
set_cookies(wordpress_logged_in_XXX="userid%XXXXXreallylongvalueXXXXX…"),
encode="form",
body=list(id=576),
verbose()) -> artwork_response
fromJSON(content(artwork_response, as="text"))
That's not huge but I won't clutter the response with it.
NOTE that you can also use rvest's html_session to do the login (which will get you cookies for free) and then continue to use that session in the scraping (vs read_html) which will mean you don't have to do the httr GET/PUT.
You'll have to figure out how you want to incorporate that data into the data frame or associate it with it via various id's in the data frame (or some other strategy).
You can see it call those two php scripts via Developer Tools and it also shows the data it passes in. I'm also really shocked that site doesn't have any anti-scraping clauses in their ToS but they don't.