I'm stuck on this one after much searching....
I started with scraping the contents of a table from:
http://www.skatepress.com/skates-top-10000/artworks/
Which is easy:
data <- data.frame()
for (i in 1:100){
print(paste("page", i, "of 100"))
url <- paste("http://www.skatepress.com/skates-top-10000/artworks/", i, "/", sep = "")
temp <- readHTMLTable(stringsAsFactors = FALSE, url, which = 1, encoding = "UTF-8")
data <- rbind(data, temp)
} # end of scraping loop
However, I need to additionally scrape the detail that is contained in a pop-up box when you click on each name (and on the artwork title) in the list on the site.
I can't for the life of me figure out how to pass the breadcrumb (or artist-id or painting-id) through in order to make this happen. Since straight up using rvest to access the contents of the nodes doesn't work, I've tried the following:
I tried passing the painting id through in the url like this:
url <- ("http://www.skatepress.com/skates-top-10000/artworks/?painting_id=576")
site <- html(url)
But it still gives an empty result when scraping:
node1 <- "bread-crumb > ul > li.activebc"
site %>% html_nodes(node1) %>% html_text(trim = TRUE)
character(0)
I'm (clearly) not a scraping expert so any and all assistance would be greatly appreciated! I need a way to capture this additional information for each of the 10,000 items on the list...hence why I'm not interested in doing this manually!
Hoping this is an easy one and I'm just overlooking something simple.
This will be a more efficient base scraper and you can get progress bars for free with the pbapply package:
library(xml2)
library(httr)
library(rvest)
library(dplyr)
library(pbapply)
library(jsonlite)
base_url <- "http://www.skatepress.com/skates-top-10000/artworks/%d/"
n <- 100
bind_rows(pblapply(1:n, function(i) {
mutate(html_table(html_nodes(read_html(sprintf(base_url, i)), "table"))[[1]],
`Sale Date`=as.Date(`Sale Date`, format="%m.%d.%Y"),
`Premium Price USD`=as.numeric(gsub(",", "", `Premium Price USD`)))
})) -> skatepress
I added trivial date & numeric conversions.
I belive your main issue is that the site requires a login to get the additional data. You should give that (i.e. logging in) a shot using httr and grab the wordpress_logged_inXXXXXXX… cookie from that endeavour. I just grabbed it from inspecting the session with Developer Tools in Chrome and that will also work for you (but it's worth the time to learn how to do it via httr).
You'll need to scrape two additional <a … tags from each table row. The one for "artist" looks like:
Pablo Picasso
You can scrape the contents with:
POST("http://www.skatepress.com/wp-content/themes/skatepress/scripts/query_artist.php",
set_cookies(wordpress_logged_in_XXX="userid%XXXXXreallylongvalueXXXXX…"),
encode="form",
body=list(id="pab_pica_1881"),
verbose()) -> artist_response
fromJSON(content(artist_response, as="text"))
(The return value is too large to post here)
The one for "artwork" looks like:
Les femmes d′Alger (Version ′O′)
and you can get that in similar fashion:
POST("http://www.skatepress.com/wp-content/themes/skatepress/scripts/query_artwork.php",
set_cookies(wordpress_logged_in_XXX="userid%XXXXXreallylongvalueXXXXX…"),
encode="form",
body=list(id=576),
verbose()) -> artwork_response
fromJSON(content(artwork_response, as="text"))
That's not huge but I won't clutter the response with it.
NOTE that you can also use rvest's html_session to do the login (which will get you cookies for free) and then continue to use that session in the scraping (vs read_html) which will mean you don't have to do the httr GET/PUT.
You'll have to figure out how you want to incorporate that data into the data frame or associate it with it via various id's in the data frame (or some other strategy).
You can see it call those two php scripts via Developer Tools and it also shows the data it passes in. I'm also really shocked that site doesn't have any anti-scraping clauses in their ToS but they don't.
Related
Thanks in advance for any feedback.
As part of my dissertation I'm trying to scrape data from the web (been working on this for months). I have a couple issues:
-Each document I want to scrape has a document number. However, the numbers don't always go up in order. For example, one document number is 2022, but the next one is not necessarily 2023, it could be 2038, 2040, etc. I don't want to hand go through to get each document number. I have tried to wrap download.file in purrr::safely(), but once it hits a document that does not exist it stops.
-Second, I'm still fairly new to R, and am having a hard time setting up destfile for multiple documents. Indexing the path for where to store downloaded data ends up with the first document stored in the named place, the next document as NA.
Here's the code I've been working on:
base.url <- "https://www.europarl.europa.eu/doceo/document/"
document.name.1 <- "P-9-2022-00"
document.extension <- "_EN.docx"
#document.number <- 2321
document.numbers <- c(2330:2333)
for (i in 1:length(document.numbers)) {
temp.doc.name <- paste0(base.url,
document.name.1,
document.numbers[i],
document.extension)
print(temp.doc.name)
#download and save data
safely <- purrr::safely(download.file(temp.doc.name,
destfile = "/Users/...[i]"))
}
Ultimately, I need to scrape about 120,000 documents from the site. Where is the best place to store the data? I'm thinking I might run the code for each of the 15 years I'm interested in separately, in order to (hopefully) keep it manageable.
Note: I've tried several different ways to scrape the data. Unfortunately for me, the RSS feed only has the most recent 25. Because there are multiple dropdown menus to navigate before you reach the .docx file, my workaround is to use document numbers. I am however, open to more efficient way to scrape these written questions.
Again, thanks for any feedback!
Kari
After quickly checking out the site, I agree that I can't see any easier ways to do this, because the search function doesn't appear to be URL-based. So what you need to do is poll each candidate URL and see if it returns a "good" status (usually 200) and don't download when it returns a "bad" status (like 404). The following code block does that.
Note that purrr::safely doesn't run a function -- it creates another function that is safe and which you then can call. The created function returns a list with two slots: result and error.
base.url <- "https://www.europarl.europa.eu/doceo/document/"
document.name.1 <- "P-9-2022-00"
document.extension <- "_EN.docx"
#document.number <- 2321
document.numbers <- c(2330:2333,2552,2321)
sHEAD = purrr::safely(httr::HEAD)
sdownload = purrr::safely(download.file)
for (i in seq_along(document.numbers)) {
file_name = paste0(document.name.1,document.numbers[i],document.extension)
temp.doc.name <- paste0(base.url,file_name)
print(temp.doc.name)
print(sHEAD(temp.doc.name)$result$status)
if(sHEAD(temp.doc.name)$result$status %in% 200:299){
sdownload(temp.doc.name,destfile=file_name)
}
}
It might not be as simple as all of the valid URLs returning a '200' status. I think in general URLs in the range 200:299 are ok (edited answer to reflect this).
I used parts of this answer in my answer.
If the file does not exists, tryCatch simply skips it
library(tidyverse)
get_data <- function(index) {
paste0(
"https://www.europarl.europa.eu/doceo/document/",
"P-9-2022-00",
index,
"_EN.docx"
) %>%
download.file(url = .,
destfile = paste0(index, ".docx"),
mode = "wb",
quiet = TRUE) %>%
tryCatch(.,
error = function(e) print(paste(index, "does not exists - SKIPS")))
}
map(2000:5000, get_data)
I have a list of names (first name, last name, and date-of-birth) that I need to search the Fulton County Georgia (USA) Jail website to determine if a person is in or released from jail.
The website is http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400
The site requires you enter a last name and first name, then it gives you a list of results.
I have found some stackoverflow posts that have given me some direction, but I'm still struggling to figure this out. I"m using this post as and example to follow. I am using SelectorGaget to help figure out the CSS tags.
Here is the code I have so far. Right now I can't figure out what html_node to use.
library(rvest)
# Specify URL
fc.url <- "http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400"
# start session
jail <- html_session(fc.url)
# Grab initial form
form.unfilled <- jail %>% html_node("form")
form.unfilled
The result I get from form.unfilled is {xml_missing} <NA> which I know isn't right.
I think if I can figure out the html_node value, I can proceed to using set_values and submit_form.
Thanks.
It appears on the initial call the webpage opens onto "http://justice.fultoncountyga.gov/PAJailManager/default.aspx". Once the session is started you should be able to jump to the search page:
library(rvest)
# Specify URL
fc.url <- "http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400"
# start session
jail <- html_session("http://justice.fultoncountyga.gov/PAJailManager/default.aspx")
#jump to search page
jail2 <- jail %>% jump_to("http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400")
#list the form's fields
html_form(jail2)[[1]]
# Grab initial form
form.unfilled <- jail2 %>% html_node("form")
Note: Verify that your actions are within the terms of service for the website. Many sites do have policy against scraping.
The website relies heavily on Javascript to render itself. When opening the link you provided in a fresh browser instance, you get redirected to http://justice.fultoncountyga.gov/PAJailManager/default.aspx, where you have to click the "Jail Records" link. This executes a bit a Javascript, to send you to the page with the form.
rvest is unable to execute arbitrary Javascript. You might have to look at RSelenium. Selenium basically remote-controls a browser (for example Firefox or Chrome), which executes the Javascript as intended.
Thanks to Dave2e.
Here is the code that works. This questions is answered (but I'll post another one because I'm not getting a table of data as a result.)
Note: I cannot find any Terms of Service on this site that I'm querying
library(rvest)
# start session
jail <- html_session("http://justice.fultoncountyga.gov/PAJailManager/default.aspx")
#jump to search page
jail2 <- jail %>% jump_to("http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400")
#list the form's fields
html_form(jail2)[[1]]
# Grab initial form
form.unfilled <- jail2 %>% html_node("form") %>% html_form()
form.unfilled
#name values
lname <- "DOE"
fname <- "JOHN"
# Fille the form with name values
form.filled <- form.unfilled %>%
set_values("LastName" = lname,
"FirstName" = fname)
#Submit form
r <- submit_form(jail2, form.filled,
submit = "SearchSubmit")
#grab tables from submitted form
table <- r %>% html_nodes("table")
#grab a table with some data
table[[5]] %>% html_table()
# resulting text in this table:
# " An error occurred while processing your request.Please contact your system administrator."
I am rather new in webscraping but need data for my PhD project. For this, I am extracting data on different activities of MEPs from the European Parliament's website. Concretely, and where I have problems, I would like to extract the title and especially the link underlying the title of each speech from a MEP's personal page. I use a code that already worked fine several times, but here I do not succeed in getting the link, but only the title of the speech. For the links I get the error message "subscript out of bounds". I am working with RSelenium because there are several load more buttons on the individual pages I have to click first before extracting the data (which makes rvest a complicated option as far as I see it).
I am basically trying to solve this for days now, and I really do not know how to get further. I have the impression that the css selector is not actually capturing the underlying link (as it extracts the title without problems), but the class has a compounded name ("ep-a_heading ep-layout_level2") so it is not possible to go via this way either. I tried Rvest as well (ignoring the problem I would then have for the load more--button) but I still do not get to those links.
```{r}
library(RSelenium)
library(wdman)
library(rvest, warn.conflicts=FALSE)
library(stringr)
server <- phantomjs(port=7005L)
browser <- remoteDriver(browserName = "phantomjs", port=7005L)
## this is one of the urls I will use, there are others, constructed all
##the same way and all with the same problem
url <- 'http://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-
activities/plenary-speeches/8'
browser$open()
browser$navigate(url)
## now I identify the load more button and click on it as long as there
##is a "load more" button on the page
more <- browser$findElement(using = "css", value=".erpl-activities-
loadmore-button .ep_name")
while (!is.null(more)){
more$clickElement()
Sys.sleep(1)}
## I get an error message doing this in the end but it is working anyway
##(yes, I really am a beginner!)
##Now, what I want to extract are the title of the speech and most
##importantly: the URL.
links <- browser$findElements(using="css", ".ep-layout_level2 .ep_title")
length(links)
## there are 128 Speeches listed on the page
URL <- rep(NA, length(links))
Title <- rep(NA, length(links))
## after having created vectors to store the results, I apply the loop
##function that had worked fine already many times to extract the data I
##want
for (i in 1:length(links)){
URL[i] <- links[[i]]$getElementAttribute('href')[[1]]
Title[i] <- links[[i]]$getElementText()[[1]]
}
speeches <- data.frame(Title, URL)
For this example there 128 speeches on the page, so in the end I would need a table with 128 titles and links. The code works fine when I only try for the title but for the URLs I get:
`"Error in links[[i]]$getElementAttribute("href")[[1]] : subscript out of bounds"`
Thank you very much for your help, I already read many posts on subscript out of bounds issues in this forum, but unfortunately I still couldn't solve the problem.
Have a great day!
I don't seem to have a problem using rvest to get that info. No need for overhead of using selenium. You want to target the a tag child of that class i.e. .ep-layout_level2 a in order to be able to access an href attribute. Same selector would apply for selenium.
library(rvest)
library(magrittr)
page <- read_html('https://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-activities/plenary-speeches/8')
titles <- page %>% html_nodes('.ep-layout_level2 .ep_title') %>% html_text() %>% gsub("\\r\\n\\t+", "", .)
links <- page %>% html_nodes('.ep-layout_level2 a') %>% html_attr(., "href")
results <- data.frame(titles,links)
Here you have a working solution based on the code you provided:
library(RSelenium)
library(wdman)
library(rvest, warn.conflicts=FALSE)
library(stringr)
server <- phantomjs(port=7005L)
browser <- remoteDriver(browserName = "phantomjs", port=7005L)
## this is one of the urls I will use, there are others, constructed all
##the same way and all with the same problem
url <- 'http://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-activities/plenary-speeches/8'
browser$open()
browser$navigate(url)
## now I identify the load more button and click on it as long as there
##is a "load more" button on the page
more <- browser$findElement(using = "class",value= "erpl-activity-loadmore-button")
while ((grepl("erpl-activity-loadmore-button",more$getPageSource(),fixed=TRUE)){
more$clickElement()
Sys.sleep(1)}
## I get an error message doing this in the end but it is working anyway
##(yes, I really am a beginner!)
##Now, what I want to extract are the title of the speech and most
##importantly: the URL.
links <- browser$findElements(using="class", "ep-layout_level2")
## there are 128 Speeches listed on the page
URL <- rep(NA, length(links))
Title <- rep(NA, length(links))
## after having created vectors to store the results, I apply the loop
##function that had worked fine already many times to extract the data I
##want
for (i in 1:length(links)){
l=links[[i]]$findChildElement(using="css","a")
URL[i] <-l$getElementAttribute('href')[[1]]
Title[i] <- links[[i]]$getElementText()[[1]]
}
speeches <- data.frame(Title, URL)
speeches
The main differences are:
In the first findElement I use value= erpl-activity-loadmore-button. Indeed the documentation says that you can not look at multiple class values at once
Same when it comes to look for the links
In the final loop, you need fist to select the link element in the
div you selected and then read the href attribute
To answer your question about the error message in comments after the while loop: When you pressed enough time the "Load more" buttons it become invisible, but still exists. So when you check for !is.null(more)it is TRUE because the button still exists, but when you try to click it you get and error message because it is invisible. So you can fix it by checking it it is visible or note.
There are 2 parts of my questions as I explored 2 methods in this exercise, however I succeed in none. Greatly appreciated if someone can help me out.
[PART 1:]
I am attempting to scrape data from a webpage on Singapore Stock Exchange https://www2.sgx.com/derivatives/negotiated-large-trade containing data stored in a table. I have some basic knowledge of scraping data using (rvest). However, using Inspector on chrome, the html hierarchy is much complex then I expected. I'm able to see that the data I want is hidden under < div class= "table-container" >,and here's what I've tied:
library(rvest)
library(httr)
library(XML)
SGXurl <- "https://www2.sgx.com/derivatives/negotiated-large-trade"
SGXdata <- read_html(SGXurl, stringsASfactors = FALSE)
html_nodes(SGXdata,".table-container")
However, nothing has been picked up by the code and I'm doubt if I'm using these code correctly.
[PART 2:]
As I realize that there's a small "download" button on the page which can download exactly the data file i want in .csv format. So i was thinking to write some code to mimic the download button and I found this question Using R to "click" a download file button on a webpage, but i'm unable to get it to work with some modifications to that code.
There's a few filtera on the webpage, mostly I will be interested downloading data for a particular business day while leave other filters blank, so i've try writing the following function:
library(httr)
library(rvest)
library(purrr)
library(dplyr)
crawlSGXdata = function(date){
POST("https://www2.sgx.com/derivatives/negotiated-large-trade",
body = NULL
encode = "form",
write_disk("SGXdata.csv")) -> resfile
res = read.csv(resfile)
return(res)
}
I was intended to put the function input "date" into the “body” argument, however i was unable to figure out how to do that, so I started off with "body = NULL" by assuming it doesn't do any filtering. However, the result is still unsatisfactory. The file download is basically empty with the following error:
Request Rejected
The requested URL was rejected. Please consult with your administrator.
Your support ID is: 16783946804070790400
The content is loaded dynamically from an API call returning json. You can find this in the network tab via dev tools.
The following returns that content. I find the total number of pages of results and loop combining the dataframe returned from each call into one final dataframe containing all results.
library(jsonlite)
url <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=0&pageSize=250'
r <- jsonlite::fromJSON(url)
num_pages <- r$meta$totalPages
df <- r$data
url2 <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=placeholder&pageSize=250'
if(num_pages > 1){
for(i in seq(1, num_pages)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$data
df <- rbind(df, newdf)
}
}
I am trying to use web scraping to generate a play by play dataset from ESPN. I have figured out most of it, but have been unable to tell which team the event is for, as this is only encoded on ESPN in the form of an image. The best way I have come up with to solve this problem is to get the URL of the logo for each entry and compare it to the URL of the logo for each team at the top of the page. However, I have been unable to figure out how to get an attribute such as the url from the image.
I am running this on R and am using the rvest package. The url I am scraping is https://www.espn.com/mens-college-basketball/playbyplay?gameId=400587906 and I am scraping using the SelectorGadget Chrome extension. I have also tried comparing the name of the player to the boxscore, which has all of the players listed, but each team has a player with the last name of Jones, so I would prefer to be able to get the team by looking at the image, as this will always be right.
library(rvest)
url <- "https://www.espn.com/mens-college-basketball/playbyplay?gameId=400587906"
webpage <- read_html(url)
# have been able to successfully scrape game_details and score
game_details_html <- html_nodes(webpage,'.game-details')
game_details <- html_text(game_details_html) %>% as.character()
score_html <- html_nodes(webpage,'.combined-score')
score <- html_text(score_html)
# have not been able to scrape image
ImgNode <- html_nodes(webpage, css = "#gp-quarter-1 .team-logo")
link <- html_attr(ImgNode, "src")
For each event, I want it to be labeled "Duke" or "Wake Forest".
Is there a way to generate the URL for each image? Any help would be greatly appreciated.
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/150.png&h=100&w=100"
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/154.png&h=100&w=100"
Your code returns these.
500/150 is Duke and 500/154 is Wake Forest. You can create a simple dataframe with these and then join the tables.
link_df <- as.data.frame(link)
link_ref_df <- data.frame(link = c("https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/150.png&h=100&w=100", "https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/154.png&h=100&w=100"),
team_name = c("Duke", "Wake Forest"))
link_merged <- merge(link_df,
link_ref_df,
by = 'link',
all.x = T)
This is not scalable if you're doing hundreds of these with other teams, but works for this specific option.