I'm trying to web scrape Nike to see when new sneakers drop. I'm relatively new to web scraping and was wondering if there was an easy was to check for differences in the last search or pull information about the date products were posted.
So far I've been able to pull the list of the most recent products by scraping the new arrivals page that's sorted by newest, but can't seem to find information on that page about when items were posted.
library(rvest)
library(tidyverse)
url<-"https://www.nike.com/w/new-mens-shoes-3n82yznik1zy7ok?sort=newest"
search<-read_html(url)
search%>%html_nodes(css ="div.product-card")%>%html_text()
Any tips are appreciated.
The easiest way to save a list to your local drive and then each time you perform the query compare the newly obtain list with your previously found list.
Here I created a data frame with today's date and the query from today and saved it to a file named "Nike.csv". Then I will run this script, determine the newly added and append to the existing file. You can then open the csv file and see the date of when each shoe was added to the list.
library(rvest)
library(dplyr)
#Read file of previous found shoes
existing <- read.csv("Nike.csv")
#Retrieve the latest list
url<-"https://www.nike.com/w/new-mens-shoes-3n82yznik1zy7ok?sort=newest"
search<-read_html(url)
shoes <- search%>%html_nodes(css ="div.product-card")%>%html_text()
#Find the differences between the latest list and previous found list
newshoes <- !(shoes %in% existing$shoes)
#Append to the differences to the current list
if (length(shoes[newshoes])>0) {
df<-data.frame(date = Sys.Date(), shoes[newshoes])
write.csv(df, "Nike.csv", row.names = FALSE, append = TRUE,)
} else {
print("No new records found")
}
There are way to optimize this and improve on the error checking, but this will get you started.
Related
Thanks in advance for any feedback.
As part of my dissertation I'm trying to scrape data from the web (been working on this for months). I have a couple issues:
-Each document I want to scrape has a document number. However, the numbers don't always go up in order. For example, one document number is 2022, but the next one is not necessarily 2023, it could be 2038, 2040, etc. I don't want to hand go through to get each document number. I have tried to wrap download.file in purrr::safely(), but once it hits a document that does not exist it stops.
-Second, I'm still fairly new to R, and am having a hard time setting up destfile for multiple documents. Indexing the path for where to store downloaded data ends up with the first document stored in the named place, the next document as NA.
Here's the code I've been working on:
base.url <- "https://www.europarl.europa.eu/doceo/document/"
document.name.1 <- "P-9-2022-00"
document.extension <- "_EN.docx"
#document.number <- 2321
document.numbers <- c(2330:2333)
for (i in 1:length(document.numbers)) {
temp.doc.name <- paste0(base.url,
document.name.1,
document.numbers[i],
document.extension)
print(temp.doc.name)
#download and save data
safely <- purrr::safely(download.file(temp.doc.name,
destfile = "/Users/...[i]"))
}
Ultimately, I need to scrape about 120,000 documents from the site. Where is the best place to store the data? I'm thinking I might run the code for each of the 15 years I'm interested in separately, in order to (hopefully) keep it manageable.
Note: I've tried several different ways to scrape the data. Unfortunately for me, the RSS feed only has the most recent 25. Because there are multiple dropdown menus to navigate before you reach the .docx file, my workaround is to use document numbers. I am however, open to more efficient way to scrape these written questions.
Again, thanks for any feedback!
Kari
After quickly checking out the site, I agree that I can't see any easier ways to do this, because the search function doesn't appear to be URL-based. So what you need to do is poll each candidate URL and see if it returns a "good" status (usually 200) and don't download when it returns a "bad" status (like 404). The following code block does that.
Note that purrr::safely doesn't run a function -- it creates another function that is safe and which you then can call. The created function returns a list with two slots: result and error.
base.url <- "https://www.europarl.europa.eu/doceo/document/"
document.name.1 <- "P-9-2022-00"
document.extension <- "_EN.docx"
#document.number <- 2321
document.numbers <- c(2330:2333,2552,2321)
sHEAD = purrr::safely(httr::HEAD)
sdownload = purrr::safely(download.file)
for (i in seq_along(document.numbers)) {
file_name = paste0(document.name.1,document.numbers[i],document.extension)
temp.doc.name <- paste0(base.url,file_name)
print(temp.doc.name)
print(sHEAD(temp.doc.name)$result$status)
if(sHEAD(temp.doc.name)$result$status %in% 200:299){
sdownload(temp.doc.name,destfile=file_name)
}
}
It might not be as simple as all of the valid URLs returning a '200' status. I think in general URLs in the range 200:299 are ok (edited answer to reflect this).
I used parts of this answer in my answer.
If the file does not exists, tryCatch simply skips it
library(tidyverse)
get_data <- function(index) {
paste0(
"https://www.europarl.europa.eu/doceo/document/",
"P-9-2022-00",
index,
"_EN.docx"
) %>%
download.file(url = .,
destfile = paste0(index, ".docx"),
mode = "wb",
quiet = TRUE) %>%
tryCatch(.,
error = function(e) print(paste(index, "does not exists - SKIPS")))
}
map(2000:5000, get_data)
I am trying to scrape data from the web, specifically from a table that has different filters and pages and I have the following code:
library (rvest)
url.colombia.compra <- "https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to = & date_from = "
tmp <- read_html (url.colombia.compra)
tmp_2 <- html_nodes (tmp, ".active")
the problem is that the code generates a list for me but I need to format it as a table and I have not succeeded, besides that it only shows me data from the first page of the table, how could I complement the code so that it allows me to read the data from all the pages in the table and format it as a table.
This is the table that looks like the table that shows the data
I would split this problem into two parts. Your first is how to programmatically access each of the 11 pages of this online table.
Since this is a simple html table, using the "Next" button (siguiente) will take us to a new page. If we look at the URL on the Next page, we can see the page number in the query parameters.
...tienda-virtual-del-estado-colombiano/ordenes-compra?page=1&number_order=&state...
We know that the pages are numbered starting with 0 (because "next" takes us to page1), and using the navigation bar we can see that there are 11 pages.
We can use the query parameters to construct a series of rvest::read_html() calls corresponding to page number by simply using lapply and paste0 to replace the page=. This will let us access all the pages of the table.
The second part is making use of rvest::html_table which will parse a tibble from the results of read_html
pages <-
lapply(0:11, function(x) {
data.frame(
html_table(
read_html(x = paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
x,
"&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_="))
)
)
})
The result is a list of dataframes which we can combine with do.call.
do.call(rbind, pages)
I am currently working on a project that deals with web scraping using R.
It is very basic but I am trying to understand how it works.
I am using Google Stocks as my URL and I am using the Google ticker as my stock I am viewing.
Here is my code:
# Declaring our URL variable
google = html("https://www.google.com/searchq=google+stock%5D&oq=google+stock%5D&aqs=chrome..69i57j0l2j69i60l3.5208j0j4&sourceid=chrome&ie=UTF-8")
# Prints and initializes the data
google_stock = google %>%
html_nodes("._FOc , .fac-l") %>%
html_text()
# Creating a data frame table
goggledf = data.frame(table(google_stock))
# Orders the data into highest frequency shown
googledf_order = googledf[order(-googledf$Freq),]
# Displays first few rows of data
head(googledf_order)
When I run this I get integer(0), which should be displaying a stock price.
I am not sure why this is not displaying the correct stock price.
I also tried running the code up until html_text() and it still did not show me the data that I wanted or needed.
I just need this to display the stock price from the web.
I am using SelectorGadget to get my html node ("._FOc , .fac-l")
I think there might be something wrong with your URL. When I try to paste it into a browser, I get a 404 error.
Instead of scraping you could use the quantmod package. To get historical data you could use the following:
library(quantmod)
start <- as.Date("2018-01-01")
end <- as.Date("2018-01-20")
getSymbols("GOOGL", src = "google", from = start, to = end)
To get a the current stock quote you could use:
getQuote("GOOGL", src = "yahoo")
From the quantmod documentation, the getQuote function "only handles sourcing quotes from Yahoo Finance."
I'm trying to scrap some data using this code.
require(XML)
tables <- readHTMLTable('http://fantasynba.movistarplus.es/basketball/reports/player_rankings.asp')
str(tables, max.level = 1)
df <- tables$searchResults
It works perfect but the problem is that it only gives me data for the first 188 observations that corresponds to the players whose position is "Base". Whenever I try to get data from "Pivot" or "Alero" players, it gives me the same info. Since the url never changes, I don't know how to get this info.
For a little project for myself I'm trying to get the results from some races.
I can access the pages with the results and download the data from the table in page. However, there are only 20 results per page, but luckily the web addresses are built logically so I can create them, and in a loop, access these pages and download the data. However, each category has a different number of racers, and thus can have different number of pages. I want to avoid to manually having to check how many racers there are in each category.
My first thought was to just generate a lot of links, making sure there are enough (based on the total amount of racers) to get all the data.
nrs <- rep(seq(1,5,1),2)
sex <- c("M","M","M","M","M","F","F","F","F","F")
links <- NULL
#Loop to create 10 links, 5 for the male age grou 18-24, 5 for women agegroup 18-24. However,
#there are only 3 pages in the male age group with a table.
for (i in 1:length(nrs) ) {
links[i] = paste("http://www.ironman.com/triathlon/events/americas/ironman/texas/results.aspx?p=",nrs[i],"&race=texas&rd=20160514&sex=",sex[i],"&agegroup=18-24&loc=",sep="")
}
resultlist <- list() #create empty list to store results
for (i in 1:length(links)) {
results = readHTMLTable(links[i],
as.data.frame = TRUE,
which=1,
stringsAsFactors = FALSE,
header = TRUE) #get data
resultlist[[i]] <- results #combine results in one big list
}
results = do.call(rbind, resultlist) #combine results into dataframe
As you can see in this code readHTMLTable throws an error message as soon as it encounters a page with no table, and then stops.
I thought of two possible solutions.
1) Somehow check all the links if they exist. I tried with url.exists from the RCurl package. But this doesn't work. It returns TRUE for all pages, as the page exists, it just doesn't have a table in it (so for me it would be a false positive). Somehow I would need some code to check if a table in the page exists, but I don't know how to go about that.
2) Suppress the error message from readHTMLTable so the loop continuous, but I'm not sure if that's possible.
Any suggestions for these two methods, or any other suggestions?
I think that method #2 is easier. I modified your code with tryCatch, one of R's builtin exception handling mechanisms. It works for me.
PS I would recommend using rvest for web scraping like this.