Web Scraping stocks - r

I am currently working on a project that deals with web scraping using R.
It is very basic but I am trying to understand how it works.
I am using Google Stocks as my URL and I am using the Google ticker as my stock I am viewing.
Here is my code:
# Declaring our URL variable
google = html("https://www.google.com/searchq=google+stock%5D&oq=google+stock%5D&aqs=chrome..69i57j0l2j69i60l3.5208j0j4&sourceid=chrome&ie=UTF-8")
# Prints and initializes the data
google_stock = google %>%
html_nodes("._FOc , .fac-l") %>%
html_text()
# Creating a data frame table
goggledf = data.frame(table(google_stock))
# Orders the data into highest frequency shown
googledf_order = googledf[order(-googledf$Freq),]
# Displays first few rows of data
head(googledf_order)
When I run this I get integer(0), which should be displaying a stock price.
I am not sure why this is not displaying the correct stock price.
I also tried running the code up until html_text() and it still did not show me the data that I wanted or needed.
I just need this to display the stock price from the web.
I am using SelectorGadget to get my html node ("._FOc , .fac-l")

I think there might be something wrong with your URL. When I try to paste it into a browser, I get a 404 error.
Instead of scraping you could use the quantmod package. To get historical data you could use the following:
library(quantmod)
start <- as.Date("2018-01-01")
end <- as.Date("2018-01-20")
getSymbols("GOOGL", src = "google", from = start, to = end)
To get a the current stock quote you could use:
getQuote("GOOGL", src = "yahoo")
From the quantmod documentation, the getQuote function "only handles sourcing quotes from Yahoo Finance."

Related

Way to see if website was updated with rvest?

I'm trying to web scrape Nike to see when new sneakers drop. I'm relatively new to web scraping and was wondering if there was an easy was to check for differences in the last search or pull information about the date products were posted.
So far I've been able to pull the list of the most recent products by scraping the new arrivals page that's sorted by newest, but can't seem to find information on that page about when items were posted.
library(rvest)
library(tidyverse)
url<-"https://www.nike.com/w/new-mens-shoes-3n82yznik1zy7ok?sort=newest"
search<-read_html(url)
search%>%html_nodes(css ="div.product-card")%>%html_text()
Any tips are appreciated.
The easiest way to save a list to your local drive and then each time you perform the query compare the newly obtain list with your previously found list.
Here I created a data frame with today's date and the query from today and saved it to a file named "Nike.csv". Then I will run this script, determine the newly added and append to the existing file. You can then open the csv file and see the date of when each shoe was added to the list.
library(rvest)
library(dplyr)
#Read file of previous found shoes
existing <- read.csv("Nike.csv")
#Retrieve the latest list
url<-"https://www.nike.com/w/new-mens-shoes-3n82yznik1zy7ok?sort=newest"
search<-read_html(url)
shoes <- search%>%html_nodes(css ="div.product-card")%>%html_text()
#Find the differences between the latest list and previous found list
newshoes <- !(shoes %in% existing$shoes)
#Append to the differences to the current list
if (length(shoes[newshoes])>0) {
df<-data.frame(date = Sys.Date(), shoes[newshoes])
write.csv(df, "Nike.csv", row.names = FALSE, append = TRUE,)
} else {
print("No new records found")
}
There are way to optimize this and improve on the error checking, but this will get you started.

Using R to mimic “clicking” a download file button on a webpage

There are 2 parts of my questions as I explored 2 methods in this exercise, however I succeed in none. Greatly appreciated if someone can help me out.
[PART 1:]
I am attempting to scrape data from a webpage on Singapore Stock Exchange https://www2.sgx.com/derivatives/negotiated-large-trade containing data stored in a table. I have some basic knowledge of scraping data using (rvest). However, using Inspector on chrome, the html hierarchy is much complex then I expected. I'm able to see that the data I want is hidden under < div class= "table-container" >,and here's what I've tied:
library(rvest)
library(httr)
library(XML)
SGXurl <- "https://www2.sgx.com/derivatives/negotiated-large-trade"
SGXdata <- read_html(SGXurl, stringsASfactors = FALSE)
html_nodes(SGXdata,".table-container")
However, nothing has been picked up by the code and I'm doubt if I'm using these code correctly.
[PART 2:]
As I realize that there's a small "download" button on the page which can download exactly the data file i want in .csv format. So i was thinking to write some code to mimic the download button and I found this question Using R to "click" a download file button on a webpage, but i'm unable to get it to work with some modifications to that code.
There's a few filtera on the webpage, mostly I will be interested downloading data for a particular business day while leave other filters blank, so i've try writing the following function:
library(httr)
library(rvest)
library(purrr)
library(dplyr)
crawlSGXdata = function(date){
POST("https://www2.sgx.com/derivatives/negotiated-large-trade",
body = NULL
encode = "form",
write_disk("SGXdata.csv")) -> resfile
res = read.csv(resfile)
return(res)
}
I was intended to put the function input "date" into the “body” argument, however i was unable to figure out how to do that, so I started off with "body = NULL" by assuming it doesn't do any filtering. However, the result is still unsatisfactory. The file download is basically empty with the following error:
Request Rejected
The requested URL was rejected. Please consult with your administrator.
Your support ID is: 16783946804070790400
The content is loaded dynamically from an API call returning json. You can find this in the network tab via dev tools.
The following returns that content. I find the total number of pages of results and loop combining the dataframe returned from each call into one final dataframe containing all results.
library(jsonlite)
url <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=0&pageSize=250'
r <- jsonlite::fromJSON(url)
num_pages <- r$meta$totalPages
df <- r$data
url2 <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=placeholder&pageSize=250'
if(num_pages > 1){
for(i in seq(1, num_pages)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$data
df <- rbind(df, newdf)
}
}

Web Scraping Image URL for a series of events in ESPN Play-By-Play

I am trying to use web scraping to generate a play by play dataset from ESPN. I have figured out most of it, but have been unable to tell which team the event is for, as this is only encoded on ESPN in the form of an image. The best way I have come up with to solve this problem is to get the URL of the logo for each entry and compare it to the URL of the logo for each team at the top of the page. However, I have been unable to figure out how to get an attribute such as the url from the image.
I am running this on R and am using the rvest package. The url I am scraping is https://www.espn.com/mens-college-basketball/playbyplay?gameId=400587906 and I am scraping using the SelectorGadget Chrome extension. I have also tried comparing the name of the player to the boxscore, which has all of the players listed, but each team has a player with the last name of Jones, so I would prefer to be able to get the team by looking at the image, as this will always be right.
library(rvest)
url <- "https://www.espn.com/mens-college-basketball/playbyplay?gameId=400587906"
webpage <- read_html(url)
# have been able to successfully scrape game_details and score
game_details_html <- html_nodes(webpage,'.game-details')
game_details <- html_text(game_details_html) %>% as.character()
score_html <- html_nodes(webpage,'.combined-score')
score <- html_text(score_html)
# have not been able to scrape image
ImgNode <- html_nodes(webpage, css = "#gp-quarter-1 .team-logo")
link <- html_attr(ImgNode, "src")
For each event, I want it to be labeled "Duke" or "Wake Forest".
Is there a way to generate the URL for each image? Any help would be greatly appreciated.
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/150.png&h=100&w=100"
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/154.png&h=100&w=100"
Your code returns these.
500/150 is Duke and 500/154 is Wake Forest. You can create a simple dataframe with these and then join the tables.
link_df <- as.data.frame(link)
link_ref_df <- data.frame(link = c("https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/150.png&h=100&w=100", "https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/154.png&h=100&w=100"),
team_name = c("Duke", "Wake Forest"))
link_merged <- merge(link_df,
link_ref_df,
by = 'link',
all.x = T)
This is not scalable if you're doing hundreds of these with other teams, but works for this specific option.

Scrape contents of dynamic pop-up window using R

I'm stuck on this one after much searching....
I started with scraping the contents of a table from:
http://www.skatepress.com/skates-top-10000/artworks/
Which is easy:
data <- data.frame()
for (i in 1:100){
print(paste("page", i, "of 100"))
url <- paste("http://www.skatepress.com/skates-top-10000/artworks/", i, "/", sep = "")
temp <- readHTMLTable(stringsAsFactors = FALSE, url, which = 1, encoding = "UTF-8")
data <- rbind(data, temp)
} # end of scraping loop
However, I need to additionally scrape the detail that is contained in a pop-up box when you click on each name (and on the artwork title) in the list on the site.
I can't for the life of me figure out how to pass the breadcrumb (or artist-id or painting-id) through in order to make this happen. Since straight up using rvest to access the contents of the nodes doesn't work, I've tried the following:
I tried passing the painting id through in the url like this:
url <- ("http://www.skatepress.com/skates-top-10000/artworks/?painting_id=576")
site <- html(url)
But it still gives an empty result when scraping:
node1 <- "bread-crumb > ul > li.activebc"
site %>% html_nodes(node1) %>% html_text(trim = TRUE)
character(0)
I'm (clearly) not a scraping expert so any and all assistance would be greatly appreciated! I need a way to capture this additional information for each of the 10,000 items on the list...hence why I'm not interested in doing this manually!
Hoping this is an easy one and I'm just overlooking something simple.
This will be a more efficient base scraper and you can get progress bars for free with the pbapply package:
library(xml2)
library(httr)
library(rvest)
library(dplyr)
library(pbapply)
library(jsonlite)
base_url <- "http://www.skatepress.com/skates-top-10000/artworks/%d/"
n <- 100
bind_rows(pblapply(1:n, function(i) {
mutate(html_table(html_nodes(read_html(sprintf(base_url, i)), "table"))[[1]],
`Sale Date`=as.Date(`Sale Date`, format="%m.%d.%Y"),
`Premium Price USD`=as.numeric(gsub(",", "", `Premium Price USD`)))
})) -> skatepress
I added trivial date & numeric conversions.
I belive your main issue is that the site requires a login to get the additional data. You should give that (i.e. logging in) a shot using httr and grab the wordpress_logged_inXXXXXXX… cookie from that endeavour. I just grabbed it from inspecting the session with Developer Tools in Chrome and that will also work for you (but it's worth the time to learn how to do it via httr).
You'll need to scrape two additional <a … tags from each table row. The one for "artist" looks like:
Pablo Picasso
You can scrape the contents with:
POST("http://www.skatepress.com/wp-content/themes/skatepress/scripts/query_artist.php",
set_cookies(wordpress_logged_in_XXX="userid%XXXXXreallylongvalueXXXXX…"),
encode="form",
body=list(id="pab_pica_1881"),
verbose()) -> artist_response
fromJSON(content(artist_response, as="text"))
(The return value is too large to post here)
The one for "artwork" looks like:
Les femmes d′Alger (Version ′O′)
and you can get that in similar fashion:
POST("http://www.skatepress.com/wp-content/themes/skatepress/scripts/query_artwork.php",
set_cookies(wordpress_logged_in_XXX="userid%XXXXXreallylongvalueXXXXX…"),
encode="form",
body=list(id=576),
verbose()) -> artwork_response
fromJSON(content(artwork_response, as="text"))
That's not huge but I won't clutter the response with it.
NOTE that you can also use rvest's html_session to do the login (which will get you cookies for free) and then continue to use that session in the scraping (vs read_html) which will mean you don't have to do the httr GET/PUT.
You'll have to figure out how you want to incorporate that data into the data frame or associate it with it via various id's in the data frame (or some other strategy).
You can see it call those two php scripts via Developer Tools and it also shows the data it passes in. I'm also really shocked that site doesn't have any anti-scraping clauses in their ToS but they don't.

How can I scrape data from this website (multiple webpages) using R?

I am a beginner in scraping data from website. It seems difficult for me to interpret the structure of html using XML or other packages.
Can anyone help me to download the data from this website?
http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp
It is about the investment from China. The character set is in Chinese.
What I've tried so far:
library("rvest")
url <- "http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp"
firm <- url %>%
html() %>%
html_nodes(xpath='//*[#id="Grid1MainLayer"]/table[1]') %>%
html_table()
firm <- firm[[1]] head(firm)
You can try with the function in the XML package called readHTMLTable that should download all the tables in the page and already format it into a data.frame.
library(XML)
all_tables = readHTMLTable("http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp")
Then since there is only one table in the page you linked it should be enough to get the first element so:
target_table = all_tables[[1]]

Resources