This question already has an answer here:
Webscraping with Yahoo Finance
(1 answer)
Closed last year.
Unfortunately, I am not an experienced scraper yet. However, I need to scrape key statistics of multiple stocks from Yahoo Finance with R.
I am somewhat familiar with scraping data directly from html using read_html, html_nodes(), and html_text() from the rvest package. However, this web page MSFT key stats is a bit complicated, I am not sure if all the stats are kept in XHR, JS, or Doc. I am guessing the data is stored in JSON.
If anyone knows a good way to extract and parse data for this web page with R, kindly answer my question, great thanks in advance!
Or if there is a more convenient way to extract these metrics via quantmod or Quandl, kindly let me know, that would be a extremely good solution!
The goal is to have tickets/symbols as rownames/rowlabels whereas the statistics are identified as columns. A illustration of my needs can be found at this Finviz link:
https://finviz.com/screener.ashx
The reason I would like to scrape Yahoo Finance data is because Yahoo also considers Enterprise, EBITDA key stats..
EDIT:
I meant to refer to the key statistics page.. For example.. : https://finance.yahoo.com/quote/MSFT/key-statistics/ . The code should lead to one data frame rows of stock symbols and columns of key stats.
Code
library(rvest)
library(tidyverse)
# Define stock name
stock <- "MSFT"
# Extract and transform data
df <- paste0("https://finance.yahoo.com/quote/", stock, "/financials?p=", stock) %>%
read_html() %>%
html_table() %>%
map_df(bind_cols) %>%
# Transpose
t() %>%
as_tibble()
# Set first row as column names
colnames(df) <- df[1,]
# Remove first row
df <- df[-1,]
# Add stock name column
df$Stock_Name <- stock
Result
Revenue `Total Revenue` `Cost of Revenu… `Gross Profit`
<chr> <chr> <chr> <chr>
1 6/30/2… 110,360,000 38,353,000 72,007,000
2 6/30/2… 96,571,000 33,850,000 62,721,000
3 6/30/2… 91,154,000 32,780,000 58,374,000
4 6/30/2… 93,580,000 33,038,000 60,542,000
# ... with 25 more variables: ...
edit:
Or, for convenience, as a function:
get_yahoo <- function(stock){
# Extract and transform data
x <- paste0("https://finance.yahoo.com/quote/", stock, "/financials?p=", stock) %>%
read_html() %>%
html_table() %>%
map_df(bind_cols) %>%
# Transpose
t() %>%
as_tibble()
# Set first row as column names
colnames(x) <- x[1,]
# Remove first row
x <- x[-1,]
# Add stock name column
x$Stock_Name <- stock
return(x)
}
Usage: get_yahoo(stock)
I hope that this is what are you looking for:
library(quantmod)
library(plyr)
what_metrics <- yahooQF(c("Price/Sales",
"P/E Ratio",
"Price/EPS Estimate Next Year",
"PEG Ratio",
"Dividend Yield",
"Market Capitalization"))
Symbols<-c("XOM","MSFT","JNJ","GE","CVX","WFC","PG","JPM","VZ","PFE","T","IBM","MRK","BAC","DIS","ORCL","PM","INTC","SLB")
metrics <- getQuote(paste(Symbols, sep="", collapse=";"), what=what_metrics)
to get the list of metrics
yahooQF()
you can use lapply to get more than one pirce
library(quantmod)
Symbols<-c("XOM","MSFT","JNJ","GE","CVX","WFC","PG","JPM","VZ","PFE","T","IBM","MRK","BAC","DIS","ORCL","PM","INTC","SLB")
StartDate <- as.Date('2015-01-01')
Stocks <- lapply(Symbols, function(sym) {
Cl(na.omit(getSymbols(sym, from=StartDate, auto.assign=FALSE)))
})
Stocks <- do.call(merge, Stocks)
in this case i get the closing price look in function Cl()
Related
I am interested in the data provided by the link: https://stats.bis.org/api/v1/data/WS_XRU_D/D.RU../all?detail=full
My code for retrieving the data on the daily exchange rate history so far has developed into just this basic lines, while it seems I am stuck with realization that I am not able to extract the core daily data I am interested in:
u <- "https://stats.bis.org/api/v1/data/WS_XRU_D/D.RU../all?detail=full"
d <- xml2::read_xml(u)
d
{xml_document}
<StructureSpecificData
xmlns:ss="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/data/structurespecific
...
[1] <message:Header>\n <message:ID>IDREF85e8b4cf-d7d2-4506-81e6-adf668cd841b</message:ID>\n <message:Test>fa ...
[2] <message:DataSet ss:dataScope="DataStructure" xsi:type="ns1:DataSetType" ss:structureRef="BIS_WS_XRU_D_1_0 ...
I will appreciate very much for any suggestion on how to proceed correctly with xml data retreiving!
You started on the correct track, it is just a matter of extracting the correct nodes and obtaining the attribute values:
library(xml2)
library(dplyr)
#read the page
url <- "https://stats.bis.org/api/v1/data/WS_XRU_D/D.RU../all?detail=full"
page <- xml2::read_xml(url)
#extract the OBS nodes
#<Obs TIME_PERIOD="1992-07-01" OBS_VALUE="0.12526" OBS_STATUS="A" OBS_CONF="F"/>
observations <- xml_find_all(page, ".//Obs")
#extract the attribute vales from each node
date <- observations %>% xml_attr("TIME_PERIOD") %>% as.Date("%Y-%m-%d")
value <- observations %>% xml_attr("OBS_VALUE") %>% as.numeric()
status <- observations %>% xml_attr("OBS_STATUS")
conf <- observations %>% xml_attr("OBS_CONF")
answer <- data.frame(date, value, status, conf)
plot(answer$date, answer$value)
This is the second time I'm trying to formulate a question here - hopefully this time I'll make my self clear and in line with recommendations on this site.
To the problem: I've a dataset on certain companies and their headquarters. The structure of the data comes a bit messy to me (Please see the link below) - even more problematic is that I've the data on 15 separate for the years 2003, 2007, 2011, 2015 and 2019 (three csv files for each year because of the size I guess).
For the purpose of this question I've merged three files into one (for the year 2003).
Now, what I want is this: 1) merge all the 15 files and from there 2) generate a set of variables that would indicate the total number of companies per country and year [note though that the year variable is not included as a variable].
Since I've the data on four main addresses, I'd like to create separate "sum variables" based on the order (1, 2, 3, 4) and, in addition, one variable that doesn't take into account the order of countries.
Just to give an example of how I'd like it to look like:
country year total_c1 total_c2 ...
USA 2003 100 100
USA 2007 150 120
CAN 2003 50 50
CAN 2007 100 60
I intend to merge this data with a panel data that I have (country-year data).
Please click on the link to access the data. Data sample for 2003.
The first variable indicates the ID of companies.
The second (country_1) means country of first address. The third (country_2) means country of second address and so on. After that, comes a bunch of variables (over 2800) indicating a single company in the dataset.
Now, what I've come up with in my attempt to do this in R (rather than doing manually). Credit to #Duck in helping me with the merging part.
myfun <- function(df)
{
#Code
new <- df %>%
pivot_longer(starts_with('country')) %>%
group_by(name) %>%
summarise_all(sum,na.rm=T)
return(new)
}
#Load files
myfiles <- list.files(pattern = '.csv')
#List of files
L <- lapply(myfiles, read.csv)
#Apply function
L <- lapply(L,myfun)
# turn to a df
df <- as.data.frame(L)
But this didn't work out for me since I couldn't figure out which year the data come from. Instead I merged the files for one year (for example 2003) and tried to create the variables I want by running this:
df2<- df %>%
mutate(Total_c1 = select(., A2654:U9340) %>% rowSums(na.rm = TRUE))
df3<–df2 %>% group_by(country_1) %>%
summarise(Total_c1=sum(Total_c1,na.rm = T)
And here I'm stuck. Any suggestion that can take me forward from here (and start from the right side) would be much appreciated!
You can try the following part of the code assuming all your csv files that you want to combine are in the working directory itself.
library(tidyverse)
myfiles <- list.files(pattern = '.csv')
map_df(myfiles, function(x) {
year_number <- readr::parse_number(x)
df <- read.csv2(x)
df %>%
mutate(Total = rowSums(select(., -(1:5)), na.rm = TRUE)) %>%
pivot_longer(cols = starts_with('country')) %>%
group_by(name, value) %>%
summarise(Total = sum(Total)) %>%
pivot_wider(names_from = name, values_from = Total) %>%
mutate(year = year_number)
}) %>%
arrange(country, year) -> result
result
You have asked for help on different problems here. I will answer just one. With the data.table library, efficiently read in many CSVs in the same directory with the same or nearly same column titles. This produces one object (l1):
library(data.table)
# setDTthreads() # use some appropriate integer
# unzip all the files you want row bound .... to this directory
setwd("D:/Politics/General.2020/BallotReturnStats/11.24.2020")
l1 <- as.data.table({})
for(i in dir()) {l1 <- rbind(l1,fread(i),fill=TRUE)}
I am new using R and this is my first question. I apologize if it has been solved before but I haven't found a solution.
By using below code, that I found here, I can get data from and specific subsector from Finviz screener:
library (rvest)
url <- read_html("https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry")
tables <- html_nodes(url,"table")
screener <- tables %>% html_nodes("table") %>% .[11] %>%
html_table(fill=TRUE) %>% data.frame()
head(screener)
It was a bit difficult to find the table number bud I did. My question refers to lists with more than 20, like the one I am using at the example. They use &r=1, &r=21, &r=41, &r=61 at the end of each url.
How could I create in this case the structure?
i=0
for(z in ...){
Many thanks in advance for your help.
Update script based on new table number and link:
library (rvest)
library(stringr)
url <- "https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry"
TableList<-c("1","21","41","61") # table list
GetData<-function(URL,tableNo){
cat('\n',"Running For table",tableNo,'\n', 'Weblink Used:',stringr::str_c(url,"&r=",tableNo),'\n')
tables<-read_html(stringr::str_c(url,"&r=",tableNo)) #get data from webpage based on table numbers
screener <- tables %>%
html_nodes("table") %>%
.[17] %>%
html_table(fill=TRUE) %>%
data.frame()
return(screener)
}
AllData<- lapply(TableList, function(x) GetData(URL=url, tableNo = x)) # getting all data in form of list
Here is one approach using stringr and lapply:
library (rvest)
library(stringr)
url <- "https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry" # base url
TableList<-c("1","21","41","61") # table number list
GetData<-function(URL,tableNo){
cat('\n',"Running For table",tableNo,'\n', 'Weblink Used:',stringr::str_c(url,"&",tableNo),'\n')
tables<-read_html(stringr::str_c(url,"&",tableNo)) #get data from webpage based on table numbers
screener <- tables %>%
html_nodes("table") %>%
.[11] %>% # check
html_table(fill=TRUE) %>%
data.frame()
return(screener)
}
AllData<- lapply(TableList, function(x) GetData(URL=url, tableNo = x)) # list of dataframes
However please check for .[11] number as it will be changed for these URLs(URL with &1, &21, etc). It is working fine for the base URL. Data is not present for the URL with &1, &21, etc at 11th index. Please change accordingly.
I'm scraping the ASN database (http://aviation-safety.net/database/). I've written code to paginate through each of the years (1919-2019) and scrape all relevant nodes except fatalities (represented as "fat."). Selector Gadget tells me the fatalities node is called "'#contentcolumnfull :nth-child(5)'". For some reason ".list:nth-child(5)" doesn't work.
When I scrape #contentcolumnfull :nth-child(5), the first element is blank, represented as "".
How can I write a function to delete the first empty element for every year/page that's scraped? It's simple to delete the first element when I scrape a single page on its own:
fat <- html_nodes(webpage, '#contentcolumnfull :nth-child(5)')
fat <- html_text(fat)
fat <- fat[-1]
but I'm finding it difficult to write into a function.
I also have a second question regarding date-time and formatting. My days data are represented as day-month-year. Several element days and months are missing (ex: ??-??-1985, JAN-??-2004). Ideally, I'd like to transform the dates into a lubridate object, but I can't with missing data or if I only keep the years.
At this point, I've used gsub() and regex to clean the data (delete "??" and floating dashes), so I have a mixed bag of data formats. However, this makes it difficult to visualize the data. Thoughts on best practice?
# Load libraries
library(tidyverse)
library(rvest)
library(xml2)
library(httr)
years <- seq(1919, 2019, by=1)
pages <- c("http://aviation-safety.net/database/dblist.php?Year=") %>%
paste0(years)
# Leaving out the category, location, operator, etc. nodes for sake of brevity
read_date <- function(url){
az <- read_html(url)
date <- az %>%
html_nodes(".list:nth-child(1)") %>%
html_text() %>%
as_tibble()
}
read_type <- function(url){
az <- read_html(url)
type <- az %>%
html_nodes(".list:nth-child(2)") %>%
html_text() %>%
as_tibble()
}
date <- bind_rows(lapply(pages, read_date))
type <- bind_rows(lapply(pages, read_type))
# Writing to dataframe
aviation_df <- cbind(type, date)
aviation_df <- data.frame(aviation_df)
# Excluding data cleaning
It is bad practice to ping the same page more than once in order to extract the requested information. You should read the page, extract all of the desired information and then move to the next page.
In this case the individual nodes are all store in one master table. rvest's html_table() function is handy to convert a html table into a data frame.
library(rvest)
library(dplyr)
years <- seq(2010, 2015, by=1)
pages <- c("http://aviation-safety.net/database/dblist.php?Year=") %>%
paste0(years)
# Leaving out the category, location, operator, etc. nodes for sake of brevity
read_table <- function(url){
#add delay so that one is not attacking the host server (be polite)
Sys.sleep(0.5)
#read page
page <- read_html(url)
#extract the table out (the data frame is stored in the first element of the list)
answer<-(page %>% html_nodes("table") %>% html_table())[[1]]
#convert the falatities column to character to make a standardize column type
answer$fat. <-as.character(answer$fat.)
answer
}
# Writing to dataframe
aviation_df <- bind_rows(lapply(pages, read_table))
The are a few extra columns which will need clean-up
I am new to R, and I have come upon a problem I can't solve. I would like to scrape Swedish election data at electoral district level. They are structured as can be found here http://www.val.se/val/val2014/slutresultat/K/valdistrikt/25/82/0134/personroster.html
I get the data I want by using this code:
library(rvest)
district.data <- read_html("http://www.val.se/val/val2014/slutresultat/K/kommun/25/82/0134/personroster.html")
prost <- district.data %>%
html_nodes("table") %>%
.[[2]] %>%
html_table()
But that is just for one district out of 6,227 districts. The districts are identified by the html address. In the website mentioned above it is identified by "25/82/0134". I can find the identities of all districts here http://www.val.se/val/val2014/statistik/2014_riksdagsval_per_valdistrikt.skv
And I read this semi-colon separated file into R by using this code:
valres <- read_csv2("http://www.val.se/val/val2014/statistik/2014_riksdagsval_per_valdistrikt.skv" )
(as a side note, how can I change the encoding so that the Swedish letters (e.g. å, ä, ö) are imported correctly? I manage to do that with read.csv and specifying encoding='utf-8' but not with read_csv)
In this data frame, the columns LAN, KOM and VALDIST give the identities of the districts (note that VALDIST sometimes just have 2 characters). Hence the addresses have the following structure http://www.val.se/val/val2014/slutresultat/K/kommun/LAN/KOM/VALDIST/personroster.html
So, I would like to use the combination in each row to get the identity of district, scrape the information into R, add a column with the district identity (i.e. LAN, KOM and VALDIST combined into one string), and do so over all 6,227 districts and append the information from each of those districts into one single data frame. I assume I need to use some kind of loop or some of those apply functions, to iterate over the data frame, but I have not figured out how.
UPDATE:
After the help I received (thank you!) in the answer below, the code now is as follows. My remaining problem is that I want to add the district identity (i.e. paste0(LAN, KOM, VALDIST)) for each website that is scraped to a column in the final data frame. Can someone help me with this final step?
# Read the indentities of the districts (w Swedish letters)
districts_url <- "http://www.val.se/val/val2014/statistik/2014_riksdagsval_per_valdistrikt.skv"
valres <- read_csv2(districts_url, locale=locale("sv",encoding="ISO-8859-1", asciify=FALSE))
# Add a variabel to separate the two types of electoral districts
valres$typ <- "valdistrikt"
valres$typ [nchar(small_valres$VALDIST) == 2] <- "onsdagsdistrikt"
# Create a vector w all the web addresses to the district data
base_url <- "http://www.val.se/val/val2014/slutresultat/K/%s/%s/%s/%s/personroster.html"
urls <- with(small_valres, sprintf(base_url, typ, LAN, KOM, VALDIST))
# Scrape the data
pb <- progress_estimated(length(urls))
map_df(urls, function(x) {
pb$tick()$print()
# Maybe add Sys.sleep(1)
read_html(x) %>%
html_nodes("table") %>%
.[[2]] %>%
html_table()
}) -> df
Any help would be greatly appreciated!
All the best,
Richard
You can use sprintf() to do positional substitution and then use purrr::map_df() to iterate over a vector of URLs and generate a data frame:
library(rvest)
library(readr)
library(purrr)
library(dplyr)
districts_url <- "http://www.val.se/val/val2014/statistik/2014_riksdagsval_per_valdistrikt.skv"
valres <- read_csv2(districts_url, locale=locale("sv",encoding="UTF-8", asciify=FALSE))
base_url <- "http://www.val.se/val/val2014/slutresultat/K/valdistrikt/%s/%s/%s/personroster.html"
urls <- with(valres, sprintf(base_url, LAN, KOM, VALDIST))
pb <- progress_estimated(length(urls))
map_df(urls, function(x) {
pb$tick()$print()
read_html(x) %>%
html_nodes("table") %>%
.[[2]] %>%
html_table()
}) -> df
HOWEVER, you should add a randomized delay to avoid being blocked as a bot and should look at wrapping read_html() with purrr::safely() since not all those LAN/KOM/VALDIST combinations are valid URLs (at least in my testing).
That code also provides a progress bar since it's going to take a while (prbly an hour on a moderately decent connection).