GOAL:
The end goal is to be able to retrieve the dynamic data found in the table under the 'Summary' tab on a Yahoo Finance page. I need to be able to do this without the use of a 3rd party API like yahoo_fin or yfinance. To clarify which table (there are many), here is a screenshot:
(Link to the Exact Webpage).
The only cells I really need are the 'Volume' and 'Avg. Volume' values, but I believe these are only available via an HTML element of the following structure:
Code:
import requests
from bs4 import BeautifulSoup
url = 'https://finance.yahoo.com/quote/AMZN?p=EDTK&.tsrc=fin-srch'
r = requests.get(url)
web_content = BeautifulSoup(r.text,'lxml')
table_values = web_content.find_all('div', class_='Bxz(bb) Bdbw(1px) Bdbs(s) Bdc($seperatorColor) H(36px)')
print(table_values)
returns: '[]'
Is there anything I'm clearly doing wrong? I'm no expert in bs4, but I've used this syntax before in the past and never had any issues.
You're selecting <div> instead of <tr>:
import requests
from bs4 import BeautifulSoup
url = "https://finance.yahoo.com/quote/AMZN?p=EDTK&.tsrc=fin-srch"
r = requests.get(url)
web_content = BeautifulSoup(r.text, "lxml")
table_values = web_content.find_all(
"tr", class_="Bxz(bb) Bdbw(1px) Bdbs(s) Bdc($seperatorColor) H(36px)"
)
for row in table_values:
tds = [td.get_text(strip=True) for td in row.select("td")]
print(*tds)
Prints:
Previous Close 3,421.57
Open 3,424.80
Bid 3,472.00 x 1000
Ask 3,473.00 x 1100
Day's Range 3,395.59 - 3,472.58
52 Week Range 2,871.00 - 3,773.08
Volume 4,312,055
Market Cap 1.758T
Beta (5Y Monthly) 1.14
PE Ratio (TTM) 60.47
EPS (TTM) 57.40
Earnings Date Oct 27, 2021-Nov 01, 2021
Forward Dividend & Yield N/A (N/A)
Ex-Dividend Date N/A
Related
I'm trying to scrape the nhl playoff bracket from wikipedia, for the years 1988 on, using beautiful soup 4 in python. Inconsistent formatting (sometimes the there is more than one team on a row see: (https://en.wikipedia.org/wiki/2004_Stanley_Cup_playoffs) makes this hard. I would like to identify the Team, Round, and Number of Games Won for every series in that year.
Initially, I converted the table to text and used regular expressions to identify the teams and the information, but the ordering shifts depending on whether the brackets allow more than one team per row or not.
Now I'm trying to work my way down the rows and count things like the number of cells/columns spans, but the results are inconsistent. I'm missing how the 4th round teams are identified.
What I have so far is an attempt to count the number of cells before a cell with a team is reached...
from bs4 import BeautifulSoup as soup
hockeyteams = ['Anaheim','Arizona','Atlanta','Boston','Buffalo','Calgary','Carolina','Chicago','Colorado','Columbus','Dallas','Detroit',
'Edmonton','Florida','Hartford','Los Angeles','Minnesota','Montreal','Nashville','New Jersey',
'Ottawa','Philadelphia','Pittsburgh','Quebec','San Jose','St. Louis','Tampa Bay','Toronto','Vancouver','Vegas','Washington',
'Winnipeg','NY Rangers','NY Islanders']
#fetch the content from the url from the library
page_response = requests.get(full_link, timeout=5)
#use the html parser to parse the url
page_content = soup(page_response.content, "html.parser")
tables = page_content.find_all('table')
cnt = 0
#identify the appropriate table
for table in tables:
if ('Semi' in table.text) & ('Stanley Cup Finals' in table.text):
bracket = table
break
row_num = 0
for row in bracket.find_all('tr'):
row_num += 1
print(row_num,'#')
colcnt = 0
for col in row.find_all('td'):
if "colspan" in col.attrs:
colcnt += int(col.attrs['colspan'])
else:
colcnt += 1
if (col.text.strip(' \n') in str(hockeyteams)):
print(colcnt,col.text)
print('col width:',colcnt)
Ultimately I'd like something like a dataframe that has:
Round Team A Team A Wins, Team B Team B Wins
1, Tampa Bay, 4, NY Islanders, 1
2, Tampa Bay, 4, Montreal, 0
etc
That table can be scraped with pandas:
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/2004_Stanley_Cup_playoffs#Playoff_bracket')
bracket = tables[2].dropna(axis=1, how='all').dropna(axis=0, how='all')
print(bracket)
The output is full of NaNs, but it has what I think you're looking for and you can modify it using standard pandas methods.
The CIA publishes a list of world leaders and cabinet ministers for all countries multiple times a year. This information is in PDF form.
I want to convert this PDF to CSV using R and then seperate and tidy the data.
I am getting the PDF from "https://www.cia.gov/library/publications/resources/world-leaders-1/"
under the link 'PDF Version for Prior Years' located at the center right hand side of the page.
Each PDF has some introductory pages and then lists the Leaders and Ministers for each country.
With each'Title' and 'Name' being seperated by a '..........' of varying lengths.
I have tried to use the pdftools package to convert from PDF, but I am not quite sure how to deal with the format of the data for sorting and tidying.
Here is the first steps I have taken with a downloaded PDF
library(pdftools)
text <- pdf_text("Data/April2006ChiefsDirectory.pdf")
test <- as.data.frame(text)
Starting with a single PDF, I want to list each Minister in a seperate row, with individual columns for year, country, title and name.
With the step I have taken so far, converting the PDF into .csv without any additional tidying, the data is in a single column and each row has a string of text contining title and name for multiple countries.
I am a novice at data tidying any help would be much appreciated.
You can do it with tabulizer but it is going to require some work to clean it up if your want to import all the 240 pages of the document.
Here I import page 4, that is the first with info regarding the leaders
library(tabulizer)
mw_table <- extract_tables(
"https://www.cia.gov/library/publications/resources/world-leaders-1/pdfs/2019/January2019ChiefsDirectory.pdf",
output = "data.frame",
pages = 4,
area = list(c(35.68168, 40.88842, 740.97853, 497.74737 )),
guess = FALSE
)
head(mw_table[[1]])
#> X Afghanistan
#> 1 Last Updated: 20 Dec 2017
#> 2 Pres. Ashraf GHANI
#> 3 CEO Abdullah ABDULLAH, Dr.
#> 4 First Vice Pres. Abdul Rashid DOSTAM
#> 5 Second Vice Pres. Sarwar DANESH
#> 6 First Deputy CEO Khyal Mohammad KHAN
You can use a vector of pages that you want to import as the argument in pages. Consider that you will have all the country names buried among the people names in the second column. Probably you can work out a method to identifying the indexes of the country by looking for the empty "" occurrences in the first column.
I am trying to get into text analysis in R. I have a text file with the following structure.
HD A YEAR Oxxxx
WC 244 words
PD 28 February 2018
SN XYZ
SC hydt
LA English
CY Copyright 2018
LP Rio de Janeiro, Feb 28
TD
With recreational cannabis only months away from legalization in Canada, companies are racing to
prepare for the new market. For many, this means partnerships, supply agreements,
I want to extract the following elements (PD and TD) in R, and saved into a table.
I have tried this but I am unable to get it correct.
Extract PD
library(stringr)
library(tidyverse)
pd <- unlist(str_extract_all(txt, "\\bPD\\b\t[0-9]+?\\s[A-Za-z]+?\\s[0-9]+\\s"))
pd <- str_replace_all(pd, "\\bPD\\b\t", "")
if (length(pd) == 0) {
pd <- as.character(NA)
}
pd <- str_trim(pd)
pd <- as.Date(strptime(pd, format = "%d %B %Y"))
Extract TD
td <- unlist(str_extract_all(txt, "\\bTD\\b[\\t\\s]*?.+?\\bCO\\b"))
td <- str_replace_all(td, "\\bTD\\b[\\t\\s]+?", "")
td <- str_replace_all(td, "\\bCO\\b", "")
td <- str_replace_all(td, "\\s+", " ")
if (length(td) == 0) {
td <- as.character(NA)
I want table as follows please:
PD TD
28 February 2018 With recreational cannabis only months away from
legalization in Canada, companies are racing to
prepare for the new market. For many, this means
partnerships, supply agreements, Production hit a
record 366.5Mt
Any help would be appreciated. Thank you
[I had to add a few characters to the end of your data set which I inferred from your regexes:
txt <- "HD A YEAR Oxxxx
WC 244 words
PD 28 February 2018
SN XYZ
SC hydt
LA English
CY Copyright 2018
LP Rio de Janeiro, Feb 28
TD
With recreational cannabis only months away from legalization in Canada, companies are racing to
prepare for the new market. For many, this means partnerships, supply agreements,
CO ...further stuff"
Dirty
The dirty solution to your problems is probably:
For the date field, fix either the regex that it expects not a tab but an arbitrary space after the PD text. E.g. \\bPD\\b [0-9]+?\\s[A-Za-z]+?\\s[0-9]+\\s" works for me.
For the TD field, make your regex multi-line by using the dotall= option: (See ?stringr::regex)
td <- unlist(str_extract_all(txt, regex("\\bTD\\b[\\t\\s]*?.+?\\bCO\\b", dotall=TRUE)))
Maybe shorter regexes are better?
However, I would recommend you capture the characteristics of your input format only as fine-grained as needed. For example, I would not check the date format via a regex. Just search for "^ PD.*" and let R try to parse the result. It will complain anyway if it does not match.
To filter for a text block which starts with multiple spaces like after the TD marker, you can use the multiline= option to use ^ to match every (not only the first) line beginning. E.g.
str_extract_all(txt, regex("^TD\\s+(^\\s{3}.*\\n)+", multiline = TRUE))
(note that the regex class \s comprises \n so I do not need to specify that explicitly after matching the TD line)
Careful if fields are missing
Finally, your current approach might assign the wrong dates to the text if one of the TD or PD fields are ever missing in the input! A for loop in combination with readLines instead of regex matching might help for this:
Have used forecasting method using R. Using the below codes:
library(forecast)
t1$StartDate <- as.Date(t1$StartDate, origin = "1899-12-30")
## 10,1 indicates 10th Week & Sunday
ordervalu_ts <- ts(t1$Revenue, start = c(10,1), frequency = 7)
print(ordervalu_ts)
ordervalu_ts_decom <- HoltWinters(ordervalu_ts)
print(ordervalu_ts_decom)
ordervalu_ts_for <- forecast:::forecast.HoltWinters(ordervalu_ts_decom, h=30)
print(ordervalu_ts_for)
t1 is the input file. It has two columns: Date and Revenue. I am trying to forecast the Revenue for the next 30 days. ?Able to get output?. The date in the output is not in the right format. I have the following questions:
Start date: wanted to have it dynamic and not static (ie, the start date to take it form the column "Date"
Output is not providing me the exact date (providing 81.42857 instead of the first predicted date) while providing the prediction. Shows as "Below".
print(ordervalu_ts_for)
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
81.42857 1390.4782 368.3917 2412.565 -172.668266 2953.625
81.57143 1351.3890 328.9055 2373.872 -212.364558 2915.142
81.71429 1355.7625 332.8507 2378.674 -208.646034 2920.171
Can some one help ? Have tried reviewing all the video's in youtube and online. Thanks for your help
I have to import some datas from Yahoo! finance to R. For example I have to import in R datas about Roche Holding (AG) (Roche Yahoo! link) and The Goldman Sachs Group, Inc. (GS) (Goldman Yahoo! link).
My problem is that Roche Holding (AG) datas are in euros and datas about The Goldman Sachs Group, Inc. (GS) are in dollars.
Now to import my datas in R I'm using:
Goldman_Sachs.z = get.hist.quote(instrument="GS", start=date.start,
end=date.end, quote="AdjClose",origin="1970-01-01",
provider="yahoo",compression = "m", retclass="zoo")
Is possible to import these datas just in dollars or must I implement a function in R to make this job?
And in the second case how can I choose the exchange rate?
Although Roche Holding is also traded on NASDAQ OTC and you might therefore be able to obtain the current share price in USD, the proper way to handle such situations is to retrieve the data from the main market (which is the Swiss Stock Exchange in Zurich in this case) and calculate the value in dollars using the current exchange rate. The problem with OTC values is their low trading volume which may result in inaccurate prices.
To obtain the exchange rate of the CHF/USD Forex pair you can use the quantmod package:
library(quantmod)
getFX("CHF/USD")
tail(CHFUSD,1)
# CHF.USD
#2016-04-16 1.0332
The price in EUR does not seem to be a suitable choice in this case, but since you mentioned that you are looking at a market where Roche Holding is traded in EUR you may use in the same way getFX("EUR/USD").
To download the EOD data from the Swiss Stock Exchange I would recommend using
Roche_CH <- getSymbols("SWX:RO", src ="google", auto.assign = FALSE)
or
Roche_CH <- getSymbols("RO.SW", src = "yahoo", auto.assign = FALSE)
Hope this helps.