Python code to scrape ticker symbols from Yahoo finance - web-scraping

I have a list of >1.000 companies which I could use to invest in. I need the ticker symbol id's from all these companies. I find difficulties when I am trying to strip the output of the soup, and when I am trying to loop through all the company names.
Please see an example of the site: https://finance.yahoo.com/lookup?s=asml. The idea is to replace asml and put 'https://finance.yahoo.com/lookup?s='+ Companies., so I can loop through all the companies.
companies=df
Company name
0 Abbott Laboratories
1 ABBVIE
2 Abercrombie
3 Abiomed
4 Accenture Plc
This is the code I have now, where the strip code doesn't work, and where the loop for all the company isn't working as well.
#Create a function to scrape the data
def scrape_stock_symbols():
Companies=df
url= 'https://finance.yahoo.com/lookup?s='+ Companies
page= requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
Company_Symbol=Soup.find_all('td',attrs ={'class':'data-col0 Ta(start) Pstart(6px) Pend(15px)'})
for i in company_symbol:
try:
row = i.find_all('td')
company_symbol.append(row[0].text.strip())
except Exception:
if company not in company_symbol:
next(Company)
return (company_symbol)
#Loop through every company in companies to get all of the tickers from the website
for Company in companies:
try:
(temp_company_symbol) = scrape_stock_symbols(company)
except Exception:
if company not in companies:
next(Company)
Another difficulty is that the symbol look up from yahoo finance will retrieve many companies names.
I will have to clear the data afterwards. I want to set the AMS exchange as the standard, hence if a company is listed on multiple exchanges, I am only interested in the AMS ticker symbol. The final goal is to create a new dataframe:
Comapny name Company_symbol
0 Abbott Laboratories ABT
1 ABBVIE ABBV
2 Abercrombie ANF

Here's a solution that doesn't require any scraping. It uses a package called yahooquery (disclaimer: I'm the author), which utilizes an API endpoint that returns symbols for a user's query. You can do something like this:
import pandas as pd
import yahooquery as yq
def get_symbol(query, preferred_exchange='AMS'):
try:
data = yq.search(query)
except ValueError: # Will catch JSONDecodeError
print(query)
else:
quotes = data['quotes']
if len(quotes) == 0:
return 'No Symbol Found'
symbol = quotes[0]['symbol']
for quote in quotes:
if quote['exchange'] == preferred_exchange:
symbol = quote['symbol']
break
return symbol
companies = ['Abbott Laboratories', 'ABBVIE', 'Abercrombie', 'Abiomed', 'Accenture Plc']
df = pd.DataFrame({'Company name': companies})
df['Company symbol'] = df.apply(lambda x: get_symbol(x['Company name']), axis=1)
Company name Company symbol
0 Abbott Laboratories ABT
1 ABBVIE ABBV
2 Abercrombie ANF
3 Abiomed ABMD
4 Accenture Plc ACN

Related

How would I go about web scraping from an interactive map?

This pertains to this interactive map, https://www.newworld-map.com/?filters=ores
An example is the ores here, how would I go about getting the coordinates of each node? It looks like the html element is a Canvas and I could not for the life of me figure out where it pulls the data from for this.
Any help would be greatly appreciated
Hoping that next OP's question will be more in line with Stackoverflow's guidelines (see https://stackoverflow.com/help/minimal-reproducible-example), one way to solve this would be to inspect what network calls are being made when page loads, and scrape an eventual API endpoint where the data is pulled from. Like below:
import requests
import pandas as pd
import time
time_stamp = int(time.time_ns() / 1000)
ore_list = []
url = f'https://www.newworld-map.com/markers.json?time={time_stamp}'
ores= requests.get(url).json()['ores']
for ore in ores:
for x in ores[ore]:
ore_list.append((ore, x, ores[ore][x]['x'], ores[ore][x]['y']))
df = pd.DataFrame(ore_list, columns = ['Ore', 'Code', 'X_Coord', 'Y_Coord'])
print(df)
Result in terminal:
Ore Code X_Coord Y_Coord
0 brimstone 02d1ba070438d53ce5fbb1955cd7d694 7473.096191 8715.674805
1 brimstone 0a50c499af034aeb6f38e011648a2ea8 7471.124512 8709.161133
2 brimstone 0b5b190c31eb3d314d993dd393aadfe8 5670.894043 7862.319336
3 brimstone 0f5c7427c75d80e10f71f9e92ddc4362 5883.601562 7703.445801
4 brimstone 20b0801bdb41c7dafbb1053b43c25bd8 6020.838379 8147.747070
... ... ... ... ...
4260 starmetal 86h 8766.964000 8431.438000
4261 starmetal 86i 8598.688000 8562.974000
4262 starmetal 86j 8586.000000 8211.000000
4263 starmetal 86k 8688.938000 8509.722000
4264 starmetal 86l 8685.827000 8505.694000
4265 rows × 4 columns

Webscraping RequestGet from Airbnb not working properly

This query is returning 0 or 20 randomly every time i run it. Yesterday when i loop through the pages i always get 20 and I am able to scrape through 20 listings and 15 pages. But now, I can't run my code properly because sometimes the listings return 0.
I tried adding headers in the request get and time sleep (5-10s random) before each request but am still facing the same issue. Tried connecting to hotspot to change my IP but am still facing the same issue. Anyone understand why?
import time
from random import randint
from bs4 import BeautifulSoup
import requests #to connect to url
airbnb_url = 'https://www.airbnb.com/s/Mayrhofen--Austria/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&date_picker_type=calendar&query=Mayrhofen%2C%20Austria&place_id=ChIJbzLYLzjdd0cRDtGuTzM_vt4&checkin=2021-02-06&checkout=2021-02-13&adults=4&source=structured_search_input_header&search_type=autocomplete_click'
soup = BeautifulSoup(requests.get(airbnb_url).content, 'html.parser')
listings = soup.find_all('div', '_8s3ctt')
print(len(listings))
It seems AirBnB returns 2 versions of the page. One "normal" HTML and other where the listings are stored inside <script>. To parse the <script> version of page you can use next example:
import json
import requests
from bs4 import BeautifulSoup
def find_listing(d):
if isinstance(d, dict):
if "__typename" in d and d["__typename"] == "DoraListingItem":
yield d["listing"]
else:
for v in d.values():
yield from find_listing(v)
elif isinstance(d, list):
for v in d:
yield from find_listing(v)
airbnb_url = "https://www.airbnb.com/s/Mayrhofen--Austria/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&date_picker_type=calendar&query=Mayrhofen%2C%20Austria&place_id=ChIJbzLYLzjdd0cRDtGuTzM_vt4&checkin=2021-02-06&checkout=2021-02-13&adults=4&source=structured_search_input_header&search_type=autocomplete_click"
soup = BeautifulSoup(requests.get(airbnb_url).content, "html.parser")
listings = soup.find_all("div", "_8s3ctt")
if len(listings):
# normal page:
print(len(listings))
else:
# page that has listings stored inside <script>:
data = json.loads(soup.select_one("#data-deferred-state").contents[0])
for i, l in enumerate(find_listing(data), 1):
print(i, l["name"])
Prints (when returned the <script> version):
1 Mariandl (MHO103) for 36 persons.
2 central and friendly! For Families and Friends
3 Sonnenheim for 5 persons.
4 MO's Apartments
5 MO's Apartments
6 Beautiful home in Mayrhofen with 3 Bedrooms
7 Quaint Apartment in Finkenberg near Ski Lift
8 Apartment 2 Villa Daringer (5 pax.)
9 Modern Apartment in Schwendau with Garden
10 Holiday flats Dornau, Mayrhofen
11 Maple View
12 Laubichl Lodge by Apart Hotel Therese
13 Haus Julia - Apartment Edelweiß Mayrhofen
14 Melcherhof,
15 Rest coke
16 Vacation home Traudl
17 Luxurious Apartment near Four Ski Lifts in Mayrhofen
18 Apartment 2 60m² for 2-4 persons "Binder"
19 Apart ZEMMGRUND, 4-9 persons in Mayrhofen/Tirol
20 Apartment Ahorn View
EDIT: To print lat, lng:
...
for i, l in enumerate(find_listing(data), 1):
print(i, l["name"], l["lat"], l["lng"])
Prints:
1 Mariandl (MHO103) for 36 persons. 47.16522 11.85723
2 central and friendly! For Families and Friends 47.16209 11.859691
3 Sonnenheim for 5 persons. 47.16809 11.86694
4 MO's Apartments 47.166969 11.863186
...

R extract specific word after keyword

How do I extract a specific word after keyword in R.
I have the following input text which contains details about policy. I need to extract specific words value like FirstName , SurName , FatherName and dob.
input.txt
In Case of unit linked plan, Investment risk in Investment Portfolio is borne by the policyholder.
ly
c I ROPOSAL FORM z
Insurance
Proposal Form Number: 342525 PF 42242
Advisor Coe aranch Code 2
Ff roanumber =F SSOS™S™~™S~S rancid ate = |
IBR. Code S535353424
re GFN ——
INSTRUCTION FOR FILLING THES APPLICATION FORM ; 1. Compiets the proocsal form in CAPITAL LETTERS using = Black Ball Point P]n. 2. Sless= mark your selection by marking “X" insides the
Boe. 3. Slnsse bases 2 Blank soece after eect word, letter or initial 4. Slssse write "MA" for questions whic are not apolicatie. 5.00 NOT USE the Sor") to identify your initial or seperate the sddressiiine.
6. Sulmissson of age proof ie mandatory along wall Ge propel fonm.
IMPORTANT INSTRUCTIONS WITH REGARD TO DISCLOSURE OF INFORMATION: Inturance it a contract of UTMOST GOOD FAITH and itis required by disclose all material and nelevant
fach: complebehy, DO) NOT suppress any fac: in response by the questions in the priposal form. FAILURE TO PROVIDE COMPLETE AND ACCURATE INFORMATION OR
MISREPRESENTATION OF THE FACTS COULD DECLARE THES POLICY CONTRACT NULL AND VOID AFTER PAYMENT OF SURRENDER VALUE, IF ANY, SUBJECT TO SECTION 45 OF
INSURANCE ACT, 1998 As AMENDED FROM TIME TO TIME,
Section I - Details of the Life to be Assured
1. Tite E-] Mr. LJ Mrs. LJ Miss [J Or. LJ Others (Specify)
2. FirstName PETER PAUL
3. Surname T
44. Father's Name
46, Mother's Name ERIKA RESWE D
5. Date of Birth 13/02/1990 6, Gender E] Male ] Female
7. Age Proof L] School Certificate [] Driving License [] Passport {Birth Certificate E"] PAN Card
3, Marital Status D) Single EF] Married 0 Widower) 0 Civorcee
9, Spouse Name ERISEWQ FR
10. Maiden Name
iL. Nationality -] Resident Indian National [J Non Resident Indian (MRI) L] Others (Specify)
12, Education J Postgraduate / Doctorate Ee) Graduate [] 12thstd. Pass [J 10thstd. Pass [J Below 10th std.
OO Dliterate / Uneducated CJ Others (Specify)
13. Address For No 7¥%a vaigai street Flower
Communication Nagar selaiyur
Landmark
City Salem
Pin Code BO00 73: State TAMIL NADU
Address proof [] Passport ([] Driving License [] Voter ID [] Bank Statement [] Utility Bill G4 Others (Specify) Aadhaar Card
14, Permanent No 7¥a vaigai street Flower
Address :
Nagar selaiyur
Landmark
City Salem
Pin Code 5353535 state (TAMIL NADU
Address proof CJ] Passport [9 DrivingLicense [J Voter ID [ Bank Statement [ Utility Bill B] Others (Specify) Aadhaar Card
15. Contact Details Mobile 424242424 Phone (Home)
Office / Business
E-mail fdgrgtr13#yahoo.com
Preferred mode: ((] Letter EF) E-Mail
Preferred Language for Letter {other than English): [] Hindi [] Kannada [-] Tamil J Telugu C] Malayalam C) Gujarati
Bengali GOriya =D] Marathi
16. Occupation CL] Salaried-Govt /PSU ( Salaried-other [9 Self Employed Professional [J Aagriculturist {Farmer [Part Time Business
LJ Retired ] Landlord J Student (current Std) -] Others (Specify) Salaried - MNC
17. Full Name of the Capio software
Employers Businnes/
School/College
18, Designation & Exact nature of Work / Business Manager
19. AnnualIncomein 1,200,000.00 20. Annual Income of Husband / Father = 1,500,000.00
Figures (%) (for female and minor lives)
21. Exact nature of work / business of Husband / Father for female and minor lives Government Employee
Page 10fé
The below code works for me but the problem is if line order changes everything get changed. Is there a way to extract keyword value irrespective of line order. ?
Current Code
path <- getwd()
my_txt <- readLines(paste(path, "/input.txt", sep = ""))
fName <- sub('.*FirstName', '', my_txt[7])
SName <- sub('.*Surname', '', my_txt[8])
FatherNm <- sub(".*Father's Name", '', my_txt[9])
dob <- sub("6, Gender.*", '',sub(".*Date of Birth", '', my_txt[11]))
You can combine the text together as one string and extract the values based on pattern in the data. This approach will work irrespective of the line number in the data provided the pattern in the data is always valid for all the files.
my_txt <- readLines(paste(path, "/input.txt", sep = ""))
#Collapse data in one string
text <- paste0(my_txt, collapse = '\n')
#Extract text after FirstName till '\n'
fName <- sub('.*FirstName (.*?)\n.*', '\\1', text)
fName
#[1] "John Woo"
#Extract text after Surname till '\n'
SName <- sub('.*Surname (.*?)\n.*', '\\1', text)
SName
#[1] "T"
#Extract text after Father's Name till '\n'
FatherNm <- sub(".*Father's Name (.*?)\n.*", '\\1', text)
FatherNm
#[1] "Bill Woo"
#Extract numbers which come after Date of Birth.
dob <- sub(".*Date of Birth (\\d+/\\d+/\\d+).*", '\\1', text)
dob
#[1] "13/07/1970"

Data Scraping with list in excel

I have a list in Excel. One code in Column A and another in Column B.
There is a website in which I need to input both the details in two different boxes and it takes to another page.
That page contains certain details which I need to scrape in Excel.
Any help in this?
Ok. Give this a shot:
import pandas as pd
import requests
df = pd.read_excel('C:/test/data.xlsx')
url = 'http://rla.dgft.gov.in:8100/dgft/IecPrint'
results = pd.DataFrame()
for row in df.itertuples():
payload = {
'iec': '%010d' %row[1],
'name':row[2]}
response = requests.post(url, params=payload)
print ('IEC: %010d\tName: %s' %(row[1],row[2]))
try:
dfs = pd.read_html(response.text)
except:
print ('The name Given By you does not match with the data OR you have entered less than three letters')
temp_df = pd.DataFrame([['%010d' %row[1],row[2], 'ERROR']],
columns = ['IEC','Party Name and Address','ERROR'])
results = results.append(temp_df, sort=False).reset_index(drop=True)
continue
generalData = dfs[0]
generalData = generalData.iloc[:,[0,-1]].set_index(generalData.columns[0]).T.reset_index(drop=True)
directorData = dfs[1]
directorData = directorData.iloc[:,[-1]].T.reset_index(drop=True)
directorData.columns = [ 'director_%02d' %(each+1) for each in directorData.columns ]
try:
branchData = dfs[2]
branchData = branchData.iloc[:,[-1]].T.reset_index(drop=True)
branchData.columns = [ 'branch_%02d' %(each+1) for each in branchData.columns ]
except:
branchData = pd.DataFrame()
print ('No Branch Data.')
temp_df = pd.concat([generalData, directorData, branchData], axis=1)
results = results.append(temp_df, sort=False).reset_index(drop=True)
results.to_excel('path.new_file.xlsx', index=False)
Output:
print (results.to_string())
IEC IEC Allotment Date File Number File Date Party Name and Address Phone No e_mail Exporter Type IEC Status Date of Establishment BIN (PAN+Extension) PAN ISSUE DATE PAN ISSUED BY Nature Of Concern Banker Detail director_01 director_02 director_03 branch_01 branch_02 branch_03 branch_04 branch_05 branch_06 branch_07 branch_08 branch_09
0 0305008111 03.05.2005 04/04/131/51473/AM20/ 20.08.2019 NISSAN MOTOR INDIA PVT. LTD. PLOT-1A,SIPCOT IN... 918939917907 shailesh.kumar#rnaipl.com 5 Merchant/Manufacturer Valid IEC 2005-02-07 AACCN0695D FT001 NaN NaN 3 Private Limited STANDARD CHARTERED BANK A/C Type:1 CA A/C No :... HARDEEP SINGH BRAR GURMEL SINGH BRAR HOUSE NO ... JEROME YVES MARIE SAIGOT THIERRY SAIGOT A9/2, ... KOJI KAWAKITA KIHACHI KAWAKITA 3-21-3, NAGATAK... Branch Code:165TH FLOOR ORCHID BUSINESS PARK,S... Branch Code:14NRPDC , WAREHOUSE NO.B -2A,PATAU... Branch Code:12EQUINOX BUSINESS PARK TOWER 3 4T... Branch Code:8GRAND PALLADIUM,5TH FLR.,B WING,,... Branch Code:6TVS LOGISTICS SERVICES LTD.SING,C... Branch Code:2PLOT 1A SIPCOT INDUL PARK,ORAGADA... Branch Code:5BLDG.NO.3 PART,124A,VALLAM A,SRIP... Branch Code:15SURVEY NO. 678 679 680 681 682 6... Branch Code:10INDOSPACE SKCL INDL.PARK,BULD.NO...

Dictionary with a running total

This is for a homework I am doing.
I have a .txt file that looks like this.
11
eggs
1.17
milk
3.54
bread
1.50
coffee
3.57
sugar
1.07
flour
1.37
apple
.33
cheese
4.43
orange
.37
bananas
.53
potato
.19
What I'm trying to do is keep a running total, when you type in the word "Eggs" then the word "bread" it needs to add the cost of both and keep going until "EXIT" also I'm going to run into a 'KeyError' and need help with that also.
def main():
key = ''
infile = open('shoppinglist.txt', 'r')
total = 0
count = infile.readline()
grocery = ''
groceries = {}
print('This program keeps a running total of your shopping list.')
print('Use \'EXIT\' to exit.')
while grocery != 'EXIT':
grocery = input('Enter an item: ')
for line in infile:
line = line.strip()
if key == '':
key = line
else:
groceries[key] = line
key = ''
print ('Your current total is $'+ groceries[grocery])
main()
Does the file contain the prices of each of the different groceries?
The user input statement should have a .strip() at the end too as sometimes line ending characters can be included from user input.
You should only need to read the file once, not in the loop.
When the user enters a grocery item it should as you say check that it exists:
if grocery in groceries:
...
else:
#grocery name not recognised
I think you should have a separate dictionary to store the counts of each grocery something like this: http://docs.python.org/library/collections.html#collections.Counter
import collections
quantitiesWanted = collections.Counter()
Then any grocery can be asked for like this quantitiesWanted['eggs'] which will return 0 by default. Doing something like quantitiesWanted['eggs'] += 1 will increase it to 1 and so on.
To get the current total you can do:
total = 0
for key, value in quantitiesWanted:
total += groceries[key] * value

Resources