SEC Edgar filings extraction master.idx question - web-scraping

I encountered an issue while using the code from https://codingandfun.com/scraping-sec-edgar-python/
I tried to contact the authors of the website, but didn't work out. I am hoping to get some help here, and thank you in advance.
It seems that when I get to the print (download) step, the output is some weird special characters instead of organized firm urls. Is there something wrong the SEC master.idx? Could someone help me identify the issue?
Here is the code:
import bs4 as bs
import requests
import pandas as pd
import re
company = 'Facebook Inc'
filing = '10-Q'
year = 2020
quarter = 'QTR3'
#get name of all filings
download = requests.get(f'https://www.sec.gov/Archives/edgar/full-index/{year}/{quarter}/master.idx').content
download = download.decode("utf-8").split('\n')
print (download)

You need to declare your user-agent as described here otherwise you will download an html page prompting you do so.

Related

How do I scrape live updating website using BeautifulSoup?

I have been trying to extract live data from worldometer.com(https://www.worldometers.info/), particularly the health section data. I was able to extract the title (example:'Communicable disease deaths today' but I cannot extract the live data(numbers). Can anyone please help me on this?
The live data(numbers) is populated by JavaScript and you can grab it easily using automation tool something like selenium. Here is an example. Please just run the code.
Script:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
url = "https://www.worldometers.info/"
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
time.sleep(5)
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'lxml')
num = soup.select_one('div#c49 > div > span.counter-number')
print(num.text)
Output:
2,134,658

How to use jsonlite to import CMS dataset

I am trying to import a dataset from CMS using an API. My code, however, only returns 1,000 of the 155,262 observations. I don't know what I am doing wrong. Another user posted a similar problem, but regrettably, I still cannot it figure out.
library(jsonlite)
# url for CMS dataset
url <- 'https://data.cms.gov/data-api/v1/dataset/3cc6ad89-5cc0-4071-91e1-2a91aff79975/data?'
# read url and convert to data.frame
document <- fromJSON(url)
This is the link to the website on CMS: https://data.cms.gov/provider-characteristics/hospitals-and-other-facilities/provider-of-services-file-hospital-non-hospital-facilities. I am interested in accessing the POS file for Q4 2021. Thanks for your help.

How to get data from the live table using web scraping?

I am trying to set up a live table by downloading the data directly from a website through Python. I guess I am following all the steps to the dot but I still am not able to get the data from the said table.
I have referred to many web pages and blogs to try to correct the issue here but was unsuccessful. I would like the stack overflow community's help here.
The following is the table website and there is only one table on the page from which I am trying to get the data:
https://etfdb.com/themes/smart-beta-etfs/#complete-list__esg&sort_name=assets_under_management&sort_order=desc&page=1
The data on the table is partially available for free and the rest is paid. So I guess that is the problem here but I would assume I should be able to download the free data. But since this is my first time trying this and considering I am a beginner at Python, I can be wrong. So please all the help is appreciated.
The code is as follows:
import pandas as pd
import html5lib
import lxml
from bs4 import BeautifulSoup
import requests
site = 'https://etfdb.com/themes/smart-beta-etfs/#complete-list&sort_name=assets_under_management&sort_order=desc&page=1'
page1 = requests.get(site, proxies = proxy_support)
page1
page1.status_code
page1.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(page1.text, 'html.parser')
print(soup)
print(soup.prettify())
table = soup.find_all("div", class_ = "fixed-table-body")
table
When I run the table command, it gives me no data and the field is completely empty even though there is a table available on the website. All the help will be really appreciated.
The page does an another request for this info which returns json you can parse
import requests
r = requests.get('https://etfdb.com/data_set/?tm=77630&cond=&no_null_sort=&count_by_id=&sort=assets_under_management&order=desc&limit=25&offset=0').json()
Some of the keys (those for output columns Symbol and ETF Name - keys symbol and name) are associated with html so you can use bs4 to handle those values and extract the final desired result; the other keys value pairs are straightforward.
For example, if you loop each row in the json
for row in r['rows']:
print(row)
break
You get rows for parsing, of which two items need bs4 like this.
Python:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://etfdb.com/data_set/?tm=77630&cond=&no_null_sort=&count_by_id=&sort=assets_under_management&order=desc&limit=25&offset=0').json()
results = []
for row in r['rows']:
soup = bs(row['symbol'], 'lxml')
symbol = soup.select_one('.caps').text
soup = bs(row['name'], 'lxml')
etf_name = soup.select_one('a').text
esg_score = row['esg_quality_score']
esg_quality_score_pctl_peer = row['esg_quality_score_pctl_peer']
esg_quality_score_pctl_global = row['esg_quality_score_pctl_global']
esg_weighted_avg_carbon_inten = row['esg_weighted_avg_carbon_inten']
esg_sustainable_impact_pct = row['esg_sustainable_impact_pct']
row = [symbol, etf_name, esg_score, esg_quality_score_pctl_peer , esg_quality_score_pctl_global, esg_weighted_avg_carbon_inten, esg_sustainable_impact_pct ]
results.append(row)
headers = ['Symbol', 'ETF Name', 'ESG Score', 'ESG Score Peer Percentile (%)', 'ESG Score Global Percentile (%)',
'Carbon Intensity (Tons of CO2e / $M Sales)', 'Sustainable Impact Solutions (%)']
df = pd.DataFrame(results, columns = headers)
print(df)
I would like to use pandas data frame to fetch the table and can export the into csv.
import pandas as pd
tables=pd.read_html("https://etfdb.com/themes/smart-beta-etfs/#complete-list&sort_name=assets_under_management&sort_order=desc&page=1")
table=tables[0][:-1]
print(table)
table.to_csv('table.csv') #You can find the csv file in project folder after run

urllib.request.urlopen(url) NOT Working

I am trying to read a simple stock page with the following code. The last line returns an error. I have double checked that the url works and have also tried multiple url's as a check. Any help please?
import urllib.request
url = "https://www.google.com"
data = urllib.request.urlopen(url).read()

iPython: Unable to export data to CSV

I have searched a multiple articles, but unable to get iPython (Python 2.7) to export data to a CSV, and I do not receive an error message to troubleshoot the specific problem, and when I include "print(new_links)" I obtain the desired output; thus, this issue is printing to the csv.
Any suggestions on next steps are much appreciated !
Thanks!
import csv
import requests
import lxml.html as lh
url = 'http://wwwnc.cdc.gov/travel/destinations/list'
page = requests.get(url)
doc = lh.fromstring(page.content)
new_links = []
for link_node in doc.iterdescendants('a'):
try:
new_links.append(link_node.attrib['href'])
except KeyError:
pass
cdc_part1 = open("cdc_part1.csv", 'wb')
wr = csv.writer(cdc_part1, dialect='excel')
wr.writerow(new_links)
Check to make sure the new_links is a list of lists.
If so and wr.writerow(new_links) is still not working, you can try:
for row in new_links:
wr.writerow(row)
I would also check the open statement's file path and mode. Check if you can get it to work with 'w'.

Resources