Replace special characters in a string (like "{ by { ) in R - r

I am stuck with replacing a series of special characters by another different series.
For instance, I have "{'request' and I want to change the starting "{ to {.
Here's one line of my data:
"{'request':{'id':'n98u4jiqp61c19v8eknicioq4be74pfe','time':'2017-08-21T21:57:27+00:00','type':'web','tcp':{'signature':{'attributes':{'ip_version':4,'initial_ttl':128,'options_length':0,'mss':1360,'window_size':'8192','window_scale':8,'options':'mss,nop,ws,nop,nop,sok','header_quirks':'df,id+'},'normalized_full':'34.2178821511','normalized_partial':'18.3082608836'},'mtu':{'type':'Probably IPsec or other VPN','size':1400},'ssl':{'protocol':'TLSv1.2','cipher':'ECDHE-RSA-AES128-GCM-SHA256','handshake':{'version':'3.3','ciphers':{'value':'aaaa,cca9,cca8,c02b,c02f,c02c,c030,c013,c014,9c,9d,2f,35,a','signature':{'value':'cca9,cca8,c02b,c02f,c02c,c030,c013,c014,9c,9d,2f,35,a','normalized':'13.552269047','garbage':['aaaa']}},'extensions':{'value':'dada,ff01,?0,17,23,d,5,12,10,b,a,baba','signature':{'value':'ff01,?0,17,23,d,5,12,10,b,a','normalized':'10.1792723498','garbage':['dada','baba']}},'flags':'ver,rtime'},'signature':{'normalized':'48.3888883277'}}},'network':{'rtt':{'value':84958,'variance':25927},'distance':27,'ip':{'address':'0.0.0.0','hostname':'0.0.0.0','asn':{'number':'AS2609','organization':'Tunisia BackBone AS'},'location':{'continent':{'code':'AF','name':'Africa'},'country':{'code':'TN','name':'Tunisia'},'city':{'name':'D'ile Deux'},'region':{'name':null},'timezone':{'name':'Africa\/Tunis','offset':-60},'coordinates':{'latitude':34,'longitude':9}}}},'header':{'structure':{'value':['Host','Connection','Content-Length','Origin','User-Agent','Content-type','Accept','Referer','Accept-Encoding','Accept-Language'],'leftover':['Content-Length','Origin','Content-type'],'normalized':'7.2362511174'},'languages':{'value':{'fr-FR':1,'fr':0.8,'en-US':0.6,'en':0.4},'normalized':'35.775591716'},'agent':{'string':'Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/60.0.3112.101 Safari\/537.36','os':{'name':'Windows','version':'10','platform':'x64','family':'Windows'},'client':{'type':'browser','name':'Chrome','version':'60.0.3112.101','engine':'Blink'},'device':{'type':'desktop','vendor':'','model':''}}},'context':{'source':'javascript-2.0','processor':'web-1.0','details':{'browser':{'features':['52','127','126','47','0','204'],'plugins':{'hash':'4.834303856','list':['widevinecdmadapter.dll 1481000','mhjfbmdgcfjbbpaeojofohoefgiehjai','internal-nacl-plugin','internal-pdf-viewer']},'fonts':{'hash':'44.591879809','list':['Agency FB','Arabic Typesetting','Arial Black','Bauhaus 93','Bell MT','Bodoni MT','Bookman Old Style','Broadway','Calibri','Californian FB','Castellar','Centaur','Century Gothic','Colonna MT','Copperplate Gothic Light','Engravers MT','Forte','Franklin Gothic Heavy','French Script MT','Gabriola','Gigi','Goudy Old Style','Haettenschweiler','Harrington','Impact','Informal Roman','Lucida Bright','Lucida Fax','Magneto','Malgun Gothic','Matura MT Script Capitals','MingLiU-ExtB','MS Reference Sans Serif','Niagara Solid','Palace Script MT','Papyrus','Perpetua','Playbill','Rockwell','Segoe Print','Showcard Gothic','Snap ITC','Vladimir Script','Wide Latin']},'webgl':{'hashes':{'extensions':'25,1910689852','attributes':'3594354498','info':'1687665201'},'strings':{'attributes':'1,2,8,8,8,8,24,0,8,16384,32,16384,1024,16,16384,30,16,16,4096,1,1,1,1024,16384,16384,16,35633,35632','vendor':'GoogleInc.','renderer':'ANGLE(Intel(R)HDGraphicsDirect3D11vs_5_0ps_5_0)'}},'properties':{'name':'-1','platform':'Win32','concurrency':2,'flash_version':'0.0.0','math_vector':'2297712969','colors':'28,3846833241'},'navigator':'39,1102024947','languages':['fr'],'tokens':{'id':'20170821X4013172038Xn98u4jiqp61c19v8eknicioq4be74pfe'},'is_incognito':0,'history_length':3},'screen':{'width':1366,'height':768,'color_depth':24,'window_inner':'1366x662','window_outer':'1366x728','max_touch_points':'0','availwidth':1366,'availheight':728},'media':{'structure':'MMVSS','structure_list':['M741301070','M2957980628','V3222337076','S741301070','S3687074924'],'audio_signature':'18614.611131555066,172.67165302389913'},'battery':{'string':'A?,26','percentage':26,'status':'charging','status_seconds':0},'timezone':{'offset':1,'list':['-60','-60','-60']},'network':{'ipv4':['192.168.1.37'],'ipv6':['2001::9d38:6abd:3442:3626:3afa:f267'],'networks':['192.168.1']},'timing':{'dns_connection_ssl':1134,'latency':1134,'dns_resolution':72,'ssl_timing':974,'server_time':401,'content_download_time':177,'dom_timing':1623,'browser_process_time':1623,'event_binding_timing':0,'page_load':3183}}}},'identity':{'profile':'Chrome # Windows # TN','tag':{'id':'55e2dffc-8cb9-521b-98b2-526a68f603e4','fuzzy':'69d2f304-1795-5296-bd2a-9acdf4ad75c5','general':'35fc63aa-dd16-59b5-8932-d45255fd9117'},'recognized':{'fingerprints':{'passive':'18.3082608836\/48.3888883277\/7.2362511174\/35.775591716','context':'2.3300655535\/3.3983292286\/2.2957713582\/2.4102577126\/1.3430272718\/4.834303856\/44.591879809\/3.1275262912'},'os':[{'name':'Windows 7','signatures':5,'samples':66,'history':{'first_seen':1502198736,'last_seen':1503328113}},{'name':'Windows 10','signatures':5,'samples':45,'history':{'first_seen':1499346211,'last_seen':1503236236}},{'name':'Windows 8','signatures':5,'samples':34,'history':{'first_seen':1500265706,'last_seen':1503244946}}],'browser':[{'name':'Chrome','signatures':5,'samples':112,'history':{'first_seen':1496772867,'last_seen':1503244946}},{'name':'Opera','signatures':3,'samples':20,'history':{'first_seen':1499281234,'last_seen':1502415113}},{'name':'QQ Browser','signatures':2,'samples':4,'history':{'first_seen':1502916160,'last_seen':1502916411}},{'name':'Amigo','signatures':2,'samples':4,'history':{'first_seen':1502936943,'last_seen':1502937076}},{'name':'Sogou Explorer','signatures':2,'samples':6,'history':{'first_seen':1502525699,'last_seen':1502745114}}],'risk':[]}}}"
I was trying to use the str_replace from the stringr package, but I can't make it work with all those special characters in between.
I guess I need something like
str_replace(data, ""{", "{")
but it's not working
Any guess on how to change "{ with { for instance?
Thanks!

You need to escape the special char " with a slash. So do this:
str_replace(data, "\"{", "{")

Related

How to web scrape AQI from airnow?

I am trying to scrape the current AQI in my location by beautifulsoup 4.
url = "https://www.airnow.gov/?city=Burlingame&state=CA&country=USA"
header = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8"
}
response = requests.get(url, headers=header)
soup = BeautifulSoup(response.content, "lxml")
aqi = soup.find("div", class_="aqi")
when I print the aqi, it is just empty div like this:
However, on the website, there should be a element inside this div containing the aqi number that I want.

Why is b' ' included in the excel file after web scraping?

I'm learning web scraping and was able to scrape data from a website to an excel file. However, in the excel file, you can see that it also includes b' ', instead of just the strings (names of Youtube channels, uploads, views). Any idea where this came from?
from bs4 import BeautifulSoup
import csv
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'} # Need to use this otherwise it returns error 403.
url = requests.get('https://socialblade.com/youtube/top/50/mostviewed', headers=headers)
#print(url)
soup = BeautifulSoup(url.text, 'lxml')
rows = soup.find('div', attrs = {'style': 'float: right; width: 900px;'}).find_all('div', recursive = False)[4:] # If in the inspect of the website, it uses class, then instead of 'style", type in '_class = ' instead. We don't need the first 4 rows, so [4:]
file = open('/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/My_Projects/Web_scraping/topyoutubers.csv', 'w')
writer = csv.writer(file)
# write header rows
writer.writerow(['Username', 'Uploads', 'Views'])
for row in rows:
username = row.find('a').text.strip()
numbers = row.find_all('span', attrs = {'style': 'color:#555;'})
uploads = numbers[0].text.strip()
views = numbers[1].text.strip()
print(username + ' ' + uploads + ' ' + views)
writer.writerow([username.encode('utf-8'), uploads.encode('utf-8'), views.encode('utf-8')])
file.close()
It is caused by the way you do your encoding - you might better define it once while opening the file:
file = open('topyoutubers.csv', 'w', encoding='utf-8')
New code
from bs4 import BeautifulSoup
import csv
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'} # Need to use this otherwise it returns error 403.
url = requests.get('https://socialblade.com/youtube/top/50/mostviewed', headers=headers)
#print(url)
soup = BeautifulSoup(url.text, 'lxml')
rows = soup.find('div', attrs = {'style': 'float: right; width: 900px;'}).find_all('div', recursive = False)[4:] # If in the inspect of the website, it uses class, then instead of 'style", type in '_class = ' instead. We don't need the first 4 rows, so [4:]
file = open('/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/My_Projects/Web_scraping/topyoutubers.csv', 'w', encoding='utf-8')
writer = csv.writer(file)
# write header rows
writer.writerow(['Username', 'Uploads', 'Views'])
for row in rows:
username = row.find('a').text.strip()
numbers = row.find_all('span', attrs = {'style': 'color:#555;'})
uploads = numbers[0].text.strip()
views = numbers[1].text.strip()
print(username + ' ' + uploads + ' ' + views)
writer.writerow([username, uploads, views])
file.close()
Output
Username Uploads Views
1 T-Series 15,029 143,032,749,708
2 Cocomelon - Nursery Rhymes 605 93,057,513,422
3 SET India 48,505 78,282,384,002
4 Zee TV 97,302 59,037,594,757

How to change user agent when Tor ip changes in Scrapy

I use Tor and Privoxy with TorIpChanger to change ip after a random number of items_scraped. And it is working fine. I would like to change user-agent as well, when ip changes.
I am a bit confused about the way to go to achieve this. I have had a look at scrapy_useragents and similar solutions looking for inspiration, without a lot of success for now. This is what i'm trying to do, based on https://github.com/khpeek/scraper-compose/ and https://docs.scrapy.org/en/latest/topics/extensions.html
extensions.py
class TorRenewIdentity(object):
def __init__(self, crawler, item_count, user_agents):
self.crawler = crawler
self.item_count = self.randomize(item_count) # Randomize the item count to confound traffic analysis
self._item_count = item_count # Also remember the given item count for future randomizations
self.items_scraped = 0
self.user_agents = user_agents
# Connect the extension object to signals
self.crawler.signals.connect(self.item_scraped, signal=signals.item_scraped)
#staticmethod
def randomize(item_count, min_factor=0.5, max_factor=1.5):
'''Randomize the number of items scraped before changing identity. (A similar technique is applied to Scrapy's DOWNLOAD_DELAY setting).'''
randomized_item_count = random.randint(int(min_factor*item_count), int(max_factor*item_count))
logger.info("The crawler will scrape the following (randomized) number of items before changing identity (again): {}".format(randomized_item_count))
return randomized_item_count
#classmethod
def from_crawler(cls, crawler):
if not crawler.settings.getbool('TOR_RENEW_IDENTITY_ENABLED'):
raise NotConfigured
item_count = crawler.settings.getint('TOR_ITEMS_TO_SCRAPE_PER_IDENTITY', 10)
user_agents = crawler.settings['USER_AGENT']
return cls(crawler=crawler, item_count=item_count, user_agents=user_agents) # Instantiate the extension object
def item_scraped(self, item, spider):
'''When item_count items are scraped, pause the engine and change IP address.'''
self.items_scraped += 1
if self.items_scraped == self.item_count:
logger.info("Scraped {item_count} items. Pausing engine while changing identity...".format(item_count=self.item_count))
self.crawler.engine.pause()
ip_changer.get_new_ip() # Change IP address with toripchanger https://github.com/DusanMadar/TorIpChanger
self.items_scraped = 0 # Reset the counter
self.item_count = self.randomize(self._item_count) # Generate a new random number of items to scrape before changing identity again
# Get new user agent from list
if self.user_agents:
new_user_agent = random.choice(self.user_agents)
logger.info('Load {} user_agents from settings. New user agent is {}.'.format(
len(self.user_agents) if self.user_agents else 0, new_user_agent))
# Change user agent here ?
# For next self.item_count items
# headers.setdefault('User-Agent', new_user_agent)
#
self.crawler.engine.unpause()
settings.py
USER_AGENT = [
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:56.0) Gecko/20100101 Firefox/56.0'
]
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'scrapydevua.middlewares.ScrapydevuaSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'scrapydevua.middlewares.ScrapydevuaDownloaderMiddleware': 543,
#}
EXTENSIONS = {
'scrapydevua.extensions.TorRenewIdentity': 1,
}

Lua - Concatenation of variables (pairs) escaping special characters - Openresty Nginx

I don't use Lua, but need to use it with Openresty (nginx) as provided in the link.
Openresty has a lua module which I managed to install and run the Openresty nginx version correctly, the website working.
This answer shows how to concatenate headers into a string $request_headers:
set_by_lua $request_headers '
local h = ngx.req.get_headers()
local request_headers_all = ""
for k, v in pairs(h) do
request_headers_all = request_headers_all .. "["..k..": "..v..\n"]"
end
return request_headers_all
';
I changed the format from ""..k..": "..v..";" to "["..k..": "..v.."]" in the lua function above.
Log format:
log_format log_realip 'Host: "$host", Via : "$http_via", RemoteAddr: "$remote_addr", XForwardedFor: "$h
ttp_x_forwarded_for", 'RealIPRemoteAddr: "$realip_remote_addr" - $status - "$request_uri" - **"[HEADERS]" - $request_headers';**
Host: "192.168.1.4", Via : "-",
//trimmed just to show the [HEADERS]
....
"[HEADERS]" - [sec-ch-ua: \x22Chromium\x22;v=\x2288\x22, \x22Google Chrome\x22;v=\x228
8\x22, \x22;Not A Brand\x22;v=\x2299\x22][sec-ch-ua-mobile: ?0][cookie: __utmz=abcdef; frontend=abcdef; adminhtml=abcdef
08; TestName=Some Value][upgrade-insecure-requests: 1][accept-language: en-US,en;q=0.9][user-agent: Mozilla/5.0
(Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36][accept
-encoding: gzip, deflate, br][accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/we
bp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9][sec-fetch-dest: document][host: 192.168.1.4][se
c-fetch-user: ?1][connection: keep-alive][sec-fetch-mode: navigate][cache-control: max-age=0][sec-fetch-site: n
one
When using log_format with $request_headers string I get all the headers in one line, but I am trying to create a newline \n to break the string into lines. The example above is where I added \n but doesn't seem to output break to the log file.
I understand the request_headers_all .. concatenates the string, but what is happening here with the key k and value v : ""..k..": "..v..\n""?
What is the "..variablename.." doing, is this how variables are always used inside Lua strings?
How would I be able to create a line break in that string? Or is it possible that nginx(openresty) doesn't output the newline?
you add the \n to a wrong place, you can change to
request_headers_all = request_headers_all .. "["..k..": "..v.."]\n" for a newline log.
In lua, the .. is a concat operator, to concat to strings, for example:
print("hello" .. "world")
get the result helloworld.
your code \n"]" have syntax error, because \n not in a string.
lua strings can not directly use variables, usually, lua use string.format for complex string. for example:
local test = "hello"
string.format("%s world",test) -- hello world
you can use string.format for you string concat.
also, you can use table.concat to concat strings.
for example:
local test = {}
table.insert(test, "hello")
table.insert(test, "world")
local concated_string = table.concat(test, ' ')
print(concated_string) -- hello world
request_headers_all = request_headers_all .. "["..k..": "..v..\n"]"
contains a syntax error. replace "["..k..": "..v..\n"]" with "["..k..": "..v.."]\n"
Newline needs to be inside the quotes as it is part of the string and it will probably make sense to add the new line after the bracket.
What is the "..variablename.." doing, is this how variables are always
used inside Lua strings?
using the concatenation operator on a variable concatenates a string value, concatenates its string representation if it is a number or invokes __concat or raises an error if neither of those is true.
Read https://www.lua.org/manual/5.4/manual.html#3.4.6
The above answers gave me some guidance, but the formats suggested still didn't work. After playing around with string.format("%s %s\n", k, v), string.format("%s %s\\n", k, v) I still got unfinished string errors or no newline output. (Tried to escape string in second example).
Based on the answers given I assumed the answers gave correct lua information, so decided most likely lua + openresty does something different.
I will attempt to update the title to reflect more specific requirements
TLDR
Openresty + lua string manipulation with special characters might not work as expected, even when using string.format()
Change from set_by_lua which returns a string to set_by_lua_block which allows string.format() or string concatenation better.
Update nginx configuration with your custom/existing log_format and add the switch escape=none.
Full Explanation
Investigating the provided link answer set_by_lua function documentation :
NOTE Use of this directive is discouraged following the v0.9.17 release. Use the set_by_lua_block directive instead.
So from the original set_by_lua from the link:
set_by_lua $request_headers '
return 'stringvalue'
';
I changed to set_by_lua_block function:
this directive inlines the Lua source directly inside a pair of curly braces ({}) instead of in an Nginx string literal (which requires special character escaping)
set_by_lua_block $request_headers{
local h = ngx.req.get_headers()
local request_headers_all = ""
for k, v in pairs(h) do
local rowtext = ""
rowtext = string.format("[%s %s]\n", k, v)
request_headers_all = request_headers_all .. rowtext
end
return request_headers_all
}
The important part is that this _block {} function escapes the special characters correctly.
After that I received output in the log files as : x0A (newline character literal).
The final step then is to update the nginx.conf file with the custom log_format and add escape=none:
log_format log_realip escape=none "$nginx_variables"
The main question/problem here is how does the wrapper software handle newline characters/escapes, is it expecting "\n"? or is it expecting "\r\n"?
Ultimately the new line does not actually exist until it is interpreted and printed and you are creating one massive string that gets returned from the Lua engine to the wrapper, so it is up to the wrapper software on how to interpret new lines.
Edit: I missed that this was already answered by using the other parsing function.
Additionally the docs state to use the original parsing function escapes need to be double escaped.
Here, \\\\d+ is stripped down to \\d+ by the Nginx config file parser and this is further stripped down to \d+ by the Lua language parser before running.

Seeking Alpha scraping conference call transcripts issues

I am trying to collect transcripts of conference calls from Seeking Alpha for a research project (I am a PhD student). Now, I have found a code online to extract the transcripts and store it in a .json file. I adjusted the code already to rotate user agents. However, the code only extracts the first page of the conference call transcript because of the following:
body = response.css('div#a-body p.p1')
chunks = body.css('p.p1')
The pages are represented by a series of <p> elements, with the class .p1 .p2 .p3 etc. that indicate the page numbers. I have already tried a number of things such as replacing the above code with:
response.xpath('//div[#id="a-body"]/p')
but I have not been able to extract the full conference call transcript (only the first page). Below is the full code:
import scrapy
# This enum lists the stages of each transcript.
from enum import Enum
import random
# SRC: https://developers.whatismybrowser.com/useragents/explore/
user_agent_list = [
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.94 Chrome/37.0.2062.94 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
#Firefox
'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.2; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)',
'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)'
]
Stage = Enum('Stage', 'preamble execs analysts body')
# Some transcript preambles are concatenated on a single line. This list is used
# To separate the title and date sections of the string.
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
transcripts = {}
class TranscriptSpider(scrapy.Spider):
name = 'transcripts'
custom_settings = {
'DOWNLOAD_DELAY': 2 # 0.25 == 250 ms of delay, 1 == 1000ms of delay, etc.
}
start_urls = ['http://seekingalpha.com/earnings/earnings-call-transcripts/1']
def parse(self, response):
# Follows each transcript page's link from the given index page.
for href in response.css('.dashboard-article-link::attr(href)').extract():
user_agent = random.choice(user_agent_list)
yield scrapy.Request(response.urljoin(href), callback=self.parse_transcript,headers={'User-Agent': user_agent})
# Follows the pagination links at the bottom of given index page.
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
def parse_transcript(self, response):
i = 0
transcript = {}
details = {}
execs = []
analysts = []
script = []
mode = 1
# As the pages are represented by a series of `<p>` elements we have to do this the
# old-fashioned way - breaking it into chunks and iterating over them.
body = response.css('div#a-body p.p1')
chunks = body.css('p.p1')
while i < len(chunks):
# If the current line is a heading and we're not currently going
# through the transcript body (where headings represent speakers),
# change the current section flag to the next section.
if (len(chunks[i].css('strong::text').extract()) == 0) or (mode == 4):
currStage = Stage(mode)
# If we're on the preamble stage, each bit of data is extracted
# separately as they all have their own key in the JSON.
if currStage == Stage['preamble']:
# If we're on the first line of the preamble, that's the
# company name, stock exchange and ticker acroynm (or should
# be - see below)
if i == 0:
# Checks to see if the second line is a heading. If not,
# everything is fine.
if len(chunks[1].css('strong::text').extract()) == 0:
details['company'] = chunks[i].css('p::text').extract_first()
if " (" in details['company']:
details['company'] = details['company'].split(' (')[0]
# If a specific stock exchange is not listed, it
# defaults to NYSE
details['exchange'] = "NYSE"
details['ticker'] = chunks.css('a::text').extract_first()
if ":" in details['ticker']:
ticker = details['ticker'].split(':')
details['exchange'] = ticker[0]
details['ticker'] = ticker[1]
# However, if it is, that means this line contains the
# full, concatenated preamble, so everything must be
# extracted here
else:
details['company'] = chunks[i].css('p::text').extract_first()
if " (" in details['company']:
details['company'] = details['company'].split(' (')[0]
# if a specific stock exchange is not listed, default to NYSE
details['exchange'] = "NYSE"
details['ticker'] = chunks.css('a::text').extract_first()
if ":" in details['ticker']:
ticker = details['ticker'].split(':')
details['exchange'] = ticker[0]
details['ticker'] = ticker[1]
titleAndDate = chunks[i].css('p::text').extract[1]
for date in months:
if date in titleAndDate:
splits = titleAndDate.split(date)
details['title'] = splits[0]
details['date'] = date + splits[1]
# Otherwise, we're onto the title line.
elif i == 1:
title = chunks[i].css('p::text').extract_first()
# This should never be the case, but just to be careful
# I'm leaving it in.
if len(title) <= 0:
title = "NO TITLE"
details['title'] = title
# Or the date line.
elif i == 2:
details['date'] = chunks[i].css('p::text').extract_first()
# If we're onto the 'Executives' section, we create a list of
# all of their names, positions and company name (from the
# preamble).
elif currStage == Stage['execs']:
anExec = chunks[i].css('p::text').extract_first().split(" - ")
# This covers if the execs are separated with an em- rather
# than an en-dash (see above).
if len(anExec) <= 1:
anExec = chunks[i].css('p::text').extract_first().split(" – ")
name = anExec[0]
if len(anExec) > 1:
position = anExec[1]
# Again, this should never be the case, as an Exec-less
# company would find it hard to get much done.
else:
position = ""
execs.append((name,position,details['company']))
# This does the same, but with the analysts (which never seem
# to be separated by em-dashes for some reason).
elif currStage == Stage['analysts']:
name = chunks[i].css('p::text').extract_first().split(" - ")[0]
company = chunks[i].css('p::text').extract_first().split(" - ")[1]
analysts.append((name,company))
# This strips the transcript body of everything except simple
# HTML, and stores that.
elif currStage == Stage['body']:
line = chunks[i].css('p::text').extract_first()
html = "p>"
if line is None:
line = chunks[i].css('strong::text').extract_first()
html = "h1>"
script.append("<"+html+line+"</"+html)
else:
mode += 1
i += 1
# Adds the various arrays to the dictionary for the transcript
details['exec'] = execs
details['analysts'] = analysts
details['transcript'] = ''.join(script)
# Adds this transcript to the dictionary of all scraped
# transcripts, and yield that for the output
transcript["entry"] = details
yield transcript
I have been stuck on this for a week now (still new to Python and web scraping) so it would be great if someone brighter than me could take a look!
It seems that the transcripts are organized in various pages.
So, I think that you have to add to your parse_transcript method a part where you find the link to next page of transcript, then you open it and submit it to parse_transcript.
Something like this:
# Follows the pagination links at the bottom of transcript page.
next_page = response.css(YOUR CSS SELECTOR GOES HERE).extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse_transcript)
Obviously, your have to modify your parse_transcript method to parse not only paragraphs extracted from the first page. You have to make this part more general:
body = response.css('div#a-body p.p1')
chunks = body.css('p.p1')

Resources