Seeking Alpha scraping conference call transcripts issues

Seeking Alpha scraping conference call transcripts issues - web-scraping

I am trying to collect transcripts of conference calls from Seeking Alpha for a research project (I am a PhD student). Now, I have found a code online to extract the transcripts and store it in a .json file. I adjusted the code already to rotate user agents. However, the code only extracts the first page of the conference call transcript because of the following:
body = response.css('div#a-body p.p1')
chunks = body.css('p.p1')
The pages are represented by a series of <p> elements, with the class .p1 .p2 .p3 etc. that indicate the page numbers. I have already tried a number of things such as replacing the above code with:
response.xpath('//div[#id="a-body"]/p')
but I have not been able to extract the full conference call transcript (only the first page). Below is the full code:
import scrapy
# This enum lists the stages of each transcript.
from enum import Enum
import random
# SRC: https://developers.whatismybrowser.com/useragents/explore/
user_agent_list = [
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.94 Chrome/37.0.2062.94 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
#Firefox
'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.2; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)',
'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)'
]
Stage = Enum('Stage', 'preamble execs analysts body')
# Some transcript preambles are concatenated on a single line. This list is used
# To separate the title and date sections of the string.
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
transcripts = {}
class TranscriptSpider(scrapy.Spider):
name = 'transcripts'
custom_settings = {
'DOWNLOAD_DELAY': 2 # 0.25 == 250 ms of delay, 1 == 1000ms of delay, etc.
}
start_urls = ['http://seekingalpha.com/earnings/earnings-call-transcripts/1']
def parse(self, response):
# Follows each transcript page's link from the given index page.
for href in response.css('.dashboard-article-link::attr(href)').extract():
user_agent = random.choice(user_agent_list)
yield scrapy.Request(response.urljoin(href), callback=self.parse_transcript,headers={'User-Agent': user_agent})
# Follows the pagination links at the bottom of given index page.
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
def parse_transcript(self, response):
i = 0
transcript = {}
details = {}
execs = []
analysts = []
script = []
mode = 1
# As the pages are represented by a series of `<p>` elements we have to do this the
# old-fashioned way - breaking it into chunks and iterating over them.
body = response.css('div#a-body p.p1')
chunks = body.css('p.p1')
while i < len(chunks):
# If the current line is a heading and we're not currently going
# through the transcript body (where headings represent speakers),
# change the current section flag to the next section.
if (len(chunks[i].css('strong::text').extract()) == 0) or (mode == 4):
currStage = Stage(mode)
# If we're on the preamble stage, each bit of data is extracted
# separately as they all have their own key in the JSON.
if currStage == Stage['preamble']:
# If we're on the first line of the preamble, that's the
# company name, stock exchange and ticker acroynm (or should
# be - see below)
if i == 0:
# Checks to see if the second line is a heading. If not,
# everything is fine.
if len(chunks[1].css('strong::text').extract()) == 0:
details['company'] = chunks[i].css('p::text').extract_first()
if " (" in details['company']:
details['company'] = details['company'].split(' (')[0]
# If a specific stock exchange is not listed, it
# defaults to NYSE
details['exchange'] = "NYSE"
details['ticker'] = chunks.css('a::text').extract_first()
if ":" in details['ticker']:
ticker = details['ticker'].split(':')
details['exchange'] = ticker[0]
details['ticker'] = ticker[1]
# However, if it is, that means this line contains the
# full, concatenated preamble, so everything must be
# extracted here
else:
details['company'] = chunks[i].css('p::text').extract_first()
if " (" in details['company']:
details['company'] = details['company'].split(' (')[0]
# if a specific stock exchange is not listed, default to NYSE
details['exchange'] = "NYSE"
details['ticker'] = chunks.css('a::text').extract_first()
if ":" in details['ticker']:
ticker = details['ticker'].split(':')
details['exchange'] = ticker[0]
details['ticker'] = ticker[1]
titleAndDate = chunks[i].css('p::text').extract[1]
for date in months:
if date in titleAndDate:
splits = titleAndDate.split(date)
details['title'] = splits[0]
details['date'] = date + splits[1]
# Otherwise, we're onto the title line.
elif i == 1:
title = chunks[i].css('p::text').extract_first()
# This should never be the case, but just to be careful
# I'm leaving it in.
if len(title) <= 0:
title = "NO TITLE"
details['title'] = title
# Or the date line.
elif i == 2:
details['date'] = chunks[i].css('p::text').extract_first()
# If we're onto the 'Executives' section, we create a list of
# all of their names, positions and company name (from the
# preamble).
elif currStage == Stage['execs']:
anExec = chunks[i].css('p::text').extract_first().split(" - ")
# This covers if the execs are separated with an em- rather
# than an en-dash (see above).
if len(anExec) <= 1:
anExec = chunks[i].css('p::text').extract_first().split(" – ")
name = anExec[0]
if len(anExec) > 1:
position = anExec[1]
# Again, this should never be the case, as an Exec-less
# company would find it hard to get much done.
else:
position = ""
execs.append((name,position,details['company']))
# This does the same, but with the analysts (which never seem
# to be separated by em-dashes for some reason).
elif currStage == Stage['analysts']:
name = chunks[i].css('p::text').extract_first().split(" - ")[0]
company = chunks[i].css('p::text').extract_first().split(" - ")[1]
analysts.append((name,company))
# This strips the transcript body of everything except simple
# HTML, and stores that.
elif currStage == Stage['body']:
line = chunks[i].css('p::text').extract_first()
html = "p>"
if line is None:
line = chunks[i].css('strong::text').extract_first()
html = "h1>"
script.append("<"+html+line+"</"+html)
else:
mode += 1
i += 1
# Adds the various arrays to the dictionary for the transcript
details['exec'] = execs
details['analysts'] = analysts
details['transcript'] = ''.join(script)
# Adds this transcript to the dictionary of all scraped
# transcripts, and yield that for the output
transcript["entry"] = details
yield transcript
I have been stuck on this for a week now (still new to Python and web scraping) so it would be great if someone brighter than me could take a look!

It seems that the transcripts are organized in various pages.
So, I think that you have to add to your parse_transcript method a part where you find the link to next page of transcript, then you open it and submit it to parse_transcript.
Something like this:
# Follows the pagination links at the bottom of transcript page.
next_page = response.css(YOUR CSS SELECTOR GOES HERE).extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse_transcript)
Obviously, your have to modify your parse_transcript method to parse not only paragraphs extracted from the first page. You have to make this part more general:
body = response.css('div#a-body p.p1')
chunks = body.css('p.p1')

Related

How to web scrape AQI from airnow?

I am trying to scrape the current AQI in my location by beautifulsoup 4.
url = "https://www.airnow.gov/?city=Burlingame&state=CA&country=USA"
header = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8"
}
response = requests.get(url, headers=header)
soup = BeautifulSoup(response.content, "lxml")
aqi = soup.find("div", class_="aqi")
when I print the aqi, it is just empty div like this:
However, on the website, there should be a element inside this div containing the aqi number that I want.

Beautiful soup articles scraping

Why does my code only finds 5 articles instead all of all 30 in the page?
Here is my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
url = 'https://www.15min.lt/tema/svietimas-24297'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
antrastes = soup.find_all('h3', {'class': 'vl-title'})
print(antrastes)

Page uses JavaScript to add items but requests/BeautifulSoup can't run JavaScript.
It may need to use Selenium to control real web browser which can run JavaScript.
And it may also need some JavaScript code to scroll page.
Eventually you can check in DevTools in Firefox/Chrome if JavaScript loads data from some URL and you can try to use this URL with requests. It may need to use Session to get cookies and headers from first GET.
This code uses URL which I found in DevTools (tab: Network, filter: XHR).
It needs to set different offset (date time) in url to get different rows - url.format(offset)
If you use current datetime then you don't even need to read main page.
It needs header 'X-Requested-With': 'XMLHttpRequest' to work.
It sends JSON data with keys rows (with HTML) and offset (with datetime for next rows).
And I use this offset to get next rows. I run this in loop to get more rows.
import urllib.parse
import requests
from bs4 import BeautifulSoup
import datetime
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
url = 'https://www.15min.lt/tags/ajax/list/svietimas-24297?tag=24297&type=&offset={}&last_row=2&iq=L&force_wide=true&cachable=1&layout%5Bw%5D%5B%5D=half_wide&layout%5Bw%5D%5B%5D=third_wide&layout%5Bf%5D%5B%5D=half_wide&layout%5Bf%5D%5B%5D=third_wide&cosite=default'
offset = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
for _ in range(5):
print('=====', offset, '=====')
offset = urllib.parse.quote_plus(offset)
response = requests.get(url.format(offset), headers=headers)
data = response.json()
soup = BeautifulSoup(data['rows'], 'html.parser')
antrastes = soup.find_all('h3', {'class': 'vl-title'})
for item in antrastes:
print(item.text.strip())
print('---')
offset = data['offset'] # offset for next data
Result:
===== 2022-03-09 21:20:36 =====
Konkursas „Praeities stiprybė – dabarčiai“. Susipažinkite su finalininkų darbais ir išrinkite nugalėtojus
---
ŠMSM į ukrainiečių vaikų ugdymą žada įtraukti ir atvykstančius mokytojus
---
Didėjant būrelių Vilniuje finansavimui, tikimasi įtraukti ir ukrainiečių vaikus
---
Mylėti priešus – ne glostyti palei plauką
---
Atvira pamoka su prof. Alfredu Bumblausku: „Ką reikėtų žinoti apie Ukrainos istoriją?“
---
===== 2022-03-04 13:20:21 =====
Vilniečiams vaikams – didesnis neformaliojo švietimo krepšelis
---
Premjerė: sudėtingiausiose situacijoje mokslo ir mokslininkų svarba tik didėja
---
Prasideda priėmimas į sostinės mokyklas: ką svarbu žinoti?
---
Dešimtokai lietuvių kalbos ir matematikos pasiekimus gegužę tikrinsis nuotoliniu būdu
---
Vilniuje prasideda priėmimas į mokyklas
---
===== 2022-03-01 07:09:05 =====
Nuotolinė istorijos pamoka apie Ukrainą sulaukė 30 tūkst. peržiūrų
---
J.Šiugždinienė: po Ukrainos pergalės bendradarbiavimas su šia herojiška valstybe tik didės
---
Vilniaus savivaldybė svarsto įkurdinti moksleivius buvusiame „Ignitis“ pastate
---
Socialdemokratai ragina stabdyti švietimo įstaigų tinklo pertvarką
---
Pokyčiai mokyklinėje literatūros programoje: mažiau privalomų autorių, brandos egzaminas – iš kelių dalių
---
===== 2022-02-26 11:04:29 =====
Mokytojo Gyčio „pagalbos“ – žygis, puodas ir uodas
---
Nuo kovo 2-osios pradinukams klasėse nebereikės dėvėti kaukių
---
Dr. Austėja Landsbergienė: Matematikos nerimas – kas tai ir ar įmanoma išvengti?
---
Ukrainos palaikymui – visuotinė istorijos pamoka Lietuvos mokykloms
---
Mokinius kviečia didžiausias chemijos dalyko konkursas Lietuvoje
---
===== 2022-02-23 10:11:14 =====
Mokyklų tinklo stiprinimas savivaldybėse: klausimai ir atsakymai
---
Vaiko ir paauglio kelias į sėkmę, arba Kaip gauti Nobelio premiją
---
Geriausias ugdymas – žygis, laužas, puodas ir uodas
---
Vilija Targamadzė: Bendrojo ugdymo mokyklų reformatoriai, ar ir toliau sėsite kakofoniją?
---
Švietimo ministrė: tai, kad turime sujungtas 5–8 klases, yra kažkas baisaus
---

web scraping using BeautifulSoup: reading tables

I'm trying to get data from a table on transfermarkt.com. I was able to get the first 25 entry with the following code. However, I need to get the rest of the entries which are in the following pages. When I clicked on the second page, url does not change.
I tried to increase the range in the for loop but it gives an error. Any suggestion would be appreciated.
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop'
heads = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/70.0.3538.110 Safari/537.36'}
r = requests.get(url, headers = heads)
source = r.text
soup = BeautifulSoup(source, "html.parser")
players = soup.find_all("a",{"class":"spielprofil_tooltip"})
values = soup.find_all("td",{"class":"rechts hauptlink"})
playerslist = []
valueslist = []
for i in range(0,25):
playerslist.append(players[i].text)
valueslist.append(values[i].text)
df = pd.DataFrame({"Players":playerslist, "Values":valueslist})

Alter the url in the loop and also change your selectors
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
players = []
values = []
headers = {'User-Agent':'Mozilla/5.0'}
with requests.Session() as s:
for page in range(1,21):
r = s.get(f'https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?ajax=yw1&page={page}', headers=headers)
soup = bs(r.content,'lxml')
players += [i.text for i in soup.select('.items .spielprofil_tooltip')]
values += [i.text for i in soup.select('.items .rechts.hauptlink')]
df = pd.DataFrame({"Players":players, "Values":values})

Replace special characters in a string (like "{ by { ) in R

I am stuck with replacing a series of special characters by another different series.
For instance, I have "{'request' and I want to change the starting "{ to {.
Here's one line of my data:
"{'request':{'id':'n98u4jiqp61c19v8eknicioq4be74pfe','time':'2017-08-21T21:57:27+00:00','type':'web','tcp':{'signature':{'attributes':{'ip_version':4,'initial_ttl':128,'options_length':0,'mss':1360,'window_size':'8192','window_scale':8,'options':'mss,nop,ws,nop,nop,sok','header_quirks':'df,id+'},'normalized_full':'34.2178821511','normalized_partial':'18.3082608836'},'mtu':{'type':'Probably IPsec or other VPN','size':1400},'ssl':{'protocol':'TLSv1.2','cipher':'ECDHE-RSA-AES128-GCM-SHA256','handshake':{'version':'3.3','ciphers':{'value':'aaaa,cca9,cca8,c02b,c02f,c02c,c030,c013,c014,9c,9d,2f,35,a','signature':{'value':'cca9,cca8,c02b,c02f,c02c,c030,c013,c014,9c,9d,2f,35,a','normalized':'13.552269047','garbage':['aaaa']}},'extensions':{'value':'dada,ff01,?0,17,23,d,5,12,10,b,a,baba','signature':{'value':'ff01,?0,17,23,d,5,12,10,b,a','normalized':'10.1792723498','garbage':['dada','baba']}},'flags':'ver,rtime'},'signature':{'normalized':'48.3888883277'}}},'network':{'rtt':{'value':84958,'variance':25927},'distance':27,'ip':{'address':'0.0.0.0','hostname':'0.0.0.0','asn':{'number':'AS2609','organization':'Tunisia BackBone AS'},'location':{'continent':{'code':'AF','name':'Africa'},'country':{'code':'TN','name':'Tunisia'},'city':{'name':'D'ile Deux'},'region':{'name':null},'timezone':{'name':'Africa\/Tunis','offset':-60},'coordinates':{'latitude':34,'longitude':9}}}},'header':{'structure':{'value':['Host','Connection','Content-Length','Origin','User-Agent','Content-type','Accept','Referer','Accept-Encoding','Accept-Language'],'leftover':['Content-Length','Origin','Content-type'],'normalized':'7.2362511174'},'languages':{'value':{'fr-FR':1,'fr':0.8,'en-US':0.6,'en':0.4},'normalized':'35.775591716'},'agent':{'string':'Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/60.0.3112.101 Safari\/537.36','os':{'name':'Windows','version':'10','platform':'x64','family':'Windows'},'client':{'type':'browser','name':'Chrome','version':'60.0.3112.101','engine':'Blink'},'device':{'type':'desktop','vendor':'','model':''}}},'context':{'source':'javascript-2.0','processor':'web-1.0','details':{'browser':{'features':['52','127','126','47','0','204'],'plugins':{'hash':'4.834303856','list':['widevinecdmadapter.dll 1481000','mhjfbmdgcfjbbpaeojofohoefgiehjai','internal-nacl-plugin','internal-pdf-viewer']},'fonts':{'hash':'44.591879809','list':['Agency FB','Arabic Typesetting','Arial Black','Bauhaus 93','Bell MT','Bodoni MT','Bookman Old Style','Broadway','Calibri','Californian FB','Castellar','Centaur','Century Gothic','Colonna MT','Copperplate Gothic Light','Engravers MT','Forte','Franklin Gothic Heavy','French Script MT','Gabriola','Gigi','Goudy Old Style','Haettenschweiler','Harrington','Impact','Informal Roman','Lucida Bright','Lucida Fax','Magneto','Malgun Gothic','Matura MT Script Capitals','MingLiU-ExtB','MS Reference Sans Serif','Niagara Solid','Palace Script MT','Papyrus','Perpetua','Playbill','Rockwell','Segoe Print','Showcard Gothic','Snap ITC','Vladimir Script','Wide Latin']},'webgl':{'hashes':{'extensions':'25,1910689852','attributes':'3594354498','info':'1687665201'},'strings':{'attributes':'1,2,8,8,8,8,24,0,8,16384,32,16384,1024,16,16384,30,16,16,4096,1,1,1,1024,16384,16384,16,35633,35632','vendor':'GoogleInc.','renderer':'ANGLE(Intel(R)HDGraphicsDirect3D11vs_5_0ps_5_0)'}},'properties':{'name':'-1','platform':'Win32','concurrency':2,'flash_version':'0.0.0','math_vector':'2297712969','colors':'28,3846833241'},'navigator':'39,1102024947','languages':['fr'],'tokens':{'id':'20170821X4013172038Xn98u4jiqp61c19v8eknicioq4be74pfe'},'is_incognito':0,'history_length':3},'screen':{'width':1366,'height':768,'color_depth':24,'window_inner':'1366x662','window_outer':'1366x728','max_touch_points':'0','availwidth':1366,'availheight':728},'media':{'structure':'MMVSS','structure_list':['M741301070','M2957980628','V3222337076','S741301070','S3687074924'],'audio_signature':'18614.611131555066,172.67165302389913'},'battery':{'string':'A?,26','percentage':26,'status':'charging','status_seconds':0},'timezone':{'offset':1,'list':['-60','-60','-60']},'network':{'ipv4':['192.168.1.37'],'ipv6':['2001::9d38:6abd:3442:3626:3afa:f267'],'networks':['192.168.1']},'timing':{'dns_connection_ssl':1134,'latency':1134,'dns_resolution':72,'ssl_timing':974,'server_time':401,'content_download_time':177,'dom_timing':1623,'browser_process_time':1623,'event_binding_timing':0,'page_load':3183}}}},'identity':{'profile':'Chrome # Windows # TN','tag':{'id':'55e2dffc-8cb9-521b-98b2-526a68f603e4','fuzzy':'69d2f304-1795-5296-bd2a-9acdf4ad75c5','general':'35fc63aa-dd16-59b5-8932-d45255fd9117'},'recognized':{'fingerprints':{'passive':'18.3082608836\/48.3888883277\/7.2362511174\/35.775591716','context':'2.3300655535\/3.3983292286\/2.2957713582\/2.4102577126\/1.3430272718\/4.834303856\/44.591879809\/3.1275262912'},'os':[{'name':'Windows 7','signatures':5,'samples':66,'history':{'first_seen':1502198736,'last_seen':1503328113}},{'name':'Windows 10','signatures':5,'samples':45,'history':{'first_seen':1499346211,'last_seen':1503236236}},{'name':'Windows 8','signatures':5,'samples':34,'history':{'first_seen':1500265706,'last_seen':1503244946}}],'browser':[{'name':'Chrome','signatures':5,'samples':112,'history':{'first_seen':1496772867,'last_seen':1503244946}},{'name':'Opera','signatures':3,'samples':20,'history':{'first_seen':1499281234,'last_seen':1502415113}},{'name':'QQ Browser','signatures':2,'samples':4,'history':{'first_seen':1502916160,'last_seen':1502916411}},{'name':'Amigo','signatures':2,'samples':4,'history':{'first_seen':1502936943,'last_seen':1502937076}},{'name':'Sogou Explorer','signatures':2,'samples':6,'history':{'first_seen':1502525699,'last_seen':1502745114}}],'risk':[]}}}"
I was trying to use the str_replace from the stringr package, but I can't make it work with all those special characters in between.
I guess I need something like
str_replace(data, ""{", "{")
but it's not working
Any guess on how to change "{ with { for instance?
Thanks!

You need to escape the special char " with a slash. So do this:
str_replace(data, "\"{", "{")

how to login to website using HTTP Client in Delphi xe

i am trying to implement the HTTP Client in my project, i cant login to my account,i get Forbidden!, with IdHTTP its working well, whats is missing or wrong in my code ?
NetHTTPClient1 properties:
Connectiontimeout = 30000
AllowCookies = True
HandleRedirects = True
UserAgent = Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36
NetHTTPRequest1 Properties :
Method String = POST
URL = https://www.instagram.com/accounts/web_create_ajax/attempt/
Code:
procedure TForm2.Button1Click(Sender: TObject);
var
Params : TStrings;
lHTTP: TIdHTTP;
IdSSL : TIdSSLIOHandlerSocketOpenSSL;
N: Integer;
Token,email,S: string;
Reply: TStringList;
Cookie: TIdCookie;
begin
lHTTP := TIdHTTP.Create(nil);
try
IdSSL := TIdSSLIOHandlerSocketOpenSSL.Create(lHTTP);
IdSSL.SSLOptions.Method := sslvTLSv1;
IdSSL.SSLOptions.Mode := sslmClient;
lHTTP.IOHandler := IdSSL;
lHTTP.ReadTimeout := 30000;
lHTTP.HandleRedirects := True;
lHTTP.Request.UserAgent := 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36';
lHTTP.Get('https://www.instagram.com', TStream(nil));
Cookie := lHTTP.CookieManager.CookieCollection.Cookie['csrftoken', 'www.instagram.com'];
if Cookie <> nil then
Token := Cookie.Value;
finally
end;
try
Params := TStringList.Create;
Params.Add('username=' +'myusername');
Params.Add('password=' + 'mypassword');
NetHTTPClient1.CustomHeaders['X-CSRFToken'] := Token;
NetHTTPClient1.CustomHeaders['X-Instagram-AJAX'] := '1';
NetHTTPClient1.CustomHeaders['X-Requested-With'] := 'XMLHttpRequest';
NetHTTPClient1.CustomHeaders['Referer'] := 'https://www.instagram.com/';
Memo1.Lines.Add(NetHTTPRequest1.Post('https://www.instagram.com/accounts/login/ajax/', Params).StatusText);
finally
end;
///login with IdHTTP///Wroks//
try
lHTTP.Request.CustomHeaders.Values['X-CSRFToken'] := Token;
lHTTP.Request.CustomHeaders.Values['X-Instagram-AJAX'] := '1';
lHTTP.Request.CustomHeaders.Values['X-Requested-With'] := 'XMLHttpRequest';
lHTTP.Request.Referer := 'https://www.instagram.com/';
lHTTP.Request.ContentType := 'application/x-www-form-urlencoded';
lHTTP.Request.UserAgent := 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36';
Reply := lHTTP.Post('https://www.instagram.com/accounts/login/ajax/', Params);
Memo1.Lines.Add(Reply);
end;

TNetHTTPClient is buggy with handleRedirect and post. https://quality.embarcadero.com/browse/RSP-14671
after when you login, you receive the cookie (the key in some way) and you must use theses cookies in all futur connexion.

"TNetHTTPClient is buggy with handleRedirect and post. "
This is already fix in version: 10.2 Tokyo Release 2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Seeking Alpha scraping conference call transcripts issues - web-scraping

Related

How to web scrape AQI from airnow?

Beautiful soup articles scraping

web scraping using BeautifulSoup: reading tables

Replace special characters in a string (like "{ by { ) in R

how to login to website using HTTP Client in Delphi xe

Categories

Resources