How can I webscrape these ticker symbols from barchart.com?

How can I webscrape these ticker symbols from barchart.com? - web-scraping

I am trying to use Beautiful Soup to webscrape the list of ticker symbols from this page: https://www.barchart.com/options/most-active/stocks
My code returns a lot of HTML from the page, but I can't find any of the ticker symbols with CTRL+F. Would be much appreciated if someone could let me know how I can access these!
Code:
from bs4 import BeautifulSoup as bs
import requests
headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36"}
url = "https://www.barchart.com/options/most-active/stocks"
page = requests.get(url, headers=headers)
html = page.text
soup = bs(html, 'html.parser')
print(soup.find_all())

import requests
from urllib.parse import unquote
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0",
}
def main(url):
with requests.Session() as req:
req.headers.update(headers)
r = req.get(url[:25])
req.headers.update(
{'X-XSRF-TOKEN': unquote(r.cookies.get_dict()['XSRF-TOKEN'])})
params = {
"list": "options.mostActive.us",
"fields": "symbol,symbolType,symbolName,hasOptions,lastPrice,priceChange,percentChange,optionsImpliedVolatilityRank1y,optionsTotalVolume,optionsPutVolumePercent,optionsCallVolumePercent,optionsPutCallVolumeRatio,tradeTime,symbolCode",
"orderBy": "optionsTotalVolume",
"orderDir": "desc",
"between(lastPrice,.10,)": "",
"between(tradeTime,2021-08-03,2021-08-04)": "",
"meta": "field.shortName,field.type,field.description",
"hasOptions": "true",
"page": "1",
"limit": "500",
"raw": "1"
}
r = req.get(url, params=params).json()
df = pd.DataFrame(r['data']).iloc[:, :-1]
print(df)
main('https://www.barchart.com/proxies/core-api/v1/quotes/get?')
Output:
symbol symbolType ... tradeTime symbolCode
0 AMD 1 ... 08/03/21 STK
1 AAPL 1 ... 08/03/21 STK
2 TSLA 1 ... 08/03/21 STK
3 AMC 1 ... 08/03/21 STK
4 PFE 1 ... 08/03/21 STK
.. ... ... ... ... ...
495 BTU 1 ... 08/03/21 STK
496 EVER 1 ... 08/03/21 STK
497 VRTX 1 ... 08/03/21 STK
498 MCHP 1 ... 08/03/21 STK
499 PAA 1 ... 08/03/21 STK
[500 rows x 14 columns]

Related

Automatically downloading pdf's from a website with python and wget

I am trying to download all pdf files which contain scanned school books from a website. I tried using wget but it doesn't work. I suspect this is due to the website being an ASP-page with a selection options to select the course/year.
I also tried selecting a certain year/course and saving the html file locally but this doesn't work either
from bs4 import BeautifulSoup as bs
import urlopen
import wget
from urllib import parse as urlparse
def get_pdfs(my_url):
links = []
html = urlopen(my_url).read()
html_page = bs(html, features="lxml")
og_url = html_page.find("meta", property = "og:url")
base = urlparse(my_url)
print("base",base)
for link in html_page.find_all('a'):
current_link = link.get('href')
if current_link.endswith('pdf'):
if og_url:
print("currentLink",current_link)
links.append(og_url["content"] + current_link)
else:
links.append(base.scheme + "://" + base.netloc + current_link)
for link in links:
try:
wget.download(link)
except:
print(" \n \n Unable to Download A File \n")
my_url = 'https://www.svpo.nl/curriculum.asp'
get_pdfs(my_url)
my_url_local_html = r'C:\test\en_2.html' # downloaded year 2 english books page locally to extract pdf links
get_pdfs(my_url_local_html )
snipplet of my_url_local_html with links to pdf:
<li><a target="_blank" href="https://www.ib3.nl/curriculum/engels\010 TB 2 Ch 5.pdf">Chapter 5 - Going extreme</a></li>
<li><a target="_blank" href="https://www.ib3.nl/curriculum/engels\020 TB 2 Ch 6.pdf">Chapter 6 - A matter of taste</a></li>

You need to specify payload. For example Engels and 2e klas
url = "https://www.svpo.nl/curriculum.asp"
payload = 'vak=Engels&klas_en_schoolsoort=2e klas'
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
response = requests.post(url, data=payload, headers=headers)
for link in BeautifulSoup(response.text, "lxml").find_all('a'):
current_link = link.get('href')
if current_link.endswith('pdf'):
print(current_link)
OUTPUT:
https://www.ib3.nl/curriculum/engels\010 TB 2 Ch 5.pdf
https://www.ib3.nl/curriculum/engels\020 TB 2 Ch 6.pdf
https://www.ib3.nl/curriculum/engels\030 TB 2 Ch 7.pdf
https://www.ib3.nl/curriculum/engels\040 TB 2 Ch 8.pdf
https://www.ib3.nl/curriculum/engels\050 TB 2 Ch 9.pdf
https://www.ib3.nl/curriculum/engels\060 TB 2 Reading matters.pdf
https://www.ib3.nl/curriculum/engels\080 TB 3 Ch 1.pdf
https://www.ib3.nl/curriculum/engels\090 TB 3 Ch 2.pdf
https://www.ib3.nl/curriculum/engels\100 TB 3 Ch 3.pdf
https://www.ib3.nl/curriculum/engels\110 TB 3 Ch 4.pdf
https://www.ib3.nl/curriculum/engels\120 TB 3 Ch 5.pdf
https://www.ib3.nl/curriculum/engels\130 TB 3 Ch 6.pdf
https://www.ib3.nl/curriculum/engels\140 TB 3 Ch 7.pdf
https://www.ib3.nl/curriculum/engels\150 TB 3 Ch 8.pdf
https://www.ib3.nl/curriculum/engels\160 TB 3 Reading matters.pdf
https://www.ib3.nl/curriculum/engels\170 TB 3 Grammar.pdf
https://www.ib3.nl/curriculum/engels\Grammar Survey StSt 2.pdf
https://www.ib3.nl/curriculum/engels\StSt 2 Reading Matters.pdf
https://www.ib3.nl/curriculum/engels\StSt2 Yellow Pages.pdf
https://www.ib3.nl/curriculum/engels\050 WB 2 Ch 5.pdf
https://www.ib3.nl/curriculum/engels\060 WB 2 Ch 6.pdf
https://www.ib3.nl/curriculum/engels\070 WB 2 Ch 7.pdf
https://www.ib3.nl/curriculum/engels\080 WB 2 Ch 8.pdf
https://www.ib3.nl/curriculum/engels\090 WB 2 Ch 9.pdf
https://www.ib3.nl/curriculum/engels\110 WB 3 Ch 1.pdf
https://www.ib3.nl/curriculum/engels\115 WB 3 Ch 2.pdf
https://www.ib3.nl/curriculum/engels\120 WB 3 Ch 3.pdf
https://www.ib3.nl/curriculum/engels\125 WB 3 Ch 4.pdf
https://www.ib3.nl/curriculum/engels\130 WB 3 Ch 5.pdf
https://www.ib3.nl/curriculum/engels\135 WB 3 Ch 6.pdf
https://www.ib3.nl/curriculum/engels\140 WB 3 Ch 7.pdf
https://www.ib3.nl/curriculum/engels\145 WB 3 Ch 8.pdf
UPDATE:
To save pdf from link:
with open('somefilename.pdf', 'wb') as f:
url = r'https://www.ib3.nl/curriculum/engels\010 TB 2 Ch 5.pdf'.replace(' ', '%20').replace('\\', '/')
response = requests.get(url)
f.write(response.content)
#KJ right, you need to replace the spaces and the left slash

Resource not found nse india even with using Curl

So i am scrapping the nse india results calendar site via using curl command and even after that its giving me "Resource not Found" error . Here's my code
url = f"https://www.nseindia.com/api/event-calendar?index=equities"
header1 ="Host:www.nseindia.com"
header2 = "User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0"
header3 ="Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
header4 ="Accept-Language:en-US,en;q=0.5"
header5 ="Accept-Encoding:gzip, deflate, br"
header6 ="DNT:1"
header7 ="Connection:keep-alive"
header8 ="Upgrade-Insecure-Requests:1"
header9 ="Pragma:no-cache"
header10 ="Cache-Control:no-cache"
def run_curl_command(curl_command, max_attempts):
result = os.popen(curl_command).read()
count = 0
while "Resource not found" in result and count < max_attempts:
result = os.popen(curl_command).read()
count += 1
time.sleep(1)
print("API Read")
result = json.loads(result)
result = pd.DataFrame(result)
def init():
max_attempts = 100
curl_command = f'curl "{url}" -H "{header1}" -H "{header2}" -H "{header3}" -H "{header4}" -H "{header5}" -H "{header6}" -H "{header7}" -H "{header8}" -H "{header9}" -H "{header10}" --compressed '
print(f"curl_command : {curl_command}")
run_curl_command(curl_command, max_attempts)
init()

scrapy stops scraping elements that are addressed

Here are my spider code and the log I got. The problem is the spider seems to stop scraping items addressed from somewhere in the midst of page 10 (while there are 352 pages to be scraped). When I check the XPath expressions of the rest of the elements, I find them the same in my browser.
Here is my spider:
# -*- coding: utf-8 -*-
import scrapy
import logging
import urllib.parse
parts = urllib.parse.urlsplit(u'http://fa.wikipedia.org/wiki/صفحهٔ_اصلی')
parts = parts._replace(path=urllib.parse.quote(parts.path.encode('utf8')))
encoded_url = parts.geturl().encode('ascii')
'https://fa.wikipedia.org/wiki/%D8%B5%D9%81%D8%AD%D9%87%D9%94_%D8%A7%D8%B5%D9%84%DB%8C'
class CriptolernSpider(scrapy.Spider):
name = 'criptolern'
allowed_domains = ['arzdigital.com']
def start_requests(self):
yield scrapy.Request(url='https://arzdigital.com',
callback= self.parse,dont_filter = True)
def parse(self, response):
posts=response.xpath("//a[#class='arz-last-post arz-row']")
try:
for post in posts:
post_title=post.xpath(".//#title").get()
post_link=post.xpath(".//#href").get()
post_date=post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-last-post__publish-time']/time/#datetime").get()
if post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()"):
likes=int(post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()").get())
else:
likes=0
if post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-comment']/span[2]/text()"):
commnents=int(post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-comment']/span[2]/text()").get())
else:
commnents=0
yield{
'post_title':post_title,
'post_link':post_link,
'post_date':post_date,
'likes':likes,
'commnents':commnents
}
next_page=response.xpath("//div[#class='arz-last-posts__get-more']/a[#class='arz-btn arz-btn-info arz-round arz-link-nofollow']/#href").get()
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse,dont_filter = True)
else:
next_pages= response.xpath("//div[#class='arz-pagination']/ul/li[#class='arz-pagination__item arz-pagination__next']/a[#class='arz-pagination__link']/#href").get()
if next_pages:
yield scrapy.Request(url=next_pages, callback=self.parse, dont_filter = True)
except AttributeError:
logging.error("The element didn't exist")
Here is the log, when the spider stops:
2021-12-04 11:06:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/10/>
{'post_title': 'ولادیمیر پوتین: ارزهای دیجیتال در نوع خود ارزشمند هستند', 'post_link': 'https://arzdigital.com/russias-putin-says-crypto-has-value-but-maybe-not-for-trading-oil-html/', 'post_date': '2021-10-16', 'likes': 17, 'commnents': 1}
2021-12-04 11:06:51 [scrapy.core.scraper] ERROR: Spider error processing <GET https://arzdigital.com/latest-posts/page/10/> (referer: https://arzdigital.com/latest-posts/page/9/)
Traceback (most recent call last):
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\projects\arzdigital\arzdigital\spiders\criptolern.py", line 32, in parse
likes=int(post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()").get())
ValueError: invalid literal for int() with base 10: '۱,۸۵۱'
2021-12-04 11:06:51 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-04 11:06:51 [scrapy.extensions.feedexport] INFO: Stored csv feed (242 items) in: dataset.csv
2021-12-04 11:06:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4112,
'downloader/request_count': 12,
'downloader/request_method_count/GET': 12,
'downloader/response_bytes': 292561,
'downloader/response_count': 12,
'downloader/response_status_count/200': 12,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 4, 7, 36, 51, 830291),
'item_scraped_count': 242,
'log_count/DEBUG': 254,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'request_depth_max': 10,
'response_received_count': 12,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 11,
'scheduler/dequeued/memory': 11,
'scheduler/enqueued': 11,
'scheduler/enqueued/memory': 11,
'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2021, 12, 4, 7, 36, 47, 423017)}
2021-12-04 11:06:51 [scrapy.core.engine] INFO: Spider closed (finished)
I can't find the problem, if it is related to wrong XPath expression. Thanks for any help!!
EDIT:
So I guess it's better to see two files here.
The first is settings.py:
BOT_NAME = 'arzdigital'
SPIDER_MODULES = ['arzdigital.spiders']
NEWSPIDER_MODULE = 'arzdigital.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 10
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'arzdigital.middlewares.ArzdigitalDownloaderMiddleware': None,
'arzdigital.middlewares.UserAgentRotatorMiddleware':400
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 60
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 120
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = False
FEED_EXPORT_ENCODING='utf-8'
And the second file is middlewares.py:
from scrapy import signals
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random, logging
class UserAgentRotatorMiddleware(UserAgentMiddleware):
user_agent_list=[
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/2010010 1 Firefox/7.0.1',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWeb Kit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393'
]
def __init__(self, user_agent=''):
self.user_agent= user_agent
def process_request(self, request, spider):
try:
self.user_agent= random.choice(self.user_agent_list)
request.headers.setdefault('User-Agent', self.user_agent)
except IndexError:
logging.error("Couldn't fetch the user agent")

Your code is working fine as your expectation and the problem was in pagination portion and I've made the pagination in start_urls which type of pagination is always accurate and more than two times faster than if next page.
Code
import scrapy
import logging
#base url=https://arzdigital.com/latest-posts/
#start_url =https://arzdigital.com/latest-posts/page/2/
class CriptolernSpider(scrapy.Spider):
name = 'criptolern'
allowed_domains = ['arzdigital.com']
start_urls=[f'https://arzdigital.com/latest-posts/page/{i}/'.format(i) for i in range(1,353)]
def parse(self, response):
posts = response.xpath("//a[#class='arz-last-post arz-row']")
try:
for post in posts:
post_title = post.xpath(".//#title").get()
post_link = post.xpath(".//#href").get()
post_date = post.xpath(
".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-last-post__publish-time']/time/#datetime").get()
if post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()"):
likes = int(post.xpath(
".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()").get())
else:
likes = 0
if post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-comment']/span[2]/text()"):
commnents = int(post.xpath(
".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-comment']/span[2]/text()").get())
else:
commnents = 0
yield{
'post_title': post_title,
'post_link': post_link,
'post_date': post_date,
'likes': likes,
'commnents': commnents
}
except AttributeError:
logging.error("The element didn't exist")
Output:
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': 'تأکید مقام رسمی سابق وزارت دفاع آمریکا مبنی بر تشویق سرمایه گذاری بر روی بلاکچین', 'post_link': 'https://arzdigital.com/blockchain-investment/', 'post_date': '2017-07-27', 'likes': 4, 'commnents': 0}
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': 'ریسک سرمایه گذاری از طریق ICO', 'post_link': 'https://arzdigital.com/ico-risk/', 'post_date': '2017-07-27', 'likes': 9, 'commnents': 0}
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': '\xa0ای.سی.او چیست؟', 'post_link': 'https://arzdigital.com/what-is-ico/', 'post_date': '2017-07-27', 'likes': 7, 'commnents': 7}
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': 'چرا\xa0فراریت بیت کوین و واحدهای مشابه آن، نسبت به سایر واحدهای پولی بیش\u200cتر است؟', 'post_link': 'https://arzdigital.com/bitcoin-currency/', 'post_date': '2017-07-27', 'likes': 6, 'commnents': 0}
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': 'اتریوم کلاسیک Ethereum Classic چیست ؟', 'post_link': 'https://arzdigital.com/what-is-ethereum-classic/', 'post_date': '2017-07-24', 'likes': 10, 'commnents': 2}
2021-12-04 17:25:19 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-04 17:25:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 111431,
'downloader/request_count': 353,
'downloader/request_method_count/GET': 353,
'downloader/response_bytes': 8814416,
'downloader/response_count': 353,
'downloader/response_status_count/200': 352,
'downloader/response_status_count/301': 1,
'elapsed_time_seconds': 46.29503,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 4, 11, 25, 19, 124154),
'httpcompression/response_bytes': 55545528,
'httpcompression/response_count': 352,
'item_scraped_count': 7920
.. so on
settings.py file
Please make sure that the settings.py file, you have to change only the uncomment portion nothing else
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 10
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'gs_spider.middlewares.GsSpiderSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'gs_spider.middlewares.GsSpiderDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'gs_spider.pipelines.GsSpiderPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

How can I show more than 100 results per page?

I want to change the number of results on this page: https://fifatracker.net/players/ to more than 100 and then export the table to Excel and make it much easier for me. I tried to scrape it using python following a tutorial but I can't make it work. If there is a way to extract the table from all the pages it would also help me.

As stated, it's restricted to 100 per request. Simply iterate through the query payload on the api to get each page:
import pandas as pd
import requests
url = 'https://fifatracker.net/api/v1/players/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'}
page= 1
payload = {
"pagination":{
"per_page":"100","page":page},
"filters":{
"attackingworkrate":[],
"defensiveworkrate":[],
"primarypositions":[],
"otherpositions":[],
"nationality":[],
"order_by":"-overallrating"},
"context":{
"username":"guest",
"slot":"1","season":1},
"currency":"eur"}
jsonData = requests.post(url, headers=headers, json=payload).json()
current_page = jsonData['pagination']['current_page']
last_page = jsonData['pagination']['last_page']
dfs = []
for page in range(1,last_page+1):
if page == 1:
pass
else:
payload['pagination']['page'] = page
jsonData = requests.post(url, headers=headers, json=payload).json()
players = pd.json_normalize(jsonData['result'])
dfs.append(players)
print('Page %s of %s' %(page,last_page))
df = pd.concat(dfs).reset_index(drop=True)
Output:
print(df)
slug ... info.contract.loanedto_clubname
0 lionel-messi ... NaN
1 cristiano-ronaldo ... NaN
2 robert-lewandowski ... NaN
3 neymar-jr ... NaN
4 kevin-de-bruyne ... NaN
... ... ...
19137 levi-kaye ... NaN
19138 phillip-cancar ... NaN
19139 julio-pérez ... NaN
19140 alan-mclaughlin ... NaN
19141 tatsuki-yoshitomi ... NaN
[19142 rows x 92 columns]

R -RMySQL- how to save more sql queries to file?

I do some data analysis in R. On end of script I want save my results to file. I know there is more options how to do it, but they don't work properly. When I try sink() it works but it give me :
<MySQLResult:1,5,1>
host logname user time request_fline status
1 142.4.5.115 - - 2018-01-03 12:08:58 GET /phpmyadmin?</script><script>alert('<!--VAIBS-->');</script><script> HTTP/1.1 400
size_varchar referer agent ip_adress size_int cookie time_microsec filename
1 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 445 - 142.4.5.115 445 - 159 -
request_protocol keepalive request_method contents_of_foobar contents_of_notefoobar port child_id
<MySQLResult:1,5,1>
[1] host logname user time request_fline status
[7] size_varchar referer agent ip_adress size_int cookie
<0 rows> (or 0-length row.names)
which is totally unusable because I cant export that type of data. If I try write.table it give file with one row which is possible read but after one row R skript and give me error : Error in isOpen(file, "w") : invalid connection and when I try write.csv result is same. And when I try lapply it give me just empty file.
There is my code :
fileConn<-file("outputX.txt")
fileCon2<-file("outputX.csv")
sink("outputQuery.txt")
for (i in 1:length(awq)){
sql <- paste("SELECT * FROM mtable ORDER BY cookie LIMIT ", awq[i], ",1")
nb <- dbGetQuery(mydb, sql)
print (nb)
write.table(nb, file = fileConn, append = TRUE, quote = FALSE, sep = " ", eol = "\n", na = "NA", row.names = FALSE, col.names = FALSE)
write.csv(nb, file = fileCon2,row.names=FALSE, sep=" ")
lapply(nb, write, fileConn, append=TRUE, ncolumns=7)
writeLines(unlist(lapply(nb, paste, collapse=" ")))
}
sink()
close(fileConn)
close(fileCon2)
I am new in R, so I don't know what else should I try.What I want is 1 file where data will be print in form which is easy to read and export. For example tike this :
142.4.5.115 - - 2018-01-03 12:08:58 GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 400 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 445 - 142.4.5.115 445 - 145 - HTTP/1.1 0 GET - - 80 7216 ?/><!--VAIBS--> GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 - 0 /phpmyadmin - 354 0
142.4.5.115 - - 2018-01-03 12:10:23 GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 400 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 445 - 142.4.5.115 445 - 145 - HTTP/1.1 0 GET - - 80 7216 ?/><!--VAIBS--> GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 - 0 /phpmyadmin - 354 0
142.4.5.115 - - 2018-01-03 12:12:41 GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 400 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 445 - 142.4.5.115 445 - 145 - HTTP/1.1 0 GET - - 80 7216 ?/><!--VAIBS--> GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 - 0 /phpmyadmin - 354 0
142.4.5.115 - - 2018-01-03 12:15:29 GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 400 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 445 - 142.4.5.115 445 - 145 - HTTP/1.1 0 GET - - 80 7216 ?/><!--VAIBS--> GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 - 0 /phpmyadmin - 354 0
or this :
host,logname,user,time, request_fline status,size_varchar,referer agent,ip_adress,size_int,cookie,time_microsec,filename,request_protocol,keepalive,request_method,contents_of_foobar,contents_of_notefoobar port child_id
1 142.4.5.115 - - 2018-01-03 12:08:58 GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 400 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 445 - 142.4.5.115 445 - 145 - HTTP/1.1 0 GET - - 80 7216 ?/><!--VAIBS--> GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 - 0 /phpmyadmin - 354 0
2 142.4.5.115 - - 2018-01-03 12:10:23 GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 400 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 445 - 142.4.5.115 445 - 145 - HTTP/1.1 0 GET - - 80 7216 ?/><!--VAIBS--> GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 - 0 /phpmyadmin - 354 0
3 142.4.5.115 - - 2018-01-03 12:12:41 GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 400 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 445 - 142.4.5.115 445 - 145 - HTTP/1.1 0 GET - - 80 7216 ?/><!--VAIBS--> GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 - 0 /phpmyadmin - 354 0
4 142.4.5.115 - - 2018-01-03 12:15:29 GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 400 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 445 - 142.4.5.115 445 - 145 - HTTP/1.1 0 GET - - 80 7216 ?/><!--VAIBS--> GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 - 0 /phpmyadmin - 354 0
or something similar. Best of all, will be some help how to write write.table in loop without error. But I will welcome any functional solution. Best what I have is :
sql <- paste("SELECT * FROM idsaccess ORDER BY cookie LIMIT ", awq[1], ",1")
nb <- dbGetQuery(mydb, sql)
write.table(nb, file = fileConn, append = TRUE, quote = FALSE, sep = " ", eol = "\n", na = "NA", row.names = FALSE, col.names = FALSE)
fileConn<-file("outputX1.txt")
sql <- paste("SELECT * FROM idsaccess ORDER BY cookie LIMIT ", awq[2], ",1")
nb <- dbGetQuery(mydb, sql)
write.table(nb, file = fileConn, append = true, quote = FALSE, sep = " ", eol = "\n", na = "NA", row.names = FALSE, col.names = FALSE)
But this give every query to own file. And I don't want have every query in own file. Any help ?

Simply concatenate all query dataframes into one large dataframe since they all share same structure, and then output to file in one call which is really the typical way to use write.table or its wrapper, write.csv:
Specifically, turn for loop:
for (i in 1:length(awq)){
sql <- paste("SELECT * FROM mtable ORDER BY cookie LIMIT ", awq[i], ",1")
nb <- dbGetQuery(mydb, sql)
}
Into lapply for a list of dataframes:
df_list <- lapply(1:length(awq), function(i) {
sql <- paste0("SELECT * FROM mtable ORDER BY cookie LIMIT ", awq[i], ",1")
nb <- dbGetQuery(mydb, sql)
})
Then, row bind with do.call to stack all dfs into a single dataframe and output to file:
final_df <- do.call(rbind, df_list)
write.table(final_df, file = "outputX.txt", append = true, quote = FALSE, sep = " ",
eol = "\n", na = "NA", row.names = FALSE, col.names = FALSE)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How can I webscrape these ticker symbols from barchart.com? - web-scraping

Related

Automatically downloading pdf's from a website with python and wget

Resource not found nse india even with using Curl

scrapy stops scraping elements that are addressed

How can I show more than 100 results per page?

R -RMySQL- how to save more sql queries to file?

Categories

Resources