I am scraping this asp.net site and since request url is the same Scrapy dupefilter does not work. As a result I am getting tons of duplicated urls which puts my spider into infintite run. How can I deal with it?
My code looks as this.
if '1' in page:
target = response.xpath("//a[#class = 'dtgNormalPage']").extract()[1:]
for i in target:
i = i.split("'")[1]
i = i.replace('$',':')
yield FormRequest.from_response(response,url, callback = self.pages, dont_filter = True,
formdata={'__EVENTTARGET': i,
})
I tried to add a set to keep track of page numbers but have no clue how to deal with '...' which leads to the next 10 pages.
if '1' in page:
target = response.xpath("//a[#class = 'dtgNormalPage']")
for i in target[1:]:
page = i.xpath("./text()").extract_first()
if page in self.pages_seen:
pass
else:
self.pages_seen.add(page)
i = i.xpath("./#href").extract_first()
i = i.split("'")[1]
i = i.replace('$',':')
yield FormRequest.from_response(response,url, callback = self.pages, dont_filter = True,
formdata={'__EVENTTARGET': i,
})
self.pages_seen.remove('[ ... ]')
The more threads I set the more duplicates I recieve.
So it looks like the only solution so far is to reduce thread_count to 3 or less.
I'm not certain if I understad you correctly but asp.net usually relies a lot on cookies for delivering content. So when crawling asp.net websites you want to use cookiejar feature of scrapy:
class MySpider(Spider):
name = 'cookiejar_asp'
def start_requests():
for i, url in enumerate(start_urls):
yield Request(url, meta={'cookiejar': i})
def parse(self, response):
# Keep in mind that the cookiejar meta key is not “sticky”. You need to keep passing it along on subsequent requests. For example:
return Request(
"http://www.example.com/otherpage",
callback=self.parse_other_page
meta={'cookiejar': response.meta['cookiejar']}, # <--- carry over cookiejar
)
Read more about cookiejars here:
https://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=cookiejar#multiple-cookie-sessions-per-spider
Related
I'm trying to scrape game reviews from steam.
when running the spider above, I get the first page with 10 reviews.
then the second page with 10 reviews three times
class MySpider(scrapy.Spider):
name = "MySpider"
download_delay = 6
page_number = 1
start_urls = (
'https://steamcommunity.com/app/1794680/reviews/',
)
custom_settings = {
'LOG_LEVEL': logging.WARNING,
'LOG_ENABLED': False,
'LOG_FILE': 'logging.txt',
'LOG_FILE_APPEND': False,
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'FEEDS': {"items.json": {"format": "json", 'overwrite': True},},
}
def parse(self, response):
soup = BeautifulSoup(response.text, 'lxml')
for review in soup.find_all('div', class_="apphub_UserReviewCardContent"):
{...}
if(self.page_number<4):
self.page_number +=1
yield scrapy.Request('https://steamcommunity.com/app/1794680/homecontent/?userreviewscursor=AoIIPwYYanu12fcD&userreviewsoffset={offset}&p={p}&workshopitemspage={p}&readytouseitemspage={p}&mtxitemspage={p}&itemspage={p}&screenshotspage={p}&videospage={p}&artpage={p}&allguidepage={p}&webguidepage={p}&integratedguidepage={p}&discussionspage={p}&numperpage=10&browsefilter=trendweek&browsefilter=trendweek&l=english&appHubSubSection=10&filterLanguage=default&searchText=&maxInappropriateScore=100'.format(offset=10*(self.page_number-1) ,p=self.page_number),method='GET', callback=self.parse)
json output
I took a few request when scrolling the reviews.
I changed all values that looked like page number and replaced them with {p},
also I tried changing the 'userreviewsoffset' to fit the request format
i noticed that 'userreviewscursor' has a changing value every request but I don't know where it is from.
Your issue is with userreviewscursor=AoIIPwYYanu12fcD part of the url. That bit will change for every call, and you can find it in the HTML response under:
<input type="hidden" name="userreviewscursor" value="AoIIPwYYanLi8vYD">
Get that value and add it to the next call, and you're alright. (didn't want to babysit you and write the full code, but if needs be, let me know).
Hopefully a quick question, I'm trying to connect to the KuCoin API, not super relevant as I think this is more an issue with how I'm using the POST function and how it sends JSON along
Here is my function that is supposed to place an order:
API.Order <- function(pair,buysell,price,size) {
path = "/api/v1/orders"
now = as.integer(Sys.time()) * 1000
json <- list(
clientOid = as.character(now),
side = buysell,
symbol=pair,
type="limit",
price=price,
size=size
)
json=toJSON(json, auto_unbox = TRUE)
str_to_sign = (paste0(as.character(now), 'POST', path, json))
signature = as.character(base64Encode(hmac(api_secret,str_to_sign,"sha256", raw=TRUE)))
passphrase=as.character(base64Encode(hmac(api_secret,api_passphrase,"sha256", raw=TRUE)))
response=content(POST(url=url,
path=path,
body=json,
encode="json",
config = add_headers("KC-API-SIGN"=signature,
"KC-API-TIMESTAMP"=as.character(now),
"KC-API-KEY"=api_key,
"KC-API-PASSPHRASE"=passphrase,
"KC-API-KEY-VERSION"="2")
),
"text",encoding = "UTF-8")
response
data.table(fromJSON(response)$data)
}
API.Order(pair,"sell",1.42,1.0)
And everything works, except I get the following response:
"{\"code\":\"415000\",\"msg\":\"Unsupported Media Type\"}"
Which is puzzling to me. Everything else checks out (the signature and other auth headers), and I set the encode to "json" in the POST.. I also can put it as standard "application/json" and neither works. I've been staring at this for hours now and I can't see what (likely very little) thing I got wrong?
Thanks
There's a website when users can add products, and I want to scrape the product a given user (me) has added to the webpage.
I have the following so far:
from fake_headers import Headers
import requests
def get_page(link):
#Get the product-links
headers = Headers(headers=True)
n_tries = 0
max_n_tries = 5
is_valid = False
while (not is_valid) & (n_tries<max_n_tries):
try:
head = headers.generate()
r = requests.get(link,headers=head,timeout=10)
is_valid = r.status_code==200 #Try 5 different headers. If no timeout and no 200 -> invalid url
n_tries +=1
except TimeoutError:
n_tries +=1
if n_tries==max_n_tries:
return 404
page = r.text
return page
link = "https://xn--nskeskyen-k8a.dk/share/Jakob_Daller"
page = get_page(link)
soup = BeautifulSoup(page,"lxml")
#Get saved items
results = soup.find_all('a')
products = [x['href'] for x in results if x.text.strip() == 'LINK']
and with the same link, products sometimes returns two items, sometimes one. When returning two, it ain't always the same items aswell (there's three items on link atm). After a bit, it returns all three items all the time. This happens each time I delete/add an item on the page.
Note, if I inspect the page in my browser, I can see all the items all the time.
The same happens if I just use page = requests.get(link).text with no headers.¨
Since I cannot inspect the entire page-body I don't know if it's due to BeautifulSoup or to the body returned by requests.
You don't need bs4 for this. There's an API you can get the data from.
Try this:
import requests
response = requests.get("https://api.xn--nskeskyen-k8a.dk/api/share/Jakob_Daller")
for wish in response.json()["wishes"]:
print(f"{wish['title']}\n{wish['trackingUrl']}")
Output:
Molo CANDI - Jerseykjoler - red/rød - Zalando.dk
https://xn--nskeskyen-k8a.dk/api/redirect/wish/38317697
Molo FLORIE - Overall / Jumpsuit /Buksedragter - multi-coloured/flerfarvet - Zalando.dk
https://xn--nskeskyen-k8a.dk/api/redirect/wish/38317401
I've collected 48,000 url pages and I put them in a list. My goal is to extract 8 pieces of data using BeautifulSoup and appending each data point to an empty list from this list of urls.
Before I ran the for loop below I tested the extraction on 3 urls from the list and it worked perfectly fine.
I know the code works but I am questioning the amount of time it is taking to complete the web scrape of 48,000 url pages since my code has been running for a day and a half already. This is making me question my code or that I created the code inefficiently.
Can someone please review my code and provide any suggestions or ideas on how to make the code run faster?
Thanks in advance!
title_list = []
price_list = []
descrip_list = []
grape_variety_list = []
region_list = []
region_list2 = []
wine_state_list = []
wine_country_list = []
with requests.Session() as session:
for link in grape_review_links_list:
response2 = session.get(link, headers=headers)
wine_html = response2.text
soup2 = BeautifulSoup(wine_html, 'html.parser')
wine_title = soup2.find('span', class_='rating').findNext('h1').text
title_list.append(wine_title)
wine_price = soup2.find(text='Buy Now').findPrevious('span').text.split(',')[0]
price_list.append(wine_price)
wine_descrip = soup2.find('p', class_='description').find(text=True, recursive=False)
descrip_list.append(wine_descrip)
wine_grape = soup2.find(text='Buy Now').findNext('a').text
grape_variety_list.append(wine_grape)
wine_region = soup2.find(text='Appellation').findNext('a').text
region_list.append(wine_region)
wine_region2 = soup2.find(text='Appellation').findNext('a').findNext('a').text
region_list2.append(wine_region2)
wine_state = soup2.find(text='Appellation').findNext('a').findNext('a').findNext('a').text
wine_state_list.append(wine_state)
wine_country = soup2.find(text='Appellation').findNext('a').findNext('a').findNext('a').findNext('a').text
wine_country_list.append(wine_country)
To best of my current knowledge I have written a little web spider/crawler that is able to crawl recursively with a variable nesting depth is also capable of doing an optional POST/GET pre-login before crawling (if required).
As I am a complete beginner, I would like to get some feedback, improvements or whatever your throw at this.
I am only adding the parser function here. The whole source can be viewed at github: https://github.com/cytopia/crawlpy
What I really want to make sure is that the recursion in combination with yield is as efficient as possible and that I am also doing it in the right way.
Any comments on that and the coding style a very much welcome.
def parse(self, response):
"""
Scrapy parse callback
"""
# Get current nesting level
if response.meta.has_key('depth'):
curr_depth = response.meta['depth']
else:
curr_depth = 1
# Only crawl the current page if we hit a HTTP-200
if response.status == 200:
hxs = Selector(response)
links = hxs.xpath("//a/#href").extract()
# We stored already crawled links in this list
crawled_links = []
# Pattern to check proper link
linkPattern = re.compile("^(?:http|https):\/\/(?:[\w\.\-\+]+:{0,1}[\w\.\-\+]*#)?(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&%#!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&%#!\-\/\(\)]+))?$")
for link in links:
# Link could be a relative url from response.url
# such as link: '../test', respo.url: http://dom.tld/foo/bar
if link.find('../') == 0:
link = response.url + '/' + link
# Prepend BASE URL if it does not have it
elif 'http://' not in link and 'https://' not in link:
link = self.base_url + link
# If it is a proper link and is not checked yet, yield it to the Spider
if (link
and linkPattern.match(link)
and link.find(self.base_url) == 0):
#and link not in crawled_links
#and link not in uniques):
# Check if this url already exists
re_exists = re.compile('^' + link + '$')
exists = False
for i in self.uniques:
if re_exists.match(i):
exists = True
break
if not exists:
# Store the shit
crawled_links.append(link)
self.uniques.append(link)
# Do we recurse?
if curr_depth < self.depth:
request = Request(link, self.parse)
# Add meta-data about the current recursion depth
request.meta['depth'] = curr_depth + 1
yield request
else:
# Nesting level too deep
pass
else:
# Link not in condition
pass
#
# Final return (yield) to user
#
for url in crawled_links:
#print "FINAL FINAL FINAL URL: " + response.url
item = CrawlpyItem()
item['url'] = url
item['depth'] = curr_depth
yield item
#print "FINAL FINAL FINAL URL: " + response.url
#item = CrawlpyItem()
#item['url'] = response.url
#yield item
else:
# NOT HTTP 200
pass
Your whole code could be shortened to something like:
from scrapy.linkextractors import LinkExtractor
def parse(self, response):
# Get current nesting level
curr_depth = response.meta.get('depth',1)
item = CrawlpyItem() # could also just be `item = dict()`
item['url'] = response.url
item['depth'] = curr_depth
yield item
links = LinkExtractor().extract_links(response)
for link in links:
yield Request(link.url, meta={'depth': curr_depth+1})
If I understand correctly what you want to do here is broad crawl all urls, yield depth and url as items right?
Scrapy already has dupe filter enabled by default, so you don't need to do that logic yourself. Also your parse() method will never receive anything but response 200 so that check is useless.
Edit: rework to avoid dupes.