stuck scraping the same 2nd page with infinite scroll - web-scraping

I'm trying to scrape game reviews from steam.
when running the spider above, I get the first page with 10 reviews.
then the second page with 10 reviews three times
class MySpider(scrapy.Spider):
name = "MySpider"
download_delay = 6
page_number = 1
start_urls = (
'https://steamcommunity.com/app/1794680/reviews/',
)
custom_settings = {
'LOG_LEVEL': logging.WARNING,
'LOG_ENABLED': False,
'LOG_FILE': 'logging.txt',
'LOG_FILE_APPEND': False,
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'FEEDS': {"items.json": {"format": "json", 'overwrite': True},},
}
def parse(self, response):
soup = BeautifulSoup(response.text, 'lxml')
for review in soup.find_all('div', class_="apphub_UserReviewCardContent"):
{...}
if(self.page_number<4):
self.page_number +=1
yield scrapy.Request('https://steamcommunity.com/app/1794680/homecontent/?userreviewscursor=AoIIPwYYanu12fcD&userreviewsoffset={offset}&p={p}&workshopitemspage={p}&readytouseitemspage={p}&mtxitemspage={p}&itemspage={p}&screenshotspage={p}&videospage={p}&artpage={p}&allguidepage={p}&webguidepage={p}&integratedguidepage={p}&discussionspage={p}&numperpage=10&browsefilter=trendweek&browsefilter=trendweek&l=english&appHubSubSection=10&filterLanguage=default&searchText=&maxInappropriateScore=100'.format(offset=10*(self.page_number-1) ,p=self.page_number),method='GET', callback=self.parse)
json output
I took a few request when scrolling the reviews.
I changed all values that looked like page number and replaced them with {p},
also I tried changing the 'userreviewsoffset' to fit the request format
i noticed that 'userreviewscursor' has a changing value every request but I don't know where it is from.

Your issue is with userreviewscursor=AoIIPwYYanu12fcD part of the url. That bit will change for every call, and you can find it in the HTML response under:
<input type="hidden" name="userreviewscursor" value="AoIIPwYYanLi8vYD">
Get that value and add it to the next call, and you're alright. (didn't want to babysit you and write the full code, but if needs be, let me know).

Related

"requests.get"/beautifulSoup returns different result each call on the same URL

There's a website when users can add products, and I want to scrape the product a given user (me) has added to the webpage.
I have the following so far:
from fake_headers import Headers
import requests
def get_page(link):
#Get the product-links
headers = Headers(headers=True)
n_tries = 0
max_n_tries = 5
is_valid = False
while (not is_valid) & (n_tries<max_n_tries):
try:
head = headers.generate()
r = requests.get(link,headers=head,timeout=10)
is_valid = r.status_code==200 #Try 5 different headers. If no timeout and no 200 -> invalid url
n_tries +=1
except TimeoutError:
n_tries +=1
if n_tries==max_n_tries:
return 404
page = r.text
return page
link = "https://xn--nskeskyen-k8a.dk/share/Jakob_Daller"
page = get_page(link)
soup = BeautifulSoup(page,"lxml")
#Get saved items
results = soup.find_all('a')
products = [x['href'] for x in results if x.text.strip() == 'LINK']
and with the same link, products sometimes returns two items, sometimes one. When returning two, it ain't always the same items aswell (there's three items on link atm). After a bit, it returns all three items all the time. This happens each time I delete/add an item on the page.
Note, if I inspect the page in my browser, I can see all the items all the time.
The same happens if I just use page = requests.get(link).text with no headers.¨
Since I cannot inspect the entire page-body I don't know if it's due to BeautifulSoup or to the body returned by requests.
You don't need bs4 for this. There's an API you can get the data from.
Try this:
import requests
response = requests.get("https://api.xn--nskeskyen-k8a.dk/api/share/Jakob_Daller")
for wish in response.json()["wishes"]:
print(f"{wish['title']}\n{wish['trackingUrl']}")
Output:
Molo CANDI - Jerseykjoler - red/rød - Zalando.dk
https://xn--nskeskyen-k8a.dk/api/redirect/wish/38317697
Molo FLORIE - Overall / Jumpsuit /Buksedragter - multi-coloured/flerfarvet - Zalando.dk
https://xn--nskeskyen-k8a.dk/api/redirect/wish/38317401

Duplication in data while scraping data using Scrapy

python
I am using scrapy to scrape data from a website, where i want to scrape graphic cards title,price and whether they are in stock or not. The problem is my code is looping twice and instead of having 10 products I am getting 20.
import scrapy
class ThespiderSpider(scrapy.Spider):
name = 'Thespider'
start_urls = ['https://www.czone.com.pk/graphic-cards-pakistan-ppt.154.aspx?page=2']
def parse(self, response):
data = {}
cards = response.css('div.row')
for card in cards:
for c in card.css('div.product'):
data['Title'] = c.css('h4 a::text').getall()
data['Price'] = c.css('div.price span::text').getall()
data['Stock'] = c.css('div.product-stock span.product-data::text').getall()
yield data
You're doing a nested for loop when one isn't necessary.
Each card can be captured by the CSS selector response.css('div.product')
Code Example
def parse(self, response):
data = {}
cards = response.css('div.product')
for card in cards:
data['Title'] = card.css('h4 a::text').getall()
data['Price'] = card.css('div.price span::text').getall()
data['Stock'] = card.css('div.product-stock span.product-data::text').getall()
yield data
Additional Information
Use get() instead of getall(). The output you get is a list, you'll probably want a string which is what get() gives you.
If you're thinking about multiple pages, an items dictionary may be better than yielding a dictionary. Invariably there will be the thing you need to alter and an items dictionary gives you more flexibility to do this.

Duplicates in asp.net pagination when scraping?

I am scraping this asp.net site and since request url is the same Scrapy dupefilter does not work. As a result I am getting tons of duplicated urls which puts my spider into infintite run. How can I deal with it?
My code looks as this.
if '1' in page:
target = response.xpath("//a[#class = 'dtgNormalPage']").extract()[1:]
for i in target:
i = i.split("'")[1]
i = i.replace('$',':')
yield FormRequest.from_response(response,url, callback = self.pages, dont_filter = True,
formdata={'__EVENTTARGET': i,
})
I tried to add a set to keep track of page numbers but have no clue how to deal with '...' which leads to the next 10 pages.
if '1' in page:
target = response.xpath("//a[#class = 'dtgNormalPage']")
for i in target[1:]:
page = i.xpath("./text()").extract_first()
if page in self.pages_seen:
pass
else:
self.pages_seen.add(page)
i = i.xpath("./#href").extract_first()
i = i.split("'")[1]
i = i.replace('$',':')
yield FormRequest.from_response(response,url, callback = self.pages, dont_filter = True,
formdata={'__EVENTTARGET': i,
})
self.pages_seen.remove('[ ... ]')
The more threads I set the more duplicates I recieve.
So it looks like the only solution so far is to reduce thread_count to 3 or less.
I'm not certain if I understad you correctly but asp.net usually relies a lot on cookies for delivering content. So when crawling asp.net websites you want to use cookiejar feature of scrapy:
class MySpider(Spider):
name = 'cookiejar_asp'
def start_requests():
for i, url in enumerate(start_urls):
yield Request(url, meta={'cookiejar': i})
def parse(self, response):
# Keep in mind that the cookiejar meta key is not “sticky”. You need to keep passing it along on subsequent requests. For example:
return Request(
"http://www.example.com/otherpage",
callback=self.parse_other_page
meta={'cookiejar': response.meta['cookiejar']}, # <--- carry over cookiejar
)
Read more about cookiejars here:
https://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=cookiejar#multiple-cookie-sessions-per-spider

Scrapy: Scraping nested links

I am new to Scrapy and web scraping. Please don't get mad. I am trying to scrape profilecanada.com. Now, when I ran the code below, no errors are given but I think it still not scraping. In my code, I am trying to start in a page where there is a list of link. Each link leads to a page where there is also another list of link. From that link is another page that lies the data that I needed to extract and save into a json file. In general, it something like "nested link scraping". I don't know how it is actually called. Please see the image below for the result of spider when I rant it. Thank you in advance for your help.
import scrapy
class ProfilecanadaSpider(scrapy.Spider):
name = 'profilecanada'
allowed_domains = ['http://www.profilecanada.com']
start_urls = ['http://www.profilecanada.com/browse_by_category.cfm/']
def parse(self, response):
# urls in from start_url
category_list_urls = response.css('div.div_category_list > div.div_category_list_column > ul > li.li_category > a::attr(href)').extract()
# start_u = 'http://www.profilecanada.com/browse_by_category.cfm/'
# for each category of company
for url in category_list_urls:
url = url[3:]
url = response.urljoin(url)
return scrapy.Request(url=url, callback=self.profileCategoryPages)
def profileCategoryPages(self, response):
company_list_url = response.css('div.dv_en_block_name_frame > a::attr(href)').extract()
# for each company in the list
for url in company_list_url:
url = response.urljoin(url)
return scrapy.Request(url=url, callback=self.companyDetails)
def companyDetails(self, response):
return {
'company_name': response.css('span#name_frame::text').extract_first(),
'street_address': str(response.css('span#frame_addr::text').extract_first()),
'city': str(response.css('span#frame_city::text').extract_first()),
'region_or_province': str(response.css('span#frame_province::text').extract_first()),
'postal_code': str(response.css('span#frame_postal::text').extract_first()),
'country': str(response.css('div.type6_GM > div > div::text')[-1].extract())[2:],
'phone_number': str(response.css('span#frame_phone::text').extract_first()),
'fax_number': str(response.css('span#frame_fax::text').extract_first()),
'email': str(response.css('span#frame_email::text').extract_first()),
'website': str(response.css('span#frame_website > a::attr(href)').extract_first()),
}
IMAGE RESULT IN CMD:
The result in cmd when I ran the spider
You should change allowed_domains to allowed_domains = ['profilecanada.com'] and all the return scrapy.Request to yield scrapy.Request and it'll start working, keep in mind that obeying the robots.txt is not always enough, you should throttle your requests if necessary.

How to scrape all of the data from the website?

My code only giving me 44 links data instead of 102. Can Someone say me why it is Extracting like that?I would appreciate your help.How can i extract it properly???
import scrapy
class ProjectItem(scrapy.Item):
title = scrapy.Field()
owned = scrapy.Field()
Revenue2014 = scrapy.Field()
Revenue2015 = scrapy.Field()
Website = scrapy.Field()
Rank = scrapy.Field()
Employees = scrapy.Field()
headquarters = scrapy.Field()
FoundedYear = scrapy.Field()
class ProjectSpider(scrapy.Spider):
name = "cin100"
allowed_domains = ['cincinnati.com']
start_urls = ['http://www.cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/']
def parse(self, response):
# get selector for all 100 companies
sel_companies = response.xpath('//p[contains(.,"click or tap here.")]/following-sibling::p/a')
# create request for every single company detail page from href
for sel_companie in sel_companies:
href = sel_companie.xpath('./#href').extract_first()
url = response.urljoin(href)
request = scrapy.Request(url, callback=self.parse_company_detail)
yield request
def parse_company_detail(self, response):
# On detail page create item
item = ProjectItem()
# get detail information with specific XPath statements
# e.g. title is the first paragraph
item['title'] = response.xpath('//div[#role="main"]/p[1]//text()').extract_first().rsplit('-')[1]
# e.g. family owned has a label we can select
item['owned'] = response.xpath('//div[#role="main"]/p[contains(.,"Family owned")]/text()').extract_first()
item['Revenue2014'] ='$'+response.xpath('//div[#role="main"]/p[contains(.,"2014")]/text()').extract_first().rsplit('$')[1]
item['Revenue2015'] ='$'+response.xpath('//div[#role="main"]/p[contains(.,"$")]/text()').extract_first().rsplit('$')[1]
item['Website'] = response.xpath('//div[#role="main"]/p/a[contains(.,"www.")]/#href').extract_first()
item['Rank'] = response.xpath('//div[#role="main"]/p[contains(.,"rank")]/text()').extract_first()
item['Employees'] = response.xpath('//div[#role="main"]/p[contains(.,"Employ")]/text()').extract_first()
item['headquarters'] = response.xpath('//div[#role="main"]/p[10]//text()').extract()
item['FoundedYear'] = response.xpath('//div[#role="main"]/p[contains(.,"founded")]/text()').extract()
# Finally: yield the item
yield item
Looking closer at the output of scrapy you'll find that starting after a few dozen of requests they get redirected like shown below:
DEBUG: Redirecting (302) to <GET http://www.cincinnati.com/get-access/?return=http%3A%2F%2Fwww.cincinnati.com%2Fstory%2Fmoney%2F2016%2F11%2F27%2Ffrischs-restaurants%2F94430718%2F> from <GET http://www.cincinnati.com/story/money/2016/11/27/frischs-restaurants/94430718/>
The page that gets requested says: We hope you have enjoyed your complimentary access.
So it looks like they offer only limited access to anonymous users. You probably need to register to their service to get full access to the data.
There are a few potential problems with your xpaths:
it's usually a bad idea to make xpaths look for text that's on a page. Text can change from one minute to the next. The layout and html structure is much more long lived.
using 'following-siblings' is also a last-resort xpath feature that is quite vulnerable to slight changes on the website.
What I would be doing instead:
# iterate all paragraphs within the article:
for para in response.xpath("//*[#itemprop='articleBody']/p"):
url = para.xpath("./a/#href").extract()
# ... etc
len( response.xpath("//*[#itemprop='articleBody']/p")) gives me the expected 102 by the way.
You might have to filter the URLs to remove non-company urls like the on labeled with "click or tap here"

Resources