Duplication in data while scraping data using Scrapy

Duplication in data while scraping data using Scrapy - web-scraping

python
I am using scrapy to scrape data from a website, where i want to scrape graphic cards title,price and whether they are in stock or not. The problem is my code is looping twice and instead of having 10 products I am getting 20.
import scrapy
class ThespiderSpider(scrapy.Spider):
name = 'Thespider'
start_urls = ['https://www.czone.com.pk/graphic-cards-pakistan-ppt.154.aspx?page=2']
def parse(self, response):
data = {}
cards = response.css('div.row')
for card in cards:
for c in card.css('div.product'):
data['Title'] = c.css('h4 a::text').getall()
data['Price'] = c.css('div.price span::text').getall()
data['Stock'] = c.css('div.product-stock span.product-data::text').getall()
yield data

You're doing a nested for loop when one isn't necessary.
Each card can be captured by the CSS selector response.css('div.product')
Code Example
def parse(self, response):
data = {}
cards = response.css('div.product')
for card in cards:
data['Title'] = card.css('h4 a::text').getall()
data['Price'] = card.css('div.price span::text').getall()
data['Stock'] = card.css('div.product-stock span.product-data::text').getall()
yield data
Additional Information
Use get() instead of getall(). The output you get is a list, you'll probably want a string which is what get() gives you.
If you're thinking about multiple pages, an items dictionary may be better than yielding a dictionary. Invariably there will be the thing you need to alter and an items dictionary gives you more flexibility to do this.

Related

stuck scraping the same 2nd page with infinite scroll

I'm trying to scrape game reviews from steam.
when running the spider above, I get the first page with 10 reviews.
then the second page with 10 reviews three times
class MySpider(scrapy.Spider):
name = "MySpider"
download_delay = 6
page_number = 1
start_urls = (
'https://steamcommunity.com/app/1794680/reviews/',
)
custom_settings = {
'LOG_LEVEL': logging.WARNING,
'LOG_ENABLED': False,
'LOG_FILE': 'logging.txt',
'LOG_FILE_APPEND': False,
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'FEEDS': {"items.json": {"format": "json", 'overwrite': True},},
}
def parse(self, response):
soup = BeautifulSoup(response.text, 'lxml')
for review in soup.find_all('div', class_="apphub_UserReviewCardContent"):
{...}
if(self.page_number<4):
self.page_number +=1
yield scrapy.Request('https://steamcommunity.com/app/1794680/homecontent/?userreviewscursor=AoIIPwYYanu12fcD&userreviewsoffset={offset}&p={p}&workshopitemspage={p}&readytouseitemspage={p}&mtxitemspage={p}&itemspage={p}&screenshotspage={p}&videospage={p}&artpage={p}&allguidepage={p}&webguidepage={p}&integratedguidepage={p}&discussionspage={p}&numperpage=10&browsefilter=trendweek&browsefilter=trendweek&l=english&appHubSubSection=10&filterLanguage=default&searchText=&maxInappropriateScore=100'.format(offset=10*(self.page_number-1) ,p=self.page_number),method='GET', callback=self.parse)
json output
I took a few request when scrolling the reviews.
I changed all values that looked like page number and replaced them with {p},
also I tried changing the 'userreviewsoffset' to fit the request format
i noticed that 'userreviewscursor' has a changing value every request but I don't know where it is from.

Your issue is with userreviewscursor=AoIIPwYYanu12fcD part of the url. That bit will change for every call, and you can find it in the HTML response under:
<input type="hidden" name="userreviewscursor" value="AoIIPwYYanLi8vYD">
Get that value and add it to the next call, and you're alright. (didn't want to babysit you and write the full code, but if needs be, let me know).

Get structured output with Scrapy

I'm just starting to use scrapy and this is one of my first few projects. I am trying to scrape some company metadata from https://www.baincapitalprivateequity.com/portfolio/ . I have figured out my selectors but I'm unable to structure the output. I'm currently getting everything in one cell but I want the output to be one row for each company. If someone could help with where I'm going wrong, it'll be really great.
import scrapy
from ..items import BainpeItem
class BainPeSpider(scrapy.Spider):
name = 'Bain-PE'
allowed_domains = ['baincapitalprivateequity.com']
start_urls = ['https://www.baincapitalprivateequity.com/portfolio/']
def parse(self, response):
items = BainpeItem()
all_cos = response.css('div.grid')
for i in all_cos:
company = i.css('ul li::text').extract()
about = i.css('div.companyDetail p').extract()
items['company'] = company
items['about'] = about
yield items

You can just yield each item in the for loop:
for i in all_cos:
item = BainpeItem()
company = i.css('ul li::text').extract()
about = i.css('div.companyDetail p').extract()
item['company'] = company
item['about'] = about
yield item
This way each item will arrive in the pipeline separately.

How to scrape options from dropdown list and store them in table?

I am trying to make an interactive dashboard with analysis, base on car side. I would like user to be able to pick car brand for example BMW, Audi etc. and base on this choise he will have only avaiablity to pick BMW/Audi etc. models. I have a problem after selecting each brand, I am not able to scrape the models that belongs to that brand. Page that I am scraping from:
main page --> https://www.otomoto.pl/osobowe/
sub car brand page example --> https://www.otomoto.pl/osobowe/audi/
I have tried to scrape every option, so later on I can maybe somehow clean the data to store only models
code:
otomoto_models - paste0("https://www.otomoto.pl/osobowe/"audi/")
models <- read_html(otomoto_models) %>%
html_nodes("option") %>%
html_text()
But it is just scraping the brands with other options avaiable on the page engine type etc. While after inspecting element I can clearly see models types.
otomoto <- "https://www.otomoto.pl/osobowe/"
brands <- read_html(otomoto) %>%
html_nodes("option") %>%
html_text()
brands <- data.frame(brands)
for (i in 1:nrow(brands)){
no_marka_pojazdu <- i
if(brands[i,1] == "Marka pojazdu"){
break
}
}
no_marka_pojazdu <- no_marka_pojazdu + 1
for (i in 1:nrow(brands)){
zuk <- i
if(substr(brands[i,1],1,3) == "Żuk"){
break
}
}
Modele_pojazdow <- as.character(brands[no_marka_pojazdu:zuk,1])
Modele_pojazdow <- removeNumbers(Modele_pojazdow)
Modele_pojazdow <- substr(Modele_pojazdow,1,nchar(Modele_pojazdow)-2)
Modele_pojazdow <- data.frame(Modele_pojazdow)
Above code is only to pick supported car brands on the webpage and store them in the data frame. With that I am able to create html link and direct everything to one selected brand.
I would like to have similar object to "Modele_pojazdow" but with models limited on previous selected car brand.
Dropdown list with models appears as white box with text "Model pojazdu" next to the "Audi" box on the right side.

Some may frown on the solution language being Python, but the aim of this is was to give some pointers (high level process). I haven't written R in a long time so Python was quicker.
EDIT: R script now added
General outline:
The first dropdown options can be grabbed from the value attribute of each node returned by using a css selector of #param571 option. This uses an id selector (#) to target the parent dropdown select element, and then option type selector in descendant combination, to specify the option tag elements within. The html to apply this selector combination to can be retrieved by an xhr request to the url you initially provided. You want a nodeList returned to iterate over; akin to applying selector with js document.querySelectorAll.
The page uses ajax POST requests to update the second dropdown based on your first dropdown choice. Your first dropdown choice determines the value of a parameter search[filter_enum_make], which is used in the POST request to the server. The subsequent response contains a list of the available options (it includes some case alternatives which can be trimmed out).
I captured the POST request by using fiddler. This showed me the request headers and params in the request body. Screenshot sample shown at end.
The simplest way to extract the options from the response text, IMO, is to regex the appropriate string out (I wouldn't normally recommend regex for working with html but in this case it serves us nicely). If you don't want to use regex, you can grab the relevant info from the data-facets attribute of the element with id body-container. For the non-regex version you need to handle unquoted nulls, and retrieve the inner dictionary whose key is filter_enum_model. I show a function re-write, at the end, to handle this.
The retrieved string is a string representation of a dictionary. This needs converting to an actual dictionary object which you can then extract the option values from. Edit: As R doesn't have a dictionary object a similar structure needs to be found. I will look at this when converting.
I create a user defined function, getOptions(), to return the options for each make. Each car make value comes from the list of possible items in the first dropdown. I loop those possible values, use the function to return a list of options for that make, and add those lists as values to a dictionary, results ,whose keys are the make of car. Again, for R an object with similar functionality to a python dictionary needs to be found.
That dictionary of lists needs converting to a dataframe which includes a transpose operation to make a tidy output of headers, which are the car makes, and columns underneath each header, which contain the associated models.
The whole thing can be written to csv at the end.
So, hopefully that gives you an idea of one way to achieve what you want. Perhaps someone else can use this to help write you a solution.
Python demonstration of this below:
import requests
from bs4 import BeautifulSoup as bs
import re
import ast
import pandas as pd
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
}
def getOptions(make): #function to return options based on make
data = {
'search[filter_enum_make]': make,
'search[dist]' : '5',
'search[category_id]' : '29'
}
r = requests.post('https://www.otomoto.pl/ajax/search/list/', data = data, headers = headers)
try:
# verify the regex here: https://regex101.com/r/emvqXs/1
data = re.search(r'"filter_enum_model":(.*),"new_used"', r.text ,flags=re.DOTALL).group(1) #regex to extract the string containing the models associated with the car make filter
aDict = ast.literal_eval(data) #convert string representation of dictionary to python dictionary
d = len({k.lower(): v for k, v in aDict.items()}.keys()) #find length of unique keys when accounting for case
dirtyList = list(aDict)[:d] #trim to unique values
cleanedList = [item for item in dirtyList if item != 'other' ] #remove 'other' as doesn't appear in dropdown
except:
cleanedList = [] # sometimes there are no associated values in 2nd dropdown
return cleanedList
r = requests.get('https://www.otomoto.pl/osobowe/')
soup = bs(r.content, 'lxml')
values = [item['value'] for item in soup.select('#param571 option') if item['value'] != '']
results = {}
# build a dictionary of lists to hold options for each make
for value in values:
results[value] = getOptions(value) #function call to return options based on make
# turn into a dataframe and transpose so each column header is the make and the options are listed below
df = pd.DataFrame.from_dict(results,orient='index').transpose()
#write to csv
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )
Sample of csv output:
Example as sample json for alfa-romeo:
Example of regex match for alfa-romeo:
{"145":1,"146":1,"147":218,"155":1,"156":118,"159":559,"164":2,"166":39,"33":1,"Alfasud":2,"Brera":34,"Crosswagon":2,"GT":89,"GTV":7,"Giulia":251,"Giulietta":378,"Mito":224,"Spider":24,"Sportwagon":2,"Stelvio":242,"alfasud":2,"brera":34,"crosswagon":2,"giulia":251,"giulietta":378,"gt":89,"gtv":7,"mito":224,"spider":24,"sportwagon":2,"stelvio":242}
Example of the filter option list returned from function call with make parameter value alfa-romeo:
['145', '146', '147', '155', '156', '159', '164', '166', '33', 'Alfasud', 'Brera', 'Crosswagon', 'GT', 'GTV', 'Giulia', 'Giulietta', 'Mito', 'Spider', 'Sportwagon', 'Stelvio']
Sample of fiddler request:
Sample of ajax response html containing options:
<section id="body-container" class="om-offers-list"
data-facets='{"offer_seek":{"offer":2198},"private_business":{"business":1326,"private":872,"all":2198},"categories":{"29":2198,"161":953,"163":953},"categoriesParent":[],"filter_enum_model":{"145":1,"146":1,"147":219,"155":1,"156":116,"159":561,"164":2,"166":37,"33":1,"Alfasud":2,"Brera":34,"Crosswagon":2,"GT":88,"GTV":7,"Giulia":251,"Giulietta":380,"Mito":226,"Spider":25,"Sportwagon":2,"Stelvio":242,"alfasud":2,"brera":34,"crosswagon":2,"giulia":251,"giulietta":380,"gt":88,"gtv":7,"mito":226,"spider":25,"sportwagon":2,"stelvio":242},"new_used":{"new":371,"used":1827,"all":2198},"sellout":null}'
data-showfacets=""
data-pagetitle="Alfa Romeo samochody osobowe - otomoto.pl"
data-ajaxurl="https://www.otomoto.pl/osobowe/alfa-romeo/?search%5Bbrand_program_id%5D%5B0%5D=&search%5Bcountry%5D="
data-searchid=""
data-keys=''
data-vars=""
Alternative version of function without regex:
from bs4 import BeautifulSoup as bs
def getOptions(make): #function to return options based on make
data = {
'search[filter_enum_make]': make,
'search[dist]' : '5',
'search[category_id]' : '29'
}
r = requests.post('https://www.otomoto.pl/ajax/search/list/', data = data, headers = headers)
soup = bs(r.content, 'lxml')
data = soup.select_one('#body-container')['data-facets'].replace('null','"null"')
aDict = ast.literal_eval(data)['filter_enum_model'] #convert string representation of dictionary to python dictionary
d = len({k.lower(): v for k, v in aDict.items()}.keys()) #find length of unique keys when accounting for case
dirtyList = list(aDict)[:d] #trim to unique values
cleanedList = [item for item in dirtyList if item != 'other' ] #remove 'other' as doesn't appear in dropdown
return cleanedList
print(getOptions('alfa-romeo'))
R conversion and improved python:
Whilst converting to R I found a better way of extracting the parameters from a js file on the server. If you open dev tools you can see the file listed in the sources tab.
R (To be improved):
library(httr)
library(jsonlite)
url <- 'https://www.otomoto.pl/ajax/jsdata/params/'
r <- GET(url)
contents <- content(r, "text")
data <- strsplit(contents, "var searchConditions = ")[[1]][2]
data <- strsplit(as.character(data), ";var searchCondition")[[1]][1]
source <- fromJSON(data)$values$'573'$'571'
makes <- names(source)
for(make in makes){
print(make)
print(source[make][[1]]$value)
#break
}
Python:
import requests
import json
import pandas as pd
r = requests.get('https://www.otomoto.pl/ajax/jsdata/params/')
data = r.text.split('var searchConditions = ')[1]
data = data.split(';var searchCondition')[0]
items = json.loads(data)
source = items['values']['573']['571']
makes = [item for item in source]
results = {}
for make in makes:
df = pd.DataFrame(source[make]) ## build a dictionary of lists to hold options for each make
results[make] = list(df['value'])
dfFinal = pd.DataFrame.from_dict(results,orient='index').transpose() # turn into a dataframe and transpose so each column header is the make and the options are listed below
mask = dfFinal.applymap(lambda x: x is None) #tidy up None values to empty strings https://stackoverflow.com/a/31295814/6241235
cols = dfFinal.columns[(mask).any()]
for col in dfFinal[cols]:
dfFinal.loc[mask[col], col] = ''
print(dfFinal)

Scrapy Data Table extract

I am trying to scrape "https://www.expireddomains.net/deleted-com-domains/"
for the expired domain data list.
I always get empty item fields for the following
class ExpiredSpider(BaseSpider):
name = "expired"
allowed_domains = ["example.com"]
start_urls = ['https://www.expireddomains.net/deleted-com-domains/']
def parse(self, response):
log.msg('parse(%s)' % response.url, level = log.DEBUG)
rows = response.xpath('//table[#class="base1"]/tbody/tr')
for row in rows:
item = DomainItem()
item['domain'] = row.xpath('td[1]/text()').extract()
item['bl'] = row.xpath('td[2]/text()').extract()
yield item
Can somebody point out what is wrong? Thanks.

As a first note, you should use scrapy.Spider instead of BaseSpider which is deprecated
Secondly, .extract() method returns a list rather than a single element.
This is how the item extraction should look like
item['domain'] = row.xpath('td[1]/text()').extract_first()
item['bl'] = row.xpath('td[2]/text()').extract_first()
Also,
You should use the built in python logging library
import logging
logging.debug("parse("+response.url+")")

What's wrong with my filter query to figure out if a key is a member of a list(db.key) property?

I'm having trouble retrieving a filtered list from google app engine datastore (using python for server side). My data entity is defined as the following
class Course_Table(db.Model):
course_name = db.StringProperty(required=True, indexed=True)
....
head_tags_1=db.ListProperty(db.Key)
So the head_tags_1 property is a list of keys (which are the keys to a different entity called Headings_1).
I'm in the Handler below to spin through my Course_Table entity to filter the courses that have a particular Headings_1 key as a member of the head_tags_1 property. However, it doesn't seem like it is retrieving anything when I know there is data there to fulfill the request since it never displays the logs below when I go back to iterate through the results of my query (below). Any ideas of what I'm doing wrong?
def get(self,level_num,h_key):
path = []
if level_num == "1":
q = Course_Table.all().filter("head_tags_1 =", h_key)
for each in q:
logging.info('going through courses with this heading name')
logging.info("course name filtered is %s ", each.course_name)
MANY MANY THANK YOUS

I assume h_key is key of headings_1, since head_tags_1 is a list, I believe what you need is IN operator. https://developers.google.com/appengine/docs/python/datastore/queries
Note: your indentation inside the for loop does not seem correct.
My bad apparently '=' for list is already check membership. Using = to check membership is working for me, can you make sure h_key is really a datastore key class?
Here is my example, the first get produces result, where the 2nd one is not
import webapp2 from google.appengine.ext import db
class Greeting(db.Model):
author = db.StringProperty()
x = db.ListProperty(db.Key)
class C(db.Model): name = db.StringProperty()
class MainPage(webapp2.RequestHandler):
def get(self):
ckey = db.Key.from_path('C', 'abc')
dkey = db.Key.from_path('C', 'def')
ekey = db.Key.from_path('C', 'ghi')
Greeting(author='xxx', x=[ckey, dkey]).put()
x = Greeting.all().filter('x =',ckey).get()
self.response.write(x and x.author or 'None')
x = Greeting.all().filter('x =',ekey).get()
self.response.write(x and x.author or 'None')
app = webapp2.WSGIApplication([('/', MainPage)],
debug=True)

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Duplication in data while scraping data using Scrapy - web-scraping

Related

stuck scraping the same 2nd page with infinite scroll

Get structured output with Scrapy

How to scrape options from dropdown list and store them in table?

Scrapy Data Table extract

What's wrong with my filter query to figure out if a key is a member of a list(db.key) property?

Categories

Resources