How to convert from urllib to scrapy? - web-scraping
I am webscraping an output of url amazon links, each of these links contains the price of a given book. So, the idea is to get the link and the price of the book of that link.
I have created code using urllib. However, after running this code, I got an HTTP response status code 308 because I want to scrape 230 links. I did a search and find out that urllib doesn't yet support 308 codes, and I think that Scrapy would.
Here is my urllib code:
import pandas as pd
import json
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup as bs
import requests
from pprint import pprint
import ast
from time import sleep
url= "https://api.nytimes.com/svc/books/v3/lists/full-overview.json?api-
key=mykey "
data = requests.get(url).text
data = json.loads(data)
best_sellers_history = []
for index in range(0,len(data['results']['lists'])):
for book in range(0,len(data['results']['lists'][index]['books'])):
amazon_product_url = (data['results']['lists'][index]['books'][book]
['amazon_product_url'])
pprint((amazon_product_url)
req = Request(amazon_product_url, headers=ast.literal_eval("{'User-
Agent':'Mozilla/5.0'}"))
page = urlopen(req)
soup = bs(page, 'html.parser')
price = soup.find('span',{'class':'a-size-base a-color-price a-color-price'})
if price:
price = price.get_text(strip=True).replace('$', '')
else:
price = "None"
print(price)
sleep(2)
I tried and failed to convert this into scrapy. Could anyone help me to convert this into scrapy?
This is converting to Scrapy for gettering Amazon books list from NY Times API server.
Steps
$scrapy startproject amazon
$cd amazon
$scrapy genspider ny-times https://www.nytimes.com/
It will created files
D:\temp\amazon>tree /F
Folder PATH listing for volume DATA
Volume serial number is 16D6-338C
D:.
│ scrapy.cfg
│
└───amazon
│ items.py
│ middlewares.py
│ pipelines.py
│ settings.py
│ __init__.py
│
├───spiders
│ │ ny_times.py
│ │ __init__.py
│ │
│ └───__pycache__
│ __init__.cpython-310.pyc
│
└───__pycache__
settings.cpython-310.pyc
__init__.cpython-310.pyc
Among this files we will touch only two files (items.py and ny_times.py)
Overwrite two files with this code
ny_times.py
import scrapy
import json
from amazon.items import AmazonItem
class NyTimesSpider(scrapy.Spider):
name = 'ny-times'
start_urls = ['https://api.nytimes.com/svc/books/v3/lists/full-overview.json?api-key=******** your API-KEY **********']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
callback = self.url_parse
)
def url_parse(self, response):
books = []
jsonresponse = json.loads(response.text)
for lists in jsonresponse['results']['lists']:
for book in lists['books']:
books.append({
'amazon_product_url' : book['amazon_product_url']
})
for book in books:
yield scrapy.Request(
book['amazon_product_url'],
callback = self.parse
)
def parse(self, response):
title = response.xpath("//span[#id='productTitle']//text()").get().strip()
price = response.xpath("//span[#class='a-size-base a-color-price a-color-price']//text()").get().strip()
loader = AmazonItem() # Here you create a new item each iteration
loader['title'] = title
loader['url'] = response.request.url
loader['price'] = price
yield loader
items.py
import scrapy
class AmazonItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
url = scrapy.Field()
price = scrapy.Field()
pass
Run it with save into result.json
if you success those step, you can see this log in terminal.
$scrapy crawl ny-times -O result.json
...
[
{"title": "Demon Copperhead: A Novel", "url": "https://www.amazon.com/Demon-Copperhead-Novel-Barbara-Kingsolver/dp/0063251922", "price": "$19.82"},
{"title": "Friends, Lovers, and the Big Terrible Thing: A Memoir", "url": "https://www.amazon.com/Friends-Lovers-Big-Terrible-Thing/dp/1250866448", "price": "$14.00"},
{"title": "Triple Cross: The Greatest Alex Cross Thriller Since Kiss the Girls (An Alex Cross Thriller, 28)", "url": "https://www.amazon.com/Triple-Cross-Alex-Thriller-28/dp/0316499188", "price": "$14.00"},
{"title": "The Light We Carry: Overcoming in Uncertain Times", "url": "https://www.amazon.com/the-light-we-carry/dp/0593237463", "price": "$16.89"},
{"title": "I'm Glad My Mom Died", "url": "https://www.amazon.com/Im-Glad-My-Mom-Died/dp/1982185821", "price": "$17.28"},
{"title": "The Seven Husbands of Evelyn Hugo: A Novel", "url": "https://www.amazon.com/Seven-Husbands-Evelyn-Hugo-Novel/dp/1501161938", "price": "$9.42"},
{"title": "Where the Crawdads Sing", "url": "https://www.amazon.com/Where-Crawdads-Sing-Delia-Owens/dp/0735219095", "price": "$14.63"},
{"title": "The Silent Patient", "url": "https://www.amazon.com/Silent-Patient-Alex-Michaelides/dp/1250301696", "price": "$12.99"},
{"title": "Fairy Tale", "url": "https://www.amazon.com/Fairy-Tale-Stephen-King/dp/1668002175", "price": "$16.25"},
{"title": "November 9: A Novel", "url": "https://www.amazon.com/November-9-Novel-Colleen-Hoover-ebook/dp/B00UDCI1S8", "price": "$10.99"},
{"title": "The Boys from Biloxi: A Legal Thriller", "url": "https://www.amazon.com/Boys-Biloxi-Legal-Thriller/dp/0385548923", "price": "$14.00"},
{"title": "Mad Honey: A Novel", "url": "https://www.amazon.com/Mad-Honey-Novel-Jodi-Picoult/dp/1984818384", "price": "$15.67"},
{"title": "Dreamland: A Novel", "url": "https://www.amazon.com/Dreamland-Novel-Nicholas-Sparks/dp/059344955X", "price": "$14.00"},
{"title": "Verity", "url": "https://www.amazon.com/Verity-Colleen-Hoover/dp/1791392792", "price": "$17.00"},
{"title": "Ugly Love: A Novel", "url": "https://www.amazon.com/Ugly-Love-Novel-Colleen-Hoover-ebook/dp/B00HB62MC0", "price": "$10.99"},
{"title": "Six of Crows (Six of Crows, 1)", "url": "https://www.amazon.com/Six-Crows-Leigh-Bardugo/dp/1627792120", "price": "$11.49"},
{"title": "Legendborn (The Legendborn Cycle)", "url": "https://www.amazon.com/Legendborn-Tracy-Deonn/dp/1534441603", "price": "$13.26"},
{"title": "She's Gone", "url": "https://www.amazon.com/Shes-Gone-David-Bell/dp/1728254205", "price": "$9.89"},
{"title": "Better Than the Movies", "url": "https://www.amazon.com/Better-Than-Movies-Lynn-Painter/dp/1534467637", "price": "$10.38"},
{"title": "Demon Slayer: Kimetsu no Yaiba―The Flower of Happiness (Demon Slayer: Kimetsu no Yaiba Novels)", "url": "https://www.amazon.com/Demon-Slayer-Kimetsu-Yaiba_The-Happiness/dp/1974732525", "price": "$8.68"},
{"title": "They Both Die at the End", "url": "https://www.amazon.com/They-Both-Die-at-End/dp/0062457799", "price": "$13.99"},
{"title": "All the Bright Places", "url": "https://www.amazon.com/All-Bright-Places-Jennifer-Niven/dp/0385755880", "price": "$15.76"},
{"title": "The Book Thief", "url": "https://www.amazon.com/Book-Thief-Markus-Zusak/dp/0375842209", "price": "$6.99"},
{"title": "We Were Liars", "url": "https://www.amazon.com/We-Were-Liars-Lockhart-ebook/dp/B00FPOSDGY", "price": "$8.99"},
{"title": "Amari and the Night Brothers (Supernatural Investigations, 1)", "url": "https://www.amazon.com/Amari-Night-Brothers-Supernatural-Investigations/dp/0062975161", "price": "$14.99"},
{"title": "The One and Only Bob (One and Only Ivan)", "url": "https://www.amazon.com/One-Only-Bob-Ivan/dp/0062991310", "price": "$11.79"},
{"title": "Restart", "url": "https://www.amazon.com/Restart-Gordon-Korman/dp/1338053809", "price": "$6.49"},
{"title": "A Wolf Called Wander", "url": "https://www.amazon.com/Wolf-Called-Wander-Rosanne-Parry/dp/0062895931", "price": "$10.14"},
{"title": "Map of Flames (The Forgotten Five, Book 1)", "url": "https://www.amazon.com/Map-Flames-Forgotten-Five-Book/dp/0593325400", "price": "$10.59"},
{"title": "Pax", "url": "https://www.amazon.com/Pax-Sara-Pennypacker/dp/0062377019", "price": "$13.93"},
{"title": "The Wild Robot (The Wild Robot, 1)", "url": "https://www.amazon.com/Wild-Robot-Peter-Brown/dp/0316381993", "price": "$7.60"},
{"title": "A Christmas Promise: A Will and a Way and Home for Christmas: A 2-in-1 Collection", "url": "https://www.amazon.com/Christmas-Promise-2-1-Collection/dp/1250847257", "price": "$7.83"},
{"title": "Wish", "url": "https://www.amazon.com/Wish-Barbara-OConnor/dp/1250144051", "price": "$4.17"},
{"title": "A Long Walk to Water: Based on a True Story", "url": "https://www.amazon.com/Long-Walk-Water-Based-Story/dp/0547577311", "price": "$7.64"},
{"title": "Dear Santa: A Novel", "url": "https://www.amazon.com/Dear-Santa-Novel-Debbie-Macomber/dp/1984818813", "price": "$12.51"},
{"title": "The One and Only Ivan", "url": "https://www.amazon.com/One-Only-Ivan-Katherine-Applegate/dp/0061992275", "price": "$6.96"},
{"title": "Snowflakes and Starlight: A Novel", "url": "https://www.amazon.com/Snowflakes-Starlight-Novel-Debbie-Macomber/dp/0778386902", "price": "$7.48"},
{"title": "Blind Tiger", "url": "https://www.amazon.com/Blind-Tiger-Sandra-Brown/dp/1538751968", "price": "$9.48"},
{"title": "False Witness: A Novel", "url": "https://www.amazon.com/False-Witness-Novel-Karin-Slaughter/dp/0062858092", "price": "$15.00"},
{"title": "Kingdom of Bones: A Thriller (Sigma Force Novels, 22)", "url": "https://www.amazon.com/Kingdom-Bones-Thriller-Sigma-Novels/dp/0062892983", "price": "$12.89"},
{"title": "The Santa Suit: A Novel", "url": "https://www.amazon.com/Santa-Suit-Mary-Kay-Andrews/dp/1250279313", "price": "$15.99"},
{"title": "Flying Angels: A Novel", "url": "https://www.amazon.com/Flying-Angels-Novel-Danielle-Steel/dp/1984821555", "price": "$13.00"},
{"title": "Wyoming Homecoming: A Novel (Wyoming Men, 11)", "url": "https://www.amazon.com/Wyoming-Homecoming-Men-11/dp/1335620958", "price": "$7.48"},
{"title": "Tom Clancy Chain of Command (A Jack Ryan Novel)", "url": "https://www.amazon.com/Clancy-Chain-Command-Jack-Novel/dp/0593188160", "price": "$15.17"},
{"title": "Chainsaw Man, Vol. 4 (4)", "url": "https://www.amazon.com/Chainsaw-Man-Vol-4/dp/1974717275", "price": "$9.99"},
{"title": "The Paris Detective", "url": "https://www.amazon.com/Paris-Detective-James-Patterson/dp/1538718847", "price": "$7.48"},
{"title": "The Dark Hours (A Renée Ballard and Harry Bosch Novel, 4)", "url": "https://www.amazon.com/Hours-Ren%C3%A9e-Ballard-Harry-Bosch/dp/0316485640", "price": "$14.29"},
{"title": "Invisible: A Novel", "url": "https://www.amazon.com/Invisible-Novel-Danielle-Steel/dp/198482158X", "price": "$13.40"},
{"title": "Cat Kid Comic Club: Perspectives: A Graphic Novel (Cat Kid Comic Club #2): From the Creator of Dog Man", "url": "https://www.amazon.com/Cat-Kid-Comic-Club-Perspectives/dp/1338784854", "price": "$8.68"},
{"title": "The Bad Guys in Open Wide and Say Arrrgh! (The Bad Guys #15)", "url": "https://www.amazon.com/Bad-Guys-Open-Wide-Arrrgh/dp/1338813188", "price": "$5.24"},
{"title": "The Judge's List: A Novel (The Whistler)", "url": "https://www.amazon.com/Judges-List-Novel-John-Grisham/dp/0385546025", "price": "$13.55"},
{"title": "Dog Man: For Whom the Ball Rolls: From the Creator of Captain Underpants (Dog Man #7)", "url": "https://www.amazon.com/Dog-Man-Creator-Captain-Underpants/dp/1338236598", "price": "$6.78"},
{"title": "Shuna's Journey", "url": "https://www.amazon.com/Shunas-Journey-Hayao-Miyazaki/dp/1250846528", "price": "$19.59"},
{"title": "Chainsaw Man, Vol. 3 (3)", "url": "https://www.amazon.com/Chainsaw-Man-Vol-3/dp/1974709957", "price": "$9.98"},
{"title": "Chainsaw Man, Vol. 2 (2)", "url": "https://www.amazon.com/Chainsaw-Man-Vol-2/dp/1974709949", "price": "$9.73"},
{"title": "Number One Is Walking: My Life in the Movies and Other Diversions", "url": "https://www.amazon.com/Number-One-Walking-Movies-Diversions/dp/1250815290", "price": "$15.00"},
{"title": "Chainsaw Man, Vol. 1 (1)", "url": "https://www.amazon.com/Chainsaw-Man-Vol-1/dp/1974709930", "price": "$7.68"},
{"title": "Jessi's Secret Language: A Graphic Novel (The Baby-sitters Club #12) (The Baby-Sitters Club Graphix)", "url": "https://www.amazon.com/Jessis-Secret-Language-Baby-sitters-Graphic/dp/1338616072", "price": "$10.99"},
{"title": "Dog Man: Grime and Punishment: A Graphic Novel (Dog Man #9): From the Creator of Captain Underpants (9)", "url": "https://www.amazon.com/Dog-Man-Punishment-Creator-Underpants/dp/1338535625", "price": "$6.48"},
{"title": "Dog Man: Mothering Heights: A Graphic Novel (Dog Man #10): From the Creator of Captain Underpants (10)", "url": "https://www.amazon.com/Dog-Man-Mothering-Heights-Underpants/dp/1338680455", "price": "$6.78"},
{"title": "The Bad Guys in the Others?! (The Bad Guys #16)", "url": "https://www.amazon.com/Bad-Guys-16-Aaron-Blabey/dp/1338820532", "price": "$5.78"},
{"title": "Cat Kid Comic Club: On Purpose: A Graphic Novel (Cat Kid Comic Club #3): From the Creator of Dog Man", "url": "https://www.amazon.com/Cat-Kid-Comic-Club-Purpose/dp/1338801945", "price": "$8.21"},
{"title": "Cat Kid Comic Club: Collaborations: A Graphic Novel (Cat Kid Comic Club #4): From the Creator of Dog Man", "url": "https://www.amazon.com/Cat-Kid-Comic-Club-Collaborations/dp/1338846620", "price": "$7.49"},
{"title": "The Book of Boundaries: Set the Limits That Will Set You Free", "url": "https://www.amazon.com/Book-Boundaries-Limits-That-Will/dp/0593448707", "price": "$21.69"},
{"title": "Power Failure: The Rise and Fall of an American Icon", "url": "https://www.amazon.com/Power-Failure-Rise-Fall-American/dp/0593084160", "price": "$31.94"},
{"title": "Like a Rolling Stone: A Memoir", "url": "https://www.amazon.com/Like-Rolling-Stone-Jann-Wenner/dp/0316415197", "price": "$17.50"},
{"title": "Chip War: The Fight for the World's Most Critical Technology", "url": "https://www.amazon.com/Chip-War-Worlds-Critical-Technology/dp/1982172002", "price": "$24.99"},
{"title": "Empire of Pain: The Secret History of the Sackler Dynasty", "url": "https://www.amazon.com/Empire-Pain-History-Sackler-Dynasty/dp/0385545681", "price": "$18.00"},
{"title": "What Happened to You : Conversations on Trauma, Resilience, and Healing", "url": "https://www.amazon.com/What-Happened-You-Understanding-Resilience/dp/1250223180", "price": "$14.49"},
{"title": "Grit: The Power of Passion and Perseverance", "url": "https://www.amazon.com/Grit-Passion-Perseverance-Angela-Duckworth-ebook/dp/B010MH9V3W", "price": "$14.99"},
{"title": "Dare to Lead: Brave Work. Tough Conversations. Whole Hearts.", "url": "https://www.amazon.com/Dare-Lead-Brave-Conversations-Hearts/dp/0399592520", "price": "$14.63"},
{"title": "Finding Me: A Memoir", "url": "https://www.amazon.com/Finding-Me-Memoir-Viola-Davis/dp/0063037327", "price": "$18.48"},
{"title": "The Myth of Normal: Trauma, Illness, and Healing in a Toxic Culture", "url": "https://www.amazon.com/Myth-Normal-Illness-Healing-Culture/dp/0593083881", "price": "$24.99"},
{"title": "The Trump Tapes: Bob Woodward's Twenty Interviews with President Donald Trump", "url": "https://www.amazon.com/Trump-Tapes-Woodwards-Interviews-President/dp/1797124722", "price": "$44.78"},
{"title": "The Choice: The Dragon Heart Legacy, Book 3 (The Dragon Heart Legacy, 3)", "url": "https://www.amazon.com/Choice-Dragon-Heart-Legacy-Book/dp/1250272726", "price": "$15.23"},
{"title": "Greenlights", "url": "https://www.amazon.com/Greenlights-Matthew-McConaughey/dp/0593139135", "price": "$15.00"},
{"title": "Bloodmarked (2) (The Legendborn Cycle)", "url": "https://www.amazon.com/Bloodmarked-Legendborn-Cycle-Tracy-Deonn/dp/1534441638", "price": "$9.99"},
{"title": "The Lost Metal: A Mistborn Novel (The Mistborn Saga, 7)", "url": "https://www.amazon.com/Lost-Metal-Mistborn-Novel-Saga/dp/0765391198", "price": "$23.48"},
{"title": "I Was Born for This", "url": "https://www.amazon.com/Was-Born-This-Alice-Oseman/dp/1338830937", "price": "$12.00"},
{"title": "Loveless", "url": "https://www.amazon.com/Loveless-Alice-Oseman/dp/133875193X", "price": "$9.49"},
{"title": "Family of Liars: The Prequel to We Were Liars", "url": "https://www.amazon.com/Family-Liars-Prequel-We-Were/dp/0593485858", "price": "$11.99"},
{"title": "Lightlark (The Lightlark Saga Book 1)", "url": "https://www.amazon.com/Lightlark-Book-1-Alex-Aster/dp/1419760866", "price": "$14.76"},
{"title": "A Thousand Heartbeats", "url": "https://www.amazon.com/Thousand-Heartbeats-Kiera-Cass/dp/0062665782", "price": "$14.98"},
{"title": "The First to Die at the End", "url": "https://www.amazon.com/First-Die-at-End/dp/0063240807", "price": "$12.99"},
{"title": "Five Survive", "url": "https://www.amazon.com/Five-Survive-Holly-Jackson/dp/0593374169", "price": "$14.39"},
{"title": "Long Live the Pumpkin Queen: Tim Burton's The Nightmare Before Christmas", "url": "https://www.amazon.com/Long-Live-Pumpkin-Queen-Nightmare/dp/1368069606", "price": "$13.16"},
{"title": "Straight On Till Morning (A Twisted Tale): A Twisted Tale", "url": "https://www.amazon.com/Straight-Till-Morning-Twisted-Tale/dp/1484781309", "price": "$15.99"},
{"title": "One of Us Is Lying", "url": "https://www.amazon.com/One-Us-Lying-Karen-McManus/dp/1524714682", "price": "$10.70"},
{"title": "The Final Gambit (The Inheritance Games, 3)", "url": "https://www.amazon.com/Final-Gambit-Inheritance-Games/dp/0316370959", "price": "$11.79"},
{"title": "As Good as Dead: The Finale to A Good Girl's Guide to Murder", "url": "https://www.amazon.com/As-Good-Dead-Finale-Murder/dp/0593379853", "price": "$12.72"},
{"title": "The Last Kids on Earth and the Nightmare King", "url": "https://www.amazon.com/Last-Kids-Earth-Nightmare-King/dp/0425288714", "price": "$8.50"},
{"title": "We'll Always Have Summer (The Summer I Turned Pretty)", "url": "https://www.amazon.com/Well-Always-Summer-Turned-Pretty/dp/1416995587", "price": "$14.18"},
{"title": "The Brightest Night (Wings of Fire #5) (5)", "url": "https://www.amazon.com/Wings-Fire-Book-Five-Brightest/dp/0545349222", "price": "$16.99"},
{"title": "Pete the Cat's 12 Groovy Days of Christmas: A Christmas Holiday Book for Kids", "url": "https://www.amazon.com/Pete-Cats-Groovy-Days-Christmas/dp/0062675273", "price": "$5.84"},
{"title": "Captain Underpants and the Revolting Revenge of the Radioactive Robo-Boxers (Captain Underpants #10) (10)", "url": "https://www.amazon.com/Captain-Underpants-Revolting-Radioactive-Robo-Boxers/dp/0545175364", "price": "$7.44"},
{"title": "The Titan's Curse (Percy Jackson and the Olympians, Book 3)", "url": "https://www.amazon.com/Titans-Curse-Percy-Jackson-Olympians/dp/1423101480", "price": "$7.99"},
{"title": "How to Catch a Unicorn", "url": "https://www.amazon.com/How-Catch-Unicorn-Adam-Wallace/dp/1492669733", "price": "$5.74"},
{"title": "Harry Potter and the Order of the Phoenix (5)", "url": "https://www.amazon.com/Harry-Potter-Order-Phoenix-Rowling/dp/0439358078", "price": "$6.78"},
{"title": "Little Blue Truck Makes a Friend: A Friendship Book for Kids", "url": "https://www.amazon.com/Little-Blue-Truck-Makes-Friend/dp/0358722829", "price": "$16.33"},
{"title": "5 More Sleeps ‘til Christmas", "url": "https://www.amazon.com/5-More-Sleeps-til-Christmas/dp/1250266475", "price": "$15.00"},
{"title": "Diary of a Wimpy Kid: Hard Luck, Book 8", "url": "https://www.amazon.com/Diary-Wimpy-Kid-Hard-Luck/dp/1419711326", "price": "$8.95"},
{"title": "The Pigeon Will Ride the Roller Coaster!", "url": "https://www.amazon.com/Pigeon-Will-Ride-Roller-Coaster/dp/1454946865", "price": "$13.99"},
{"title": "How to Catch an Elf", "url": "https://www.amazon.com/How-Catch-Elf-Adam-Wallace/dp/1492646318", "price": "$4.94"},
{"title": "Construction Site on Christmas Night: (Christmas Book for Kids, Children's Book, Holiday Picture Book) (Goodnight, Goodnight Construction Site)", "url": "https://www.amazon.com/Construction-Christmas-Sherri-Duskey-Rinker/dp/1452139113", "price": "$8.49"},
{"title": "The Door of No Return", "url": "https://www.amazon.com/Door-No-Return-Kwame-Alexander/dp/0316441864", "price": "$12.39"},
{"title": "Daughter of the Deep", "url": "https://www.amazon.com/Daughter-Deep-Rick-Riordan/dp/1368077927", "price": "$9.99"},
{"title": "The Wonderful Things You Will Be", "url": "https://www.amazon.com/Wonderful-Things-You-Will-Be/dp/0385376715", "price": "$8.55"},
{"title": "Odder", "url": "https://www.amazon.com/Odder-Katherine-Applegate/dp/1250147425", "price": "$8.49"},
{"title": "Two Degrees", "url": "https://www.amazon.com/Two-Degrees-Alan-Gratz/dp/1338735675", "price": "$15.99"},
{"title": "The Day the Crayons Quit", "url": "https://www.amazon.com/Day-Crayons-Quit-Drew-Daywalt/dp/0399255370", "price": "$9.19"},
{"title": "Unstoppable Us, Volume 1: How Humans Took Over the World (Unstoppable Us, 1)", "url": "https://www.amazon.com/Unstoppable-Us-Humans-Took-World/dp/0593643461", "price": "$21.99"},
{"title": "Wonder", "url": "https://www.amazon.com/Wonder-R-J-Palacio/dp/B0051ANPZQ", "price": "$10.99"},
{"title": "Dragons Love Tacos", "url": "https://www.amazon.com/Dragons-Love-Tacos-Adam-Rubin/dp/0803736800", "price": "$9.92"},
{"title": "The Official Harry Potter Baking Book: 40+ Recipes Inspired by the Films", "url": "https://www.amazon.com/Official-Harry-Potter-Baking-Book/dp/1338285262", "price": "$13.98"},
{"title": "The Christmas Pig", "url": "https://www.amazon.com/Christmas-Pig-J-K-Rowling/dp/1338790234", "price": "$14.00"},
{"title": "Half Baked Harvest Every Day: Recipes for Balanced, Flexible, Feel-Good Meals: A Cookbook", "url": "https://www.amazon.com/Half-Baked-Harvest-Every-Day/dp/0593232550", "price": "$18.49"},
{"title": "The Complete Cookbook for Young Chefs: 100+ Recipes that You'll Love to Cook and Eat", "url": "https://www.amazon.com/Complete-Cookbook-Young-Chefs/dp/1492670022", "price": "$10.44"},
{"title": "Atlas of the Heart: Mapping Meaningful Connection and the Language of Human Experience", "url": "https://www.amazon.com/Atlas-Heart-Meaningful-Connection-Experience/dp/0399592555", "price": "$17.78"},
{"title": "The Boy, the Mole, the Fox and the Horse", "url": "https://www.amazon.com/Boy-Mole-Fox-Horse/dp/0062976583", "price": "$12.01"},
{"title": "The Stories We Tell: Every Piece of Your Story Matters", "url": "https://www.amazon.com/Stories-We-Tell-Every-Matters/dp/1400333873", "price": "$16.97"},
{"title": "The Complete Baking Book for Young Chefs: 100+ Sweet and Savory Recipes that You'll Love to Bake, Share and Eat!", "url": "https://www.amazon.com/Complete-Baking-Book-Young-Chefs/dp/1492677698", "price": "$14.89"},
{"title": "The Simply Happy Cookbook: 100-Plus Recipes to Take the Stress Out of Cooking (The Happy Cookbook Series)", "url": "https://www.amazon.com/Simply-Happy-Cookbook-100-Plus-Recipes/dp/0063209233", "price": "$18.80"},
{"title": "Faith Still Moves Mountains: Miraculous Stories of the Healing Power of Prayer", "url": "https://www.amazon.com/Faith-Still-Moves-Mountains-Miraculous/dp/006322593X", "price": "$17.45"},
{"title": "Never Finished: Unshackle Your Mind and Win the War Within", "url": "https://www.amazon.com/Never-Finished-Unshackle-Your-Within/dp/1544534078", "price": "$18.45"},
{"title": "The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life", "url": "https://www.amazon.com/Subtle-Art-Not-Giving-Counterintuitive/dp/0062457713", "price": "$13.99"},
{"title": "Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones", "url": "https://www.amazon.com/Atomic-Habits-Proven-Build-Break/dp/0735211299", "price": "$11.98"},
{"title": "These Precious Days: Essays", "url": "https://www.amazon.com/These-Precious-Days-Ann-Patchett/dp/0063092786", "price": "$17.88"},
{"title": "Go-To Dinners: A Barefoot Contessa Cookbook", "url": "https://www.amazon.com/Go-Dinners-Barefoot-Contessa-Cookbook/dp/1984822780", "price": "$15.19"},
{"title": "Killers of the Flower Moon: The Osage Murders and the Birth of the FBI", "url": "https://www.amazon.com/Killers-Flower-Moon-Osage-Murders/dp/0385534248", "price": "$14.59"},
{"title": "The Spy and the Traitor: The Greatest Espionage Story of the Cold War", "url": "https://www.amazon.com/Spy-Traitor-Greatest-Espionage-Story/dp/1101904194", "price": "$64.99"},
{"title": "The Bomber Mafia: A Dream, a Temptation, and the Longest Night of the Second World War", "url": "https://www.amazon.com/Bomber-Mafia-Temptation-Longest-Second/dp/0316296619", "price": "$13.97"},
{"title": "Educated: A Memoir", "url": "https://www.amazon.com/Educated-Memoir-Tara-Westover/dp/0399590501", "price": "$14.63"},
{"title": "Talking to Strangers: What We Should Know about the People We Don't Know", "url": "https://www.amazon.com/Talking-Strangers-Should-about-People/dp/0316478520", "price": "$13.38"},
{"title": "The Splendid and the Vile: A Saga of Churchill, Family, and Defiance During the Blitz", "url": "https://www.amazon.com/Splendid-Vile-Churchill-Family-Defiance/dp/0385348711", "price": "$14.90"},
{"title": "Devotion (Movie Tie-in): An Epic Story of Heroism, Friendship, and Sacrifice", "url": "https://www.amazon.com/Devotion-Movie-Tie-Friendship-Sacrifice/dp/0593722337", "price": "$20.00"},
{"title": "The Greatest Beer Run Ever: A Memoir of Friendship, Loyalty, and War", "url": "https://www.amazon.com/Greatest-Beer-Run-Ever-Friendship/dp/0062995464", "price": "$17.98"},
{"title": "All About Love: New Visions", "url": "https://www.amazon.com/All-About-Love-New-Visions/dp/0060959479", "price": "$12.84"},
{"title": "Maybe Now: A Novel (Maybe Someday)", "url": "https://www.amazon.com/Maybe-Now-Novel-Someday/dp/1668013347", "price": "$10.65"},
{"title": "Daisy Jones & The Six: A Novel", "url": "https://www.amazon.com/Daisy-Jones-Taylor-Jenkins-Reid/dp/1524798622", "price": "$22.99"},
{"title": "All Your Perfects: A Novel", "url": "https://www.amazon.com/All-Your-Perfects-Colleen-Hoover-ebook/dp/B078MC547V", "price": "$10.99"},
{"title": "Things We Never Got Over", "url": "https://www.amazon.com/Things-We-Never-Got-Over/dp/1728278872", "price": "$18.97"},
{"title": "Becoming", "url": "https://www.amazon.com/Becoming-Michelle-Obama/dp/1524763136", "price": "$24.72"},
{"title": "Starry Messenger: Cosmic Perspectives on Civilization", "url": "https://www.amazon.com/Starry-Messenger-Cosmic-Perspectives-Civilization/dp/1250861500", "price": "$18.04"},
{"title": "Cinema Speculation", "url": "https://www.amazon.com/Cinema-Speculation-Quentin-Tarantino/dp/0063112582", "price": "$21.86"},
{"title": "The Song of Achilles: A Novel", "url": "https://www.amazon.com/Song-Achilles-Novel-Madeline-Miller/dp/0062060627", "price": "$10.34"},
{"title": "Going Rogue: Rise and Shine Twenty-Nine (29) (Stephanie Plum)", "url": "https://www.amazon.com/Going-Rogue-Shine-Twenty-Nine-Stephanie/dp/1668003058", "price": "$15.14"},
{"title": "Confess: A Novel", "url": "https://www.amazon.com/Confess-Novel-Colleen-Hoover-ebook/dp/B00LD1OHE0", "price": "$10.99"},
{"title": "Killing the Legends: The Lethal Danger of Celebrity", "url": "https://www.amazon.com/Killing-Legends-Lethal-Danger-Celebrity/dp/1250283302", "price": "$15.00"},
{"title": "Maybe Someday", "url": "https://www.amazon.com/Maybe-Someday-Colleen-Hoover-ebook/dp/B00DPM7RJW", "price": "$10.99"},
{"title": "No Plan B: A Jack Reacher Novel", "url": "https://www.amazon.com/No-Plan-Jack-Reacher-Novel/dp/1984818546", "price": "$17.49"},
{"title": "Babel: Or the Necessity of Violence: An Arcane History of the Oxford Translators' Revolution", "url": "https://www.amazon.com/Babel-Necessity-Violence-Translators-Revolution/dp/0063021420", "price": "$20.49"},
{"title": "Tomorrow, and Tomorrow, and Tomorrow: A novel", "url": "https://www.amazon.com/Tomorrow-novel-Gabrielle-Zevin/dp/0593321200", "price": "$14.69"},
{"title": "The Midnight Library: A Novel", "url": "https://www.amazon.com/Midnight-Library-Novel-Matt-Haig/dp/0525559477", "price": "$13.59"},
{"title": "A World of Curiosities: A Novel (Chief Inspector Gamache Novel, 18)", "url": "https://www.amazon.com/World-Curiosities-Novel-Inspector-Gamache/dp/1250145295", "price": "$20.22"},
{"title": "Tom Clancy Red Winter (A Jack Ryan Novel)", "url": "https://www.amazon.com/Clancy-Winter-Jack-Ryan-Novel/dp/0593422759", "price": "$20.06"},
{"title": "All About Me!: My Remarkable Life in Show Business", "url": "https://www.amazon.com/All-About-Me-Remarkable-Business/dp/059315911X", "price": "$20.00"},
{"title": "Braiding Sweetgrass: Indigenous Wisdom, Scientific Knowledge and the Teachings of Plants", "url": "https://www.amazon.com/Braiding-Sweetgrass-Indigenous-Scientific-Knowledge/dp/1571313567", "price": "$13.25"},
{"title": "What If? 2: Additional Serious Scientific Answers to Absurd Hypothetical Questions", "url": "https://www.amazon.com/What-Additional-Scientific-Hypothetical-Questions/dp/0525537112", "price": "$18.57"},
{"title": "The Song of the Cell: An Exploration of Medicine and the New Human", "url": "https://www.amazon.com/Song-Cell-Exploration-Medicine-Human/dp/1982117354", "price": "$16.25"},
{"title": "So Help Me God", "url": "https://www.amazon.com/So-Help-God-Mike-Pence/dp/1982190337", "price": "$21.78"},
{"title": "Radio's Greatest of All Time", "url": "https://www.amazon.com/Radios-Greatest-Time-Rush-Limbaugh/dp/1668001845", "price": "$21.49"},
{"title": "The Philosophy of Modern Song", "url": "https://www.amazon.com/Philosophy-Modern-Song-Bob-Dylan/dp/1451648707", "price": "$22.50"},
{"title": "An Immense World: How Animal Senses Reveal the Hidden Realms Around Us", "url": "https://www.amazon.com/Immense-World-Animal-Senses-Reveal/dp/0593133234", "price": "$18.31"},
{"title": "The Revolutionary: Samuel Adams", "url": "https://www.amazon.com/Revolutionary-Samuel-Adams-Stacy-Schiff/dp/0316441112", "price": "$17.50"},
{"title": "And There Was Light: Abraham Lincoln and the American Struggle", "url": "https://www.amazon.com/There-Was-Light-American-Struggle/dp/0553393960", "price": "$20.00"},
{"title": "Lessons in Chemistry: A Novel", "url": "https://www.amazon.com/Lessons-Chemistry-Novel-Bonnie-Garmus/dp/038554734X", "price": "$17.83"},
{"title": "The Body Keeps the Score: Brain, Mind, and Body in the Healing of Trauma", "url": "https://www.amazon.com/Body-Keeps-Score-Healing-Trauma/dp/0670785938", "price": "$19.39"},
{"title": "Surrender: 40 Songs, One Story", "url": "https://www.amazon.com/Surrender-40-Songs-One-Story/dp/0525521046", "price": "$17.00"},
{"title": "It Starts with Us: A Novel (It Ends with Us)", "url": "https://www.amazon.com/Starts-Us-Novel-Ends/dp/1668001225", "price": "$10.98"},
{"title": "It Ends with Us: A Novel", "url": "https://www.amazon.com/Ends-Us-Novel-Colleen-Hoover-ebook/dp/B0176M3U10", "price": "$10.99"}
]
This is actually a Question:
My question is about How to parse a data to ElasticSearch after being scrapped:
This is my code but it misses something:
In pipelines.py
from itemadapter import ItemAdapter
import sqlite3
from elasticsearch import Elasticsearch as es
class AmazonPipeline:
"""
def __init__(self):
self.con = sqlite3.connect('books.db')
self.cur = self.con.cursor()
self.cur.execute("""
CREATE TABLE IF NOT EXISTS book_prices(
id INTEGER PRIMARY KEY,
title TEXT,
price INTEGER,
ratings INTEGER,
url TEXT
)
""")
def process_item(self, item, spider):
## Define insert statement
self.cur.execute("""
INSERT INTO book_prices (title, price, ratings, url) VALUES (?, ?,
?, ?)
""",
(
item['title'],
item['price'],
item['ratings'],
item['url']
))
## Execute insert of data into database
self.con.commit()
return item
"""
#This is the code that should connect and create an index in Elasticsearch
def ingestForES(title: str,rank: int,description):
es_url = es(hosts=[{'host':'http://34.255.105.149','port': 9200}])
es = es(es_url,basic_auth=("elastic", "datascientest"),
verify_certs=False)
data = {'title': title,
'rank':rank,
'description': description,
}
resp = es.create.index(index='myindex', body=data)
print(resp)
and in the settings.py I added the following lines:
ITEM_PIPELINES = {
'amazon.pipelines.AmazonPipeline': 300,
'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline': 500
}
ELASTICSEARCH_SERVERS = 'http://34.255.105.149:9200'
ELASTICSEARCH_INDEX = 'myindex'
ELASTICSEARCH_TYPE = 'items'
#ELASTICSEARCH_UNIQ_KEY = 'url'
Related
How to create and use custom connection field in Airflow 2.0.2
I am creating a emr_default connection through DAG (I don't create using UI). AWS credentials are already defined in UI. The code is as below. c = Connection( conn_id='emr_default1', conn_type='Elastic MapReduce', extra=json.dumps(dict({"Name": "Data Spark", "LogUri": "s3://aws-logs-55-us-west-1/elasticmap/", "ReleaseLabel": "emr-6.5.0", "Instances": {"Ec2KeyName": "tester", "Ec2SubnetId": "subnet-0d62", "InstanceGroups": [{"Name": "Master nodes", "Market": "ON_DEMAND", "InstanceRole": "MASTER", "InstanceType": "m3.xlarge", "InstanceCount": 1}, {"Name": "Core nodes", "Market": "ON_DEMAND", "InstanceRole": "CORE", "InstanceType": "m3.xlarge", "InstanceCount": 0}], "TerminationProtected": false, "KeepJobFlowAliveWhenNoSteps": false}, "Applications": [{"Name": "Spark"}], "Configurations": [{"Classification": "core-site", "Properties": {"io.compression.codec.lzo.class": "", "io.compression.codecs": "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec"}, "Configurations": []}], "VisibleToAllUsers": true, "JobFlowRole": "EMR_EC2_DefaultRole", "ServiceRole": "EMR_DefaultRole", "Tags": [{"Key": "app", "Value": "analytics"}, {"Key": "environment", "Value": "development"}]})), ) print(f"AIRFLOW_CONN_{c.conn_id.upper()}='{c.get_uri()}'") and I am using this connection in emr creation cluster_creator = EmrCreateJobFlowOperator(task_id='create_job_flow', emr_conn_id='emr_default1', job_flow_overrides=JOB_FLOW_OVERRIDES) But it is not making this connection, can someone please tell me how to create emr_default and use this connection. Errro: botocore.exceptions.NoCredentialsError: Unable to locate credentials Thanks, Xi
jq replace values based on external map
I would like to change a field in my json file as specified by another json file. My input file is something like: {"id": 10, "name": "foo", "some_other_field": "value 1"} {"id": 20, "name": "bar", "some_other_field": "value 2"} {"id": 25, "name": "baz", "some_other_field": "value 10"} I have an external override file that specifies how name in certain objects should be overridden, for example: {"id": 20, "name": "Bar"} {"id": 10, "name": "foo edited"} As shown above, the override may be shorter than input, in which case the name should be unchanged. Both files can easily fit into available memory. Given the above input and the override, I would like to obtain the following output: {"id": 10, "name": "foo edited", "some_other_field": "value 1"} {"id": 20, "name": "Bar", "some_other_field": "value 2"} {"id": 25, "name": "baz", "some_other_field": "value 10"} Being a beginner with jq, I wasn't really sure where to start. While there are some questions that cover similar ground (the closest being this one), I couldn't figure out how to apply the solutions to my case.
There are many possibilities, but probably the simplest, efficient solution would use the built-in function: INDEX/2, e.g. as follows: jq -n --slurpfile dict f2.json ' (INDEX($dict[]; .id) | map_values(.name)) as $d | inputs | .name = ($d[.id|tostring] // .name) ' f1.json This uses inputs with the -n option to read the first file so that each JSON object can be processed in turn. Since the solution is so short, it should be easy enough to figure it out with the aid of the online jq manual. Caveat This solution comes with a caveat: that there are no "collisions" between ids in the dictionary as a result of the use of "tostring" (e.g. if {"id": 10} and {"id": "10"} both occurred). If the dictionary does or might have such collisions, then the above solution can be tweaked accordingly, but it is a bit tricky.
R Parsing error when trying to import JSON to R
I've got a JSON file that looks like this I am trying to import it into R using the jsonlite package. #Load package for import library(jsonlite) df <- fromJSON("test.json") But it throws an error Error in parse_con(txt, bigint_as_char) : parse error: trailing garbage ome in at a later time." } { "id": "e5fa37f44557c62ee (right here) ------^ I've tried looking at all solutions on stackoverflow, but haven't been able to figure this out. Any inputs would be very helpful.
The JSON file you linked contains two JSON objects. Perhaps you want an array: [ { "id": "71bb8883780bb152e4bb4db976bedc62", "metadata": { "abc_bad_date": "true", "abc_client": "Hydra Corp", "abc_doc_id": 1, "abc_file": "Hydra Corp 2016.txt", "abc_interview_type": "Post Analysis", "abc_interviewee_role": "Director Corporate Engineering; Greater Chicago Area; Global Procurement Director Facilities and MRO", "abc_interviewer": "Piper Thomas", "abc_services_provided": "Food", "section": "on_expectations" }, "text": "Gerrit: There were a number ...." }, { "id": "e5fa37f44557c62eef44baafb13128f0", "metadata": { "abc_bad_date": "true", "abc_client": "Hydra Corp", "abc_doc_id": 1, "abc_file": "Hydra Corp 2016.txt", "abc_interview_type": "Post Analysis", "abc_interviewee_role": "Director Corporate Engineering; Greater Chicago Area; Global Procurement Director Facilities and MRO", "abc_interviewer": "Piper Thomas", "abc_services_provided": "Painting", "section": "on_relationships" }, "text": "Gerrit: I thought the ABC ..." } ]
Importing csv file but few columns are full of special symbols
I have imported a movie dataset in csv format, few of the columns are full of special symbols along with the data I need(Example is attached below along with the image of the Movie dataset). Now, do I have to remove those special characters individually OR is there anyway(shortcut) to remove them while importing the file into R. Thanks Movie.csv Image GENRE [{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}] Spoken Languages [{"iso_639_1": "en", "name": "English"}, {"iso_639_1": "es", "name": "Espa\u00f1ol"}]
Deleting multiple keys at once with jq
I need to delete multiple keys at once from some JSON (using jq), and I'm trying to learn if there is a better way of doing this, than calling map and del every time. Here's my input data: test.json [ { "label": "US : USA : English", "Country": "USA", "region": "US", "Language": "English", "locale": "en", "currency": "USD", "number": "USD" }, { "label": "AU : Australia : English", "Country": "Australia", "region": "AU", "Language": "English", "locale": "en", "currency": "AUD", "number": "AUD" }, { "label": "CA : Canada : English", "Country": "Canada", "region": "CA", "Language": "English", "locale": "en", "currency": "CAD", "number": "CAD" } ] For each item, I want to remove the number, Language, and Country keys. I can do that with this command: $ cat test.json | jq 'map(del(.Country)) | map(del(.number)) | map(del(.Language))' That works fine, and I get the desired output: [ { "label": "US : USA : English", "region": "US", "locale": "en", "currency": "USD" }, { "label": "AU : Australia : English", "region": "AU", "locale": "en", "currency": "AUD" }, { "label": "CA : Canada : English", "region": "CA", "locale": "en", "currency": "CAD" } ] However, I'm trying to understand if there is a jq way of specifying multiple labels to delete, so I don't have to have multiple map(del()) directives?
You can provide a stream of paths to delete: $ cat test.json | jq 'map(del(.Country, .number, .Language))' Also, consider that, instead of blacklisting specific keys, you might prefer to whitelist the ones you do want: $ cat test.json | jq 'map({label, region, locale, currency})'
There is no need to use both map and del. You can pass multiple paths to del, separated by commas. Here is a solution using "dot-style" path notation: jq 'del( .[] .Country, .[] .number, .[] .Language )' test.json doesn't require quotation marks (which you may feel makes it more readable) doesn't group the paths (requires you to retype .[] once per path) Here is an example using "array-style" path notation, which allows you to combine paths with a common prefix like so: jq 'del( .[] ["Country", "number", "Language"] )' test.json Combines subpaths under the "last-common ancestor" (which in this case is the top-level list iterator .[]) peak's answer uses map and delpaths, though it seems you can also use delpaths on its own: jq '[.[] | delpaths( [["Country"], ["number"], ["Language"]] )]' test.json Requires both quotation marks and array of singleton arrays Requires you to put it back into a list (with the start and end square brackets) Overall, here I'd go for the array-style notation for brevity, but it's always good to know multiple ways to do the same thing.
A better compromise between "array-style" and "dot-style" notation mentioned in by Louis in his answer. del(.[] | .Country, .number, .Language) jqplay This form can also be used to delete a list of keys from a nested object (see russholio's answer): del(.a | .d, .e) Implying that you can also pick a single index to delete keys from: del(.[1] | .Country, .number, .Language) Or multiple: del(.[2,3,4] | .Country,.number,.Language) You can delete a range using the range() function (slice notation doesn't work): del(.[range(2;5)] | .Country,.number,.Language) # same as targetting indices 2,3,4 Some side notes: map(del(.Country,.number,.Language)) # Is by definition equivalent to [.[] | del(.Country,.number,.Language)] If the key contains special characters or starts with a digit, you need to surround it with double quotes like this: ."foo$", or else .["foo$"].
This question is very high in the google results, so I'd like to note that some time in the intervening years, del has apparently been altered so that you can delete multiple keys with just: del(.key1, .key2, ...) So don't tear your hair out trying to figure out the syntax work-arounds, assuming your version of jq is reasonably current.
In addition to #user3899165's answer, I found that to delete a list of keys from "sub-object" example.json { "a": { "b": "hello", "c": "world", "d": "here's", "e": "the" }, "f": { "g": "song", "h": "that", "i": "I'm", "j": "singing" } } $ jq 'del(.a["d", "e"])' example.json
delpaths is also worth knowing about, and is perhaps a little less mysterious: map( delpaths( [["Country"], ["number"], ["Language"]] )) Since the argument to delpaths is simply JSON, this approach is particularly useful for programmatic deletions, e.g. if the key names are available as JSON strings.