Use scrapy to collect information for one item from multiple pages (and output it as a nested dictionary) - web-scraping

I'm trying to scrape data from a tournaments site.
Each tournament has some information such as the venue, the date, prices etc.
And also the rank of teams that took part. The rank is a table that simply provides the name of the team, and its position in the rank.
Then, you can click on the name of the team which takes you to a page were we can get the roster of players that the team selected for that tournament.
I'd like to scrape the data into something like:
[{
"name": "Grand Tournament",
"venue": "...",
"date": "...",
"rank": [
{"team_name": "Team name",
"rank": 1,
"roster": ["player1", "player2", "..."]
},
{"team_name": "Team name",
"rank": 2,
"roster": ["player1", "player2", "..."]
}
]
}]
I have the following spider to scrape a single tournament page (usage: scrapy crawl tournamentspider -a strat_url="<tournamenturl>")
class TournamentSpider(scrapy.Spider):
name = "tournamentspider"
allowed_domains = ["..."]
def start_requests(self):
try:
yield scrapy.Request(url=self.start_url, callback=self.parse)
except AttributeError:
raise ValueError("You must use this spider with argument start_url.")
def parse(self, response):
tournament_item = TournamentItem()
tournament_item['teams'] = []
tournament_item ['name'] = "Tournament Name"
tournament_item['date'] = "Date"
tournament_item['venue'] = "Venue"
ladder = response.css('#ladder')
for row in ladder.css('table tbody tr'):
row_cells = row.xpath('td')
participation_item = PlayerParticipationItem()
participation_item['team_name'] = "Team Name"
participation_item['rank'] = "x"
# Parse roster
roster_url_page = row_cells[2].xpath('a/#href').get()
# Follow link to extract list
base_url = urlparse(response.url)
absolute_url = f'{base_url.scheme}://{base_url.hostname}/{list_url_page}'
request = scrapy.Request(absolute_url, callback=self.parse_roster_page)
request.meta['participation_item'] = participation_item
yield request
# Include participation item in the roster
tournament_item['players'].append(participation_item)
yield tournament_item
def parse_roster_page(self, response):
participation_item = response.meta['participation_item']
participation_item['roster'] = ["Player1", "Player2", "..."]
return participation_item
My problem is that this spider produces the following output:
[{
"name": "Grand Tournament",
"venue": "...",
"date": "...",
"rank": [
{"team_name": "Team name",
"rank": 1,
},
{"team_name": "Team name",
"rank": 2,
}
]
},
{"team_name": "Team name",
"rank": 1,
"roster": ["player1", "player2", "..."]
},
{"team_name": "Team name",
"rank": 2,
"roster": ["player1", "player2", "..."]
}]
I know that those extra items in the output are generated by the yield request line. When I remove it, I'm no longer scraping the roster page, so the extra items disappear, but I no longer have the roster data.
Is is possible to get the output I'm aiming for?
I know that a different approach could be to scrape the tournament information, and then teams with a field that identifies the tournament. But I'd like to know if the initial approach is achievable.

you can use scrapy inline requests to to call parse_roster_page and you'll get the roster data without yielding it out.
The only change you need to include is the decorator #inline_requests with the function parse_roster_page.
from inline_requests import inline_requests
class TournamentSpider(scrapy.Spider):
def parse(self, response):
...
#inline_requests
def parse_roster_page(self, response):
...

Related

OpenAI package leaving linebreak in response

I've starting using OpenAI API in R. I downloaded the openai package. I keep getting a double linebreak in the text response. Here's an example of my code:
library(openai)
vector = create_completion(
model = "text-davinci-003",
prompt = "Tell me what the weather is like in London, UK, in Celsius in 5 words.",
max_tokens = 20,
temperature = 0,
echo = FALSE
)
vector_2 = vector$choices[1]
vector_2$text
[1] "\n\nRainy, mild, cool, humid."
Is there a way to get rid of this without 'correcting' the response text using other functions?
No, it's not possible.
The OpenAI API returns the completion with starting \n\n by default. There's no parameter for the Completions endpoint to control this.
You need to remove linebreak manually.
Example response looks like this:
{
"id": "cmpl-uqkvlQyYK7bGYrRHQ0eXlWi7",
"object": "text_completion",
"created": 1589478378,
"model": "text-davinci-003",
"choices": [
{
"text": "\n\nThis is indeed a test",
"index": 0,
"logprobs": null,
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 5,
"completion_tokens": 7,
"total_tokens": 12
}
}

Is my partition transform in Vega written correctly because the graph that is visualized is not accurate

I am creating a hierarchical representation of data in Vega. To do this I am using the stratify and partition transformations. The issue that is occurring lies with the x coordinates that are generated with the partition transformation. In the link, navigate to data viewer and select tree-map. The x0 and x1 for the initial id, the top most element, "completed stories" within the hierarchy ranges from 0 - 650. The next two elements, "testable" & "not testable", should have a combined x range of 0 - 650. But instead, they range from 0 - 455. The width should be based on their quantities, located in the "amount" field. Any suggestions as to why the rectangle that is generated is not commensurate with the quantities.
Link to Vega Editor with code shown
For your dataset "rawNumbers", values should only be provided for the "leave" nodes when using stratify transform.
{
"name": "rawNumbers",
"values": [
{"id": "completed stories", "parent": null},
{"id": "testable", "parent": "completed stories"},
{"id": "not testable", "parent": "completed stories", "amount": 1435},
{"id": "sufficiently tested", "parent": "testable"},
{"id": "insufficiently tested", "parent": "testable"},
{"id": "integration tested", "parent": "sufficiently tested", "amount": 1758},
{"id": "unit tested", "parent": "sufficiently tested", "amount": 36},
{"id": "partial coverage", "parent": "insufficiently tested", "amount": 298},
{"id": "no coverage", "parent": "insufficiently tested", "amount": 341}
]
},
Open in Vega Editor

split a key-value pair in Python

I have a dictionairy as follows:
{
"age": "76",
"Bank": "98310",
"Stage": "final",
"idnr": "4578",
"last number + Value": "[345:K]"}
I am trying to adjust the dictionary by splitting the last key-value pair creating a new key('total data'), it should look like this:
"Total data":¨[
{
"last number": "345"
"Value": "K"
}]
}
Does anyone know if there is a split function based on ':' and '+' or a for loop to accomplish this?
Thanks in advance.
One option to accomplish that could be getting the last key from the dict and using split on + for the key and : for the value removing the outer square brackets assuming the format of the data is always the same.
If you want Total data to contain a list, you can wrap the resulting dict in []
from pprint import pprint
d = {
"age": "76",
"Bank": "98310",
"Stage": "final",
"idnr": "4578",
"last number + Value": "[345:K]"
}
last = list(d.keys())[-1]
d["Total data"] = dict(
zip(
last.strip().split('+'),
d[last].strip("[]").split(':')
)
)
pprint(d)
Output (tested with Python 3.9.4)
{'Bank': '98310',
'Stage': 'final',
'Total data': {' Value': 'K', 'last number ': '345'},
'age': '76',
'idnr': '4578',
'last number + Value': '[345:K]'}
Python demo

How to split column into rows by Object in array?

Hi all my dataframe looks somewhat like this:
**| Descriptor |**
[{"name": "Some name", "id": "L73871287"}, {"name": "Another name", "id": "L7123287"}]
[{"name": "Yet another name", "id": "L73556287"}, {"name": "Yet another name", "id": "L73556287"}]
How would one go about splitting this data by objects in R?
So to get:
**| Descriptor |**
{"name": "Some name", "id": "L73871287"}
{"name": "Another name", "id": "L7123287"}
{"name": "Yet another name", "id": "L73556287"}
{"name": "Yet another name", "id": "L73556287"}
Even better would be to just get a column "name" and a column "id", but idk if this is possible in R (I have a python and javascript background, but the file was too large for python)
Maybe this is what you are looking for:
library(jsonlite)
json <- '[{"name": "Some name", "id": "L73871287"}, {"name": "Another name", "id": "L7123287"}],
[{"name": "Yet another name", "id": "L73556287"}, {"name": "Yet another name", "id": "L73556287"}]'
ls <- fromJSON(txt = paste0("[", json, "]"))
do.call(rbind, ls)
#> name id
#> 1 Some name L73871287
#> 2 Another name L7123287
#> 3 Yet another name L73556287
#> 4 Yet another name L73556287

Is there an R library or function for formatting international currency strings?

Here's a snippet of the JSON data I'm working with:
{
"item" = "Mexican Thing",
...
"raised": "19",
"currency": "MXN"
},
{
"item" = "Canadian Thing",
...
"raised": "42",
"currency": "CDN"
},
{
"item" = "American Thing",
...
"raised": "1",
"currency": "USD"
}
You get the idea.
I'm hoping there's a function out there that can take in a standard currency abbreviation and a number and spit out the appropriate string. I could theoretically write this myself except I can't pretend like I know all the ins and outs of this stuff and I'm bound to spend days and weeks being surprised by bugs or edge cases I didn't think of. I'm hoping there's a library (or at least a web api) already written that can handle this but my Googling has yielded nothing useful so far.
Here's an example of the result I want (let's pretend "currency" is the function I'm looking for)
currency("USD", "32") --> "$32"
currency("GBP", "45") --> "£45"
currency("EUR", "19") --> "€19"
currency("MXN", "40") --> "MX$40"
Assuming your real json is valid, then it should be relatively simple. I'll provide a valid json string, fixing the three invalid portions here: = should be :; ... is obviously a placeholder; and it should be a list wrapped in [ and ]:
js <- '[{
"item": "Mexican Thing",
"raised": "19",
"currency": "MXN"
},
{
"item": "Canadian Thing",
"raised": "42",
"currency": "CDN"
},
{
"item": "American Thing",
"raised": "1",
"currency": "USD"
}]'
with(jsonlite::parse_json(js, simplifyVector = TRUE),
paste(raised, currency))
# [1] "19 MXN" "42 CDN" "1 USD"
Edit: in order to change to specific currency characters, don't make this too difficult: just instantiate a lookup vector where "USD" (for example) prepends "$" and appends "" (nothing) to the raised string. (I say both prepend/append because I believe some currencies are always post-digits ... I could be wrong.)
pre_currency <- Vectorize(function(curr) switch(curr, USD="$", GDP="£", EUR="€", CDN="$", "?"))
post_currency <- Vectorize(function(curr) switch(curr, USD="", GDP="", EUR="", CDN="", "?"))
with(jsonlite::parse_json(js, simplifyVector = TRUE),
paste0(pre_currency(currency), raised, post_currency(currency)))
# [1] "?19?" "$42" "$1"
I intentionally left "MXN" out of the vector here to demonstrate that you need a default setting, "?" (pre/post) here. You may choose a different default/unknown currency value.
An alternative:
currency <- function(val, currency) {
pre <- sapply(currency, switch, USD="$", GDP="£", EUR="€", CDN="$", "?")
post <- sapply(currency, switch, USD="", GDP="", EUR="", CDN="", "?")
paste0(pre, val, post)
}
with(jsonlite::parse_json(js, simplifyVector = TRUE),
currency(raised, currency))
# [1] "?19?" "$42" "$1"

Resources