Scraping multiple pages with Scrapy and saving as a csv file - web-scraping

I want to scrape all the pages of Internshala and extract the Job ID, Job name, Company name and the Last date to apply and store everything in a csv to later convert to a dataframe.
import requests
import scrapy
from bs4 import BeautifulSoup
from scrapy import Selector
from scrapy.crawler import CrawlerProcess
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
import string
import pandas as pd
url='https://internshala.com/fresher-jobs'
sel=Selector(text=BeautifulSoup(requests.get(url).content).prettify())
pages=sel.xpath('//span[#id="total_pages"]').xpath('normalize-space(./text())').extract()
pages[0]=int(pages[0])
print(pages[0]) #which gives -> 4
class jobMan(scrapy.Spider):
name='job'
to_remove={0:["\n ","\n "],\
1:['\n ','\n ']}
def start_requests(self):
urls="https://internshala.com/fresher-jobs/page-1"
yield scrapy.Request(url=urls,callback=self.parse)
def parse(self,response):
ID=response.xpath('//div[#class="container-fluid individual_internship visibilityTrackerItem"]/#internshipid').extract()
Job_Post = response.xpath('//div[#class="heading_4_5 profile"]/a').xpath('normalize-space(./text())').extract()
Company = response.xpath('//a[#class="link_display_like_text"]').xpath('normalize-space(./text())').extract()
Apply_By = response.xpath('//div[#class="internship_other_details_container"]/div[#class="other_detail_item_row"][2]//div[#class="item_body"]').xpath('normalize-space(./text())').extract()
for page in range(2,pages[0]+1):
yield(scrapy.Request(url=f"https://internshala.com/fresher-jobs/page-{page}",callback=self.parse))
yield {
'ID': ID,
'Job':Job_Post,
'Company':Company,
'Apply_By':Apply_By
}
process=CrawlerProcess(settings={
'FEED_URI':'JOBSS.csv',
'FEED_FORMAT':'csv'
})
process.crawl(jobMan)
process.start()
And then finally-:
final=pd.read_csv('JOBSS.csv')
print(final)
Which gave me-:
ID Job \
0 NaN Product Developer - Science,Salesforce Develop...
1 NaN Business Development Manager,Mobile App Develo...
2 NaN Software Engineer,Social Media Strategist And ...
3 NaN Reactjs Developer,Full Stack Developer,Busines...
Company \
0 Open Door Education,Aekot Consulting And Techn...
1 ISB Studienkolleg,TutorBin,Alphacore Technolog...
2 CrewKarma,Internshala,Mithi Software Technolog...
3 Startxlabs Technologies Private Limited,RavGin...
Apply_By
0 7 Aug' 21,7 Aug' 21,7 Aug' 21,7 Aug' 21,7 Aug'...
1 31 Jul' 21,30 Jul' 21,30 Jul' 21,31 Jul' 21,30...
2 24 Jul' 21,24 Jul' 21,23 Jul' 21,23 Jul' 21,23...
3 11 Jul' 21,11 Jul' 21,11 Jul' 21,11 Jul' 21,11...
Doubt_1-: Why is it not printing the IDs ?? I tried scraping just the ID for the first page using the same xpath and I got the correct output but not while crawling.
/
Doubt_2-: I wanted a a dataframe such that, for example, the Job_Post column contains each job post's name nested under each other (means as a new row) from all the pages merged but I am getting rows per page.
How can I solve these issues ?? Please help

Doubt_1-: Why is it not printing the IDs ?? I tried scraping just the ID for the first page using the same xpath and I got the correct output but not while crawling.
Because the class name has a space in it, use:
ID=response.xpath('//div[contains(#class, "container-fluid individual_internship visibilityTrackerItem")]/#internshipid').extract()

Related

How would I go about web scraping from an interactive map?

This pertains to this interactive map, https://www.newworld-map.com/?filters=ores
An example is the ores here, how would I go about getting the coordinates of each node? It looks like the html element is a Canvas and I could not for the life of me figure out where it pulls the data from for this.
Any help would be greatly appreciated
Hoping that next OP's question will be more in line with Stackoverflow's guidelines (see https://stackoverflow.com/help/minimal-reproducible-example), one way to solve this would be to inspect what network calls are being made when page loads, and scrape an eventual API endpoint where the data is pulled from. Like below:
import requests
import pandas as pd
import time
time_stamp = int(time.time_ns() / 1000)
ore_list = []
url = f'https://www.newworld-map.com/markers.json?time={time_stamp}'
ores= requests.get(url).json()['ores']
for ore in ores:
for x in ores[ore]:
ore_list.append((ore, x, ores[ore][x]['x'], ores[ore][x]['y']))
df = pd.DataFrame(ore_list, columns = ['Ore', 'Code', 'X_Coord', 'Y_Coord'])
print(df)
Result in terminal:
Ore Code X_Coord Y_Coord
0 brimstone 02d1ba070438d53ce5fbb1955cd7d694 7473.096191 8715.674805
1 brimstone 0a50c499af034aeb6f38e011648a2ea8 7471.124512 8709.161133
2 brimstone 0b5b190c31eb3d314d993dd393aadfe8 5670.894043 7862.319336
3 brimstone 0f5c7427c75d80e10f71f9e92ddc4362 5883.601562 7703.445801
4 brimstone 20b0801bdb41c7dafbb1053b43c25bd8 6020.838379 8147.747070
... ... ... ... ...
4260 starmetal 86h 8766.964000 8431.438000
4261 starmetal 86i 8598.688000 8562.974000
4262 starmetal 86j 8586.000000 8211.000000
4263 starmetal 86k 8688.938000 8509.722000
4264 starmetal 86l 8685.827000 8505.694000
4265 rows × 4 columns

Rvest: using css selector pulls data from different tab in URL

I am very new to scraping, and am trying to pull data from a section of this website - https://projects.fivethirtyeight.com/soccer-predictions/premier-league/. The data I'm trying to get is in the second tab, "Matches," and is the section titled "Upcoming Matches."
I have attempted to do this with SelectorGadget and using rvest, as follows -
library(rvest)
url <- ("https://projects.fivethirtyeight.com/soccer-predictions/premier-league/")
url %>%
html_nodes(".prob, .name") %>%
html_text()
this returns values, however corresponding to the first tab on the page, "Standings." How can I reference the correct section that I am trying to pull?
First:I don't know R but Python.
When you click Matches then page uses JavaScript to generate matches and it loads JSON data from:
https://projects.fivethirtyeight.com/soccer-predictions/forecasts/2021_premier-league_forecast.json
https://projects.fivethirtyeight.com/soccer-predictions/forecasts/2021_premier-league_matches.json
https://projects.fivethirtyeight.com/soccer-predictions/forecasts/2021_premier-league_clinches.json
I checked only one of them - 2021_premier-league_matches.json - and I see it has data for Completed Matches
I made example in Python:
import requests
url = 'https://projects.fivethirtyeight.com/soccer-predictions/forecasts/2021_premier-league_matches.json'
response = requests.get(url)
data = response.json()
for item in data:
# search date
if item['datetime'].startswith('2022-03-16'):
print('team1:', item['team1_code'], '|', item['team1'])
print('prob1:', item['prob1'])
print('score1:', item['score1'])
print('adj_score1:', item['adj_score1'])
print('chances1:', item['chances1'])
print('moves1:', item['moves1'])
print('---')
print('team2:', item['team2_code'], '|', item['team2'])
print('prob2:', item['prob2'])
print('score2:', item['score2'])
print('adj_score2:', item['adj_score2'])
print('chances2:', item['chances2'])
print('moves2:', item['moves2'])
print('----------------------------------------')
Result:
team1: BHA | Brighton and Hove Albion
prob1: 0.30435
score1: 0
adj_score1: 0.0
chances1: 1.244
moves1: 1.682
---
team2: TOT | Tottenham Hotspur
prob2: 0.43627
score2: 2
adj_score2: 2.1
chances2: 1.924
moves2: 1.056
----------------------------------------
team1: ARS | Arsenal
prob1: 0.22114
score1: 0
adj_score1: 0.0
chances1: 0.569
moves1: 0.514
---
team2: LIV | Liverpool
prob2: 0.55306
score2: 2
adj_score2: 2.1
chances2: 1.243
moves2: 0.813
----------------------------------------

Webscraping RequestGet from Airbnb not working properly

This query is returning 0 or 20 randomly every time i run it. Yesterday when i loop through the pages i always get 20 and I am able to scrape through 20 listings and 15 pages. But now, I can't run my code properly because sometimes the listings return 0.
I tried adding headers in the request get and time sleep (5-10s random) before each request but am still facing the same issue. Tried connecting to hotspot to change my IP but am still facing the same issue. Anyone understand why?
import time
from random import randint
from bs4 import BeautifulSoup
import requests #to connect to url
airbnb_url = 'https://www.airbnb.com/s/Mayrhofen--Austria/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&date_picker_type=calendar&query=Mayrhofen%2C%20Austria&place_id=ChIJbzLYLzjdd0cRDtGuTzM_vt4&checkin=2021-02-06&checkout=2021-02-13&adults=4&source=structured_search_input_header&search_type=autocomplete_click'
soup = BeautifulSoup(requests.get(airbnb_url).content, 'html.parser')
listings = soup.find_all('div', '_8s3ctt')
print(len(listings))
It seems AirBnB returns 2 versions of the page. One "normal" HTML and other where the listings are stored inside <script>. To parse the <script> version of page you can use next example:
import json
import requests
from bs4 import BeautifulSoup
def find_listing(d):
if isinstance(d, dict):
if "__typename" in d and d["__typename"] == "DoraListingItem":
yield d["listing"]
else:
for v in d.values():
yield from find_listing(v)
elif isinstance(d, list):
for v in d:
yield from find_listing(v)
airbnb_url = "https://www.airbnb.com/s/Mayrhofen--Austria/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&date_picker_type=calendar&query=Mayrhofen%2C%20Austria&place_id=ChIJbzLYLzjdd0cRDtGuTzM_vt4&checkin=2021-02-06&checkout=2021-02-13&adults=4&source=structured_search_input_header&search_type=autocomplete_click"
soup = BeautifulSoup(requests.get(airbnb_url).content, "html.parser")
listings = soup.find_all("div", "_8s3ctt")
if len(listings):
# normal page:
print(len(listings))
else:
# page that has listings stored inside <script>:
data = json.loads(soup.select_one("#data-deferred-state").contents[0])
for i, l in enumerate(find_listing(data), 1):
print(i, l["name"])
Prints (when returned the <script> version):
1 Mariandl (MHO103) for 36 persons.
2 central and friendly! For Families and Friends
3 Sonnenheim for 5 persons.
4 MO's Apartments
5 MO's Apartments
6 Beautiful home in Mayrhofen with 3 Bedrooms
7 Quaint Apartment in Finkenberg near Ski Lift
8 Apartment 2 Villa Daringer (5 pax.)
9 Modern Apartment in Schwendau with Garden
10 Holiday flats Dornau, Mayrhofen
11 Maple View
12 Laubichl Lodge by Apart Hotel Therese
13 Haus Julia - Apartment Edelweiß Mayrhofen
14 Melcherhof,
15 Rest coke
16 Vacation home Traudl
17 Luxurious Apartment near Four Ski Lifts in Mayrhofen
18 Apartment 2 60m² for 2-4 persons "Binder"
19 Apart ZEMMGRUND, 4-9 persons in Mayrhofen/Tirol
20 Apartment Ahorn View
EDIT: To print lat, lng:
...
for i, l in enumerate(find_listing(data), 1):
print(i, l["name"], l["lat"], l["lng"])
Prints:
1 Mariandl (MHO103) for 36 persons. 47.16522 11.85723
2 central and friendly! For Families and Friends 47.16209 11.859691
3 Sonnenheim for 5 persons. 47.16809 11.86694
4 MO's Apartments 47.166969 11.863186
...

Python code to scrape ticker symbols from Yahoo finance

I have a list of >1.000 companies which I could use to invest in. I need the ticker symbol id's from all these companies. I find difficulties when I am trying to strip the output of the soup, and when I am trying to loop through all the company names.
Please see an example of the site: https://finance.yahoo.com/lookup?s=asml. The idea is to replace asml and put 'https://finance.yahoo.com/lookup?s='+ Companies., so I can loop through all the companies.
companies=df
Company name
0 Abbott Laboratories
1 ABBVIE
2 Abercrombie
3 Abiomed
4 Accenture Plc
This is the code I have now, where the strip code doesn't work, and where the loop for all the company isn't working as well.
#Create a function to scrape the data
def scrape_stock_symbols():
Companies=df
url= 'https://finance.yahoo.com/lookup?s='+ Companies
page= requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
Company_Symbol=Soup.find_all('td',attrs ={'class':'data-col0 Ta(start) Pstart(6px) Pend(15px)'})
for i in company_symbol:
try:
row = i.find_all('td')
company_symbol.append(row[0].text.strip())
except Exception:
if company not in company_symbol:
next(Company)
return (company_symbol)
#Loop through every company in companies to get all of the tickers from the website
for Company in companies:
try:
(temp_company_symbol) = scrape_stock_symbols(company)
except Exception:
if company not in companies:
next(Company)
Another difficulty is that the symbol look up from yahoo finance will retrieve many companies names.
I will have to clear the data afterwards. I want to set the AMS exchange as the standard, hence if a company is listed on multiple exchanges, I am only interested in the AMS ticker symbol. The final goal is to create a new dataframe:
Comapny name Company_symbol
0 Abbott Laboratories ABT
1 ABBVIE ABBV
2 Abercrombie ANF
Here's a solution that doesn't require any scraping. It uses a package called yahooquery (disclaimer: I'm the author), which utilizes an API endpoint that returns symbols for a user's query. You can do something like this:
import pandas as pd
import yahooquery as yq
def get_symbol(query, preferred_exchange='AMS'):
try:
data = yq.search(query)
except ValueError: # Will catch JSONDecodeError
print(query)
else:
quotes = data['quotes']
if len(quotes) == 0:
return 'No Symbol Found'
symbol = quotes[0]['symbol']
for quote in quotes:
if quote['exchange'] == preferred_exchange:
symbol = quote['symbol']
break
return symbol
companies = ['Abbott Laboratories', 'ABBVIE', 'Abercrombie', 'Abiomed', 'Accenture Plc']
df = pd.DataFrame({'Company name': companies})
df['Company symbol'] = df.apply(lambda x: get_symbol(x['Company name']), axis=1)
Company name Company symbol
0 Abbott Laboratories ABT
1 ABBVIE ABBV
2 Abercrombie ANF
3 Abiomed ABMD
4 Accenture Plc ACN

bytes object has no attribute find_all

I've been trying for the last 3 hours to scrape this website and get the rank, name, wins, and losses of each team.
When implementing this code:
import requests
from bs4 import BeautifulSoup
halo = requests.get("https://www.halowaypoint.com/en-us/esports/standings")
page = BeautifulSoup(halo.content, "html.parser")
final = page.encode('utf-8')
print(final.find_all("div"))
I keep getting this error
If anyone can help me out then it would be much appreciated!
Thanks!
You are calling the the method on the wrong variable, use the BeautifulSoup object page not the byte string final:
print(page.find_all("div"))
To get the table data is pretty straightforward, all the data is inside the div with the css classes "table.table--hcs":
halo = requests.get("https://www.halowaypoint.com/en-us/esports/standings")
page = BeautifulSoup(halo.content, "html.parser")
table = page.select_one("div.table.table--hcs")
print(",".join([td.text for td in table.select("header div.td")]))
for row in table.select("div.tr"):
rank,team = row.select_one("span.numeric--medium.hcs-trend-neutral").text,row.select_one("div.td.hcs-title").span.a.text
wins, losses = [div.span.text for div in row.select("div.td.em-7")]
print(rank,team, wins, losses)
If we run the code, you can see the data matches the table:
In [4]: print(",".join([td.text for td in table.select("header div.td")]))
Rank,Team,Wins,Losses
In [5]: for row in table.select("div.tr"):
...: rank,team = row.select_one("span.numeric--medium.hcs-trend-neutral").text,row.select_one("div.td.hcs-title").span.a.text
...: wins, losses = [div.span.text for div in row.select("div.td.em-7")]
...: print(rank,team, wins, losses)
...:
1 Counter Logic Gaming 10 1
2 Team EnVyUs 8 3
3 Enigma6 8 3
4 Renegades 6 5
5 Team Allegiance 5 6
6 Evil Geniuses 4 7
7 OpTic Gaming 2 9
8 Team Liquid 1 10

Resources