How would I go about web scraping from an interactive map? - web-scraping

This pertains to this interactive map, https://www.newworld-map.com/?filters=ores
An example is the ores here, how would I go about getting the coordinates of each node? It looks like the html element is a Canvas and I could not for the life of me figure out where it pulls the data from for this.
Any help would be greatly appreciated

Hoping that next OP's question will be more in line with Stackoverflow's guidelines (see https://stackoverflow.com/help/minimal-reproducible-example), one way to solve this would be to inspect what network calls are being made when page loads, and scrape an eventual API endpoint where the data is pulled from. Like below:
import requests
import pandas as pd
import time
time_stamp = int(time.time_ns() / 1000)
ore_list = []
url = f'https://www.newworld-map.com/markers.json?time={time_stamp}'
ores= requests.get(url).json()['ores']
for ore in ores:
for x in ores[ore]:
ore_list.append((ore, x, ores[ore][x]['x'], ores[ore][x]['y']))
df = pd.DataFrame(ore_list, columns = ['Ore', 'Code', 'X_Coord', 'Y_Coord'])
print(df)
Result in terminal:
Ore Code X_Coord Y_Coord
0 brimstone 02d1ba070438d53ce5fbb1955cd7d694 7473.096191 8715.674805
1 brimstone 0a50c499af034aeb6f38e011648a2ea8 7471.124512 8709.161133
2 brimstone 0b5b190c31eb3d314d993dd393aadfe8 5670.894043 7862.319336
3 brimstone 0f5c7427c75d80e10f71f9e92ddc4362 5883.601562 7703.445801
4 brimstone 20b0801bdb41c7dafbb1053b43c25bd8 6020.838379 8147.747070
... ... ... ... ...
4260 starmetal 86h 8766.964000 8431.438000
4261 starmetal 86i 8598.688000 8562.974000
4262 starmetal 86j 8586.000000 8211.000000
4263 starmetal 86k 8688.938000 8509.722000
4264 starmetal 86l 8685.827000 8505.694000
4265 rows × 4 columns

Related

Rvest: using css selector pulls data from different tab in URL

I am very new to scraping, and am trying to pull data from a section of this website - https://projects.fivethirtyeight.com/soccer-predictions/premier-league/. The data I'm trying to get is in the second tab, "Matches," and is the section titled "Upcoming Matches."
I have attempted to do this with SelectorGadget and using rvest, as follows -
library(rvest)
url <- ("https://projects.fivethirtyeight.com/soccer-predictions/premier-league/")
url %>%
html_nodes(".prob, .name") %>%
html_text()
this returns values, however corresponding to the first tab on the page, "Standings." How can I reference the correct section that I am trying to pull?
First:I don't know R but Python.
When you click Matches then page uses JavaScript to generate matches and it loads JSON data from:
https://projects.fivethirtyeight.com/soccer-predictions/forecasts/2021_premier-league_forecast.json
https://projects.fivethirtyeight.com/soccer-predictions/forecasts/2021_premier-league_matches.json
https://projects.fivethirtyeight.com/soccer-predictions/forecasts/2021_premier-league_clinches.json
I checked only one of them - 2021_premier-league_matches.json - and I see it has data for Completed Matches
I made example in Python:
import requests
url = 'https://projects.fivethirtyeight.com/soccer-predictions/forecasts/2021_premier-league_matches.json'
response = requests.get(url)
data = response.json()
for item in data:
# search date
if item['datetime'].startswith('2022-03-16'):
print('team1:', item['team1_code'], '|', item['team1'])
print('prob1:', item['prob1'])
print('score1:', item['score1'])
print('adj_score1:', item['adj_score1'])
print('chances1:', item['chances1'])
print('moves1:', item['moves1'])
print('---')
print('team2:', item['team2_code'], '|', item['team2'])
print('prob2:', item['prob2'])
print('score2:', item['score2'])
print('adj_score2:', item['adj_score2'])
print('chances2:', item['chances2'])
print('moves2:', item['moves2'])
print('----------------------------------------')
Result:
team1: BHA | Brighton and Hove Albion
prob1: 0.30435
score1: 0
adj_score1: 0.0
chances1: 1.244
moves1: 1.682
---
team2: TOT | Tottenham Hotspur
prob2: 0.43627
score2: 2
adj_score2: 2.1
chances2: 1.924
moves2: 1.056
----------------------------------------
team1: ARS | Arsenal
prob1: 0.22114
score1: 0
adj_score1: 0.0
chances1: 0.569
moves1: 0.514
---
team2: LIV | Liverpool
prob2: 0.55306
score2: 2
adj_score2: 2.1
chances2: 1.243
moves2: 0.813
----------------------------------------

Scraping multiple pages with Scrapy and saving as a csv file

I want to scrape all the pages of Internshala and extract the Job ID, Job name, Company name and the Last date to apply and store everything in a csv to later convert to a dataframe.
import requests
import scrapy
from bs4 import BeautifulSoup
from scrapy import Selector
from scrapy.crawler import CrawlerProcess
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
import string
import pandas as pd
url='https://internshala.com/fresher-jobs'
sel=Selector(text=BeautifulSoup(requests.get(url).content).prettify())
pages=sel.xpath('//span[#id="total_pages"]').xpath('normalize-space(./text())').extract()
pages[0]=int(pages[0])
print(pages[0]) #which gives -> 4
class jobMan(scrapy.Spider):
name='job'
to_remove={0:["\n ","\n "],\
1:['\n ','\n ']}
def start_requests(self):
urls="https://internshala.com/fresher-jobs/page-1"
yield scrapy.Request(url=urls,callback=self.parse)
def parse(self,response):
ID=response.xpath('//div[#class="container-fluid individual_internship visibilityTrackerItem"]/#internshipid').extract()
Job_Post = response.xpath('//div[#class="heading_4_5 profile"]/a').xpath('normalize-space(./text())').extract()
Company = response.xpath('//a[#class="link_display_like_text"]').xpath('normalize-space(./text())').extract()
Apply_By = response.xpath('//div[#class="internship_other_details_container"]/div[#class="other_detail_item_row"][2]//div[#class="item_body"]').xpath('normalize-space(./text())').extract()
for page in range(2,pages[0]+1):
yield(scrapy.Request(url=f"https://internshala.com/fresher-jobs/page-{page}",callback=self.parse))
yield {
'ID': ID,
'Job':Job_Post,
'Company':Company,
'Apply_By':Apply_By
}
process=CrawlerProcess(settings={
'FEED_URI':'JOBSS.csv',
'FEED_FORMAT':'csv'
})
process.crawl(jobMan)
process.start()
And then finally-:
final=pd.read_csv('JOBSS.csv')
print(final)
Which gave me-:
ID Job \
0 NaN Product Developer - Science,Salesforce Develop...
1 NaN Business Development Manager,Mobile App Develo...
2 NaN Software Engineer,Social Media Strategist And ...
3 NaN Reactjs Developer,Full Stack Developer,Busines...
Company \
0 Open Door Education,Aekot Consulting And Techn...
1 ISB Studienkolleg,TutorBin,Alphacore Technolog...
2 CrewKarma,Internshala,Mithi Software Technolog...
3 Startxlabs Technologies Private Limited,RavGin...
Apply_By
0 7 Aug' 21,7 Aug' 21,7 Aug' 21,7 Aug' 21,7 Aug'...
1 31 Jul' 21,30 Jul' 21,30 Jul' 21,31 Jul' 21,30...
2 24 Jul' 21,24 Jul' 21,23 Jul' 21,23 Jul' 21,23...
3 11 Jul' 21,11 Jul' 21,11 Jul' 21,11 Jul' 21,11...
Doubt_1-: Why is it not printing the IDs ?? I tried scraping just the ID for the first page using the same xpath and I got the correct output but not while crawling.
/
Doubt_2-: I wanted a a dataframe such that, for example, the Job_Post column contains each job post's name nested under each other (means as a new row) from all the pages merged but I am getting rows per page.
How can I solve these issues ?? Please help
Doubt_1-: Why is it not printing the IDs ?? I tried scraping just the ID for the first page using the same xpath and I got the correct output but not while crawling.
Because the class name has a space in it, use:
ID=response.xpath('//div[contains(#class, "container-fluid individual_internship visibilityTrackerItem")]/#internshipid').extract()

Python code to scrape ticker symbols from Yahoo finance

I have a list of >1.000 companies which I could use to invest in. I need the ticker symbol id's from all these companies. I find difficulties when I am trying to strip the output of the soup, and when I am trying to loop through all the company names.
Please see an example of the site: https://finance.yahoo.com/lookup?s=asml. The idea is to replace asml and put 'https://finance.yahoo.com/lookup?s='+ Companies., so I can loop through all the companies.
companies=df
Company name
0 Abbott Laboratories
1 ABBVIE
2 Abercrombie
3 Abiomed
4 Accenture Plc
This is the code I have now, where the strip code doesn't work, and where the loop for all the company isn't working as well.
#Create a function to scrape the data
def scrape_stock_symbols():
Companies=df
url= 'https://finance.yahoo.com/lookup?s='+ Companies
page= requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
Company_Symbol=Soup.find_all('td',attrs ={'class':'data-col0 Ta(start) Pstart(6px) Pend(15px)'})
for i in company_symbol:
try:
row = i.find_all('td')
company_symbol.append(row[0].text.strip())
except Exception:
if company not in company_symbol:
next(Company)
return (company_symbol)
#Loop through every company in companies to get all of the tickers from the website
for Company in companies:
try:
(temp_company_symbol) = scrape_stock_symbols(company)
except Exception:
if company not in companies:
next(Company)
Another difficulty is that the symbol look up from yahoo finance will retrieve many companies names.
I will have to clear the data afterwards. I want to set the AMS exchange as the standard, hence if a company is listed on multiple exchanges, I am only interested in the AMS ticker symbol. The final goal is to create a new dataframe:
Comapny name Company_symbol
0 Abbott Laboratories ABT
1 ABBVIE ABBV
2 Abercrombie ANF
Here's a solution that doesn't require any scraping. It uses a package called yahooquery (disclaimer: I'm the author), which utilizes an API endpoint that returns symbols for a user's query. You can do something like this:
import pandas as pd
import yahooquery as yq
def get_symbol(query, preferred_exchange='AMS'):
try:
data = yq.search(query)
except ValueError: # Will catch JSONDecodeError
print(query)
else:
quotes = data['quotes']
if len(quotes) == 0:
return 'No Symbol Found'
symbol = quotes[0]['symbol']
for quote in quotes:
if quote['exchange'] == preferred_exchange:
symbol = quote['symbol']
break
return symbol
companies = ['Abbott Laboratories', 'ABBVIE', 'Abercrombie', 'Abiomed', 'Accenture Plc']
df = pd.DataFrame({'Company name': companies})
df['Company symbol'] = df.apply(lambda x: get_symbol(x['Company name']), axis=1)
Company name Company symbol
0 Abbott Laboratories ABT
1 ABBVIE ABBV
2 Abercrombie ANF
3 Abiomed ABMD
4 Accenture Plc ACN

Too many values in one argument case_when?

I am not sure why this code doesnt run. But if it breaks it into 2 smaller chunks then it works. Is there anyway i can run this whole chunk at once?
When I run this code it appears the plus sign in the console and I couldnt click run in R markdown
dataT4<- dataT4 %>% mutate (coupleID=case_when(id==10011~1, id==10021~2,
id==10032~3, id==10041~4,id==10062~5, id==10071~6,id==10082~7, id==10092~8,
id==10112~9, id==10121~10,id== 10131~11, id==10142~12, id==10151~13,
id==10162~14,id==10171~15, id==10181~16, id==10202~17, id==10212~18, id==10221~19,
id==10232~20, id==10242~21, id==10251~22, id==10262~23, id==10271~24, id==10292~25,
id==10311~26, id==10332~27, id==10342~28, id==10351~29, id==10361~30, id==10372~31,
id==10382~32, id==10391~33, id==10401~34, id==10412~35, id==10421~36, id==10432~37,
id==10442~38, id==10452~39, id==10461~40, id==10471~41, id==10481~42, id==10492~43,
id==10501~44, id==10511~45, id==10521~46, id==10532~47, id==10542~48, id==10562~49,
id==10581~50, id==10592~51, id==10602~52, id==10611~53, id==10642~54, id==10651~55,
id==10662~56, id==10672~57, id==10681~58, id==10702~59, id==10761~60, id==10782~61,
id==10791~62, id==10802~63, id==10812~64, id==10822~65, id==10831~66, id==10852~67,
id==10862~68, id==10881~69, id==10912~70, id==10942~71, id==10951~72, id==10962~73,
id==10972~74, id==10982~75, id==10992~76, id==11001~77, id==11031~78, id==11052~79,
id==11061~80, id==11072~81, id==11092~82, id==11101~83, id==11112~84, id==11171~85,
id==11192~86, id==11202~87, id==11221~88, id==11231~89, id==11252~90, id==11261~91,
id==11281~92, id==11292~93, id==11322~94, id==11332~95, id==11372~96, id==11382~97,
id==11391~98, id==11411~99, id==11422~100, id==11441~101, id==11461~102,
id==11471~103, id==11492~104, id==11501~105, id==11512~106,
id==11521~107,id==11562~108,id==11591~109, id==11601~110, id==11611~111,
id==11621~112, id==11632~113, id==11641~114, id==11651~115, id==11662~116,
id==11682~117,id==11691~118,id==11712~119, id==11771~120, id==11782~121,
id==11811~122, id==11821~123, id==11831~124, id==11841~125, id==11852~126,
id==11861~127,id==11872~128,id==11882~129, id==11892~130, id==11902~131,
id==11911~132, id==11922~133, id==11961~134, id==11972~135,
id==11992~136,id==12011~137, id==12041~138, id==12052~139, id==12061~140,
id==12081~141, id==12101~142, id==12111~143, id==12122~144, id==12131~145,
id==12142~146, id==12151~147, id==12161~148, id==12182~149, id==12191~150,
id==12201~151, id==12232~152, id==12261~153, id==12272~154, id==12322~155,
id==12332~156, id==12342~157, id==12352~158, id==12382~159, id==12392~160,
id==12401~161, id==12411~162, id==12421~163, id==12432~164, id==12441~165,
id==12451~166, id==12461~167, id==12471~168, id==12492~169, id==12501~170,
id==12512~171, id==12521~172, id==12542~173, id==12552~174, id==12562~175,
id==12572~176, id==12581~177, id==12612~178, id==12622~179, id==12652~180,
id==12662~181, id==12682~182, id==12701~183, id==12712~184, id==12731~185,
id==12741~186, id==12762~187, id==12792~188, id==12802~189, id==12811~190,
id==12822~191, id==12832~192, id==12841~193, id==12862~194, id==12882~195,
id==12891~196, id==12911~197, id==12931~198, id==12942~199, id==12952~200,
id==12961~201, id==12972~202, id==13011~203, id==13021~204, id==13032~205,
id==13042~206, id==13061~207, id==13082~208, id==13102~209, id==13111~210,
id==13132~211, id==13142~212, id==13151~213, id==13162~214, id==13191~215,
id==13202~216, id==13212~217, id==13262~218, id==13271~219, id==13281~220,
id==13311~221, id==13322~222, id==13331~223, id==13351~224, id==13361~225,
id==13372~226, id==13422~227, id==13432~228, id==13452~229, id==13462~230,
id==13472~231, id==13481~232, id==13501~233, id==13511~234, id==13521~235,
id==13561~236, id==13571~237, id==13601~238, id==13612~239, id==13632~240,
id==13642~241, id==13652~242, id==13662~243, id==13671~244, id==13681~245,
id==13691~246, id==13701~247, id==13711~248, id==13732~249, id==13742~250,
id==13752~251, id==13782~252, id==13842~253, id==13802~254, id==13822~255,
id==13851~256, id==13872~257, id==13882~258, id==13892~259, id==13912~260,
id==13921~261, id==13932~262, id==13941~263, id==13952~264, id==13971~265,
id==13981~266, id==13992~267, id==14011~268, id==14021~269, id==14031~270,
id==14041~271, id==14052~272, id==14072~273, id==14111~274, id==14131~275,
id==14162~276, id==14172~277, id==14182~278, id==14191~279, id==14212~280,
id==14222~281, id==14241~282, id==14261~283, id==14291~284, id==14302~285,
id==14312~286, id==14321~287, id==14342~288, id==14352~289, id==14362~290,
id==14371~291, id==14392~292, id==14402~293, id==14432~294, id==14451~295,
id==14472~296, id==14482~297, id==14491~298, id==14511~299, id==14521~300,
id==14531~301, id==14541~302, id==14552~303, id==14562~304, id==14572~305,
id==14581~306, id==14592~307, id==14602~308, id==14621~309, id==14632~310,
id==14641~311, id==14651~312, id==14671~313, id==14681~314, id==14692~315,
id==14712~316, id==14722~317, id==14732~318, id==14741~319, id==14751~320,
id==14781~321, id==14792~322, id==14812~323, id==14842~324, id==14852~325,
id==14862~326, id==14882~327, id==14892~328, id==14901~329, id==11012~330))
As a single line it is just too long to be parsed. You may be better served putting all of these values into a separate data.frame and merging it into your data instead of using a giant case_when.
Usually when I want to do something like this I'll open Excel or something similar, put column names in the first row (here that would be id and couple_id) and enter all of the values, save it as a CSV, then read the CSV into R as a data.frame, and then merge it.
You can use rank:
dataT4 <- data.frame(id=c(10011, 10021, 10382, 11012))
dataT4 <- dataT4 %>% mutate (coupleID=rank(id))
dataT4
id coupleID
1 10011 1
2 10021 2
3 10382 3
4 11012 4
Data:
dataT4 <- data.frame(id=c(10011, 10021, 10382, 11012))

read csv into index of year, dayofyear, and hour/min into a pandas datetime object

I am trying to read in a csv in this form:
2014,92,1931,6.234,10.14
2014,92,1932,5.823,9.49
2014,92,1933,5.33,7.65
2014,92,1934,4.751,6.19
2014,92,1935,4.156,5.285
2014,92,1936,3.962,4.652
2014,92,1937,3.74,4.314
2014,92,1938,3.325,3.98
2014,92,1939,2.909,3.847
2014,92,1940,2.878,3.164
To be clear, this is (Year, Day of year, 2400hr time, and 2 columns of values).
I have had some thought on the matter in a previous question, but to no avail, and it's proving to be a matter of a few problems... (Create an indexed datetime from date/time info in 3 columns using pandas)
As noted in the above question, the following "read_csv" attempt
df = pd.read_csv("home_prepped.dat", parse_dates={"dt" : [0,1,2]},
date_parser=parser, header=None)
triggers a TypeError:
TypeError: parser() takes exactly 1 argument (3 given)
This is due to the "parse_dates" arg having 0,1,2 in it.
I have also tried putting them in double brackets [[0,1,2]] and get:
ValueError: [0, 1, 2] is not in list
I have gotten past this by setting parse_dates=True and thought I could just set_index after but get this:
TypeError: must be string, not numpy.int64
My parser gets hung up on the format too, and I have read conflicting stories about zero-padding the "day of year" value. Mine are not zero-padded, but even still, above errors aside I have had the format get hung up on the first value, the year! Here is the parser:
def parser(x):
return pd.datetime.strptime(x, '%Y %j %H%M')
So yea, I have had errors saying '2014' not recognized, and '92' (day of year) not recognized, but have been encouraged cause at least strptime has been able to make its way "through" to try out the format.
I am wondering if this has something to do with my data.
I am looking for a way to get this datetime info indexed as a datetime and I have had nothing but problems. I have gone ahead and padded some julians in case someone wants to test out the format being a problem of the padding, see below:
2014,092,1931,6.234,10.14
2014,092,1932,5.823,9.49
2014,092,1933,5.33,7.65
2014,092,1934,4.751,6.19
2014,092,1935,4.156,5.285
2014,092,1936,3.962,4.652
2014,092,1937,3.74,4.314
2014,092,1938,3.325,3.98
2014,092,1939,2.909,3.847
2014,092,1940,2.878,3.164
Thanks for your help guys, I am starting to really get frustrated here :S
After correcting your %m (month) to %M (minute), your code works for me:
>>> import pandas as pd
>>> print pd.version.version
0.15.2-10-gf7af818
>>>
>>> def parser(x):
... return pd.datetime.strptime(x, '%Y %j %H%M')
...
>>> df = pd.read_csv("home_prepped.dat", parse_dates={"dt" : [0,1,2]},
... date_parser=parser, header=None)
>>> df
dt 3 4
0 2014-04-02 19:31:00 6.234 10.140
1 2014-04-02 19:32:00 5.823 9.490
2 2014-04-02 19:33:00 5.330 7.650
3 2014-04-02 19:34:00 4.751 6.190
4 2014-04-02 19:35:00 4.156 5.285
5 2014-04-02 19:36:00 3.962 4.652
6 2014-04-02 19:37:00 3.740 4.314
7 2014-04-02 19:38:00 3.325 3.980
8 2014-04-02 19:39:00 2.909 3.847
9 2014-04-02 19:40:00 2.878 3.164
But after playing around with this for a little while, there are some very strange behaviours when an error happens, leading to some odd error messages, so I can see why it's very hard to debug this.
If for some reason the above isn't working, you could try doing the parsing yourself:
df = pd.read_csv("home_prepped.dat", header=None)
timestr = df.iloc[:,:3].astype(str).apply(' '.join,axis=1)
df = df.iloc[:,3:]
times = pd.to_datetime(timestr, format='%Y %j %H%M')
df["dt"] = times
As mentioned above, when something goes wrong (e.g. a parse error) the error messages are very confusing from within read_csv.
The following seems to work, i think. Keep in mind this is the first time I have ever brought anything into pandas to work with so not sure how to properly test it, but it recognizes the format and says:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-04-02 19:31:00, ..., 2014-12-21 23:59:00]
Length: 337917, Freq: None, Timezone: None
Which is sweet, as I believe this means I have finally indexed a datetime!
Here is what I did...
In [41]:
import numpy as np
import pandas as pd
from datetime import datetime
In [60]:
def parse(yr, yearday, hrmn):
date_string = ''.join([yr, yearday, hrmn])
return datetime.strptime(date_string,"%Y%j%H%M")
In [61]:
df = pd.read_csv('home_prepped.csv', parse_dates={'datetime':[0,1,2]}, date_parser=parse, index_col='datetime', header=None)
Now I tried to put a space in between the '' before the .join and it separated the %Y %j but only managed to see a "1" as part of the %H. So I got rid of the space and changed the format to be spaceless as well.
Thanks for your work on this DSM.

Resources