Web Scrape interactive map coordinates - web-scraping

I am struggling with how to scrape an interactive map or coordinates from the website,
below is an example of the map (or coordinates) I would like to scrape with requests / bs4.
The idea is to scrape like 100 or so map locations and plot them a map graph.
Could you please advise on how to scrape the map bottom of the website:
https://www.njuskalo.hr/nekretnine/gradevinsko-zemljiste-zagreb-lucko-5000-m2-oglas-34732559

The location data is hidden in a script tag within the HTML, you can get it out like this:
import requests
import json
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
url = 'https://www.njuskalo.hr/nekretnine/gradevinsko-zemljiste-zagreb-lucko-5000-m2-oglas-34732559'
resp = requests.get(url,headers=headers)
start = '"defaultMarker":'
end = ',"cluster":{"icon1'
s = resp.text
dirty_json = s[s.find(start)+len(start):s.rfind(end)].strip() #get the json out the html
clean_json = json.loads(dirty_json)
print(clean_json['lat'],clean_json['lng'])

Related

How to read JSON field in Kusto query when fields are dynamic

I am working with the JSON data (below) resulting from following query.
SignInLogs
| project AddtionalDetails
Results
[{"value":"test.com","key":"TenantId"},{"value":"PC100921","key":"PolicyId"},{"value":"f4525425-60ff-42a7-acf4-f88c4266431f","key":"ApplicationId"},{"value":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36","key":"Client"},{"value":"SMS","key":"VerificationMethod"},{"value":"+1232123211","key":"PhoneNumber"},{"value":"e000::5890, 128.1.1.1","key":"ClientIpAddress"},{"value":"https://test.com","key":"DomainName"}]
I would like to access access a particular filed, e.g. PolicyId, using query SignInLogs | Policy=extractjson("$.[1].value", tostring(AdditionalDetails)) | project Policy . However, since ordering of fields and their presence is not guaranteed so cant always using [1] as an index.
Is there better way to access JSON fields where ordering and availability is not promised? like in other languages you can check empty reference and access by key name.
Something like this?
let T =datatable(AdditionalDetails:dynamic )[dynamic([{"value":"test.com","key":"TenantId"},{"value":"PC100921","key":"PolicyId"},{"value":"f4525425-60ff-42a7-acf4-f88c4266431f","key":"ApplicationId"},{"value":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36","key":"Client"},{"value":"SMS","key":"VerificationMethod"},{"value":"+1232123211","key":"PhoneNumber"},{"value":"e000::5890, 128.1.1.1","key":"ClientIpAddress"},{"value":"https://test.com","key":"DomainName"}])];
T
| mv-apply AdditionalDetails on (
extend IP = iif(AdditionalDetails.key=="ClientIpAddress", tostring(AdditionalDetails.value), ""),
PolicyId = iif(AdditionalDetails.key=="PolicyId", tostring(AdditionalDetails.value), "")
| where isnotempty(IP) or isnotempty( PolicyId)
| summarize take_any(IP), take_any(PolicyId)
)
IP
PolicyId
e000::5890, 128.1.1.1
PC100921

R webshot package login and redirect: issue with selector

A tool we use has regular updates. To keep the screenshots in our documentation up-to-date, I would like to start using Rmd and the webshot package. A first step would be to be able to login (and next to redirect to the desired page).
Based on the example in the package I tried the code below to login, but this triggers an "element not found error".
https://dev79379.service-now.com/login.do is the login page where I took the selectors
https://dev79379.service-now.com/home would be one the urls of interest
So, I have two questions:
what would be the correct selector to find the element?
how can I redirect from login.do to /home?
library(webshot)
url <- "https://dev79379.service-now.com/home"
fn <- tools::file_path_sans_ext(basename(url))
webshot(url, paste0(fn,".png"),
useragent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36", eval = "casper.then(function() {
// Enter username and password
this.sendKeys('#user_name.form-control input[type=\"text\"]', 'test');
this.sendKeys('#user_password.form-control input[type=\"password\"]', 'test');
// Now click in the search box. This results in a box expanding below
this.click('#sysverb_login.pull-right.btn.bt-primary input[type=\"submit\"]');
// Wait 500ms
this.wait(500);
});"
)

Webscraping of Weather Website returns nil

I'm new to Python and I'm trying to take the temperature from The Weather Network however I receive no value for my temperature. Can someone please help me with this because I've been stuck on this for a while? :( Thank you in advance!
import time
import schedule
import requests
from bs4 import BeautifulSoup
def FindTemp ():
myurl = "https://www.theweathernetwork.com/ca/36-hour-weather-forecast/ontario/toronto"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
}
r = requests.get(myurl, headers = headers)
c = r.content
soup = BeautifulSoup(c,"html.parser")
all = soup.find("div",{"class":"obs-area"}).find("span",{'class': 'temp'})
todaydate = time.asctime()
TorontoTemp = all.text
print("The temperature in Toronto is" ,TorontoTemp, "on", todaydate)
print(TorontoTemp)
print(FindTemp())
It doesn't have to work at all, even if you didn't do anything wrong. Many sites use Javascript to fetch data, so you'd need to use some other scraper that has Chromium built-in and uses the same DOM that you'd see if you were interacting with the site yourself, in-person. And many sites with valuable data, such as weather data, actively protect themselves from scraping, since the data they provide has monetary value (i.e. you can buy the data feed access).
In any case, you should start with some site that's known to scrape well. Beautifulsoup's own webpage is a good start :)
And you should use a debugger to see the intermediate values your code generated, and investigate at which point they diverge from your expectations.

httr::GET / xml2::read_html - Stuck/Frozen - No response

I wanted to pick up some data from the health.usnews.com. I ran the following lines but both lines give me the exact same result - No response at all, R gets stuck on the line and I have to manually click on "interrupt R".
page_response <- httr::GET("https://health.usnews.com/")
# or
page <- xml2::read_html("https://health.usnews.com/")
What am I missing?
The website uses the user-agent header to detect web-scraper. Add a fake user-agent header, you'll able to get the result:
page_response <- httr::GET(
"https://health.usnews.com/",
config = httr::add_headers(
`user-agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"
)
)
The problem though is most of the data is generated by JS. Idk what info you need but you're probably gonna need the V8 package to help you.

scraping search results page in amazon using Jsoup

I am using Jsoup to scrape two urls:
http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=pendrives&rh=i%3Aaps%2Ck%3Apendrives
http://www.amazon.in/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=FDW+CLEAR+SPRINGS+125+GMS
In the first url, I am searching for pendrives and I am getting results which are nested under "atfresults" tag , which I have been able to scrape.
whereas for second url I am searching for FDW CLEAR SPRINGS 125 GMS for which I get "Your search FDW CLEAR SPRINGS 125 GMS did not match any products." but it does return three products in "searchTemplate", which I am unable to traverse through using Jsoup. I need help in finding the description of those 3 products
You can find them using:
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36")
.maxBodySize(0)
.get();
Elements products = doc.select(".s-result-list-parent-container > ul > li");
Or you can directly find the description using:
Elements products = doc.select(".s-result-list-parent-container > ul > li .s-access-title");

Resources