How to scrape this database site? - web-scraping

I wanted to scape this site, but it seems like the information is not in html code. How to scrape this site/information?
https://golden.com/query/list-of-incubator-companies-NMB3
I have tried to use normal html scraping, but I am currently not that much familiar with scraping at all.

This site uses javascript to render it's content, however you can use it's api to scrape all of the data in json format.
The api endpoint is:
url = f"https://golden.com/api/v1/queries/list-of-incubators-and-accelerators-NMB3/results/?page={page_number}&per_page=25&order=&search="
And a simple scrapy example could look something like this.
import scrapy
class MySpider(scrapy.Spider):
name = 'golden'
def start_requests(self):
for page_num in range(1,4):
url = f"https://golden.com/api/v1/queries/list-of-incubators-and-accelerators-NMB3/results/?page={page_num}&per_page=25&order=&search="
yield scrapy.Request(url)
def parse(self, response):
data = response.json()
yield {"data": data["results"]}

Related

How to recover a hidden ID from a query string from an XHR GET request?

I'm trying to use the hidden airbnb api. I need to reverse engineer where the ID comes from in the query string of a GET request. For example, take this listing:
https://www.airbnb.ca/rooms/47452643
The "public" ID is shown to be 47452643. However, another ID is needed to use the API.
If you look at the XHR requests in Chrome, you'll see a request starting with " StaysPdpSections?operationName". This is the request I want to replicate. If I copy the request in Insomnia or Postman, I see a variable in the query string starting with:
"variables":"{"id":"U3RheUxpc3Rpbmc6NDc0NTI2NDM="
The hidden ID "U3RheUxpc3Rpbmc6NDc0NTI2NDM" is what I need. It is needed to get the data from this request and must be inserted into the query string. How can I recover the hidden ID "U3RheUxpc3Rpbmc6NDc0NTI2NDM" for each listing dynamically?
That target id is burried really deep in the html....
import requests
from bs4 import BeautifulSoup as bs
import json
url = 'https://www.airbnb.ca/rooms/47452643'
req = requests.get(url)
soup = bs(req.content, 'html.parser')
script = soup.select_one('script[type="application/json"][id="data-state"]')
data = json.loads(script.text)
target = data.get('niobeMinimalClientData')[2][1]['variables']
print(target.get('id'))
Output:
U3RheUxpc3Rpbmc6NDc0NTI2NDM=

How does adding dont_filter=True argument in scrapy.Request make my parsing method to work ?

Here's a simple scrapy spider
import scrapy
class ExampleSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["https://www.dmoz.org"]
start_urls = ('https://www.dmoz.org/')
def parse(self,response):
yield scrapy.Request(self.start_urls[0],callback=self.parse2)
def parse2(self, response):
print(response.url)
When you run the program, parse2 method doesn't work and it doesn't print response.url. Then I found the solution to this in the thread below.
Why is my second request not getting called in the parse method of my scrapy spider
Its just that I needed to add dont_filter=True as argument in request method to make the parse2 function work.
yield scrapy.Request(self.start_urls[0],callback=self.parse2,dont_filter=True)
But in the examples given in scrapy documentation and many youtube tutorials, they never used dont_filter = True argument in scrapy.Request method and still their second parse functions works.
Take a look at this
def parse_page1(self, response):
return scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
def parse_page2(self, response):
# this would log http://www.example.com/some_page.html
self.logger.info("Visited %s", response.url)
Why can't my spider work unless dont_filter=True is added ? What am I doing wrong ? What were the duplicate links that my spider had filtered in my first example ?
P.S. I could've resolved resolved this in the QA thread I posted above, But I'm not allowed to comment unless I have 50 reputation (poor me !!)
Short answer: You are making duplicate requests. Scrapy has built in duplicate filtering which is turned on by default. That's why the parse2 doesn't get called. When you add that dont_filter=True, scrapy doesn't filter out the duplicate requests. So this time the request is processed.
Longer version:
In Scrapy, if you have set start_urls or have the method start_requests() defined, the spider automatically requests those urls and passes the response to the parse method which is the default method used for parsing requests. Now you can yield new requests from here which will again be parsed by Scrapy. If you don't set a callback, the parse method will be used again. If you set a callback, that callback will be used.
Scrapy also has a built in filter which stops duplicate requests. That is if Scrapy has already crawled a site and parsed the response, even if you yield another request with that url, scrapy will not process it.
In your case, you have the url in start_urls. Scrapy starts with that url. It crawls the site and passes the response to parse. Inside that parse method, you again yield a request to that same url (which scrapy just processed) but this time with parse2 as the callback. When this request is yielded, scrapy sees this as a duplicate. So it ignores the request and never processes it. So no calls to parse2 is made.
If you want to control which urls should be processed and which callback to be used, I recommend you override the start_requests() and return a list of scrapy.Request instead of using the single start_urls attribute.

R output to confluence page REST API

I am new at using APIs. I have an R output in the form of a list which I want to paste on a confluence page. and I have no idea how. I have been trying to use Rest API but its confusing me.
I have been able to get a 200 response from the website using
httr::set_config(config(ssl_verifypeer=FALSE))
URL <- "http://xxx.xx.xx/xx/xx/daily_report"
response = GET(URL, authenticate("xxx", "xxx"))
response
Really clueless where to go next.
Try this:
content(response, "parsed", "application/json")$results

How to pass parameter to Url with Python urlopen

I'm currently new to python programming. My problem is that my python program doesn't seem to pass/encode the parameter properly to the ASP file that I've created. This is my sample code:
import urllib.request
url = 'http://www.sample.com/myASP.asp'
full_url = url + "?data='" + str(sentData).replace("'", '"').replace(" ", "%20").replace('"', "%22") + "'"
print (full_url)
response = urllib.request.urlopen(full_url)
print(response)
the output would give me something like:
http://www.sample.com/myASP.asp?data='{%22mykey%22:%20[{%22idno%22:%20%22id123%22,%20%22name%22:%20%22ej%22}]}'
The asp file is suppose to insert the acquired querystring to a database.. But whenever I check my database, no record is saved. Though if I do copy and paste the printed output on my browser url, the record is saved. Any input on this? TIA
Update:
Is it possible the python calls my ASP File A but it doesn't call my ASP File B? ASP File A is called by python while ASP File B is called by ASP File A. Because whenever I run the url on a browser, the saving goes well. But in python, no saving of database occurs even though the data passed from python is read by ASP File A..
Use firebug with Firefox and watch the network traffic when the page is loaded. If it is actually an HTTP POST, which I suspect it is, check the post parameters on that post and do something like this:
from BeautifulSoup import BeautifulSoup
import urllib
post_params = {
'param1' : 'val1',
'param2' : 'val2',
'param3' : 'val3'
}
post_args = urllib.urlencode(post_params)
url = 'http://www.sample.com/myASP.asp'
fp = urllib.urlopen(url, post_args)
soup = BeautifulSoup(fp)
If its actually HTTP POST, this will work.
In case anybody stumbles upon this, this is what I've come up with:
py file:
url = "my.url.com"
data = {'sample': 'data'}
encodeddata = urllib.parse.urlencode(data).encode('UTF-8')
req = urllib.request.Request(url, encodeddata)
response = urllib.request.urlopen(req)
and in my asp file, I used json2.js:
jsondata = request.form("data")
jsondata = replace(jsondata,"'","""")
SET jsondata = JSON.parse(jsontimecard)
Note: use requests instead. ;)
First off, I don't know Python.
But from this : doc on urllib.request
the HTTP request will be a POST instead of a GET when the data
parameter is provided
Let me make a really wild guess, you are accessing the form values as Request.Querystring(..) in the asp page, so your post wont pass any values. But when you paste the url in the address bar, it is a GET and it works.
just guessing, you could show the .asp page for further check.

web scraping in txt mode

I am currently using watir to do a web scraping of a website hiding all data from the usual HTML source. If I am not wrong, they are using XML and those AJAX technology to hide it. Firefox can see it but it is displayed via "DOM Source of selection".
Everything works fine but now I am looking for an equivalent tool as watir but everything need to be done without a browser. Everything need to be done in txt file.
In fact right now, watir is using my browser to emulate the page and return me the whole html code I am looking. I would like to the same but without the browser.
Is it possible ?
Thanks
Regards
Tak
Your best guess would be to use something like webscarab and capture the URLS of the AJAX requests your browser is doing.
That way, you can just grab the "important" data yourself by simulating those calls with any HTTP library
It is possible with a little Python coding.
I wrote a simple script to fetch locations of cargo offices.
First steps
Open the ajax page with Google Chrome for example, in Turkish but you can understand it.
http://www.yurticikargo.com/bilgi-servisleri/Sayfalar/en-yakin-sube.aspx
Press F12 to show bottom developer tools and navigate to Network tab.
Navigate XHR tab on the bottom.
Make an AJAX request by selecting an item in the first combobox. And go to Headers Tab
You will GetTownByCity on left pane, click it and inspect it.
Request URL: (...)/_layouts/ArikanliHolding.YurticiKargo.WebSite/ajaxproxy-
sswservices.aspx/GetTownByCity
Request Method:POST
Status Code:200 OK
In the Request Payload tree item you will see
Request Payload :{cityId:34}
header.
This will guide us to implement a python code.
Lets do it.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
import json
# import simplejson as json
baseUrl = 'http://www.yurticikargo.com/'
ajaxRoot = '_layouts/ArikanliHolding.YurticiKargo.WebSite/'
getTown = 'ajaxproxy-sswservices.aspx/GetTownByCity'
urlGetTown = baseUrl + ajaxRoot + getTown
headers = {'content-type': 'application/json','encoding':'utf-8'} # We are sending JSON headers, equivalent to Python dictionary
for plaka in range(1,82): # Because Turkiye has number plates from 1 to 81
payload = {'cityId':plaka}
r = requests.post(url, data=json.dumps(payload), headers=headers)
data = r.json() # Returning data is in JSON format, if you need HTML use r.content()
# ... Process the fetched data with JSON parser,
# If HTML format, Beautiful Soup, Lxml, or etc...
Note that this code is a part of my working code and it is written on the fly, the most important is I did not test it. It may require small modifications to run it.

Resources