How To Parse URL in Deno - deno

How do you parse a URL in deno like node.js url.parse()?

No external module is needed to parse URLs in Deno. The URL class is available as a global, just like in your browser:
const urlString = "https://www.google.com";
const url = new URL(urlString);
console.log(`URL: ${url.protocol}//${url.host}`);

Aside from using the ECMA script's native URL class, as illustrated by jsejcksn's answer, you can also use url library from the /std/node compatibility module as follows:
import * as UrlLib from "https://deno.land/std/node/url.ts";
const url = "https://www.google.com"
const purl = UrlLib.parse(url)
console.log(`URL: ${purl.protocol}//${purl.host}`)
Output:
URL: https://google.com

Related

'TypeError: Protocol "http:" not supported. Expected "https:"' error when fetching HTTPS site

I'm trying to use node-fetch to capture the contents of a page, and running into an unexpected error. I checked a similar question but it doesn't seem relevant. I am trying to fetch a HTTPS site using an HTTPS agent and agents, but I'm getting an unexpected error about HTTP. I wonder whether this may be due to redirects, but I can't see anything that would cause it. This only fails for this particular URL (works fine, for example, with https://www.robinhood.com) , and I'm trying to figure out why. Here is a minimal example. I'd note that this uses some certificates I have saved locally, but I'm not sure how necessary that is to reproduce.
//start SO example
var siteURL = "https://robinhood.com/l/privacy";
import path from 'path';
import sslrootcas from 'ssl-root-cas';
const rootCas = sslrootcas.create();
import {fileURLToPath} from 'url';
const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);
rootCas.addFile(path.resolve(__dirname,'intermediate.pem'));
import http from 'node:http';
import https from 'node:https';
import UserAgent from 'user-agents';
const myhttpsAgent = new https.Agent({ca: rootCas});
// const requestcheck = fetch("https://www.google.com", {
const requestcheck = fetch(siteURL, {
method: "GET"
,headers: {"User-Agent": new UserAgent() }
,agent: myhttpsAgent
})
Here is the error I'm getting:
node:internal/errors:477
ErrorCaptureStackTrace(err);
^
TypeError: Protocol "http:" not supported. Expected "https:"
at new NodeError (node:internal/errors:387:5)
at new ClientRequest (node:_http_client:177:11)
at request (node:http:96:10)
at file:///home/app/node_modules/node-fetch/src/index.js:94:20
at new Promise (<anonymous>)
at fetch (file:///home/app/node_modules/node-fetch/src/index.js:49:9)
at ClientRequest.<anonymous> (file:///home/app/node_modules/node-fetch/src/index.js:236:15)
at ClientRequest.emit (node:events:525:35)
at HTTPParser.parserOnIncomingClient [as onIncoming] (node:_http_client:674:27)
at HTTPParser.parserOnHeadersComplete (node:_http_common:128:17)
at TLSSocket.socketOnData (node:_http_client:521:22)
at TLSSocket.emit (node:events:525:35)
at addChunk (node:internal/streams/readable:315:12)
at readableAddChunk (node:internal/streams/readable:289:9)
at TLSSocket.Readable.push (node:internal/streams/readable:228:10)
at TLSWrap.onStreamRead (node:internal/stream_base_commons:190:23) {
code: 'ERR_INVALID_PROTOCOL'
}
I wonder whether this may be due to redirects, but I can't see anything that would cause it.
https://robinhood.com/l/privacy redirects to
https://robinhood.com/us/en/support/articles/privacy-policy which then redirects to
http://robinhood.com/us/en/support/articles/privacy-policy/
The latter URL is plain HTTP and thus the wrong protocol by a https-only user agent.

Download a static file with strict name via Nginx [duplicate]

I'm writing a web application that, among other things, allows users to upload files to my server. In order to prevent name clashes and to organize the files, I rename them once they are put on my server. By keeping track of the original file name I can communicate with the file's owner without them ever knowing I changed the file name on the back end. That is, until they go do download the file. In that case they're prompted to download a file with a unfamiliar name.
My question is, is there any way to specify the name of a file to be downloaded using just HTML? So a user uploads a file named 'abc.txt' and I rename it to 'xyz.txt', but when they download it I want the browser to save the file as 'abc.txt' by default. If this isn't possible with just HTML, is there any way to do it?
When they click a button to download the file, you can add the HTML5 attribute download where you can set the default filename.
That's what I did, when I created a xlsx file and the browser want to save it as zip file.
Download
Download Export
Can't find a way in HTML. I think you'll need a server-side script which will output a content-disposition header. In php this is done like this:
header('Content-Disposition: attachment; filename="downloaded.pdf"');
if you wish to provide a default filename, but not automatic download, this seems to work.
header('Content-Disposition: inline; filename="filetodownload.jpg"');
In fact, it is the server that is directly serving your files, so you have no way to interact with it from HTML, as HTML is not involved at all.
just need to use HTML5 a tag download attribute
codepen live demo
https://codepen.io/xgqfrms/full/GyEGzG/
my screen shortcut.
update answer
whether a file is downloadable depends on the server's response config, such as Content-Type, Content-Disposition;
download file's extensions are optional, depending on the server's config, too.
'Content-Type': 'application/octet-stream',
// it means unknown binary file,
// browsers usually don't execute it, or even ask if it should be executed.
'Content-Disposition': `attachment; filename=server_filename.filetype`,
// if the header specifies a filename,
// it takes priority over a filename specified in the download attribute.
download blob url file
function generatorBlobVideo(url, type, dom, link) {
var xhr = new XMLHttpRequest();
xhr.open('GET', url);
xhr.responseType = 'arraybuffer';
xhr.onload = function(res) {
// console.log('res =', res);
var blob = new Blob(
[xhr.response],
{'type' : type},
);
// create blob url
var urlBlob = URL.createObjectURL(blob);
dom.src = urlBlob;
// download file using `a` tag
link.href = urlBlob;
};
xhr.send();
}
(function() {
var type = 'image/png';
var url = 'https://cdn.xgqfrms.xyz/logo/icon.png';
var dom = document.querySelector('#img');
var link = document.querySelector('#img-link');
generatorBlobVideo(url, type, dom, link);
})();
https://cdn.xgqfrms.xyz/HTML5/Blob/index.html
refs
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/a#download
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition
https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types#important_mime_types_for_web_developers
Sometimes #Mephiztopheles answer won't work on blob storages and some browsers.
For this you need to use a custom function to convert the file to blob and download it
const coverntFiletoBlobAndDownload = async (file, name) => {
const blob = await fetch(file).then(r => r.blob())
const url = URL.createObjectURL(blob)
const a = document.createElement('a')
a.style.display = 'none'
a.href = url
a.download = name // add custom extension here
document.body.appendChild(a)
a.click()
window.URL.revokeObjectURL(url)
}
Same code as #Hillkim Henry but with a.remove() improvement
This forces the document to remove the a tag from the body and avoid multiple elements
const coverntFiletoBlobAndDownload = async (file, name) => {
const blob = await fetch(file).then(r => r.blob())
const url = URL.createObjectURL(blob)
const a = document.createElement('a')
a.style.display = 'none'
a.href = url
a.download = name // add custom extension here
document.body.appendChild(a)
a.click()
window.URL.revokeObjectURL(url)
// Remove "a" tag from the body
a.remove()
}
Well, #Palantir's answer is, for me, the most correct way!
If you plan to use that with multiple files, then i suggest you to use (or make one) PHP Download Manager.
BUT, if you want to make that to one or two files, I will suggest you the mod_rewrite option:
You have to create or edit your .htaccess file on htdocs folder and add this:
RewriteEngine on
RewriteRule ^abc\.txt$ xyz.txt
With this code, users will download xyz.txt data with the name abc.txt
NOTE: Verify if you have already the "RewriteEngine on " on your file, if yes, add only the second for each file you wish to redirect.
Good Luck ;)
(Sorry for my english)

getStaticPaths and getStaticProps with domain flavouring

I have a question related to the static generation of Next.js:
I'm creating whitelabel websites for my customers; it means that I'm reading the domain where the request is coming from to load a config file and some specific CSS files. That's working fine and looks like this:
export const readConfig = async ({req}) => {
const configs = await import('../configs.json')
const domain = req ? req.headers['host'].split(':')[0] : window.location.hostname
const config = configs[domain]
return {domain, config}
}
Page.getInitialProps = readConfig
However, I'm using getInitialProps for that and my understanding is that because I rely on req, this code will be loaded for every page.
Now, let's say that I want to have a static generation of some pages, how should I proceed? Can I avoid having count_different_domains * count_different_items combinations? Is that somehow possible to cache the result of some queries and revalidate it later (but not as entire Page)?

BeautifulSoup isn't returning a url when we query for the src of the img tag

from bs4 import BeautifulSoup
from urllib import request
url = "https://amazon-asin.com/asincheck/?product_id=B000JMLBHU"
req = request.urlopen(url)
soap = BeautifulSoup(req,'html.parser')
soap.find('img',{'class':'resp-img'})['ng-src']
I'm using ng-src because, with only 'src', it returns nothing. But, with ng-src, it returns this:
'{{data.product_details.image_url}}'
Why it doesn't return the url? How can i scrape the url of this image?
Try this:
from selenium import webdriver
driver = webdriver.Firefox(executable_path='c:program/geckodriver')
url = "https://amazon-asin.com/asincheck/?product_id=B000JMLBHU"
driver.get(url)
driver.implicitly_wait(10)
print(driver.find_element_by_css_selector('img.resp-img').get_attribute('ng-src'))
driver.close()
Prints:
https://m.media-amazon.com/images/I/51sPuWd2JbL.jpg
Note yo need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe

Scrapy does not extract data

I am trying to get some technical informations about automobiles from this page
Here is my current code:
import scrapy
import re
from arabamcom.items import ArabamcomItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class BasicSpider(CrawlSpider):
name="arabamcom"
allowed_domains=["arabam.com"]
start_urls=['https://www.arabam.com/ikinci-el/otomobil']
rules=(Rule(LinkExtractor(allow=(r'/ilan')),callback="parse_item",follow=True),)
def parse_item(self,response):
item=ArabamcomItem()
item['fiyat']=response.css('span.color-red.font-huge.bold::text').extract()
item['marka']=response.css('p.color-black.bold.word-break.mb4::text').extract()
item['yil']=response.xpath('//*[#id="js-hook-appendable-technicalPropertiesWrapper"]/div[2]/dl[1]/dd/span/text()').extract()
And this is my items.py file
import scrapy
class ArabamcomItem(scrapy.Item):
fiyat=scrapy.Field()
marka=scrapy.Field()
yil=scrapy.Field()
When i run the code i can get data from 'marka' and 'fiyat' item but spider does not get anything for 'yil' attribute. Also other parts like 'Yakit Tipi','Vites Tipi' etc. How can i solve this problem ?
What's wrong:
//*[#id="js-hook-appendable-technicalPropertiesWrapper"]/......
This id start with js and may be dynamic element appeded by javascript
Scrapy do not have the ability to render javascript by default.
There are 2 solutions you can try
Scrapy-Splash
This is a javascript rendering engine for scrapy
Install Splash as a Docker container
Modify your settings.py file to integrate splash (append following middlewares to your project)
SPLASH_URL = 'http://127.0.0.1:8050'
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware':100,
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware':723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
Replace your Request Function with SplashRequest
from scrapy_splash import SplashRequest as SP
SP(url=url, callback=parse, endpoint='render.html', args={'wait': 5})
Selenium WebDriver
This is a browser automation-testing framework
Install Selenium from PyPi and install there corresponding driver(e.g. Firefox -> Geckodriver) to PATH folder
Append following middleware class to your project's middleware.py file:
class SeleniumMiddleware(object):
#classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
return middleware
def process_request(self, request, spider):
request.meta['driver'] = self.driver
self.driver.get(request.url)
self.driver.implicitly_wait(2)
body = to_bytes(self.driver.page_source)
return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)
def spider_opened(self, spider):
"""Change your browser mode here"""
self.driver = webdriver.Firefox()
def spider_closed(self, spider):
self.driver.close()
Modify your settings.py file to integrate the Selenium middleware (append following middleware to your project and replace yourproject with your project name)
DOWNLOADER_MIDDLEWARES = {
'yourproject.middlewares.SeleniumMiddleware': 200
}
Comparison
Scrapy-Splash
An official module by Scrapy Company
You can deploy splash instance to cloud, so that you will be able to browse the url in cloud then transfer the render.html back to your spider
It's slow
Splash container will stop if there is a memory leak. (Be sure to deploy splash instance on a high memory cloud instance)
Selenium web driver
You have to have Firefox or Chrome with their corresponding automated-test-driver on your machine, unless you use PhantomJS.
You can't modify request headers directly with Selenium web driver
You could render the webpage using a headless browser but this data can be easily extracted without it, try this:
import re
import ast
...
def parse_item(self,response):
regex = re.compile('dataLayer.push\((\{.*\})\);', re.DOTALL)
html_info = response.xpath('//script[contains(., "dataLayer.push")]').re_first(regex)
data = ast.literal_eval(html_info)
yield {'fiyat': data['CD_Fiyat'],
'marka': data['CD_marka'],
'yil': data['CD_yil']}
# output an item with {'fiyat': '103500', 'marka': 'Renault', 'yil': '2017'}

Resources