Scrapy Shell and Scrapy Splash - web-scraping

We've been using scrapy-splash middleware to pass the scraped HTML source through the Splash javascript engine running inside a docker container.
If we want to use Splash in the spider, we configure several required project settings and yield a Request specifying specific meta arguments:
yield Request(url, self.parse_result, meta={
'splash': {
'args': {
# set rendering arguments here
'html': 1,
'png': 1,
# 'url' is prefilled from request url
},
# optional parameters
'endpoint': 'render.json', # optional; default is render.json
'splash_url': '<url>', # overrides SPLASH_URL
'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN,
}
})
This works as documented. But, how can we use scrapy-splash inside the Scrapy Shell?

just wrap the URL you want to shell to in splash HTTP API.
So you would want something like:
scrapy shell 'http://localhost:8050/render.html?url=http://example.com/page-with-javascript.html&timeout=10&wait=0.5'
where:
localhost:port is where your splash service is running
url is URL you want to crawl and don't forget to urlquote it!
render.html is one of the possible HTTP API endpoints, returns redered HTML page in this case
timeout time in seconds for timeout
wait time in seconds to wait for JavaScript to execute before reading/saving the HTML.

You can run scrapy shell without arguments inside a configured Scrapy project, then create req = scrapy_splash.SplashRequest(url, ...) and call fetch(req).

For the windows users, who use Docker Toolbox:
Change the single inverted comma with double inverted comma for preventing the invalid hostname:http error.
change the localhost to the docker IP address which is below the whale logo. for me it was 192.168.99.100.
Finally I got this:
scrapy shell "http://192.168.99.100:8050/render.html?url="https://example.com/category/banking-insurance-financial-services/""

Related

Next.JS - localhost is prepended when making external API call

I got a simple Next app where I'm making an external API call to fetch some data. This worked perfectly fine until a couple days ago - when the app is making an API request, I can see in the network tab that the URL that it's trying to call, got Next app's address (localhost:3000) prepended in front of the actual URL that needs to be called e.g.: instead of http://{serverAddress}/api/articles it is calling http://localhost:3000/{serverAddress}/api/articles and this request resolves into 404 Not Found.
To make the API call, I'm using fetch. Before making the request, I've logged the URL that was passed into fetch and it was correct URL that I need. I also confirmed my API is working as expected by making the request to the expected URL using Postman.
I haven't tried using other library like axios to make this request because simply it doesn't make sense considering my app was working perfectly fine only using fetch so I want to understand why is this happening for my future experience.
I haven't made any code changes since my app was working, however, I was Dockerizing my services so I installed Docker and WSL2 with Ubuntu. I was deploying those containers on another machine, now both, the API I'm calling and Next app are running on my development machine directly when this issue is happening.
I saw this post, I confirmed I don't have any whitespaces in the URL, however, as one comment mentions, I installed WSL2, however, I am not running the app via WSL terminal. Also, I've tried executing wsl --shutdown to see if that helps, unfortunately the issue still persists. If this is the cause of the issue, how can I fix it? Uninstall WSL2? If not, what might be another possible cause for the issue?
Thanks in advance.
EDIT:
The code I'm using to call fetch:
fetcher.js
export const fetcher = (path, options) =>
fetch(`${process.env.NEXT_PUBLIC_API_URL}${path}`, options)
.then(res => res.json());
useArticles.js
import { useSWRInfinite } from 'swr';
import { fetcher } from '../../utils/fetcher';
const getKey = (pageIndex, previousPageData, pageSize) => {
if (previousPageData && !previousPageData.length) return null;
return `/api/articles?page=${pageIndex}&limit=${pageSize}`;
};
export default function useArticles(pageSize) {
const { data, error, isValidating, size, setSize } = useSWRInfinite(
(pageIndex, previousPageData) =>
getKey(pageIndex, previousPageData, pageSize),
fetcher
);
return {
data,
error,
isValidating,
size,
setSize
};
}
You might be missing protocol (http/https) in your API call. Fetch by default calls the host server URL unless you provide the protocol name.
Either put it into env variable:
NEXT_PUBLIC_API_URL=http://server_address
Or prefix your fetch call with the protocol name:
fetch(`http://${process.env.NEXT_PUBLIC_API_URL}${path}`, options)

Find out if you are running in a command and the command parameters in Symfony

We have a service that can be called from a Symfony command and from a normal web request. Is there a way to find out if the service was called from a command or from a web request? If so, if it was called from a command, is there a way to find out the parameters that were used when running the command?
In symfony Console,
the command line context does not know about your VirtualHost or domain name
It means that you can check the request scheme, host, base_url and base path since these request properties have no values in the console context unless you configure them (https://symfony.com/doc/current/console/request_context.html#configuring-the-request-context-globally)
Hi you can use this to know if the service is used from the cli, if it runs with apache you will get this apache2handler
if(php_sapi_name() === 'cli') {
//some code
}
https://www.php.net/manual/en/function.php-sapi-name.php

Http request command line tool

I am developing a Command Line Tool in Swift 3, i have this code:
let url = "www.google.com"
var request = URLRequest(url:url)
request.httpMethod = "GET"
let task = session.dataTask(with: a){ (data,response,error) in
print("REACHED")
handler(response,data)
}
task.resume()
I cannot reach the task if i use "http://" or "https://" as prefix in any url, i am wondering if i need a App Transport Security plist, i already tried create a simple plist, anyone knows some if has a particularity for this problem?
When specifying a URL, you need the "scheme" (e.g. http:// or https://).
The url is a URL, not a String, so it should be:
let url = URL(string: "http://www.google.com")!
Yes, you need Info.plist entry if you want to use http://. E.g. https://stackoverflow.com/a/37552442/1271826 or Transport security has blocked a cleartext HTTP or just search stack overflow for "[ios] http info.plist" or with [osx].
Note, in Xcode 8.1, console apps don't necessarily have a Info.plist file, so if you don't have one, you may have to add one by pressing command+n:
update your target settings to specify the plist:
and then add the appropriate settings, e.g.:
I assume where you have URLRequest(with: a) you meant URLRequest(with: request).
You'll need something to keep the command app alive while you're performing the request (e.g. a semaphore or something like that).

Sitecore - build URL with Agent

I have an Email sending class, when the activate the item it generates a link to the dash board as follows,
Item dashboardItem = DatabaseManager.WebDatabase.GetItem"/sitecore/content/Public/Pages/Users/Dashboard");
string url = LinkManager.GetItemUrl(dashboardItem, opt);
URL generated as http://mysite/Pages/Users/Dashboard, which is the expected behaviour. This is the user accessible URL.
I am trying to generate the same Email using a scheduled task. But when it runs and tries to execute this code URL generated as follows,
http://127.0.0.1/sitecore/content/Public/Pages/Users/Dashboard
Seems like when we are using the scheduler LinkManager can not identify the URL mapped with the item. How can I generate the user accessible URL with the scheduled task?
This happens because the scheduled task is running in a different SiteContext.
In the code of your task, you should manually switch to the SiteContext that contains the item you are linking to.
In such way:
using (new Sitecore.Sites.SiteContextSwitcher(
Sitecore.Sites.SiteContext.GetSite("your_site_name")))
{
// load item & generate url here ...
}
your_site_name is the site name that is configured in the <sites> configuration.

set proxy To hide my IP address for scraping the webpage using scrapy

I am using scrapy to crawl website now I need to set proxy handle the request which has been sent. Can anyone help me solve this set proxy in scrapy app. Please give any sample link too if you have so. And I need solution that from which IP this request is going.
You can do it through the code below found here:
1 – Create a new file called middlewares.py and save it in your scrapy project and add the following code to it.
# Importing base64 library because we'll need it ONLY
#in case if the proxy we are going to use requires authentication
import base64
# Start your middleware class
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"
# Use the following lines if your proxy requires authentication
proxy_user_pass = "USERNAME:PASSWORD"
# setup basic authentication for the proxy
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
2 – Open your project’s configuration file (./project_name/settings.py) and add the following code
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
'project_name.middlewares.ProxyMiddleware': 100,
}
Also, you can use multiple proxies with scrapy. More information can
be found here.

Resources