spider set as default in a scrapy project - web-scraping

I have a scrapy project with two spiders. spider_a and spider_b.
The problem I'm facing is that spider_b is always the one which runs.
when I type:
scrapy crawl spider_a
spider_b is executed, same when I try:
scrapy crawl spider_b
spider_b is executed again, only way I have handled to execute spider_a so far is by deleting the file which contains spider_b.
What can be causing this behavior?

Related

Unwanted page refresh with webpack5 in next js

When I turn on webpack5 and call internal api(/api/*) from page after first render, the page refreshes and logs Refreshing page data due to server-side change. after refreshing once, it works fine as webpack4.
Expected Behavior
The page should not refresh on api call after first render.
I recently updated to Next JS 12 and suddenly started encountering this issue also. I'm not sure if it's necessarily related to that as I believe Next JS 11 was also using Webpack 5 for HMR, but they certainly switched over to socket communication for hot reloading, rather than server sent events as they had with previous versions. [https://nextjs.org/docs/upgrading#nextjs-hmr-connection-now-uses-a-websocket]
I have a file /inc/paths.js where I am organizing and exporting URI path string variables for different resources in my app. There were a number of paths in that module which were also being utilized by parts of my /api scripts, namely object keys for AWS S3 bucket paths. So, this module was being imported by not only React components in the /pages directory and elsewhere, but also to the modules in the /api directory. By moving all the variables used by the /api modules into their own file and making sure that none of those variables are imported by React components or pages, the error seems to have disappeared for me.
This may be related to this quote from Vercel:
Finally, if you edit a file that's imported by files outside of the
React tree, Fast Refresh will fall back to doing a full reload. You
might have a file which renders a React component but also exports a
value that is imported by a non-React component. For example, maybe
your component also exports a constant, and a non-React utility file
imports it. In that case, consider migrating the constant to a
separate file and importing it into both files. This will re-enable
Fast Refresh to work. Other cases can usually be solved in a similar
way.
https://nextjs.org/docs/basic-features/fast-refresh
Although the logic doesn't quite match up, it leads me to believe there is something occurring along those lines.
I updated next.js due to the console warning I get whenever next.js run, telling me its using using weppack 4 instead of 5
You can still use webpack 4 by changing it from your next config as it's an update issue
On the client page, I changed to call the internal API by useEffect() hook to fetch data, instead of triggering the data-fetch function by onclick. I found the issue was gone.

Reactphp with symfony images and css

I am using reactphp with symfony my react webserver link is http://localserver.reactsymfony:1337/.
none of css and images file serve. for example ( http://localserver.reactsymfony:1337/bundles/calibration/css/bootstrap.min.css ) it gives me 404 not found error.
I am clue less how I get rid of this error.
You need to serve these files yourself with ReactPHP or instead run a separate file server on a different port. You might be able to find some component that does exactly that, but I don't know of any.
By default, ReactPHP doesn't have any concept of a document root. It just has a request event handler and that's it.

Server-Side routing for single page app

I am trying to get my Dart Polymer 1.0 single page app working with pushState. I have set up nginx to route all requests to the dev server which runs when executing pub serve. Nginx also takes care of always requesting index.html instead of the real url.
The problem I am facing is that as soon as I load a url with at least one folder, the js cannot be loaded anymore.
Example
Requesting project.local loads the index.html file and works fine. The same is true for project.local/test. As soon as I try going to project.local/test/something, it stops working because the file index.bootstrap.initialize.dart is requested from project.local/test/index.bootstrap.initialize.dart and not from project.local/index.bootstrap.initialize.dart.
Source code
The whole project can be found at https://github.com/agileaddicts/blitzlicht. The index.html is where the magic happens.
How do I tell the transformer to put absolute urls into the html instead of relative ones?
you should by able to upgrade to the last version of polymer by changing the version of reflectable.
reflectable: >=0.5.0
and perhaps add this in you pubspec
- $dart2js:
$include: '**/*.bootstrap.initialize.dart'

Scrapy Follow externals links [duplicate]

I saw this post to make scrapy crawl any site without allowed domains restriction.
Is there any better way of doing it, such as using a regular expression in allowed domains variable, like-
allowed_domains = ["*"]
I hope there is some other way than hacking into scrapy framework to do this.
Don't set allowed_domains at all.
Look at the get_host_regex() function in this scrapy file:
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spidermiddleware/offsite.py
you should diactivate offsite middlware which is a built in spider middleware in scrapy.
for more information http://doc.scrapy.org/en/latest/topics/spider-middleware.html

httpmodules - configure to run only once?

I've created a httpmodule but it seems to be getting called on every request within a page.
Can it configured to run only once or for a certain file type?
At the minute when a request is made to load an image it runs the module etc..
Any ideas,
Thanks,
I think the only way to fix that is put some checking within the httpmodule to work out what file is requesting the module to be run and filter it accordingly.

Resources