Scrapy shell with playwright

Scrapy shell with playwright - web-scraping

Is it possible to invoke Playwright in a Scrapy shell?
I would like to use a shell to test my xpaths, which I intend to place in a spider that incorporates Scrapy Playwright.
My scrapy settings file has the usual Playwright setup:
# Scrapy Playwright Setup
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

Yes, It is possible. In fact, all you have to do is just running scrapy shell inside a folder that contains a scrapy project. It will automatically load all the default settings from settings.py. You can see it on the logs when running scrapy shell.
Also, You can override settings using the -s parameters.
scrapy shell -s DOWNLOAD_HANDLERS='<<your custom handlers>>'
Happy Scraping :)

I believe the shell command might not be possible to do with scrapy playwright. Here i am using python3 as demonstration:
This documentation link should help you further:
https://playwright.dev/python/docs/intro#interactive-mode-repl
I believe instead of shell you just need python3 or python3 in interactive mode. This way you have auto complete which the scrapy shell never did.
Here is the synchronous example in a file called spider_interactive.py:
from playwright.sync_api import sync_playwright
playwright = sync_playwright().start()
browser = playwright.firefox.launch()
page = browser.new_page()
page.goto("http://whatsmyuseragent.org/")
#Remember to run these manually when your done to prevent left over garbage on the machine.
#browser.close()
#playwright.stop()
Run with:
python3 -i spider_interactive.py
Then you can enter for example the following command:
page.locator("p.intro-text").all_inner_texts()
returns
['Mozilla/5.0 (X11; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0', 'My IP Address: your_ip_address_here]

Related

How to start the grpc server when definitions are in a module

My project structure looks like this:
/definitions (for all dagster python definitions)
__init__.py
repositories.py
/exchangerates
pipelines.py
...
...
workspace.yaml
I've tried running the grpc server using various methods, especially the following (started in project root):
dagster api grpc -h 0.0.0.0 -p 4000 -f definitions/repositories.py
dagster api grpc -h 0.0.0.0 -p 4000 -m definitions
dagster api grpc -h 0.0.0.0 -p 4000 -m definitions.repositories
The first command yields the following error:
dagster.core.errors.DagsterImportError: Encountered ImportError: attempted relative import with no known parent package while importing module repositories from file C:\Users\Klaus\PycharmProjects\dagsterexchangerates\definitions\repositories.py. Consider using the module-based options -m for CLI-based targets or the python_package workspace.yaml target.
The second and third command yield the following error:
(stacktrace comes before this)
ModuleNotFoundError: No module named 'definitions'
How can this be solved?
EDIT:
I've uploaded the current version of the example I'm working on to GitHub: https://github.com/kstadler/dagster-exchangerates
EDIT2:
Reflected changes in directory structure

sorry about the trouble - there are a couple of options here to get your server running.
To get it working with the '-f' option, the relative imports need to be replaced with absolute imports. That would look like this:
-from .pipelines import exchangerates_pipline
-from .partitions import year_partition_set
+from definitions.pipelines import exchangerates_pipline
+from definitions.partitions import year_partition_set
(This is the same error you would get if you tried to run python definitions/repositories.py directly).
I'm still digging into why exactly the third '-m' option isn't working the way I'd expect it to. Curiously the following command seems to work for me which should be close to identical:
python -m dagster.grpc -h 0.0.0.0 -p 4000 -m definitions.repositories
Incidentally, your example contains a workspace.yaml that should cause dagit and other Dagster processes to automatically start up the gRPC server for you for that module - so depending on your goal you may not need to run dagster api grpc yourself at all.

Problem with launching scrapy spider from sh script

When I try to run my sh script with a spider, it displays only one warning and does not parse, but when I run the spider on my own, the parsing goes fine
my sh file
#!/bin/bash
# shellcheck disable=SC2164
cd /var/www/scrapy_parser/avito/avito/spiders
scrapy crawl avito -L WARNING
cd /var/www/scrapy_parser/info/info/spiders
scrapy crawl info_v1 -L WARNING.
sh output:
WARNING: /usr/local/lib/python3.6/site-packages/scrapy/extensions/feedexport.py:210:
ScrapyDeprecationWarning: The `FEED_URI` and `FEED_FORMAT` settings have been deprecated in
favor of the `FEEDS` setting. Please see the `FEEDS` setting docs for more details
exporter = cls(crawler)
What can I do to fix this?

In general, I solved this problem by logging, in the end it turned out that the site that I parse blocked me as a bot and the sh script did not read the pre-installed proxies in the system, I solved everything by adding the proxy setting directly to the sh script before starting the spider

paramiko and nohup ''

OK so I have paramiko v2.2.1 and I am trying to login to a machine and restart a service. Inside the service scripts it basically starts a process via nohup. However if I allow paramiko to disconnect as soon as it is done the process started terminates with a PIPE signal when it writes to stdout.
If I start the service by ssh'ing into the box and manually starting it there is no issue and it runs in the background fine. Also if I add long sleep 10 before disconnecting (close) paramiko it also seems to work just fine.
The service is started via a init.d script via a line like this:
env LD_LIBRARY_PATH=$bin_path nohup $bin_path/ServerLoop.sh \
"$bin_path/Service service args" "$#" &
Where ServerLoop.sh simply calls the service forever in a loop like this so it will never die:
SERVER=$1
shift
ARGS=$#
logger $ARGS
while [ 1 ]; do
$SERVER $ARGS
logger "$SERVER terminated with exit code: $STATUS. Server has been restarted"
sleep 1
done
I have noticed when I start the service by ssh'ing into the box I get a nohup.out file written to the root. However when I run through paramiko I get no nohup.out written anywhere on the system ... ie this after I manually ssh into the box and start the service:
root#ts4700:/mnt/mc.fw/bin# find / -name "nohup*"
/usr/bin/nohup
/usr/share/man/man1/nohup.1.gz
/nohup.out
And this is after I run through paramiko:
root#ts4700:/mnt/mc.fw/bin# find / -name "nohup*"
/usr/bin/nohup
/usr/share/man/man1/nohup.1.gz
As I understand it nohup will only redirect the output to nohup.out if "If standard output is a terminal" (from the manual), otherwise it thinks it is saving the output to a file so it does not redirect. Hence I tried the following:
In [43]: import paramiko
In [44]: paramiko.__version__
Out[44]: '2.2.1'
In [45]: ssh = paramiko.SSHClient()
In [46]: ssh.set_missing_host_key_policy(AutoAddPolicy())
In [47]: ssh.connect(ip, username='root', password=not_for_so_sorry, look_for_keys=False, allow_agent=False)
In [48]: stdin, stdout, stderr = ssh.exec_command("tty")
In [49]: stdout.read()
Out[49]: 'not a tty\n'
So I am thinking that nohup is not redirecting to nohup.out when I run it through paramiko because tty is not returning a terminal. I don't know why adding a sleep(10) would fix this though as the service if run on the command line is quite verbose.
I have also noticed that if the service is started from a manual ssh its tty in the ps ax output is still set to the ssh tty ... however if the process is started by paramiko its tty in the ps ax output is set to "?" .. since both processes are run through nohup I would have expected this to be the same.
If the problem is that nohup is indeed not redirecting the output to nohup.out because of the tty is there a way to force this to happen or a better way to run this sort of command via paramiko?
Thanks all, any help with this would be great :)

Can I use node-inspector with meteor?

Can I use node-inspector with meteor?
I tried "--debug" option, and succeed to connect debug port.
But, cannot to access my codes.
exec "$DEV_BUNDLE/bin/node" "--debug" "$METEOR" "$#"

You might have countered the same problem like I have:
On Linux machine, Meteor script will spawn two processes:
Process1: node "meteor files"
Process2: node "your meteor files"
When you run exec "$DEV_BUNDLE/bin/node" "--debug" "$METEOR" "$#", it spawn process1 in debug mode but process2 still run in normal mode. This is why you can not see your files.
I just run the regular meteor script and send kill -s USR1 to process2 then you can see your file in node-inspector

Post to Twitter using Terminal with CURL

I got this far:
:~ curl -u username:password -d status="new_status" http://twitter.com/statuses/update.xml
Now, how can I alias this with variables so I can easily twit from Terminal? How can I make the alias working through different sessions (when I close Terminal aliases reset).
Thanks!

Basic Authentication is no longer supported by twitter. Please use OAuth.

You clearly have the alias command: stick it in your ~/.bashrc and it will be set up when your bash shell starts. (.shrc should also work for sh-like shells.)
If you stick it in a script file as the previous answer suggests:
(a) add the line
#!/bin/sh
at the top;
(b) make sure it's on your path or you'll have to type the whole path to the script when you want to run it.
(c) to make it executable,
chmod +x tweet.sh

what about putting it a file and using argument 1 as $1:
# tweet.sh "post my status, moron!":
curl -u username:password -d status="$1" http://twitter.com/statuses/update.xml
will that work?

You need to create a file in your home directory that will get referenced each time a new terminal opens.
Do a bit of research as to what to name the file, according to what type of shell you are using (tcsh looks for a file called .tcshrc while bash looks for .bashrc).
Once you have that file, make it executable by running:
chmod +x name_of_file
Then, in that file, create your alias (again, you'll need to research how to do this depending on what type of shell you are using). For tcsh, my alias looks like this:
alias tw 'curl -u username:password -d status=\!^ http://twitter.com/statuses/update.xml'
Bash aliases use an equals sign. A bash alias would look something more like this:
alias tw='curl -u username:password -d status=\!^ http://twitter.com/statuses/update.xml'
Note the change in the command after "status=". The \!^ tells the line of code to insert the first argument passed after the alias itself.
Save your file.
You could then run an update to twitter by typing the following in a new terminal:
tw 'my first post to twitter via the terminal, using aliases'
Don't forget to escape 'special' characters (like exclamations) with the escape character, \ (i.e. \!)

Since Basic Authentication is no longer supported by twitter, you have to use OAuth to achieve your goal.
But if you just want to post to Twitter using terminal, there are many application can do it.
Take a look at Rainbowstream or t
With rainbowstream, the following lines will let you tweet from console:
$ sudo pip install rainbowstream
$ rainbowstream
[#yourscreenname]t whatever you want

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scrapy shell with playwright - web-scraping

Related

How to start the grpc server when definitions are in a module

Problem with launching scrapy spider from sh script

paramiko and nohup ''

Can I use node-inspector with meteor?

Post to Twitter using Terminal with CURL

Categories

Resources