Problem with launching scrapy spider from sh script

Problem with launching scrapy spider from sh script - web-scraping

When I try to run my sh script with a spider, it displays only one warning and does not parse, but when I run the spider on my own, the parsing goes fine
my sh file
#!/bin/bash
# shellcheck disable=SC2164
cd /var/www/scrapy_parser/avito/avito/spiders
scrapy crawl avito -L WARNING
cd /var/www/scrapy_parser/info/info/spiders
scrapy crawl info_v1 -L WARNING.
sh output:
WARNING: /usr/local/lib/python3.6/site-packages/scrapy/extensions/feedexport.py:210:
ScrapyDeprecationWarning: The `FEED_URI` and `FEED_FORMAT` settings have been deprecated in
favor of the `FEEDS` setting. Please see the `FEEDS` setting docs for more details
exporter = cls(crawler)
What can I do to fix this?

In general, I solved this problem by logging, in the end it turned out that the site that I parse blocked me as a bot and the sh script did not read the pre-installed proxies in the system, I solved everything by adding the proxy setting directly to the sh script before starting the spider

Related

How to make HTML changes or customize module like poll in BigBlueButton html5-client?

I'm using bigbluebutton (2.3-dev) in Ubuntu 18.04 server I installed it using bbb-install (# wget -qO- https://ubuntu.bigbluebutton.org/bbb-install.sh | bash -s -- -v bionic-230-dev -s bbb.example.com -e info#example.com -a -w) and its work perfect.
Now I want to make some changes in html5-client (https://doamin/html5client/join?sessionToken=e)
I found the file path - /usr/share/meteor/bundle and it's served from this path /usr/share/meteor/bundle/programs/web.browser but problem is this is a build file so I can't make any changes because every time this file is new generator when stop and start or restart.
I want to add one link in left side menu (http://prntscr.com/umy63l). How can I do this and where I can do this?
Thanks in advance!

Did you install a dev environement for bbb-html5 ? You can find the doc about it here :
https://docs.bigbluebutton.org/2.2/dev.html#developing-the-html5-client

Unable to export env variable from script

I'm currently struggling with running a .sh script I'm trying to trigger from Jenkins.
Within the Jenkins "execute shell" section, I'm connecting to a remote server (The Jenkins agent does not have right OS to build what I need.), using:
cp -r . /to/shared/drive/to/have/access/on/remote
ssh -t -t username#servername << EOF
cd /to/shared/drive/to/have/access/on/remote
source build.sh dev
exit
EOF
Inside build.sh, I'm exporting R_LIBS to build a package for different R versions.
...
for path in "${!rVersionPaths[#]}"; do
export R_LIBS="${path}"
Rscript -e 'install.packages(c("someDependency", "someOtherDependency"), repos="http://cran.r-project.org");'
...
Setting R_LIBS should functions here like setting lib within install.packages(...). For some reason the R_LIBS export doesn't get picked up. Also setting other env variables like http_proxy are ignored. This causes any requests outside the network to fail.
Is there any particular way of achieving this?

Maybe pass those variables with env, like
env R_LIBS="${path}" Rscript -e 'install.packages(c("someDependency", .....

Well i'm not able to comment on the question, so posting it as answer.
I had similar problem when calling remote shell script from Jenkins, the problem was somehow bash_profile variables were not loaded when called the script from Jenkins but locally it worked. Loading the bash profile in ssh connection solved it for me.
Add source to bash_profile in build.sh
. ~/.bash_profile OR source ~/.bash_profile
Or
Reload bash_profile in ssh connection
`ssh -t -t username#servername << EOF
. ~/.bash_profile
your commands here
exit
EOF

You can set that variable in the same command line like this:
R_LIBS="${path}" Rscript -e \
'install.packages(c("someDependency", "someOtherDependency"), repos="http://cran.r-project.org");'
It's possible to append more variables in this way. Note that this will set those environment variables only for the command being called after them (and its children processes as well).

You said that "R_LIBS export doesn't get picked up". Question Is the value UNSET? Or is it set to some other value & you are trying to override it?
It is possible that SSH may be invoking "/bin/sh -c". Based on the second answer to: Why does 'cd' command not work via SSH?, you can simplify the SSH command and explicitly invoke the build.sh script in Bash:
cp -r . /to/shared/drive/to/have/access/on/remote
ssh -t -t username#servername "cd /to/shared/drive/to/have/access/on/remote && bash -f build.sh dev"
This makes the SSH invocation more similar to invoking the command within a remote interactive shell. (You can avoid sourcing scripts and exporting variables.)
You don't need to export R_LIBSor env R_LIBS when it is possible to prefix any command with local environment variable overrides (agrees with Luis' answer):
...
for path in "${!rVersionPaths[#]}"; do
R_LIBS="${path}" Rscript -e 'install.packages(c("someDependency", "someOtherDependency"), repos="http://cran.r-project.org");'
...
The Rscript may be doing a lot with env vars. You can verify that you are setting the R_LIBS env var by replacing Rscript with the env command and observe the output:
...
for path in "${!rVersionPaths[#]}"; do
R_LIBS="${path}" env
...
According to this manual "Initialization at Start of an R Session", Rscript looks in several places to load "site and user files":
$R_PROFILE
$R_HOME/etc/Renviron
$R_HOME/etc/Renviron.site
$R_ENVIRON_USER
$R_PROFILE_USER
./.Rprofile
$HOME/.Rprofile
./.RData
The "Examples" section of that manual shows this:
## Not run:
## Example ~/.Renviron on Unix
R_LIBS=~/R/library
PAGER=/usr/local/bin/less
If you add the --vanilla command-line option to ignore all of these files, then you may get different results and know something in the site/init/environ files is affecting your R_LIBS! I cannot run this system myself. Hopefully we have given you some areas to investigate.

You probably don't want to source build.sh, just invoke it directly (i.e. remove the source command).
By source-ing the file your script is executed in the SSH shell (likely sh) rather than by bash, which it sounds like is what you intended.

How do I turn off wget proxy?

I had been using a proxy for a long time. Now I need to remove it. I have forgotten how I have added the proxy to wget. Can someone please help me get back to the normal wget where it doesn't use any proxy. As of now, I'm using
wget <link> --proxy=none
But I'm facing a problem when I'm installing using a pre-written script. It's painstaking to search all through the scripts and change each command.
Any simpler solution will be very much appreciated.
Thanks

Check your
~/.wgetrc
/etc/wgetrc
and remove proxy settings.
Or use wget --no-proxy command line option to override them.

In case your OS is alpine/busybox then the wget might vary from the one used by #Logu.
There the correct command is
wget --proxy off http://server:port/
Running wget --help outputs:
/ # wget --help
BusyBox v1.31.1 () multi-call binary.
Usage: wget [-c|--continue] [--spider] [-q|--quiet] [-O|--output-document FILE]
[-o|--output-file FILE] [--header 'header: value'] [-Y|--proxy on/off]
[-P DIR] [-S|--server-response] [-U|--user-agent AGENT] [-T SEC] URL...
Retrieve files via HTTP or FTP
--spider Only check URL existence: $? is 0 if exists
-c Continue retrieval of aborted transfer
-q Quiet
-P DIR Save to DIR (default .)
-S Show server response
-T SEC Network read timeout is SEC seconds
-O FILE Save to FILE ('-' for stdout)
-o FILE Log messages to FILE
-U STR Use STR for User-Agent header
-Y on/off Use proxy

crontab not working: perhaps an error in notation?

In an Amazon EC2 terminal, I type: `sudo nano crontab -e' to bring up the editor. I have the following (empty line at the end included):
#reboot echo "Running RMV scrape & R Shiny via: nano crontab -e"
#reboot nohup python /home/ec2-user/RMV/RMV_scrape.py &
#reboot nohup shiny-server &
#reboot service start httpd
#hourly cp -f /home/ec2-user/RMV/wait_times.csv /var/shiny-server/www/wait_times.csv
Here, I'm trying to run (a) my program, (b) apache, (c) R Shiny server and (d) a script that runs hourly to copy a file.
For some reason, this fails to run. pgrep chron does show chron runs upon startup. It shouldn't be a permissions issue because I ran crontab using sudo. I had one relative pathname in my .py script but I changed it to an absolute pathname.
I've consulted:
https://askubuntu.com/questions/23009/reasons-why-crontab-does-not-work
http://www.unix.com/answers-to-frequently-asked-questions/13527-cron-crontab.html
Any ideas why this may not be working?

I think your problems is with the command you used to edit the crontab sudo nano crontab -e does not edit the crontab you made a file named crontab in whatever directory you were working in, but crontab files are in /var and are not intended to be edited directly. For any given user crontab -e will edit the crontab using the editor specified in the environment variable EDITOR. So to edit root's crontab the command is sudo crontab -e.
That said adding entries to root's crontab is probably not what you want. You probably want to use the system crontab for some thing like this. In almost all cases the system crontab is /etc/crontab which can be edited using sudo nano /etc/crontab. Note that for the system crontab you need to add the user of the command between the time and command sections. e.g.
#reboot root echo "Running RMV scrape & R Shiny via: nano crontab -e"
Also note that crontab uses a very minimal PATH environment variable for security reasons. If a command you issue is not on the path it will not execute. Remember to either add the paths you need to the crontab PATH (specified in the particular crontab file) or use the full path to a given executable from the (filesystem) root directory.

How to run my script using wget?

I have a URL in my custom module which runs a long script. If i call url via wget it downloads the page content. It doesn't run the script. How to do it?

I would have thought that even though it downloaded the page it would still run the script.
To run without downloading the file use:
wget -O - -q -t 1 http://example.com/path/to/file.php
From memory:
-O and the hyphen are redirecting the output so it's not saved to a file.
-q is for quiet
-t is the number of attempts.
You can use man wget to look any more options up.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Problem with launching scrapy spider from sh script - web-scraping

In general, I solved this problem by logging, in the end it turned out that the site that I parse blocked me as a bot and the sh script did not read the pre-installed proxies in the system, I solved everything by adding the proxy setting directly to the sh script before starting the spider

Related

How to make HTML changes or customize module like poll in BigBlueButton html5-client?

Unable to export env variable from script

How do I turn off wget proxy?

crontab not working: perhaps an error in notation?

How to run my script using wget?

Categories

Resources