How can I speed up downloading files using paramiko? - sftp

I have written code for downloading file from SFTP server but the process is taking a lot of time. Could you please tell me is there any way to speed up the process?
Code I am using -
import paramiko
ssh_client = paramiko.SSHClient()
ssh_client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh_client.connect(hostname='****',port=22,username='*****',password='', key_filename='******')
sftp_obj= ssh_client.open_sftp()
sftp_obj.get(sftp_loc, ec2_loc)

Related

Deno 500 error from dev.jspm.io installing dependencies

I'm trying to run my first deno script which is pretty much from the denoDB docs, it just tries to connect to a database with a SQLite3 connector (I'm on a Macbook pro so it should be installed):
import { Database, SQLite3Connector } from 'https://deno.land/x/denodb/mod.ts';
const connector = new SQLite3Connector({
filepath: './db.sqlite',
});
export const db = new Database(connector);
I'm running deno run api/db.ts and I get this error after a few successful downloads:
Download https://deno.land/std#0.149.0/encoding/hex.ts
Download https://deno.land/std#0.149.0/hash/_wasm/lib/deno_hash.generated.mjs
error: Import 'https://dev.jspm.io/inherits#2.0' failed: 500 Internal Server Error
at https://raw.githubusercontent.com/Zhomart/dex/930253915093e1e08d48ec0409b4aee800d8bd0c/lib-dyn/deps.ts:4:26
I've deleted my /Users/<me>/Library/Caches/deno/deps/https and reran the script a few times but I still can't get past this. In my browser trying to follow the URL https://dev.jspm.io/inherits#2.0 does give me an error. What is going on here? There's not much code and I imagine it's not broken for everybody. What do I need to do to get this script to run without issues?
EDIT: it seems to be a library error https://github.com/eveningkid/denodb/issues/348
This is an error caused by a nested depedency, from a project that is not maintained.
See this for more info: [https://jspm.org/jspm-dev-release]
The point is dev.jspm.io is now jspm.dev
A way to fix this is to fork and update depedencies.
Another thing, if you're not using deno deploy, you can just use this as a replacement for your denodb: https://raw.githubusercontent.com/joeldesante/denodb/master/mod.ts
Just note that this script is no more maintained either, but it will fix your problem
Edit
I just made a dirty quick fix for deno deploy use this as a depedency isntead of denodb: https://raw.githubusercontent.com/ninjinskii/denodb/master/mod.ts
Again, i may not maintain this script forever.
The best thing that can happen is an update from these libs maintainers

requests_html render() throwing OSError: [WinError 14001]

Hello I'm trying to do web scraping with the python module requests-html to handle dynamic content on the page https://www.monster.com/jobs/search?q=Software+Engineer&where=. My code is:
from requests_html import HTMLSession
url = 'https://www.monster.com/jobs/search?q=Software+Engineer&where='
session = HTMLSession()
response = session.get(url)
response.html.render()
but when I run response.html.render() I get this error
OSError: [WinError 14001] The application has failed to start because its side-by-side configuration is incorrect. Please see the application event log or use the command-line sxstrace.exe tool for more detail
The first time I ran render() I got
[W:pyppeteer.chromium_downloader] start chromium download.
Download may take a few minutes.
[W:pyppeteer.chromium_downloader]
chromium download done.
[W:pyppeteer.chromium_downloader] chromium extracted to: C:\Users\user\AppData\Local\pyppeteer\pyppeteer\local-chromium\588429
however the file path doesn't exist but pyppeteer is actually an installed package (pyppeteer==0.2.5). Does anyone have an idea what is going on?
You're having this issue because chromium setup failed.
You can either try to reinstall request_html or what I did was switching from the python from the Windows store to the download from the python website and then installing request_html again.
After having everything setup correctly with the downloaded python I switched back to python 3.9 from the store and everything is still working.

Using Apache Airflow Tool, Implement a DAG for a batch processing pipeline to get a directory from a remote system

Using Apache airflow tool, how can I implement a DAG for the following Python code. The task accomplished in the code is to get a directory from GPU server to local system. Code is working fine in Jupyter notebook. Please help to implement in Airflow...I'm very new to this. Thanks.
import pysftp
import os
myHostname = "hostname"
myUsername = "username"
myPassword = "pwd"
with pysftp.Connection(host=myHostname, username=myUsername, password=myPassword) as sftp:
print("Connection successfully stablished ... ")
src = '/path/src/'
dst = '/home/path/path/destination'
os.mkdir(dst)
sftp.get_d(src, dst, preserve_mtime=True)
print("Fetched source images from GPU server to local directory")
# connection closed automatically at the end of the with-block```
For SFTP duties, Airflow provides SFTOperator that you can use directly.
Alternatively it's corresponding SFTPHook can be used with a simple PythonOperator
I acknowledge there aren't many examples, but this might be helpful
For SSH-connection, see this

E-mail (or similar) notification when code execution is finished

I am currently doing several simulations in R that each take quite a long time to execute and the time it takes for each to finish varies from case to case. To use the time in between more efficiently, I wondered if it would be possible to set up something (like a e-mail notification system or similar) that would notify me as soon a a chunk of simulation is completed.
Does somebody here have any experience with setting up something similar or does someone know a resource that could teach me to implement a notification system via R?
I recently saw an R package for this kind of thing: pushoverr. However didn't use it myself - so not tested how it works. But seems like it might be useful in your case.
I assume you run the time consuming simulations on a server, correct? If these run own you own PC, your PC will be slow as hell anyway and I would not see something beneficial in sending a mail to myself.
For long calculations: Run them on a virtual machine, I use the following workflow for my own calculations.
Write your R script. Important: Write a .txt file when the calculation file in the end. The shell script will search in a loop for the file to exist.
Copy that code an save it as Python script. I tried one day to get MailR running a Linux and it did not work. This code worked on the first try.
#!/usr/bin/env python3
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from email.mime.base import MIMEBase
from email import encoders
email_user = 'youownmail#gmail.com'
email_password = 'password'
email_send = 'theothersmail.com'
subject = 'yourreport'
msg = MIMEMultipart()
msg['From'] = email_user
msg['To'] = email_send
msg['Subject'] = subject
body = 'Calculation is done'
msg.attach(MIMEText(body,'plain'))
part = MIMEBase('application','octet-stream')
part.set_payload((attachment).read())
encoders.encode_base64(part)
msg.attach(part)
text = msg.as_string()
server = smtplib.SMTP('smtp.gmail.com',587)
server.starttls()
server.login(email_user,email_password)
server.sendmail(email_user,email_send,text)
server.quit()
Make sure you are allowed to run the script.
sudo chmod 777 /path/script.R sudo chmod 777 /path/script.py
Run both your script.R and script.py inside a script.sh file. It looks the the following:
R < /path/script.R --no-save
while [ ! -f /tmp/finished.txt ]
do
sleep 2
done
python path/script.py
This may sound a bit overwhelming if you are not familiar with these technologies, but think this is a pretty much automated workflow, which relieves your own resources and can be used "in production". (I use this workflow to send me my own stock reports).

R Import - CSV file from password protected URL - in .BAT file

Okay - so here is what I'm trying to do.
I've got this password protected CSV file I'm trying to import into R.
I can import it fine using:
read.csv()
and when I run my code in RStudio everything works perfect.
However, when I try and run my .R file using a batch file (windows .bat) it doesn't work. I want to use the .BAT file so that I can set up a scheduled task to run my code every morning.
Here is my .BAT file:
"E:\R-3.0.2\bin\x64\R.exe" CMD BATCH "E:\Control Files\download_data.R" "E:\Control Files\DailyEmail.txt"
And here is my .R file:
url <- "http://username:password#www.url.csv"
data <- read.csv(url, skip=1)
** note, I've put my username/password and the exact location of the CSV in my code. I've used generic stuff here, as this is work related and posting usernames and passwords is probably frowned upon.
As I've said, this code works fine when I use it in RStudio. But fails when I use the .BAT file.
I get the following error message:
Error in download.file(url, "E:/data/data.csv") :
cannot open URL 'websiteurl'
In addition: Warning message:
In download.file(url, "E:/data/data.csv") :
unable to resolve 'username'
Execution halted
** above websiteurl is the http above (I can't post links)
So obviously, the .BAT is having trouble with the username/password? Any thoughts?
* EDIT *
I've gone so far as trying this on Linux. Thinking maybe windows was playing silly bugger.
Just from the terminal, I run Rscript -e "download_data.r" and get the EXACT same error message as I did in Windows. So I suspect this may be a problem with where I'm getting the data? Could the provider be blocking data from the command line, but not from with Rstudio?
I have had similar problems which had to do with file permissions. The .bat file somehow does not have the same privileges as you running the code directly from Rstudio. Try using rscript (http://stat.ethz.ch/R-manual/R-devel/library/utils/html/Rscript.html) within your .bat file like
Rscript "E:\Control Files\download_data.R"
What is the purpose of the argument "E:\Control Files\DailyEmail.txt"? Is the program suppose to use it in any way?
So, I've found a solution, which is likely not the most practical for most people, but works for me.
What I did was migrated my project over to a Linux system. Running daily scripts, is easier on Linux anyways.
The solution makes use of the "wget" function in linux.
You can either run the wget right in your shell script, or make use of the system() function in R to run the wget.
code looks like:
wget -O /home/user/.../file.csv --user=userid --password='password' http://www.url.com/file.csv
And you can do something like:
syscomand >- "wget -O /home/.../file.csv --user=userid --password='password' http://www.url.com/file.csv"
system (syscommand)
in R to download the CSV to a location on your hard drive, then grab the CSV using read.csv()
Doing it this way gave me some more insight into the potential root cause of the problem. While the system(syscommand) is running, I get the following output:
Connecting to www.website.com (www.website.com)|ip.ad.re.ss|:80... connected.
HTTP request sent, awaiting response... 401 Unauthorized
Reusing existing connection to www.weburl.com:80.
HTTP request sent, awaiting response... 200 OK
Not sure why it has to send the request twice? And why I'm getting a 401 Unauthorized the first try?

Resources