How can i read a .docx or .doc file directly from the URL link without downloading it to the local system in Python 3.7? - docx

I have built a resume parsing code in Python 3.7 but i want to read a .docx or .doc resume file directly from the URL link(e.g. http://13.234.163.240/storage/userData/phpjsK4iJ.docx) without downloading it to my local system.

How about Requests?
import requests
from io import BytesIO
r = requests.get('http://your.link/storage/file.docx')
fh = BytesIO(r.content)
your_resume_parser.parse_resume(fh)
More about Requests: https://2.python-requests.org/en/master/user/quickstart/

Related

Adobe Extract PDF unable to move result file from one disk to another

I am using Adobe Extract PDF and it works fine. I only have 1 problem when I try to save the result file:
result.save_as("./output/ExtractTextInfoFromPDF.zip")
Meaning that I am saving the file in Dir /output at root of current python app
Adobe's API responds with an OSError: [WinError 17] stating that temp file stored at AppData\Local\Temp\extractSdkResult\5b6dd24443b011ed9c77010101010000.zip cannot be moved to my local Dir on drive E:
What should I do ?
Thanks a lot,
Pierre-Emmanuel

How can I speed up downloading files using paramiko?

I have written code for downloading file from SFTP server but the process is taking a lot of time. Could you please tell me is there any way to speed up the process?
Code I am using -
import paramiko
ssh_client = paramiko.SSHClient()
ssh_client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh_client.connect(hostname='****',port=22,username='*****',password='', key_filename='******')
sftp_obj= ssh_client.open_sftp()
sftp_obj.get(sftp_loc, ec2_loc)

Reading csv files from microsoft Azure using R

I have recently started working with databricks and azure.
I have microsoft azure storage explorer. I ran a jar program on databricks
which outputs many csv files in the azure storgae explorer in the path
..../myfolder/subfolder/output/old/p/
The usual thing I do is to go the folder p and download all the csv files
by right clicking the p folder and click download on my local drive
and these csv files in R to do any analysis.
My issue is that sometimes my runs could generate more than 10000 csv files
whose downloading to the local drive takes lot of time.
I wondered if there is a tutorial/R package which helps me to read in
the csv files from the path above without downloading them. For e.g.
is there any way I can set
..../myfolder/subfolder/output/old/p/
as my working directory and process all the files in the same way I do.
EDIT:
the full url to the path looks something like this:
https://temp.blob.core.windows.net/myfolder/subfolder/output/old/p/
According to the offical document CSV Files of Azure Databricks, you can directly read a csv file in R of a notebook of Azure Databricks as the R example of the section Read CSV files notebook example said, as the figure below.
Alternatively, I used R package reticulate and Python package azure-storage-blob to directly read a csv file from a blob url with sas token of Azure Blob Storage.
Here is my steps as below.
I created a R notebook in Azure Databricks workspace.
To install R package reticulate via code install.packages("reticulate").
To install Python package azure-storage-blob as the code below.
%sh
pip install azure-storage-blob
To run Python script to generate a sas token of container level and to use it to get a list of blob urls with sas token, please see the code below.
library(reticulate)
py_run_string("
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import BlobPermissions
from datetime import datetime, timedelta
account_name = '<your storage account name>'
account_key = '<your storage account key>'
container_name = '<your container name>'
blob_service = BaseBlobService(
account_name=account_name,
account_key=account_key
)
sas_token = blob_service.generate_container_shared_access_signature(container_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1))
blob_names = blob_service.list_blob_names(container_name, prefix = 'myfolder/')
blob_urls_with_sas = ['https://'+account_name+'.blob.core.windows.net/'+container_name+'/'+blob_name+'?'+sas_token for blob_name in blob_names]
")
blob_urls_with_sas <- py$blob_urls_with_sas
Now, I can use different ways in R to read a csv file from the blob url with sas token, such as below.
5.1. df <- read.csv(blob_urls_with_sas[[1]])
5.2. Using R package data.table
install.packages("data.table")
library(data.table)
df <- fread(blob_urls_with_sas[[1]])
5.3. Using R package readr
install.packages("readr")
library(readr)
df <- read_csv(blob_urls_with_sas[[1]])
Note: for reticulate library, please refer to the RStudio article Calling Python from R.
Hope it helps.
Update for your quick question:

Transfer files through RDP connection

Trying to copy files from a remote desktop to my local.
Here is the code that tried...
import os
import os.path
import shutil
import sys
import win32wnet
def netcopy(host, source, dest_dir, username=None, password=None, move=False):
""" Copies files or directories to a remote computer. """
wnet_connect(host, username, password)
dest_dir = covert_unc(host, dest_dir)
# Pad a backslash to the destination directory if not provided.
if not dest_dir[len(dest_dir) - 1] == '\\':
dest_dir = ''.join([dest_dir, '\\'])
# Create the destination dir if its not there.
if not os.path.exists(dest_dir):
os.makedirs(dest_dir)
else:
# Create a directory anyway if file exists so as to raise an error.
if not os.path.isdir(dest_dir):
os.makedirs(dest_dir)
if move:
shutil.move(source, dest_dir)
else:
shutil.copy(source, dest_dir)
Trying to figure out how to establish a connection and copy files over to my local.
New to python here...
Are you using an RDP client?
Is this windows linux or mac ?
Which app are you using?
Is this a code you wrote ?
Do you know what virtual channels are ?
Is NLA on?
THere is very little information you provided.
can you even connect ? Can you ping the server ?

download large zipped csv over https, unzip, and load

I'm trying to follow this example to download a zipped file over https, extract the csv file (14GB), and load the data into a dataframe. I created a small example (<1MB).
library(data.table)
temp <- tempfile()
download.file("https://www.dropbox.com/s/h130oe03krthcl0/example.csv.zip",
temp, method="curl")
data <- fread(unz(temp, "example.csv"))
unlink(temp)
Is my mistake obvious?
This works fine for me (download.file does too but I'm on 3.2.2 OS X so this is more "portable" given the updates to download.file since 3.1.2):
library(httr)
response <- GET("https://www.dropbox.com/s/h130oe03krthcl0/example.csv.zip?dl=1",
write_disk("example.csv.zip"),
progress())
fil <- unzip("example.csv.zip")
read.csv(fil[1], stringsAsFactors=FALSE)
## v1 v2 v3
## 1 1 2 3
## 2 1 2 3
## 3 1 2 3
I didn't try it w/o the ?dl=1 (& I do that by wrote, not due to the edit queue suggestion).
Honestly, though, I'd probably spare the download in R and just use curl on the command line in an automated workflow for files the size you've indicated (and, I'd do that if the processing language was python [et al], too).
In my of the application I was trying to download the zip file from http and just create stream for unzipping that file into a folder.
After making some google search I was able to write following code which helps me in my task
Here are few steps you have to follow
Install unzipper package
import unzipper and http into the code file
import unzipper from ‘unzipper’;
import http from ‘http’;
Now you have to download the zip file and create stream for this, here is the complete code
import unzipper from ‘unzipper’;
import http from ‘http’;
var self=this;
http.get(‘http://yoururl.com/file.zip’, function(res) {
res.pipe(unzipper.Extract({ path: ‘C:/cmsdata/’ })).on(‘close’, function() {
//Here you can perform any action after completion of stream unzipping
});
});

Resources