Why i can not import data to csv with scrapy? - web-scraping

I want to scrape a trade site with scrapy package and I make all settings. When I write
scrapy runspider exp.py -o exp1.csv
it scrapes but does not show in csv file.
What can be problem? I corrected response to 200 from 204 and site is not prepare with javascript

Related

ReadError: file could not be opened successfully. But I am not sure where the tar file is stored to resolve this

I am using biobert-embeddings==0.1.2 and torch==1.2.0 versions to embed some documents. But, I get the following error when I try to load the model by
from biobert_embedding.embedding import BiobertEmbedding
biobert = BiobertEmbedding()
Output/Error I get is -
Extracting biobert model tar.gz
ReadError: file could not be opened successfully
I was also having the same issue. Please follow the below steps to run the model:
Download the model from the link https://www.dropbox.com/s/hvsemunmv0htmdk/biobert_v1.1_pubmed_pytorch_model.tar.gz?dl=0
Extract all the files from the downloaded tar.gz file.
Use code:
biobert = BiobertEmbedding(model_path = "location_you_installed")
Note: Please make sure "location_you_installed" has config.json, pytorch_model.bin, and vocab.text files. These files are obtained after step 2.

How can i read a .docx or .doc file directly from the URL link without downloading it to the local system in Python 3.7?

I have built a resume parsing code in Python 3.7 but i want to read a .docx or .doc resume file directly from the URL link(e.g. http://13.234.163.240/storage/userData/phpjsK4iJ.docx) without downloading it to my local system.
How about Requests?
import requests
from io import BytesIO
r = requests.get('http://your.link/storage/file.docx')
fh = BytesIO(r.content)
your_resume_parser.parse_resume(fh)
More about Requests: https://2.python-requests.org/en/master/user/quickstart/

download large zipped csv over https, unzip, and load

I'm trying to follow this example to download a zipped file over https, extract the csv file (14GB), and load the data into a dataframe. I created a small example (<1MB).
library(data.table)
temp <- tempfile()
download.file("https://www.dropbox.com/s/h130oe03krthcl0/example.csv.zip",
temp, method="curl")
data <- fread(unz(temp, "example.csv"))
unlink(temp)
Is my mistake obvious?
This works fine for me (download.file does too but I'm on 3.2.2 OS X so this is more "portable" given the updates to download.file since 3.1.2):
library(httr)
response <- GET("https://www.dropbox.com/s/h130oe03krthcl0/example.csv.zip?dl=1",
write_disk("example.csv.zip"),
progress())
fil <- unzip("example.csv.zip")
read.csv(fil[1], stringsAsFactors=FALSE)
## v1 v2 v3
## 1 1 2 3
## 2 1 2 3
## 3 1 2 3
I didn't try it w/o the ?dl=1 (& I do that by wrote, not due to the edit queue suggestion).
Honestly, though, I'd probably spare the download in R and just use curl on the command line in an automated workflow for files the size you've indicated (and, I'd do that if the processing language was python [et al], too).
In my of the application I was trying to download the zip file from http and just create stream for unzipping that file into a folder.
After making some google search I was able to write following code which helps me in my task
Here are few steps you have to follow
Install unzipper package
import unzipper and http into the code file
import unzipper from ‘unzipper’;
import http from ‘http’;
Now you have to download the zip file and create stream for this, here is the complete code
import unzipper from ‘unzipper’;
import http from ‘http’;
var self=this;
http.get(‘http://yoururl.com/file.zip’, function(res) {
res.pipe(unzipper.Extract({ path: ‘C:/cmsdata/’ })).on(‘close’, function() {
//Here you can perform any action after completion of stream unzipping
});
});

How to authenticate myself to download data in R?

I want to download secured data from LendingClub (a P2P lending company, please Google it if you're interested in what they do).
The secured data can only be downloaded if you have an account. So now I have a username and password, and I check the download page to copy the file download link. Then how can I authenticate myself to download the data? I tried the following:
file <- 'lc1'
url <- "https://www.lendingclub.com/fileDownload.action?type=gen&file=LoanStats3a_securev1.csv.zip"
download.file(url, file)
But it throws warning:
trying URL 'https://www.lendingclub.com/fileDownload.action?type=gen&file=LoanStats3a_securev1.csv.zip'
Content type 'text/html;charset=UTF-8' length 200 bytes
opened URL
downloaded 14 Kb
Warning message:
In download.file(url, file) :
downloaded length 14531 != reported length 200
And the text file downloaded is not the zip file I want, I guess it's because no authentication step is involved, because if you don't have an account you can also download the partial data and the link is different:
url <- "https://resources.lendingclub.com/LoanStats3a.csv.zip"
and previous commands would work fine. So where can I add the authentication step?
You'll have to use their REST API with an API key that they give you here.
Then you can build a URL to the resource that you're looking to download in the format you'd like it in (or a format that you can manipulate to use in your code).
You can use curl to double-check your URL:
$curl -v -H "Authorization: <api key>" -XGET https://api.lendingclub.com/api/investor/v1/accounts/<investor_id>/summary

R Import - CSV file from password protected URL - in .BAT file

Okay - so here is what I'm trying to do.
I've got this password protected CSV file I'm trying to import into R.
I can import it fine using:
read.csv()
and when I run my code in RStudio everything works perfect.
However, when I try and run my .R file using a batch file (windows .bat) it doesn't work. I want to use the .BAT file so that I can set up a scheduled task to run my code every morning.
Here is my .BAT file:
"E:\R-3.0.2\bin\x64\R.exe" CMD BATCH "E:\Control Files\download_data.R" "E:\Control Files\DailyEmail.txt"
And here is my .R file:
url <- "http://username:password#www.url.csv"
data <- read.csv(url, skip=1)
** note, I've put my username/password and the exact location of the CSV in my code. I've used generic stuff here, as this is work related and posting usernames and passwords is probably frowned upon.
As I've said, this code works fine when I use it in RStudio. But fails when I use the .BAT file.
I get the following error message:
Error in download.file(url, "E:/data/data.csv") :
cannot open URL 'websiteurl'
In addition: Warning message:
In download.file(url, "E:/data/data.csv") :
unable to resolve 'username'
Execution halted
** above websiteurl is the http above (I can't post links)
So obviously, the .BAT is having trouble with the username/password? Any thoughts?
* EDIT *
I've gone so far as trying this on Linux. Thinking maybe windows was playing silly bugger.
Just from the terminal, I run Rscript -e "download_data.r" and get the EXACT same error message as I did in Windows. So I suspect this may be a problem with where I'm getting the data? Could the provider be blocking data from the command line, but not from with Rstudio?
I have had similar problems which had to do with file permissions. The .bat file somehow does not have the same privileges as you running the code directly from Rstudio. Try using rscript (http://stat.ethz.ch/R-manual/R-devel/library/utils/html/Rscript.html) within your .bat file like
Rscript "E:\Control Files\download_data.R"
What is the purpose of the argument "E:\Control Files\DailyEmail.txt"? Is the program suppose to use it in any way?
So, I've found a solution, which is likely not the most practical for most people, but works for me.
What I did was migrated my project over to a Linux system. Running daily scripts, is easier on Linux anyways.
The solution makes use of the "wget" function in linux.
You can either run the wget right in your shell script, or make use of the system() function in R to run the wget.
code looks like:
wget -O /home/user/.../file.csv --user=userid --password='password' http://www.url.com/file.csv
And you can do something like:
syscomand >- "wget -O /home/.../file.csv --user=userid --password='password' http://www.url.com/file.csv"
system (syscommand)
in R to download the CSV to a location on your hard drive, then grab the CSV using read.csv()
Doing it this way gave me some more insight into the potential root cause of the problem. While the system(syscommand) is running, I get the following output:
Connecting to www.website.com (www.website.com)|ip.ad.re.ss|:80... connected.
HTTP request sent, awaiting response... 401 Unauthorized
Reusing existing connection to www.weburl.com:80.
HTTP request sent, awaiting response... 200 OK
Not sure why it has to send the request twice? And why I'm getting a 401 Unauthorized the first try?

Resources