Download file smaller than it's real size - python-3.6

I'm trying to download all the comics(png) from xkcd.com and here's my code to do the download job:
imageFile=open(os.path.join('XKCD',os.path.basename(comicLink)),'wb')
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()
And the downloadeded file is 6388Bytes and cannot be opened while the real file from link is 27.6Kb.
I've already tested my code line by line in shell, so I'm pretty sure I get the right link and right file.
I just don't understand why the png downloaded by my code is smaller.
Also I tried to search why this is happening but without helpful information.
Thanks.

Okay since you are using requests here is function that will let you download a file given a url
def download_file(url):
local_filename = url.split('/')[-1]
# NOTE the stream=True parameter
r = requests.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
return local_filename
Link to the documentation -> http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow

Related

Save website content into txt files

I am trying to write R code where I input an URL and output (save on hard drive) a .txt file. I created a large list of url using the "edgarWebR" package. An example would be "https://www.sec.gov/Archives/edgar/data/1131013/000119312518074650/d442610dncsr.htm". Basically
open the link
Copy everything (CTRL+A, CTRL+C)
open empy text file and paste content (CTRL+V)
save .txt file under specified name
(in a looped fashion of course). I am inclined to "hard code it" (as in open website in browner using browseURL(...) and "send keys" commands). But I am afraid that it will not run very smoothly. However other commands (such as readLines()) seem to copy the HTML structure (therefore returning not only the text).
In the end I am interested in a short paragraph of each of those shareholder letters (containing only text; Therefore Tables/graphs are no concern in my particular setup.)
Anyone aware of an R function that would help`?
thanks in advance!
Let me know incase below code works for you. xpathSApply can be applied for different html components as well. Since in your case only paragraphs are required.
library(RCurl)
library(XML)
# Create character vector of urls
urls <- c("url1", "url2", "url3")
for ( url in urls) {
# download html
html <- getURL(url, followlocation = TRUE)
# parse html
doc = htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//p", xmlValue)
# writing lines to html
# depends whether you need separate files for each url or same
fileConn<-file(paste(url, "txt", sep="."))
writeLines(paste(plain.text, collapse = "\n"), fileConn)
close(fileConn)
}
Thanks everyone for your input. Turns out that any html conversion took too much time given the ammount of websites I need to parse. The (working) solution probably violates some best-practice guidelines, but it does do the job.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox(executable_path=path + '/codes_ml/geckodriver/geckodriver.exe') # initialize driver
# it is fine to open the driver just once
# loop over urls will the text
driver.get(report_url)
element = driver.find_element_by_css_selector("body")
element.send_keys(Keys.CONTROL+'a')
element.send_keys(Keys.CONTROL+'c')
text = clipboard.paste()

Download URL links using R

I am new to R and would like to seek some advice.
I am trying to download multiple url links (pdf format, not html) and save it into pdf file format using R.
The links I have are in character (took from the html code of the website).
I tried using download.file() function, but this requires specific url link (Written in R script) and therefore can only download 1 link for 1 file. However I have many url links, and would like to get help in doing this.
Thank you.
I believe what you are trying to do is download a list of URLs, you could try something like this approach:
Store all the links in a vector using c(), ej:
urls <- c("http://link1", "http://link2", "http://link3")
Iterate through the file and download each file:
for (url in urls) {
download.file(url, destfile = basename(url))
}
If you're using Linux/Mac and https you may need to specify method and extra attributes for download.file:
download.file(url, destfile = basename(url), method="curl", extra="-k")
If you want, you can test my proof of concept here: https://gist.github.com/erickthered/7664ec514b0e820a64c8
Hope it helps!
URL
url = c('https://cran.r-project.org/doc/manuals/r-release/R-data.pdf',
'https://cran.r-project.org/doc/manuals/r-release/R-exts.pdf',
'http://kenbenoit.net/pdfs/text_analysis_in_R.pdf')
Designated names
names = c('manual1',
'manual2',
'manual3')
Iterate through the file and download each file with corresponding name:
for (i in 1:length(url)){
download.file(url[i], destfile = names[i], mode = 'wb')
}

how to download and display an image from an URL in R?

My goal is to download an image from an URL and then display it in R.
I got an URL and figured out how to download it. But the downloaded file can't be previewed because it is 'damaged, corrupted, or is too big'.
y = "http://upload.wikimedia.org/wikipedia/commons/5/5d/AaronEckhart10TIFF.jpg"
download.file(y, 'y.jpg')
I also tried
image('y.jpg')
in R, but the error message shows like:
Error in image.default("y.jpg") : argument must be matrix-like
Any suggestions?
If I try your code it looks like the image is downloaded. However, when opened with windows image viewer it also says it is corrupt.
The reason for this is that you don't have specified the mode in the download.file statement.
Try this:
download.file(y,'y.jpg', mode = 'wb')
For more info about the mode is see ?download.file
This way at least the file that you downloaded is working.
To view the image in R, have a look at
jj <- readJPEG("y.jpg",native=TRUE)
plot(0:1,0:1,type="n",ann=FALSE,axes=FALSE)
rasterImage(jj,0,0,1,1)
or how to read.jpeg in R 2.15
or Displaying images in R in version 3.1.0
this could work too
here
library("jpeg")
library("png")
x <- "http://upload.wikimedia.org/wikipedia/commons/5/5d/AaronEckhart10TIFF.jpg"
image_name<- readJPEG(getURLContent(x)) # for jpg
image_name<- readPNG(getURLContent(x)) # for png
After downloading the image, you can use base R to open the file using your default image viewer program like this:
file.show(yourfilename)

R write(append=TRUE) overwrites file contents

Here's my R code:
out = file('testfile')
write('hello', file=out, append=T)
write('world', file=out, append=T)
close(out)
When I run this (using R 3.1.0), testfile then contains:
world
I expected:
hello
world
The same behavior happens if I use cat() instead of write(). Why? How can I append to files?
You must open the file for writing:
out = file('testfile', 'w')
...
When R opens (or does not open) connections automatically is a bit complicated, but it's explained in the help (?file).
If you do not pass 'w', each write call opens and closes the file, and I guess this causes the strange behaviour you observe.
If you want to open an existing file for appending, use
out = file('testfile', 'a')
The clue comes in the help page for cat (which write is a wrapper for):
append logical. Only used if the argument file is the name of file
(and not a connection or "|cmd"). If TRUE output will be appended to
file; otherwise, it will overwrite the contents of file.
When using connections you should set the connection to be opened for appending, eg:
file('testfile', open="a")

Converting PDF to a collection of images on the server using GhostScript

These are the steps I am trying to achieve:
Upload a PDF document on the server.
Convert the PDF document to a set of images using GhostScript (every page is converted to an image).
Send the collection of images back to the client.
So far, I am interested in #2.
First, I downloaded both gswin32c.exe and gsdll32.dll and managed to manually convert a PDF to a collection of images(I opened cmd and run the command bellow):
gswin32c.exe -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r150 -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -dMaxStripSize=8192 -sOutputFile=image_%d.jpg somepdf.pdf
Then I thought, I'll put gswin32c.exe and gsdll32.dll into ClientBin of my web project, and run the .exe via a Process.
System.Diagnostics.Process process1 = new System.Diagnostics.Process();
process1.StartInfo.WorkingDirectory = Request.MapPath("~/");
process1.StartInfo.FileName = Request.MapPath("ClientBin/gswin32c.exe");
process1.StartInfo.Arguments = "-dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r150 -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -dMaxStripSize=8192 -sOutputFile=image_%d.jpg somepdf.pdf"
process1.Start();
Unfortunately, nothing was output in ClientBin. Anyone got an idea why? Any recommendation will be highly appreciated.
I've tried your code and it seem to be working fine. I would recommend checking following things:
verify if your somepdf.pdf is in the working folder of the gs process or specify the full path to the file in the command line. It would also be useful to see ghostscript's output by doing something like this:
....
process1.StartInfo.RedirectStandardOutput = true;
process1.StartInfo.UseShellExecute = false;
process1.Start();
// read output
string output = process1.StandardOutput.ReadToEnd();
...
process1.WaitForExit();
...
if gs can't find your file you would get an "Error: /undefinedfilename in (somepdf.pdf)" in the output stream.
another suggestion is that you proceed with your script without waiting for the gs process to finish and generate resulting image_N.jpg files. I guess adding process1.WaitForExit should solve the issue.

Resources