R download.file download raw html without images - r

I am downloading raw html websites with R's download.file function.
For storage efficiency, I would like to know if it is possible to download the html-files without contained images.
Something rings that I have heard about something like that before, but I can simply neither remember nor find any evidence/instructions.
I would be very grateful for some help.
Greetings,
Marcel
url_list <- c("http://www.spiegel.de/",
"http://www.faz.net/")
dest_list <- c("test1.html",
"test2.html")
download.file(url_list,
dest_list,
method="libcurl",
quiet=F)

Related

Ways to extract images from pdf using R

Is there a way to extract images from pdf using R and save them into a folder?
there are a lot of similar questions regarding other programming languages and there is apparently a way to do this in python, was wondering if the same work can be replicated in r https://www.thepythoncode.com/article/extract-pdf-images-in-python
there is pdftools package in r but does not sound like it can help much with images, only reads text and there is an option for ocr, I just want to extract the images and store them into a folder.
I can try to use reticulate package to use this python method in r but I won't be able to loop / map it as I would like. That's why I was asking if anyone knows a way in R.
thank you.
You can try something like this :
library(pdftools)
path_To_PDF <- "C:/my_pdf.pdf"
pdf_convert(path_To_PDF)

How to display a table of images in R output in a Jupyter notebook?

How do I take a list of image URLs, and display them in an HTML table in a Jupyter notebook with an R kernel?
Here's a list of URLs:
image_urls = c('https://i.stack.imgur.com/wdQNM.jpg',
'https://i.stack.imgur.com/8oysP.jpg')
Here's some code to display one image from an image_url:
library(jpeg)
library(RCurl)
img <- RCurl::getBinaryURL(image_url)
jj <- jpeg::readJPEG(img,native=TRUE)
plot(0:1,0:1,type="n",ann=FALSE,axes=FALSE)
rasterImage(jj,0,0,1,1)
Edit: Another way to think of this is, is there functionality like ipython's display? It looks like there might be in https://github.com/IRkernel/repr. I have to read more.
I’m the maintainer of all IRkernel-related projects.
IRdisplay is the package you’re searching for, specifically display_jpeg:
library(IRdisplay)
display_jpeg(file = 'filename.jpg')
Sadly the file parameter doesn’t work with URLs (yet), so you have to manually pass data to it:
jpeg_data <- RCurl::getBinaryURL(image_url)
display_jpeg(jpeg_data)

Loading ftp directly to R dataframe - how to see the original text?

Is it possible to load ftp files directly to the R workspace without downloading it?
I have 700+ files each around 1.5 Gb and I want to extract approx 0.1 % of the information from every files and add them into a single dataframe.
I had a look at Download .RData and .csv files from FTP using RCurl (or any other method), could not get it to work.
Edit: After some reading, i managed to get the files into R
library(httr)
r <- GET("ftp://ftp.ais.dk/ais_data/aisdk_20141001.csv", write_memory())
when i try to read the body i use
content(r, "text")
but the output is gibberish. It might be because of the encoding, but how do i know which encoding the server uses. Any ideas on how to get the original data from the ftp?
I found a solution, which is very simple, but works nonetheless:
library(data.table)
r <- fread("ftp://ftp.ais.dk/ais_data/aisdk_20141001.csv")
This blog was helpfull

Downloading sound files using urls in R

I am trying to download some sound files through R (mostly mp3). I've started off using download.file() like below. However, the sound files downloaded this way sound horrible and it's like as if they're playing way too fast. Any ideas?
download.file("http://www.mfiles.co.uk/mp3-downloads/frederic-chopin-piano-sonata-2-op35-3-funeral-march.mp3","test.mp3")
Even better than if the above function would work, is there a way do download files without having to specify the extension? Sometimes I only have the redirecting page.
Thanks!
Try explicitly setting binary mode with mode="wb":
download.file("http://www.mfiles.co.uk/mp3-downloads/frederic-chopin-piano-sonata-2-op35-3-funeral-march.mp3",
tf <- tempfile(fileext = ".mp3"),
mode="wb")
(You can view the filename with cat(tf).)

Converting .pdf files to excel (.xls)

A friend of mine doing an internship asked me 2 hours ago if I could help him avoid to do manually 462 pdf file to .xls using free online soft.
I thought of a shell script using unoconv, but I didn't find out how to use it properly, and I am not sure if unoconv can solve this problem since it mainly converts file to pdf, not the reverse thing.
Conversion from PDF to any other structured format is not always possible and not generally recommended.
Having said that, this does look like a one-off job and there's a fair few of them (462).
It's worth pursuing, if you can reliably extract text from most of them and it's reasonably structured. It's a matter of trying to get regular text output across a sample of the PDF's that you can reliably parse into a table structure.
There's plenty of tools around that target either direct or OCR based text extraction, just google around.
One I like is pstotext from the ghostscript suite; the -bboxes option lets me get the coordinates of each word and leaves it up to me to re-assemble the structure. Despite its name it does work on input PDFs. Downside is that it can be a bit flakey and works on some PDF's but not others.
If you get this far, you'd then most likely then need to write a shell-script or program to convert that to a CSV. You can either open this directly via a spread-sheet or look for tools to convert this into XLS.
PS If he hasn't already, get the intern to ask if there's any possible way of getting at the original data that was used to created the PDFs It will save a lot of time and effort and lead to a way more accurate result.
Update An alternative to pstotext is renderpdf.pl command which is included in the Perl CAM::PDF module. More robust, but just reports text (x,y) position, not bounding boxes.
Other responses on a linked question suggest Tabula, too.
https://github.com/tabulapdf/tabula
I tried and it works very well.

Resources