Downloading images from multiple URLs in R - r

Thanks to everyone who helped me with my previous query!
I have another question about how to proceed to download those images utilising the loop function!
I would like to download images from my data frame which consists of URL links that point directly to a .jpg image all at once.
I've attached the current code below:
This is the current code to load the URLs
# load libraries and packages
library("rvest")
library("ralger")
library("tidyverse")
library("jpeg")
library("here")
# set the number of pages
num_pages <- 5
# set working directory for photos to be stored
setwd("~/Desktop/lab/male_generic")
# create a list to hold the output
male <- vector("list", num_pages)
# looping the scraping, images from istockphoto
for(page_result in 1:num_pages){
link = paste0("https://www.istockphoto.com/search/2/image?alloweduse=availableforalluses&mediatype=photography&phrase=man&page=", page_result)
male[[page_result]] <- images_preview(link)
}
male <- unlist(male)
I only figured out how to download one image at a time, but I would like to learn how to do it all at once:
test = "https://media.istockphoto.com/id/1028900652/photo/man-meditating-yoga-at-sunset-mountains-travel-lifestyle-relaxation-emotional-concept.jpg?s=612x612&w=0&k=20&c=96TlYdSI8POnOrcqH10GlPgOeWFjEIoY-7G_yMV4Eco="
download.file(test,'test.jpg', mode = 'wb')

num_pages = 10 # write the number of pages you want to download
link = paste0("https://www.istockphoto.com/search/2/image?alloweduse=availableforalluses&mediatype=photography&phrase=man&page=", 1:num_pages)
sapply(link, function(x) {
download.file(x,
destfile = paste0("C:/Users/USUARIO/Desktop", # change it to your directory
str_extract(x, pattern = "[0-9]{1,2}"), ".jpg"),
mode = "wb")
}
)

Related

Download file from url R

I am having problems downloading data from the link below directly with the code into R:
kaggle.com/c/house-prices-advanced-regression-techniques/data
I tried with this code:
data<-read.csv("https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=test.csv", skip = 1")
I tried most of the options listed here:
Access a URL and read Data with R
However, I only get html table and not tables with the relevant house-price data from the web-site. Not sure what I am doing wrong.
tnx
Here's a simple example post on kaggle how to achieve your goal, the code is taken from the example.
Create a verified account
Log in
Go to you account (click the top right -> account)
Click "Create new API token"
Place the file somewhere sensible that you can access from R
library(httr)
library(jsonlite)
kgl_credentials <- function(kgl_json_path="~/.kaggle/kaggle.json"){
# returns user credentials from kaggle json
user <- fromJSON("~/.kaggle/kaggle.json", flatten = TRUE)
return(user)
}
kgl_dataset <- function(ref, file_name, type="dataset", kgl_json_path="~/.kaggle/kaggle.json"){
# ref: depends on 'type':
# - dataset: "sudalairajkumar/novel-corona-virus-2019-dataset"
# - competition: competition ID, e.g. 8587 for "competitive-data-science-predict-future-sales"
# file_name: specific dataset wanted, e.g. "covid_19_data.csv"
.kaggle_base_url <- "https://www.kaggle.com/api/v1"
user <- kgl_credentials(kgl_json_path)
if(type=="dataset"){
# dataset
url <- paste0(.kaggle_base_url, "/datasets/download/", ref, "/", file_name)
}else if(type=="competition"){
# competition
url <- paste0(.kaggle_base_url, "/competitions/data/download/", ref, "/", file_name)
}
# call
rcall <- httr::GET(url, httr::authenticate(user$username, user$key, type="basic"))
# content type
content_type <- rcall[[3]]$`content-type`
if( grepl("zip", content_type)){
# download and unzup
temp <- tempfile()
download.file(rcall$url,temp)
data <- read.csv(unz(temp, file_name))
unlink(temp)
}else{
# else read as text -- note: code this better
data <- content(rcall, type="text/csv", encoding = "ISO-8859-1")
}
return(data)
}
Then you can use the credentials to download the dataset as described in the post
kgl_dataset(file_name = 'test.csv',
type = 'competition',
ref = 'house-prices-advanced-regression-techniques',
kgl_json_path = 'kaggle.json')
Alternatively you can use the unofficial R api
library(devtools)
install_github('mkearney/kaggler')
library(kaggler)
kgl_auth(creds_file = 'kaggle.json')
kgl_competitions_data_download('house-prices-advanced-regression-techniques', 'test.csv')
However this fails, due to a mistake in the implementation of kgl_api_get
function (path, ..., auth = kgl_auth())
{
r <- httr::GET(kgl_api_call(path, ...), auth)
httr::warn_for_status(r)
if (r$status_code != 200) { # <== should be "=="
...
}
I downloaded the data (which you should just do too, it's quite easy), but just in case you don't want to, I uploaded the data to Pastebin and you can run the code below. This is for their "train" dataset, downloaded from the link you provided above
data <- read.delim("https://pastebin.com/raw/aGvwwdV0", header=T)

How to download multiple gzip file?

I have to download a lot of files in the format .gz (one file ~40mb , 40k rows).
The file contain a data from another country i would like to choose data only from france -> fr (limiting the number of columns)
I am trying to automate this process but I have problems with unpacking.
The data is on a webpage
and I'm intersted in data from whole folder.
I try with:
create tempfile
dowloand zip to tempfile
unzip, read and selected rows.
save as new file and repeat to next file.
I would like to ask if this way of think is correct.(the below code will be in for loop)
temp <- tempfile()
temp1 <- "C:/Users/tdo/Desktop/data/test.txt"
download.file("https://dumps.wikimedia.org/other/pageviews/2018/2018-
06/pageviews-20180601-000000.gz",temp) # example
unzip(files = temp,exdir = temp1)
data <- read.table(..)
daata[data$name == 'fr']
write.table(...)
In this way I created links:
dumpList <- read_html("https://dumps.wikimedia.org/other/pageviews/2018/2018-04/")
links <- data_frame(filename = html_attr(html_nodes(dumpList, "a"), "href")) %>%
filter(grepl(x = filename, "pageviews")) %>% data by project
mutate(link = paste0("https://dumps.wikimedia.org/other/pageviews/2018/2018-04/", filename))
Why not directly read the gzipped files? I don't see the need to locally unpack the archives, if all you want to do is subset/filter the data and store as new local files.
I recommend using readr::read_table2 to directly read the gzipped file.
Here is a minimal example:
# List of files to download
# url is the link, target the local filename
lst.files <- list(
list(
url = "https://dumps.wikimedia.org/other/pageviews/2018/2018-06/pageviews-20180601-000000.gz",
target = "pageviews-20180601-000000.gz"))
# Download gzipped files (only if file does not exist)
lapply(lst.files, function(x)
if (!file.exists(x$target)) download.file(x$url, x$target))
# Open files
library(readr)
lst <- lapply(lst.files, function(x) {
df <- read_table2(x$target)
# Filter/subset entries
# Write to file with write_delim
})

Loop through array of PDF files online and copy text from each

I see it is super-easy to grab a PDF file, save it, and fetch all the text from the file.
library(pdftools)
download.file("http://www2.sas.com/proceedings/sugi30/085-30.pdf", "sample.pdf", mode = "wb")
txt <- pdf_text("sample.pdf")
I am wondering how to loop through an array of PDF files, based on links, download each, and scrape the test from each. I want to go to the following link.
http://www2.sas.com/proceedings/sugi30/toc.html#dp
Then I want to download each file from 'Paper 085-30:' to 'Paper 095-30:'. Finally, I want to scrape the text out of each file. How can I do that?
I would think it would be something like this, but I suspect the paste function is not setup correctly.
library(pdftools)
for(i in values){'085-30',' 086-30','087-30','088-30','089-30'
paste(download.file("http://www2.sas.com/proceedings/sugi30/"i".pdf", i".pdf", mode = "wb")sep = "", collapse = NULL)
}
You can get a list of pdfs using rvest.
library(rvest)
x <- read_html("http://www2.sas.com/proceedings/sugi30/toc.html#dp")
href <- x %>% html_nodes("a") %>% html_attr("href")
# char vector of links, use regular expression to fetch only papers
links <- href[grepl("^http://www2.sas.com/proceedings/sugi30/\\d{3}.*\\.pdf$", href)]
I've added some error handling and don't forget to put R session to sleep so you don't flood the server. In case a download is unsuccessful, the link is stored into a variable which you can investigate after the loop has finished and perhaps adapt your code or just download them manually.
# write failed links to this variable
unsuccessful <- c()
for (link in links) {
out <- tryCatch(download.file(url = link, destfile = basename(link), mode = "wb"),
error = function(e) e, warning = function(w) w)
if (class(out) %in% c("simpleError", "simpleWarning")) {
message(sprintf("Unable to download %s ?", link))
unsuccessful <- c(unsuccessful, link)
}
sleep <- abs(rnorm(1, mean = 10, sd = 10))
message(sprintf("Sleeping for %f seconds", sleep))
Sys.sleep(sleep) # don't flood the server, sleep for a while
}

R Web scrape Excel spreadsheet URLs to read with openxlsx

I need to read parts of an Excel file into R. I have some existing code, but the authority changed the source. Previously, there was a direct URL to the document, now the link to the document can only be accessed through a website landing page.
Could someone tell me with which package I can achieve that? The link to the Excel file is: http://www.snamretegas.it/it/business-servizi/dati-operativi-business/8_dati_operativi_bilanciamento_sistema/
There i am looking at the document: "Dati operativi relativi al bilanciamento del sistema post Del. 312/2016/R/gas - Database 2018"
I add the previous code to give an idea what I did. As you can see, I only required read.xlsx for this first step.
Many thanks in advance!
library(ggplot2)
library(lubridate)
library(openxlsx)
library(reshape2)
library(dplyr)
Bilres <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",sheet = "Storico_G", startRow = 1, colNames = TRUE)
# Selecting Column R from Storico_G and stored in variable Bilres_df
Bilres_df <- data.frame(Bilres$pubblicazione, Bilres$BILANCIAMENTO.RESIDUALE )
# Conerting pubblicazione in date format and time
Bilres_df$pubblicazione <- ymd_h(Bilres_df$Bilres.pubblicazione)
Bilreslast=tail(Bilres_df,1)
Bilreslast=data.frame(Bilreslast)
Bilreslast$Bilres.BILANCIAMENTO.RESIDUALE <- as.numeric(as.character((Bilreslast$Bilres.BILANCIAMENTO.RESIDUALE)))
If you copy the URL from the web page, you can then use download.files() first to download as a binary file and use read.xlsx() to read the data. Depending on how frequently the content changes on the web page, you may be better off just copying the URL than parsing it from the page.
oldFile <- "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx"
newFile <- "http://www.snamretegas.it/repository/file/it/business-servizi/dati-operativi-business/dati_operativi_bilanciamento_sistema/2017/DatiOperativi_2017-IT.xlsx"
if(!file.exists("./data/downloadedXlsx.xlsx")){
download.file(newFile,"./data/downloadedXlsx.xlsx",
method="curl", #use "curl" for OS X / Linux, "wininet" for Windows
mode="wb") # "wb" means "write binary"
} else message("file already loaded locally, using disk version")
library(openxlsx)
Bilres <- read.xlsx(xlsxFile = "./data/downloadedXlsx.xlsx",
sheet = "Storico_G", startRow = 1, colNames = TRUE)
head(Bilres[,1:3])
...and the output:
> head(Bilres[,1:3])
pubblicazione aggiornato.il IMMESSO
1 2017_01_01_06 42736.24 1915484
2 2017_01_01_07 42736.28 1915484
3 2017_01_01_08 42736.33 1866326
4 2017_01_01_09 42736.36 1866326
5 2017_01_01_10 42736.41 1866326
6 2017_01_01_11 42736.46 1866326
>
UPDATE: Added logic to avoid downloading the file once it has been downloaded.
You can find the .xlsx links this way:
library(rvest)
library(magrittr)
pg <- read_html("http://www.snamretegas.it/it/business-servizi/dati-operativi-business/8_dati_operativi_bilanciamento_sistema/")
# get all the Excel (xlsx) links on that page:
html_nodes(pg, xpath=".//a[contains(#href, '.xlsx')]") %>%
html_attr("href") %>%
sprintf("http://www.snamretegas.it%s", .) -> excel_links
head(excel_links)
## [1] "http://www.snamretegas.it/repository/file/it/business-servizi/dati-operativi-business/dati_operativi_bilanciamento_sistema/2017/DatiOperativi_2017-IT.xlsx"
## [2] "http://www.snamretegas.it/repository/file/it/business-servizi/dati-operativi-business/dati_operativi_bilanciamento_sistema/2018/DatiOperativi_2018-IT.xlsx"
And, pass in what you want to your Excel reading function:
openxlsx::read.xlsx(excel_links[1], sheet = "Storico_G", startRow = 1, colNames = TRUE)
## data frame output here that I'm not going to show
BUT!!
This is a very selfish and unkind way to do this since you hit that site for the Excel file every time you want to read it, wasting their CPU and bandwidth and your bandwidth.
You should use the download.file() technique Len described to cache a local copy and only re-download when necessary.
This should get you going in the right direction.
library(data.table)
mydat <- fread('http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx')
head(mydat)

How to download multiple files using loop in R?

I have to download multiple xlsx files about a country's census data from internet using R. Files are located in this
Link .The problems are:
I am unable to write a loop which will let me go back and forth to download
File being download has some weird name not districts name. So how can I change it to districts name dynamically.
I have used the below mentioned codes:
url<-"http://www.censusindia.gov.in/2011census/HLO/HL_PCA/HH_PCA1/HLPCA-28532-2011_H14_census.xlsx"
download.file(url, "HLPCA-28532-2011_H14_census.xlsx", mode="wb")
But this downloads one file at a time and doesnt change the file name.
Thanks in advance.
Assuming you want all the data without knowing all of the urls, your questing involves webparsing. Package httr provides useful function for retrieving HTML-code of a given website, which you can parse for links.
Maybe this bit of code is what you're looking for:
library(httr)
base_url = "http://www.censusindia.gov.in/2011census/HLO/" # main website
r <- GET(paste0(base_url, "HL_PCA/Houselisting-housing-HLPCA.html"))
rc = content(r, "text")
rcl = unlist(strsplit(rc, "<a href =\\\"")) # find links
rcl = rcl[grepl("Houselisting-housing-.+?\\.html", rcl)] # find links to houslistings
names = gsub("^.+?>(.+?)</.+$", "\\1",rcl) # get names
names = gsub("^\\s+|\\s+$", "", names) # trim names
links = gsub("^(Houselisting-housing-.+?\\.html).+$", "\\1",rcl) # get links
# iterate over regions
for(i in 1:length(links)) {
url_hh = paste0(base_url, "HL_PCA/", links[i])
if(!url_success(url_hh)) next
r <- GET(url_hh)
rc = content(r, "text")
rcl = unlist(strsplit(rc, "<a href =\\\"")) # find links
rcl = rcl[grepl(".xlsx", rcl)] # find links to houslistings
hh_names = gsub("^.+?>(.+?)</.+$", "\\1",rcl) # get names
hh_names = gsub("^\\s+|\\s+$", "", hh_names) # trim names
hh_links = gsub("^(.+?\\.xlsx).+$", "\\1",rcl) # get links
# iterate over subregions
for(j in 1:length(hh_links)) {
url_xlsx = paste0(base_url, "HL_PCA/",hh_links[j])
if(!url_success(url_xlsx)) next
filename = paste0(names[i], "_", hh_names[j], ".xlsx")
download.file(url_xlsx, filename, mode="wb")
}
}

Resources