I have to download multiple xlsx files about a country's census data from internet using R. Files are located in this
Link .The problems are:
I am unable to write a loop which will let me go back and forth to download
File being download has some weird name not districts name. So how can I change it to districts name dynamically.
I have used the below mentioned codes:
url<-"http://www.censusindia.gov.in/2011census/HLO/HL_PCA/HH_PCA1/HLPCA-28532-2011_H14_census.xlsx"
download.file(url, "HLPCA-28532-2011_H14_census.xlsx", mode="wb")
But this downloads one file at a time and doesnt change the file name.
Thanks in advance.
Assuming you want all the data without knowing all of the urls, your questing involves webparsing. Package httr provides useful function for retrieving HTML-code of a given website, which you can parse for links.
Maybe this bit of code is what you're looking for:
library(httr)
base_url = "http://www.censusindia.gov.in/2011census/HLO/" # main website
r <- GET(paste0(base_url, "HL_PCA/Houselisting-housing-HLPCA.html"))
rc = content(r, "text")
rcl = unlist(strsplit(rc, "<a href =\\\"")) # find links
rcl = rcl[grepl("Houselisting-housing-.+?\\.html", rcl)] # find links to houslistings
names = gsub("^.+?>(.+?)</.+$", "\\1",rcl) # get names
names = gsub("^\\s+|\\s+$", "", names) # trim names
links = gsub("^(Houselisting-housing-.+?\\.html).+$", "\\1",rcl) # get links
# iterate over regions
for(i in 1:length(links)) {
url_hh = paste0(base_url, "HL_PCA/", links[i])
if(!url_success(url_hh)) next
r <- GET(url_hh)
rc = content(r, "text")
rcl = unlist(strsplit(rc, "<a href =\\\"")) # find links
rcl = rcl[grepl(".xlsx", rcl)] # find links to houslistings
hh_names = gsub("^.+?>(.+?)</.+$", "\\1",rcl) # get names
hh_names = gsub("^\\s+|\\s+$", "", hh_names) # trim names
hh_links = gsub("^(.+?\\.xlsx).+$", "\\1",rcl) # get links
# iterate over subregions
for(j in 1:length(hh_links)) {
url_xlsx = paste0(base_url, "HL_PCA/",hh_links[j])
if(!url_success(url_xlsx)) next
filename = paste0(names[i], "_", hh_names[j], ".xlsx")
download.file(url_xlsx, filename, mode="wb")
}
}
Related
Thanks to everyone who helped me with my previous query!
I have another question about how to proceed to download those images utilising the loop function!
I would like to download images from my data frame which consists of URL links that point directly to a .jpg image all at once.
I've attached the current code below:
This is the current code to load the URLs
# load libraries and packages
library("rvest")
library("ralger")
library("tidyverse")
library("jpeg")
library("here")
# set the number of pages
num_pages <- 5
# set working directory for photos to be stored
setwd("~/Desktop/lab/male_generic")
# create a list to hold the output
male <- vector("list", num_pages)
# looping the scraping, images from istockphoto
for(page_result in 1:num_pages){
link = paste0("https://www.istockphoto.com/search/2/image?alloweduse=availableforalluses&mediatype=photography&phrase=man&page=", page_result)
male[[page_result]] <- images_preview(link)
}
male <- unlist(male)
I only figured out how to download one image at a time, but I would like to learn how to do it all at once:
test = "https://media.istockphoto.com/id/1028900652/photo/man-meditating-yoga-at-sunset-mountains-travel-lifestyle-relaxation-emotional-concept.jpg?s=612x612&w=0&k=20&c=96TlYdSI8POnOrcqH10GlPgOeWFjEIoY-7G_yMV4Eco="
download.file(test,'test.jpg', mode = 'wb')
num_pages = 10 # write the number of pages you want to download
link = paste0("https://www.istockphoto.com/search/2/image?alloweduse=availableforalluses&mediatype=photography&phrase=man&page=", 1:num_pages)
sapply(link, function(x) {
download.file(x,
destfile = paste0("C:/Users/USUARIO/Desktop", # change it to your directory
str_extract(x, pattern = "[0-9]{1,2}"), ".jpg"),
mode = "wb")
}
)
I have been trying to work this out but I have not been able to do it...
I want to create a data frame with four columns: country-number-year-(content of the .txt file)
There is a .zip file in the following URL:
https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/0TJX8Y/PZUURT
The file contains a folder with 49 folders in it, and each of them contain 150 .txt files give or take.
I first tried to download the zip file with get_dataset but did not work
if (!require("dataverse")) devtools::install_github("iqss/dataverse-client-r")
library("dataverse")
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
get_dataset("=doi:10.7910/DVN/0TJX8Y/PZUURT", key = "", server = "dataverse.harvard.edu")
"Error in get_dataset("=doi:10.7910/DVN/0TJX8Y/PZUURT", key = "", server = "dataverse.harvard.edu") :
Not Found (HTTP 404)."
Then I tried
temp <- tempfile()
download.file("https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/0TJX8Y/PZUURT",temp)
UNGDC <-unzip(temp, "UNGDC+1970-2018.zip")
It worked to some point... I downloaded the .zip file and then I created UNGDC but nothing happened, because it only has the following information:
UNGDC
A connection with
description "/var/folders/nl/ss_qsy090l78_tyycy03x0yh0000gn/T//RtmpTc3lvX/fileab730f392b3:UNGDC+1970-2018.zip"
class "unz"
mode "r"
text "text"
opened "closed"
can read "yes"
can write "yes"
Here I don't know what to do... I have not found relevant information to proceed... Can someone please give me some hints? or any web to learn how to do it?
Thanks for your attention and help!!!
How about this? I used the zip package to unzip, but possibly the base unzip might work as well.
library(zip)
dir.create(temp <- tempfile())
url<-'https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/0TJX8Y/PZUURT'
download.file(url, paste0(temp, '/PZUURT.zip'), mode = 'wb', exdir = temp)
unzip(paste0(temp, '/PZUURT.zip'), exdir = temp)
Note in particular I had to set the mode = 'wb' as I'm on a Windows machine.
I then saw that the unzipped archive had a _MACOSX folder and a Converted sessions folder. Assuming I don't need the MACOSX stuff, I did the following to get just the files I'm interested in:
root_folder <- paste0(temp,'/Converted sessions/')
filelist <- list.files(path = root_folder, pattern = '*.txt', recursive = TRUE)
filenames <- basename(filelist)
'filelist' contains the full paths to each text file, while 'filenames' has just each file name, which I'll then break up to get the country, the number and the year:
df <- data.frame(t(sapply(strsplit(filenames, '_'),
function(x) c(x[1], x[2], substr(x[3], 1, 4)))))
colnames(df) <- c('Country', 'Number', 'Year')
Finally, I can read the text from each of the files and stick it into the dataframe as a new Text field:
df$Text <- sapply(paste0(root_folder, filelist), function(x) readChar(x, file.info(x)$size))
I have downloaded one photo of each deputy. In total, I have 513 photos (but I hosted a file with 271 photos). Each photo was named with the ID of the deputy. I want to change the name of photo to the deputy's name. This means that "66179.jpg" file would be named "norma-ayub.jpg".
I have a column with the IDs ("uri") and their names ("name_lower"). I tried to run the code with "destfile" of download.file(), but it receives only a string. I couldn't find out how to work with file.rename().
And rename_r_to_R changes only the file extension.
I am a beginner in working with R.
CSV file:
https://gist.github.com/gabrielacaesar/3648cd61a02a3e407bf29b7410b92cec
Photos:
https://github.com/gabrielacaesar/studyingR/blob/master/chamber-of-deputies-17jan2019-files.zip
(It's not necessary to download the ZIP file; When running the code below, you also get the photos, but it takes some time to download them)
deputados <- fread("dep-legislatura56-14jan2019.csv")
i <- 1
while(i <= 514) {
tryCatch({
url <- deputados$uri[i]
api_content <- rawToChar(GET(url)$content)
pessoa_info <- jsonlite::fromJSON(api_content)
pessoa_foto <- pessoa_info$dados$ultimoStatus$urlFoto
download.file(pessoa_foto, basename(pessoa_foto), mode = "wb")
Sys.sleep(0.5)
}, error = function(e) return(NULL)
)
i <- i + 1
}
I downloaded the files you provided and directly read them into R or unzipped them into a new folder respectivly:
df <- data.table::fread(
"https://gist.githubusercontent.com/gabrielacaesar/3648cd61a02a3e407bf29b7410b92cec/raw/1d682d8fcdefce40ff95dbe57b05fa83a9c5e723/chamber-of-deputies-17jan2019",
sep = ",",
header = TRUE)
download.file("https://github.com/gabrielacaesar/studyingR/raw/master/chamber-of-deputies-17jan2019-files.zip",
destfile = "temp.zip")
dir.create("photos")
unzip("temp.zip", exdir = "photos")
Then I use list.files to get the file names of all photos, match them with the dataset and rename the photos. This runs very fast and the last bit will report if renaming the file was succesful.
photos <- list.files(
path = "photos",
recursive = TRUE,
full.names = TRUE
)
for (p in photos) {
id <- basename(p)
id <- gsub(".jpg$", "", id)
name <- df$name_lower[match(id, basename(df$uri))]
fname <- paste0(dirname(p), "/", name, ".jpg")
file.rename(p, fname)
# optional
cat(
"renaming",
basename(p),
"to",
name,
"succesful:",
ifelse(success, "Yes", "No"),
"\n"
)
}
I have some file names which look like the following;
Year1:
blds_PANEL_DPK_8237_8283
blds_PANEL_DPR_8237_8283
blds_PANEL_MWK_8237_8283
Which are all located in the same file path. However in a different file path for a different year I have very similar files;
Year 2:
blds_PANEL_MHG_9817_9876
blds_PANEL_HKG_9817_9876
blds_PANEL_DPR_9817_9876
Some of the files have the same names as the previous years yet some of the names change. The only part of the name which changes is the MHG, HKG, DPR sections of the name, the blds_PANEL_ stays the same along with 9817_9876.
I have created a paste0()
file_path = C:/Users...
product = blds
part_which_keeps_changing = HKG
weeks = 9817_9876
read.csv(paste0(file_path, product, product, part_which_keeps_changing, weeks, ".DAT"), header = TRUE)
It was working well for one product, however for new products I am running into some errors. So I am trying to load in data which perhaps ignores this part of the file name.
EDIT: This seems to solve what I am looking to do
temp <- list.files(paste0(files, product), pattern = "*.DAT")
location <- paste0(files, product, temp)
myfiles = lapply(location, read.csv)
library(plyr)
df <- ldply(myfiles, data.frame)
How ever I am running into a slightly different problem for some of the files.
If I have the following;
blds_PANEL_DPK_8237_8283
blds_PANEL_DPR_8237_8283
blds_PANEL_MWK_8237_8283
It is possible that one of the files contains no information and when I apply lapply it breaks and stops loading in the data when loading in the data.
Is it possible to skip over these files. Heres the error:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
EDIT 2:
This seems to overrde the lapply errors:
lapply_with_error <- function(X,FUN,...){
lapply(X, function(x, ...) tryCatch(FUN(x, ...),
error=function(e) NULL))
}
myfiles = lapply_with_error(location, read.delim)
I see it is super-easy to grab a PDF file, save it, and fetch all the text from the file.
library(pdftools)
download.file("http://www2.sas.com/proceedings/sugi30/085-30.pdf", "sample.pdf", mode = "wb")
txt <- pdf_text("sample.pdf")
I am wondering how to loop through an array of PDF files, based on links, download each, and scrape the test from each. I want to go to the following link.
http://www2.sas.com/proceedings/sugi30/toc.html#dp
Then I want to download each file from 'Paper 085-30:' to 'Paper 095-30:'. Finally, I want to scrape the text out of each file. How can I do that?
I would think it would be something like this, but I suspect the paste function is not setup correctly.
library(pdftools)
for(i in values){'085-30',' 086-30','087-30','088-30','089-30'
paste(download.file("http://www2.sas.com/proceedings/sugi30/"i".pdf", i".pdf", mode = "wb")sep = "", collapse = NULL)
}
You can get a list of pdfs using rvest.
library(rvest)
x <- read_html("http://www2.sas.com/proceedings/sugi30/toc.html#dp")
href <- x %>% html_nodes("a") %>% html_attr("href")
# char vector of links, use regular expression to fetch only papers
links <- href[grepl("^http://www2.sas.com/proceedings/sugi30/\\d{3}.*\\.pdf$", href)]
I've added some error handling and don't forget to put R session to sleep so you don't flood the server. In case a download is unsuccessful, the link is stored into a variable which you can investigate after the loop has finished and perhaps adapt your code or just download them manually.
# write failed links to this variable
unsuccessful <- c()
for (link in links) {
out <- tryCatch(download.file(url = link, destfile = basename(link), mode = "wb"),
error = function(e) e, warning = function(w) w)
if (class(out) %in% c("simpleError", "simpleWarning")) {
message(sprintf("Unable to download %s ?", link))
unsuccessful <- c(unsuccessful, link)
}
sleep <- abs(rnorm(1, mean = 10, sd = 10))
message(sprintf("Sleeping for %f seconds", sleep))
Sys.sleep(sleep) # don't flood the server, sleep for a while
}