Downloading txt files from multiple directories using R - r

I've been trying to download .txt files from https://ds.data.jma.go.jp/gmd/goos/data/pub/JMA-product/cobe2_sst_glb_M/
So far, I've managed to download the complete set for 1850 by using the code from Download all the files (.zip and .txt) from a webpage using R which is, for my case:
page <- "https://ds.data.jma.go.jp/gmd/goos/data/pub/JMA-product/cobe2_sst_glb_M/1850/"
a <- readLines(
page
)
loc.txt <- grep(
".txt",
a
)
#------------------------------------
convfn <- function(line, marker, page){
i <- unlist(gregexpr(pattern ='href="', line)) + 6
i2<- unlist(gregexpr(pattern =,marker, line)) + 3
#target file
.destfile <- substring(line, i[1], i2[1])
#target url
.url <- paste(page, .destfile, sep = "/")
#print targets
cat(.url, '\n', .destfile, '\n')
#the workhorse function
download.file(url=.url, destfile=.destfile)
}
#------------------------------------
print(
getwd()
)
sapply(a[loc.txt],
FUN = convfn,
marker = '.txt"',
page = page)
I would like to know how to write a function that will allow me to automate replacing the years 1850 to 2022 since doing this would somehow be long and repetitive (over 170 years). My idea is somehow stuck on the line:
page <- paste0("https://ds.data.jma.go.jp/gmd/goos/data/pub/JMA-product/cobe2_sst_glb_M/", c(seq(1850, 2022, by = 1)), "/")
but I do not know how to make it into a working function
Please help, thank you and keep safe
Best regards,
Raven

I'd be inclined to do this differently. Better to use xml/xpath to extract the file links.
library(httr) # for GET(...)
library(XML) # for htmlParse(...)
base.url <- 'https://ds.data.jma.go.jp/gmd/goos/data/pub/JMA-product/cobe2_sst_glb_M'
get.docs <- function(year) {
url <- paste(base.url, year, sep='/')
html <- htmlParse(content(GET(url), type='text'))
file.names <- html['//td/a/#href'][-1] # first href is to parent directory; remove
##
# uncomment next line to save files
#
# mapply(download.file, paste(url, file.names, sep='/'), file.names)
print(sprintf('Downloaded from: %s to: %s', paste(url, file.names, sep='/'), file.names))
}
lapply(1850:2022, get.docs)

Related

How to rename multiple files inside a loop in R

I have downloaded one photo of each deputy. In total, I have 513 photos (but I hosted a file with 271 photos). Each photo was named with the ID of the deputy. I want to change the name of photo to the deputy's name. This means that "66179.jpg" file would be named "norma-ayub.jpg".
I have a column with the IDs ("uri") and their names ("name_lower"). I tried to run the code with "destfile" of download.file(), but it receives only a string. I couldn't find out how to work with file.rename().
And rename_r_to_R changes only the file extension.
I am a beginner in working with R.
CSV file:
https://gist.github.com/gabrielacaesar/3648cd61a02a3e407bf29b7410b92cec
Photos:
https://github.com/gabrielacaesar/studyingR/blob/master/chamber-of-deputies-17jan2019-files.zip
(It's not necessary to download the ZIP file; When running the code below, you also get the photos, but it takes some time to download them)
deputados <- fread("dep-legislatura56-14jan2019.csv")
i <- 1
while(i <= 514) {
tryCatch({
url <- deputados$uri[i]
api_content <- rawToChar(GET(url)$content)
pessoa_info <- jsonlite::fromJSON(api_content)
pessoa_foto <- pessoa_info$dados$ultimoStatus$urlFoto
download.file(pessoa_foto, basename(pessoa_foto), mode = "wb")
Sys.sleep(0.5)
}, error = function(e) return(NULL)
)
i <- i + 1
}
I downloaded the files you provided and directly read them into R or unzipped them into a new folder respectivly:
df <- data.table::fread(
"https://gist.githubusercontent.com/gabrielacaesar/3648cd61a02a3e407bf29b7410b92cec/raw/1d682d8fcdefce40ff95dbe57b05fa83a9c5e723/chamber-of-deputies-17jan2019",
sep = ",",
header = TRUE)
download.file("https://github.com/gabrielacaesar/studyingR/raw/master/chamber-of-deputies-17jan2019-files.zip",
destfile = "temp.zip")
dir.create("photos")
unzip("temp.zip", exdir = "photos")
Then I use list.files to get the file names of all photos, match them with the dataset and rename the photos. This runs very fast and the last bit will report if renaming the file was succesful.
photos <- list.files(
path = "photos",
recursive = TRUE,
full.names = TRUE
)
for (p in photos) {
id <- basename(p)
id <- gsub(".jpg$", "", id)
name <- df$name_lower[match(id, basename(df$uri))]
fname <- paste0(dirname(p), "/", name, ".jpg")
file.rename(p, fname)
# optional
cat(
"renaming",
basename(p),
"to",
name,
"succesful:",
ifelse(success, "Yes", "No"),
"\n"
)
}

rbind txt files from online directory (R)

I am trying to get concatenate text files from url but i don't know how to do this with the html and the different folders?
This is the code i tried, but it only lists the text files and has a lot of html code like this How do I fix this so that I can combine the text files into one csv file?
library(RCurl)
url <- "http://weather.ggy.uga.edu/data/daily/"
dir <- getURL(url, dirlistonly = T)
filenames <- unlist(strsplit(dir,"\n")) #split into filenames
#append the files one after another
for (i in 1:length(filenames)) {
file <- past(url,filenames[i],delim='') #concatenate for urly
if (i==1){
cp <- read_delim(file, header=F, delim=',')
}
else{
temp <- read_delim(file,header=F,delim=',')
cp <- rbind(cp,temp) #append to existing file
rm(temp)# remove the temporary file
}
}
here is a code snippet that I got to work for me. I like to use rvest over RCurl, just because that's what I've learned. In this case, I was able to use the html_nodes function to isolate each file ending in .txt. The result table has the times saved as character strings, but you could fix that later. Let me know if you have any questions.
library(rvest)
library(readr)
url <- "http://weather.ggy.uga.edu/data/daily/"
doc <- xml2::read_html(url)
text <- rvest::html_text(rvest::html_nodes(doc, "tr td a:contains('.txt')"))
# define column types of fwf data ("c" = character, "n" = number)
ctypes <- paste0("c", paste0(rep("n",11), collapse = ""))
data <- data.frame()
for (i in 1:2){
file <- paste0(url, text[1])
date <- as.Date(read_lines(file, n_max = 1), "%m/%d/%y")
# Read file to determine widths
columns <- fwf_empty(file, skip = 3)
# Manually expand `solar` column to be 3 spaces wider
columns$begin[8] <- columns$begin[8] - 3
data <- rbind(data, cbind(date,read_fwf(file, columns,
skip = 3, col_types = ctypes)))
}

how to scrape all pages (1,2,3,.....n) from a website using r vest

# I would like to read the list of .html files to extract data. Appreciate your help.
library(rvest)
library(XML)
library(stringr)
library(data.table)
library(RCurl)
u0 <- "https://www.r-users.com/jobs/"
u1 <- read_html("https://www.r-users.com/jobs/")
download_folder <- ("C:/R/BNB/")
pages <- html_text(html_node(u1, ".results_count"))
Total_Pages <- substr(pages, 4, 7)
TP <- as.numeric(Total_Pages)
# reading first two pages, writing them as separate .html files
for (i in 1:TP) {
url <- paste(u0, "page=/", i, sep = "")
download.file(url, paste(download_folder, i, ".html", sep = ""))
#create html object
html <- html(paste(download_folder, i, ".html", sep = ""))
}
Here is a potential solution:
library(rvest)
library(stringr)
u0 <- "https://www.r-users.com/jobs/"
u1 <- read_html("https://www.r-users.com/jobs/")
download_folder <- getwd() #note change in output directory
TP<-max(as.integer(html_text(html_nodes(u1,"a.page-numbers"))), na.rm=TRUE)
# reading first two pages, writing them as separate .html files
for (i in 1:TP ) {
url <- paste(u0,"page/",i, "/", sep="")
print(url)
download.file(url,paste(download_folder,i,".html",sep=""))
#create html object
html <- read_html(paste(download_folder,i,".html",sep=""))
}
I could not find the class .result-count in the html, so instead I looked for the page-numbers class and pick the highest returned value.
Also, the function html is deprecated thus I replaced it with read_html.
Good luck

How to use a loop (for) to download.file and write.csv using names from a list in R

So i've got to write.csv's for each downloaded files, with the currencies from a bunch a countries, from the web. And i wanted it to be saved using their ticks.
So i did,
codigos = list("JPY", "RUB","SGD","BRL","INR","THB","GBP","EUR","CHF")
for (i in 1:9){
url1 = 'http://www.exchangerates.org.uk/'
url2='-USD-exchange-rate-history-full.html'
codigos = list("JPY", "RUB","SGD","BRL","INR","THB","GBP","EUR","CHF")
codigo = codigos[i]
url <- paste(url1, codigo, url2, sep = "")
download.file(url, destfile='codigo.html')
dados <- readHTMLTable('codigo.html')
write.csv(dados, file="codigo.csv")
}
although it can read each of the url's altered by the loop, it can't download them, nor save the csv's individually. During the process i can see each of them being "saved" in a file named codigo.html and in the very end i get a codigo.html and a codigo.csv with the last country of the list.
The problem is that you're saving everything to the same filename. Each pass through the loop will overwrite the prior contents entirely.
Note also that readHTMLTable will take a URL. So perhaps something like this is in order:
for (i in 1:9){
url1 = 'http://www.exchangerates.org.uk/'
url2='-USD-exchange-rate-history-full.html'
cod = codigos[i]
url <- paste(url1, cod, url2, sep = "")
dados <- readHTMLTable(url)
# Create a unique name for each file
filename <- paste(cod, 'csv', sep='.')
write.csv(dados, file=filename)
}
Instead of creating csv files on disk, you might be better off using a list to hold the data, so you can manipulate the list:
url1 <- 'http://www.exchangerates.org.uk/'
url2 <- '-USD-exchange-rate-history-full.html'
l <- lapply(codigos
, function(i) readHTMLTable(paste0(url1, i, url2))
)
names(l) <- codigos
a) in your loop "url <- ..." line should go before "download.file(url ...)"
url <- paste(url1, cod, url2, sep = "")
download.file(url, destfile='cod.html')
b) in your line "write.csv(url, file=nome)" , nome must be between "
write.csv(url, file= "nome")

Recursively ftp download, then extract gz files

I have a multiple-step file download process I would like to do within R. I have got the middle step, but not the first and third...
# STEP 1 Recursively find all the files at an ftp site
# ftp://prism.oregonstate.edu//pub/prism/pacisl/grids
all_paths <- #### a recursive listing of the ftp path contents??? ####
# STEP 2 Choose all the ones whose filename starts with "hi"
all_files <- sapply(sapply(strsplit(all_paths, "/"), rev), "[", 1)
hawaii_log <- substr(all_files, 1, 2) == "hi"
hi_paths <- all_paths[hawaii_log]
hi_files <- all_files[hawaii_log]
# STEP 3 Download & extract from gz format into a single directory
mapply(download.file, url = hi_paths, destfile = hi_files)
## and now how to extract from gz format?
For part 1, RCurl might be helpful. The getURL function retrieves one or more URLs; dirlistonly lists the contents of the directory without retrieving the file. The rest of the function creates the next level of url
library(RCurl)
getContent <- function(dirs) {
urls <- paste(dirs, "/", sep="")
fls <- strsplit(getURL(urls, dirlistonly=TRUE), "\r?\n")
ok <- sapply(fls, length) > 0
unlist(mapply(paste, urls[ok], fls[ok], sep="", SIMPLIFY=FALSE),
use.names=FALSE)
}
So starting with
dirs <- "ftp://prism.oregonstate.edu//pub/prism/pacisl/grids"
we can invoke this function and look for things that look like directories, continuing until done
fls <- character()
while (length(dirs)) {
message(length(dirs))
urls <- getContent(dirs)
isgz <- grepl("gz$", urls)
fls <- append(fls, urls[isgz])
dirs <- urls[!isgz]
}
we could then use getURL again, but this time on fls (or elements of fls, in a loop) to retrieve the actual files. Or maybe better open a url connection and use gzcon to decompress and process on the file. Along the lines of
con <- gzcon(url(fls[1], "r"))
meta <- readLines(con, 7)
data <- scan(con, integer())
I can read the contents of the ftp page if I start R with the internet2 option. I.e.
C:\Program Files\R\R-2.12\bin\x64\Rgui.exe --internet2
(The shortcut to start R on Windows can be modified to add the internet2 argument - right-click /Properties /Target, or just run that at the command line - and obvious on GNU/Linux).
The text on that page can be read like this:
download.file("ftp://prism.oregonstate.edu//pub/prism/pacisl/grids", "f.txt")
txt <- readLines("f.txt")
It's a little more work to parse out the Directory listings, then read them recursively for the underlying files.
## (something like)
dirlines <- txt[grep("Directory <A HREF=", txt)]
## split and extract text after "grids/"
split1 <- sapply(strsplit(dirlines, "grids/"), function(x) rev(x)[1])
## split and extract remaining text after "/"
sapply(strsplit(split1, "/"), function(x) x[1])
[1] "dem" "ppt" "tdmean" "tmax" "tmin"
It's about here that this stops seeming very attractive, and gets a bit laborious so I would actually recommend a different option. There would no doubt be a better solution perhaps with RCurl, and I would recommend learning to use and ftp client for you and your user. Command line ftp, anonymous logins, and mget all works pretty easily.
The internet2 option was explained for a similar ftp site here:
https://stat.ethz.ch/pipermail/r-help/2009-January/184647.html
ftp.root <- where are the files
dropbox.root <- where to put the files
#=====================================================================
# Function that downloads files from URL
#=====================================================================
fdownload <- function(sourcelink) {
targetlink <- paste(dropbox.root, substr(sourcelink, nchar(ftp.root)+1,
nchar(sourcelink)), sep = '')
# list of contents
filenames <- getURL(sourcelink, ftp.use.epsv = FALSE, dirlistonly = TRUE)
filenames <- strsplit(filenames, "\n")
filenames <- unlist(filenames)
files <- filenames[grep('\\.', filenames)]
dirs <- setdiff(filenames, files)
if (length(dirs) != 0) {
dirs <- paste(sourcelink, dirs, '/', sep = '')
}
# files
for (filename in files) {
sourcefile <- paste(sourcelink, filename, sep = '')
targetfile <- paste(targetlink, filename, sep = '')
download.file(sourcefile, targetfile)
}
# subfolders
for (dirname in dirs) {
fdownload(dirname)
}
}

Resources