R loop to extract CSV files from FTP - r

I'm trying to loop through all the CSV files on an FTP site and upload the contents of CSVs with a certain filename to a database.
So far I've been able to
access the FTP using...
getURL((url, userpwd = userpwd, ftp.use.epsv = FALSE, dirlistonly = TRUE),
get a list of the filenames using...
unlist(strsplit(filenames, "\r\n"),
and create a dataframe with a list of the full urls (e.g ftp://sample#ftpserver.name.com/samplename.csv) using...
for (i in seq_along(myfiles)) {
url_list[i,] <- paste(url, myfiles[i], sep = '')
}
How do I loop through this dataframe, filtering for certain filenames, in order to create a new dataframe with all of data from the relevant CSVs? (half the files are named Type1SampleName and half are Type2SampleName)
I would then uploading this data to the database.
Thanks!

Since RCurl::getURL returns direct HTTP response here being content of CSVs, consider extending your lapply function call to pass result into read.csv using text argument:
# VECTOR OF URLs
urls <- paste0(url, myfiles[grep("Type1", myfiles])
# LIST OF DATA FRAMES FROM EACH CSV
mydata <- lapply(urls, function(url) {
resp <- getURL(url, userpwd = userpwd, connecttimeout = 60)
read.csv(text = resp)
})
Alternatively, getURL supports a callback function with write argument:
Alternatively, if a value is supplied for the write parameter, this is returned. This allows the caller to create a handler within the call and get it back. This avoids having to explicitly create and assign it and then call getURL and then access the result. Instead, the 3 steps can be inlined in a single call.
# USER DEFINED METHOD
import_csv <- function(resp) read.csv(text = resp)
# LONG FORM NOTATION
mydata <- lapply(urls, function(url)
getURL(url, userpwd = userpwd, connecttimeout = 60, write = import_csv)
)
# SHORT FORM NOTATION
mydata <- lapply(urls, getURL, userpwd = userpwd, connecttimeout = 60, write = import_csv)

Just an update on how I finished this off and what worked for me in the end...
mydata <- lapply(urls, getURL, userpwd = userpwd, connecttimeout = 60)
Following on from above..
while (i <= length(mydata)) {
mydata1 <- paste0(mydata[[i]])
bin <- read.csv(text = mydata1, header = FALSE, skip = 1)
#Column renaming and formatting here
#Uploading to database using RODBC here
}
Thanks for the pointers #Parfait - really appreciated.
Like most problems it looks straightforward after you've done it!

Related

Use R Loop to Bulk Download Youtube Transcripts with youtubecaption

I'm trying to use the youtubecaption library to download all the transcripts for a playlist then create a dataframe with all the results.
I have a list of the video URLs and have tried to create a for loop to pass them into the get_caption() function. I can only get one video's transcripts added to the df.
I've tried a few approaches:
vids <- as.list(mydata$videoId)
for (i in 1:length(vids)){
vids2 <- paste("https://www.youtube.com/watch?v=",vids[i],sep="")
test_transcript2 <-
get_caption(
url = vids2,
language = "en",
savexl = FALSE,
openxl = FALSE,
path = getwd())
rbind(test_transcript, test_transcript2)
}
Also using the column of the main dataframe:
captions <- sapply(mydata[,24], FUN = get_captions)
Is there an efficient way to accomplish this?
In your code, you do rbind(test_transcript, test_transcript2) but never assign it, so it is lost forever. When we combine that with my comment about not using the rbind(old, newrow) paradigm, your code might be
vids <- as.list(mydata$videoId)
out <- list()
for (i in 1:length(vids)){
vids2 <- paste("https://www.youtube.com/watch?v=",vids[i],sep="")
test_transcript2 <-
get_caption(
url = vids2,
language = "en",
savexl = FALSE,
openxl = FALSE,
path = getwd())
out <- c(out, list(test_transcript2))
}
alldat <- do.call(rbind, out)
Some other pointers:
for (i in 1:length(.)) can be a bad practice if this is functionalized, it's better to use for (i in seq_along(vids))
we never need the index number itself, we can use for (vid in vids)
we can do the pasteing in one shot, generally faster for R, with for (vid in paste0("https://www.youtube.com/watch?v=", vids)), and then url=vid in the call to get_caption
with all that, it might be even simpler to use lapply for the whole thing:
path <- getwd()
out <- lapply(paste0("https://www.youtube.com/watch?v=", vids),
get_caption, language = "en", savexl = FALSE,
openxl = FALSE, path = path)
do.call(rbind, out)
(NB: untested.)

XLSX data upload with RestRserve

I would like to work with RestRServe to have a .xlsx file uploaded for processing. I have tried the below using a .csv with success, but some slight modifications for .xlsx with get_file was not fruitful.
ps <- r_bg(function(){
library(RestRserve)
library(readr)
library(xlsx)
app = Application$new(content_type = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")
app$add_post(
path = "/echo",
FUN = function(request, response) {
cnt <- request$get_file("xls")
dt <- xlsx::read.xlsx(cnt, sheetIndex = 1, header = TRUE)
response$set_body("some function")
}
)
backend = BackendRserve$new()
backend$start(app, http_port = 65080)
})
What have you tried? According to the documentation request$get_file() method returns a raw vector - a binary representation of the file. I'm not aware of R packages/functions which allow to read xls/xlsx file directly from the raw vector (probably such functions exist, I just don't know).
Here you can write body to a file and then read it normal way then:
library(RestRserve)
library(readxl)
app = Application$new()
app$add_post(
path = "/xls",
FUN = function(request, response) {
fl = tempfile(fileext = '.xlsx')
xls = request$get_file("xls")
# need to drop attributes as writeBin()
# can't write object with attributes
attributes(xls) = NULL
writeBin(xls, fl)
xls = readxl::read_excel(fl, sheet = 1)
response$set_body("done")
}
)
backend = BackendRserve$new()
backend$start(app, http_port = 65080)
Also mind that content_type argument is for response encoding, not for request decoding.

specify the path to the common folder with json files in R

to parse json, i can use this approach
library("rjson")
json_file <- "https://api.coindesk.com/v1/bpi/currentprice/USD.json"
json_data <- fromJSON(paste(readLines(json_file), collapse=""))
but what if i want work with set of json files
it located
json_file<-"C:/myfolder/"
How to parse in to data.frame all json files in this folder? (there 1000 files)?
A lot of missing info, but this will probably work..
I used pblapply to get a nice progress-bar (since you are mentioning >1000 files).
I never used the solution below for JSON-files (no experience wit JSON), but it works flawless on .csv and .xls files ( of course with different read-functions).. so I expect it to work with JSON also.
library(data.table)
library(pbapply)
library(rjson)
folderpath <- "C:\\myfolder\\"
filefilter <- "*.json$"
#set paramaters as needed
f <- list.files( path = folderpath,
pattern = filefilter,
full.names = TRUE,
recursive = FALSE )
#read all files to a list
f.list <- pblapply( f, function(x) fromJSON( file = x ) )
#join lists together
dt <- data.table::rbindlist( f.list )

Using httr package, setting a header where the header name is a variable

I want to set a header in an request using the R httr package and set a header, when I have the name of the header in a variable.
I would like to do something like this:
tokenName = 'X-Auth-Token'
get_credentials_test <- function (token) {
url <- paste(baseUrl,"/api/usercredentials", sep = '')
r <- GET(url, add_headers(tokenName = token))
r
}
however, the above code seems to set a header with the name tokenName.
It does work if I do the following:
get_credentials_test <- function (token) {
url <- paste(baseUrl,"/api/usercredentials", sep = '')
r <- GET(url, add_headers('X-Auth-Token' = token))
r
}
but I want to have some flexibility if the name of the header changes and the requirement to add the header is sprinkled liberally around the code. I am not sure if it is possible to add a header contained with a variable but that is what I would like to do.
You could create the headers as a named vector, and then pass it as the .headers argument:
h <- c(token)
names(h) <- tokenName
r <- GET(url, add_headers(.headers = h))
While this works because add_headers takes a .headers argument (see here), a more general alternative for calling a function with arbitrary argument names is do.call:
h <- list(token)
names(h) <- tokenName
r <- GET(url, do.call(add_headers, h))
It's easy with structure():
get_creds <- function(base.url, path, header.name, token) {
url <- paste0(base.url, path)
header <- structure(token, names = header.name)
r <- httr::GET(url, httr::add_headers(header))
r
}

How to use a loop (for) to download.file and write.csv using names from a list in R

So i've got to write.csv's for each downloaded files, with the currencies from a bunch a countries, from the web. And i wanted it to be saved using their ticks.
So i did,
codigos = list("JPY", "RUB","SGD","BRL","INR","THB","GBP","EUR","CHF")
for (i in 1:9){
url1 = 'http://www.exchangerates.org.uk/'
url2='-USD-exchange-rate-history-full.html'
codigos = list("JPY", "RUB","SGD","BRL","INR","THB","GBP","EUR","CHF")
codigo = codigos[i]
url <- paste(url1, codigo, url2, sep = "")
download.file(url, destfile='codigo.html')
dados <- readHTMLTable('codigo.html')
write.csv(dados, file="codigo.csv")
}
although it can read each of the url's altered by the loop, it can't download them, nor save the csv's individually. During the process i can see each of them being "saved" in a file named codigo.html and in the very end i get a codigo.html and a codigo.csv with the last country of the list.
The problem is that you're saving everything to the same filename. Each pass through the loop will overwrite the prior contents entirely.
Note also that readHTMLTable will take a URL. So perhaps something like this is in order:
for (i in 1:9){
url1 = 'http://www.exchangerates.org.uk/'
url2='-USD-exchange-rate-history-full.html'
cod = codigos[i]
url <- paste(url1, cod, url2, sep = "")
dados <- readHTMLTable(url)
# Create a unique name for each file
filename <- paste(cod, 'csv', sep='.')
write.csv(dados, file=filename)
}
Instead of creating csv files on disk, you might be better off using a list to hold the data, so you can manipulate the list:
url1 <- 'http://www.exchangerates.org.uk/'
url2 <- '-USD-exchange-rate-history-full.html'
l <- lapply(codigos
, function(i) readHTMLTable(paste0(url1, i, url2))
)
names(l) <- codigos
a) in your loop "url <- ..." line should go before "download.file(url ...)"
url <- paste(url1, cod, url2, sep = "")
download.file(url, destfile='cod.html')
b) in your line "write.csv(url, file=nome)" , nome must be between "
write.csv(url, file= "nome")

Resources