I would like to download multiple files from a list of urls. Some of the urls may be invalid and I would like to skip it if there is error.
If possible, would also like to rename the downloaded file based on the ID.
Appreciate if someone could help me out. A sample of my data is as follow:
ID <- c('L18491','K18781','I28004')
url <- c('https://file-examples-com.github.io/uploads/2017/02/file_example_XLSX_50.xlsx',
'https://file-examples-com.github.io/uploads/2017/02/file_example_XLSX_101.xlsx',
'https://file-examples-com.github.io/uploads/2017/02/file_example_XLSX_100.xlsx')
df <- data.frame(ID, url)
We can use possibly from purrr
library(purrr)
out_lst <- map(df$url, pfun)
names(out_lst) <- df$ID
where
pfun <- possibly(f1, otherwise = NA)
where
f1 <- function(urllink) {
openxlsx::read.xlsx(urllink)
}
Or another option is tryCatch
f2 <- function(urllink) {
tryCatch(openxlsx::read.xlsx(urllink),
error = function(e) message("error occured"))
}
out_lst2 <- lapply(df$url, f2)
If we want to use download.file
lapply(seq_along(df$url), function(i)
tryCatch(download.file(df$url[i], paste0(getwd(), "/", df$ID[i], ".xlsx")),
error = function(e) message("error occured")))
Or using iwalk
library(tibble)
pfun2 <- possibly(download.file, otherwise = NA)
iwalk(deframe(df), ~ pfun2(.x, as.character(glue::glue('{getwd()}/{.y}.xlsx'))))
You can use download.file to download the file and name it according to ID variable.
Map(function(x, y) tryCatch(download.file(x, sprintf('%s.xlsx', y)),
error = function(e) {},
warning = function(w) {}), df$url, df$ID)
This will download the files in your working directory and name it as ID.xlsx. Also it will skip any error or warnings generated.
Related
I have more than 1000 csv files. I would like to combine in a single file, after running some processes. So, I used loop function as follow:
> setwd("C:/....") files <- dir(".", pattern = ".csv$") # Get the names
> of the all csv files in the current directory.
>
> for (i in 1:length(files)) { obj_name <- files %>% str_sub(end = -5)
> assign(obj_name[i], read_csv(files[i])) }
Until here, it works well.
I tried to concatenate the imported files into a list to manipulate them at once as follow:
command <- paste0("RawList <- list(", paste(obj_name, collapse = ","),
> ")") eval(parse(text = command))
>
> rm(i, obj_name, command, list = ls(pattern = "^g20")) Ref_com_list =
> list()
Until here, it still okay. But ...
> for (i in 1:length(RawList)) { df <- RawList[[i]] %>%
> pivot_longer(cols = -A, names_to = "B", values_to = "C") %>%
> mutate(time_sec = paste(YMD[i], B) %>% ymd_hms())%>%
> mutate(minute = format(as.POSIXct(B,format="%H:%M:%S"),"%M"))
>
> ...(some calculation)
> Ref_com_list [[i]] <- file_all }
>
> Ref_com_all <- do.call(rbind,Ref_com_list)
At that time, I got the error as follow:
> Error: Can't combine `A` <double> and `B` <datetime<UTC>>. Run
> `rlang::last_error()` to see where the error occurred.
If I run individual file, it work well. But if I run in for loop, the error showed up.
Does anyone could tell me what the problem is?
Thanks a lot in advance.
There is a substantial scope for improvement in your code. Broadly speaking, if you are working in tidyverse you can pass multiple files to read_csv directly. Example:
# Generate some sample files
tmp_dir <- fs::path_temp("some_csv_files")
fs::dir_create(tmp_dir)
for (i in 1:100) {
readr::write_csv(mtcars, fs::file_temp(pattern = "cars",
tmp_dir = tmp_dir, ext = ".csv"))
}
# Actual file reading
dta_cars <- readr::read_csv(
file = fs::dir_ls(path = tmp_dir, glob = "*.csv"),
id = "file_path"
)
If you want to keep information on the file origination, using id = "file_path" in read_csv will store the path details in column. This is arguably more efficient than and less error-prone than:
for (i in 1:length(files)) { obj_name <- files %>% str_sub(end = -5)
assign(obj_name[i], read_csv(files[i])) }
This is much cleaner and will be faster than growing object via loop. After you would progress with your transformations:
dta_cars %>% ...
try:
library(data.table)
files <- list.files(path = '.', full.names=T, pattern='csv')
files_open <- lapply(files, function(x) fread(x, ...)) # ... for arguments like sep, dec, etc...
big_file <- rbindlist(files_open)
fwrite(big_file, ...) # ... for arguments like sep, dec, path to save data, etc...
Now I found out the reason why it happened. There was another file which is not the same file name but with the same file type. So, the code read all the files, and provided the error.
I am sorry I made you all confused.
Thank you so much!
I have a folder full of .csv files that have to be slightly changed and then saved as an xlsx document.
Herefore, I have created a Loop to do this:
library(xlsx)
docs <- Sys.glob( "*.csv" )
for( i in docs )
{
df <- read.csv(i)
df2 <- select(df, X, Y)
df3 <- mutate(df3, Z = (X - Y) / 3600)
write.xlsx(df3, paste( "C:/users/Desktop/Files/", i), row.names = FALSE)
}
However, when I execute this for loop, the following error message pops up:
Error in createWorkbook(type = ext) : Unknown format csv
Did I forget anything? I would be very grateful, if you could help me, as I have no idea what else to change...
The package rio removes most of the headache with xlsx files. It can also be used to read in files:
docs <- Sys.glob("*.csv")
for(i in docs) {
df <- rio::import(i)
df2 <- select(df, X, Y)
df3 <- mutate(df3, Z = (X - Y) / 3600)
rio::export(df3, paste0("C:/users/Desktop/Files/", i, ".xlsx"))
}
This should work on the import/export part. What I'm not so sure about is your Sys.glob, since I've never used that before. I find list.files has a really easy and powerful syntax...
Update
If you want to get rid of the .csv file extension, you can use this instead:
for(i in docs) {
df <- rio::import(i)
df2 <- select(df, X, Y)
df3 <- mutate(df3, Z = (X - Y) / 3600)
fname <- gsub(".csv$", "", i)
rio::export(df3, paste0("C:/users/Desktop/Files/", fname, ".xlsx"))
}
I think I am missing something simple, but I am having trouble accessing elements of a list, in a lapply.
Problem: I have a number of files on a FTP I want to download and read. So I need to specify the location, download them and read them. All which I thought can be handled best with a few lists, but I can't really get it to work in my function.
I would like to be able to start with calling a lapply(lst,...) because I need both the variable name (a) and the url in the same function, to download & name them easily.
Code-example:
a <- "ftp://user:pass#url_A1"
b <- "ftp://user:pass#url_B1"
c <- "ftp://user:pass#url_C1"
d <- "ftp://user:pass#url_D1"
lst <- list(a, b, c, d)
names(lst) <- c("a", "b", "c", "d")
Desired goal:
print(lst[[1]]), ...., print(lst[[4]])
What I've tried:
lapply(lst,
function(x) print(x[[]])
)
# Error!
My real code looks something more like:
lapply(lst,
function(x) download.file(url = x[[]], # Error!
destfile = paste0(lok, paste0(names(x), ".csv")),
quiet = FALSE)
)
EDIT:
I know the x[[]] throws an error, it is just to illustrate what I would like to get.
Untested:
lapply(names(lst),function(x){
download.file(url = lst[[x]],
destfile = paste0(lok,paste0(x,".csv")),
quiet = FALSE)
}
This should work given lok is defined.
I want to parse the read.table() function to a list of .txt files. These files are in my current directory.
my.txt.list <-
list("subject_test.txt", "subject_train.txt", "X_test.txt", "X_train.txt")
Before applying read.table() to elements of this list, I want to check if the dt has not been already computed and is in a cache directory. dt from cache directory are already in my environment(), in form of file_name.dt
R> ls()
"subject_test.dt" "subject_train.dt"
In this example, I only want to compute "X_test.txt" and "X_train.txt". I wrote a small function to test if dt has already been cached and apply read.table()in case not.
my.rt <- function(x,...){
# apply read.table to txt files if data table is not already cached
# x is a character vector
y <- strsplit(x,'.txt')
y <- paste(y,'.dt',sep = '')
if (y %in% ls() == FALSE){
rt <- read.table(x, header = F, sep = "", dec = '.')
}
}
This function works if I take one element this way :
subject_test.dt <- my.rt('subject_test.txt')
Now I want to sapply to my files list this way:
my.res <- saply(my.txt.list,my.rt)
I have my.resas a list of df, but the issue is the function compute all files and does take into account already computed files.
I must be missing something, but I can't see why.
TY for suggestions.
I think it has to do with the use of strsplit in your example. strsplit returns a list.
What about this?
my.txt.files <- c("subject_test.txt", "subject_train.txt", "X_test.txt", "X_train.txt")
> ls()
[1] "subject_test.dt" "subject_train.dt"
my.rt <- function(x){
y <- gsub(".txt", ".dt", x, fixed = T)
if (!(y %in% ls())) {
read.table(x, header = F, sep = "", dec = '.') }
}
my.res <- sapply(my.txt.files, FUN = my.rt)
Note that I'm replacing .txt with .dt and I'm doing a "not in". You will get NULL entries in the result list if a file is not processed.
This is untested, but I think it should work...
What I need to do is to read data from hundreds of links, and among them some of the links contains no data, therefore, as the codes here:
urls <-paste0("http://somelink.php?station=",station, "&start=", Year, "01-01&etc")
myData <- lapply(urls, read.table, header = TRUE, sep = '|')
an error pops up saying "no lines available in input", I've tried using "try", but with same error, please help, thanks.
Here are 2 possible solutions (untested because your example is not reproducible):
Using try:
myData <- lapply(urls, function(x) {
tmp <- try(read.table(x, header = TRUE, sep = '|'))
if (!inherits(tmp, 'try-error')) tmp
})
Using tryCatch:
myData <- lapply(urls, function(x) {
tryCatch(read.table(x, header = TRUE, sep = '|'), error=function(e) NULL)
})
Does this help?
dims <- sapply(myData, dim)[2,]
bad_Ones <- myData[dims==1]
good_Ones <- myData[dims>1]
If myData still grabs something off the station page, the above code should separate the myData list into two separate groups. good_Ones would be the list you would want to work with. (assuming the above is accurate, of course)