Reading .csv files, converting them and saving them as .xlsx using R - r

I have a folder full of .csv files that have to be slightly changed and then saved as an xlsx document.
Herefore, I have created a Loop to do this:
library(xlsx)
docs <- Sys.glob( "*.csv" )
for( i in docs )
{
df <- read.csv(i)
df2 <- select(df, X, Y)
df3 <- mutate(df3, Z = (X - Y) / 3600)
write.xlsx(df3, paste( "C:/users/Desktop/Files/", i), row.names = FALSE)
}
However, when I execute this for loop, the following error message pops up:
Error in createWorkbook(type = ext) : Unknown format csv
Did I forget anything? I would be very grateful, if you could help me, as I have no idea what else to change...

The package rio removes most of the headache with xlsx files. It can also be used to read in files:
docs <- Sys.glob("*.csv")
for(i in docs) {
df <- rio::import(i)
df2 <- select(df, X, Y)
df3 <- mutate(df3, Z = (X - Y) / 3600)
rio::export(df3, paste0("C:/users/Desktop/Files/", i, ".xlsx"))
}
This should work on the import/export part. What I'm not so sure about is your Sys.glob, since I've never used that before. I find list.files has a really easy and powerful syntax...
Update
If you want to get rid of the .csv file extension, you can use this instead:
for(i in docs) {
df <- rio::import(i)
df2 <- select(df, X, Y)
df3 <- mutate(df3, Z = (X - Y) / 3600)
fname <- gsub(".csv$", "", i)
rio::export(df3, paste0("C:/users/Desktop/Files/", fname, ".xlsx"))
}

Related

Issues with for loops in R

I am trying to combine some excel spreadsheets. There are 50 documents. I am looking to get sheets 2:5, except some only have sheets 2:3, 2:4, etc - this is why I include the try function. I need ranges F6:AZ2183 and I am transposing the data.
The issue I am running into is that only the last file is saving into the data frame df.
I attached the code below. If you have any ideas, I would much appreciate it!
Also, I'm a longtime lurker first time poster, so if my etiquette is poor, I apologize.
df <- data.frame()
for (i in 1:50){
for (j in 2:5) {
try({
df.temp <- t(read_excel((paste0('FqReport',i,'.xlsx')), sheet = j, range ='F6:AZ2183'))
df.temp <- df.temp[rowSums(is.na(df.temp)) != ncol(df.temp), ]
df <- rbind(df, df.temp)
rm(df.temp)
gc()
}, silent = TRUE)
}
}
You can read the sheets available in each excel file which will avoid the use of try. Also growing dataframe in loop is quite inefficient. Try this lapply approach.
library(readxl)
filename <- paste0('FqReport',1:50,'.xlsx')
df <- do.call(rbind, lapply(filename, function(x) {
sheet_name <- excel_sheets(x)[-1]
do.call(rbind, lapply(sheet_name, function(y) {
df.temp <- t(read_excel(x, y, range ='F6:AZ2183'))
df.temp[rowSums(is.na(df.temp)) != ncol(df.temp), ]
}))
}))

Using R to loop through downloading of files and renaming it

I would like to download multiple files from a list of urls. Some of the urls may be invalid and I would like to skip it if there is error.
If possible, would also like to rename the downloaded file based on the ID.
Appreciate if someone could help me out. A sample of my data is as follow:
ID <- c('L18491','K18781','I28004')
url <- c('https://file-examples-com.github.io/uploads/2017/02/file_example_XLSX_50.xlsx',
'https://file-examples-com.github.io/uploads/2017/02/file_example_XLSX_101.xlsx',
'https://file-examples-com.github.io/uploads/2017/02/file_example_XLSX_100.xlsx')
df <- data.frame(ID, url)
We can use possibly from purrr
library(purrr)
out_lst <- map(df$url, pfun)
names(out_lst) <- df$ID
where
pfun <- possibly(f1, otherwise = NA)
where
f1 <- function(urllink) {
openxlsx::read.xlsx(urllink)
}
Or another option is tryCatch
f2 <- function(urllink) {
tryCatch(openxlsx::read.xlsx(urllink),
error = function(e) message("error occured"))
}
out_lst2 <- lapply(df$url, f2)
If we want to use download.file
lapply(seq_along(df$url), function(i)
tryCatch(download.file(df$url[i], paste0(getwd(), "/", df$ID[i], ".xlsx")),
error = function(e) message("error occured")))
Or using iwalk
library(tibble)
pfun2 <- possibly(download.file, otherwise = NA)
iwalk(deframe(df), ~ pfun2(.x, as.character(glue::glue('{getwd()}/{.y}.xlsx'))))
You can use download.file to download the file and name it according to ID variable.
Map(function(x, y) tryCatch(download.file(x, sprintf('%s.xlsx', y)),
error = function(e) {},
warning = function(w) {}), df$url, df$ID)
This will download the files in your working directory and name it as ID.xlsx. Also it will skip any error or warnings generated.

R loop over write.xlsx()

I want to export a couple of data frames to an excel file using the function write.xlsx() from openxlsx. So, for example the following:
library(openxlsx)
x <- c(1,2,3)
for (i in x) {
name <- paste("sheet", i, sep = "")
assign(name, data.frame(1:4, 2:3))
path <- paste("/some_directory/",name,".xlsx" , sep = "")
write.xlsx(name, file = path)
}
This does create three different data frames with the values 1 to 4 and 2 to 3, those have the right names, it also creates three different excel files with the right names, but the excel files only contain the name instead of the values from the dataframe. Does anyone know how to change that?
you need to keep your data.frame in a variable:
library(glue)
library(openxlsx)
x <- c(1,2,3)
for (i in x) {
name <- paste("sheet", i, sep = "")
df <- data.frame(1:4, 2:3) # This step is missing in your example
assign(name, df)
path <- glue("/some_directory/{name}.xlsx", name = name)
write.xlsx(df, file = path)
}
``

R efficiently bind_rows over many dataframes stored on harddrive

I have roughly 50000 .rda files. Each contains a dataframe named results with exactly one row. I would like to append them all into one dataframe.
I tried the following, which works, but is slow:
root_dir <- paste(path, "models/", sep="")
files <- paste(root_dir, list.files(root_dir), sep="")
load(files[1])
results_table = results
rm(results)
for(i in c(2:length(files))) {
print(paste("We are at step ", i,sep=""))
load(files[i])
results_table= bind_rows(list(results_table, results))
rm(results)
}
Is there a more efficient way to do this?
Using .rds is a little bit easier. But if we are limited to .rda the following might be useful. I'm not certain if this is faster than what you have done:
library(purrr)
library(dplyr)
library(tidyr)
## make and write some sample data to .rda
x <- 1:10
fake_files <- function(x){
df <- tibble(x = x)
save(df, file = here::here(paste0(as.character(x),
".rda")))
return(NULL)
}
purrr::map(x,
~fake_files(x = .x))
## map and load the .rda files into a single tibble
load_rda <- function(file) {
foo <- load(file = file) # foo just provides the name of the objects loaded
return(df) # note df is the name of the rda returned object
}
rda_files <- tibble(files = list.files(path = here::here(""),
pattern = "*.rda",
full.names = TRUE)) %>%
mutate(data = pmap(., ~load_rda(file = .x))) %>%
unnest(data)
This is untested code but should be pretty efficient:
root_dir <- paste(path, "models/", sep="")
files <- paste(root_dir, list.files(root_dir), sep="")
data_list <- lapply("mydata.rda", function(f) {
message("loading file: ", f)
name <- load(f) # this should capture the name of the loaded object
return(eval(parse(text = name))) # returns the object with the name saved in `name`
})
results_table <- data.table::rbindlist(data_list)
data.table::rbindlist is very similar to dplyr::bind_rows but a little faster.

I want to loop through subfolders within a main directory, grab two tifs and preform the same function on each pair

I have a main directory, and each folder within this has the same format of data. I'd like to go into each folder individually and preform the same function. Each folder has a slightly different name, but the files all follow the same naming protocol (ie. YYYYMMDD_XX_raster1.tif; YYYYMMDD_XX_raster2.geo.tif).
I've tried modifying a code that I already had, but in that instance the .tifs were all in the same folder.
library(raster)
library(sp)
library(rgdal)
#create sample rasters
r <- raster(nrow=10, ncol=10)
x <- setValues(r, sample(-180:180,ncell(r), replace(T))
y <- setValues(r, sample(-90:90,ncell(r), replace(T))
dq2<- function(in_dir_path, add_file_name, silent=F){
#inputs
#in_dir_path: path to directory
#add_file_name: ie. finished
list_files <- sort(list.files(path= in_dir_path, pattern =""))
print(list_files)
if(silent == F){
message(paste"Nb files in in_dir", length(list_files))
}
for(i in 1: length(list_files)){
if(silent == F){
message(paste("Processing", i, "out of", length(list_files)))
}
#get rasters (this would normally grab from within the subfolder, but here refers to the rasters created above)
x <- raster(paste(in_dir_path, "/$raster1.geo.tif", list_files[i], sep = ""))
y <- raster(paste(in_dir_path, "/$raster2.geo.tif", list_files[i], sep = ""))
#Create filename to save
split<- strsplit(list_files[i], split=".tif")[[1]][1]
sprint (split)
nameFile <- paster(in_dir_path, "/", split, "_", add_file_name, ".tif", sep = "")
# calculation here
tf <- function(x,y) {
return(100 * x / y)
}
answer <- overlay(x,y, fun=tf)
WriteRaster(answer, filename=nameFile, format="GTiff", overwrite=TRUE, options=c('TFW=YES'))
if(silent == F){
message(paste("process complete", nameFile))
}
}
message("finished")
dq2("C:/Users", "fin", silent=F)
When I try to run this, it looks like the code is waiting for more input.

Resources