automatize path directory in R - r

I build an automatic program that needs the user to modify only few parameters in the Launch file.
I ask myself if it's possible to automatize the file path automation according to the custom "year" and "month" ?
Structure
Launch.r
| Import.r #(load librairies and call specific programs)
| Topic1.r
|.......
| Topic2.r
......
#-- > Launch file
# Parameters to be personalized by the user
year <- 2021
month <- 01
# File directory
import <- c('c:/folderX/year/month/folder')
export <- c('c:/folderX/year/export/folder1')
.....
When I run the program, R sort
import "c:/folderX/year/month/folder"
export "c:/folderX/year/export/folder1"
My goal is to get
import "c:/folderX/2021/01/folder"
export "c:/folderX/2021/export/folder1"
Would you have tips to help me?

If I am understanding the question correctly, file.path can accept variables as parts of the path e.g.
year <- 2021
month <- 01
import <- file.path("c:/folderX", year, month, "folder")
should give
#> [1] "c:/folderX/2021/1/folder"

You can also use glue package for string generation from templates:
# Parameters to be personalized by the user
library(glue)
year <- 2021
month <- 01
# File directory
import <- glue(
'c:/folderX/{year}/{month}/folder',
month = sprintf('%02d', month),
year = year
)
export <- glue(
'c:/folderX/{year}/{month}/folder1',
month = sprintf('%02d', month),
year = year
)
# glue can also take variables from environment:
export <- glue(
'c:/folderX/{year}/{month}/folder1'
)
# in this case just make sure that month variable is a string in correct format

Related

Using purr to iterate over list of file directories to read max subfolder where file exists

The code referenced below seeks to find the most recent year within workbook_dir, then the most recent month to then read the excel file within the readxl command below. The issue I'm having is when the file doesn't exist, but the subfolder is already created. How would I change below in order to find the most recent file path where the referenced file actually exists? I would imaging some form of purrr? So in this case, if the file doesn't exist in 9, then iterate to check whether it exists in 8. If it doesn't exist in the subfolder 1 within 2021, then check 12 within 2020. Code that gets me most of the way there is below.
# path for workbook log files
workbook_dir <- "Z:/Ac/Iron/STAR"
# list files in directory
year_dirs = fs::dir_ls(workbook_dir)
# Find the latest year and get that directory
years = str_extract(year_dirs, '\\d{4}/?$') %>% parse_number()
latest_year_dir = year_dirs[which.max(years)]
# Repeat for month
month_dirs = fs::dir_ls(latest_year_dir)
months = str_extract(month_dirs, '\\d{2}/?$') %>% parse_number()
latest_month_dir = month_dirs[which.max(months)]
# read excel file
file <- readxl::read_excel(file.path(latest_month_dir, "Working", "Group Workbook Log.xlsx"))
I'll try to answer with a little pseudo code here. I'm going to assume your directories are in the format /year/month/file (e.g. ./2020/01/File.xlsx)
library(tidyverse)
library(stringr)
# get a list of all files in all subdirectories - working directory
# should be the parent folder of the folder(s) containing years
# or give that path directly to list.files()
df <- list.files(recursive = TRUE) %>% data.frame(paths = .)
# filter to .xlsx files
most_recent_file <- df %>% filter(grepl("\\.xlsx$", paths)) %>%
# extract Year/Month then create a Date column
mutate(
Year = str_extract(paths,"(?<=\\/)\\d{4}(?=\\/)"),
Month = str_extract(paths,"(?<=\\/)\\d{2}(?=\\/)"),
Date = as.Date(paste(Year, Month, "01", sep = "-"))
) %>%
# filter to most recent date
filter(Date == max(Date)) %>%
# extract the relevant path as a string
pull(paths)
At which point we will have filtered df to (assuming there's only one file per month) a single file path which can then be read into R.

apply inside apply function?

I've a data frame with the start and end of each month of the year 2019.
I need to make a fetch to an API, write a CSV file with name mydf plus month (eg. mydf-01.csv, mydf-02.csv, etc).
I need to fetch the data, write CSV, clean memory to avoid error message "not enough memory", and continue with the next month.
For now I've this, but is giving me error: not enough memory, because the expected data for all 2019 is around 3GB.
I was thinking on making a for loop. But maybe I can use another apply family function?
Months: my_dates data.frame
This is how it looks:
from to
2019-01-01 2019-01-31
2019-02-01 2019-02-28
2019-03-01 2019-03-31
...
Code to generate the 12 months:
som <- function(x) as.Date(cut(as.Date(x), "month")) # start of month
eom <- function(x) som(som(x) + 32) - 1 # end of month
month_ranges <- function(from, to) {
s <- seq(som(from), as.Date(to), "month")
data.frame(from = pmax(as.Date(from), s), to = pmin(as.Date(to), eom(s)))
}
my_dates <- month_ranges(som("2019-01-01"), eom("2019-12-31"))
Code to fetch data:
Currently it fetches all months, holds them in memory and at the end
it rbinds them together. However, this approache gives error when
months range is too large because data is above 2GB. So I'd like it for each month to save the data to > a CSV and continue to the next month.
library(googleAuthR)
library(googleAnalyticsR)
my_fetch <- function(ga_id, d1, d2) {
google_analytics(ga_id,
date_range = c(d1, d2),
metrics = c("totalEvents"),
dimensions = c("ga:date", "ga:eventCategory", "ga:eventAction", "ga:eventLabel"),
anti_sample = TRUE,
anti_sample_batches = 1,
rows_per_call = 400)
}
my_fetches_fetches <- mapply(my_fetch, myviewID, my_dates$from, my_dates$to, SIMPLIFY = FALSE)
total <- do.call(rbind, my_fetches_fetches)
UPDATE 1:
Maybe it could be possible to pass the "loop" that generates an error, like API timeout to continue to the next month?

Iterate import of excel files and averaging matched values by file name in R

I have a folder containing 630 excel files, all with similar file names. Each file represents climate data in specific geographic areas for a month of a specific year. My goal is to find a way to iterate my importing of these files and find the average of values for specific variables. All files are titled as such:
PRISM_ppt_stable_4kmM3_201201_bil
where "ppt" represents climate variable the data is about, "2012" represents the year 2012 and "01" represents the month of January. The next file in the folder is titled:
PRISM_ppt_stable_4kmM3_201202_bil
where "ppt" represents the same variable,"2012" again represents the year 2012 and "02" this time represents the month of February. These repeat for every month of every year and for 7 different variables. The variables are titled:
ppt, vpdmax, vpdmin, tmax, tmin, tdmean, tmean
Each excel file contains >1500 observations of 11 variables where I am interesting in finding the average MEAN variable among all matching tl_2016_us variables. Some quick sample data is shown below:
tl_2016_us MEAN
14136 135.808
14158 132.435
etc. etc.
It gets tricky in that I only wish to find my averages over a designated winter season, in this case November through March. So all files with 201211, 201212, 201301, 201302 and 201303 in the file name should be matched by tl_2016_us and the corresponding MEAN variables averaged. Ideally, this process would repeat to the next year of 201311, 201312, 201401, 201402, 201403. To this point, I have used
list.files(path = "filepath", pattern ="*ppt*")
to create lists of my filenames for each of the 7 variables.
I don't really get what the "tl_2016_us" variables are/mean.
However, you can easily get the list of only winter months using a bit of regular expressions like so:
library(tidyverse)
# Assuming your files are already in your working directory
all_files <- list.files(full.names = TRUE, pattern = "*ppt*")
winter_mos <- str_subset(files, "[01, 02, 03, 11, 12]_\\w{3}$")
After that, you can iterate reading in all files into a data frame with map() from purrr:
library(readxl)
data <- map(winter_mos, ~ read_xlsx(.x)) %>% bind_rows(.id = "id")
After that, you should be able to select the variables you want, use group_by() to group by id (i.e. id of each Excel file), and then summarize_all(mean)
Maybe something like (not very elegant):
filetypes = c("ppt", "vpdmax", "vpdmin", "tmax", "tmin", "tdmean", "tmean")
data_years = c(2012,2013,2014)
df <- NULL
for (i in 1:length(data_years)) {
yr <- data_years[i]
datecodes <- c(paste(yr,"11",sep=""),
paste(yr,"12",sep=""),
paste(yr+1,"01",sep=""),
paste(yr+1,"02",sep=""),
paste(yr+1,"03",sep=""))
for (j in 1:length(filetypes)) {
filetype <- filetypes[j]
file_prefix <- paste("PRISM",filetype,"stable_4kmM3",sep="_")
for (k in 1:length(datecodes)) {
datecode <- datecodes[k]
filename <- paste(file_prefix,datecode,"bil",sep="_")
dk <- read_excel(filename)
M <- dim(dk)[1]
dk$RefYr <- rep(yr,M)
dk$DataType <- rep(filetype,M)
if (is.null(df_new)) {
df <- dk
} else {
df <- rbind(df,dk)
}
}
}
}
Once that has run, you will have a data frame containing all the data you need to compute your averages (I think).
You could then do something like:
df_new <- NULL
for (i in 1:length(data_years)) {
yr <- data_years[i]
di <- df[df$RefYr==yr,]
for (j in 1:length(filetypes)) {
filetype <- filetypes[j]
dj <- di[di$DataType==filetype,]
tls <- unique(dj$tl_2016_us)
for (k in 1:length(tls)) {
tl <- tls[k]
dk <- dj[dj$tl_2016_us==tl,]
dijk <- data.frame(RefYr=yr,TL2016=tl,DataType=filetype,
SeasonAverage=mean(dk$MEAN))
if (is.null(df)){
df_new <- dijk
} else {
df_new <- rbind(df_new,dijk)
}
}
}
}
I'm sure there are more elegant ways to do it and that there are some bugs in the above since I couldn't really run the code, but I think you should be left with a data frame containing what you are looking for.

run an r script from within an r script and change some values

I have an R script that I would like to run to import some data. The R script is called load_in_year1-4. I have designed this script so that I only have to make 3 changes to the code at the top and everything will run and import the correct data files.
The 3 changes are;
year <- "Year4"
weeks <- "1270_1321"
product <- "/cookies"
However I have 20 years worth of data and more than 50 products.
I am currently manually changing the top of each file and running it, so I have no errors currently in the data.
What I would like to do is to create a separate R script which will run the current script.
I would like to have something like
year <- c("year1", "year2", "year3"....)
weeks <- c("1270_1321", "1321_1327"....)
product <- c("product1", "product2"....)
So it will take year 1, week 1270_1321 and product1, call them year, week, product and run the R script which I have created.
Is there a grid function anybody can suggest?
EDIT: I have something like the following
#Make changes here
year <- "Year11"
weeks <- "1635_1686"
product <- "/cigets"
# year1: "1114_1165", year2: "1166_1217", year3: "1218_1269"
#Does not need changing
files <- gsub("Year1", as.character(year), "E:/DATA/Dataset/Year1")
parsedstub <- "E:/DATA/Dataset/files/"
produc <- paste0("prod", gsub("/", "_", as.character(product)))
drug <- "_drug_"
groc <- "_groc_"
####################Reading in the data###########################################
drug <- read.table(paste0(files, product, product, drug, weeks), header = TRUE)
groc <- read.table(paste0(files, product, product, groc, weeks), header = TRUE)
To make a function out of your script, do something like this:
get.tables <- function(year,weeks,product){
files <- gsub("Year1", as.character(year), "E:/DATA/Dataset/Year1")
parsedstub <- "E:/DATA/Dataset/files/"
product <- paste0("prod", gsub("/", "_", as.character(product)))
drug <- "_drug_"
groc <- "_groc_"
####################Reading in the data###########################################
drug <- read.table(paste0(files, product, product, drug, weeks), header = TRUE)
groc <- read.table(paste0(files, product, product, groc, weeks), header = TRUE)
list(drug = drug, groc = groc)
}
Then you could use something in the apply family to apply this function to different years, weeks, and products.

dynamic date in R to insert in url

I’m working in Power BI and am using R-script to insert a dynamic date in a URL.
The URL in question is as following
Tests <- read.csv(file=paste0("https://www,testtest/getCsv?startdate_Day=1&startdate_Month=01&startdate_Year=2016"), header=TRUE, sep=";")
My problem is that the day month and year value has to always be the current date
I have tried or example
Year<-Sys.Date() format(Year, format="%Y")
Month<-Sys.Date() format(Month, format="%m")
Day<-Sys.Date() format(Day, format="%d")
<- read.csv(file=paste0("read.csv(file=paste0("https://www,testtest/getCsv??startdate_Day=", Day,"&startdate_Month=", Month,"&startdate_Year=", Year"), header=TRUE, sep=",")
But then I get the following error:
startdate_Day=2016-08-12&startdate_Month=2016-08-12&startdate_Year=2016-08-12
My question is how I fill Startdate_Day, Startdate_Month and Startdate_Year automatically with the current Day, Month and Year?
If that's the exact code you're running, then you're having a few problems:
As noted, you have a comma in your url (www.test), do you mean to?
Your calls to Sys.Date() and format aren't written properly.
Try this:
# Get the date parts we need
Year <-format(Sys.Date(), format="%Y")
Month <- format(Sys.Date(), format="%m")
Day <- format(Sys.Date(), format="%d")
# Create the file string
file_string <- paste0("https://www.testtest/getCsv??startdate_Day=", Day,"&startdate_Month=", Month,"&startdate_Year=", Year)
# Run read.csv
EnergyStockVolumeData <- read.csv(file_string, header = TRUE, sep = ";")

Resources