looping through and merging files in folder in R - r

I am trying to loop through hundreds of weather data files (.nc) and merge them together.
I can load them, and merge them manually using:
library(raster)
library(ncdf4)
library(ncdf4.helpers)
require(data.table)
#define input paths, load data, then merge
baseline_path_file <- "E:/input_data/HadOBS/tas/tas_hadukgrid_uk_1km_mon_202001-202012.nc"
baseline_path_file2 <- "E:/input_data/HadOBS/tas/tas_hadukgrid_uk_1km_mon_201901-201912.nc"
BASELINE <- setDT(as.data.frame(brick(baseline_path_file), xy = T))
BASELINE2 <- setDT(as.data.frame(brick(baseline_path_file2), xy = T))
combined <- merge(BASELINE, BASELINE2, by = c("x","y"))
but what I would like to do is define the list of files in a folder and merge them manually.
e.g.
library(fs)
files <- dir_ls("E:/input_data/HadOBS/tas")
combined2 <- map(files, brick) %>%
as.data.frame %>%
setDT %>%
reduce(inner_join, by = c("x", "y"))
but that obviously isn't working... I can't seem to get the piping in the right order. Any ideas how to get this right? Many thanks indeed.

The problem appears to be that you are only using map() to apply brick to your list elements, and not also as.data.frame() and setDT().
Lacking the data, I didn't run this code, so it might not work, but you get the idea:
combined2 <- map(files,
~ .x %>%
brick() %>%
as.data.frame() %>%
setDT()
) %>%
reduce(inner_join, by = c("x", "y"))

Related

A for() loop to overwrite existing data.frames

I'm testing some operations with that classical 99' Czech Bank data set, trying to execute a tidyverse task upon several data.frames in my global environment, but the loop I've created keeps overwriting the same object val, which was supposed to be a dummy for the df themselves:
x <- c("loans93","loans94","loans95","loans96",
"loans97","loans98")
x <- base::mget(x, envir=as.environment(-1), mode= "any", inherits=F)
for (val in x) {
val <- val %>%
select(account_id, district_id, balance, status, date) %>%
group_by(account_id, district_id, status, date) %>%
summarise(balance=mean(balance, na.rm=T)) %>%
ungroup()
}
What am I doing wrong? I've searched for similar questions but people keep answering lapply solutions, I just need the task to be saved upon my DFs instead of this "val" object I keep getting.
you could try this:
df_names <- c("loans93", "loans94", "loans95",
"loans96", "loans97", "loans98")
for(df_name in df_names){
get(df_name) %>%
head %>% ## replace with desired manipulations
assign(value = .,
x = paste0(df_name,'_manipulated'), ## or just: df_name to overwrite original
envir = globalenv())
}
aside: list2env is handy to "spawn" list members (e. g. dataframes) into the environment. Example:
list_of_dataframes <- list(
iris_short = head(iris),
cars_short = head(cars)
)
list2env(x = list_of_dataframes)

Add filename to a new column when using map_df in r

Is there a quick and easy way using dplyr to add a column called 'site_id' which populates rows from the number given to the filename when using map_df from purrr package to bring the data in to one dataframe?
For example my.files will read in two csv files:
"H:/Documents/2015.csv" and "H:/Documents/2021.csv"
my.files <- list.files(my.path, pattern = "*.csv", full.names = TRUE)
I then use map_df to bring all the data in to one data frame, but would like to create an additional column called 'site_id' that will populate each row from that file with its original file title e.g. 2015 or 2021
I currently merge the .csv files together with this code:
temp.df <- my.files %>% map_df(~read.csv(., skip = 15))
But I envisage using mutate to help but am unsure how it would work...
temp.df <- my.files %>% map_df(~read.csv(., skip = 15) %>%
mutate(site_id = ????))
Any help is much appreciated.
We may use imap if we want to use mutate
library(dplyr)
library(purrr)
setNames(my.files, my.files) %>%
imap_df(~ read.csv(.x, skip = 15) %>%
mutate(site_id = .y))
Or specify the .id in map
setNames(my.files, my.files) %>%
map_dfr(read.csv, skip = 15, .id = "site_id")
Using purrr & dplyr:
temp.df <- my.files %>%
purrr::set_names() %>%
purrr::map(., ~read.csv(., skip = 15)) %>%
dplyr::bind_rows(.id = "site_id")

Pass multiple functions (each one time) ussing purrr

I have a folder with different files, each with a different format, so I created different functions able to read each of the files. Is it possible to use map to apply the corresponding function to the corresponding file?
I have found this post to apply several functions to the object, but I don't think is applicable in this case since here all functions are applied always.
all_files <- list.dirs(file.path(path))
fun_A <- function(x) {read.csv(x)}
fun_B <- function(x) {read.table(x)}
fun_C <- function(x) {read.delim(x)}
funs <- c(fun_A , fun_B , fun_C)
So, if I do it manually it works:
(all_files %>%
purrr::map(., ~list.files(., full.names = T)))[[1]][1] %>% fun_A() %>%
dplyr::bind_rows((all_files %>%
purrr::map(., ~list.files(., full.names = T)))[[1]][2] %>% fun_B ()) %>%
dplyr::bind_rows((all_files %>%
purrr::map(., ~list.files(., full.names = T)))[[1]][3] %>% fun_C())
But I tried several times with purrr and I am not able to make it work. This is my final attempt:
all_files %>% purrr::map(.x = ., ~{
df = (.x)
funs %>% purrr::map(., ~ df %>% (.))
})
Any suggestions?
You can use Map or map2 as suggested by #akrun
do.call(rbind, Map(function(x, y) y(x), all_files, funs))
Using map2_df :
purrr::map2_df(all_files, funs, ~.y(.x))
For this to work it is expected that length(all_files) and length(funs) are equal.

How to make this work without global variable in R?

I am parsing some metadata containing json files to similar dataframes. I am using tidyjson. Finally i made it work like this:
append_arrays_and_objects <- function (tbl) {
objs <- tbl %>%
filter(is_json_object(.)) %>% gather_object %>%
append_values_string
arr <- tbl %>%
filter(is_json_array(.)) %>% gather_array %>%
append_values_string
if (nrow(objs) > 0) append_arrays_and_objects(objs)
if (nrow(arr) > 0) append_arrays_and_objects(arr)
print(objs)
print(arr)
res1 <- merge(objs,arr, all=TRUE)
result <<- merge(result,res1, all=TRUE)
result
}
#parse microdata
result <- data.frame()
md <- dataHighest$JSON %>%
enter_object(microdata) %>%
append_arrays_and_objects
rm(result)
It just bothers me that I can't make it work without the global dataframe result. When i tried it by returning any combination of local dataframes it always ends up with a dataframe with the "first level" depth dataframe.. I think when it has all the data collected i cannot seem to pass it back anymore. Should be trivial to solve?

How can I read in and bind together several data frames without using a for loop?

for() loops in R always seem to be my go-to for reading in multiple things but there must be a better way of doing what I want to do.
Let's say I have several data sets that all come from an automated data-pull system:
#Fake data set up
library(lubridate)
dir_path <- tempdir()
file_1 <- paste0(Sys.Date() - days(2), ".rds")
file_2 <-paste0(Sys.Date() - days(1), ".rds")
file_3 <- paste0(Sys.Date(), ".rds")
data.frame(thing = rnorm(100)) %>%
saveRDS(file.path(dir_path, file_1))
data.frame(thing = rnorm(100)) %>%
saveRDS(file.path(dir_path, file_2))
data.frame(thing = rnorm(100)) %>%
saveRDS(file.path(dir_path, file_3))
I want to read each of these into my R session, do a small bit of processing, to each, then stick them all into the same dataframe:
read_in_data <- function(file_name, dir){
d <- substr(file_name, 1, 10)
thing <-
readRDS(file.path(dir, file_name)) %>%
mutate(date = d)
}
files <- list.files(temp_dir(), pattern = "^2017-1")
this_thing <- NULL
for(i in 1:length(files)){
this_thing <-
this_thing %>%
bind_rows(read_in_data(files[i], dir_path))
}
This is great and does exactly what I want, but I have the sneaking suspicion that, as the number of files I want to read in and bind together grows, the for() loop will end up being very slow.
I could do something like
this_thing <-
read_in_data(files[1], dir_path) %>%
bind_rows(read_in_data(files[2], dir_path)) %>%
bind_rows(read_in_data(files[3], dir_path))
but this is gross and will be impossible to maintain, especially as the number of files I want to read in grows.
How can I get rid of this for loop? I know that growing things in a for() loop is a bad idea but I don't know how else to do this kind of operation. What am I missing? Probably something pretty simple.
I ended up using the purrr package:
library(purrr)
files %>%
map(safely(read_in_data, quiet = FALSE)) %>%
transpose() %>%
simplify_all() %>%
.result() %>%
bind_rows() %>%
saveRDS(file.path("path to .rds file"))

Resources