This is a tricky dplyr & purrr question
I want to simplify the following code into one dplyr pipe:
filenames <- list.files(path = data.location, pattern = "*.csv") %>%
map_chr(function(name) gsub(paste0('(.*).csv'), '\\1', name))
files.raw <- list.files(path = data.location, pattern = "*.csv", full.names = TRUE) %>%
map(read_csv) %>%
setNames(filenames)
I tried to do this solution but it failed because the filenames must be used with full path (full.names = TRUE) for read_csv() but I want to assign the filenames without the full path.
In other words, this worked - but only with full path in filenames:
list.files(path = data.location, pattern = "*.csv", full.names = TRUE) %>%
{ . ->> filenames } %>%
map(read_csv) %>%
setNames(filenames)
but this didn't:
list.files(path = data.location, pattern = "*.csv", full.names = TRUE) %>%
{ map_chr(., function(name) gsub(paste0(data.location, '/(.*).csv'), '\\1', name)) ->> filenames } %>%
map(read_csv) %>%
setNames(filenames)
Is there a way to make the map_chr work with the save (->> filenames), or is there an even simpler way to completely avoid saving to a temporary variable (filenames)?
To do it in one pipeline without intermediate values, and similar to #Ronak Shah, why not set the names first, then read in the CSVs? Ronak nests the setNames call, but it can be put it in the pipeline to make it more readable:
library(tidyverse)
list.files(path = data.location, pattern = "*.csv", full.names = TRUE) %>%
setNames(., sub("\\.csv$", "", basename(.))) %>%
map(read_csv)
We can do this with only tidyverse functions
library(readr)
library(purrr)
library(dplyr)
all_files <- list.files(path = data.location, pattern = "*\\.csv", full.names = TRUE)
map(all_files, read_csv) %>%
set_names(str_remove(basename(all_files), "\\.csv$"))
Try using this method :
all_files <- list.files(path = data.location, pattern = "*.csv", full.names = TRUE)
purrr::map(all_files, readr::read_csv) %>%
setNames(sub("\\.csv$", "", basename(all_files)))
Here, we first get complete path of all files, use it to read it using read_csv. We can use basename to get only file name and remove "csv" from it and assign names using setNames.
To do this in one-pipe, we can do
list.files(path = data.location, pattern = "*.csv", full.names = TRUE) %>%
{setNames(map(., read_csv), sub("\\.csv$", "", basename(.)))}
Inspired by
the below answer by #Ronak Shah
the intermediary assignment suggested by #G. Grothendiek here
I put together the following solution which
combines the reading and naming of files into one dplyr pipe (my initial goal)
makes the code more intuitive to read (my always implicit goal) - important for collaborators or in general.
So here's my one-in-all dplyr pipe:
list.files(path = data.location, pattern = "*.csv", full.names = TRUE) %>%
{ filenames <<- . } %>%
map(read_csv) %>%
setNames(filenames %>% basename %>% map(~gsub(paste0('(.*).csv'), '\\1', .x)))
The only tricky part in the above solution is to use the assignment operator <<- as <- would not work. If you want to use the latter, you can do so by putting the whole second section into brackets - also suggested by #G. Grothendieck in the above post:
list.files(path = data.location, pattern = "*.csv", full.names = TRUE) %>%
{
{ filenames <- . } %>%
map(read_csv) %>%
setNames(filenames %>% basename %>% map(~gsub(paste0('(.*).csv'), '\\1', .x)))
}
Related
I have discovered R a couple of years ago and it has been very handy to clean up dataframes, prepare some data and to handle other basic tasks.
Now I would like to try using R to apply basic treatments but on many different files stored in different folders at once.
Here is the script I would like to improve into one function that would loop through my folder "dataset_2006" and "dataset_2007" to do all the work.
library(dplyr)
library(readr)
library(sf)
library(purrr)
setwd("C:/Users/Downloads/global_data/dataset_2006")
shp2006 <- list.files(pattern = 'data_2006.*\\.shp$', full.names = TRUE)
listOfShp <- lapply(shp2006, st_read)
combinedShp <- do.call(what = sf:::rbind.sf, args=listOfShp)
#import and merge CSV files into one data frame
folderfiles <- list.files(pattern = 'csv_2006_.*\\.csv$', full.names = TRUE)
csv_data <- folderfiles %>%
set_names() %>%
map_dfr(.f = read_delim,
delim = ";",
.id = "file_name")
new_shp_2006 <- merge(combinedShp, csv_data , by = "ID") %>% filter(label %in% c("AR45T", "GK879"))
st_write(new_shp_2006 , "new_shp_2006.shp", overwrite = TRUE)
setwd("C:/Users/Downloads/global_data/dataset_2007")
shp2007 <- list.files(pattern = 'data_2007.*\\.shp$', full.names = TRUE)
listOfShp <- lapply(shp2007, st_read)
combinedShp <- do.call(what = sf:::rbind.sf, args=listOfShp)
#import and merge CSV files into one data frame
folderfiles <- list.files(pattern = 'csv_2007_.*\\.csv$', full.names = TRUE)
csv_data <- folderfiles %>%
set_names() %>%
map_dfr(.f = read_delim,
delim = ";",
.id = "file_name")
new_shp_2007 <- merge(combinedShp, csv_data , by = "ID") %>% filter(label %in% c("AR45T", "GK879"))
st_write(new_shp_2007 , "new_shp_2007.shp", overwrite = TRUE)
This is easy to achieve with a for-loop to loop over multiple items. To allow us to use wildcards, we can also add the function Sys.glob():
myfunction <- function(directories) {
for(dir in Sys.glob(directories)) {
# do something with a single dir
print(dir)
}
}
# you can specify multiple directories manually:
myfunction(c('C:/Users/Downloads/global_data/dataset_2006',
'C:/Users/Downloads/global_data/dataset_2007'))
# or use a wildcard to automatically get all files/directories that match the pattern:
myfunction('C:/Users/Downloads/global_data/dataset_200*')
I'm trying to read in every file from a directory, but prior to binding the data.frames to each other with map_df, I want to create a column that stores the file name of data.frame.
Here are failed various attempts:
data <-
list.files(path, pattern = "*.csv") %>%
map_df(~fread(paste0(path, .)), ~mutate(file_name = .))
data <-
list.files(path, pattern = "*.csv") %>%
map_df(~fread(paste0(path, .))) %>% mutate(file_name = .)
data <-
list.files(path, pattern = "*.csv") %>%
map_df(~fread(paste0(path, .))) %>% mutate(~file_name = .)
However, I can't seem to figure out how to access the dot operator in the mutate function to get the file name. Any ideas?
map_dfr() is designed for this. It has an arg .id to create a new column storing either the name (if .x is named) or the index (if .x is unnamed) of the input.
data <- list.files(path, pattern = "*.csv") %>%
set_names() %>%
map_dfr(~ fread(paste0(path, .x)), .id = "file_name")
set_names() is used to set names of a vector, and default to name an element as itself. E.g.
c("file_1.csv", "file_1.csv") %>% set_names()
# file_1.csv file_1.csv (name)
# "file_1.csv" "file_1.csv" (value)
I am trying to merge various .csv files into one dataframe using the following:
df<- list.files(path = "C:/Users...", pattern = "*.csv", full.names = TRUE) %>% lapply(read_csv) %>% bind_rows clean
However, I get an error saying that I can't combine X character variable and X double variable.
Is there a way I can transform one of them to a character or double variable?
Since each csv file is slightly different, from my beginners standpoint, believe that lapply would be the best in this case, unless there is an easier way to get around this.
Thank you everyone for your time and attention!
You can change the X variable to character in all the files. You can also use map_df to combine all the files in one dataframe.
library(tidyverse)
result <- list.files(path = "C:/Users...", pattern = "*.csv", full.names = TRUE) %>%
map_df(~read_csv(.x) %>% mutate(X = as.character(X)))
If there are more columns with type mismatch issue you can change all the columns to character, combine the data and use type_convert to change their class.
result <- list.files(path = "C:/Users...", pattern = "*.csv", full.names = TRUE) %>%
map_df(~read_csv(.x) %>% mutate(across(.fns = as.character))) %>%
type_convert()
if all file has same number of columns, then try plyrr::rbind.fill instead of dplyr::bind_rows.
list.files(path = "C:/Users...", pattern = "*.csv", full.names = TRUE) %>% lapply(read_csv) %>% plyrr::rbind.fill
I need to open 100 ndjson large files (with same columns) ,
I have prepared a script to apply to each file but I would not like to repeat this 100 times !
With ndjson::stream_in , I can only open 1 ndjson file into R as a data frame
I tried the process to open multiple csv files and consolidate them into 1 dafatframe only, but it does not work with ndjson files :(
library(data.table)
library(purrr)
map_df_fread <- function(path, pattern = "*.ndjson") {
list.files(path, pattern, full.names = TRUE) %>%
map_df(~fread(., stringsAsFactors = FALSE))
}
myfiles <-
list.files(path = "C:/Users/sandrine/Documents/Projet/CAD/A/",
pattern = "*.ndjson",
full.names = T) %>%
map_df_fread(~fread(., stringsAsFactors = FALSE))
I tried to find also a package to convert ndjson files into csv ...but did not find any.
Any idea?
Using your own approach that you mentioned first, does this work?
library(tidyverse)
library(ndjson)
final_df <-
list.files(path = "C:/Users/sandrine/Documents/Projet/CAD/A/",
pattern = "*.ndjson",
full.names = T) %>%
map_dfr(~stream_in(.))
I have a multiple folder with files name in numeric (12345.in). I am trying to write a function which will list the nearest file if the file in the command is not in the folder
soili=371039 #this is the file name
Getmapunit <- function(soili){
soilfile=list.files(pattern = paste0(soili,".in"), recursive = TRUE)
if (length(soilfile)==0){
soilfile=list.files(pattern = paste0(soili+1,".in"), recursive = TRUE)
}
soilfile
}
soilfile=Getmapunit(soili)
#i want to extract the file name closest to 371039, i was able to write function to get file name with next number
I would try to extract the number of each file and check for the nearest value:
library(magrittr)
library(stringr)
soili <- 371039
# get all files in the specific folder
files <- list.files(path = "file folder", full.names = F)
# extract number of each file and turn it into an integer
numbers <- str_extract(files, ".*(?=.in") %>% as.integer()
# get the number of the nearest file
nearest_file <- numbers[which.min(abs(soili - numbers)]
# turn it into a filename
paste0(as.character(nearest_file), ".in")
You can also put everything into one pipe:
soili <- 371039
nearest_file <- list.files(path = "file folder", full.names = F) %>%
str_extract(files, ".*(?=.in") %>%
as.integer() %>%
.[which.min(abs(soili - .)] %>%
paste0(as.character(nearest_file), ".in")
Of course, you can also translate this approach into a function.
Edit:
If you have all the files in different folders, you can use this approach:
soili <- 371039
files <- list.files(path = "highest_file_folder", full.names = T)
nearest_file <- files %>%
str_extract(., "[^/]*$") %>%
str_extract(".*(?=.in)") %>%
as.integer() %>%
.[which.min(abs(soili - .)] %>%
paste0(as.character(nearest_file), ".in")
# getting filepath with nearest_file out of the files vector
files[str_detect(files, nearest_file)]
# little example
files <- c("./folder1/12345.in", "./folder2/56789.in") %>%
str_extract(., "[^/]*$") %>%
str_extract(.,".*(?=.in)")