I am trying to understand purrr, and how to map/walk over a list of images and save them to files. Below is my code that works using a for loop, but how would this be structured using purrr? I am confused by the various versions (walk, walk2, pwalk, map, map2, pmap etc.)
library(magick)
library(purrr)
#create a list of the files
inpath <- "C:\\Path\\To\\Images"
file_list <- list.files(path = inpath, full.names = TRUE)
# read the files and negate them
imgn <- map(file_list, image_read) %>%
map(image_negate)
# assign list names as original file names
names(imgn) = list.files(path = inpath)
# how to use walk, map, map2? walk2, pwalk? to do this
for (i in 1:length(imgn)) {
image_write(imgn[[i]], path = names(imgn)[[i]])
}
Using Map from base R
Map(function(x, y) image_write(x, path = y), imgn, file_list)
If I'm correct in understanding your code, it looks like you're trying to save your edited images to their original file paths. If so, could you replace your for loop with:
map2(imgn, file_list, ~ image_write(.x, path = .y))
As an explanation, you want to use map2 because you're applying a function with two inputs; the image you're saving (stored in imgn), and the filepath you're writing it to (stored in file_list). You can then use formula notation to specify the function and arguments you'd like to map, as above (more on this in the map docs).
Related
I read data files from a directory where I don't know the number or the name of the files. Each files a data frame (as parquet file). I can read that files. But how to name the results?
I would like to have something like a named list where the filename is the name of the element. I don't know how to do this in R. In Python I would use dictionaries like this
file_names = ['A.parquet', 'B.parquet']
all_data = {}
for fn in file_names:
data = pd.read_parquet(fn)
all_data[fn] = data
How can I solve this in R?
library("arrow")
file_names = c('a.parquet', 'B.parquet')
# "named vector"?
daten = c()
for (pf in file_names) {
# name of data frame (filename without suffix)
df_name <- strsplit(pf, ".", fixed=TRUE)[[1]][1]
df <- arrow::read_parquet(pf)
daten[df_name] = df
}
This doesn't work because I got this error
number of items to replace is not a multiple of replacement length
In the tidyverse you would use purrr. This is basically the same as the lapply() or sapply() approach, but in a different ecosystem.
library(arrow)
library(purrr)
file_names = c('a.parquet', 'B.parquet')
daten <- file_names %>%
set_names(tools::file_path_sans_ext) %>%
map(read_parquet)
You would access each list item through the usual ways.
daten$a
daten$B
# or
daten[["a"]]
daten[["B"]]
Explaination
The pipe operator %>% is an extremely common thing to run into in R these days. It is from the magrittr package, but is also exported from various other tidyverse packages, including purrr.
The pipe takes the left hand argument and enters it as the first argument on the right side expression. So f(x, y) can be written as x %>% f(y). This is useful to chain together expressions. R itself has a native pipe operator |> starting with version 4.1.0.
file_names is an unnamed character vector of the file names.
set_names() will make this a named vector by applying the function file_path_sans_ext() to file_names. This removes the file extension, so each element is named according to its name before the extension.
map() will iterate over each element of the vector, returning a list named according to the names of the vector elements. Each iteration runs the read_parquet function on the input (the file name).
You can used named lists like so.
You can either use the names directly
sapply(file_names, arrow::read_parquet,USE.NAMES = TRUE,simplify = FALSE)
or set them after with whatever function you want to apply
setNames(lapply(file_names, arrow::read_parquet), str_extract(file_names, '(^.+)(\\.)'))
Suppose we have files file1.csv, file2.csv, ... , and file100.csv in directory C:\R\Data and we want to read them all into separate data frames (e.g. file1, file2, ... , and file100).
The reason for this is that, despite having similar names they have different file structures, so it is not that useful to have them in a list.
I could use lapply but that returns a single list containing 100 data frames. Instead I want these data frames in the Global Environment.
How do I read multiple files directly into the global environment? Or, alternatively, How do I unpack the contents of a list of data frames into it?
Thank you all for replying.
For completeness here is my final answer for loading any number of (tab) delimited files, in this case with 6 columns of data each where column 1 is characters, 2 is factor, and remainder numeric:
##Read files named xyz1111.csv, xyz2222.csv, etc.
filenames <- list.files(path="../Data/original_data",
pattern="xyz+.*csv")
##Create list of data frame names without the ".csv" part
names <-substr(filenames,1,7)
###Load all files
for(i in names){
filepath <- file.path("../Data/original_data/",paste(i,".csv",sep=""))
assign(i, read.delim(filepath,
colClasses=c("character","factor",rep("numeric",4)),
sep = "\t"))
}
Quick draft, untested:
Use list.files() aka dir() to dynamically generate your list of files.
This returns a vector, just run along the vector in a for loop.
Read the i-th file, then use assign() to place the content into a new variable file_i
That should do the trick for you.
Use assign with a character variable containing the desired name of your data frame.
for(i in 1:100)
{
oname = paste("file", i, sep="")
assign(oname, read.csv(paste(oname, ".txt", sep="")))
}
This answer is intended as a more useful complement to Hadley's answer.
While the OP specifically wanted each file read into their R workspace as a separate object, many other people naively landing on this question may think that that's what they want to do, when in fact they'd be better off reading the files into a single list of data frames.
So for the record, here's how you might do that.
#If the path is different than your working directory
# you'll need to set full.names = TRUE to get the full
# paths.
my_files <- list.files("path/to/files")
#Further arguments to read.csv can be passed in ...
all_csv <- lapply(my_files,read.csv,...)
#Set the name of each list element to its
# respective file name. Note full.names = FALSE to
# get only the file names, not the full path.
names(all_csv) <- gsub(".csv","",
list.files("path/to/files",full.names = FALSE),
fixed = TRUE)
Now any of the files can be referred to by my_files[["filename"]], which really isn't much worse that just having separate filename variables in your workspace, and often it is much more convenient.
Here is a way to unpack a list of data.frames using just lapply
filenames <- list.files(path="../Data/original_data",
pattern="xyz+.*csv")
filelist <- lappy(filenames, read.csv)
#if necessary, assign names to data.frames
names(filelist) <- c("one","two","three")
#note the invisible function keeps lapply from spitting out the data.frames to the console
invisible(lapply(names(filelist), function(x) assign(x,filelist[[x]],envir=.GlobalEnv)))
Reading all the CSV files from a folder and creating vactors same as the file names:
setwd("your path to folder where CSVs are")
filenames <- gsub("\\.csv$","", list.files(pattern="\\.csv$"))
for(i in filenames){
assign(i, read.csv(paste(i, ".csv", sep="")))
}
A simple way to access the elements of a list from the global environment is to attach the list. Note that this actually creates a new environment on the search path and copies the elements of your list into it, so you may want to remove the original list after attaching to prevent having two potentially different copies floating around.
I want to update the answer given by Joran:
#If the path is different than your working directory
# you'll need to set full.names = TRUE to get the full
# paths.
my_files <- list.files(path="set your directory here", full.names=TRUE)
#full.names=TRUE is important to be added here
#Further arguments to read.csv can be passed in ...
all_csv <- lapply(my_files, read.csv)
#Set the name of each list element to its
# respective file name. Note full.names = FALSE to
# get only the file names, not the full path.
names(all_csv) <- gsub(".csv","",list.files("copy and paste your directory here",full.names = FALSE),fixed = TRUE)
#Now you can create a dataset based on each filename
df <- as.data.frame(all_csv$nameofyourfilename)
a simplified version, assuming your csv files are in the working directory:
listcsv <- list.files(pattern= "*.csv") #creates list from csv files
names <- substr(listcsv,1,nchar(listcsv)-4) #creates list of file names, no .csv
for (k in 1:length(listcsv)){
assign(names[[k]] , read.csv(listcsv[k]))
}
#cycles through the names and assigns each relevant dataframe using read.csv
#copy all the files you want to read in R in your working directory
a <- dir()
#using lapply to remove the".csv" from the filename
for(i in a){
list1 <- lapply(a, function(x) gsub(".csv","",x))
}
#Final step
for(i in list1){
filepath <- file.path("../Data/original_data/..",paste(i,".csv",sep=""))
assign(i, read.csv(filepath))
}
Use list.files and map_dfr to read many csv files
df <- list.files(data_folder, full.names = TRUE) %>%
map_dfr(read_csv)
Reproducible example
First write sample csv files to a temporary directory.
It's more complicated than I thought it would be.
library(dplyr)
library(purrr)
library(purrrlyr)
library(readr)
data_folder <- file.path(tempdir(), "iris")
dir.create(data_folder)
iris %>%
# Keep the Species column in the output
# Create a new column that will be used as the grouping variable
mutate(species_group = Species) %>%
group_by(species_group) %>%
nest() %>%
by_row(~write.csv(.$data,
file = file.path(data_folder, paste0(.$species_group, ".csv")),
row.names = FALSE))
Read these csv files into one data frame.
Note the Species column has to be present in the csv files, otherwise we would loose that information.
iris_csv <- list.files(data_folder, full.names = TRUE) %>%
map_dfr(read_csv)
I would like to read data from several files into separated data frames. The files are in a different folder than the script.
I have used list with filenames.
users_list <- list.files(path = "Data_Eye/Jazz/data/first",
pattern = "*.cal", full.names = F)
I tried to use functions map and read_delim but without success. It is important for me to read each file to a different dataframe. It will be the best to have the list of data frames.
You can do something like this, though I don't have any .cal files to test it on. So, it's possible you might need a different function for reading in those files.
library(devtools)
devtools::install_github("https://github.com/KirtOnthank/OTools")
library(OTools)
# Give full path to your files (if different than working directory).
temp = list.files(path="../Data/original_data", pattern="*.cal", full.names = TRUE)
# Then, apply the read.cal function to the list of files.
myfiles = lapply(temp, OTools::read.cal)
# Then, set the name of each list element (each dataframe) to its respective file name.
names(myfiles) <- gsub(".cal","",
list.files("../Data/original_data",full.names = FALSE),
fixed = TRUE)
# Now, put all of those individual dataframes from your list into the global environment as separate dataframes.
list2env(myfiles,envir=.GlobalEnv)
In base R, just use lapply to generate a list of data frames:
list_of_dfs <- lapply(users_list, read_delim)
I would like to write a function to repeat a chunk of code over a collection of file names (names of files present in my working directory) in r. I would also like to save the outputs of each line as a global environment object if possible. General structure of what I am trying to do is given below. Function name is made up.
global_environment_object_1 <- anyfunction("filename and extension").
# Repeat this over a set of filenames in the working directory and save each as a separate
# global environment object with separate names.
A Real life example can be:
sds22 <- get_subdatasets("EVI_2017_June_2.hdf")
sds23 <- get_subdatasets("EVI_2016_June_1.hdf")
-where object names and file names are changing and the total number of files is 48.
Thanks for the help in advance!
Try using :
#Get all the filenames
all_files <- list.files(full.names = TRUE)
#Probably to be specific
#all_files <- list.files(patter = "\\.hdf$", full.names = TRUE)
#apply get_subdatasets to each one of them
all_data <- lapply(all_files, get_subdatasets)
#Assign name to list output
names(all_data) <- paste0('sds', seq_along(all_data))
#Get the data in global environment
list2env(all_data, .GlobalEnv)
I am pretty new to R and programming so I do apologies if this question has been asked elsewhere.
I'm trying to load multiple .csv files, edit them and save again. But cannot find out how to manage more than one .csv file and also name new files based on a list of character strings.
So I have .csv file and can do:
species_name<-'ace_neg'
{species<-read.csv('species_data/ace_neg.csv')
species_1_2<-species[,1:2]
species_1_2$species<-species_name
species_3_2_1<-species_1_2[,c(3,1,2)]
write.csv(species_3_2_1, file='ace_neg.csv',row.names=FALSE)}
But I would like to run this code for all .csv files in the folder and add text to a new column based on .csv file name.
So I can load all .csv files and make a list of character strings for use as a new column text and as new file names.
NDOP_files <- list.files(path="species_data", pattern="*.csv$", full.names=TRUE, recursive=FALSE)
short_names<- substr(NDOP_files, 14,20)
Then I tried:
lapply(NDOP_files, function(x){
species<-read.csv(x)
species_1_2<-species[,1:2]
species_1_2$species<-'name' #don't know how to insert first character string of short_names instead of 'name', than second character string from short_names for second csv. file etc.
Then continue in the code to change an order of columns
species_3_2_1<-species_1_2[,c(3,1,2)]
And then write all new modified csv. files and name them again by the list of short_names.
I'm sorry if the text is somewhat confusing.
Any help or suggestions would be great.
You are actually quite close and using lapply() is really good idea.
As you state, the issue is, it only takes one list as an argument,
but you want to work with two. mapply() is a function in base R that you can feed multiple lists into and cycle through synchronically. lapply() and mapply()are both designed to create/ manipulate objects inRbut you want to write the files and are not interested in the out withinR. Thepurrrpackage has thewalk*()\ functions which are useful,
when you want to cycle through lists and are only interested in creating
side effects (in your case saving files).
purrr::walk2() takes two lists, so you can provide the data and the
file names at the same time.
library(purrr)
First I create some example data (I’m basically already using the same concept here as I will below):
test_data <- map(1:5, ~ data.frame(
a = sample(1:5, 3),
b = sample(1:5, 3),
c = sample(1:5, 3)
))
walk2(test_data,
paste0("species_data/", 1:5, "test.csv"),
~ write.csv(.x, .y))
Instead of getting the file paths and then stripping away the path
to get the file names, I just call list.files(), once with full.names = TRUE and once with full.names = FALSE.
NDOP_filepaths <-
list.files(
path = "species_data",
pattern = "*.csv$",
full.names = TRUE,
recursive = FALSE
)
NDOP_filenames <-
list.files(
path = "species_data",
pattern = "*.csv$",
full.names = FALSE,
recursive = FALSE
)
Now I feed the two lists into purrr::walk2(). Using the ~ before
the curly brackets I can define the anonymous function a bit more elegant
and then use .x, and .y to refer to the entries of the first and the
second list.
walk2(NDOP_filepaths,
NDOP_filenames,
~ {
species <- read.csv(.x)
species <- species[, 1:2]
species$species <- gsub(".csv", "", .y)
write.csv(species, .x)
})
Learn more about purrr at purrr.tidyverse.org.
Alternatively, you could just extract the file name in the loop and stick to lapply() or use purrr::map()/purrr::walk(), like this:
lapply(NDOP_filepaths,
function(x) {
species <- read.csv(x)
species <- species[, 1:2]
species$species <- gsub("species///|.csv", "", x)
write.csv(species, gsub("species///", "", x))
})
NDOP_files <- list.files(path="species_data", pattern="*.csv$",
full.names=TRUE, recursive=FALSE)
# Get name of each file (without the extension)
# basename() removes all of the path up to and including the last path seperator
# file_path_sands_ext() removes the .csv extension
csvFileNames <- tools::file_path_sans_ext(basename(NDOP_files))
Then, I would write a function that takes in 1 csv file and does some manipulation to the file and outputs out a data frame. Since you have a list of csv files from using list.files, you can use the map function in the purrr package to apply your function to each csv file.
doSomething <- function(NDOP_file){
# your code here to manipulate NDOP_file to your liking
return(NDOP_file)
NDOP_files <- map(NDOP_files, ~doSomething(.x))
Lastly, you can manipulate the file names when you write the new csv files using csvFileNames and a custom function you write to change the file name to your liking. Essentially, use the same architecture of defining your custom function and using map to apply to each of your files.