dplyr get access to dot operator within mutate - r

I'm trying to read in every file from a directory, but prior to binding the data.frames to each other with map_df, I want to create a column that stores the file name of data.frame.
Here are failed various attempts:
data <-
list.files(path, pattern = "*.csv") %>%
map_df(~fread(paste0(path, .)), ~mutate(file_name = .))
data <-
list.files(path, pattern = "*.csv") %>%
map_df(~fread(paste0(path, .))) %>% mutate(file_name = .)
data <-
list.files(path, pattern = "*.csv") %>%
map_df(~fread(paste0(path, .))) %>% mutate(~file_name = .)
However, I can't seem to figure out how to access the dot operator in the mutate function to get the file name. Any ideas?

map_dfr() is designed for this. It has an arg .id to create a new column storing either the name (if .x is named) or the index (if .x is unnamed) of the input.
data <- list.files(path, pattern = "*.csv") %>%
set_names() %>%
map_dfr(~ fread(paste0(path, .x)), .id = "file_name")
set_names() is used to set names of a vector, and default to name an element as itself. E.g.
c("file_1.csv", "file_1.csv") %>% set_names()
# file_1.csv file_1.csv (name)
# "file_1.csv" "file_1.csv" (value)

Related

Write_csv for a list of csv.files mantaining original file names

I have a df list (df_list) and i want a tydiverse approach to write all csv files from my list while mantaining their original file names.
So far i did:
df = dir(pattern = "\\.csv$", full.names = TRUE)
df_list = vector("list",length(df))
for(i in seq_along(networks))
{
df_list[[i]] = read.csv(df[[i]], sep = ";")
}
imap(df_list, ~write_csv(.x, paste0(.y,".csv")))
my current output is:
1.csv; 2.csv; 3.csv ...
The below will read in a set of files from an example directory, apply a function to those files, then save the files with the exact same names.
library(purrr)
library(dplyr)
# Create example directory with example .csv files
dir.create(path = "example")
data.frame(x1 = letters) %>% write.csv(., file = "example/example1.csv")
data.frame(x2 = 1:20) %>% write.csv(., file = "example/example2.csv")
# Get relative paths of all .csv files in the example subdirectory
path_list <- list.files(pattern = "example.*csv", recursive = TRUE) %>%
as.list()
# Read every file into list
file_list <- path_list %>%
map(~ read.csv(.x, sep = ","))
# Do something to the data
file_list_updated <- file_list %>%
map( ~ .x %>% mutate(foo = 5))
# Write the updated files to the old file names
map2(.x = file_list_updated,
.y = path_list,
~ write.csv(x = .x, file = .y))

Can't Combine X <character> and X <double>

I am trying to merge various .csv files into one dataframe using the following:
df<- list.files(path = "C:/Users...", pattern = "*.csv", full.names = TRUE) %>% lapply(read_csv) %>% bind_rows clean
However, I get an error saying that I can't combine X character variable and X double variable.
Is there a way I can transform one of them to a character or double variable?
Since each csv file is slightly different, from my beginners standpoint, believe that lapply would be the best in this case, unless there is an easier way to get around this.
Thank you everyone for your time and attention!
You can change the X variable to character in all the files. You can also use map_df to combine all the files in one dataframe.
library(tidyverse)
result <- list.files(path = "C:/Users...", pattern = "*.csv", full.names = TRUE) %>%
map_df(~read_csv(.x) %>% mutate(X = as.character(X)))
If there are more columns with type mismatch issue you can change all the columns to character, combine the data and use type_convert to change their class.
result <- list.files(path = "C:/Users...", pattern = "*.csv", full.names = TRUE) %>%
map_df(~read_csv(.x) %>% mutate(across(.fns = as.character))) %>%
type_convert()
if all file has same number of columns, then try plyrr::rbind.fill instead of dplyr::bind_rows.
list.files(path = "C:/Users...", pattern = "*.csv", full.names = TRUE) %>% lapply(read_csv) %>% plyrr::rbind.fill

Define one column only when reading in a list of csv files?

I'm reading in a list of files and combining into a single df with:
fi <- list.files(path = 'data', pattern = '*.csv', full.names = T) %>%
map_df(~read_csv(paste0(.)))
But this is hitting an error:
Error: Can't combine App ID <character> and App ID <double>. .
Within the directory in question there are many files to read in and combine. I'd like to know if it's possible to define the column type for the problem field, App ID while reading in like this? E.g. map_df(~read_csv(paste0(.), cols(App ID = CHR))
We can use
library(dplyr)
library(purrr)
out <- list.files(path = 'data', pattern = '*\\.csv',
full.names = TRUE) %>%
map_df(~read_csv(.x) %>%
mutate(across(everything(), as.character())) %>%
type.convert(as.is = TRUE)

Select file from the list of the folder

I have a multiple folder with files name in numeric (12345.in). I am trying to write a function which will list the nearest file if the file in the command is not in the folder
soili=371039 #this is the file name
Getmapunit <- function(soili){
soilfile=list.files(pattern = paste0(soili,".in"), recursive = TRUE)
if (length(soilfile)==0){
soilfile=list.files(pattern = paste0(soili+1,".in"), recursive = TRUE)
}
soilfile
}
soilfile=Getmapunit(soili)
#i want to extract the file name closest to 371039, i was able to write function to get file name with next number
I would try to extract the number of each file and check for the nearest value:
library(magrittr)
library(stringr)
soili <- 371039
# get all files in the specific folder
files <- list.files(path = "file folder", full.names = F)
# extract number of each file and turn it into an integer
numbers <- str_extract(files, ".*(?=.in") %>% as.integer()
# get the number of the nearest file
nearest_file <- numbers[which.min(abs(soili - numbers)]
# turn it into a filename
paste0(as.character(nearest_file), ".in")
You can also put everything into one pipe:
soili <- 371039
nearest_file <- list.files(path = "file folder", full.names = F) %>%
str_extract(files, ".*(?=.in") %>%
as.integer() %>%
.[which.min(abs(soili - .)] %>%
paste0(as.character(nearest_file), ".in")
Of course, you can also translate this approach into a function.
Edit:
If you have all the files in different folders, you can use this approach:
soili <- 371039
files <- list.files(path = "highest_file_folder", full.names = T)
nearest_file <- files %>%
str_extract(., "[^/]*$") %>%
str_extract(".*(?=.in)") %>%
as.integer() %>%
.[which.min(abs(soili - .)] %>%
paste0(as.character(nearest_file), ".in")
# getting filepath with nearest_file out of the files vector
files[str_detect(files, nearest_file)]
# little example
files <- c("./folder1/12345.in", "./folder2/56789.in") %>%
str_extract(., "[^/]*$") %>%
str_extract(.,".*(?=.in)")

Merging multiple csv files with name sequence

I am trying to merge 700+ csv files in R. I was able to merge them successfully using the code:
library(dplyr)
library(readr)
df <- list.files(full.names = TRUE) %>%
lapply(read_csv) %>%
bind_rows
Now my problem is the file names are saved as flux.0, flux.1, flux.2......flux.733. The R binds the files in order of flux.0, flux.1, flux.10, flux.100, flux.101...and so on. Since the sequence of the file is important for me, can you suggest to incorporate this in the above code?
Many Thanks for the help!
My pipeline for things like that is to get the list of all the files (as you did), turn it into a tbl/data.frame and than using mapto read the files and unnest() them. That why I can keep the path/file name for each file I loaded.
require(tidyverse)
df <- list.files(path = "path",
full.names = TRUE,
recursive = TRUE,
pattern = "*.csv") %>%
tbl_df() %>%
mutate(data = map(value, read.csv)) %>%
arrange(value) %>%
unnest(data)
Here you have another answer using your own approach. I've just added a function that reads the csv and adds a new column called 'file' with the name of the file without the extension.
library(dplyr)
library(readr)
df <- list.files(full.names = TRUE) %>%
lapply(function(x) {a <- read_csv(x);
mutate(a, file = tools::file_path_sans_ext(basename(x)))}) %>%
bind_rows

Resources