Copy and rename Specific Files based on parent directories in R - r

I am attempting to solve this issue in R, but I'll upvote answers in any programming language.
I have an example vector of filenames like so called file_list
c("D:/example/sub1/session1/OD/CD/text.txt", "D:/example/sub2/session1/OD/CD/text.txt",
"D:/example/sub3/session1/OD/CD/text.txt")
What I'm trying to do is move and rename the text files to be based on the part of the parent directory that contains the part about sub and session. So the first file would be renamed sub2_session1_text.txtand be copied along with the other text files to just 1 new directory called all_files
I'm struggling with some of the specifics of how to rename the file. I'm trying to use substr combined with str_locate_all and paste0 to copy and rename the files based on these parent directories.
Locate the position in each element of the vector file_list to construct starting and ending position for substr
library(stringr)
ending<-str_locate_all(pattern="/OD",file_list)
starting <- str_locate_all(pattern="/sub", file_list)
I then want to somehow pull out of those lists the starting and ending position of those patterns for each element and then feed it to substr to get the naming down and then in turn use paste0 to create
What I'd like is something like
substr_naming_vector<-substr(file_list, start=starting[starting_position],stop=ending[starting_position])
but I don't know how to index the list such that it can know how to correctly index for each element the starting_position. Once I figure that out I'd fill in something like this
#paste the filenames into a vector that represents them being renamed in a new directory
all_files <- paste0("D:/all_files/", substr_naming_vector)
#rename and copy the files
file.copy(from = file_list, to = all_files)

Here's an example using regular expression, which makes it somewhat shorter:
library(stringr)
library(magrittr)
all_dirs <-
c("D:/example/sub1/session1/OD/CD/text.txt",
"D:/example/sub2/session1/OD/CD/text.txt",
"D:/example/sub3/session1/OD/CD/text.txt")
new_dirs <-
all_dirs %>%
# Match each group using regex
str_match_all("D:/example/(.+)/(.+)/OD/CD/(.+)") %>%
# Paste the matched groups into one path
vapply(function(x) paste0(x[2:4], collapse = "_"), character(1)) %>%
paste0("D:/all_files/", .)
# Copy them.
file.copy(all_dirs, new_dirs)

This is one way of doing it. I assumed your file is always called text.txt.
library(stringr)
my_files <- c("D:/example/sub1/session1/OD/CD/text.txt",
"D:/example/sub2/session1/OD/CD/text.txt",
"D:/example/sub3/session1/OD/CD/text.txt")
# get the sub information
subs <- str_extract(string = my_files,
pattern = "sub[0-9]")
# get the session information
sessions <- str_extract(string = my_files,
pattern = "session[0-9]")
# paste it all together
new_file_names <- paste("D:/all_files/",
paste(subs,
sessions,
"text.txt",
sep = "_"),
sep = "")
file.copy(from = my_files,
to = new_file_names)

Related

Cannot combine files in list of files when opening multiple .dta files [duplicate]

I have a folder with more than 500 .dta files. I would like to load some of this files into a single R object.
My .dta files have a generic name composed of four parts : 'two letters/four digits/y/.dta'. For instance, a name can be 'de2015y.dta' or 'fr2008y.dta'. Only the parts corresponding to the two letters and the four digits change across the .dta file.
I have written a code that works, but I am not satisfied with it. I would like to avoid using a loop and to shorten it.
My code is:
# Select the .dta files I want to load
#.....................................
name <- list.files(path="E:/Folder") # names of the .dta files in the folder
db <- as.data.frame(name)
db$year <- substr(db$name, 3, 6)
db <- subset (db, year == max(db$year)) # keep last year available
db$country <- substr(db$name, 1, 2)
list.name <- as.list(db$country)
# Loading all the .dta files in the Global environment
#..................................................
for(i in c(list.name)){
obj_name <- paste(i, '2015y', sep='')
file_name <- file.path('E:/Folder',paste(obj_name,'dta', sep ='.'))
input <- read.dta13(file_name)
assign(obj_name, value = input)
}
# Merge the files into a single object
#..................................................
df2015 <- rbind (at2015y, be2015y, bg2015y, ch2015y, cy2015y, cz2015y, dk2015y, ee2015y, ee2015y, es2015y, fi2015y,
fr2015y, gr2015y, hr2015y, hu2015y, ie2015y, is2015y, it2015y, lt2015y, lu2015y, lv2015y, mt2015y,
nl2015y, no2015y, pl2015y, pl2015y, pt2015y, ro2015y, se2015y, si2015y, sk2015y, uk2015y)
Does anyone know how I can avoid using a loop and shortening my code ?
You can also use purrr for your task.
First create a named vector of all files you want to load (as I understand your question, you simply need all files from 2015). The setNames() part is only necessary in case you'd like an ID variable in your data frame and it is not already included in the .dta files.
After that, simply use map_df() to read all files and return a data frame. Specifying .id is optional and results in an ID column the values of which are based on the names of in_files.
library(purrr)
library(haven)
in_files <- list.files(path="E:/Folder", pattern = "2015y", full.names = TRUE)
in_files <- setNames(in_files, in_files)
df2015 <- map_df(in_files, read_dta, .id = "id")
The following steps should give you what you want:
Load the foreign package:
library(foreign) # or alternatively: library(haven)
create a list of file names
file.list <- list.files(path="E:/Folder", pattern='*.dat', full.names = TRUE)
determine which files to read (NOTE: you have to check if these are the correct position in substr it is an estimate from my side)
vec <- as.integer(substr(file.list,13,16))
file.list2 <- file.list[vec==max(vec)]
read the files
df.list <- sapply(file.list2, read.dta, simplify=FALSE)
remove the path from the listnames
names(df.list) <- gsub("E:/Folder","",names(df.list))
bind the the dataframes together in one data.frame/data.table and create an id-column as well
library(data.table)
df <- rbindlist(df.list, idcol = "id")
# or with 'dplyr'
library(dplyr)
df <- bind_rows(df.list, .id = "id")
Now you have a data.frame with an id-column that identifies the different original files.
I would change the working directory for this task...
Then does this do what you are asking for?
setwd("C:/.../yourfiles")
# get file names where year equals "2015"
name=list.files(pattern="*.dta")
name=name[substr(name,3,6)=="2015"]
# read in the files in a list
files=lapply(name,foreign::read.dta)
# remove ".dta" from file names and
# give the file contents in the list their name
names(files)=lapply(name,function(x) substr(x, 1, nchar(x)-4))
#or alternatively
names(files)=as.list(substr(name,1,nchar(name)-4))
# optional: put all file contents into one data-frame
#(data-frames (vectors) need to have the same row counts (lengths) for this last step to work)
mydatafrm = data.frame(files)

R function to get directory name of a file as characters

I can create a list of csv files in folder_A:
list1 <- dir_ls("path to folder_A")
I can define a function to add a column with filenames and combine these files into one dataframe:
read_and_save_combo <- function(fileX){
read_csv(fileX) %>%
mutate(fileX = path_file(fileX)}
combo_df <- map_df(list1, read_and_save_combo)
I want to add another column with enclosing folder name (would be the same for all files, folder_A). If I use dirname() on an individual file, I get the full parent directory path to folder_A. I only want the characters "folder_A". If I use dirname() as part of the function, I get another column but its filled with "." Less importantly, I don't know why I get the "." instead of the full path, but more importantly is there a function like path_parentfoldername, that would let me add a new column with only the name of the folder containing each file to each row of the combined dataframe?
Thanks!
Edit:
New function for clarity after answers:
read_and_save_combo <- function(fileX){
read_csv(fileX) %>%
mutate(filename = path_file(fileX), foldername = dirname(fileX) %>%
str_replace(pattern = ".*/", replacement = ""))}
This works because . is the wildcard but * modifies the meaning to 0-infinity characters, so ".*" is any character and any number of characters preceding /. Gregor said this but now I understand it.
Also, I was getting the column filled with ".", because in the function, I was reading one file, but then trying to mutate with dirname operating on the list, which is a vector with multiple elements (more than one file).
You can use dirname + basename :
list1 <- list.files('folder_A_path', full.names = TRUE)
read_and_save_combo <- function(fileX) {
readr::read_csv(fileX) %>%
dplyr::mutate(fileX = basename(dirname(fileX)))
}
combo_df <- purrr::map_df(list1, read_and_save_combo)
If your file is at the path 'Users/Downloads/FolderA/Filename.csv' :
dirname('Users/Downloads/FolderA/Filename.csv')
#[1] "Users/Downloads/FolderA"
basename(dirname('Users/Downloads/FolderA/Filename.csv'))
#[1] "FolderA"
"path to folder_A" is a bad example, use "path/to/folder_A". You need to delete everything from the start through the last /:
library(stringr)
str_replace("path/to/folder_A", pattern = ".*/", replacement = "")
# [1] "folder_A"
If you're worried about \\ or other non-standard things, use dirname() as the input.
Here are two ways to do what I wanted, using the helpful answers above:
read_and_save_combo <- function(file){
read_csv(file) %>%
mutate(filename = path_file(file), foldername = basename(dirname(file)))}
read_and_save_combo <- function(file){
read_csv(file) %>%
mutate(filename = path_file(file), foldername = dirname(file) %>%
str_replace(pattern = ".*/", replacement = ""))}
Other basic things I learned that could be helpful for other beginners:
(1) While writing the function, point all the functions (read_csv(), dirname(), etc.) at a uniform variable (here written as "file" but it could be just a letter "g" or whatever you choose). Then you will avoid the problem I had where part of the function is acting on one file and another part is acting on a list.
(2)
filex and fileX
appear far too similar to each other using certain fonts, which can mess you up (capitalization).

Using a function within list.files function in r

I want to create a program where I select files with a user defined prefix in list.files()
My folder will have files beginning with various characters. I want to define a variable or function at the beginning of the program which I can use in list.files in the program
List of file
MP201901 MP201902 MP201903 SG201901 SG201902 SG201903 XY201901 XY202001 XY202002
If I use
inpfiles1 <- list.files(path =Input, pattern = "*SG.*.csv", full.names = TRUE)
it gives correct output but I want to store the prefix somewhere so we can just change the prefix
Currently using code
A<-"SG"
inpfiles2 <- list.files(path =Input, pattern = "*A*.*.csv", full.names = TRUE)
but this is giving empty result
With your current code, R doesn't know that A is a variable name, and so it's ignoring your variable and literally using the letter A.
You can use paste0 instead:
A <- "SG"
pattern <- paste0(A, '.*.csv')
You have to concatenate the user-inputted pattern in A with your own suffix. I.e.
A <- "SG"
pattern <- paste0(A, ".*.csv")
inpfiles2 <- list.files(path=Input, pattern=pattern, full.names=TRUE)

How to replace the title of columns in a merged document with the file directory using R?

I have performed an experiment under different conditions. Each of those condition has its own Folder. In each of those folders, there is a subfolder for each replicate that containts a text file called DistList.txt. This then looks like this, where the folders "C1.1", "C1.2" and so on contain the mentioned .txt files:
I have now managed to combine all those single DistList.txt files using the following script:
setwd("~/Desktop/Experiment/.")
fileList <- list.files(path = ".", recursive = TRUE, pattern = "DistList.txt", full.names = TRUE)
listData <- lapply(fileList, read.table)
names(listData) <- gsub("DistList.txt","",basename(fileList))
library(tidyverse)
library(reshape2)
bind_rows(listData, .id = "FileName") %>%
group_by(FileName) %>%
mutate(rowNum = row_number()) %>%
dcast(rowNum~FileName, value.var = "V1") %>%
select(-rowNum) %>%
write.csv(file="Result.csv")
This then yields a .csv file that has just numbers a titles (marked in red), which are not that useful for me, as shown in this picture:
I would rather like to have the directory of the "DistList.txt" files or even better only the name of the folder they are in as a title. I thought that I could do that using the function list.dirs() and colnames, but I somehow didn't manage to get it to work.
I would be very grateful, if someone could help me with this issue!
I think this line
names(listData) <- gsub("DistList.txt", "", basename(fileList))
should be:
names(listData) <- gsub("DistList.txt", "", fileList)
Because by using basename we are removing all the folders, leaving us with filename "DistList.txt", and that filename gets replaced by empty string "" using gsub.
We might actually want below instead, extract the last directory, which should give in your case something like c("C1.1", "C1.2", ...):
names(listData) <- basename(dirname(fileList))

How to insert text in specific in directory in R

I am looking for an elegant way to insert character (name) into directory and create .csv file. I found one possible solution, however I am looking another without "replacing" but "inserting" text between specific charaktects.
#lets start
df <-data.frame()
name <- c("John Johnson")
dir <- c("C:/Users/uzytkownik/Desktop/.csv")
#how to insert "name" vector between "Desktop/" and "." to get:
dir <- c("C:/Users/uzytkownik/Desktop/John Johnson.csv")
write.csv(df, file=dir)
#???
#I found the answer but it is not very elegant in my opinion
library(qdapRegex)
dir2 <- c("C:/Users/uzytkownik/Desktop/ab.csv")
dir2<-rm_between(dir2,'a','b', replacement = name)
> dir2
[1] "C:/Users/uzytkownik/Desktop/John Johnson.csv"
write.csv(df, file=dir2)
I like sprintf syntax for "fill-in-the-blank" style string construction:
name <- c("John Johnson")
sprintf("C:/Users/uzytkownik/Desktop/%s.csv", name)
# [1] "C:/Users/uzytkownik/Desktop/John Johnson.csv"
Another option, if you can't put the %s in the directory string, is to use sub. This is replacing, but it replaces .csv with <name>.csv.
dir <- c("C:/Users/uzytkownik/Desktop/.csv")
sub(".csv", paste0(name, ".csv"), dir, fixed = TRUE)
# [1] "C:/Users/uzytkownik/Desktop/John Johnson.csv"
This should get you what you need.
dir <- "C:/Users/uzytkownik/Desktop/.csv"
name <- "joe depp"
dirsplit <- strsplit(dir,"\\/\\.")
paste0(dirsplit[[1]][1],"/",name,".",dirsplit[[1]][2])
[1] "C:/Users/uzytkownik/Desktop/joe depp.csv"
I find that paste0() is the way to go, so long as you store your directory and extension separately:
path <- "some/path/"
file <- "file"
ext <- ".csv"
write.csv(myobj, file = paste0(path, file, ext))
For those unfamiliar, paste0() is shorthand for paste( , sep="").
Let’s suppose you have list with the desired names for some data structures you want to save, for instance:
names = [“file_1”, “file_2”, “file_3”]
Now, you want to update the path in which you are going to save your files adding the name plus the extension,
path = “/Users/Documents/Test_Folder/”
extension = “.csv”
A simple way to achieve it is using paste() to create the full path as input for write.csv() inside a lapply, as follows:
lapply(names, function(x) {
write.csv(x = data,
file = paste(path, x, extension))
}
)
The good thing of this approach is you can iterate on your list which contain the names of your files and the final path will be updated automatically. One possible extension is to define a list with extensions and update the path accordingly.

Resources