I am working with stock files from a repository, a new file is generated every day:
For example:
"stock2021-11-05.txt"
I need to read the last generated file, or in its defect read all the files that begin with the word stock, join them.
I currently use the following code:
fileList <- list.files( pattern= "*.txt")
But this brings me all the txt files from the repository and not just the ones that start with the word stock.
I would appreciate a help with this.
Thanks!
Simply use:
list.files(pattern = "stock.*\\.txt")
to find all files that begin with "stock" and end with ".txt"
Check out this REGEX cheat sheet from the stringr package to learn more:
https://github.com/rstudio/cheatsheets/blob/main/strings.pdf
So you managed to filter out files which are not .txt files. Two steps are missing. A possible fileList now could be:
fileList <- c("stock2021-11-05.txt",
"stock2020-11-15.txt",
"stock2021-02-05.txt",
"vwxyz2018-01-01.txt")
1 - Filter stock-files
> fileList_stock <- grep("^stock", fileList, value = TRUE)
> fileList_stock
[1] "stock2021-11-05.txt" "stock2020-11-15.txt" "stock2021-02-05.txt"
2 - Get latest file
> sort(fileList_stock, decreasing = TRUE)[1]
[1] "stock2021-11-05.txt"
1+2 - Wrapper function
> get_last_stock_file <- function(x){
+ grep("^stock", x, value = TRUE) %>% sort(decreasing = TRUE) %>% .[1]
+ }
> get_last_stock_file(fileList)
[1] "stock2021-11-05.txt"
Related
I have been working on a dataset of folders and subfolders (folder -> subfolder - > file)
I have trouble reading the first 10 folders of data. I have used the below code but it doesn't work. Please help
> for(i in seq_along(my_folders)){
+ my_data[[[i]]] = list.files(path = "~/dataset1", recursive = TRUE)
Below see problem with reading txt file in subfolder:
> for(i in 1:13){
+ current_dir = dirs[i]
+ lines = readLines(mydata[[i]])}
This gives error: Error in file(con, "r") : invalid 'description' argument
But outside of the loop this works:
> lines <- readLines(my_data[[1]])
What do you think of that:
dirs = list.dirs(recursive = FALSE) # reads all directories/folders
mydata = list() # create empty list
for (i in 1:10) { # only takes the first 10 directories
current_dir = dirs[i]
mydata[[i]] = list.files(path = file.path("~/dataset1", current_dir), recursive = TRUE)
}
You only have to adapt your folder structure
#sequoia's answer works, but in R you can take advantage of concise functional programming, which #langtang's answer gets at with lapply(). Try this one-liner:
library(tidyverse)
library(fs)
d <- dir_ls("path/to/folders", recurse = TRUE) %>% walk(~read_lines(.x))
Use dir to get a vector of file names, for example all .txt files in folder "f" and all it subfolders
files= dir("f",pattern = ".txt", full.names = T,recursive = T)
files
[1] "f/f1/f1_1/f1_1.txt"
[2] "f/f1/f1_2/f1_2.txt"
[3] "f/f2/f2_1/f2_1.txt"
[4] "f/f2/f2_2/f2_2.txt"
Then read them using readLines
lapply(files, readLines)
I am looking for an elegant way to insert character (name) into directory and create .csv file. I found one possible solution, however I am looking another without "replacing" but "inserting" text between specific charaktects.
#lets start
df <-data.frame()
name <- c("John Johnson")
dir <- c("C:/Users/uzytkownik/Desktop/.csv")
#how to insert "name" vector between "Desktop/" and "." to get:
dir <- c("C:/Users/uzytkownik/Desktop/John Johnson.csv")
write.csv(df, file=dir)
#???
#I found the answer but it is not very elegant in my opinion
library(qdapRegex)
dir2 <- c("C:/Users/uzytkownik/Desktop/ab.csv")
dir2<-rm_between(dir2,'a','b', replacement = name)
> dir2
[1] "C:/Users/uzytkownik/Desktop/John Johnson.csv"
write.csv(df, file=dir2)
I like sprintf syntax for "fill-in-the-blank" style string construction:
name <- c("John Johnson")
sprintf("C:/Users/uzytkownik/Desktop/%s.csv", name)
# [1] "C:/Users/uzytkownik/Desktop/John Johnson.csv"
Another option, if you can't put the %s in the directory string, is to use sub. This is replacing, but it replaces .csv with <name>.csv.
dir <- c("C:/Users/uzytkownik/Desktop/.csv")
sub(".csv", paste0(name, ".csv"), dir, fixed = TRUE)
# [1] "C:/Users/uzytkownik/Desktop/John Johnson.csv"
This should get you what you need.
dir <- "C:/Users/uzytkownik/Desktop/.csv"
name <- "joe depp"
dirsplit <- strsplit(dir,"\\/\\.")
paste0(dirsplit[[1]][1],"/",name,".",dirsplit[[1]][2])
[1] "C:/Users/uzytkownik/Desktop/joe depp.csv"
I find that paste0() is the way to go, so long as you store your directory and extension separately:
path <- "some/path/"
file <- "file"
ext <- ".csv"
write.csv(myobj, file = paste0(path, file, ext))
For those unfamiliar, paste0() is shorthand for paste( , sep="").
Let’s suppose you have list with the desired names for some data structures you want to save, for instance:
names = [“file_1”, “file_2”, “file_3”]
Now, you want to update the path in which you are going to save your files adding the name plus the extension,
path = “/Users/Documents/Test_Folder/”
extension = “.csv”
A simple way to achieve it is using paste() to create the full path as input for write.csv() inside a lapply, as follows:
lapply(names, function(x) {
write.csv(x = data,
file = paste(path, x, extension))
}
)
The good thing of this approach is you can iterate on your list which contain the names of your files and the final path will be updated automatically. One possible extension is to define a list with extensions and update the path accordingly.
I'd want to read the CSV files modified( or created) most recently in differents directories and then put it in a pre-existing single dataframe (df_total).
I have two kinds of directories to read:
A:/LogIIS/FOLDER01/"files.csv"
On others there a folder with several files.csv, as the example bellow:
"A:/LogIIS/FOLDER02/FOLDER_A/"files.csv"
"A:/LogIIS/FOLDER02/FOLDER_B/"files.csv"
"A:/LogIIS/FOLDER02/FOLDER_C/"files.csv"
"A:/LogIIS/FOLDER03/FOLDER_A/"files.csv"
"A:/LogIIS/FOLDER03/FOLDER_B/"files.csv"
"A:/LogIIS/FOLDER03/FOLDER_C/"files.csv"
"A:/LogIIS/FOLDER03/FOLDER_D/"files.csv"
Something like this...
#get a vector of all filenames
files <- list.files(path="A:/LogIIS",pattern="files.csv",full.names = TRUE,recursive = TRUE)
#get the directory names of these (for grouping)
dirs <- dirname(files)
#find the last file in each directory (i.e. latest modified time)
lastfiles <- tapply(files,dirs,function(v) v[which.max(file.mtime(v))])
You can then loop through these and read them in.
If you just want the latest file overall, this will be files[which.max(file.mtime(files))].
Here a tidyverse-friendly solution
list.files("data/",full.names = T) %>%
enframe(name = NULL) %>%
bind_cols(pmap_df(., file.info)) %>%
filter(mtime==max(mtime)) %>%
pull(value)
Consider creating a data frame of files as file.info maintains OS file system metadata per path such as created time:
setwd("A:/LogIIS")
files <- list.files(getwd(), full.names = TRUE, recursive = TRUE)
# DATAFRAME OF FILE, DIR, AND METADATA
filesdf <- cbind(file=files,
dir=dirname(files),
data.frame(file.info(files), row.names =NULL),
stringsAsFactors=FALSE)
# SORT BY DIR AND CREATED TIME (DESC)
filesdf <- with(filesdf, filesdf[order(dir, -xtfrm(ctime)),])
# AGGREGATE LATEST FILE PER DIR
latestfiles <- aggregate(.~dir, filesdf, FUN=function(i) head(i)[[1]])
# LOOP THROUGH LATEST FILE VECTOR FOR IMPORT
df_total <- do.call(rbind, lapply(latestfiles$file, read.csv))
Here is a pipe-friendly way to get the most recent file in a folder. It uses an anonymous function which in my view is slightly more readable than a one-liner. file.mtime is faster than file.info(fpath)$ctime.
dir(path = "your_path_goes_here", full.names = T) %>% # on W, use pattern="^your_pattern"
(function(fpath){
ftime <- file.mtime(fpath) # file.info(fpath)$ctime for file CREATED time
return(fpath[which.max(ftime)]) # returns the most recent file path
})
I feel I am very close to the solution but at the moment i cant figure out how to get there.
I´ve got the following problem.
In my folder "Test" I´ve got stacked datafiles with the names M1_1; M1_2, M1_3 and so on: /Test/M1_1.dat for example.
No I want to seperate the files, so that I get: M1_1[1].dat, M1_1[2].dat, M1_1[3].dat and so on. These files I´d like to save in specific subfolders: Test/M1/M1_1[1]; Test/M1/M1_1[2] and so on, and Test/M2/M1_2[1], Test/M2/M1_2[2] and so on.
Now I already created the subfolders. And I got the following command to split up the files so that i get M1_1.dat[1] and so on:
for (e in dir(path = "Test/", pattern = ".dat", full.names=TRUE, recursive=TRUE)){
data <- read.table(e, header=TRUE)
df <- data[ -c(2) ]
out <- split(df , f = df$.imp)
lapply(names(out),function(z){
write.table(out[[z]], paste0(e, "[",z,"].dat"),
sep="\t", row.names=FALSE, col.names = FALSE)})
}
Now the paste0 command gets me my desired split up data (although its M1_1.dat[1] instead of M1_1[1].dat), but i cant figure out how to get this data into my subfolders.
Maybe you´ve got an idea?
Thanks in advance.
I don't have any idea what your data looks like so I am going to attempt to recreate the scenario with the gender datasets available at baby names
Assuming all the files from the zip folder are stored to "inst/data"
store all file paths to all_fi variable
all_fi <- list.files("inst/data",
full.names = TRUE,
recursive = TRUE,
pattern = "\\.txt$")
> head(all_fi, 3)
[1] "inst/data/yob1880.txt" "inst/data/yob1881.txt"
Preset function that will apply to each file in the directory
f.it <- function(f_in = NULL){
# Create the new folder based on the existing basename of the input file
new_folder <- file_path_sans_ext(f_in)
dir.create(new_folder)
data.table::fread(f_in) %>%
select(name = 1, gender = 2, freq = 3) %>%
mutate(
gender = ifelse(grepl("F", gender), "female","male")
) %>% (function(x){
# Dataset contains names for males and females
# so that's what I'm using to mimic your split
out <- split(x, x$gender)
o <- rbind.pages(
lapply(names(out), function(i){
# New filename for each iteration of the split dataframes
###### THIS IS WHERE YOU NEED TO TWEAK FOR YOUR NEEDS
new_dest_file <- sprintf("%s/%s.txt", new_folder, i)
# Write the sub-data-frame to the new file
data.table::fwrite(out[[i]], new_dest_file)
# For our purposes return a dataframe with file info on the new
# files...
data.frame(
file_name = new_dest_file,
file_size = file.size(new_dest_file),
stringsAsFactors = FALSE)
})
)
o
})
}
Now we can just loop through:
NOTE: for my purposes I'm not going to spend time looping through each file, for your purposes this would apply to each of your initial files, or in my case all_fi rather than all_fi[2:5].
> rbind.pages(lapply(all_fi[2:5], f.it))
============================ =========
file_name file_size
============================ =========
inst/data/yob1881/female.txt 16476
inst/data/yob1881/male.txt 15306
inst/data/yob1882/female.txt 18109
inst/data/yob1882/male.txt 16923
inst/data/yob1883/female.txt 18537
inst/data/yob1883/male.txt 15861
inst/data/yob1884/female.txt 20641
inst/data/yob1884/male.txt 17300
============================ =========
This question already has answers here:
How to import multiple .csv files at once?
(15 answers)
Closed 4 years ago.
I'm using R to visualize some data all of which is in .txt format. There are a few hundred files in a directory and I want to load it all into one table, in one shot.
Any help?
EDIT:
Listing the files is not a problem. But I am having trouble going from list to content. I've tried some of the code from here, but I get a bug with this part:
all.the.data <- lapply( all.the.files, txt , header=TRUE)
saying
Error in match.fun(FUN) : object 'txt' not found
Any snippets of code that would clarify this problem would be greatly appreciated.
You can try this:
filelist = list.files(pattern = ".*.txt")
#assuming tab separated values with a header
datalist = lapply(filelist, function(x)read.table(x, header=T))
#assuming the same header/columns for all files
datafr = do.call("rbind", datalist)
There are three fast ways to read multiple files and put them into a single data frame or data table
First get the list of all txt files (including those in sub-folders)
list_of_files <- list.files(path = ".", recursive = TRUE,
pattern = "\\.txt$",
full.names = TRUE)
1) Use fread() w/ rbindlist() from the data.table package
#install.packages("data.table", repos = "https://cran.rstudio.com")
library(data.table)
# Read all the files and create a FileName column to store filenames
DT <- rbindlist(sapply(list_of_files, fread, simplify = FALSE),
use.names = TRUE, idcol = "FileName")
2) Use readr::read_table2() w/ purrr::map_df() from the tidyverse framework:
#install.packages("tidyverse",
# dependencies = TRUE, repos = "https://cran.rstudio.com")
library(tidyverse)
# Read all the files and create a FileName column to store filenames
df <- list_of_files %>%
set_names(.) %>%
map_df(read_table2, .id = "FileName")
3) (Probably the fastest out of the three) Use vroom::vroom():
#install.packages("vroom",
# dependencies = TRUE, repos = "https://cran.rstudio.com")
library(vroom)
# Read all the files and create a FileName column to store filenames
df <- vroom(list_of_files, .id = "FileName")
Note: to clean up file names, use basename or gsub functions
Benchmark: readr vs data.table vs vroom for big data
Edit 1: to read multiple csv files and skip the header using readr::read_csv
list_of_files <- list.files(path = ".", recursive = TRUE,
pattern = "\\.csv$",
full.names = TRUE)
df <- list_of_files %>%
purrr::set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
purrr::map_df(read_csv,
col_names = FALSE,
skip = 1,
.id = "FileName")
Edit 2: to convert a pattern including a wildcard into the equivalent regular expression, use glob2rx()
There is a really, really easy way to do this now: the readtext package.
readtext::readtext("path_to/your_files/*.txt")
It really is that easy.
Look at the help for functions dir() aka list.files(). This allows you get a list of files, possibly filtered by regular expressions, over which you could loop.
If you want to them all at once, you first have to have content in one file. One option would be to use cat to type all files to stdout and read that using popen(). See help(Connections) for more.
Thanks for all the answers!
In the meanwhile, I also hacked a method on my own. Let me know if it is any useful:
library(foreign)
setwd("/path/to/directory")
files <-list.files()
data <- 0
for (f in files) {
tempData = scan( f, what="character")
data <- c(data,tempData)
}