I have 45 csv files in a folder called myFolder. Each csv file has 13 columns and 640 rows.
I want to read each csv and divide the columns 7:12 by 10 and save it in a new folder called 'my folder'. Here's my appraoch which
is using the simple for loop.
library(data.table)
dir.create('newFolder')
allFiles <- list.files(file.path('myFolder'), pattern = '.csv')
for(a in seq_along(allFiles)){
fileRef <- allFiles[a]
temp <- fread(file.path('myFolder', fileRef)
temp[, 7:12] <- temp[, 7:12]/10
fwrite(temp, file.path('myFolder', paste0('new_',fileRef)))
}
Is there a more simple solution in a line or two using datatable and apply function to achieve this?
Your code is already pretty good but these improvements could be made:
define the input and output folders up front for modularity
use full.names = TRUE so that allFiles contains complete paths
use .csv$ as the pattern to anchor it to the end of the filename
iterate over the full names rather than an index
use basename in fwrite to extract out the base name from the path name
The code is then
library(data.table)
myFolder <- "myFolder"
newFolder <- "newFolder"
dir.create(newFolder)
allFiles <- list.files(myFolder, pattern = '.csv$', full.names = TRUE)
for(f in allFiles) {
temp <- fread(f)
temp[, 7:12] <- temp[, 7:12] / 10
fwrite(temp, file.path(newFolder, paste0('new_', basename(f))))
}
You can use purrr::walk if you want to improve readability of your code and get rid of the loop:
allFiles <- list.files(file.path('myFolder'), pattern = '.csv')
purrr::walk(allFiles, function(x){
temp <- fread(file.path('myFolder', x)
temp[, 7:12] <- temp[, 7:12]/10
fwrite(temp, file.path('myFolder', paste0('new_',fileRef)))
})
From the reference page of purrr::walk:
walk() returns the input .x (invisibly)
I don't think it helps speed-wise, though.
Related
I have 332 csv files and each file has the same number of variables and the same format, and I need to create a function that every time the user calls it, can specify the folder where the csv files are located and the id of the csv files they want to store in one data frame.
The name of the files follows the next format: 001.csv, 002.csv ... 332.csv.
data <- function(directory, id_default = 1:332){
setwd(paste0("/Users/", directory))
id <- id_default
for(i in length(id)){
if(i < 10){
aux <- paste0("00",i)
filename <- paste0(aux,".csv")
}else if(i < 100){
aux <- paste0("0", i)
filename <- paste0(aux, ".csv")
}else if(i >= 100){
filename <- paste0(i, ".csv")
}
my_dataframe <- do.call(rbind, lapply(filename, read.csv))
}
my_dataframe #Print dataframe
}
But the problem is that it only store the last csv file, it seems that every time that enters the loop it overwrites the dataframe with the last csv file.
How do I fix it? Plz help
Here, we are looping over the last 'id', i.e the length. Instead it should be
for(i in 1:length(id))
Or more correctly
for(i in seq_along(id))
In addition to the issue with looping, the if/else if is not really needed. We could use sprintf
filenames <- sprintf('%03d.csv', id)
i.e.
data <- function(directory, id_default = 1:332){
setwd(paste0("/Users/", directory))
filenames <- sprintf('%03d.csv', id_default)
do.call(rbind, lapply(filenames, read.csv))
}
A tidy solution will use purrr (better than a loop for this task): https://purrr.tidyverse.org/reference/map.html
library(tidyverse)
directory <- "directory"
id <- c(1,20,300)
# add leading 0s with stringr's str_pad
id %<>% str_pad(3, pad = "0")
It is best to avoid using setwd() like this.
Instead, add directory to the file paths.
paths <- str_c(directory, "/", id, ".csv")
# map files to that function (similar to a loop) and stack rows
map_dfr(paths, read_csv)
Even better, use here()--it makes file paths work: https://github.com/jennybc/here_here
paths <- str_c(
here::here(directory, id),
".csv")
# map files to that function (similar to a loop) and stack rows
map_dfr(paths, read_csv)
Your example seems to want to make the default id's 1:332. If we wanted all files in the directory, we could use paths <- list.files(here::here(directory)).
read_my_data <- function(directory, id = 1:332){
paths <- str_c(
here::here(directory, str_pad(id, 3, pad = "0")),
".csv")
map_dfr(paths, read_csv)
}
read_my_data("directory")
If you need to combine files from multiple directories in parallel, you can use pmap_dfr()
How can I read many CSV files and make each of them into data tables?
I have files of 'A1.csv' 'A2.csv' 'A3.csv'...... in Folder 'A'
So I tried this.
link <- c("C:/A")
filename<-list.files(link)
listA <- c()
for(x in filename) {
temp <- read.csv(paste0(link , x), header=FALSE)
listA <- list(unlist(listA, recursive=FALSE), temp)
}
And it doesn't work well. How can I do this job?
Write a regex to match the filenames
reg_expression <- "A[0-9]+"
files <- grep(reg_expression, list.files(directory), value = TRUE)
and then run the same loop but use assign to dynamically name the dataframes if you want
for(file in files){
assign(paste0(file, "_df"),read.csv(file))
}
But in general introducing unknown variables into the scope is bad practice so it might be best to do a loop like
dfs <- list()
for(index in 1:length(files)){
file <- files[index]
dfs[index] <- read.csv(file)
}
Unless each file is a completely different structure (i.e., different columns ... the number of rows does not matter), you can consider a more efficient approach of reading the files in using lapply and storing them in a list. One of the benefits is that whatever you do to one frame can be immediately done to all of them very easily using lapply.
files <- list.files(link, full.names = TRUE, pattern = "csv$")
list_of_frames <- lapply(files, read.csv)
# optional
names(list_of_frames) <- files # or basename(files), if filenames are unique
Something like sapply(list_of_frames, nrow) will tell you how many rows are in each frame. If you have something more complex,
new_list_of_frames <- lapply(list_of_frames, function(x) {
# do something with 'x', a single frame
})
The most immediate problem is that when pasting your file path together, you need a path separator. When composing file paths, it's best to use the function file.path as it will attempt to determine what the path separator is for operating system the code is running on. So you want to use:
read.csv(files.path(link , x), header=FALSE)
Better yet, just have the full path returned when listing out the files (and can filter for .csv):
filename <- list.files(link, full.names = TRUE, pattern = "csv$")
Combining with the idea to use assign to dynamically create the variables:
link <- c("C:/A")
files <-list.files(link, full.names = TRUE, pattern = "csv$")
for(file in files){
assign(paste0(basename(file), "_df"), read.csv(file))
}
I want to use a loop to read in multiple csv files and append a list in R.
path = "~/path/to/csv/"
file.names <- dir(path, pattern =".csv")
mylist=c()
for(i in 1:length(file.names)){
datatmp <- read.csv(file.names[i],header=TRUE, sep=";", stringsAsFactors=FALSE)
listtmp = datatmp[ ,6]
finallist <- append(mylist, listtmp)
}
finallist
For each csv file, the desired column has a different length.
In the end, I want to get the full appended list with all values in that certain column from all csv files.
I am fairly new to R, so I am not sure what I'm missing...
There are four errors in your approach.
First, file.names <- dir(path, pattern =".csv") will extract just file names, without path. So, when you try to import then, read.csv() doesn't find.
Building the path
You can build the right path including paste0():
path = "~/path/to/csv/"
file.names <- paste0(path, dir(path, pattern =".csv"))
Or file.path(), which add slashes automaticaly.
path = "~/path/to/csv"
file.names <- file.path(path, dir(path, pattern =".csv"))
And another way to create the path, for me more efficient, is that suggested in the answer commented by Tung.
file.names <- list.files(path = "~/path/to/csv", recursive = TRUE,
pattern = "\\.csv$", full.names = TRUE)
This is better because in addition to being all in one step, you can use within a directory containing multiple files of various formats. The code above will match all .csv files in the folder.
Importing, selecting and creating the list
The second error is in mylist <- c(). You want a list, but this creates a vector. So, the correct is:
mylist <- list()
And the last error is inside the loop. Instead of create other list when appending, use the same object created before the loop:
for(i in 1:length(file.names)){
datatmp <- read.csv(file.names[i], sep=";", stringsAsFactors=FALSE)
listtmp = datatmp[, 6]
mylist <- append(mylist, list(listtmp))
}
mylist
Another approach, easier and cleaner, is looping with lapply(). Just this:
mylist <- lapply(file.names, function(x) {
df <- read.csv(x, sep = ";", stringsAsFactors = FALSE)
df[, 6]
})
Hope it helps!
I have 40 text files with names :
[1] "2006-03-31.txt" "2006-06-30.txt" "2006-09-30.txt" "2006-12-31.txt" "2007-03-31.txt"
[6] "2007-06-30.txt" "2007-09-30.txt" "2007-12-31.txt" "2008-03-31.txt" etc...
I need to extract one specific data, i know how to do it individually but this take a while:
m_value1 <- `2006-03-31.txt`$Marknadsvarde_tot[1]
m_value2 <- `2006-06-30.txt`$Marknadsvarde_tot[1]
m_value3 <- `2006-09-30.txt`$Marknadsvarde_tot[1]
m_value4 <- `2006-12-31.txt`$Marknadsvarde_tot[1]
Can someone help me with a for loop which would extract the data from a specific column and row through all the different text files please?
Assuming your files are all in the same folder, you can use list.files to get the names of all the files, then loop through them and get the value you need. So something like this?
m_value<-character() #or whatever the type of your variable is
filelist<-list.files(path="...", all.files = TRUE)
for (i in 1:length(filelist)){
df<-read.table(myfile[i], h=T)
m_value[i]<-df$Marknadsvarde_tot[1]
}
EDIT:
In case you have imported already all the data you can use get:
txt_files <- list.files(pattern = "*.txt")
for(i in txt_files) { x <- read.delim(i, header=TRUE) assign(i,x) }
m_value<-character()
for(i in 1:length(txt_files)) {
m_value[i] <- get(txt_files[i])$Marknadsvarde_tot[1]
}
You could utilize the select-parameter from fread of the data.table-package for this:
library(data.table)
file.list <- list.files(pattern = '.txt')
lapply(file.list, fread, select = 'Marknadsvarde_tot', nrow = 1, header = FALSE)
This will result in a list of datatables/dataframes. If you just want a vector with all the values:
sapply(file.list, function(x) fread(x, select = 'Marknadsvarde_tot', nrow = 1, header = FALSE)[[1]])
temp = list.files(pattern="*.txt")
library(data.table)
list2env(
lapply(setNames(temp, make.names(gsub("*.txt$", "", temp))),
fread), envir = .GlobalEnv)
Added data.table to an existing answer at Importing multiple .csv files into R
After you get all your files you can get data from the data.tables using DT[i,j,k] where i will be your condition
I am new to R program and currently working on a set of financial data. Now I got around 10 csv files under my working directory and I want to analyze one of them and apply the same command to the rest of csv files.
Here are all the names of these files: ("US%10y.csv", "UK%10y.csv", "GER%10y.csv","JAP%10y.csv", "CHI%10y.csv", "SWI%10y.csv","SOA%10y.csv", "BRA%10y.csv", "CAN%10y.csv", "AUS%10y.csv")
For example, because the Date column in CSV files are Factor so I need to change them to Date format:
CAN <- read.csv("CAN%10y.csv", header = T, sep = ",")
CAN$Date <- as.character(CAN$Date)
CAN$Date <- as.Date(CAN$Date, format ="%m/%d/%y")
CAN_merge <- merge(all.dates.frame, CAN, all = T)
CAN_merge$Bid.Yield.To.Maturity <- NULL
all.dates.frame is a data frame of 731 consecutive days. I want to merge them so that each file will have the same number of rows which later enables me to combine 10 files together to get a 731 X 11 master data frame.
Surely I can copy and paste this code and change the file name, but is there any simple approach to use apply or for loop to do that ???
Thank you very much for your help.
This should do the trick. Leave a comment if a certain part doesn't work. Wrote this blind without testing.
Get a list of files in your current directory ending in name .csv
L = list.files(".", ".csv")
Loop through each of the name and reads in each file, perform the actions you want to perform, return the data.frame DF_Merge and store them in a list.
O = lapply(L, function(x) {
DF <- read.csv(x, header = T, sep = ",")
DF$Date <- as.character(CAN$Date)
DF$Date <- as.Date(CAN$Date, format ="%m/%d/%y")
DF_Merge <- merge(all.dates.frame, CAN, all = T)
DF_Merge$Bid.Yield.To.Maturity <- NULL
return(DF_Merge)})
Bind all the DF_Merge data.frames into one big data.frame
do.call(rbind, O)
I'm guessing you need some kind of indicator, so this may be useful. Create a indicator column based on the first 3 characters of your file name rep(substring(L, 1, 3), each = 731)
A dplyr solution (though untested since no reproducible example given):
library(dplyr)
file_list <- c("US%10y.csv", "UK%10y.csv", "GER%10y.csv","JAP%10y.csv", "CHI%10y.csv", "SWI%10y.csv","SOA%10y.csv", "BRA%10y.csv", "CAN%10y.csv", "AUS%10y.csv")
can_l <- lapply(
file_list
, read.csv
)
can_l <- lapply(
can_l
, function(df) {
df %>% mutate(Date = as.Date(as.character(Date), format ="%m/%d/%y"))
}
)
# Rows do need to match when column-binding
can_merge <- left_join(
all.dates.frame
, bind_cols(can_l)
)
can_merge <- can_merge %>%
select(-Bid.Yield.To.Maturity)
One possible solution would be to read all the files into R in the form of a list, and then use lapply to to apply a function to all data files. For example:
# Create vector of file names in working direcotry
files <- list.files()
files <- files[grep("csv", files)]
#create empty list
lst <- vector("list", length(files))
#Read files in to list
for(i in 1:length(files)) {
lst[[i]] <- read.csv(files[i])
}
#Apply a function to the list
l <- lapply(lst, function(x) {
x$Date <- as.Date(as.character(x$Date), format = "%m/%d/%y")
return(x)
})
Hope it's helpful.