Naming a dataframe like the path - r

I have a lot of CSV that need to be standardized. I created a dictionary for doing so and so far the function that I have looks like this:
inputpath <- ("input")
files<- paste0(inputpath, "/",
list.files(path = inputpath, pattern = '*.gz',
full.names = FALSE))
standardizefunctiontofiles = lapply(files, function(x){
DF <- read_delim(x, delim = "|", na="")
names(DF) <- dictionary$final_name[match(names(DF), dictionary$old_name)]
})
Nonetheless, the issue that I have is that when I read the CSV and turn them into a dataframe they lose their path and therefore I can't not write each of them as a CSV that matches the input name. What I would normally do would be:
output_name <- str_replace(x, "input", "output")
write_delim(x, "output_name", delim = "|")
I was thinking that a way of solving this would be to make this step:
DF <- read_delim(x, delim = "|", na="")
so that the DF gets the name of the path but I haven't find any solution for that.
Any ideas on how to solve this issue for being able to apply a function and writing each of them as a standardized CSV?

I don't completely understand the question. But as far as I understood you want to overwrite CSV files you are reading with a new CSV file that contains the information of a modified (and correct) data frame.
I think you have two alternatives
Option 1) When reading data, store both CSV as a data frame and path as a string within a list.
This would be something like
file_list <- list()
for (i in seq_along(files)) {
file_list[[i]] <- list(df = read_delim(files[[i]], delim = "|", na = ""),
path = files[[i]])
}
Then, when you write the corrected data frames, you can use the paths in the second element of the list within the list file_list. Note that in order to get the path as a string you will need to do something like file_list[[1]][["path"]]
Option 2) Use assign
for (i in seq_along(files)) {
assign(files[[i]], read_delim(files[[i]], delim = "|", na = ""))
}
Option 3) Use do.call and the fact that <- is a function!
for (i in seq_along(files)) {
do.call("<-", list(files[[i]], read_delim(files[[i]], delim = "|", na = "")))
}
I hope this is useful!!
NB) None of the functions are implemented as efficiently as possible. They just introduce the idea.

Related

Read multiple .txt files and add new column identifying file name in R

I have 1500+ .txt files called data_{date from 2015070918 to today} all with 7 columns worth of data and variable row amounts. I have managed to use the following code to extract and merge the data into one table:
files = list.files(pattern = ".txt")
myData <- lapply(files, function(x) {
tryCatch(read.table(x, header = F, sep = ','), error=function(e) NULL)
})
Note: there are no headers on the columns, currently I don't even know which variable is which!
At the moment the data only has the date in the file name and therefore it isn't possible to distinguish between each subset of daily data. I want to create an additional column to include the date which I can extract if I can include the filename in an additional column.
I searched on stackexchange and came across this possible solution: Importing multiple .csv files into R and adding a new column with file name
df <- do.call(rbind, lapply(files, function(x) cbind(read.csv(x, header = F, sep = ","), name=strsplit(x,'\\.')[[1]][1])))
However I get the following error:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
I have used read.csv on individual files and they have imported without any issues. Any ideas to resolve this would be greatly appreciated!
This should work, if your read.table command is correct:
myData_list <- lapply(files, function(x) {
out <- tryCatch(read.table(x, header = F, sep = ','), error = function(e) NULL)
if (!is.null(out)) {
out$source_file <- x
}
return(out)
})
myData <- data.table::rbindlist(myData_list)
In the past I found that you can spare yourself a lot of headache using data.table::fread instead of read.table. So you could consider this:
myData_list <- lapply(files, function(x) {
out <- data.table::fread(x, header = FALSE)
out$source_file <- x
return(out)
})
myData <- data.table::rbindlist(myData_list)
You can add the tryCatch part back if necessary. Depending on how the files vector looks, basename() might be interesting to use on the column source_file.
You could try using sapply with an index corresponding to each of the files:
files <- list.files(pattern = ".txt")
myData <- lapply(seq_along(files), function(x) {
tryCatch(
{
dt <- read.table(files[x], header = F, sep = ',')
dt$index <- x # or files[x] is you want to use the file name instead
dt
},
error=function(e) { NULL }
)
})

Appending a list in a loop (R)

I want to use a loop to read in multiple csv files and append a list in R.
path = "~/path/to/csv/"
file.names <- dir(path, pattern =".csv")
mylist=c()
for(i in 1:length(file.names)){
datatmp <- read.csv(file.names[i],header=TRUE, sep=";", stringsAsFactors=FALSE)
listtmp = datatmp[ ,6]
finallist <- append(mylist, listtmp)
}
finallist
For each csv file, the desired column has a different length.
In the end, I want to get the full appended list with all values in that certain column from all csv files.
I am fairly new to R, so I am not sure what I'm missing...
There are four errors in your approach.
First, file.names <- dir(path, pattern =".csv") will extract just file names, without path. So, when you try to import then, read.csv() doesn't find.
Building the path
You can build the right path including paste0():
path = "~/path/to/csv/"
file.names <- paste0(path, dir(path, pattern =".csv"))
Or file.path(), which add slashes automaticaly.
path = "~/path/to/csv"
file.names <- file.path(path, dir(path, pattern =".csv"))
And another way to create the path, for me more efficient, is that suggested in the answer commented by Tung.
file.names <- list.files(path = "~/path/to/csv", recursive = TRUE,
pattern = "\\.csv$", full.names = TRUE)
This is better because in addition to being all in one step, you can use within a directory containing multiple files of various formats. The code above will match all .csv files in the folder.
Importing, selecting and creating the list
The second error is in mylist <- c(). You want a list, but this creates a vector. So, the correct is:
mylist <- list()
And the last error is inside the loop. Instead of create other list when appending, use the same object created before the loop:
for(i in 1:length(file.names)){
datatmp <- read.csv(file.names[i], sep=";", stringsAsFactors=FALSE)
listtmp = datatmp[, 6]
mylist <- append(mylist, list(listtmp))
}
mylist
Another approach, easier and cleaner, is looping with lapply(). Just this:
mylist <- lapply(file.names, function(x) {
df <- read.csv(x, sep = ";", stringsAsFactors = FALSE)
df[, 6]
})
Hope it helps!

Using lapply to apply a function over read-in list of files and saving output as new list of files

I'm quite new at R and a bit stuck on what I feel is likely a common operation to do. I have a number of files (57 with ~1.5 billion rows cumulatively by 6 columns) that I need to perform basic functions on. I'm able to read these files in and perform the calculations I need no problem but I'm tripping up in the final output. I envision the function working on 1 file at a time, outputting the worked file and moving onto the next.
After calculations I would like to output 57 new .txt files named after the file the input data first came from. So far I'm able to perform the calculations on smaller test datasets and spit out 1 appended .txt file but this isn't what I want as a final output.
#list filenames
files <- list.files(path=, pattern="*.txt", full.names=TRUE, recursive=FALSE)
#begin looping process
loop_output = lapply(files,
function(x) {
#Load 'x' file in
DF<- read.table(x, header = FALSE, sep= "\t")
#Call calculated height average a name
R_ref= 1647.038203
#Add column names to .las data
colnames(DF) <- c("X","Y","Z","I","A","FC")
#Calculate return
DF$R_calc <- (R_ref - DF$Z)/cos(DF$A*pi/180)
#Calculate intensity
DF$Ir_calc <- DF$I * (DF$R_calc^2/R_ref^2)
#Output new .txt with calcuated columns
write.table(DF, file=, row.names = FALSE, col.names = FALSE, append = TRUE,fileEncoding = "UTF-8")
})
My latest code endeavors have been to mess around with the intial lapply/sapply function as so:
#begin looping process
loop_output = sapply(names(files),
function(x) {
As well as the output line:
#Output new .csv with calcuated columns
write.table(DF, file=paste0(names(DF), "txt", sep="."),
row.names = FALSE, col.names = FALSE, append = TRUE,fileEncoding = "UTF-8")
From what I've been reading the file naming function during write.table output may be one of the keys I don't have fully aligned yet with the rest of the script. I've been viewing a lot of other asked questions that I felt were applicable:
Using lapply to apply a function over list of data frames and saving output to files with different names
Write list of data.frames to separate CSV files with lapply
to no luck. I deeply appreciate any insights or paths towards the right direction on inputting x number of files, performing the same function on each, then outputting the same x number of files. Thank you.
The reason the output is directed to the same file is probably that file = paste0(names(DF), "txt", sep=".") returns the same value for every iteration. That is, DF must have the same column names in every iteration, therefore names(DF) will be the same, and paste0(names(DF), "txt", sep=".") will be the same. Along with the append = TRUE option the result is that all output is written to the same file.
Inside the anonymous function, x is the name of the input file. Instead of using names(DF) as a basis for the output file name you could do some transformation of this character string.
example.
Given
x <- "/foo/raw_data.csv"
Inside the function you could do something like this
infile <- x
outfile <- file.path(dirname(infile), gsub('raw', 'clean', basename(infile)))
outfile
[1] "/foo/clean_data.csv"
Then use the new name for output, with append = FALSE (unless you need it to be true)
write.table(DF, file = outfile, row.names = FALSE, col.names = FALSE, append = FALSE, fileEncoding = "UTF-8")
Using your code, this is the general idea:
require(purrr)
#list filenames
files <- list.files(path=, pattern="*.txt", full.names=TRUE, recursive=FALSE)
#Call calculated height average a name
R_ref= 1647.038203
dfTransform <- function(file){
colnames(file) <- c("X","Y","Z","I","A","FC")
#Calculate return
file$R_calc <- (R_ref - file$Z)/cos(file$A*pi/180)
#Calculate intensity
file$Ir_calc <- file$I * (file$R_calc^2/R_ref^2)
return(file)
}
output <- files %>% map(read.table,header = FALSE, sep= "\t") %>%
map(dfTransform) %>%
map(write.table, file=paste0(names(DF), "txt", sep="."),
row.names = FALSE, col.names = FALSE, append = TRUE,fileEncoding = "UTF-8")

Combine csv files with common file identifier

I have a list of approximately 500 csv files each with a filename that consists of a six-digit number followed by a year (ex. 123456_2015.csv). I would like to append all files together that have the same six-digit number. I tried to implement the code suggested in this question:
Import and rbind multiple csv files with common name in R but I want the appended data to be saved as new csv files in the same directory as the original files are currently saved. I have also tried to implement the below code however the csv files produced from this contain no data.
rm(list=ls())
filenames <- list.files(path = "C:/Users/smithma/Desktop/PM25_test")
NAPS_ID <- gsub('.+?\\([0-9]{5,6}?)\\_.+?$', '\\1', filenames)
Unique_NAPS_ID <- unique(NAPS_ID)
n <- length(Unique_NAPS_ID)
for(j in 1:n){
curr_NAPS_ID <- as.character(Unique_NAPS_ID[j])
NAPS_ID_pattern <- paste(".+?\\_(", curr_NAPS_ID,"+?)\\_.+?$", sep = "" )
NAPS_filenames <- list.files(path = "C:/Users/smithma/Desktop/PM25_test", pattern = NAPS_ID_pattern)
write.csv(do.call("rbind", lapply(NAPS_filenames, read.csv, header = TRUE)),file = paste("C:/Users/smithma/Desktop/PM25_test/MERGED", "MERGED_", Unique_NAPS_ID[j], ".csv", sep = ""), row.names=FALSE)
}
Any help would be greatly appreciated.
Because you're not doing any data manipulation, you don't need to treat the files like tabular data. You only need to copy the file contents.
filenames <- list.files("C:/Users/smithma/Desktop/PM25_test", full.names = TRUE)
NAPS_ID <- substr(basename(filenames), 1, 6)
Unique_NAPS_ID <- unique(NAPS_ID)
for(curr_NAPS_ID in Unique_NAPS_ID){
NAPS_filenames <- filenames[startsWith(basename(filenames), curr_NAPS_ID)]
output_file <- paste0(
"C:/Users/nwerth/Desktop/PM25_test/MERGED_", curr_NAPS_ID, ".csv"
)
for (fname in NAPS_filenames) {
line_text <- readLines(fname)
# Write the header from the first file
if (fname == NAPS_filenames[1]) {
cat(line_text[1], '\n', sep = '', file = output_file)
}
# Append every line in the file except the header
line_text <- line_text[-1]
cat(line_text, file = output_file, sep = '\n', append = TRUE)
}
}
My changes:
list.files(..., full.names = TRUE) is usually the best way to go.
Because the digits appear at the start of the filenames, I suggest substr. It's easier to get an idea of what's going on when skimming the code.
Instead of looping over the indices of a vector, loop over the values. It's more succinct and less likely to cause problems if the vector's empty.
startsWith and endsWith are relatively new functions, and they're great.
You only care about copying lines, so just use readLines to get them in and cat to get them out.
You might consider something like this:
##will take the first 6 characters of each file name
six.digit.filenames <- substr(filenames, 1,6)
path <- "C:/Users/smithma/Desktop/PM25_test/"
unique.numbers <- unique(six.digit.filenames)
for(j in unique.numbers){
sub <- filenames[which(substr(filenames,1,6) == j)]
data.for.output <- c()
for(file in sub){
##now do your stuff with these files including read them in
data <- read.csv(paste0(path,file))
data.for.output <- rbind(data.for.output,data)
}
write.csv(data.for.output,paste0(path,j, '.csv'), row.names = F)
}

Function to read in multiple delimited text files

Using this answer, I have created a function that should read in all the text datasets in a directory:
read.delims = function(dir, sep = "\t"){
# Make a list of all data frames in the "data" folder
list.data = list.files(dir, pattern = "*.(txt|TXT|csv|CSV)")
# Read them in
for (i in 1:length(list.data)) {
assign(list.data[i],
read.delim(paste(dir, list.data[i], sep = "/"),
sep = sep))
}
}
However, even though there are .txt and .csv files in the specified directory, no R objects get created (I'm guessing this happens because I'm using the read.delim within a function). How to correct this?
You can add the parameter envir in your assignment, like this :
read.delims = function(dir, sep = "\t"){
# Make a list of all data frames in the "data" folder
list.data = list.files(dir, pattern = "*.(txt|TXT|csv|CSV)")
# Read them in
for (i in 1:length(list.data)) {
assign(list.data[i],
read.delim(paste(dir, list.data[i], sep = "/"),
sep = sep),
envir=.GlobalEnv)
}
}
Doing this, your object will be created in the global environment and not just in the function environment
As I said in my comment, it is necessary to return() a value after assigning. I don't really see the point in using assign() though, so here it is with a simple for-loop, assuming you want your output to be a list of data frames.
Note that I changed the reading function to read.table() for personal convenience. You might want to adjust that.
read.delims <- function(dir, sep = "\t"){
# Make a list of all data frames in the "data" folder
list.data <- list.files(dir, pattern = "*.(txt|TXT|csv|CSV)")
list.out <- as.list(1:length(list.data))
# Read them in
for (i in 1:length(list.data)) {
list.out[[i]] <- read.table(paste(dir, list.data[i], sep = "/"), sep = sep)
}
return(list.out)
}
Maybe you should also add a $ to your regular expression.
Cheers.

Resources