My R script not picking up all the files in the folder - r

My R script is trying to aggregate excel spreadsheets that are in different folders within the Concerned Files folder (shown in the directory below) and putting all the data into one master file. However, the script is randomly selecting files to copy information from and when i run the code, the following error shows so i am assuming this is why it's not choosing every file in the folder?
all_some_data <- rbind(all_some_data, temp)
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
The whole code:
#list of people's name it has to search the folders for. For our purposes, i am only taking one name
managers <- c("Name")
#directory of all the files
directory = 'C:/Users/Username/OneDrive/Desktop/Testing/Concerned Files/'
#Create an empty dataframe
all_HR_data <-
setNames(
data.frame(matrix(ncol = 8, nrow = 0)),
c("Employee", "ID", "Overtime", "Regular", "Total", "Start", "End", "Manager")
)
str(files)
#loop through managers to get time sheets and then add file to combined dataframe
for (i in managers){
#a path to find all the extract files
files <-
list.files(
path = paste(directory, i, "/", sep = ""),
pattern = "*.xls",
full.names = FALSE,
recursive = FALSE
)
#for each file, get a start and end date of period, remove unnecessary columns, rename columns and add manager name
for (j in files){
temp <- read_excel(paste(directory, i, "/", j, sep = ""), skip = 8)
#a bunch of manipulations with the data being copied over. Code not relevant to the problem
all_some_data <- rbind(all_some_data, temp)
}
}

The most likely cause of your problem is an extra column in one or more of your files.
A potential solution along with a performance improvement is to use the bind_rows function from the dplyr package. This function is more fault tolerant than the base R rbind.
Wrap you loop up with lapply statement and then use bind_rows on the entire list of dataframes in one statement.
output <-lapply(files, function(j) {
temp <- read_excel(paste(directory, i, "/", j, sep = ""), skip = 8)
#a bunch of manipulations with the data being copied over.
# Code not relevant to the problem
temp #this is the returned value to the list
})
all_some_data <- dplyr::bind_rows(output)

Related

How to loop over on different files and save the output with filename in R?

I have several files with the names RTDFE, TRYFG, FTYGS, WERTS...like 100 files in txt format. For each file, I'm using the following code and writing the output in a file.
name = c("RTDFE")
file1 <- paste0(name, "_filter",".txt")
file2 <- paste0(name, "_data",".txt")
### One
A <- read.delim(file1, sep = "\t", header = FALSE)
#### two
B <- read.delim(file2, sep = "\t", header = FALSE)
C <- merge(A, B, by="XYZ")
nrow(C)
145
Output:
Samples Common
RTDFE 145
Every time I'm assigning the file to variable name running my code and writing the output in the file. Instead, I want the code to be run on all the files in one go and want the following output. Common is the row of merged data frame C
The output I need:
Samples Common
RTDFE 145
TRYFG ...
FTYGS ...
WERTS ...
How to do this? Any help.
How about putting all your names in a single vector, called names, like this:
names<-c("TRYFG","RTDFE",...)
and then feeding each one to a function that reads the files, merges them, and returns the rows
f<-function(n) {
fs = paste0(n,c("_filter", "_data"),".txt")
C = merge(
read.delim(fs[1],sep="\t", header=F),
read.delim(fs[2],sep="\t", header=F), by="XYZ")
data.frame(Samples=n,Common=nrow(C))
}
Then just call call this function f on each of the values in names, row binding the result together
do.call(rbind, lapply(names, f))
An easy way to create the vector names is like this:
p = "_(filter|data).txt"
names = unique(gsub(p,"",list.files(pattern = p)))
I am making some assumptions here.
The first assumption is that you have all these files in a folder with no other text files (.txt) in this folder.
If so you can get the list of files with the command list.files.
But when doing so you will get the "_data.txt" and the "filter.txt".
We need a way to extract the basic part of the name.
I use "str_replace" to remove the "_data.txt" and the "_filter.txt" from the list.
But when doing so you will get a list with two entries. Therefore I use the "unique" command.
I store this in "lfiles" that will now contain "RTDFE, TRYFG, FTYGS, WERTS..." and any other file that satisfy the conditions.
After this I run a for loop on this list.
I reopen the files similarly as you do.
I merge by XYZ and I immediately put the results in a data frame.
By using rbind I keep adding results to the data frame "res".
library(stringr)
lfiles=list.files(path = ".", pattern = ".txt")
## we strip, from the files, the "_filter and the data
lfiles=unique( sapply(lfiles, function(x){
x=str_replace(x, "_data.txt", "")
x=str_replace(x, "_filter.txt", "")
return(x)
} ))
res=NULL
for(i in lfiles){
file1 <- paste0(i, "_filter.txt")
file2 <- paste0(i, "_data.txt")
### One
A <- read.delim(file1, sep = "\t", header = FALSE)
#### two
B <- read.delim(file2, sep = "\t", header = FALSE)
res=rbind(data.frame(Samples=i, Common=nrow(merge(A, B, by="XYZ"))))
}
Ok, I will assume you have a folder called "data" with files named "RTDFE_filter.txt, RTDFE_data, TRYFG_filter.txt, TRYFG_data.txt, etc. (only and exacly this files).
This code should give a possible way
# save the file names
files = list.files("data")
# get indexes for "data" (for "filter" indexes, add 1)
files_data_index = seq(1, length(f), 2) # 1, 3, 5, ...
# loop on indexes
results = lapply(files_data_index, function(i) {
A <- read.delim(files[i+1], sep = "\t", header = FALSE)
B <- read.delim(files[i], sep = "\t", header = FALSE)
C <- merge(A, B, by="XYZ")
samp = strsplit(files[i], "_")[[1]][1]
com = nrow(C)
return(c(Samples = samp, Comon = com))
})
# combine results
do.call(rbind, results)

Can I automate an increasing value in a file name in R?

So I have .csv's of nesting data that I need to trim. I wrote a series of functions in R and then spit out the new pretty .csv. The issue is that I need to do this with 59 .csv's and I would like to automate the name.
data1 <- read.csv("Nest001.csv", skip = 3, header=F)
functions functions functions
write.csv("Nest001_NEW.csv, file.path(out.path, edit), row.names=F)
So...is there any way for me to loop the name Nest001 to Nest0059 so that I don't have to delete and retype the name for every .csv?
EDIT to incorporate Gregor's suggestion:
One option:
filenames_in <- sprintf("Nest%03d.csv", 1:59)
filenames_out <- sub(pattern = "(\\d{3})(\\.)", replacement = "\\1_NEW\\2", filenames_in)
all_files <- matrix(c(filenames_in, filenames_out), ncol = 2)
And then loop through them:
for (i in 1:nrow(all_files)) {
temp <- read.csv(all_files[[i, 1]], skip = 3, header=F)
do stuff
write.csv(temp, all_files[[i, 2]], row.names = f)
)
To do this purrr-style, you would create two lists similar to the above, and then write a custom function to read in the file, perform all the functions, and then output it.
e.g.
purrr::walk2(
.x = list(filenames_in),
.y = list(filenames_out),
.f = ~my_function()
)
Consider .x and .y as the i in the for loop; it goes through both lists simultaneously, and performs the function on each item.
More info is available here.
Your best bet is to put all of these CSVs into one folder, without any other CSVs in that folder. Then, you can write a loop to go over every file in that folder, and read them in.
library(dplyr)
setwd("path to the folder with CSV's goes here")
combinedData = data.frame()
files = list.files()
for (file in files)
{
read.csv(file)
combinedData = bind_rows(combinedData, file)
}
EDIT: if there are other files in the folder that you don't want to read, you can add this line of code to only read in files that contain the word "Nest" in the title:
files= files[grepl("Nest",filesToRead)]
I don't remember off the top of my head if that is case sensitive or not

Using lapply to apply a function over read-in list of files and saving output as new list of files

I'm quite new at R and a bit stuck on what I feel is likely a common operation to do. I have a number of files (57 with ~1.5 billion rows cumulatively by 6 columns) that I need to perform basic functions on. I'm able to read these files in and perform the calculations I need no problem but I'm tripping up in the final output. I envision the function working on 1 file at a time, outputting the worked file and moving onto the next.
After calculations I would like to output 57 new .txt files named after the file the input data first came from. So far I'm able to perform the calculations on smaller test datasets and spit out 1 appended .txt file but this isn't what I want as a final output.
#list filenames
files <- list.files(path=, pattern="*.txt", full.names=TRUE, recursive=FALSE)
#begin looping process
loop_output = lapply(files,
function(x) {
#Load 'x' file in
DF<- read.table(x, header = FALSE, sep= "\t")
#Call calculated height average a name
R_ref= 1647.038203
#Add column names to .las data
colnames(DF) <- c("X","Y","Z","I","A","FC")
#Calculate return
DF$R_calc <- (R_ref - DF$Z)/cos(DF$A*pi/180)
#Calculate intensity
DF$Ir_calc <- DF$I * (DF$R_calc^2/R_ref^2)
#Output new .txt with calcuated columns
write.table(DF, file=, row.names = FALSE, col.names = FALSE, append = TRUE,fileEncoding = "UTF-8")
})
My latest code endeavors have been to mess around with the intial lapply/sapply function as so:
#begin looping process
loop_output = sapply(names(files),
function(x) {
As well as the output line:
#Output new .csv with calcuated columns
write.table(DF, file=paste0(names(DF), "txt", sep="."),
row.names = FALSE, col.names = FALSE, append = TRUE,fileEncoding = "UTF-8")
From what I've been reading the file naming function during write.table output may be one of the keys I don't have fully aligned yet with the rest of the script. I've been viewing a lot of other asked questions that I felt were applicable:
Using lapply to apply a function over list of data frames and saving output to files with different names
Write list of data.frames to separate CSV files with lapply
to no luck. I deeply appreciate any insights or paths towards the right direction on inputting x number of files, performing the same function on each, then outputting the same x number of files. Thank you.
The reason the output is directed to the same file is probably that file = paste0(names(DF), "txt", sep=".") returns the same value for every iteration. That is, DF must have the same column names in every iteration, therefore names(DF) will be the same, and paste0(names(DF), "txt", sep=".") will be the same. Along with the append = TRUE option the result is that all output is written to the same file.
Inside the anonymous function, x is the name of the input file. Instead of using names(DF) as a basis for the output file name you could do some transformation of this character string.
example.
Given
x <- "/foo/raw_data.csv"
Inside the function you could do something like this
infile <- x
outfile <- file.path(dirname(infile), gsub('raw', 'clean', basename(infile)))
outfile
[1] "/foo/clean_data.csv"
Then use the new name for output, with append = FALSE (unless you need it to be true)
write.table(DF, file = outfile, row.names = FALSE, col.names = FALSE, append = FALSE, fileEncoding = "UTF-8")
Using your code, this is the general idea:
require(purrr)
#list filenames
files <- list.files(path=, pattern="*.txt", full.names=TRUE, recursive=FALSE)
#Call calculated height average a name
R_ref= 1647.038203
dfTransform <- function(file){
colnames(file) <- c("X","Y","Z","I","A","FC")
#Calculate return
file$R_calc <- (R_ref - file$Z)/cos(file$A*pi/180)
#Calculate intensity
file$Ir_calc <- file$I * (file$R_calc^2/R_ref^2)
return(file)
}
output <- files %>% map(read.table,header = FALSE, sep= "\t") %>%
map(dfTransform) %>%
map(write.table, file=paste0(names(DF), "txt", sep="."),
row.names = FALSE, col.names = FALSE, append = TRUE,fileEncoding = "UTF-8")

R - write.table overwrites file

My script reads in a list of text files from a folder. A calculation for all values in a few columns in each text file is made.
At the end I want to write the resulting data.frame into a new text file in a different location.
The problem is, that the script keeps overwriting the file it created before. So I end up with only one file (the last one that was read in).
But I don't get what I am doing wrong here. The output file name is different each time, so in my head it should produce separate files.
The script looks as follows:
RAW <- "C:/path/tofiles"
files <- list.files(RAW, full.names = TRUE)
for(j in length(files)) {
if(file.exists(files[[j]])){
data <- read.csv(files[[j]], skip = 0, header=FALSE)
data[9] <- do.call(cbind,lapply(data[9], function(x){(data[9]*0.01701)/0.00848}))
data[11] <- do.call(cbind,lapply(data[11], function(x){(data[11]*0.01834)/0.00848}))
data[13] <- do.call(cbind,lapply(data[13], function(x){(data[13]*0.00982)/0.00848}))
data[15] <- do.call(cbind,lapply(data[15], function(x){(data[15]*0.01011)/0.00848}))
OUT <- paste("C:/path/to/destination_folder",basename(files[[j]]),sep="")
write.table(data, OUT, sep=",", row.names = FALSE, col.names = FALSE, append = FALSE)
}
}
The problem is in your for loop. length(files) just provides 1 value, namely the length of your files-vector, while I think you want to have a sequence with that length.
Try seq_along or just for(j in files).

Extracting file numbers from file names in r and looping through files

I have a folder full of .txt files that I want to loop through and compress into one data frame, but each .txt file is data for one subject and there are no columns in the text files that indicate subject number or time point in the study (e.g. 1-5). I need to add a line or two of code into my loop that looks for strings of four numbers (i.e. each file is labeled something like: "4325.5_ERN_No_Startle") and just creates a column with 4325 and another column with 5 that will appear for every data point for that subject until the loop gets to the next one. I have been looking for awhile but am still coming up empty, any suggestions?
I also have not quite gotten the loop to work:
path = "/Users/me/Desktop/Event Codes/ERN task/ERN text files transferred"
out.file <- ""
file <- ""
file.names <- dir(path, pattern =".txt")
for(i in 1:length(file.names)){
file <- read.table(file.names[i],header=FALSE, fill = TRUE)
out.file <- rbind(out.file, file)
}
which runs okay until I get this error message part way through:
Error in read.table(file.names[i], header = FALSE, fill = TRUE) :
no lines available in input
Consider using regex to parse the file name for study period and subject, both of which are then binded in a lapply of list.files:
path = "path/to/text/files"
# ANY TXT FILE WITH PATTERN OF 4 DIGITS FOLLOWED BY A PERIOD AND ONE DIGIT
file.names <- list.files(path, pattern="*[0-9]{4}\\.[0-9]{1}.*txt", full.names=TRUE)
# IMPORT ALL FILES INTO A LIST OF DATAFRAMES AND BINDS THE REGEX EXTRACTS
dfList <- lapply(file.names, function(x) {
if (file.exists(x)) {
data.frame(period=regmatches(x, gregexpr('[0-9]{4}', x))[[1]],
subject=regmatches(x, gregexpr('\\.[0-9]{1}', x))[[1]],
read.table(x, header=FALSE, fill=TRUE),
stringsAsFactors = FALSE)
}
})
# COMBINE EACH DATA FRAME INTO ONE
df <- do.call(rbind, dfList)
# REMOVE PERIOD IN SUBJECT (NEEDED EARLIER FOR SPECIAL DIGIT)
df['subject'] <- sapply(df['subject'],
function(x) gsub("\\.", "", x))
You can try to use tryCatchwhich basically would give you a NULL instead of an error.
file <- tryCatch(read.table(file.names[i],header=FALSE, fill = TRUE), error=function(e) NULL))

Resources