How to combine .txt files from multiple folders - r

I want to combine multiple .txt files in R from multiple folders. However, I'm running into trouble when I want to separate the data into different columns. Right now, the files combine but into one single column when there should be four.
I used list.files to find .txt files in the folders in my working directory. Then I used rbind and lapply to combine them with read.delim. (see below)
files = list.files(pattern = "*.txt")
myfiles = do.call(rbind, lapply(files, function(x) read.delim(x, header = FALSE, stringsAsFactors = FALSE)))
The above code combines all of the .txt files, but the first 3 rows of each file are artifacts of the data download (basically just a naming feature) and are not pertinent to the data itself. So once the files are combined, the three lines repeat. I cannot use filter(), as I would have to manually go through the data (many thousands of lines). I would also like to repeat this process in another folder with a similar setup. So I'd like to be able to use the same code.
I think I can resolve the issue by removing the top 3 lines of each .txt file before combining them. Then I can set header = FALSE and just add in headers once the files are combined. But again, there are many hundreds of files, so I do not wish to do this manually. I'm not sure how to do this, though. Any suggestions?
Thank you for any help.

Options, transcribed from the comment:
By itself, read.delim(..., skip = 3) will remove those leading duplicate rows. This will also remove the header row, so all of your frames will have generic column names, not a big problem.
To fix that, you can re-read the first row of one of the files (first?) to get the column names, with read.delim(..., nrows=1). If we used nrows=0, it reads all, so we need a minimum of 1 to limit the rows read; in the comment I included [0,], but since all you need is the column-names, it doesn't really affect things.
You can do it the first time with something like:
files = list.files(pattern = "*.txt")
myfiles = do.call(rbind, lapply(files, function(x) read.delim(x, skip = 3, header = FALSE, stringsAsFactors = FALSE)))
# added this part ^^^^^^^^^
colnames(myfiles) <- colnames(read.delim(files[1], header=TRUE, nrows=1))

Related

Merge RDS files from two different file paths?

Folder 1 and Folder 2 are full of .rds files. How would I go about merging all files in both folders into 1 .rds file?
What I have so far
mergedat <- do.call('rbind', lapply(list.files("File/Path/To/Folder/1/", full.names = TRUE), readRDS))
However I don't know how to add the second file path and even then, the code above does not seem to be working.
The information in the .rds files are all set up exactly the same as far as number of columns and column headers go, but the information in them is obviously different. I just figured out that I did not have the files read either within my code.
Any suggestions?
You can do something like this twice, each time for a different path:
path <- "./files"
files <- list.files(path = path,
full.names = TRUE,
all.files = FALSE)
files <- files[!file.info(files)$isdir]
data <- lapply(files,
function(x) {
readRDS(x)
})
You end up with 2 data objects which are lists with each list element containing a data frame that corresponds with what is in the RDS file. If all those files are the same in terms if structure, you can use dplyr::bind_rows() to concatenate all data frames into one combined data frame.

R script to open folders then identify a file, rename it, and read it

I have recently learned to code with R and I sort of manage to handle the data within files but I can't get it to manipulate the files themselves. Here is my problem:
I'd like to open successively, in my working directory "Laurent/R", the 3 folders that are within it ("gene_1", "gene_2", "gene_3").
In each folder, I want one specific .csv file (the one containing the specific word "Cq") to be renamed as "gene_x_Cq" (and then to move these 3 renamed files in a new folder (is that necessary?)).
I want then to be able to successively open these 3 .csv files (with read.csv i suppose) to manipulate the data within them.
I've looked at different functions like list.file, unlist, file.rename but i'm sure they are appropriate and I can't figure out how to use them in my case.
Can anyone help ? (I use a Mac)
Thanks
Laurent
Here's a potential solution. If you don't understand something, just shout out and ask!
setwd("Your own file path/Laurent")
library(stringr)
# list all .csv files
csvfiles <- list.files(recursive = T, pattern = "\\.csv")
csvfiles
# Pick out files that have cq in them, ensuring that you ignore uppercase/lowercase
cq.files <- csvfiles[str_detect(csvfiles, fixed("cq", ignore_case = T))]
# Get gene number for both files - using "2" here because gene folder is at the second level in the file path
gene.nb <- str_sub(word(cq.files, 2, 2, sep = "/"), 6, 6)
gene.nb
# create a new folder to place new files into
dir.create("R/genefiles")
# This will copy files, not move them. To move them, use file.rename - but be careful, I'd try file.copy first.
cq.files <- file.copy(cq.files,
paste0("R/genefiles/gene_", gene.nb, "_", "Cq", ".csv"))
# Now to work with all files in the new folder
library(purrr)
genefiles <- list.files("R/genefiles", full.names = T)
# This will bring in all data into one dataframe. If you want them brought in as separate dataframes,
# use something like gene1 <- read.csv("R/genefiles/gene_1_Cq.csv")
files <- map_dfr(genefiles, read.csv)

To stack up results in one masterfile in R

Using this script I have created a specific folder for each csv file and then saved all my further analysis results in this folder. The name of the folder and csv file are same. The csv files are stored in the main/master directory.
Now, I have created a csv file in each of these folders which contains a list of all the fitted values.
I would now like to do the following:
Set the working directory to the particular filename
Read fitted values file
Add a row/column stating the name of the site/ unique ID
Add it to the masterfile which is stored in the main directory with a title specifying site name/filename. It can be stacked by rows or by columns it doesn't really matter.
Come to the main directory to pick the next file
Repeat the loop
Using the merge(), rbind(), cbind() combines all the data under one column name. I want to keep all the sites separate for comparison at a later on stage.
This is what I'm using at the moment and I'm lost on how to proceed further.
setwd( "path") # main directory
path <-"path" # need this for convenience while switching back to main directory
# import all files and create a character type array
files <- list.files(path=path, pattern="*.csv")
for(i in seq(1, length(files), by = 1)){
fileName <- read.csv(files[i]) # repeat to set the required working directory
base <- strsplit(files[i], ".csv")[[1]] # getting the filename
setwd(file.path(path, base)) # setting the working directory to the same filename
master <- read.csv(paste(base,"_fiited_values curve.csv"))
# read the fitted value csv file for the site and store it in a list
}
I want to construct a for loop to make one master file with the files in different directories. I do not want to merge all under one column name.
For example, If I have 50 similar csv files and each had two columns of data, I would like to have one csv file which accommodates all of it; but in its original format rather than appending to the existing row/column. So then I will have 100 columns of data.
Please tell me what further information can I provide?
for reading a group of files, from a number of different directories, with pathnames patha pathb pathc:
paths = c('patha','pathb','pathc')
files = unlist(sapply(paths, function(path) list.files(path,pattern = "*.csv", full.names = TRUE)))
listContainingAllFiles = lapply(files, read.csv)
If you want to be really quick about it, you can grab fread from data.table:
library(data.table)
listContainingAllFiles = lapply(files, fread)
Either way this will give you a list of all objects, kept separate. If you want to join them together vertically/horizontally, then:
do.call(rbind, listContainingAllFiles)
do.call(cbind, listContainingAllFiles)
EDIT: NOTE, the latter makes no sense unless your rows actually mean something when they're corresponding. It makes far more sense to just create a field tracking what location the data is from.
if you want to include the names of the files as the method of determining sample location (I don't see where you're getting this info from in your example), then you want to do this as you read in the files, so:
listContainingAllFiles = lapply(files,
function(file) data.frame(filename = file,
read.csv(file)))
then later you can split that column to get your details (Assuming of course you have a standard naming convention)

Combining 1200 csv files in r with different column numbers

I need to combine 1200 csv files into one but they have multiple columns. Newbie her: Upon searching through the forums, I've decided that my code should look something like this:
list.files()
filenames <- list.files(path = "~/")
do.call("rbind.fill", lapply(filenames, read.csv, header = TRUE))
When I run this, I don't receive anything but: NULL
Any ideas for me to be able to output one large csv file that combines all of these would be appreciated. Thanks.
Your "filenames" should be empty. Be sure list.files find any files in the folder you specified.
Excerpt from rbind.fill documentation:
Arguments
...
input data frames to row bind together. The first argument can be a list of data frames, in which case all other arguments are ignored. Any NULL inputs are silently dropped. If all inputs are NULL, the output is NULL

Deleting headers of various lengths in r

I am working to combine multiple .txt files, using the read.fwf function. My issue is that each text file is preceded by several header lines, varying from 23-28 lines before the data actually start. I want to somehow delete the first n rows in the file, so that all I am importing and combing are the data themselves.
Does anyone have any clues on how to do this? The start of each data file will be the same ("01Jan") followed by a year. I basically want to delete everything before 01Jan in the file.
Right now, my code looks like:
for (i in 1:length(files.x)){
if (!exists("X")){
X<-read.fwf(files.x[i], c(11,5, 16), header=FALSE, skip=23, stringsAsFactors=FALSE)
X<-head(X, -1) #delete the last row of each table
names(X)<-c("Date", "Time", "Data")
} else if (exists("X")){
temp_X<-read.fwf(files.x[i], c(11,5,16), header=FALSE, skip=23, stringsAsFactors=FALSE) #read in fixed width file
temp_X<-head(temp_X, -1)
names(temp_X)<-c("Date", "Time", "Data")
X<-rbind(X, temp_X)
}
}
I need the skip=23 to vary according to the file being read in. Any ideas other than manually reading in each file and then combining?
Perhaps
hdr <- readLines(files.x[i],n=50) ## or some reasonable upper bound
firstLine <- grep("^01Jan",hdr)[1]
X <- read.fwf(files.x[i], skip=firstLine-1, ...)
Also, it would be more efficient to read in all the files via fileList <- lapply(files.x,getFile) (where getFile is a little utility function you write to encapsulate the logic of reading in a single file) and then do.call(rbind,fileList)

Resources