I'm working on some R code to automatize file import. I´ve been using sub to change the path string, more specifically I want to go through Trial1 to Trial10 for participant 1 and so forth and than save it as data[i]. Instead of putting this manually for all trials could this be done more efficiently with a loop? The function itself adds the filepath to the imported data so I can use this information later
path <- "C:/Users/Thomas/Desktop/tapping backup/Pilot141116/pilot_151116_pat1_250/realisations/participant_8/Trial1"
setwd( path )
files <- list.files(path = path, pattern = "midi.*\\.csv", full.names = T )
# set up a function to read a file and add a column for filename
import <- function( file ) {
df <- read_csv( file, col_names = T )
df$file <- file
return( df )
}
# run that function across all files.
data1 <- ldply( .data = files, .fun = import )
I would build the file list from pilot_151116_pat1_250/realisations/ with recursive set to TRUE and full.names set to TRUE. Then you run the ldply loop with the import function. Later you can deduce from the file column which participant and trial you data was part of. This can be done by using strsplit with sep equal to /, or by using separate from the dplyr package.
Related
So I have .csv's of nesting data that I need to trim. I wrote a series of functions in R and then spit out the new pretty .csv. The issue is that I need to do this with 59 .csv's and I would like to automate the name.
data1 <- read.csv("Nest001.csv", skip = 3, header=F)
functions functions functions
write.csv("Nest001_NEW.csv, file.path(out.path, edit), row.names=F)
So...is there any way for me to loop the name Nest001 to Nest0059 so that I don't have to delete and retype the name for every .csv?
EDIT to incorporate Gregor's suggestion:
One option:
filenames_in <- sprintf("Nest%03d.csv", 1:59)
filenames_out <- sub(pattern = "(\\d{3})(\\.)", replacement = "\\1_NEW\\2", filenames_in)
all_files <- matrix(c(filenames_in, filenames_out), ncol = 2)
And then loop through them:
for (i in 1:nrow(all_files)) {
temp <- read.csv(all_files[[i, 1]], skip = 3, header=F)
do stuff
write.csv(temp, all_files[[i, 2]], row.names = f)
)
To do this purrr-style, you would create two lists similar to the above, and then write a custom function to read in the file, perform all the functions, and then output it.
e.g.
purrr::walk2(
.x = list(filenames_in),
.y = list(filenames_out),
.f = ~my_function()
)
Consider .x and .y as the i in the for loop; it goes through both lists simultaneously, and performs the function on each item.
More info is available here.
Your best bet is to put all of these CSVs into one folder, without any other CSVs in that folder. Then, you can write a loop to go over every file in that folder, and read them in.
library(dplyr)
setwd("path to the folder with CSV's goes here")
combinedData = data.frame()
files = list.files()
for (file in files)
{
read.csv(file)
combinedData = bind_rows(combinedData, file)
}
EDIT: if there are other files in the folder that you don't want to read, you can add this line of code to only read in files that contain the word "Nest" in the title:
files= files[grepl("Nest",filesToRead)]
I don't remember off the top of my head if that is case sensitive or not
I'm processing files through an application using R. The application requires a simple inputfile, outputfilename specification as parameters. Using the below code, this works fine.
input <- "\"7374.txt\""
output <- "\"7374_cleaned.txt\""
system2("DataCleaner", args = c(input, output))
However I wish to process a folder of .txt files, rather then have to do each one individually. If i had access to the source code i would simply alter the application to accept a folder rather then an individual file, but unfortunately i don't. Is it possible to somehow do this in R? I had tried starting to create a loop,
input <- dir(pattern=".txt")
but i don't know how i could insert a vector in as an argument without the regex included as part of that? Also i would then need to be able to paste '_cleaned' on to the end of the outputfile names? Many thanks in advance.
Obviously, I can't test it because I don't have your DataCleaner program but how about this...
# make some files
dir.create('folder')
x = sapply(seq_along(1:5), function(f) {t = tempfile(tmpdir = 'folder', fileext = '.txt'); file.create(t); t})
# find the files
inputfiles = list.files(path = 'folder', pattern = 'txt', full.names = T)
# remove the extension
base = tools::file_path_sans_ext(inputfiles)
# make the output file names
outputfiles = paste0(base, '_cleaned.txt')
mysystem <- function(input, output) {
system2('DataCleaner', args = c(input, output))
}
lapply(seq_along(1:length(inputfiles)), function(f) mysystem(inputfiles[f], outputfiles[f]))
It uses lapply to iterate over all the members of the input and output files and calls the system2 function.
the following code in R for all the files. actually I made a for loop for that but when I run it it will be applied only on one file not all of them. BTW, my files do not have header.
You use [[ to subset something from peaks. However, after reading it using the file name, it is a data frame with then no more reference to the file name. Thus, you just have to get rid of the [[i]].
for (i in filelist.coverages) {
peaks <- read.delim(i, sep='', header=F)
PeakSizes <- c(PeakSizes, peaks$V3 - peaks$V2)
}
By using the iterator i within read.delim() which holds a new file name each time, every time R goes through the loop, peaks will have the content of a new file.
In your code, i is referencing to a name file. Use indices instead.
And, by the way, don't use setwd, use full.names = TRUE option in list.files. And preallocate PeakSizes like this: PeakSizes <- numeric(length(filelist.coverages)).
So do:
filelist.coverages <- list.files('K:/prostate_cancer_porto/H3K27me3_ChIPseq/',
pattern = 'island.bed', full.names = TRUE)
##all 97 bed files
PeakSizes <- numeric(length(filelist.coverages))
for (i in seq_along(filelist.coverages)) {
peaks <- read.delim(filelist.coverages[i], sep = '', header = FALSE)
PeakSizes[i] <- peaks$V3 - peaks$V2
}
Or you could simply use sapply or purrr::map_dbl:
sapply(filelist.coverages, function(file) {
peaks <- read.delim(file, sep = '', header = FALSE)
peaks$V3 - peaks$V2
})
I'm trying to write a function to more or less automatize file import for experimental data.So far it works fine if the folder only contains one file, but the program I use creates two files in the given file path for e.g. Trial1:
Trial1_001_match_20161115_121628.csv.aborted and Trial1_001_midi_20161115_121628.csv.aborted. I'm only interested in the midi file. Is there an easy way to implement that only the file containing the string midi gets imported or something like this?
path <- "C:/Users/Thomas/Desktop/tapping backup/Pilot141116/pilot_151116_pat1_250/realisations/participant_8/Trial1"
setwd( path )
files <- list.files(path = path, pattern = ".csv", full.names = T )
# set up a function to read a file and add a column for filename
import <- function( file ) {
df <- read_csv( file, col_names = T )
df$file <- file
return( df )
}
# run that function across all files.
data1 <- ldply( .data = files, .fun = import )`
As you don't giva a reproducible example, I can't check, but the following should work: files[grepl("midi", files)].
I have a .xlsx file has two sheets and I want to generate a list of both excel sheets using read_excel from readxl package . I have used this code
my_work <- lapply(excel_sheets("data.xlsx"),
read_excel,
path = "data.xlsx")
The read_excel() function is called multiple times on the "data.xlsx" file and each sheet is loaded in one after the other. The result is a list of data frames, each data frame representing one of the sheets in data.xlsx.
My question is , why should I write the path argument in lapplyfunction , since the file is already in the working directory ?
Not sure this is the easiest way but you can create a short function modifying the read_excel() function to take in both a sheet name parameter and path, then lapply over that function.
library(readxl)
path <- "data.xlsx"
sheet_names <- excel_sheets(path)
# create function
read_excel_sheet <- function(sheet_name, path) {
x <- read_excel(path = path, sheet = sheet_name)
}
my_work <- lapply(sheet_names, read_excel_sheet, path = path)
The documentation:
read_excel(path, sheet = 1, col_names = TRUE, col_types = NULL, na = "", skip = 0)
The parameter path is a required argument. So you need to fill it in. Otherwise, an error will pop up.