Function to read in multiple delimited text files - r

Using this answer, I have created a function that should read in all the text datasets in a directory:
read.delims = function(dir, sep = "\t"){
# Make a list of all data frames in the "data" folder
list.data = list.files(dir, pattern = "*.(txt|TXT|csv|CSV)")
# Read them in
for (i in 1:length(list.data)) {
assign(list.data[i],
read.delim(paste(dir, list.data[i], sep = "/"),
sep = sep))
}
}
However, even though there are .txt and .csv files in the specified directory, no R objects get created (I'm guessing this happens because I'm using the read.delim within a function). How to correct this?

You can add the parameter envir in your assignment, like this :
read.delims = function(dir, sep = "\t"){
# Make a list of all data frames in the "data" folder
list.data = list.files(dir, pattern = "*.(txt|TXT|csv|CSV)")
# Read them in
for (i in 1:length(list.data)) {
assign(list.data[i],
read.delim(paste(dir, list.data[i], sep = "/"),
sep = sep),
envir=.GlobalEnv)
}
}
Doing this, your object will be created in the global environment and not just in the function environment

As I said in my comment, it is necessary to return() a value after assigning. I don't really see the point in using assign() though, so here it is with a simple for-loop, assuming you want your output to be a list of data frames.
Note that I changed the reading function to read.table() for personal convenience. You might want to adjust that.
read.delims <- function(dir, sep = "\t"){
# Make a list of all data frames in the "data" folder
list.data <- list.files(dir, pattern = "*.(txt|TXT|csv|CSV)")
list.out <- as.list(1:length(list.data))
# Read them in
for (i in 1:length(list.data)) {
list.out[[i]] <- read.table(paste(dir, list.data[i], sep = "/"), sep = sep)
}
return(list.out)
}
Maybe you should also add a $ to your regular expression.
Cheers.

Related

How to load multiple csv files into seperate objects(dataframes) in R based on filename?

I know how to load a whole folder of .csv files quite easily using:
csv_files = list.files(pattern ="*.csv")
myfiles = lapply(csv_files, read.delim, header = FALSE)
From which I can then easily iterate over 'myfiles' them and do whatever I wish. The problem I have is this simply loads all the .csv files in the working directory.
What I would like to do is be able to assign the files to objects in the script based on the filename.
Say, for example, in one directory I have the files; file001, file002, file003 and exfile001, exfile002, exfile003.
I want to be able to load them in such away that
file_object <- file...
exfile_object <- exfile...
So that when I execute the script it essentially does whatever i've programmed it to do for file_object(assigned as file001 in this example) & exfile_object(assigned as exfile001 in this example). Then goes on to continue in this way for the rest of the files in the directory (eg. file002, exfile002, file003, exfile003).
I know how to do it in MATLAB, but am just getting to grips with R.
I thought perhaps getting them into seperate lists using the list.files function may work by just changing working directory in script, but it seems messy and would involve re-writing things in my case...
Thanks!
Solution for anyone curious...
files <- list.files(pattern = ".*csv")
for(file in 1:length(files)) {
file_name <- paste(c("file00",file), collapse = " ")
file_name <- gsub(" ", "", file_name, fixed = TRUE)
ex_file_name <- paste(c("exfile00",file), collapse = " ")
ex_file_name <- gsub(" ", "", ex_file_name, fixed = TRUE)
file_object <- read.csv(file = paste(file_name, ".csv", sep=""),fileEncoding="UTF-8-BOM")
exfile_object <- read.csv(file = paste(ex_file_name, ".csv", sep=""),fileEncoding="UTF-8-BOM")
}
Essentially build the filename within the loop, then passs it to the readcsv function on each iteration.
If your list of frames, myfiles is named using this:
names(myfiles) <- gsub(".csv", "", csv_files)
then you can do
list2env(myfiles, globalenv())
to convert those individual frames to separate objects in the global environment.

Naming a dataframe like the path

I have a lot of CSV that need to be standardized. I created a dictionary for doing so and so far the function that I have looks like this:
inputpath <- ("input")
files<- paste0(inputpath, "/",
list.files(path = inputpath, pattern = '*.gz',
full.names = FALSE))
standardizefunctiontofiles = lapply(files, function(x){
DF <- read_delim(x, delim = "|", na="")
names(DF) <- dictionary$final_name[match(names(DF), dictionary$old_name)]
})
Nonetheless, the issue that I have is that when I read the CSV and turn them into a dataframe they lose their path and therefore I can't not write each of them as a CSV that matches the input name. What I would normally do would be:
output_name <- str_replace(x, "input", "output")
write_delim(x, "output_name", delim = "|")
I was thinking that a way of solving this would be to make this step:
DF <- read_delim(x, delim = "|", na="")
so that the DF gets the name of the path but I haven't find any solution for that.
Any ideas on how to solve this issue for being able to apply a function and writing each of them as a standardized CSV?
I don't completely understand the question. But as far as I understood you want to overwrite CSV files you are reading with a new CSV file that contains the information of a modified (and correct) data frame.
I think you have two alternatives
Option 1) When reading data, store both CSV as a data frame and path as a string within a list.
This would be something like
file_list <- list()
for (i in seq_along(files)) {
file_list[[i]] <- list(df = read_delim(files[[i]], delim = "|", na = ""),
path = files[[i]])
}
Then, when you write the corrected data frames, you can use the paths in the second element of the list within the list file_list. Note that in order to get the path as a string you will need to do something like file_list[[1]][["path"]]
Option 2) Use assign
for (i in seq_along(files)) {
assign(files[[i]], read_delim(files[[i]], delim = "|", na = ""))
}
Option 3) Use do.call and the fact that <- is a function!
for (i in seq_along(files)) {
do.call("<-", list(files[[i]], read_delim(files[[i]], delim = "|", na = "")))
}
I hope this is useful!!
NB) None of the functions are implemented as efficiently as possible. They just introduce the idea.

looping over all files in the same directory in R

the following code in R for all the files. actually I made a for loop for that but when I run it it will be applied only on one file not all of them. BTW, my files do not have header.
You use [[ to subset something from peaks. However, after reading it using the file name, it is a data frame with then no more reference to the file name. Thus, you just have to get rid of the [[i]].
for (i in filelist.coverages) {
peaks <- read.delim(i, sep='', header=F)
PeakSizes <- c(PeakSizes, peaks$V3 - peaks$V2)
}
By using the iterator i within read.delim() which holds a new file name each time, every time R goes through the loop, peaks will have the content of a new file.
In your code, i is referencing to a name file. Use indices instead.
And, by the way, don't use setwd, use full.names = TRUE option in list.files. And preallocate PeakSizes like this: PeakSizes <- numeric(length(filelist.coverages)).
So do:
filelist.coverages <- list.files('K:/prostate_cancer_porto/H3K27me3_ChIPseq/',
pattern = 'island.bed', full.names = TRUE)
##all 97 bed files
PeakSizes <- numeric(length(filelist.coverages))
for (i in seq_along(filelist.coverages)) {
peaks <- read.delim(filelist.coverages[i], sep = '', header = FALSE)
PeakSizes[i] <- peaks$V3 - peaks$V2
}
Or you could simply use sapply or purrr::map_dbl:
sapply(filelist.coverages, function(file) {
peaks <- read.delim(file, sep = '', header = FALSE)
peaks$V3 - peaks$V2
})

How to automate read.csv command in R?

I'm doing something stupid and I cannot get read.csv to write a lot of files.
If I write:
write.csv(X1, file = "X1.csv")
Then it writes a ~2mb csv file which is ok. I have around 2000 variables in memory and I've tried
for (i in seq_along(fotos)) {
write.csv(paste("X", i, sep = ""), file = paste(paste("X", i, sep = ""),"csv", sep="."))}
I obtain the desired files but the files are ~2kb and X1.csv contains only one cell saying "X1.csv", and all all the files are similar because X1000.csv contains "X1000.csv", this is unlike the command write.csv(X1, file = "X1.csv") which creates a file X1.csv containing a matrix of 96x96.
Any idea of what I'm doing wrong?
Many thanks in advance.
You can get the object by name with the function get. However, it is much better to read the data frames into a list than into objects related by having common names.
So you can create a list of the data frames:
X <- lapply(seq_along(fotos), function(i) get(paste0("X", i)))
names(x) <- fotos
And then write them (and this is what you'd use if you had a list to start with):
lapply(names(X), function(name) write.csv(X[[name]], paste(name, 'csv', sep='.')))
You could try using the get() function
for (i in seq_along(fotos)) {
write.csv(get(paste("X", i, sep = "")), file = paste(paste("X", i, sep = ""),"csv", sep="."))}

Executing function on objects of name 'i' within for-loop in R

I am still pretty new to R and very new to for-loops and functions, but I searched quite a bit on stackoverflow and couldn't find an answer to this question. So here we go.
I'm trying to create a script that will (1) read in multiple .csv files and (2) apply a function to strip twitter handles from urls in and do some other things to these files. I have developed script for these two tasks separately, so I know that most of my code works, but something goes wrong when I try to combine them. I prepare for doing so using the following code:
# specify directory for your files and replace 'file' with the first, unique part of the
# files you would like to import
mypath <- "~/Users/you/data/"
mypattern <- "file+.*csv"
# Get a list of the files
file_list <- list.files(path = mypath,
pattern = mypattern)
# List of names to be given to data frames
data_names <- str_match(file_list, "(.*?)\\.")[,2]
# Define function for preparing datasets
handlestripper <- function(data){
data$handle <- str_match(data$URL, "com/(.*?)/status")[,2]
data$rank <- c(1:500)
names(data) <- c("dateGMT", "url", "tweet", "twitterid", "rank")
data <- data[,c(4, 1:3, 5)]
}
That all works fine. The problem comes when I try to execute the function handlestripper() within the for-loop.
# Read in data
for(i in data_names){
filepath <- file.path(mypath, paste(i, ".csv", sep = ""))
assign(i, read.delim(filepath, colClasses = "character", sep = ","))
i <- handlestripper(i)
}
When I execute this code, I get the following error: Error in data$URL : $ operator is invalid for atomic vectors. I know that this means that my function is being applied to the string I called from within the vector data_names, but I don't know how to tell R that, in this last line of my for-loop, I want the function applied to the objects of name i that I just created using the assign command, rather than to i itself.
Inside your loop, you can change this:
assign(i, read.delim(filepath, colClasses = "character", sep = ","))
i <- handlestripper(i)
to
tmp <- read.delim(filepath, colClasses = "character", sep = ",")
assign(i, handlestripper(tmp))
I think you should make as few get and assign calls as you can, but there's nothing wrong with indexing your loop with names as you are doing. I do it all the time, anyway.

Resources