I have the following list
L = list.files(".", ".txt")
which is
a.txt
b.txt
c.txt
and I want to apply some code to all files in that list, but I want to save the dataframes with the samename plus some character to indicate that it is modified. For example
a_modified.txt
b_modified.txt
c_modified.txt
I'm currently used this code:
datalist = lapply(L, function(x) {
DF = read.csv(x, sep = ",")
DF$X = gsub("[:.:][[:digit:]]{1,3}","", DF$X))
colnames(DF)[colnames(DF)=="X"] <- "ID"
DF <- merge(DF, genes ,by="ID")
write.csv(DF, x)
return(DF)
})
I tried using
write.csv(DF, x+"_modified")
which was obviously wrong, as write.csv does not accept this exact operation.
Any ideas?
We need paste instead of +
write.csv(DF, paste0(sub("\\.txt", "", x), "_modified.csv"))
or this can be done within sub itself
write.csv(DF, sub("\\.txt", "_modified.csv", x))
NOTE: intial datasets were .txt
Related
I am attempting to load multiple text files into R and in each of the files, the columns are divided using the "|" character.
To give a sense of what the file structure looks like, a given row will look like:
congression printer|-182.431552949032
In this file I want to separate the congressional printer string from the numerical characters.
When using the following code:
folder <- '~/filepath'
file_list <- list.files(path=folder, pattern="*.txt")
data <-
do.call('rbind',
lapply(file_list,
function(x)
read.table(paste(folder, x, sep= ""),
header = TRUE, row.names = NULL)))
It'll load in the data as:
[1] [2]
congression printer|-182.431552949032
Is there away to correct this later using the tidyr::separate() function or by hedging the problem in the beginning? When trying to just put sep ="|" in the code above, that just impacts how my text files are found so that doesn't really work.
Things are always easier (and more powerful) with data.table :
library(data.table)
folder <- '~/filepath'
pathsList <- list.files(path=folder, pattern="*.txt", full.names = T)
rbindlist(lapply(pathsList, fread))
this works too:
folder <- '~/filepath'
file_list <- list.files(path=folder, pattern="*.txt")
data <-
do.call('rbind',
lapply(file_list,
function(x)
read.table(paste0(folder, x), sep = "|",
header = TRUE, row.names = NULL)))
I have a lot of CSV that need to be standardized. I created a dictionary for doing so and so far the function that I have looks like this:
inputpath <- ("input")
files<- paste0(inputpath, "/",
list.files(path = inputpath, pattern = '*.gz',
full.names = FALSE))
standardizefunctiontofiles = lapply(files, function(x){
DF <- read_delim(x, delim = "|", na="")
names(DF) <- dictionary$final_name[match(names(DF), dictionary$old_name)]
})
Nonetheless, the issue that I have is that when I read the CSV and turn them into a dataframe they lose their path and therefore I can't not write each of them as a CSV that matches the input name. What I would normally do would be:
output_name <- str_replace(x, "input", "output")
write_delim(x, "output_name", delim = "|")
I was thinking that a way of solving this would be to make this step:
DF <- read_delim(x, delim = "|", na="")
so that the DF gets the name of the path but I haven't find any solution for that.
Any ideas on how to solve this issue for being able to apply a function and writing each of them as a standardized CSV?
I don't completely understand the question. But as far as I understood you want to overwrite CSV files you are reading with a new CSV file that contains the information of a modified (and correct) data frame.
I think you have two alternatives
Option 1) When reading data, store both CSV as a data frame and path as a string within a list.
This would be something like
file_list <- list()
for (i in seq_along(files)) {
file_list[[i]] <- list(df = read_delim(files[[i]], delim = "|", na = ""),
path = files[[i]])
}
Then, when you write the corrected data frames, you can use the paths in the second element of the list within the list file_list. Note that in order to get the path as a string you will need to do something like file_list[[1]][["path"]]
Option 2) Use assign
for (i in seq_along(files)) {
assign(files[[i]], read_delim(files[[i]], delim = "|", na = ""))
}
Option 3) Use do.call and the fact that <- is a function!
for (i in seq_along(files)) {
do.call("<-", list(files[[i]], read_delim(files[[i]], delim = "|", na = "")))
}
I hope this is useful!!
NB) None of the functions are implemented as efficiently as possible. They just introduce the idea.
I am reading my files into file_list. The data is read using read.csv, however, I want the data in datalist to have colnames as the file-names the file_list. The original files does not have a header.
How do I change function(x) so that the the second column has colname similar to the file-name. The first column does not have to be unique.
file_list = list.files(pattern="*.csv")
datalist = lapply(file_list, function(x){read.csv(file=x,header=F,sep = "\t")})
How do I change function(x) so that the the second column has colname similar to the file-name?
datalist = lapply(file_list, function(x){
dat = read.csv(file=x, header=F, sep = "\t")
names(dat)[2] = x
return(dat)
})
This will put the name of the file as the name of the second column. If you want to edit the name, use gsub or substr (or similar) on x to modify the string.
You can just add another step.
names(datalist) <- file_list
I have directory with lot's of files inside of it. I'm trying to read all the files and select the second column of each files and use rbind to create matrix.But the problem is that after creating matrix, there is no colnames and row names in it.
Basically row names should be the file names that I read and rbind it's second column. colnames should be the first column of one of the file.
Here is my efforts:
nm <- list.files(path="path/to/file")
MyMatrix<-do.call(rbind, lapply(nm, function(x) read.table(file=x)[, 2]))
A fake data setup (eg all your files are in testdir directory)
my.data <- Indometh
write.csv(my.data, file = "testdir/test1.csv", row.names = FALSE)
my.data$time <- my.data$time + 1
write.csv(my.data, file = "testdir/test2.csv", row.names = FALSE)
my.data$time <- my.data$time + 1
write.csv(my.data, file = "testdir/test3.csv", row.names = FALSE)
Then few changes to your cycle are needed
nm <- list.files(path="testdir")
my.file <- paste("testdir", nm, sep="/")
MyDataFrame<-do.call(cbind, lapply(my.file, function(x) {
col2name <- gsub( "\\..+$","", basename(x))
my.col <- data.frame(read.csv(file=x)[, 2])
names(my.col) <- col2name
my.col
}))
MyDataFrame
here it's done with read.csv, adapt it to your needs :)
HTH, Luca
I have a bunch of csv files that follow the naming scheme: est2009US.csv.
I am reading them into R as follows:
myFiles <- list.files(path="~/Downloads/gtrends/", pattern = "^est[[:digit:]][[:digit:]][[:digit:]][[:digit:]]US*\\.csv$")
myDB <- do.call("rbind", lapply(myFiles, read.csv, header = TRUE))
I would like to find a way to create a new variable that, for each record, is populated with the name of the file the record came from.
You can avoid looping twice by using an anonymous function that assigns the file name as a column to each data.frame in the same lapply that you use to read the csvs.
myDB <- do.call("rbind", lapply(myFiles, function(x) {
dat <- read.csv(x, header=TRUE)
dat$fileName <- tools::file_path_sans_ext(basename(x))
dat
}))
I stripped out the directory and file extension. basename() returns the file name, not including the directory, and tools::file_path_sans_ext() removes the file extension.
plyr makes this very easy:
library(plyr)
paths <- dir(pattern = "\\.csv$")
names(paths) <- basename(paths)
all <- ldply(paths, read.csv)
Because paths is named, all will automatically get a column containing those names.
Nrows <- lapply( lapply(myFiles, read.csv, header=TRUE), NROW)
# might have been easier to store: lapply(myFiles, read.csv, header=TRUE)
myDB$grp <- rep( myFiles, Nrows) )
You can create the object from lapply first.
Lapply <- lapply(myFiles, read.csv, header=TRUE))
names(Lapply) <- myFiles
for(i in myFiles)
Lapply[[i]]$Source = i
do.call(rbind, Lapply)