This question already has answers here:
Writing multiple data frames into .csv files using R
(4 answers)
Closed 3 years ago.
I'm trying to use a for loop on a function that takes the data file read by the loop, modifies it, and saves the result as a CSV file and can't figure out how to apply the loop so that each result would be saved as a separate CSV file.
Below are examples of my data frames:
df1 <- data.frame("age" = c(1.5, 5.5, 10), "group" = rep("A", 3))
df2 <- data.frame("age" = c(1, 5.5, 9, 15), "group" = rep("B", 4))
dffiles <- list(df1, df2)
Right now my code looks like this:
addone <- function(id,df){
new <- df[,1]+1
df$index <- id
name <- paste0("df",i)
assign(name, df)
write.csv(new, paste0("added", id, ".csv"))
}
for (i in 1:2){
dffile <- dffiles[[i]]
addone(i, dffile)
}
So the for loop is reading each original file, which would ideally each get a new index column and gets saved as a CSV file. There seems to be a problem with my loop though because I would get the correct file when I run individual lines in the function but my output from the loop doesn't include the index column. Can anyone point out which part in the loop I messed up?
You need to create a new file name and then use the newly created name in the write.csv function.
Something like this should work:
#function takes an ID and creates the file name
func <- function(id, df) {
new <- df[,1]+1
write.csv(new, paste0("new", id, ".csv"))
}
#the call to the function passes the identifier.
dffiles <- list(df1, df2)
for (i in 1:2){
dffile <- dffiles[[i]]
func(i, dffile)
}
Related
Beginner here, I have 31 excel files that I want to extract the data frame. I want a R loop to read all the files, then only take 2 columns and change the column names. Then I want to combine the files based on the same row.
This is my attempt:
files = list.files(path=".", pattern="xls")
for (i in 1:length(files)){
table = data.frame(readxl::read_xls(files[i]), stringsAsFactors=FALSE)
table = table[,c(1,3)]
colnames(table) = c("UR",paste0("Zscore",i))
}
}
}
The problem is that I don't know to code it to save the individual file. This code only saves the last file. I tried googling all night and couldn't figure it out.
I also tried assign() but I don't know how to modify the tables within assign as part of the loop.
files = list.files(pattern="*.xls")
for (i in 1:length(files))assign(files[i], data.frame(readxl::read_xls(files[i])))
I want to files to end up like UR, Zscore1, Zscore2, Zscore3...
So instead I did it manually like this:
table1 = data.frame(readxl::read_xls(files[1]), stringsAsFactors=FALSE)
table1 = table1[,c(1,3)]
colnames(table1) = c("UR",paste0("Zscore",1))
table2 = data.frame(readxl::read_xls(files[2]), stringsAsFactors=FALSE)
table2 = table2[,c(1,3)]
colnames(table2) = c("UR",paste0("Zscore",2))
tableA = merge(table1,table2, all.x = T)
table3 = data.frame(readxl::read_xls(files[3]), stringsAsFactors=FALSE)
table3 = table3[,c(1,3)]
colnames(table3) = c("UR",paste0("Zscore",3))
tableA = merge(tableA,table3, all.x = T)
Try this approach with lapply and Reduce :
files = list.files(path=".", pattern="xls")
Reduce(function(x, y) merge(x, y, all.x = T, by = 'UR'),
lapply(seq_along(files), function(i) {
data <- readxl::read_xls(files[i])
data <- data[c(1, 3)]
names(data) <- c('UR', paste0('Zscore', i))
data
})) -> result
result
The main problem is that you are not assigning the table to anything, so you are rebuilding the table on every iteration.
For every iteration, you should assign the created table as a corresponding element [[i]] of either a data frame or a list, using the assign operator <-
Maybe something like this will work:
files <- list.files(path=".", pattern="xls")
list_of_tables<-vector(mode = "list", length = (length(files))
for (i in seq_along(files)){
list_of_tables[i] <- data.frame(readxl::read_xls(files[i]), stringsAsFactors=FALSE)[,c(1,3)]
names(list_of_tables[i]) <- c("UR",paste0("Zscore",i))
}
Then, if you want to stack the whole list in a single data frame, you can use cbind, as in:
my_data_frame<-do.call(cbind, list_of_tables)
Otherwise just keep it as a list
This question already has answers here:
Stop lapply from printing to console
(3 answers)
Closed 5 years ago.
Suppose I have a file structure defined as follows:
There are three folders: A, B and C. Each of the folders contains a file called file_demo.csv. Now I would like to read the file from each of the folders, do some operation on them and export them to three new files not in those folders. Subsequently I use lapply() to do.
Here's some code for a demo:
# the folder list
folder_list <- c('A', 'B', 'C')
# creating demo data frames
set.seed(1)
file_demo_a <- data.frame(X = rnorm(5),
Y = rpois(5, lambda = 2))
write_csv(file_demo_a, 'A/file_demo.csv')
set.seed(2)
file_demo_b <- data.frame(X = rnorm(5),
Y = rpois(5, lambda = 2))
write_csv(file_demo_b, 'B/file_demo.csv')
set.seed(3)
file_demo_c <- data.frame(X = rnorm(5),
Y = rpois(5, lambda = 2))
write_csv(file_demo_c, 'C/file_demo.csv')
# defining a function
df_mod_func <- function(folder_name){
path_name <- paste(folder_name, 'file_demo.csv', sep = "/")
new_demo <- read_csv(path_name)
new_demo <- new_demo + 1 # do a new operation
csv_file_name <- paste(folder_name, 'new_file_demo.csv', sep = "_")
new_demo %>% write_csv(csv_file_name)
# return(NULL)
}
lapply(folder_list, df_mod_func)
Now the problem I am facing is that when I call lapply(), each of the final data frames are printed to the console. This is a problem because these data files that I will load are huge and I do not want R to crash. I also do not want to store it a an object because of the huge size. I have also tried to return NULL in the function but that seems like a hacky way plus I do not want to fill up my console with useless output.
Is there a way to not get lapply (or use any other function) to collect the output in this case and just silently execute?
If it's just about not printing the result, you can always use invisible(), as in:
invisible( lapply( folder_list, df_mod_func ) )
I have two data frames having "TagNames" and "FileNames" and I have CSV files in a directory. I need to open csv files one by one using "FileNames" then fetch columns from CSV file by matching "TagNames", append them to a "result" data frame and move to next CSV file (repeat).
Note: I also have to take care of date and time because records coming from different files must be place according to date and time.
TagNames and File Names are as follows: Tag Names and File Names
Files Directory and Data Looks Like This: Files Directory and Data Shape in CSV
My R Script is this:
basepath <- dirname(rstudioapi::getActiveDocumentContext()$path)
# Load the Data
basepath <- dirname(rstudioapi::getActiveDocumentContext()$path)
FilesDF <- read.csv("Config/Files.csv")
TagsDF <- read.csv("Config/Tags.csv")
FilesList <- list(FilesDF)
TagsList <- list(TagsDF)
extractData <- function(x) {
result <- NULL;
temp <- NULL;
for (i in 1:nrow(x)) {
new_df <- read.csv(file=x$FileNames[i,], header=TRUE, sep=",")
for(j in q:ncol(new_df))
{
temp <- rbind(temp, new_df[which(new_df[1,j])==TagsList$Tag.Names[i,]])
}
result <- rbind(result, temp)
temp <- NULL
}
return(result)
}
df_combined <- lapply(FilesList, extractData)
write.csv(df_combined, file = "UreaSVR2.csv")
In base R would use something like:
rbind(lapply(lapply(fileList, read.csv), subset, select = TagsList))
The inner lapply() reads in all of the files in the list, the outer one subsets the data and uses the select argument which takes in a vector of column names. Finally, rbind puts the list together into a single data.frame.
I would probably using purrr and dplyr myself though I write it more like this:
map(fileList, read.csv) %>%
map_df(select, TagNames)
After having searched for help in different threads on this topic, I still have not become wiser. Therefore: Here comes another question on looping through multiple data files...
OK. I have multiple CSV files in one folder containing 5 columns of data. The filenames are as follows:
Moist yyyymmdd hh_mm_ss.csv
I would like to create a script that reads processes the CSV-files one by one doing the following steps:
1) load file
2) check number of rows and exclude file if less than 3 registrations
3) calculate mean value of all measurements (=rows) for column 2
4) calculate mean value of all measurements (=rows) for column 4
5) output the filename timestamp, mean column 2 and mean column 4 to a data frame,
I have written the following function
moist.each.mean <- function() {
library("tcltk")
directory <- tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <- regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame(timestamp=character(), humidity=numeric(), temp=numeric())
for(i in 1:length(filelist)){
file.in[[i]] <- read.csv(filelist[i], header=F)
if (nrow(file.in[[i]]<3)){
print("discard")
} else {
newrow <- c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1))
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
}
but i keep getting an error:
Error in `[[<-.data.frame`(`*tmp*`, i, value = list(V1 = c(10519949L, :
replacement has 18 rows, data has 17
Any ideas?
Thx, kruemelprinz
I'd also suggest to use (l)apply... Here's my take:
getMeans <- function(fpath,runfct,
target_cols = c(2),
sep=",",
dec=".",
header = T,
min_obs_threshold = 3){
f <- list.files(fpath)
fcsv <- f[grepl("\.csv",f)]
fcsv <- paste0(fpath,fcsv)
csv_list <- lapply(fcsv,read.table,sep = sep,
dec = dec, header = header)
csv_rows <- sapply(csv_list,nrow)
rel_csv_list <- csv_list[!(csv_rows < min_obs_threshold)]
lapply(rel_csv_list,function(x) colMeans(x[,target_cols]))
}
Also with that kind of error message, the debugger might be very helpful.
Just run debug(moist.each.mean) and execute the function stepwise.
Here's a slightly different approach. Use lapply to read each csv file, exclude it if necessary, otherwise create a summary. This gives you a list where each element is a data frame summary. Then use rbind to create the final summary data frame.
Without a sample of your data, I can't be sure the code below exactly matches your problem, but hopefully it will be enough to get you where you want to go.
# Get vector of filenames to read
filelist=list.files(path=directory, pattern="csv")
# Read all the csv files into a list and create summaries
df.list = lapply(filelist, function(f) {
file.in = read.csv(f, header=TRUE, stringsAsFactors=FALSE)
# Set to empty data frame if file has less than 3 rows of data
if (nrow(file.in) < 3) {
print(paste("Discard", f))
# Otherwise, capture file timestamp and summarise data frame
} else {
data.frame(timestamp=substr(f, 7, 22),
humidity=round(mean(file.in$V2),1),
temp=round(mean(file.in$V4),1))
}
})
# Bind list into final summary data frame (excluding the list elements
# that don't contain a data frame because they didn't have enough rows
# to be included in the summary)
result = do.call(rbind, df.list[sapply(df.list, is.data.frame)])
One issue with your original code is that you create a vector of summary results rather than a data frame of results:
c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1)) is a vector with three elements. What you actually want is a data frame with three columns:
data.frame(timestamp=filetitles[[i]],
humidity=round(mean(file.in[[i]]$V2),1),
temp=round(mean(file.in[[i]]$V4),1))
Thanks for the suggestions using lapply. This is definitely of value as it saves a whole lot of code as well! Meanwhile, I managed to fix my original code as well:
library("tcltk")
# directory: path to csv files
directory <-
tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <-
regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame()
for (i in 1:length(filelist)) {
file.in <- read.csv(filelist[i], header = F, skipNul = T)
if (nrow(file.in) < 3) {
print("discard")
} else {
newrow <-
matrix(
c(filetitles[[i]], round(mean(file.in$V2, na.rm=T),1), round(mean(file.in$V4, na.rm=T),1)), nrow = 1, ncol =
3, byrow = T
)
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
Only I did not get it to work as a function because then I would only have one row in mdf containing the last file data. Somehow it did not add rows but overwrite row 1 with each iteration. But using it without a function wrapper worked fine...
In summary, I have a script for importing lots of data stored in several txt files. In a sigle file not all the rows are to be put in the same table (DF now switching to DT), so for each file I select all the rows belonging to the same DF, get DF and assign to it the rows.
The first time I create a DF named ,say, table1 I do:
name <- "table1" # in my code the value of name will depend on different factors
# and **not** known in advance
assign(name, someRows)
Then, during the execution my code may find (in other files) other lines to be put in the table1 data frame, so:
name <- "table"
assign(name, rbindfill(get(name), someRows))
My question is: is assign(get(string), anyObject) the best way for doing assignment programmatically? Thanks
EDIT:
here is a simplified version of my code: (each item in dataSource is the result of read.table() so one single text file)
set.seed(1)
#
dataSource <- list(data.frame(fileType = rep(letters[1:2], each=4),
id = rep(LETTERS[1:4], each=2),
var1 = as.integer(rnorm(8))),
data.frame(fileType = rep(letters[1:2], each=4),
id = rep(LETTERS[1:4], each=2),
var1 = as.integer(rnorm(8))))
# # #
#
library(plyr)
#
tablesnames <- unique(unlist(lapply(dataSource,function(x) as.character(unique(x[,1])))))
for(l in tablesnames){
temp <- lapply(dataSource, function(x) x[x[,1]==l, -1])
if(exists(l)) assign(l, rbind.fill(get(l), rbind.fill(temp))) else assign(l, rbind.fill(temp))
}
#
#
# now two data frames a and b are crated
#
#
# different method using rbindlist in place of rbind.fill (faster and, until now, I don't # have missing column to fill)
#
rm(a,b)
library(data.table)
#
tablesnames <- unique(unlist(lapply(dataSource,function(x) as.character(unique(x[,1])))))
for(l in tablesnames){
temp <- lapply(dataSource, function(x) x[x[,1]==l, -1])
if(exists(l)) assign(l, rbindlist(list(get(l), rbindlist(temp)))) else assign(l, rbindlist(temp))
}
I would recommend using a named list, and skip using assign and get. Many of the cool R features (lapply for example) work very well on lists, and do not work with using assign and get. In addition, you can easily pass lists in to a function, while this can be somewhat cumbersome with groups of variables combined with assign and get.
If you want to read a set of files into one big data.frame I'd use something like this (assuming csv like text files):
library(plyr)
list_of_files = list.files(pattern = "*.csv")
big_dataframe = ldply(list_of_files, read.csv)
or if you want to keep the result in a list:
big_list = lapply(list_of_files, read.csv)
and possibly use rbind.fill:
big_dataframe = do.call("rbind.fill", big_list)