I have a vector of file paths called dfs, and I want create a dataframe of those files and bind them together into one huge dataframe, so I did something like this :
for (df in dfs){
clean_df <- bind_rows(as.data.table(read.delim(df, header=T, sep="|")))
return(clean_df)
}
but only the last item in the dataframe is being returned. How do I fix this?
I'm not sure about your file format, so I'll take common .csv as an example. Replace the a * i part with actually reading all the different files, instead of just generating mockup data.
files = list()
for (i in 1:10) {
a = read.csv('test.csv', header = FALSE)
a = a * i
files[[i]] = a
}
full_frame = data.frame(data.table::rbindlist(files))
The problem is that you can only pass one file at a time to the function read.delim(). So the solution would be to use a function like lapply() to read in each file specified in your df.
Here's an example, and you can find other answers to your question here.
library(tidyverse)
df <- c("file1.txt","file2.txt")
all.files <- lapply(df,function(i){read.delim(i, header=T, sep="|")})
clean_df <- bind_rows(all.files)
(clean_df)
Note that you don't need the function return(), putting the clean_df in parenthesis prompts R to print the variable.
Related
I need to write multiple variables (dataframes) into different .txt files, named based it's original variables names. I tried use ls() function to select by pattern my desirable variables, but with no success. Is there any other approach to do this?
Using ls() function I was able to create .txt files with the correct filenames based on my variables (data1_tables.txt, data2_tables.txt, etc), but with the wrong output.
#create some variables based on mtcars data
data1 <- mtcars[1:5,]
data2 <- mtcars[6:10,]
data3 <- mtcars[11:20,]
fileNames=ls(pattern="data",all.names=TRUE)
for (i in fileNames) {
write.table(i,paste(i,"_tables.txt",sep=""),row.names = T,sep="\t",quote=F)
}
I want that the created files (data1_tables.txt, data2_tables.txt, data3_tables.txt) have the output from the original data1, data2, data3 variables.
What is happenning is that, what you're actually writing to files are the elements from the fileNames vector (which are just strings). If you want to write any object to a file through the write functions, you need to input the object itself, not the name of the object.
#create some variables based on mtcars data
data1 <- mtcars[1:5,]
data2 <- mtcars[6:10,]
data3 <- mtcars[11:20,]
fileNames = ls(pattern="data", all.names=TRUE)
for(i in fileNames) {
write.table(x=get(i), # The get function gets an object with a given name.
file=paste0(i, "_tables.txt"), # paste0 is basically a paste with sep="" by default
row.names=T,
sep="\t",
quote=F)
}
Change the end of your code to:
for (i in fileNames) {
write.table(eval(as.name(i)),paste(i,"_tables.txt",sep=""),row.names = T,sep="\t",quote=F)
}
I have read multiple questionnaire files into DFs in R. Now I want to create new DFs based on them, buit with only specific rows in them, via looping over all of them.The loop appears to work fine. However the selection of the rows does not seem to work. When I try selecting with simple squarebrackts, i get the error "incorrect number of dimensions". I tried it with subet(), but i dont seem to be able to set the subset correctly.
Here is what i have so far:
for (i in 1:length(subjectlist)) {
p[i] <- paste("path",subjectlist[i],sep="")
files <- list.files(path=p,full.names = T,include.dirs = T)
assign(paste("subject_",i,sep=""),read.csv(paste("path",subjectlist[i],".csv",sep=""),header=T,stringsAsFactors = T,row.names=NULL))
assign(paste("subject_",i,"_t",sep=""),sapply(paste("subject_",i,sep=""),[c((3:22),(44:63),(93:112),(140:159),(180:199),(227:246)),]))
}
Here's some code that tries to abstract away the details and do what it seems like you're trying to do. If you just want to read in a bunch of files and then select certain rows, I think you can avoid the assign functions and just use sapply to read all the data frames into a list. Let me know if this helps:
# Get the names of files we want to read in
files = list.files([arguments])
df.list = sapply(files, function(file) {
# Read in a csv file from the files vector
df = read.csv(file, header=TRUE, stringsAsFactors=FALSE)
# Add a column telling us the name of the csv file that the data came from
df$SourceFile = file
# Select only the rows we want
df = df[c(3:22,44:63,93:112,140:159,180:199,227:246), ]
}, simplify=FALSE)
If you now want to combine all the data frames into a single data frame, you can do the following (the SourceFile column tells you which file each row originally came from):
# Combine all the files into a single data frame
allDFs = do.call(rbind, df.list)
I have 100 files, of which I want to extract the 4th column (total_volume) containing 100.000 rows and put it together in 1 big file which then contains 100 columns with each 100.000 rows. I was trying something with the following script:
setwd("/run/media/mydirectory")
library(data.table)
fileNames <- Sys.glob("*.txt.csv")
#read file in fileNames
for (fileName in fileNames) {
dataDf <- read.delim(fileName, header = FALSE)
# remove columns with only example values
dataDf <- dataDf[, -(7:14)]
# convert data frame to data table
dataDt <- data.table(dataDf)
# set column names
setnames(dataDt, c("mcs", "cell_type", "cell_number", "total_volume"))
#new file with only total volume
total_volume <- dataDt$total_volume
#export file
write.table(dataDt$total_volume, file = "total_volume20.csv")
But what I get then is that all columns get superimposed with as result a .csv file with the 4th column of only the last file. I would like the columns to be next to eachother instead of being superimposed. How could I do that?
Thanks in advance!
P.S. Obviously the overwriting thing happens because I used a loop. However, I am not sure how else to combine everything, so suggestions are very welcome!
You haven't given us a reproducible example, so I can't test this properly, but this should give you a table with one column for total volume from each of the files you get from the call to Sys.glob(). The idea is to make a function that does what you want for one file; use lapply() to make a list with the results of that function for each file in your target environment; then cbind the columns in that list into one big table.
setwd("/run/media/mydirectory")
library(data.table)
fileNames <- Sys.glob("*.txt.csv")
# For the function, I'm reproducing your code. You could do in fewer lines and without
# data.table if you like, but maybe there's a reason you chose this approach.
extractor <- function(fileName) {
require(data.table)
dataDf <- read.delim(fileName, header = FALSE)
dataDf <- dataDf[, -(7:14)]
dataDt <- data.table(dataDf)
setnames(dataDt, c("mcs", "cell_type", "cell_number", "total_volume"))
total_volume <- dataDt$total_volume
return(total_volume)
}
total.list <- lapply(fileNames, extractor)
total.table <- Reduce(cbind, total.list)
write.table(total.table, file = "total_volume20.csv")
Or do that last bit in one line if you like:
write.table(Reduce(cbind, lapply(Sys.glob("*.txt.csv"), extractor)), file="total_volume20.csv")
I have written a loop in R (still learning). My purpose is to pick the max AvgConc and max Roll_TotDep from each looping file, and then have two data frames that each contains all the max numbers picked from individual files. The code I wrote only save the last iteration results (for only one single file)... Can someone point me a right direction to revise my code, so I can append the result of each new iteration with previous ones? Thanks!
data.folder <- "D:\\20150804"
files <- list.files(path=data.folder)
for (i in 1:length(files)) {
sub <- read.table(file.path(data.folder, files[i]), header=T)
max1Conc <- sub[which.max(sub$AvgConc),]
maxETD <- sub[which.max(sub$Roll_TotDep),]
write.csv(max1Conc, file= "max1Conc.csv", append=TRUE)
write.csv(maxETD, file= "maxETD.csv", append=TRUE)
}
The problem is that max1Conc and maxETD are not lists data.frames or vectors (or other types of object capable of storing more than one value).
To fix this:
maxETD<-vector()
max1Conc<-vector()
for (i in 1:length(files)) {
sub <- read.table(file.path(data.folder, files[i]), header=T)
max1Conc <- append(max1Conc,sub[which.max(sub$AvgConc),])
maxETD <- append(maxETD,sub[which.max(sub$Roll_TotDep),])
write.csv(max1Conc, file= "max1Conc.csv", append=TRUE)
write.csv(maxETD, file= "maxETD.csv", append=TRUE)
}
The difference here is that I made the two variables you wish to write out empty vectors (max1Conc and maxETD), and then used the append command to add each successive value to the vectors.
There are more idiomatic R ways of accomplishing your goal; personally, I suggest you look into learning the apply family of functions. (http://adv-r.had.co.nz/Functionals.html)
I can't directly test the whole thing because I don't have a directory with files like yours, but I tested the parts, and I think this should work as an apply-driven alternative. It starts with a pair of functions, one to ingest a file from your directory and other to make a row out of the two max values from each of those files:
library(dplyr)
data.folder <- "D:\\20150804"
getfile <- function(filename) {
sub <- read.table(file.path(data.folder, filename), header=TRUE)
return(sub)
}
getmaxes <- function(df) {
rowi <- data.frame(AvConc.max = max(df[,"AvConc"]), ETD.max = max(df[,"ETD"]))
return(rowi)
}
Then it uses a couple of rounds of lapply --- embedded in piping courtesy ofdplyr --- to a) build a list with each data set as an item, b) build a second list of one-row data frames with the maxes from each item in the first list, c) rbind those rows into one big data frame, d) and then cbind the filenames to that data frame for reference.
dfmax <- lapply(as.list(list.files(path = data.folder)), getfiles) %>%
lapply(., getmaxes) %>%
Reduce(function(...) rbind(...), .) %>%
data.frame(file = list.files(path = data.folder), .)
I would like to know how I solve the following problem using higher order functions like ddply, ldply, dlply, and avoid using problematic for loops.
The problem:
I have a .csv file representing a dataset loaded into a data.frame, with each row containing the path to a directory where more information is stored in files. I want to use the directory information in the datas.frame to open the files("file1.txt","file2.txt") in that directory, merge them, then combine the merged files from each entry in one large dataframe.
something like this:
df =
entryName,dir
1,/home/guest/data/entry1
2,/home/guest/data/entry2
3,/home/guest/data/entry3
4,/home/guest/data/entry4
what I would like to do is apply a function to the dataframe that take the directory,
appends a couple of file names "file1.txt", "file.txt", then merges the two files together based off a given field.
for example file1.txt could be:
entry,subEntry,value
1,A,2
1,B,3
1,C,4
1,D,5
1,E,3
1,F,3
for example file2.txt could be:
entry,subEntry,value
1,A,8
1,B,7
1,C,8
1,D,9
1,E,8
1,F,7
the output would look something like this:
entryName,subEntry,valueFromFile1,valueFromFile2
1,A,2,8
1,B,3,7
1,C,4,8
1,D,5,9
1,E,3,8
1,F,3,7
2,A,4,8
2,B,5,9
2,C,6,7
2,D,3,7
2,E,6,8
2,F,5,9
Right now I am using a for loop, but for obvious reasons would like to use a higher order function. Here is what I have so far:
allCombined <- data.frame()
df <- read.csv(file="allDataEntries.csv",header=true)
numberOfEntries = <- dim(df)[1]
for(i in 1:numberOfEntries){
dir <- df$dir[i]
file1String <- paste(dir,"/file1.txt",sep='')
file2String <- paste(dir,"/file2.txt",sep='')
file1.df <- read.csv(file=file1String,header=TRUE)
file2.df <- read.csv(file=file2String,header=TRUE)
localMerged <- merge(file1.df,file2.df, by="value")
allCombined <- rbind(allCombined,localMerged)
}
#rest of my analysis...
Here is one way to do it. The idea is to create a list with contents of all the files, and then use Reduce to merge them sequentially using the common columns entry and subEntry.
# READ DIRECTORIES, FILES AND ENTRIES
dirs <- read.csv(file = "allDataEntries.csv", header = TRUE, as.is = TRUE)$dir
files <- as.vector(outer(dirs, c('file.txt', 'file2.txt'), 'file.path'))
entries <- lapply(files, 'read.csv', header = TRUE)
# APPLY CUSTOM MERGE FUNCTION TO COMBINE ENTRIES
merge_by <- function(x, y){
merge(x, y, by = c('entry', 'subEntry'))
}
Reduce('merge_by', entries)
I've not tested this, but it seems like it should work. The anonymous function takes a single row from df, reads in the two associated files, and merges them together by value. Using ddply will take these data frames and make a single one out of them by rbinding (since the requested output is a data frame). It does assume entryName is not repeated in df. If it is, you can add a unique row to group over instead.
ddply(df, .(entryName), function(DF) {
dir <- df$dir
file1String <- paste(dir,"/file1.txt",sep='')
file2String <- paste(dir,"/file2.txt",sep='')
file1.df <- read.csv(file=file1String,header=TRUE)
file2.df <- read.csv(file=file2String,header=TRUE)
merge(file1.df,file2.df, by="value")
})