I am working on a project that imports all csv files from a given folder and merges them into one file. I was able to import the rows and columns I wanted from each of the files from the folder but now need help merging them all into one file. I do not know how many files I will eventually end up with (probably around 120) so I do not want to merge them 1 by 1.
Here is what I have so far:
# Import All files
rowsToUse <- c(9:104,657:752)
colsToUse <- c(15,27,28,29,30,33,35)
filenames <- list.files("save", pattern="*.csv", full.names=TRUE)
for (i in seq_along(filenames)) {
assign(paste("df", i, sep = "."), read.csv(filenames[i])[!is.na(30),][rowsToUse,colsToUse])
}
# Merge into one file
for (i in seq_along(filenames)) {
df<-rbind(df.[i])
}
The first part of the code creates a series of dataframes labled df.1, df.2, etc. I would like them to end up in one final dataframe called df. All files are identical in structure.
I would really appreciate some help if someone has a few extra minutes! Thank you!
Since you have already read the files in, you can try the following:
do.call(rbind, mget(ls(pattern = "df")))
The ls(pattern = df) should capture all of your "df.1", "df.2", and so on. Hopefully you don't have other things named with the same pattern, but if you do, experiment with a stricter pattern until the command lists just your data.frames.
mget() will bring all of these into a list on which you can use do.call(rbind, ...).
Those all seem complicated ;). The answers above seem to be operating on "we have a list of objects with very similar names, how do we handle that". Answer: they don't need to have very similar names. They don't even have to be different objects.
If you read the files in not through a for loop, but through lapply(), you get a single object that contains all of the data frames - each one as a single element. These can then trivially be extracted. So you'd have something that looks like...
#Grab a list of filenames
filenames <- list.files("save", pattern="*.csv", full.names=TRUE)
#Iterate through that list of names, using lapply(), reading the data in.
list_of_data_frames <- lapply(filenames, function(x){
#Read the data in
to_return <- read.csv(x)[!is.na(30),][c(9:104,657:752),c(15,27,28,29,30,33,35)])
#Return it. You could save lines of code (and processor time!) by just reading
#straight into return(), but it would be a lot less clear.
return(to_return)
})
#Now use do.call to turn it into a single data frame.
data.df <- do.call("rbind", list_of_data_frames)
Related
I have several .csv files of data stored in a directory, and I need to import all of them into R.
Each .csv has two columns when imported into R. However, the 1001st row needs to be stored as a separate variable for each of the .csv files (it corresponds to an expected value which was stored here during the simulation; I want it to be outside of the main data).
So far I have the following code to import my .csv files as matrices.
#Load all .csv in directory into list
dataFiles <- list.files(pattern="*.csv")
for(i in dataFiles) {
#read all of the csv files
name <- gsub("-",".",i)
name <- gsub(".csv","",name)
i <- paste(".\\",i,sep="")
assign(name,read.csv(i, header=T))
}
This produces several matrices with the naming convention "sim_data_L_mu" where L and mu are parameters from the simulation. How can I remove the 1001st row (which has a number in the first column, and the second column is null) from each matrix and store it as a variable named "sim_data_L_mu_EV"? The main problem I have is that I do not know how to call all of the newly created matrices in my for loop.
Couldn't post long code in comments so am writing here:
# Use dialog to select folder
# Full names are required to access files that are not in the current working directory
file_list <- list.files(path = choose.dir(), pattern = "*.csv", full.names = T)
big_list <- lapply(file_list, function(z){
df <- read.csv(z)
scalar <- df[1000,1]
return(list(df, scalar))
})
To access the scalar value from the third file, you can use
big_list[[3]][2]
The elements in big_list follow the order of file_list so you always know which file the data comes from.
If you use data.table::fread() instead of read.csv, you can play around with assigning column names, selecting which rows/columns to read etc. It's also considerably faster for large datafiles.
Hope this helps!
I am trying to clean up some data in R. I have a bunch of .txt files: each .txt file is named with an ID (e.g. ABC001), and there is a column (let's call this ID_Column) in the .txt file that contains the same ID. Each column has 5 rows (or less - some files have missing data). However, some of the files have incorrect/missing IDs (e.g. ABC01). Here's an image of what each file looks like:
https://i.stack.imgur.com/lyXfV.png
What I am trying to do here is to import everything AND replace the ID_Column with the filename (which I know to all be correct).
Is there any way to do this easily? I think this can probably be done with a for loop but I would like to know if there is any other way. Right now I have this:
all_files <- list.files(pattern=".txt")
data <- do.call(rbind, lapply(all_files, read.table, header=TRUE))
So, basically, I want to know if it is possible to use lapply (or any other function) to replace data$ID_Column with the filenames in all_files. I am having trouble as each filename is only represented once in all_files, while each ID_Column in data is represented 5 times (but not always, due to missing data). I think the solution is to create a function and call it within lapply, but I am having trouble with that.
Thanks in advance!
I would just make a function that uses read.table and adds the file's name as a column.
all_files <- list.files(pattern=".txt")
data <- do.call(rbind, lapply(all_files, function(x){
a = read.table(x, header=TRUE);
a$ID_Column=x
return(a)
}
)
I have multiple CSV files and I know how to read them and rbind them. But my problem is that before binding them, I want to perform some actions, and then rbind them.
So for one file i would do this:
a<-read.table(file="F:..... .csv", skip=1401, nrow=2,header=FALSE, sep=";")
head(a)
##display only some columns
G<-a[,c(11:13)]
H<-a[, c(14:16)]
names(G)<-names(H)
H_G<-as.data.frame(rbind(G, H))
##transpose to long format
H_G<-t(H_G)
and now i want to rbind fromm all other files.
I tried it with this
filenames <- list.files(path="F:....2",pattern="*.csv")
readlist <- lapply(filenames, read.table, skip=1401, nrow=2,header=FALSE, sep=";")
but then I do not get the result I want.
This code will do what you want
Here I initialize some test matrices:
a<-matrix(1:100,10)
b<-matrix(901:1000,10)
write.csv(file="test.csv",a)
write.csv(file="test2.csv",b)
Here I perform your loop:
filenames <- dir(pattern="*.csv")
for (i in c(1:length(filenames))){
print(filenames[i])
assign(filenames[i],read.csv(filenames[i], header=FALSE))
assign(filenames[i], get(filenames[i])[,8:10])
if(i==1){output<-data.frame(matrix(vector(),10,0))}
results<-rbind(output,get(filenames[i]))
if(i==length(filenames)){output<-t(results)}
}
Notes: column numbers I did in this line assign(filenames[i], get(filenames[i])[,8:10]) are arbitrary, you should insert your own.
Let me know if you have any questions or if this doesn't work for you.
`
I have an assignment on Coursera and I am stuck - I do not necessarily need or want a complete answer (as this would be cheating) but a hint in the right direction would be highly appreciated.
I have over 300 CSV files in a folder (named 001.csv, 002.csv and so on). Each contains a data frame with a header. I am writing a function that will take three arguments: the location of the files, the name of the column you want to calculate the mean (inside the data frames) and the files you want to use in the calculation (id).
I have tried to keep it as simple as possible:
pm <- function(directory, pollutant, id = 1:332) {
setwd("C:/Users/cw/Documents")
setwd(directory)
files <<- list.files()
First of all, set the wd and get a list of all files
x <- id[1]
x
get the starting point of the user-specified ID.
Problem
for (i in x:length(id)) {
df <- rep(NA, length(id))
df[i] <- lapply(files[i], read.csv, header=T)
result <- do.call(rbind, df)
return(df)
}
}
So this is where I am hitting a wall: I would need to take the user-specified input from above (e.g. 10:25) and put the content from files "010.csv" through "025.csv" into a dataframe to actually come up with the mean of one specific column.
So my idea was to run a for-loop along the length of id (e.g. 16 for 10:25) starting with the starting point of the specified id. Within this loop I would then need to take the appropriate values of files as the input for read.csv and put the content of the .csv files in a dataframe.
I can get single .csv files and put them into a dataframe, but not several.
Does anybody have a hint how I could procede?
Based on your example e.g. 16 files for 10:25, i.e. 010.csv, 011.csv, 012.csv, etc.
Under the assumption that your naming convention follows the order of the files in the directory, you could try:
csvFiles <- list.files(pattern="\\.csv")[10:15]#here [10:15] ... in production use your function parameter here
file_list <- vector('list', length=length(csvFiles))
df_list <- lapply(X=csvFiles, read.csv, header=TRUE)
names(df_list) <- csvFiles #OPTIONAL: if you want to rename (later rows) to the csv list
df <- do.call("rbind", df_list)
mean(df[ ,"columnName"])
These code snippets should be possible to pimp and incorprate into your routine.
You can aggregate your csv files into one big table like this :
for(i in 100:250)
{
infile<-paste("C:/Users/cw/Documents/",i,".csv",sep="")
newtable<-read.csv(infile)
newtable<-cbind(newtable,rep(i,dim(newtable)[1]) # if you want to be able to identify tables after they are aggregated
bigtable<-rbind(bigtable,newtable)
}
(you will have to replace 100:250 with the user-specified input).
Then, calculating what you want shouldn't be very hard.
That won't works for files 001 to 099, you'll have to distinguish those from the others because of the "0" but it's fixable with little treatment.
Why do you have lapply inside a for loop? Just do lapply(files[files %in% paste0(id, ".csv")], read.csv, header=T).
They should also teach you to never use <<-.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Read multiple CSV files into separate data frames
I need to read many csv files into dataframes from one folder. The csv file names are of the form fxpair-yyyy-mm.csv (e.g. AUDJPY-2009-05.csv). I want to read all csv files in and create dataframes of the form fxpair.yyyy.mm
I am having trouble creating the dataframe names in the loop for assignment from the read.csv statements
filenames <- list.files(path=getwd())
numfiles <- length(filenames)
#fx.data.frames to hold names that will be assigned to csv files in csv.read
fx.data.frames <- gsub(pattern="-",x=filenames,replacement=".")
fx.data.frames <- gsub(pattern=".csv",x=fx.data.frames,replacement="")
i <-1
for (i in c(1:numfiles)){
filenames[i] <- paste(".\\",filenames[i],sep="")
fx.data.frames[i] <- read.csv(filenames[i], header=FALSE)
}
The csv.read seems to work fine but I am not able to create the dataframe objects in the way I intend. I just want some way to name the dataframes read in the fxpair.yyyy.mm format based on the file name.
Am I missing something obvius? THANK YOU FOR ANY HELP!!
Just to illustrate my comment :
for (i in filenames){
name <- gsub("-",".",i)
name <- gsub(".csv","",name)
i <- paste(".\\",i,sep="")
assign(name,read.csv(i, header=FALSE)
}
Or, to save all dataframes in a list :
All <- lapply(filenames,function(i){
i <- paste(".\\",i,sep="")
read.csv(i, header=FALSE)
})
filenames <- gsub("-",".",filenames)
names(All) <- gsub(".csv","",filenames)
I'd go for the second solution, as I like working with lists. It's less of a hassle to clean up the workspace afterwards. You also get rid of the name and i clutter in the global environment. These might cause some funny bugs later in the code if you're not careful. See also Is R's apply family more than syntactic sugar?
How about this :
for (i in c(1:numfiles)){
filenames[i] <- paste(".\\",filenames[i],sep="")
assign(gsub("[.]csv$","",filenames[i]),read.csv(filenames[i], header=FALSE))
}