read csv files and perform function, then bind together - r

I have multiple CSV files and I know how to read them and rbind them. But my problem is that before binding them, I want to perform some actions, and then rbind them.
So for one file i would do this:
a<-read.table(file="F:..... .csv", skip=1401, nrow=2,header=FALSE, sep=";")
head(a)
##display only some columns
G<-a[,c(11:13)]
H<-a[, c(14:16)]
names(G)<-names(H)
H_G<-as.data.frame(rbind(G, H))
##transpose to long format
H_G<-t(H_G)
and now i want to rbind fromm all other files.
I tried it with this
filenames <- list.files(path="F:....2",pattern="*.csv")
readlist <- lapply(filenames, read.table, skip=1401, nrow=2,header=FALSE, sep=";")
but then I do not get the result I want.

This code will do what you want
Here I initialize some test matrices:
a<-matrix(1:100,10)
b<-matrix(901:1000,10)
write.csv(file="test.csv",a)
write.csv(file="test2.csv",b)
Here I perform your loop:
filenames <- dir(pattern="*.csv")
for (i in c(1:length(filenames))){
print(filenames[i])
assign(filenames[i],read.csv(filenames[i], header=FALSE))
assign(filenames[i], get(filenames[i])[,8:10])
if(i==1){output<-data.frame(matrix(vector(),10,0))}
results<-rbind(output,get(filenames[i]))
if(i==length(filenames)){output<-t(results)}
}
Notes: column numbers I did in this line assign(filenames[i], get(filenames[i])[,8:10]) are arbitrary, you should insert your own.
Let me know if you have any questions or if this doesn't work for you.
`

Related

Iterating over CSVs to different dataframes based on file names

I have a dataframe that contains the names of a bunch of .CSV files. It looks how it does in the snippet below:
What I'm trying to do is convert each of these .CSVs into a dataframe that appends the results of each. What I'm trying to do is create three different dataframes based on what's in the file names:
Create a dataframe with all results from .CSV files with -callers- in its file name
Create a dataframe with all results from .CSV files with -results in its filename
Create a dataframe with all results from .CSV files with -script_results- in its filename
The command to actually convert the .CSV file into a dataframe looks like this if I were using the first .CSV in the dataframe below:
data <- aws.s3::s3read_using(read.csv, object = "s3://abc-testtalk/08182020-testpilot-arizona-results-08-18-2020--08-18-2020-168701001.csv")
But what I'm trying to do is:
Iterate ALL the .csv files under Key using the s3read_using function
Put them in three separate dataframes based on the file names as listed above
Key
08182020-testpilot-arizona-results-08-18-2020--08-18-2020-168701001.csv
08182020-testpilot-arizona-results-08-18-2020--08-18-2020-606698088.csv
08182020-testpilot-arizona-script_results-08-18-2020--08-18-2020-114004469.csv
08182020-testpilot-arizona-script_results-08-18-2020--08-18-2020-450823767.csv
08182020-testpilot-iowa-callers-08-18-2020-374839084.csv
08182020-testpilot-maine-callers-08-18-2020-396935866.csv
08182020-testpilot-maine-results-08-18-2020--08-18-2020-990912614.csv
08182020-testpilot-maine-script_results-08-18-2020--08-18-2020-897037786.csv
08182020-testpilot-michigan-callers-08-18-2020-367670258.csv
08182020-testpilot-michigan-follow-ups-08-18-2020--08-18-2020-049435266.csv
08182020-testpilot-michigan-results-08-18-2020--08-18-2020-544974900.csv
08182020-testpilot-michigan-script_results-08-18-2020--08-18-2020-239089219.csv
08182020-testpilot-nevada-callers-08-18-2020-782329503.csv
08182020-testpilot-nevada-results-08-18-2020--08-18-2020-348644934.csv
08182020-testpilot-nevada-script_results-08-18-2020--08-18-2020-517037762.csv
08182020-testpilot-new-hampshire-callers-08-18-2020-134150800.csv
08182020-testpilot-north-carolina-callers-08-18-2020-739838755.csv
08182020-testpilot-pennsylvania-callers-08-18-2020-223839956.csv
08182020-testpilot-pennsylvania-results-08-18-2020--08-18-2020-747438886.csv
08182020-testpilot-pennsylvania-script_results-08-18-2020--08-18-2020-546894204.csv
08182020-testpilot-virginia-callers-08-18-2020-027531377.csv
08182020-testpilot-virginia-follow-ups-08-18-2020--08-18-2020-419338697.csv
08182020-testpilot-virginia-results-08-18-2020--08-18-2020-193170030.csv
Create 3 empty dataframes. You will probably also need to indicate column names matching column names from each of the file you want to append:
results <- data.frame()
script_results <- data.frame()
callers <- data.frame()
Then iterate over file_name and read it into data object. Conditionally on what pattern ("-results-", "-script_results-" or "-caller-" is contanied in the name of each file, it will be appended to the correct dataframe:
for (file in file_name) {
data <- aws.s3::s3read_using(read.csv, object = paste0("s3://abc-testtalk/", file))
if (grepl(file, "-results-")) { results <- rbind(results, data)}
if (grepl(file, "-script_results-")) { script_results <- rbind(script_results, data)}
if (grepl(file, "-callers-")) { callers <- rbind(callers, data)}
}
As an alternative to #JohnFranchak's recommendation for map_dfr (which likely works just fine), the method that I referenced in comments would look something like this:
alldat <- lapply(setNames(nm = dat$file_name),
function(obj) aws.s3::s3read_using(read.csv, object = obj))
callers <- do.call(rbind, alldat[grepl("-callers-", names(alldat))])
results <- do.call(rbind, alldat[grepl("-results-", names(alldat))])
script_results <- do.call(rbind, alldat[grepl("-script_results-", names(alldat))])
others <- do.call(rbind, alldat[!grepl("-(callers|results|script_results)-", names(alldat))])
The do.call(rbind, ...) part is analogous to dplyr::bind_rows and data.table::rbindlist in that it accepts a list of frames, and the result is a single frame. Some differences:
do.call(rbind, ...) really requires all columns to exist in all frames, in the same order. It's not hard to enforce this externally (e.g., adding missing columns, rearranging), but it's not automatic.
data.table::rbindlist will complain for the same conditions (missing columns or different order), but it has fill= and use.names= arguments that need to be set TRUE.
dplyr::bind_rows will fill and row-bind by-name by default, without message or warning. (I don't agree that a default of silence is good all of the time, but it is the simplest.)
Lastly, my use of setNames(nm=..) is merely to assign the filename to each object. This is not strictly necessary since we still have dat$file_name, but I've found that with two separate objects, it is feasible to accidentally change (delete, append, or reorder) one of them and not the other, so I prefer to keep the names and the objects (frames) perfectly tied together. These two calls are relatively the same in the resulting named-list:
lapply(setNames(nm = dat$file_name), ...)
sapply(dat$file_name, ..., simplify = FALSE)

How can I save read dataframes as list in R?

I have say about 10 .txt files in my directory that I read like this:
sampleFiles <- list.files(directory)
for (i in 1:length(sampleFiles)) {
table <- read.table( sampleFiles[i], header = TRUE)
}
I want to store the read file such that I can access them as table1 for i=1, table2 for i=2 and tablen for i=n. How can I read all these files and save as dataframe base names table?
Use lapply
Data <- lapply( list.files(directory), read.table, header=TRUE)
In order to access each data.frame you can use [[ as in Data[[1]], Data[[2]],...,Data[[n]]
Read about how to Extract or Replace Parts of an Object using [[
To name them as you describe, replace the table <- assignment in your loop with
assign(paste0("table", j), read.table(sampleFiles[j], header = TRUE))
Your question title is slightly misleading, as this is not saving the tables as a list in the formal R sense of a list (for which, see the other answer).

Replace values within dataframe with filename while importing using read.table (R)

I am trying to clean up some data in R. I have a bunch of .txt files: each .txt file is named with an ID (e.g. ABC001), and there is a column (let's call this ID_Column) in the .txt file that contains the same ID. Each column has 5 rows (or less - some files have missing data). However, some of the files have incorrect/missing IDs (e.g. ABC01). Here's an image of what each file looks like:
https://i.stack.imgur.com/lyXfV.png
What I am trying to do here is to import everything AND replace the ID_Column with the filename (which I know to all be correct).
Is there any way to do this easily? I think this can probably be done with a for loop but I would like to know if there is any other way. Right now I have this:
all_files <- list.files(pattern=".txt")
data <- do.call(rbind, lapply(all_files, read.table, header=TRUE))
So, basically, I want to know if it is possible to use lapply (or any other function) to replace data$ID_Column with the filenames in all_files. I am having trouble as each filename is only represented once in all_files, while each ID_Column in data is represented 5 times (but not always, due to missing data). I think the solution is to create a function and call it within lapply, but I am having trouble with that.
Thanks in advance!
I would just make a function that uses read.table and adds the file's name as a column.
all_files <- list.files(pattern=".txt")
data <- do.call(rbind, lapply(all_files, function(x){
a = read.table(x, header=TRUE);
a$ID_Column=x
return(a)
}
)

Save all data frames in list to separate .csv files

I have a list of data frames that I want to save to independent .csv files.
Currently I have a new line for each data frame:
write.csv(lst$df1, "C:/Users/.../df1")
write.csv(lst$df2, "C:/Users/.../df2")
ad nauseam
Obviously this isn't ideal: a change in the names would mean going through every case and updating it. I considered using something like
lapply(lst, f(x) write.csv(x, "C:/Users/.../x")
but that clearly won't work. How do I save each data frame in the list as a separate .csv file?
You can do this:
N <- names(lst)
for (i in seq_along(N)) write.csv(lst[[i]], file=paste0("C:/Users/.../"), N[i], ".csv")
Following the comment of Heroka a shorter version:
for (df in names(lst)) write.csv(lst[[df]], file=paste0("C:/Users/.../"), df, ".csv")
or
lapply(names(lst), function(df) write.csv(lst[[df]], file=paste0("C:/Users/.../"), df, ".csv") )
The mapply function and its Map wrapper are the multi-argument versions of lapply. You have the list of data.frames; you need to build a vector of file names. Like this:
filenames<-paste0("C:/Users/.../",names(lst), ".csv")
Map(write.csv,lst,filenames)
What Map does? It calls the function provided as first argument multiple times and in each iteration its arguments are taken from the elements of the other arguments provided. Something on the line of:
list(write.csv(lst[[1]],filenames[[1]]),write.csv(lst[[2]],filenames[[2]]),...)

Combine several data frames in the global environment by row (rbind)

I am working on a project that imports all csv files from a given folder and merges them into one file. I was able to import the rows and columns I wanted from each of the files from the folder but now need help merging them all into one file. I do not know how many files I will eventually end up with (probably around 120) so I do not want to merge them 1 by 1.
Here is what I have so far:
# Import All files
rowsToUse <- c(9:104,657:752)
colsToUse <- c(15,27,28,29,30,33,35)
filenames <- list.files("save", pattern="*.csv", full.names=TRUE)
for (i in seq_along(filenames)) {
assign(paste("df", i, sep = "."), read.csv(filenames[i])[!is.na(30),][rowsToUse,colsToUse])
}
# Merge into one file
for (i in seq_along(filenames)) {
df<-rbind(df.[i])
}
The first part of the code creates a series of dataframes labled df.1, df.2, etc. I would like them to end up in one final dataframe called df. All files are identical in structure.
I would really appreciate some help if someone has a few extra minutes! Thank you!
Since you have already read the files in, you can try the following:
do.call(rbind, mget(ls(pattern = "df")))
The ls(pattern = df) should capture all of your "df.1", "df.2", and so on. Hopefully you don't have other things named with the same pattern, but if you do, experiment with a stricter pattern until the command lists just your data.frames.
mget() will bring all of these into a list on which you can use do.call(rbind, ...).
Those all seem complicated ;). The answers above seem to be operating on "we have a list of objects with very similar names, how do we handle that". Answer: they don't need to have very similar names. They don't even have to be different objects.
If you read the files in not through a for loop, but through lapply(), you get a single object that contains all of the data frames - each one as a single element. These can then trivially be extracted. So you'd have something that looks like...
#Grab a list of filenames
filenames <- list.files("save", pattern="*.csv", full.names=TRUE)
#Iterate through that list of names, using lapply(), reading the data in.
list_of_data_frames <- lapply(filenames, function(x){
#Read the data in
to_return <- read.csv(x)[!is.na(30),][c(9:104,657:752),c(15,27,28,29,30,33,35)])
#Return it. You could save lines of code (and processor time!) by just reading
#straight into return(), but it would be a lot less clear.
return(to_return)
})
#Now use do.call to turn it into a single data frame.
data.df <- do.call("rbind", list_of_data_frames)

Resources