cbind column in several csv files in r - r

I am new to R and dont know exactly how to do for loops.
Here is my problem: I have about 160 csv files in a folder, each with a specific name. In each file, there is a pattern:"HL.X.Y.Z.", where X="Region", Y="cluster", and Z="point". What i need to do is read all these csv files, extract strings from the names, create a column with the strings for each csv file, and bind all these csv files in a single data frame.
Here is some code of what i am trying to do:
setwd("C:/Users/worddirect")
files.names<-list.files(getwd(),pattern="*.csv")
files.names
head(files.names)
>[1] "HL.1.1.1.2F31CA.150722.csv" "HL.1.1.2.2F316A.150722.csv"
[3] "HL.1.1.3.2F3274.150722.csv" "HL.1.1.4.2F3438.csv"
[5] "HL.1.10.1.3062CD.150722.csv" "HL.1.10.2.2F343D.150722.csv"
Doing like this to read all files works just fine:
files.names
for (i in 1:length(files.names)) {
assign(files.names[i], read.csv(files.names[i],skip=18))
}
Adding an extra column for an individual csv files like this works fine:
test<-cbind("Region"=rep(substring(files.names[1],4,4),times=nrow(HL.1.1.1.2F31CA.150722.csv)),
"Cluster"=rep(substring(files.names[1],6,6),times=nrow(HL.1.1.1.2F31CA.150722.csv)),
"Point"=rep(substring(files.names[1],8,8),times=nrow(HL.1.1.1.2F31CA.150722.csv)),
HL.1.1.1.2F31CA.150722.csv)
head(test)
Region Cluster Point Date.Time Unit Value
1 1 1 1 6/2/14 11:00:01 PM C 24.111
2 1 1 1 6/3/14 1:30:01 AM C 21.610
3 1 1 1 6/3/14 4:00:01 AM C 20.609
However, a for loop of the above doesn`t work.
files.names
for (i in 1:length(files.names)) {
assign(files.names[i], read.csv(files.names[i],skip=18))
cbind("Region"=rep(substring(files.names[i],4,4),times=nrow(i)),
"Cluster"=rep(substring(files.names[i],6,6),times=nrow(i)),
"Point"=rep(substring(files.names[i],8,8),times=nrow(i)),
i)
}
>Error in rep(substring(files.names[i], 4, 4), times = nrow(i)) :
invalid 'times' argument
The final step would be to bind all the csv files in a single data frame.
I appreciate any suggestion. If there is any simpler way to do what i did i appreciate too!

There are many ways to solve a problem in R. A more R-like way to solve this problem is with an apply() function. The apply() family of functions acts like an implied for loop, applying one or more operations to each item in passed to it via a function argument.
Another important feature of R is the anonymous function. Combining lapply() with an anonymous function we can solve your multi file read problem.
setwd("C:/Users/worddirect")
files.names<-list.files(getwd(),pattern="*.csv")
# read csv files and return them as items in a list()
theList <- lapply(files.names,function(x){
theData <- read.csv(x,skip=18)
# bind the region, cluster, and point data and return
cbind(
"Region"=rep(substring(x,4,4),times=nrow(theData)),
"Cluster"=rep(substring(x,6,6),times=nrow(theData)),
"Point"=rep(substring(x,8,8),times=nrow(theData)),
theData)
})
# rbind the data frames in theList into a single data frame
theResult <- do.call(rbind,theList)
regards,
Len

i is number, which doesn't have nrow property.
You can use following code
result = data.frame()
for (i in 1:length(files.names)) {
assign(files.names[i], read.csv(files.names[i],skip=18))
result = rbind(
cbind(
"Region"=rep(substring(files.names[i],4,4),times=nrow(files.names[i])),
"Cluster"=rep(substring(files.names[i],6,6),times=nrow(files.names[i])),
"Point"=rep(substring(files.names[i],8,8),times=nrow(files.names[i])),
files.names[i]))
}

Related

Loop over a large number of CSV files with the same statements in R?

I'm having a lot of trouble reading/writing to CSV files. Say I have over 300 CSV's in a folder, each being a matrix of values.
If I wanted to find out a characteristic of each individual CSV file such as which rows had an exact number of 3's, and write the result to another CSV fil for each test, how would I go about iterating this over 300 different CSV files?
For example, say I have this code I am running for each file:
values_4 <- read.csv(file = 'values_04.csv', header=FALSE) // read CSV in as it's own DF
values_4$howMany3s <- apply(values_04, 1, function(x) length(which(x==3))) // compute number of 3's
values_4$exactly4 <- apply(values_04[50], 1, function(x) length(which(x==4))) // show 1/0 on each column that has exactly four 3's
values_4 // print new matrix
I am then continuously copy and pasting this code and changing the "4" to a 5, 6, etc and noting the values. This seems wildly inefficient to me but I'm not experienced enough at R to know exactly what my options are. Should I look at adding all 300 CSV files to a single list and somehow looping through them?
Appreciate any help!
Here's one way you can read all the files and proceess them. Untested code as you haven't given us anything to work on.
# Get a list of CSV files. Use the path argument to point to a folder
# other than the current working directory
files <- list.files(pattern=".+\\.csv")
# For each file, work your magic
# lapply runs the function defined in the second argument on each
# value of the first argument
everything <- lapply(
files,
function(f) {
values <- read.csv(f, header=FALSE)
apply(values, 1, function(x) length(which(x==3)))
}
)
# And returns the results in a list. Each element consists of
# the results from one function call.
# Make sure you can access the elements of the list by filename
names(everything) <- files
# The return value is a list. Access all of it with
everything
# Or a single element with
everything[["values04.csv"]]

R: Doing the same steps on many data frames with their names stored in a vector

I have several .RData files, each of which has letters and numbers in its name, eg. m22.RData. Each of these contains a single data.frame object, with the same name as the file, eg. m22.RData contains a data.frame object named "m22".
I can generate the file names easily enough with something like datanames <- paste0(c("m","n"),seq(1,100)) and then use load() on those, which will leave me with a few hundred data.frame objects named m1, m2, etc. What I am not sure of is how to do the next step -- prepare and merge each of these dataframes without having to type out all their names.
I can make a function that accepts a data frame as input and does all the processing. But if I pass it datanames[22] as input, I am passing it the string "m22", not the data frame object named m22.
My end goal is to epeatedly do the same steps on a bunch of different data frames without manually typing out "prepdata(m1) prepdata(m2) ... prepdata(n100)". I can think of two ways to do it, but I don't know how to implement either of them:
Get from a vector of the names of the data frames to a list containing the actual data frames.
Modify my "prepdata" function so that it can accept the name of the data frame, but then still somehow be able to do things to the data frame itself (possibly by way of "assign"? But the last step of the function will be to merge the prepared data to a bigger data frame, and I'm not sure if there's a method that uses "assign" that can do that...)
Can anybody advise on how to implement either of the above methods, or another way to make this work?
See this answer and the corresponding R FAQ
Basically:
temp1 <- c(1,2,3)
save(temp1, file = "temp1.RData")
x <- c()
x[1] <- load("temp1.RData")
get(x[1])
#> [1] 1 2 3
Assuming all your data exists in the same folder you can create an R object with all the paths, then you can create a function that gets a path to a Rdata file, reads it and calls "prepdata". Finally, using the purr package you can apply the same function on a input vector.
Something like this should work:
library(purrr)
rdata_paths <- list.files(path = "path/to/your/files", full.names = TRUE)
read_rdata <- function(path) {
data <- load(path)
return(data)
}
prepdata <- function(data) {
### your prepdata implementation
}
master_function <- function(path) {
data <- read_rdata(path)
result <- prepdata(data)
return(result)
}
merged_rdatas <- map_df(rdata_paths, master_function) # This create one dataset. Merging all together

how to use "for loop" to write multiple .csv file names?

Does anyone know the best way to carry out a "for loop" that would read in different subject id's and append them to the name of an exported csv?
As an example, I have multiple output files from an electrocardiogram software program (each file belongs to one individual). The files are named C800_HR.bdf.evt, C801_HR.bdf.evt, C802_HR.bdf.evt etc. Each file gets read into r and then has a script applied to calculate heart rate variability. At the end of the script, I need to add a loop that will extract the subject id (e.g., C800, C801, C802) and write a new file name for each individual so that it becomes C800_RtoR.csv. Essentially, I would like to avoid changing the syntax every time I read in and export a file name.
I am currently using the following syntax to read in multiple files:
>setwd("/Users/kmpc/Downloads")
>myhrvdata <-lapply(Sys.glob("C8**_HR.bdf.evt"), read.delim)
Try this out:
cardio_files <- list.files(pattern = "C8\\d{2}_HR.bdf.evt")
subject_ids <- sub("^(C8\\d{2})_.*", "\\1" cardio_files)
myList <- lapply(cardio_files, read.delim)
## do calculations on the list
for (i in names(myList)) {
write.csv(myList[[i]], paste0(subject_ids[i], "_RtoR.csv"))
}
The only thing is, you have to deal with using a list when doing your calculations. You could combine them to a single data.frame, but it would be best to leave it as a list to write the files at the end.
Consider generalizing your process by creating a function that: 1) reads in file, 2) processes data, 3) outputs to csv. Then have lapply call the defined method iteratively across all Sys.glob items and even return a list of calculated data frames.
proc_heart_rate <- function(f_name) {
# READ IN .evt FILE INTO df
df <- read.delim(f_name)
# CALCULATE HEART RATE VARIABILITY WITH df
...
# OUTPUT df TO CSV
subject_id <- gsub("\\_.*", "", f_name)
write.csv(df, paste0(subject_id, "_RtoR.csv"))
# RETURN df FOR OTHER USES
return(df)
}
# LIST OF DATA FRAMES WITH CALCULATIONS
myhrvdata_list <-lapply(Sys.glob("C8**_HR.bdf.evt"), proc_heart_rate)

Assign new names to dataframes and save as separate objects in R

I am performing a set of analyses in R. The flow of the analysis is reading in a dataframe (i.e. input_dataframe), performing a set of calculations that then result in a new, smaller dataframe (called final_result). A set of exact calculations is performed on 23 different files, each of which contains a dataframe.
My question is as follows: For each file that is read in (i.e. the 23 files) I am trying to save a unique R object: How do I do so? When I save the resulting final_result dataframe (using save() to an R object, I then cannot read all 23 objects into a new R session without having the different R objects override each other. Other suggestions (such as Create a variable name with "paste" in R?) did not work for me, since they rely on the fact that once the new variable name is assigned, you then call that new variable by its name, which I cannot do in this case.
To Summarize/Reword: Is there a way to save an object in R but change the name of the object for when it will be loaded later?
For example:
x=5
magicSave(x,file="saved_variable_1.r",to_save_as="result_1")
x=93
magicSave(x,file="saved_variable_2.r",to_save_as="result_2")
load(saved_variable_1)
load(saved_variable_2)
result_1
#returns 5
result_2
#returns 93
In R it's generally a good idea to actually store as a list everything that can be seen as a list. It will make everything more elegant afterwards.
First you put all your paths in a list or a vector :
paths <- c("C:/somewhere/file1.csv",
"C:/somewhere/file2.csv") # etc
Then you read them :
objects <- lapply(paths,read.csv) # objects is a list of tables
Then you apply your transformation on each element :
output <- lapply(objects,transformation_function)
And then you can save your output (I find saveRDS cleaner than save as you know what variables you'll be inviting in your workspace when loading) :
saveRDS(output,"C:/somewhere/output.RDS")
which you will load with
output <- readRDS("C:/somewhere/output.RDS")
OR if you prefer for some reason to save as different objects:
output_paths <- paste0("C:/somewhere/output",seq_along(output),".csv")
Map(saveRDS,output,output_paths)
To load later with:
output <- lapply(paths, readRDS)
x=5
write.csv(x,"one_thing.csv", row.names = F)
x=93
write.csv(x,"two_thing.csv", row.names = F)
result_1 <- read.csv("one_thing.csv")
result_2 <- read.csv("two_thing.csv")
result_1
# x
# 1 5
result_2
# x
# 1 93

Which data structure should be used, that can be appended in customized way?

I have to load data from files related to multiple experiments, and latter process them for generating a plot. Each experiment generated multiple files. Files related to experiment 1 will have their name "Experiment1" and then postfixed by data type it contains i.e. "Experiment1-per0", "Experiment1-per50", "Experiment1-per100".
These postfixes are fixed for all experiments. So to load the files, I want to give only the experiment names, and latter append the postfixes in R-script. Consequently, for each experiment name "ExperimentX" I would give, I will load three separate data files by appending the postfixes (i.e "ExperimentX-per0", "ExperimentX-per50", "ExperimentX-per100")
I am unable to figure out, in which datastructure I should store the initial experiment names and then the postfixed names.
Sample file (Experiment1-per50):
# the last column also shows the type of data i.e postfix of file
Obj TGiven TUsed TOGiven TOServed per50
16570 8 7 12 6 per50
18430 8 8 12 9 per50
16890 8 7 12 9 per50
Currently, I put every file name, manually, which takes lot of time.
If each experiment will have the same set of suffixes, you can store your list of experiment names and suffix names separately. Then, using a nested loop, you can combine the experiment name and suffix name using the paste function to get the filename.
You code might look something like this:
experiments = c("Experiment1","Experiment2","Experiment3")
suffixes = c("per0","per50","per100")
for (experiment in experiments) {
for (suffix in suffixes) {
filename <- paste(experiment, suffix, sep="-")
df <- read.table(filename)
df$experiment <- experiment
# Do something with the dataframe here
}
}
Alternatively, if you just want a vector of all the filenames from given experiments and suffixes lists, this would combine them:
as.vector(sapply(experiments, paste, suffixes, sep="-"))
If all the columns are the different
If the columns are different between the experiments, I would wrap the experiments in lists as follows:
library(plyr);
experiments <- c("Experiment1","Experiment2","Experiment3");
suffixes <- c("per0","per50","per100");
# if you want to go ahead and get the data
data <- llply( experiments, function(experiment) {
llply( suffixes, function(suffix) {
fn <- str_c(experiment,'_',suffix,'.csv'); # make filename
# later, try to read fn, now just return
return(fn);
})
})
You can then iterate through data for further processing. llply is part of the plyr package. It iterates over a list (the first l in llply) and returns a list (the second l).
If all the columns are the same
library(plyr);
experiments <- c("Experiment1","Experiment2","Experiment3");
suffixes <- c("per0","per50","per100");
data <- ldply( experiments, function(experiment) {
ldply( suffixes, function(suffix) {
data.frame(
experiment = experiment,
suffix= suffix,
fn = str_c(exper.name,'_',suffix,'.csv'))
})
})
This will read all the data as a single data.frame, which you can then parse as needed (for example, using plyr and/or subset).

Resources