I am new to R and I am practicing to write R functions. I have 100 cvs separate
data files stored in my directory, and each is labeled by its id, e.g. "1" to "100.
I like to write a function that reads some selected files into R, calculates the
number of complete cases in each data file, and arrange the results into a data frame.
Below is the function that I wrote. First I read all files in "dat". Then, using
rbind function, I read the selected files I want into a data.frame. Lastly, I computed
the number of complete cases using sum(complete.cases()). This seems straightforward but
the function does not work. I suspect there is something wrong with the index but
have not figured out why. Searched through various topics but could not find a useful
answer. Many thanks!
`complete = function(directory,id) {
dat = list.files(directory, full.name=T)
dat.em = data.frame()
for (i in id) {
dat.ful= rbind(dat.em, read.csv(dat[i]))
obs = numeric()
obs[i] = sum(complete.cases(dat.ful[dat.ful$ID == i,]))
}
data.frame(ID = id, count = obs)
}
complete("envi",c(1,3,5)) `
get error and a warning message:
Error in data.frame(ID = id, count = obs) : arguments imply differing number of rows: 3, 5
One problem with your code is that you reset obs to numeric() each time you go through the loop, so obs ends up with only one value (the number of complete cases in the last file in dat).
Another issue is that the line dat.ful = rbind(dat.em, read.csv(dat[i])) resets dat.ful to contain just the data frame being read in that iteration of the loop. This won't cause an error, but you don't actually need to store the previous data frames, since you're just checking the number of complete cases for each data frame you read in.
Here's a different approach using lapply instead of a loop. Note that instead of giving the function a vector of indices, this function takes a vector of file names. In your example, you use the index instead of the file name as the file "id". It's better to use the file names directly, because even if the file names are numbers, using the index will give an incorrect result if, for some reason, your vector of file names is not sorted in ascending numeric order, or if the file names don't use consecutive numbers.
# Read files and return data frame with the number of complete cases in each csv file
complete = function(directory, files) {
# Read each csv file in turn and store its name and number of complete cases
# in a list
obs.list = lapply(files, function(x) {
dat = read.csv(paste0(directory,"/", x))
data.frame(fileName=x, count=sum(complete.cases(dat)))
})
# Return a data frame with the number of complete cases for each file
return(do.call(rbind, obs.list))
}
Then, to run the function, you need to give it a directory and a list of file names. For example, to read all csv files in the current working directory, you can do this:
filesToRead = list.files(pattern=".csv")
complete(getwd(), filesToRead)
Related
I'm having a lot of trouble reading/writing to CSV files. Say I have over 300 CSV's in a folder, each being a matrix of values.
If I wanted to find out a characteristic of each individual CSV file such as which rows had an exact number of 3's, and write the result to another CSV fil for each test, how would I go about iterating this over 300 different CSV files?
For example, say I have this code I am running for each file:
values_4 <- read.csv(file = 'values_04.csv', header=FALSE) // read CSV in as it's own DF
values_4$howMany3s <- apply(values_04, 1, function(x) length(which(x==3))) // compute number of 3's
values_4$exactly4 <- apply(values_04[50], 1, function(x) length(which(x==4))) // show 1/0 on each column that has exactly four 3's
values_4 // print new matrix
I am then continuously copy and pasting this code and changing the "4" to a 5, 6, etc and noting the values. This seems wildly inefficient to me but I'm not experienced enough at R to know exactly what my options are. Should I look at adding all 300 CSV files to a single list and somehow looping through them?
Appreciate any help!
Here's one way you can read all the files and proceess them. Untested code as you haven't given us anything to work on.
# Get a list of CSV files. Use the path argument to point to a folder
# other than the current working directory
files <- list.files(pattern=".+\\.csv")
# For each file, work your magic
# lapply runs the function defined in the second argument on each
# value of the first argument
everything <- lapply(
files,
function(f) {
values <- read.csv(f, header=FALSE)
apply(values, 1, function(x) length(which(x==3)))
}
)
# And returns the results in a list. Each element consists of
# the results from one function call.
# Make sure you can access the elements of the list by filename
names(everything) <- files
# The return value is a list. Access all of it with
everything
# Or a single element with
everything[["values04.csv"]]
Does anyone know the best way to carry out a "for loop" that would read in different subject id's and append them to the name of an exported csv?
As an example, I have multiple output files from an electrocardiogram software program (each file belongs to one individual). The files are named C800_HR.bdf.evt, C801_HR.bdf.evt, C802_HR.bdf.evt etc. Each file gets read into r and then has a script applied to calculate heart rate variability. At the end of the script, I need to add a loop that will extract the subject id (e.g., C800, C801, C802) and write a new file name for each individual so that it becomes C800_RtoR.csv. Essentially, I would like to avoid changing the syntax every time I read in and export a file name.
I am currently using the following syntax to read in multiple files:
>setwd("/Users/kmpc/Downloads")
>myhrvdata <-lapply(Sys.glob("C8**_HR.bdf.evt"), read.delim)
Try this out:
cardio_files <- list.files(pattern = "C8\\d{2}_HR.bdf.evt")
subject_ids <- sub("^(C8\\d{2})_.*", "\\1" cardio_files)
myList <- lapply(cardio_files, read.delim)
## do calculations on the list
for (i in names(myList)) {
write.csv(myList[[i]], paste0(subject_ids[i], "_RtoR.csv"))
}
The only thing is, you have to deal with using a list when doing your calculations. You could combine them to a single data.frame, but it would be best to leave it as a list to write the files at the end.
Consider generalizing your process by creating a function that: 1) reads in file, 2) processes data, 3) outputs to csv. Then have lapply call the defined method iteratively across all Sys.glob items and even return a list of calculated data frames.
proc_heart_rate <- function(f_name) {
# READ IN .evt FILE INTO df
df <- read.delim(f_name)
# CALCULATE HEART RATE VARIABILITY WITH df
...
# OUTPUT df TO CSV
subject_id <- gsub("\\_.*", "", f_name)
write.csv(df, paste0(subject_id, "_RtoR.csv"))
# RETURN df FOR OTHER USES
return(df)
}
# LIST OF DATA FRAMES WITH CALCULATIONS
myhrvdata_list <-lapply(Sys.glob("C8**_HR.bdf.evt"), proc_heart_rate)
I am analyzing a data set and have created a function that summarizes most of my columns. The goal of my script is to automate the creation and extraction of summary tables(more or less dataframes).
To generalize as much as possible, I want to pass a character string to my function to be used to name columns, rows, files and more.
What I am working with currently:
NameFun <- function(df, name) {
##Name the first column
colnames(df)[1] <- "name"
##Write DF to Excel Workbook
write.xlsx(df, "Workbook.xlsx", sheetName = "name",
col.names = TRUE, row.names = TRUE, append = TRUE)
}
The objective here is to input a character "name" and use it within the function. I have tried "eval", "assign", and "get" with no luck. I have tried a few other attempts but either R doesn't recognize it in the environment, does nothing at all, or rejects the idea of passing a character all together.
I am open to any other solutions as to help generalize my script even more. Each column will have a unique name but report the same number of columns and type of metrics. Ideally, I would be able to pass a list of each column to the function and loop it through the whole data set.
Thanks!
-J
You could probably do this:
#Initialize a list to hold your results
ll<-list()
# You can run a loop or run it multiple times to generate your summary
ll[[name]]<-summary_Method(...) # Or pass the df
NameFun<-function(name, ll, df){
ll[[name]]<-df
}
# Write the list of dataframe to excel file.
lapply(names(ll), function(x) write.xlsx(ll[[x]], 'Workbook.xlsx', sheetName=x, append=TRUE))
I have read multiple questionnaire files into DFs in R. Now I want to create new DFs based on them, buit with only specific rows in them, via looping over all of them.The loop appears to work fine. However the selection of the rows does not seem to work. When I try selecting with simple squarebrackts, i get the error "incorrect number of dimensions". I tried it with subet(), but i dont seem to be able to set the subset correctly.
Here is what i have so far:
for (i in 1:length(subjectlist)) {
p[i] <- paste("path",subjectlist[i],sep="")
files <- list.files(path=p,full.names = T,include.dirs = T)
assign(paste("subject_",i,sep=""),read.csv(paste("path",subjectlist[i],".csv",sep=""),header=T,stringsAsFactors = T,row.names=NULL))
assign(paste("subject_",i,"_t",sep=""),sapply(paste("subject_",i,sep=""),[c((3:22),(44:63),(93:112),(140:159),(180:199),(227:246)),]))
}
Here's some code that tries to abstract away the details and do what it seems like you're trying to do. If you just want to read in a bunch of files and then select certain rows, I think you can avoid the assign functions and just use sapply to read all the data frames into a list. Let me know if this helps:
# Get the names of files we want to read in
files = list.files([arguments])
df.list = sapply(files, function(file) {
# Read in a csv file from the files vector
df = read.csv(file, header=TRUE, stringsAsFactors=FALSE)
# Add a column telling us the name of the csv file that the data came from
df$SourceFile = file
# Select only the rows we want
df = df[c(3:22,44:63,93:112,140:159,180:199,227:246), ]
}, simplify=FALSE)
If you now want to combine all the data frames into a single data frame, you can do the following (the SourceFile column tells you which file each row originally came from):
# Combine all the files into a single data frame
allDFs = do.call(rbind, df.list)
First I am new here, this is my first post so my apologies in advance if I am not doing everything correct. I did take the time to search around first but couldn't find what I am looking for.
Second, I am pretty sure I am breaking a rule in that this question is related to a 'coursera.org' R programming course I am taking (this was part of an assignment) but the due date has lapsed and I have failed for now, I will repeat the subject next month and try again but I am kind of now in damage control trying to find out what went wrong.
Basically below is my code:
What I am trying to do is read in data from a series of files. These files are four columns wide with the titles: Date, nitrate, sulfate and id and contain various rows of data.
The function I am trying to write should take the arguments of the directory of the files, the pollutant (so either nitrate or sulfate), and the set of numbered files, e.g. files 1 and 2, files 1 through to 4 etc. The return of the function should be the average value of the selected pollutant across the selected files.
I would call the function using a call like this
pollutantmean("datafolder", "nitrate", 1:3)
and the return should just be a number which is the average in this case of nitrate across data files 1 through to 3
OK, I hope I have provided enough information. Other stuff that may be useful is:
Operating system :Ubuntu
Language: R
Error message received:
Warning message:
In is.na(x) : is:na() applied to non(list or vector) of type 'NULL'
As I say, the data files are a series of files located in a folder and are four columns wide and vary as to the number of rows.
My function code is a follows:
pollutantmean <- function(directory, pollutant, id = 1:5) { #content of the function
#create a list of files, a vector I think
files_list <- dir(directory, full.names = TRUE)
# Now create an empty data frame
dat <- data.frame()
# Next step is to execute a loop to read all the selected data files into the dataframe
for (i in 1:5) {
dat <- rbind(dat, read.csv(files_list[i]))
}
#subsets the rows matching the selected monitor numbers
dat_subset <- dat[dat[, "ID"] == id, ]
#identify the median of the pollutant and ignore the NA values
median(dat_subset$pollutant, na.rm = TRUE)
ok, that is it, through trial and error I am pretty sure the final line of code, the "median(dat_subset$pollutant, na.rm = TRUE)" appears to be the problem. I pass an argument to the function of pollutant which should be either sulfate or nitrate but it seems the dat_subset$pollutant bit of code is what is not working. Somehow I am getting the passed pollutant argument to not come into the function body. the dat_subset$pollutant bit should ideally be equivalent to either dat_subset$nitrate or dat_subset$sulfate depending on the argument fed to the function.
You cannot subset with $ operator if you pass the column name in an object like in your example (where it is stored in pollutant). So try to subset using [], in your case that would be:
median(dat_subset[,pollutant], na.rm = TRUE)
or
median(dat_subset[[pollutant]], na.rm = TRUE)
Does that work?