Conducting summary statistics on multiple dataframes in R - r

Apologies if this has been answered elsewhere. I am looking to calculate and output summary statistics across multiple dataframes in R.
For context, my data is stored in .txt files for each subject - just one column: 63 obs of 1 variable. In total I have 48 files corresponding to 48 subjects.
I read these files into Rstudio and created multiple per-subject dataframes using the following scripts:
filenames <- gsub("\\.txt$","", list.files(pattern="\\.txt$"))
for(i in filenames){
assign(i, read.delim(paste(i,".txt", sep="")))
}
The nomenclature of the dataframes are e.g. 001_fd, 002_fd ...
So what I hope to do is create a for loop that calculates summary stats for each dataframe and then output the results for each into a single csv file.
Any assistance here will be greatly appreciated

It is not preferred to have object names that start with numbers. You have also not mentioned what do you mean by summary statistics, what exactly you want to calculate, I'll calculate mean and median here, you can include more if needed.
First, get all the dataframes in a list using mget
list_df <- mget(ls(pattern = '\\d+_fd'))
Using lapply, you can calculate whatever you want. Let's say you have a single column in each dataframe with x as a column name, you can do
output_df <- do.call(rbind, lapply(list_df, function(df)
data.frame(mean = mean(df$x), med = median(df$x))))
Or with purrr::map_df which makes this shorter.
output_df <- purrr::map_df(list_df,
~data.frame(mean = mean(.x$x), med = median(.x$x)))
Write the results to csv.
write.csv(output_df, 'results.csv', row.names = FALSE)

You don't have to use assign to create variable for each txt file.
Just use list.files all txt files and loop each files to a new empty dataframe.
This is the simplest method but may not be the most efficient way.
filenames <- list.files(pattern="*.txt")
output = data.frame()
for(f in filenames){
content = read.delim(f,header = FALSE)
sum = summary(content[,1])
output = rbind(output,sum)
}
colnames(output) = c("Min.","1st Qu.","Median","Mean","3rd Qu.","Max.")
write.csv(output,"output.csv",row.names = FALSE)

Related

Merging thousands of csv files into a single dataframe in R

I have 2500 csv files, all with the same columns and a variable number of observations.
Each file is approximately 3mb (~10000 obs per file).
Ideally, I would like to read all of these in to a single dataframe.
Each file represents a generation and contains info in regard to traits, phenotypes and allele frequencies.
While reading in this data, I am also trying to add an extra column to each read indicating the generation.
I have written the following code:
read_data <- function(ex_files,ex){
df <- NULL
ex <- as.character(ex)
for(n in 1:length(ex_files)){
temp <- read.csv(paste("Experiment ",ex,"/all e",ex," gen",as.character(n),".csv",sep=""))
temp$generation <- n
df <- rbind(df,temp)
}
return(df)
}
ex_files refers to list.length, while ex refers to the experiment number as it was performed in replicate (ie. I have multiple experiments each with 2500 csv files).
I am currently running it (I hope it's written correctly!), however it is taking quite a while (as expected). I'm wondering if there is a quicker way of doing this at all?
It is inefficient to grow objects in a loop. List all the files that you want to read using list.files and with purrr::map_df combine them into one dataframe with an additional column called generation which will give a unique number to each file.
filenames <- list.files(pattern = '\\.csv', full.names = TRUE)
df <- purrr::map_df(filenames, read.csv, .id = 'generation')
head(df)
Try plyr package
filenames = list.files(pattern = '\\.csv', full.names = TRUE)
df = plyr::ldpy(filenames , data.frame)

How can I add a column with mutate () to each of the multiple data sets I read?

I am a beginner in R and currently learn how to do the data wrangling job in multiple data sets.
Right now I read 55 csv.file data sets with 300 rows using the following code:
Rawdata <- list.files(pattern = "*.csv")
for(i in 1:length(Rawdata)){
assign(Rawdata[i],read.csv(Rawdata[i], header = TRUE)[1:300])
}
Each data set has variables "acc_X_value", "acc_Y_value", and "acc_Z_value".
I failed to add a column with mutate() in these data sets. I want to show the average of these variables in a new column. Any ideas? Thank you!
Usually it is better to keep related things in lists rather than use assign to store them in the global environment. I would do it something like this:
library(tidyverse)
Rawdata <- map(list.files(pattern = "*.csv"), read_csv)
newData <- map(rawData, mutate, average = (acc_X_value + acc_Y_value + acc_Z_value) / 3)

R for loop - appending results outside the loop

I'm taking an introductory R-programming course on Cousera. The first assignment has us evaluating a list of hundreds of csv files in a specified directory ("./specdata/). Each csv file, in turn, contains hundreds of records of sample pollutant data in the atmosphere - a date, a sulfite sample, a nitrate sample, and an ID of that identifies the sampling location.
The assignment asks us to create a function that takes the pollutant an id or range of ids for sampling location and returns a sample mean, given the supplied arguments.
My code (below) uses a for loop to use the id argument to only read the files of interest (seems more efficient than reading in all 322 files before doing any processing). That works great.
Within the loop, I assign the contents of the csv file to a variable. I then make that variable a data frame and use rbind to append to it the file read in during each loop. I use na.omit to remove the missing files from the variable. Then I use rbind to append the result of each iteration of the loop to variable. When I print the data frame variable within the loop, I can see the entire full list, subgrouped by id. But when I print the variable outside the loop, I only see the last element in the id vector.
I would like to create a consolidated list of all records matching the id argument within the loop, then pass the consolidate list outside the loop for further processing. I can't get this to work. My code is shown below.
Is this the wrong approach? Seems like it could work. Any help would be most appreciated. I searched StackOverflow and couldn't find anything that quite addresses what I'm trying to do.
pmean <- function(directory = "./specdata/", pollutant, id = 1:322) {
x <- list.files(path=directory, pattern="*.csv")
x <- paste(directory, x, sep="")
id1 <- id[1]
id2 <- id[length(id)]
for (i in id1:id2) {
df <- read.csv(x[i], header = TRUE)
df <- data.frame(df)
df <- na.omit(df)
df <- rbind(df)
print(df)
}
# would like a consolidated list of records here to to do more stuff, e.g. filter on pollutant and calcuate mean
}
You can just define the data frame outside the for loop and append to it. Also you can skip some steps in between... There are more ways to improve here... :-)
pmean <- function(directory = "./specdata/", pollutant, id = 1:322) {
x <- list.files(path=directory, pattern="*.csv")
x <- paste(directory, x, sep="")
df_final <- data.frame()
for (i in id) {
df <- read.csv(x[i], header = TRUE)
df <- data.frame(df)
df <- na.omit(df)
df_final <- rbind(df_final, df)
print(df)
}
# would like a consolidated list of records here to to do more stuff, e.g. filter on pollutant and calcuate mean
return(df_final)
}
by only calling df <- rbind(df) you are effectively overwriting df everytime. You can fix this by doing something like this:
df = data.frame() # empty data frame
for(i in 1:10) { # for all you csv files
x <- mean(rnorm(10)) # some new information
df <- rbind(df, x) # bind old dataframe and new value
}
By the way, if you know how big df will be beforehand then this is not the proper way to do it.

R: Looping through dataframes and subsetting

I have a number of dataframes (imported from CSV) that have the same structure. I would like to loop through all these dataframes and keep only two of these columns.
The loop below does not seem to work, any ideas why? Would ideally like to do this using a loop as I am trying to get better at using these.
frames <- ls()
for (frame in frames){
frame <- subset(frame, select = c("Col_A","Col_B"))
}
Cheers in advance for any advice.
For anyone interested I used Richard Scriven's idea of reading in the dataframes as one object, with a function added that showed where the file had been imported from. This allowed me to then use the Plyr package to manipulate the data:
library(plyr)
dataframes <- list.files(path = TEESMDIR, full.names = TRUE)
## Define a function to add the filename to the dataframe
read_csv_filename <- function(filename){
ret <- read.csv(filename)
ret$Source <- filename #EDIT
ret
}
list_dataframes <- ldply(dataframes, read_csv_filename)
selection <- llply(list_dataframes, subset, select = c(var1,var3))
The basic problem is that ls() returns a character vector of all the names of the objects in your environment, not the objects themselves. To get and replace an object using a character variable containing it's name, you can use the get()/assign() functions. You could re-write your function as
frames <- ls()
for (frame in frames){
assign(frame, subset(get(frame), select = c("Col_A","Col_B")))
}

Using for loops to match pairs of data frames in R

Using a particular function, I wish to merge pairs of data frames, for multiple pairings in an R directory. I am trying to write a ‘for loop’ that will do this job for me, and while related questions such as Merge several data.frames into one data.frame with a loop are helpful, I am struggling to adapt example loops for this particular use.
My data frames end with either “_df1.csv” or ‘_df2.csv”. Each pair, that I wish to merge into an output data frame, has an identical number at the being of the file name (i.e. 543_df1.csv and 543_df2.csv).
I have created a character string for each of the two types of file in my directory using the list.files command as below:
df1files <- list.files(path="~/Desktop/combined files” pattern="*_df1.csv", full.names=T, recursive=FALSE)
df2files <- list.files(path="="~/Desktop/combined files ", pattern="*_df2.csv", full.names=T, recursive=FALSE)
The function and commands that I want to apply in order to merge each pair of data frames are as follows:
findRow <- function(dt, df) { min(which(df$datetime > dt )) }
rows <- sapply(df2$datetime, findRow, df=df1)
merged <- cbind(df2, df1[rows,])
I am now trying to incorporate these commands into a for loop starting with something along the following lines, to prevent me from having to manually merge the pairs:
for(i in 1:length(df2files)){ ……
I am not yet a strong R programmer, and have hit a wall, so any help would be greatly appreciated.
My intuition (which I haven't had a chance to check) is that you should be able to do something like the following:
# read in the data as two lists of dataframes:
dfs1 <- lapply(df1files, read.csv)
dfs2 <- lapply(df2files, read.csv)
# define your merge commands as a function
merge2 <- function(df1, df2){
findRow <- function(dt, df) { min(which(df$datetime > dt )) }
rows <- sapply(df2$datetime, findRow, df=df1)
merged <- cbind(df2, df1[rows,])
}
# apply that merge command to the list of lists
mergeddfs <- mapply(merge2, dfs1, dfs2, SIMPLIFY=FALSE)
# write results to files
outfilenames <- gsub("df1","merged",df1files)
mapply(function(x,y) write.csv(x,y), mergeddfs, outfilenames)

Resources