Calculate and write Mean and SD for columns in R - r

I'm currently working with data in a *csv. I've got an effective script to plot my data already, but I'm stumped by what seems to be the simplest task. I'm trying to write a script that takes my data (arranged in columns) and have it calculate the mean by column and write it to a new document(./testAVG).
Also, I'm trying to take the same data, calculate the SD (by column) and append that data to the end of the original document (preferably in a repeat for the total number of rows of data I have).
Here's the script I have so far:
#Number of lines with data
Nlines = 5
#Number of lines to skip
Nskip = 0
chem <- read.table("./test.csv", skip=Nskip, sep=",", col.names = c("Sample", "SiO2", "Al2O3", "FeO", "MgO", "CaO", "Na2O", "K2O", "Total", "eSiO2", "eAl2O3", "eFeO", "eMgO", "eCaO", "eNa2O", "eK2O"), fill=TRUE, header = TRUE, nrow=Nlines)
sd1 <- sd(chem$SiO2)
sd2 <- sd(chem$Al2O3)
sd3 <- sd(chem$FeO)
sd4 <- sd(chem$MgO)
sd5 <- sd(chem$CaO)
sd6 <- sd(chem$Na2O)
sd7 <- sd(chem$K2O)
avg1 <- colMeans(chem$SiO2, na.rm = FALSE, dims=1)
avg2 <- colMeans(chem$Al2O3, na.rm = FALSE, dims=1)
avg3 <- colMeans(chem$FeO, na.rm = FALSE, dims=1)
avg4 <- colMeans(chem$MgO, na.rm = FALSE, dims=1)
avg5 <- colMeans(chem$CaO, na.rm = FALSE, dims=1)
avg6 <- colMeans(chem$Na2O, na.rm = FALSE, dims=1)
avg7 <- colMeans(chem$K2O, na.rm = FALSE, dims=1)
write <- write.table(sd1,sd2,sd3,sd4,sd5,sd6,sd7, file="./test.csv", append=TRUE, sep=",", dec=".", col.names = c("eSiO2", "eAl2O3", "eFeO", "eMgO", "eCaO", "eNa2O", "eK2O"))
write <- write.table(avg1, avg2, avg3, avg4, avg5, avg6, avg7, file="./testAVG.csv", append=FALSE, sep=",", dec=".", col.names = c("Sample", "SiO2", "Al2O3", "FeO", "MgO", "CaO", "Na2O", "K2O", "Total"))
The data I'm working with is this
Sample, SiO2, Al2O3, FeO, MgO, CaO, Na2O, K2O, Total,eSiO2,eAl2O3,eFeO,eMgO,eCaO,eNa2O,eK2O
01,65.01,14.77,0.34,1.31,17.27,1.14,0.2,100,,,,,,,
02,72.6,16.27,0.53,0.06,1.27,5.55,3.71,100,,,,,,,
03,64.95,14.65,0.18,1.29,17.48,1.21,0.23,100,,,,,,,
04,64.95,14.65,0.18,1.29,17.48,1.21,0.23,100,,,,,,,
I get this error:
Error in colMeans(chem$SiO2, na.rm = FALSE, dims = 1) :
'x' must be an array of at least two dimensions
Any advice? Thanks

The comments already hint at how to do it, but it seems that you are rather new to R, so let me explicitly show you how you could do it better, using the mtcars dataset:
df <- mtcars
df_sd <- apply(df, 2, sd) # this is how to use apply. See ?apply
df_avg <- colMeans(df) # this is how to use colMeans. See ?colMeans
write.table(df_sd, file="test.csv") # no assignment necessary.
write.table(df_avg, file="testAVG.csv") # writing the file is a desired side effect...
Moreover, please consider the following line:
avg1 <- colMeans(chem$SiO2, na.rm = FALSE, dims=1)
The cool thing about colMeans is that it computes the columnwise means for many columns at once. Here, you are supplying only one vector, namely chem$SiO2. If this is really what you want to do, you would just write
avg1 <- mean(chem$SiO2)

Related

How do I run descriptive statistics in R for each file in a directory and append it to a data frame (txt file final output as a table)?

I have a folder of files (csv) that have filtered/gated data -- two columns (dihedral angle vs bend angle). It was filtered based upon an individualized min and max for each file.
I need to be able to get at least the mean, median, sd, skewness, and kurtosis for each column of each file and have that data presented as a table. (One line per file in the final product)
I am not familiar with what R packages that maybe suitable for this task, so I was trying to do something simple. I can get it to work for a single file, but I have over 200 files. They will likely be updating, so I'll have to run this multiple times.
module load ccs/container/R/4.1.0
R
library(moments)
files <- list.files("/mnt/gpfs2_4m/scratch/username/fs_scripts/foldedstart_*", pattern="*.csv", recursive=TRUE, full.names=TRUE)
cat("filename","\t","dihedral mean","\t","bend mean","\t","dihedral median","\t","bend median","\t","dh sd","\t","bd sd","\t","dh skew","\t","bd skew","\t","dh kurt","\t","bd kurt","\n")
for (currentFile in files) {
df <- read.table(fileName[i], header=TRUE)
z1 <- mean(df$V1)
z2 <- median(df$V1)
z3 <- sd(df$V1)
z4 <- skewness(df$V1)
z5 <- kurtosis(df$V1)
z7 <- mean(df$V2)
z8 <- median(df$V2)
z9 <- sd(df$V2)
z10 <- skewness(df$V2)
z11 <- kurtosis(df$V2)
cat(filename,"\t",z1,"\t",z7,"\t",z2,"\t",z8,"\t",z3,"\t",z9,"\t",z4,"\t",z10,"\t",z5,"\t",z11,"\n")
write.table(newdata, file=statsFileName[i]))
}
The "first cat line" is the header and labels.
The "for cat line" likely goes "no where," but it is the format that I am trying to achieve.
The "write.table line" is something that I found, but I don't think it may be appropriate for this.
I truly appreciate any help on this. I am not that familiar with R and the examples that I have found do not appear to relate enough to what I trying to do for me to adapt them.
Edit: This is a plot from where the data is visualized. I’m looking for the medians (centers) of each major area of density. Trying to give some context.
Example of what the data looks like (head and tail) and some of the files in the folder
Added screenshot for Rui
Added screenshots for Rowan
The following computes all statistics the question asks for for each file and writes a table of results to a CSV file.
library(moments)
stats <- function(filename, na.rm = TRUE) {
tryCatch({
x <- read.csv(filename)
xbar <- colMeans(x, na.rm = na.rm)
med <- apply(x, 2, median, na.rm = na.rm)
S <- apply(x, 2, sd, na.rm = na.rm)
skwn <- skewness(x, na.rm = na.rm)
kurt <- kurtosis(x, na.rm = na.rm)
#
# return a data.frame, it will
# make the code simpler further on
out <- data.frame(
filename = filename,
dihedral.mean = xbar[1],
bend.mean = xbar[2],
dihedral.median = med[1],
bend.median = med[2],
dihedral.sd = S[1],
bend.sd = S[2],
dihedral.skewness = skwn[1],
bend.skewness = skwn[2],
dihedral.kurtosis = kurt[1],
bend.kurtosis = kurt[2]
)
row.names(out) <- NULL
out
},
error = function(e) e
)
}
statsFileName <- "statsfile.txt"
#files <- list.files("/mnt/gpfs2_4m/scratch/username/fs_scripts/foldedstart_*", pattern="*.csv", recursive=TRUE, full.names=TRUE)
files <- list.files("~/Temp", "^t.*\\.csv$")
newdata <- lapply(files, stats)
ok <- !sapply(newdata, inherits, "error")
cat("files read:", sum(ok), "\n")
if(any(!ok)) {
cat("errors:", sum(!ok), "\n")
err_list <- list(
files = files[!ok],
error = conditionMessage(newdata[!ok])
)
}
newdata <- do.call(rbind, newdata[ok])
write.csv(newdata, file = statsFileName, row.names = FALSE)
This solution uses dplyr to summarise each file, combines the summaries into a single dataframe, then writes the results to a csv file.
library(moments)
library(dplyr)
csv_output_path <- "./results.csv"
data_dir <- "./data"
### Create dummy csv files for reproducibility ###
if(!dir.exists(data_dir)) dir.create(data_dir)
for(i in 1:200){
write.csv(data.frame(V1 = runif(100), V2 = runif(100)),
file = paste0(data_dir, "/file_", i, ".csv"),
row.names = FALSE)
}
### Summarise files ###
files <- list.files(data_dir, pattern = ".csv$", recursive = TRUE, full.names = TRUE)
all_results <- vector("list", length(files)) # results placeholder
# Loop that calculates summary statistics
for (i in 1:length(files)) {
currentFile <- files[i]
df <- tryCatch(read.csv(file = currentFile, header=TRUE),
error = function(e) NULL)
if(is.null(df))
next
result <- df %>% summarise_all(list(mean = mean, median = median,
sd = sd, skew = skewness, kur = kurtosis))%>%
mutate(file = currentFile) %>% # add filename to the result
select(file, everything()) # reorder
all_results[[i]] <- result
}
# Combine results into a single df
final_table <- bind_rows(all_results)
# write file
write.csv(final_table, csv_output_path, row.names = FALSE)

Speeding up dplyr pipe including checks with mutate_if and if_else on larger tables

I wrote some code to performed oversampling, meaning that I replicate my observations in a data.frame and add noise to the replicates, so they are not exactly the same anymore. I'm quite happy that it works now as intended, but...it is too slow. I'm just learning dplyr and have no clue about data.table, but I hope there is a way to improve my function. I'm running this code in a function for 100s of data.frames which may contain about 10,000 columns and 400 rows.
This is some toy data:
library(tidyverse)
train_set1 <- rep(0, 300)
train_set2 <- rep("Factor1", 300)
train_set3 <- data.frame(replicate(1000, sample(0:1, 300, rep = TRUE)))
train_set <- cbind(train_set1, train_set2, train_set3)
row.names(train_set) <- c(paste("Sample", c(1:nrow(train_set)), sep = "_"))
This is the code to replicate each row a given number of times and a function to determine whether the added noise later will be positive or negative:
# replicate each row twice, added row.names contain a "."
train_oversampled <- train_set[rep(seq_len(nrow(train_set)), each = 3), ]
# create a flip function
flip <- function() {
sample(c(-1,1), 1)
}
In the relevant "too slow" piece of code, I'm subsetting the row.names for the added "." to filter for the replicates. Than I select only the numeric columns. I go through those columns row by row and leave the values untouched if they are 0. If not, a certain amount is added (here +- 1 %). Later on, I combine this data set with the original data set and have my oversampled data.frame.
# add percentage of noise to non-zero values in numerical columns
noised_copies <- train_oversampled %>%
rownames_to_column(var = "rowname") %>%
filter(grepl("\\.", row.names(train_oversampled))) %>%
rowwise() %>%
mutate_if(~ is.numeric(.), ~ if_else(. == 0, 0,. + (. * flip() * 0.01 ))) %>%
ungroup() %>%
column_to_rownames(var = "rowname")
# combine original and oversampled, noised data set
train_noised <- rbind(noised_copies, train_set)
I assume there are faster ways using e.g. data.table, but it was already tough work to get this code running and I have no idea how to improve its performance.
EDIT:
The solution is working perfectly fine with fixed values, but called within a for loop I receive "Error in paste(Sample, n, sep = ".") : object 'Sample' not found"
Code to replicate:
library(data.table)
train_set <- data.frame(
x = c(rep(0, 10)),
y = c(0:9),
z = c(rep("Factor1", 10)))
# changing the row name to avoid confusion with "Sample"
row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
train_list <- list(aa = train_set, bb = train_set, cc = train_set)
for(current_table in train_list) {
setDT(current_table, keep.rownames="Sample")
cols <- names(current_table)[sapply(current_table, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(current_table)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, list(train_set)), use.names=FALSE)
# As this is an example, I did not write anything to actually
# store the results, so I have to remove the object
rm(train_noised)
}
Any ideas why the column Sample can't be found now?
Here is a more vectorized approach using data.table:
library(data.table)
setDT(train_set, keep.rownames="Sample")
cols <- names(train_set)[sapply(train_set, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(train_set)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, list(train_set)), use.names=FALSE)
With data.table version >= 1.12.9, you can pass is.numeric directly to .SDcols argument and maybe a shorter way (e.g. (.SD) or names(.SD)) to pass to the left hand side of :=
address OP's updated post:
The issue is that although each data.frame within the list is converted to a data.table, the train_list is not updated. You can update the list with a left bind before the for loop:
library(data.table)
train_set <- data.frame(
x = c(rep(0, 10)),
y = c(0:9),
z = c(rep("Factor1", 10)))
# changing the row name to avoid confusion with "Sample"
row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
train_list <- list(aa = train_set, bb = copy(train_set), cc = copy(train_set))
train_list <- lapply(train_list, setDT, keep.rownames="Sample")
for(current_table in train_list) {
cols <- names(current_table)[sapply(current_table, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(current_table)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, train_list), use.names=FALSE)
# As this is an example, I did not write anything to actually
# store the results, so I have to remove the object
rm(train_noised)
}

How to create a random sample from .csv in r

I need to pull a random sample of 100,000 - 200,000 rows from a csv dataset of 2.8mil rows. How do I effectively do this so that the random sample can be cleaned and processed?
Under the DMwR2 library, I have used the sampleCSV function, but the output data messes up the 22 variables that I need to use.
library(caret)
library(DMwR2)
dataset source: https://www.kaggle.com/pschale/mlb-pitch-data-20152018#pitches.csv
pitchData <- sampleCSV(file.choose(), 200000, 2867154 , header = TRUE , mxPerc = 0.5)
summary(pitchData)
I expect the output of summary(pitchData) to have the same variable names as the csv file, but it renames them using random numbers, and some of the variables are lost.
Maybe the following function can do what the question asks for. Note that it uses a function from package R.utils.
The return values is a list with 2 members:
lines the line numbers read in;
data the data frame.
This can be changed to return just the dataframe.
sample_csv <- function(fname, n, sep = ",", header = TRUE, ...){
N <- R.utils::countLines(fname)
stopifnot(N >= n)
lns <- sample(N, n)
x <- sapply(lns, function(l){
scan(fname, what = character(), skip = l - 1, nlines = 1, quiet = TRUE)
})
list(lines = lns,
data = read.table(textConnection(x),
sep = sep, header = header, ...)
)
}
set.seed(1234)
res <- sample_csv(filename, 100, header = FALSE)
str(res$data)

How can I incorporate NA removal into aggregate based on a custom function?

This is my first time using any custom functions, so bear with me. I made a function for standard error that I'd like to use with aggregate. It worked until I tried to exclude NAs.
Dummy data frame to work with:
se <- function(x) sd(x)/sqrt(length(x))
df <- data.frame(site = c('N','N','N','S','S','S'),
birds = c(NA,4,2,9,3,1),
worms = c(2,1,2,4,0,5))
means <- aggregate(df[,2:3], na.rm = T, list(site = df$site), FUN = mean)
error <- aggregate(df[,2:3], na.rm = T, list(site = df$site), FUN = se)
So aggregate worked before I excluded NAs (e.g. error <- aggregate(df[,2:3], list(site = df$site), FUN = se)), and it works when finding the mean (using the rest of the values to take the mean and ignoring the missing value). How can I exclude NAs in that same manner when using my custom se function?
The problem is that you do not have an explicit argument for na.rm in your se function. If you add that to your function, it should work:
se <- function(x, na.rm = TRUE) {
sd(x, na.rm = na.rm)/sqrt(sum(!is.na(x)))
}

Create longitudinal data from a list of igraph objects in R

I'm doing analysis on company networks in R and am trying to export my igraph results into a dataframe.
Here's a reproducible example:
library(igraph)
sample <- data.frame(ID = 1:8, org_ID = c(5,4,1,2,2,2,5,7), mon = c("199801", "199802","199802","199802","199904","199912","200001", "200012"))
create.graphs <- function(df){
g <- graph.data.frame(d = df, directed = TRUE)
g <- simplify(g, remove.multiple = FALSE, remove.loops = TRUE)
E(g)$weight <- count_multiple(g)
#calculate global values
g$centrality <- centralization.degree(g)
#calculate local values
g$indegree <- degree(g, mode = "in",
loops = FALSE, normalized = FALSE)
return(g)
}
df.list <- split(sample, sample$mon)
g <- lapply(df.list, create.graphs)
As you can see, I have graphs for multiple months. I want to export this to longitudinal data, where each row represents a month (per ID) and each column represents the corresponding network measures.
So far I've managed to create a data frame, but not how to run it through the list of graphs and put it into a fitting format. An additional problem could be that the graphs have different numbers of nodes (some have around 25, others more than 40), but that should theoretically just be recognised as missing by my regression model.
output <- data.frame(Centrality = g$`199801`$centrality,
Indegree = g$`199801`$indegree)
output
summary(output)
I tried writing a function similar to the one above for this, but unfortunately to no avail.
Thanks in advance for reading this, any help is greatly appreciated
I wanted to share how I solved it (thanks to Dave2e's suggestion).
Note that ci$monat defines my time periods in the original data, so one row for each point in time.
sumarTable <- data.frame(time = unique(ci$monat))
sumarTable$indegree <- lapply(g, function(x){x$indegree})
sumarTable$outdegree <- lapply(g, function(x){x$outdegree})
sumarTable$constraint <- lapply(g, function(x){x$constraint})
etc
edit:
in order to export these values, I had to "flatten" the lists:
sumarTable$indegree <- vapply(sumarTable$indegree, paste, collapse = ", ", character(1L))
sumarTable$outdegree <- vapply(sumarTable$outdegree, paste, collapse = ", ", character(1L))
sumarTable$constraint <- vapply(sumarTable$constraint, paste, collapse = ", ", character(1L))

Resources