Creating a pipeline in R that serially processes multiple csv files - r

My pipeline reads in a csv to a dataframe, assigns rownames, removes a column, performs a pca, plots the pca and extracxts the meaningful variables from the pca which are also plotted.
Here is my current code, which only goes as far as the first plot:
library(ggplot2)
library(ggrepel)
tsv = read.csv('matrix.tsv', sep='\t')
bell= read.csv('bell.tsv', sep='\t')
tail= read.csv('tail.tsv', sep='\t')
dfList = list(tail, tsv, bell)
#process csv's
dfList = lapply(dfList, function(dum){
rownames(dum) = dum[,1]
dum[,1] = NULL
dum$X = NULL
dum = dum[, -grep('un', colnames(dum))]
})
#create pca's of dataframes
pcaList = lapply(dfList, function(pca){
prin_comp = prcomp(pca, scale. = T)
})
#plot top 2 principle components in the pca
plotList = lapply(pcaList, function(prin_comp){
t = qplot(x=prin_comp$rotation[,1], y=prin_comp$rotation[,2]) + geom_text_repel(aes(label=row.names(prin_comp$rotation)))
})
#this plots the 3 plots, one for each pca, but they are un-named
plotList
The problem is that the plots don't have meaningful names/titles. I don't know how to keep that information present, passed from function to function.
I know there must be a more elegant way of doing this. And I have spent a day reading similar and not so similar questions regarding processing multiple csv files. But either they weren't applicable or didn't work for my case.
And as the title of this question implies, I would prefer to do this on one csv at a time, not all 3 at a time, as the csv's in question are very large, over 5GB each, so keeping each dataframe and pca in memory at the same time is impossible.

You just need to keep a string you want to use as the title somewhere and add ggtitle(YOUR_TITLE) to your plot, but this is not so easy with your current code. Instead of performing each step of the analysis for each CSV before going to the next step, why don't you just perform all steps for one CSV at a time?
Your code could look like:
library(ggplot2)
library(ggrepel)
csvs <- c("matrix.tsv","bell.tsv","tail.tsv")
for (i in csvs) {
# read file
df <- read.csv(i, sep='\t')
# process file
rownames(df) <- df[,1]
df[,1] <- NULL
df$X = NULL
df = df[, -grep('un', colnames(df))]
# create pca
pca <- prcomp(df, scale = T)
# plot pca
pcaPlot <- qplot(x=pca$rotation[,1], y=pca$rotation[,2]) +
geom_text_repel(aes(label=row.names(pca$rotation))) +
ggtitle(i)
print(pcaPlot)
# extract and plot meaningful variables
# ...
}
Basically I just put everything you do in a lapply call inside of a for loop, this approach also does the processing for one CSV at the time.

Related

How can I count the number of rows in multiple .csv files that meet a condition in order to plot them on a bar chart in R?

I have a folder with many .csv files each containing a list of annotated variants from sequencing. I would like to plot a bar chart of the number of somatic variants in each file - there is an Origin column with the value "somatic" or "germline".
I have been able to plot the total number of variants by counting the rows in each file with the following code:
combined_data <- list.files(pattern = ".csv")
numvar <- lapply(X = combined_data, FUN = function(x) {
length(count.fields(x, skip = 1))
})
var <- do.call(rbind,numvar)
varn <- c(as.numeric(var))
names <- c(1:41)
table <- data.frame(names, varn)
ggplot(data=table, aes(x=names, y=varn)) + geom_bar(stat="identity")
While this worked to create a bar chart for the total number of variants per file, I am not sure where it would be possible to add a condition specifying to count only the rows that meet the "somatic" condition.
Any advice would be very much appreciated.
I would do this with a for loop which performs the following steps:
Read the table
Add the somatic variants
Add the germline variants
Here is a starter in which number of variants are stored in external variables to the loop and then the loop just adds them to those from the new file read. In the end your variables will store the total number of variants assuming you have a column named "variant.class" in your tables
fils <- list.files(pattern = ".csv")
som.vars.n <- 0
germline.vars.n <- 0
for(fil in fils){
fil.tab <- read.csv(fil)
som.vars.n <- som.vars.n + sum(fil.tab$variant.class == "somatic")
germline.vars.n <- germline.vars.n + sum(fil.tab$variant.class == "germline")
}
Please provide minimal info to reproduce the situation if you wanna get a more accurate response. Hope I gave you a general idea.
Best

Reading ZIP file of machine-written data won't "plot" in RStudio

Summary: Despite a complicated lead-up, the solution was very simple: In order to plot a row of a dataframe as a line instead of a lattice, I needed to transpose the data in order to invert from x obs of y variables to y obs of x variables.
I am using RStudio on a Windows 10 computer.
I am using scientific equipment to write measurements to a csv file. Then I ZIP several files and read to R using read.csv. However, the data frame behaves strangely. Commands "length" and "dim" disagree and the "plot" function throws errors. Because I can create simulated data that doesn't throw the errors, I think the problem is either in how the machine wrote the data or in my loading and processing of the data.
Two ZIP files are located in my stackoverflow repository (with "Monterey Jack" in the name):
https://github.com/baprisbrey/stackoverflow
Here is my code for reading and processing them:
# Unzip the folders
unZIP <- function(folder){
orig.directory <- getwd()
setwd(folder)
zipped.folders <- list.files(pattern = ".*zip")
for (i in zipped.folders){
unzip(i)}
setwd(orig.directory)
}
folder <- "C:/Users/user/Documents/StackOverflow"
unZIP(folder)
# Load the data into a list of lists
pullData <- function(folder){
orig.directory <- getwd()
setwd(folder)
#zipped.folders <- list.files(pattern = ".*zip")
#unzipped.folders <- list.files(folder)[!(list.files(folder) %in% zipped.folders)]
unzipped.folders <- list.dirs(folder)[-1] # Removing itself as the first directory.
oData <- vector(mode = "list", length = length(unzipped.folders))
names(oData) <- str_remove(unzipped.folders, paste(folder,"/",sep=""))
for (i in unzipped.folders) {
filenames <- list.files(i, pattern = "*.csv")
#setwd(paste(folder, i, sep="/"))
setwd(i)
files <- lapply(filenames, read.csv, skip = 5, header = TRUE, fileEncoding = "UTF-16LE") #Note unusual encoding
oData[[str_remove(i, paste(folder,"/",sep=""))]] <- vector(mode="list", length = length(files))
oData[[str_remove(i, paste(folder,"/",sep=""))]] <- files
}
setwd(orig.directory)
return(oData)
}
theData <- pullData(folder) #Load the data into a list of lists
# Process the data into frames
bigFrame <- function(bigList) {
#where bigList is theData is the result of pullData
#initialize the holding list of frames per set
preList <- vector(mode="list", length = length(bigList))
names(preList) <- names(bigList)
# process the data
for (i in 1:length(bigList)){
step1 <- lapply(bigList[[i]], t) # transpose each data
step2 <- do.call(rbind, step1) # roll it up into it's own matrix #original error that wasn't reproduced: It showed length(step2) = 24048 when i = 1 and dim(step2) = 48 501. Any comments on why?
firstRow <- step2[1,] #holding onto the first row to become the names
step3 <- as.data.frame(step2) # turn it into a frame
step4 <- step3[grepl("µA", rownames(step3)),] # Get rid of all those excess name rows
rownames(step4) <- 1:(nrow(step4)) # change the row names to rowID's
colnames(step4) <- firstRow # change the column names to the first row steps
step4$ID <- rep(names(bigList[i]),nrow(step4)) # Add an I.D. column
step4$Class[grepl("pos",tolower(step4$ID))] <- "Yes" # Add "Yes" class
step4$Class[grepl("neg",tolower(step4$ID))] <- "No" # Add "No" class
preList[[i]] <- step4
}
# bigFrame <- do.call(rbind, preList) #Failed due to different number of measurements (rows that become columns) across all the data sets
# return(bigFrame)
return(preList) # Works!
}
frameList <- bigFrame(theData)
monterey <- rbind(frameList[[1]],frameList[[2]])
# Odd behaviors
dim(monterey) #48 503
length(monterey) #503 #This is not reproducing my original error of length = 24048
rowOne <- monterey[1,1:(ncol(monterey)-2)]
plot(rowOne) #Error in plot.new() : figure margins too large
#describe the data
quantile(rowOne, seq(0, 1, length.out = 11) )
quantile(rowOne, seq(0, 1, length.out = 11) ) %>% plot #produces undesired lattice plot
# simulate the data
doppelganger <- sample(1:20461,501,replace = TRUE)
names(doppelganger) <- names(rowOne)
# describe the data
plot(doppelganger) #Successful scatterplot. (With my non-random data, I want a line where the numbers in colnames are along the x-axis)
quantile(doppelganger, seq(0, 1, length.out = 11) ) #the random distribution is mildly different
quantile(doppelganger, seq(0, 1, length.out = 11) ) %>% plot # a simple line of dots as desired
# investigating structure
str(rowOne) # results in a dataframe of 1 observation of 501 variables. This is a correct interpretation.
str(as.data.frame(doppelganger)) # results in 501 observations of 1 variable. This is not a correct interpretation but creates the plot that I want.
How do I convert the rowOne to plot like doppelganger?
It looks like one of my errors is not reproducing, where calls to "dim" and "length" apparently disagree.
However, I'm confused as to why the "plot" function is producing a lattice plot on my processed data and a line of dots on my simulated data.
What I would like is to plot each row of data as a line. (Next, and out of the scope of this question, is I would like to classify the data with adaboost. My concern is that if "plot" behaves strangely then the classifier won't work.)
Any tips or suggestions or explanations or advice would be greatly appreciated.
Edit: Investigating the structure with ("str") of the two examples explains the difference between plots. I guess my modified question is, how do I switch between the two structures to enable plotting a line (like doppelganger) instead of a lattice (like rowOne)?
I am answering my own question.
I am leaving behind the part about the discrepancy between "length" and "dim" since I can't provide a reproducible example. However, I'm happy to leave up for comment.
The answer is that in order to produce my plot, I simply have to transpose the row as follows:
rowOne %>% t() %>% as.data.frame() %>% plot
This inverts the structure from one observation of 501 variables to 501 obs of one variable as follows:
rowOne %>% t() %>% as.data.frame() %>% str()
#'data.frame': 501 obs. of 1 variable:
# $ 1: num 8712 8712 8712 8712 8712 ...
Because of the unusual encoding I used, and the strange "length" result, I failed to see a simple solution to my "plot" problem.

ggplot2 - importing and plotting multiple .csv files

I have tried batch importing, but I think ggplot2 requires data frames and I have only been able to make a list of elements. I have set up a simple code in ggplot2 that imports data from multiple csv files and overlays their trendlines. All of the .csv files are in the same folder and have the same format. Is there a way to import all of the .csv files from the folder and plot all of them in ggplot without copying this code hundreds of times?
Thank you for your help!
library(ggplot2)
points1 <- read.csv("http://drive.google.com")[1:10,1:2]
points2 <- read.csv("http://drive.google.com")[1:10,1:2]
g <- (ggplot(points1, aes(x=ALPHA, y=BETA))
+labs(title="Model Run", subtitle="run4", y="LabelY", x="LabelX", caption="run4")
+ coord_cartesian(xlim=c(0,10), ylim=c(0,11))
#+ geom_point(data = points1)#
+geom_smooth(method="loess", span=.8, data = points1, se=FALSE)
#+ geom_point(data = points2)#
+geom_smooth(method="loess", span=.8, data = points2, se=FALSE))
plot(g)
This is a fun one. I am using some packages from the tidyverse (ggplot, purrr, readr) to make it more consistent.
Since you want to plot all the data in one plot, it makes sense to put all of it into one dataframe. The function purrr::map_df is perfect for this.
library(tidyverse)
files <- list.files("data/", "*.csv", full.names = T)
names(files) <- list.files("data/", "*.csv")
df <- map_df(files, ~read_csv(.), .id = "origin")
df %>% ggplot()+
aes(x,y, color = origin)+
geom_point()
A few explainations
The first two lines create a named vector with its elements being the full paths to the csv-files and the names of this vector being the filenames. This makes is easier to use the .id argument of map_df, which creates an additional column namend "origin" from the filenames. The notation inside map might seem a little weird at first, you could also supply a function written elesewhere to apply to each element but the ~ symbol is pretty handy: it creates a function right there and this function always takes the argument . as the current element of the vector or list you are iterating over.

Retaining sample information after processing with a function in R

I won't pretend that this code is even remotely optimal, but here is the problem I have. I have a list of files with multiple columns read in with sapply(), such that if I call file.list[[1]] I get a summary of that data.frame, and summary(file.list) is a list of files.
I am fitting curves to the data using the mgcv package as follows:
gam_data <- function(curves)
{
out <- gam(curves[, 15] ~ s(curves[, 23]))
pd <- plot(out)
return(pd)
}
out <- lapply(file.list, gam_data)
split_curves <- function(splitting)
{
pd_2 <- c(splitting[[1]]$fit)
pd_3 <- c(splitting[[1]]$x)
pd_4 <- c(splitting[[1]]$se)
curveg <- cbind(pd_2, pd_3, pd_4)
colnames(curveg) <- c("fitted", "sphro", "se")
return(curveg)
}
out2 <- lapply(out, split_curves)
Where the first block is performing gam and the second is extracting the fit of the curve. However, after all of that the original information in file.list such as replicate, genotype, etc. is lost, and the data.frames are not the same length anymore. This is probably a trivial question, but how does one retain that information through processing? I'm applying this to hundreds of data frames so I cannot just manually recreate the columns.

Plot data from several large data files in ggplot

I have several data files (numeric) with around 150000 rows and 25 columns. Before I was using gnuplot (where script lines are proportional plot objects) to to plot the data but as I have to do now some additional analysis with it I moved to R and ggplot2.
How to organize the data, thought? Is one big data.frame with an additional column to mark from which file the data is coming from really the only option? Or is there some way around that?
Edit: To be a bit more precise, I'll give as an example in what form I have the data now:
filelst=c("filea.dat", "fileb.dat", "filec.dat")
dat=c()
for(i in 1:length(filelst)) {
dat[[i]]=read.table(file[i])
}
Assuming you have filenames ending with ".dat", here's a mockup example of the strategies proposed by Chase,
require(plyr)
# list the files
lf = list.files(pattern = "\.dat")
str(lf)
# 1. read the files into a data.frame
d = ldply(lf, read.table, header = TRUE, skip = 1) # or whatever options to read
str(d) # should contain all the data, and and ID column called L1
# use the data, e.g. plot
pdf("all.pdf")
d_ply(d, "L1", plot, t="l")
dev.off()
# or using ggplot2
ggplot(d, aes(x, y, colour=L1)) + geom_line()
# 2. read the files into a list
ld = lapply(lf, read.table, header = TRUE, skip = 1) # or whatever options to read
names(ld) = gsub("\.dat", "", lf) # strip the file extension
str(ld)
# use the data, e.g. plot
pdf("all2.pdf")
lapply(names(l), function(ii) plot(l[[ii]], main=ii), t="l")
dev.off()
# 3. is not fun
Your question is a little vague. If I followed along properly, I think you have three main options:
Do as you suggest and then use any one of the "split-apply-combine" functions that exist in R to conduct your analyses by group. These functions may include by, aggregate, ave, package(plyr), package(data.table) and many others.
Store your data object as separate elements in a list(). Then use lapply() and friends to work on them.
Keep everything separate in different data objects and work on them individually. This is probably the most inefficient way to go about doing things, unless you have memory constraints et al.

Resources