I have several data files (numeric) with around 150000 rows and 25 columns. Before I was using gnuplot (where script lines are proportional plot objects) to to plot the data but as I have to do now some additional analysis with it I moved to R and ggplot2.
How to organize the data, thought? Is one big data.frame with an additional column to mark from which file the data is coming from really the only option? Or is there some way around that?
Edit: To be a bit more precise, I'll give as an example in what form I have the data now:
filelst=c("filea.dat", "fileb.dat", "filec.dat")
dat=c()
for(i in 1:length(filelst)) {
dat[[i]]=read.table(file[i])
}
Assuming you have filenames ending with ".dat", here's a mockup example of the strategies proposed by Chase,
require(plyr)
# list the files
lf = list.files(pattern = "\.dat")
str(lf)
# 1. read the files into a data.frame
d = ldply(lf, read.table, header = TRUE, skip = 1) # or whatever options to read
str(d) # should contain all the data, and and ID column called L1
# use the data, e.g. plot
pdf("all.pdf")
d_ply(d, "L1", plot, t="l")
dev.off()
# or using ggplot2
ggplot(d, aes(x, y, colour=L1)) + geom_line()
# 2. read the files into a list
ld = lapply(lf, read.table, header = TRUE, skip = 1) # or whatever options to read
names(ld) = gsub("\.dat", "", lf) # strip the file extension
str(ld)
# use the data, e.g. plot
pdf("all2.pdf")
lapply(names(l), function(ii) plot(l[[ii]], main=ii), t="l")
dev.off()
# 3. is not fun
Your question is a little vague. If I followed along properly, I think you have three main options:
Do as you suggest and then use any one of the "split-apply-combine" functions that exist in R to conduct your analyses by group. These functions may include by, aggregate, ave, package(plyr), package(data.table) and many others.
Store your data object as separate elements in a list(). Then use lapply() and friends to work on them.
Keep everything separate in different data objects and work on them individually. This is probably the most inefficient way to go about doing things, unless you have memory constraints et al.
Related
I am new in R (if you know some good online courses in R, I'll be glad for any recommendation") and I am trying to apply the same set of functions for statistical analysis to many datasets (here is an example of part of my dataset.)
This is how I do it individually
ims.no3.1.csv <- read.csv("M:\\blabla\\datasetname1.csv", sep=",")
ims.no3.non0.1 <- filter_if(ims.no3.1.csv, is.numeric, all_vars((.) != 0))
ims.no3.1 <- data.frame(value = rep(ims.no3.non0.1$bin.start, ims.no3.non0.1$count))
basicStats(ims.no3.1)
h.ims.no3.1 <- hist(ims.no3.1$value, breaks = 400, ylim = c(0,400),xlim=c(0,3), xlab = "lifetime (ns)", main = "Histogram of ims+no3_1")
So I was trying to make my own function and then apply it with "lapply" but I don't know how to deal with the names that I have to use (at least for the "rep" and "filter_if" functions)
Thanks in advance for your help!
If I understand your question, you want to perform the same analysis on multiple datasets. A simple example is below, where I find the dimensions of a group of datasets in my working directory:
setwd("YourPath:/")
all<-list.files(pattern = ".csv") # using a list of my csv files
# analyze each file in list
# e.g. What are the dimensions of these files?
a<-list()
for (i in 1:length(all)){ # for each of the csv files in the list..
message("i=", all[i])
a[[i+1]]<-dim(read.csv(all[i])) # compute the dimensions and write to list
print(a) # print the result
}
This loop above will print the dimensions (rows and columns) of the datasets in my working directory. You can adapt to compute means, sums, plot or write RMD files, etc..
My pipeline reads in a csv to a dataframe, assigns rownames, removes a column, performs a pca, plots the pca and extracxts the meaningful variables from the pca which are also plotted.
Here is my current code, which only goes as far as the first plot:
library(ggplot2)
library(ggrepel)
tsv = read.csv('matrix.tsv', sep='\t')
bell= read.csv('bell.tsv', sep='\t')
tail= read.csv('tail.tsv', sep='\t')
dfList = list(tail, tsv, bell)
#process csv's
dfList = lapply(dfList, function(dum){
rownames(dum) = dum[,1]
dum[,1] = NULL
dum$X = NULL
dum = dum[, -grep('un', colnames(dum))]
})
#create pca's of dataframes
pcaList = lapply(dfList, function(pca){
prin_comp = prcomp(pca, scale. = T)
})
#plot top 2 principle components in the pca
plotList = lapply(pcaList, function(prin_comp){
t = qplot(x=prin_comp$rotation[,1], y=prin_comp$rotation[,2]) + geom_text_repel(aes(label=row.names(prin_comp$rotation)))
})
#this plots the 3 plots, one for each pca, but they are un-named
plotList
The problem is that the plots don't have meaningful names/titles. I don't know how to keep that information present, passed from function to function.
I know there must be a more elegant way of doing this. And I have spent a day reading similar and not so similar questions regarding processing multiple csv files. But either they weren't applicable or didn't work for my case.
And as the title of this question implies, I would prefer to do this on one csv at a time, not all 3 at a time, as the csv's in question are very large, over 5GB each, so keeping each dataframe and pca in memory at the same time is impossible.
You just need to keep a string you want to use as the title somewhere and add ggtitle(YOUR_TITLE) to your plot, but this is not so easy with your current code. Instead of performing each step of the analysis for each CSV before going to the next step, why don't you just perform all steps for one CSV at a time?
Your code could look like:
library(ggplot2)
library(ggrepel)
csvs <- c("matrix.tsv","bell.tsv","tail.tsv")
for (i in csvs) {
# read file
df <- read.csv(i, sep='\t')
# process file
rownames(df) <- df[,1]
df[,1] <- NULL
df$X = NULL
df = df[, -grep('un', colnames(df))]
# create pca
pca <- prcomp(df, scale = T)
# plot pca
pcaPlot <- qplot(x=pca$rotation[,1], y=pca$rotation[,2]) +
geom_text_repel(aes(label=row.names(pca$rotation))) +
ggtitle(i)
print(pcaPlot)
# extract and plot meaningful variables
# ...
}
Basically I just put everything you do in a lapply call inside of a for loop, this approach also does the processing for one CSV at the time.
I have written an executable script in R that will simply plot a graph given an input file in a tab delimited format. However, the script I wrote is specific to a single file in terms of what to use as x and y. I want to have this script be able to plot whatever file I give it. All files I will be using for this script will be in the same format: Tab delimited with 4 headers with labels a, b, c, d. Labels b,c, and d have a different name for each file. My x values for the graph will be the values under header b and y values for the graph will be the values under header c. How can I plot a graph that will use whatever is under header b and c?
My script is posted below.
#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)
data = read.table((args[1]), header=TRUE, fill=TRUE, sep="\t")
attach (data)
jpeg(args[2])
plot (RPMb, RPMc)
dev.off()
Instead of using attach() (which is almost never recommended), use data frame indexing to extract the relevant variables from your data variable.
#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)
data = read.table((args[1]), header=TRUE, fill=TRUE, sep="\t")
jpeg(args[2])
x <- names(data)[2]
y <- names(data)[3]
plot (data[[x]], data[[y]],xlab=x,ylab=y)
dev.off()
You could also just use plot(data[,2],data[,3]) ...
A couple of other details/comments:
it's generally best to avoid naming variables for built-in functions such as data. It will usually work, but occasionally it will bite you.
are you sure you want JPEG output? Either PNG or PDF are usually best for line graphs, depending on whether you need a raster or a vector format ...
I am working with some ocean sensors that were deployed at different depths. Each sensor recorded several parameters (time, temperature, oxygen) at different depths, and each outputted an identically formatted file which I have renamed to 'top.csv', 'mid.csv', bot.csv' (for top, middle, bottom).
I currently have only three files, but will eventually have more so I want to set this up iteratively. Optimally I would have something set up such that:
R will import all csv files from a specified directory
It will add a column to each data frame called "depth" with the name of the original file.
rbind them into a single data frame.
I am able to do steps 1 and 3 with the two lines below. The first line gets the file names from a specific directory that match the pattern, while the second line uses lapply nested in do.call to read all the files and vertically concatenate.
files = list.files('./data/', pattern="*.csv")
oxygenData= do.call(rbind, lapply(files, function(x) read.csv(paste('./data/',x)))
The justification to end up with a single data file is to plot them easier, as such:
ggplot(data = oxygenData, aes(x = time, y = oxygen, group = depth, color = depth))+geom_line()
Also, would dealing with this kind of data be easier with data.table? Thank you!
You can accomplish this by building your own function:
myFunc <- function(fileName) {
# read in file
temp <- read.csv(paste0("<filePath>/", fileName), as.is=TRUE)
# assign file name
temp$fileName <- fileName
# return data.frame
temp
}
Note that you could generalize myFunc by adding a second argument that takes the file path, allowing the directory to be set dynamically. Next, put this into lapply to get a list of data.frames:
myList <- lapply(fileNameVector, myFunc)
Finally, append the files using do.call and rbind.
res <- do.call(rbind, myList)
I want to use ChemoSpec with a mass spectra of about 60'000 datapoint.
I have them already in one txt file as a matrix (X + 90 samples = 91 columns; 60'000 rows).
How may I adapt this file as spectra data without exporting again each single file in csv format (which is quite long in R given the size of my data)?
The typical (and only?) way to import data into ChemoSpec is by way of the getManyCsv() function, which as the question indicates requires one CSV file for each sample.
Creating 90 CSV files from the 91 columns - 60,000 rows file described, may be somewhat slow and tedious in R, but could be done with a standalone application, whether existing utility or some ad-hoc script.
An R-only solution would be to create a new method, say getOneBigCsv(), adapted from getManyCsv(). After all, the logic of getManyCsv() is relatively straight forward.
Don't expect such a solution to be sizzling fast, but it should, in any case, compare with the time it takes to run getManyCsv() and avoid having to create and manage the many files, hence overall be faster and certainly less messy.
Sorry I missed your question 2 days ago. I'm the author of ChemoSpec - always feel free to write directly to me in addition to posting somewhere.
The solution is straightforward. You already have your data in a matrix (after you read it in with >read.csv("file.txt"). So you can use it to manually create a Spectra object. In the R console type ?Spectra to see the structure of a Spectra object, which is a list with specific entries. You will need to put your X column (which I assume is mass) into the freq slot. Then the rest of the data matrix will go into the data slot. Then manually create the other needed entries (making sure the data types are correct). Finally, assign the Spectra class to your completed list by doing something like >class(my.spectra) <- "Spectra" and you should be good to go. I can give you more details on or off list if you describe your data a bit more fully. Perhaps you have already solved the problem?
By the way, ChemoSpec is totally untested with MS data, but I'd love to find out how it works for you. There may be some changes that would be helpful so I hope you'll send me feedback.
Good Luck, and let me know how else I can help.
many years passed and I am not sure if anybody is still interested in this topic. But I had the same problem and did a little workaround to convert my data to class 'Spectra' by extracting the information from the data itself:
#Assumption:
# Data is stored as a numeric data.frame with column names presenting samples
# and row names including domain axis
dataframe2Spectra <- function(Spectrum_df,
freq = as.numeric(rownames(Spectrum_df)),
data = as.matrix(t(Spectrum_df)),
names = paste("YourFileDescription", 1:dim(Spectrum_df)[2]),
groups = rep(factor("Factor"), dim(Spectrum_df)[2]),
colors = rainbow(dim(Spectrum_df)[2]),
sym = 1:dim(Spectrum_df)[2],
alt.sym = letters[1:dim(Spectrum_df)[2]],
unit = c("a.u.", "Domain"),
desc = "Some signal. Describe it with 'desc'"){
features <- c("freq", "data", "names", "groups", "colors", "sym", "alt.sym", "unit", "desc")
Spectrum_chem <- vector("list", length(features))
names(Spectrum_chem) <- features
Spectrum_chem$freq <- freq
Spectrum_chem$data <- data
Spectrum_chem$names <- names
Spectrum_chem$groups <- groups
Spectrum_chem$colors <- colors
Spectrum_chem$sym <- sym
Spectrum_chem$alt.sym <- alt.sym
Spectrum_chem$unit <- unit
Spectrum_chem$desc <- desc
# important step
class(Spectrum_chem) <- "Spectra"
# some warnings
if (length(freq)!=dim(data)[2]) print("Dimension of data is NOT #samples X length of freq")
if (length(names)>dim(data)[1]) print("Too many names")
if (length(names)<dim(data)[1]) print("Too less names")
if (length(groups)>dim(data)[1]) print("Too many groups")
if (length(groups)<dim(data)[1]) print("Too less groups")
if (length(colors)>dim(data)[1]) print("Too many colors")
if (length(colors)<dim(data)[1]) print("Too less colors")
if (is.matrix(data)==F) print("'data' is not a matrix or it's not numeric")
return(Spectrum_chem)
}
Spectrum_chem <- dataframe2Spectra(Spectrum)
chkSpectra(Spectrum_chem)