I have tried batch importing, but I think ggplot2 requires data frames and I have only been able to make a list of elements. I have set up a simple code in ggplot2 that imports data from multiple csv files and overlays their trendlines. All of the .csv files are in the same folder and have the same format. Is there a way to import all of the .csv files from the folder and plot all of them in ggplot without copying this code hundreds of times?
Thank you for your help!
library(ggplot2)
points1 <- read.csv("http://drive.google.com")[1:10,1:2]
points2 <- read.csv("http://drive.google.com")[1:10,1:2]
g <- (ggplot(points1, aes(x=ALPHA, y=BETA))
+labs(title="Model Run", subtitle="run4", y="LabelY", x="LabelX", caption="run4")
+ coord_cartesian(xlim=c(0,10), ylim=c(0,11))
#+ geom_point(data = points1)#
+geom_smooth(method="loess", span=.8, data = points1, se=FALSE)
#+ geom_point(data = points2)#
+geom_smooth(method="loess", span=.8, data = points2, se=FALSE))
plot(g)
This is a fun one. I am using some packages from the tidyverse (ggplot, purrr, readr) to make it more consistent.
Since you want to plot all the data in one plot, it makes sense to put all of it into one dataframe. The function purrr::map_df is perfect for this.
library(tidyverse)
files <- list.files("data/", "*.csv", full.names = T)
names(files) <- list.files("data/", "*.csv")
df <- map_df(files, ~read_csv(.), .id = "origin")
df %>% ggplot()+
aes(x,y, color = origin)+
geom_point()
A few explainations
The first two lines create a named vector with its elements being the full paths to the csv-files and the names of this vector being the filenames. This makes is easier to use the .id argument of map_df, which creates an additional column namend "origin" from the filenames. The notation inside map might seem a little weird at first, you could also supply a function written elesewhere to apply to each element but the ~ symbol is pretty handy: it creates a function right there and this function always takes the argument . as the current element of the vector or list you are iterating over.
Related
I am trying to understand purrr, and how to map/walk over a list of images and save them to files. Below is my code that works using a for loop, but how would this be structured using purrr? I am confused by the various versions (walk, walk2, pwalk, map, map2, pmap etc.)
library(magick)
library(purrr)
#create a list of the files
inpath <- "C:\\Path\\To\\Images"
file_list <- list.files(path = inpath, full.names = TRUE)
# read the files and negate them
imgn <- map(file_list, image_read) %>%
map(image_negate)
# assign list names as original file names
names(imgn) = list.files(path = inpath)
# how to use walk, map, map2? walk2, pwalk? to do this
for (i in 1:length(imgn)) {
image_write(imgn[[i]], path = names(imgn)[[i]])
}
Using Map from base R
Map(function(x, y) image_write(x, path = y), imgn, file_list)
If I'm correct in understanding your code, it looks like you're trying to save your edited images to their original file paths. If so, could you replace your for loop with:
map2(imgn, file_list, ~ image_write(.x, path = .y))
As an explanation, you want to use map2 because you're applying a function with two inputs; the image you're saving (stored in imgn), and the filepath you're writing it to (stored in file_list). You can then use formula notation to specify the function and arguments you'd like to map, as above (more on this in the map docs).
My pipeline reads in a csv to a dataframe, assigns rownames, removes a column, performs a pca, plots the pca and extracxts the meaningful variables from the pca which are also plotted.
Here is my current code, which only goes as far as the first plot:
library(ggplot2)
library(ggrepel)
tsv = read.csv('matrix.tsv', sep='\t')
bell= read.csv('bell.tsv', sep='\t')
tail= read.csv('tail.tsv', sep='\t')
dfList = list(tail, tsv, bell)
#process csv's
dfList = lapply(dfList, function(dum){
rownames(dum) = dum[,1]
dum[,1] = NULL
dum$X = NULL
dum = dum[, -grep('un', colnames(dum))]
})
#create pca's of dataframes
pcaList = lapply(dfList, function(pca){
prin_comp = prcomp(pca, scale. = T)
})
#plot top 2 principle components in the pca
plotList = lapply(pcaList, function(prin_comp){
t = qplot(x=prin_comp$rotation[,1], y=prin_comp$rotation[,2]) + geom_text_repel(aes(label=row.names(prin_comp$rotation)))
})
#this plots the 3 plots, one for each pca, but they are un-named
plotList
The problem is that the plots don't have meaningful names/titles. I don't know how to keep that information present, passed from function to function.
I know there must be a more elegant way of doing this. And I have spent a day reading similar and not so similar questions regarding processing multiple csv files. But either they weren't applicable or didn't work for my case.
And as the title of this question implies, I would prefer to do this on one csv at a time, not all 3 at a time, as the csv's in question are very large, over 5GB each, so keeping each dataframe and pca in memory at the same time is impossible.
You just need to keep a string you want to use as the title somewhere and add ggtitle(YOUR_TITLE) to your plot, but this is not so easy with your current code. Instead of performing each step of the analysis for each CSV before going to the next step, why don't you just perform all steps for one CSV at a time?
Your code could look like:
library(ggplot2)
library(ggrepel)
csvs <- c("matrix.tsv","bell.tsv","tail.tsv")
for (i in csvs) {
# read file
df <- read.csv(i, sep='\t')
# process file
rownames(df) <- df[,1]
df[,1] <- NULL
df$X = NULL
df = df[, -grep('un', colnames(df))]
# create pca
pca <- prcomp(df, scale = T)
# plot pca
pcaPlot <- qplot(x=pca$rotation[,1], y=pca$rotation[,2]) +
geom_text_repel(aes(label=row.names(pca$rotation))) +
ggtitle(i)
print(pcaPlot)
# extract and plot meaningful variables
# ...
}
Basically I just put everything you do in a lapply call inside of a for loop, this approach also does the processing for one CSV at the time.
I am working with some ocean sensors that were deployed at different depths. Each sensor recorded several parameters (time, temperature, oxygen) at different depths, and each outputted an identically formatted file which I have renamed to 'top.csv', 'mid.csv', bot.csv' (for top, middle, bottom).
I currently have only three files, but will eventually have more so I want to set this up iteratively. Optimally I would have something set up such that:
R will import all csv files from a specified directory
It will add a column to each data frame called "depth" with the name of the original file.
rbind them into a single data frame.
I am able to do steps 1 and 3 with the two lines below. The first line gets the file names from a specific directory that match the pattern, while the second line uses lapply nested in do.call to read all the files and vertically concatenate.
files = list.files('./data/', pattern="*.csv")
oxygenData= do.call(rbind, lapply(files, function(x) read.csv(paste('./data/',x)))
The justification to end up with a single data file is to plot them easier, as such:
ggplot(data = oxygenData, aes(x = time, y = oxygen, group = depth, color = depth))+geom_line()
Also, would dealing with this kind of data be easier with data.table? Thank you!
You can accomplish this by building your own function:
myFunc <- function(fileName) {
# read in file
temp <- read.csv(paste0("<filePath>/", fileName), as.is=TRUE)
# assign file name
temp$fileName <- fileName
# return data.frame
temp
}
Note that you could generalize myFunc by adding a second argument that takes the file path, allowing the directory to be set dynamically. Next, put this into lapply to get a list of data.frames:
myList <- lapply(fileNameVector, myFunc)
Finally, append the files using do.call and rbind.
res <- do.call(rbind, myList)
I have many (more than 100) csv files with same table structure for example in all table headers are in row 4 and they all have 6 columns and the data are from row 5 to 400001,
I need to plot these data in a scatter plot which x shows the first column (40001 time unit) and the other columns are Ys for different variables, [its preferable if I be able to format a plot (colors, ranges, titles, legends , ...)] and automatically input these csv files and export png or pdf or anything else that might be useful , I have both Excel and R but I don't know how to do this plotting in an efficient manner. (Naming is also important, they shall have the name of their CSV files)
Any idea on how can I do this with less effort ?
Thanks
Your question is a bit light on specific detail, so I'm going to make some assumptions to get started on a kind of skeleton of an answer.
Let's make some fake CSV files ones for example data
Set working directory to folder containing data...
setwd("C:/my-csv-files")
Make 100 data frames of six col by 500 rows (to keep things quick)...
df <- lapply(1:100, function(i) data.frame(cbind(1:500, matrix(sample(1000), 500, 5))))
Make 100 csv files from these data frames in the working directory...
lapply(1:length(df), function(i) write.csv(df[[i]],file=paste("df",i,"csv",sep=".")))
Now we can reproduce your problem and quickly read many CSV files into R like so...
# create a list of all CSV files in all the folders
files <- (dir("C:/my-csv-files", recursive=TRUE, full.names=TRUE, pattern="\\.(csv|CSV)$"))
# read in the CSV files and add the filename of each file as a column to
# each dataset so we can trace back dodgy data
# so, create a function to read the CSV and get filenames
read.tables <- function(file.names, ...) {
require(plyr)
ldply(file.names, function(fn) data.frame(Filename=fn, read.csv(fn, ...)),.progress = 'text')
}
# execute function to read in data from each CSV, including file names of file that data comes from
mydata <- read.tables(files, stringsAsFactors = FALSE)
Now plot data, you say you just want one plot of all the data in the CSV files...
Melt into a format for plotting, here X1 is your time variable and X2 to X5 are the other variables in your CSV files
require(reshape2)
dat <- melt(mydata, id.vars = c("X1"), measure.vars = c("X2", "X3", "X4", "X5"))
And here's a single scatter plot of your time variable by the other variables (colour-coded). It's just not clear from your question exactly what you want to plot, so do ask another question with more details.
require(ggplot2)
ggplot(dat, aes(X1, value)) +
geom_point(aes(colour = factor(variable)))
Now, save it as a PDF or PNG, see ?ggsave for the numerous options here...
ggsave(file="myplot.pdf")
ggsave(file="myplot.png")
Find the location of those files
getwd()
To make one plot per CSV file here's one method
listcsvs <- lapply(files,function(i) read.csv(i, stringsAsFactors = FALSE))
names(listcsvs) <- files
require(reshape2)
require(ggplot2)
for (i in 1:length(files)) {
tmp <- melt(listcsvs[[i]], id.vars = "X1", measure.vars = c("X2", "X3", "X4", "X5"))
print(ggplot(tmp,aes(X1, value)) +
geom_point(aes(colour = factor(variable))) +
ggtitle(names(listcsvs[i]))
)
}
If you are using RStudio you can scroll through the plots and Export the ones you want to save them as a PDF or PNG.
So that's covered the main parts of your question:
Read in a large amount of CSV files into R
Plot data as a one scatter plot displaying several variables against one variable
Plot data as one scatter plot per CSV file
Save the plots as a PDF or PNG file
And as a bonus you've got code for creating example data which you can use in your future questions. In general, the better the quality of your example data, the better quality answers you'll get (as Thomas suggests in his comment).
I have several data files (numeric) with around 150000 rows and 25 columns. Before I was using gnuplot (where script lines are proportional plot objects) to to plot the data but as I have to do now some additional analysis with it I moved to R and ggplot2.
How to organize the data, thought? Is one big data.frame with an additional column to mark from which file the data is coming from really the only option? Or is there some way around that?
Edit: To be a bit more precise, I'll give as an example in what form I have the data now:
filelst=c("filea.dat", "fileb.dat", "filec.dat")
dat=c()
for(i in 1:length(filelst)) {
dat[[i]]=read.table(file[i])
}
Assuming you have filenames ending with ".dat", here's a mockup example of the strategies proposed by Chase,
require(plyr)
# list the files
lf = list.files(pattern = "\.dat")
str(lf)
# 1. read the files into a data.frame
d = ldply(lf, read.table, header = TRUE, skip = 1) # or whatever options to read
str(d) # should contain all the data, and and ID column called L1
# use the data, e.g. plot
pdf("all.pdf")
d_ply(d, "L1", plot, t="l")
dev.off()
# or using ggplot2
ggplot(d, aes(x, y, colour=L1)) + geom_line()
# 2. read the files into a list
ld = lapply(lf, read.table, header = TRUE, skip = 1) # or whatever options to read
names(ld) = gsub("\.dat", "", lf) # strip the file extension
str(ld)
# use the data, e.g. plot
pdf("all2.pdf")
lapply(names(l), function(ii) plot(l[[ii]], main=ii), t="l")
dev.off()
# 3. is not fun
Your question is a little vague. If I followed along properly, I think you have three main options:
Do as you suggest and then use any one of the "split-apply-combine" functions that exist in R to conduct your analyses by group. These functions may include by, aggregate, ave, package(plyr), package(data.table) and many others.
Store your data object as separate elements in a list(). Then use lapply() and friends to work on them.
Keep everything separate in different data objects and work on them individually. This is probably the most inefficient way to go about doing things, unless you have memory constraints et al.