I have two dataframes I'd like to plot against each other:
> df1 <- data.frame(HV = c(3,3,3), NAtlantic850t = c(0.501, 1.373, 1.88), AO = c(-0.0512, 0.2892, 0.0664))
> df2 <- data.frame(HV = c(3,3,2), NAtlantic850t = c(1.2384, 1.3637, -0.0332), AO = c(-0.5915, -0.0596, -0.8842))
They're identical, I'd like to plot them column vs column (e.g. df1$HV, df2$HV) - loop through the dataframe columns and plot them against each other in a scatter graph.
I've looked through 20+ questions asking similar things and can't figure it out - would appreciate some help on where to start. Can I use lapply and plot or ggplot when they're two DFs? Should I merge them first?
As you suggest, I would indeed first rearrange into a list of plottable data frames before calling the plot command. I think that would especially be the way to go if you want to feed the data argument into ggplot. Something like:
plot_dfs <- lapply(names(df1),function(nm)data.frame(col1 = df1[,nm], col2 = df2[,nm]))
for (df in plot_dfs)plot(x = df[,"col1"], y = df[,"col2"])
or using ggplot:
for (df in plot_dfs){
print(
ggplot(data = df, aes(x=col1, y=col2)) +
geom_point())}
and if you want to add the column names as plot titles, you can do:
for (idx in seq_along(plot_dfs)){
print(
ggplot(data = plot_dfs[[idx]], aes(x=col1, y=col2)) +
ggtitle(names(df1)[idx]) +
geom_point())}
You can loop through the columns like this:
for(col in 1:ncol(df1)){
plot(df1[,col], df2[,col])
}
Make sure that both data frames have the same number of columns (and the order of the columns are the same) before running this.
Here’s one way to do it — loop over the column indices and create the plots one by one, adding them to a list and writing each one to a file:
library(ggplot2)
# create some data to plot
df1 <- iris[, sapply(iris, is.numeric)]
df2 <- iris[sample(1:nrow(iris)), sapply(iris, is.numeric)]
# a list to catch each plot object
plot_list <- vector(mode="list", length=ncol(df1))
for (idx in seq_along(df1)){
plot_list[[idx]] <- ggplot2::qplot(df1[[idx]], df2[[idx]]) +
labs(title=names(df1)[idx])
ggsave(filename=paste0(names(df1)[idx], ".pdf"), plot=plot_list[[idx]])
}
As you suggest in the question, you can also use s/lapply() with an anonymous function, e.g. like this (though here we're not storing the plots, just writing each one to disk):
lapply(seq_along(df1), function(idx){
the_plot <- ggplot2::qplot(df1[[id]], df2[[idx]]) + labs(title=names(df1)[idx])
ggsave(filename=paste0(names(df1)[idx], ".pdf"), plot=the_plot)
})
If you want to keep the list of plots (as in the for-loop example), just assign the lapply() to a variable (e.g. plot_list) and add line like return(the_plot) before closing the function.
There's tons of ways you could modify/adapt this approach, depending on what your objectives are.
Hope this helps ~~
p.s. if it's possible the columns won't be in the same order, it is better to loop over column names instead of column indices (i.e. use for (colname in names(df1)){... instead of for (idx in seq_along(df1)){...). You can use the same [[ subsetting syntax with both names and indices.
Related
I have a wide data frame consisting of 1000 rows and over 300 columns. The first 2 columns are GroupID and Categorical fields. The remaining columns are all continuous numeric measurements. What I would like to do is loop through a specific range of these columns in R, beginning with the first numeric column (column #3). For example, loop through columns 3:10. I would also like to retain the column names in the loop. I've started with the following code using
for(i in 3:ncol(df)){
print(i)
}
But this includes all columns to the right of column #3 (not the range 3:10), and this does not identify column names. Can anyone help get me started on this loop so I can specify the column range and also retain column names? TIA!
Side Note: I've used tidyr to gather the data frame in long format. That works, but I've found it makes my data frame very large and therefore eats a lot of time and memory in my loop.
As long as you do not include your data, I created a similar dummy data (1000 rows and 302 columns, 2 id vars ) in order to show you how to select columns, and prepare for plot:
library(reshape2)
library(ggplot2)
set.seed(123)
#Dummy data
Numvars <- as.data.frame(matrix(rnorm(1000*300),nrow = 1000,ncol = 300))
vec1 <- 1:1000
vec2 <- rep(paste0('class',1:5),200)
IDs <- data.frame(vec1,vec2,stringsAsFactors = F)
#Bind data
Data <- cbind(IDs,Numvars)
#Select vars (in your case 10 initial vars)
df <- Data[,1:12]
#Prepare for plot
df.melted <- melt(data = df,id.vars = c('vec1','vec2'))
#Plot
ggplot(df.melted,aes(x=vec1,y=value,group=variable,color=variable))+
geom_line()+
facet_wrap(~vec2)
You will end up with a plot like this:
I hope this helps.
You can keep column names by feeding them into an lapply function, here's an example with the iris dataset:
lapply(names(iris)[2:4], function(columntoplot){
df <- data.frame(datatoplot=iris[[columntoplot]])
graphname <- columntoplot
ggplot(df, aes(x = datatoplot)) +
geom_histogram() +
ggtitle(graphname)
ggsave(filename = paste0(graphname, ".png"), width = 4, height = 4)
})
In the lapply function, you create a new dataset comprising one column (note the double brackets). You can then plot and optionally save the output within the function (see ggsave line). You're then able to use the column name as the plot title as well as the file name.
I have a list of dataframes where each column in the df corresponds to the evaluation of a function values from different numeric vectors of the same length.
Each list object(dataframe) is generated with a different function
I would like to iterate through each list object (dataframe) to
1. Generate a plot for each list object(dataframe), with columns as data series.
2. Generate a new list of new dataframes which contains a column for each column mean from the original dataframe
The below code is functional, but is there a better way to use apply statements and avoid the for loop?
plots <- list()
trait.estimate <- list()
for(i in 1:length(component.estimation)) { #outter loop start
component.estimation[[i]]$hr <- hr #add hr vector to end of dataframe
temporary.df <- melt(component.estimation[[i]] , id.vars = 'hr', variable.name = 'treatment')
#Store a plot of each df
plots[[i]] <- ggplot(temporary.df, aes(hr , value), group = treatment, colour = treatment, fill = treatment) +
geom_point(aes(colour = treatment, fill = treatment))+
geom_line(aes(colour= treatment, linetype = treatment))+
ggtitle( names(component.estimation)[i])+ #title to correspond to trait
theme_classic()
#Generate column averages for each df
trait.estimate[[i]] <- apply(component.estimation[[i]] ,2, mean)
trait.estimate[[i]] <- as.data.frame(trait.estimate[[i]])
trait.estimate[[i]]$treatment <- row.names(trait.estimate[[i]])
} #outter loop close
Your for loop looks fine to me, I wouldn't worry about transitioning to lapply. Personally, I think lapply is great when you want to do something simple, but when you want something more complicated, a for loop can be just as readable.
The only real change I'd make is to use colMeans rather than apply(., 2, mean). I also might break apart the trait.estimate part and the plotting part as they seem wholly separate operations. Seems nicer organizationally.
As an example, pulling out the trait.estimate calculations would look like this:
# inside for loop version
trait.estimate[[i]] <- colMeans(component.estimation[[i]])
trait.estimate[[i]] <- as.data.frame(trait.estimate[[i]])
trait.estimate[[i]]$treatment <- row.names(trait.estimate[[i]])
# outside for loop lapply version
trait.estimate = lapply(component.estimation, colMeans)
trait.estimate = lapply(trait.estimate, as.data.frame)
trait.estimate = lapply(trait.estimate, function(x) x$treatment = row.names(x))
# all in one lapply version with anonymous function
trait.estimate = lapply(component.estimation, function(x) {
means = colMeans(x)
means = as.data.frame(means)
means$treatment = row.names(means)
return(means)
})
Which is better? I'll leave that to you to decide. Use whichever you prefer.
I'm new to R and want to automatize plotting of my data. I wrote a small script for this that should do the following things:
1) Iterates over csv-files in a given path and stores data in a list of dataframes
2) Iterates over this list, uses ggplot to plot every single dataframe within this list and stores result in variable.
#loading libraries#
library(ggplot2)
library(forcats)
library(magrittr)
#setting wd#
#creating list#
#listing files in given path#
setwd("/path/to/csvfiles")
mylist <- list()
filenames <- list.files(path=getwd())
#iterating over csv-files#
#storing input in variables (var_1,...var_n) and in "mylist"#
#change decimal comma to dot#
for (i in (1:length(filenames))){
mylist[[i]] <- assign(paste("var",i, sep="_"), read.csv(filenames[i])[ ,1:2]) %>%
sapply(gsub, pattern=",", replacement=".") %>%
as.data.frame()
}
#Iterate over list, plotting single element of list and save in variables (plot_1,...,plot_n) in order to later plot them in a given frame (e.g. by using multiplot)#
for (k in 1:length(mylist)){
mylist[[k]]$value <- as.numeric(as.character(mylist[[k]]$value))
assign(paste("plot",k, sep="_"),
((ggplot(mylist[[k]], aes(x=mylist[[k]]$column1, y=mylist[[k]]$value))) +
aes(x = fct_inorder(column1)) +
ggtitle(paste("plot",k,sep="")) +
geom_boxplot(outlier.shape = NA)))
}
I end up with the actual number of plots I expect from n different files (--> n plots). All of them have the corresponding title (e.g. variable plot_1 also has the title "plot_1", ...).
However, all of the plots look the same. I checked it in detail and figured out that this data is actually representing the last element of the list, assuming that the other elements of the list are not plotted and saved properly into the variable (although iterating over theme title obviously works...). When I plot manually for a given element of list, it works perfectly, but within the loop the mentioned problems appear.
Probably it's something really obvious but I can't figure it out.
Thank you for your help!
I'd like to plot multiple lines with a varying number of points per line, with different colors using ggplot2. My MWE is given by
test <- list()
length(test) <- 10
for(i in 1:10){
test[[i]] <- rnorm(100 - i) # Note - different number of points per line!!!!
}
Note that The length for each vector in list are different. Then, is not possible to transform in data.frame.
So this gets you want you want I think. Note that it works on your list that has a different number of points per vector - which of course is one main reason why one would a list instead of a dataframes.
Most if not all of the examples on SO for this scenario are working with dataframes instead of data in lists. Since the vectors have different lengths, links that address this by melting a dataframe to a long form do not apply.
However if you did happen to have a dataframe, which implies a set of vectors of the same length, then you could use melt. However using gather from tidyr would probably be a more modern idiom for this than melt from reshape2. Note that melt can also be used on lists, although I would have to research how it handles the id.
I also choose not to use a function from the lapply class because I wanted to emphasis the "wide data" to "long data" aspect - something I think a for loop does far better that lapply, which beginning users can find mysterious.
Anyway we should probably be using something from purrr now as that is a modern type-stable functional library.
Here is some code - using a for loop, so not the most compact, but unrolled to make it easy and quick to understand:
library(ggplot2)
test <- list()
length(test) <- 10
for(i in 1:10){
test[[i]] <- rnorm(100 - i)
}
# Convert data to long form
df <- NULL
for(i in 1:10){
ydat <- test[[i]]
ndf <- data.frame(key=paste0("id",i),x=1:length(ydat),y=ydat)
df <- rbind(df,ndf)
}
# plot it
ggplot(df) + geom_line(aes(x=x,y=y,color=key))
Yielding:
As already pointed out by Mike Wise in his accepted answer, gplot2 requires a data.frame as input, preferably in long format.
However, both question and accepted answer used for loops although R has neat functions. To create the test data set, the following "one-liner" can been
used:
set.seed(1234L) # required to ensure reproducible data
test <- lapply(100L - 1:10, rnorm)
instead of
test <- list()
length(test) <- 10
for(i in 1:10){
test[[i]] <- rnorm(100 - i)
}
Note the use of set.seed() to ensure reproducible random data.
To reshape test from wide to long form, the whole list is turned into a data.frame at once using unlist(), adding the additional columns as required:
df <- data.frame(
id = rep(seq_along(test), lengths(test)),
x = sequence(lengths(test)),
y = unlist(test)
)
instead of turning each list element into a separate small data.frame and incrementally appending the pieces to a target data.frame using a for loop.
The plot is then created by
library(ggplot2)
ggplot(df) + geom_line(aes(x = x, y = y, color = as.factor(id)))
Alternatively, the melt() function has a method for lists:
library(data.table)
long <- melt(test, measure.vars = seq_along(test))
setDT(long)[, rn := rowid(L1)] # add row numbers for each group
ggplot(long) + aes(x = rn, y = value, color = as.factor(L1)) + geom_line()
As there were some remarks about the for loops, here is an alternate and more sophisticated approach in a modern idiom (i.e. purrr from the tidyverse).
Creates an id vector as a factor (ids) so as to avoid warnings about combining levels later.
Sets up a function (mkdf) to make a data frame from an id variable and a vector of data.
Uses map2 from purrr to merge ids and the original data list with mkdf
Uses bind_rows from dplyr to merge the resulting list of data frames into one.
Plots it.
The code:
library(tidyr)
# dummpy up some wide data (but of different lengths) in a **list** of curves
test <- list()
for(i in 1:5){
test[[i]] <- rnorm(10 - i)
}
# helper data (could do inline, but it would be harder to read)
ids <- as.factor(sprintf("id-%d",1:length(test))) # curve ids as factors
mkdf <- function(x,y) data.frame(xx=1:length(x),yy=x,key=y) # makes into dataframe
df <- test %>% map2(ids,mkdf) %>% bind_rows() #single pipe using purrr and dplyr
# plot it
ggplot(df) + geom_line(aes(x=xx,y=yy,color=key))
A plot. I reduced the datasizes to make it easier to see:
I have a dataframe (NBA_Data) that is 27,538 by 29. One of the columns is called "month", and there are eight months (October:June). I would like to write a function that automatically subsets this data frame by month and then creates 8 plot-objects in ggplot2. I'm getting out of my depth here, but I imagine that these would be stored in a single list ("My Plots").
I plan to use geom_plot for each plot object to plot Points against Minutes.Played. I'm not familiar with grid.arrange, but I'm guessing that once I have "My Plots", I could use that (in some form) as an argument for grid.arrange.
I've tried:
empty_list <- list()
for (cat in unique(NBA_Data$month)){
d <- subset(NBA_Data, month == cat)
empty_list <- c(empty_list, d)
}
This gives an unbroken list that repeats all 29 columns for each month, with a length of 261. Not ideal, but workable maybe. Then I try using lapply to split the list, but I screw it up.
lapply(empty_list, split(empty_list, empty_list$month))
Error in match.fun(FUN) :
'split(x = empty_list, f = empty_list$month)' is not a function, character or symbol
In addition: Warning message:
In split.default(x = empty_list, f = empty_list$month) :
data length is not a multiple of split variable
Any suggestions?
Thank you.
You can use split to chunk the dataset into a list already:
list <- split(data, data$month)
Also you can use facet_wrap to make multiple plots on one page with the same data if you're using ggplot.
library(ggplot2)
ggplot(data, aes(x = PlayerName, y = PPG)) + geom_point() + facet_wrap(~month)