A function to create multiple plots by subsets of data frame - r

I have a dataframe (NBA_Data) that is 27,538 by 29. One of the columns is called "month", and there are eight months (October:June). I would like to write a function that automatically subsets this data frame by month and then creates 8 plot-objects in ggplot2. I'm getting out of my depth here, but I imagine that these would be stored in a single list ("My Plots").
I plan to use geom_plot for each plot object to plot Points against Minutes.Played. I'm not familiar with grid.arrange, but I'm guessing that once I have "My Plots", I could use that (in some form) as an argument for grid.arrange.
I've tried:
empty_list <- list()
for (cat in unique(NBA_Data$month)){
d <- subset(NBA_Data, month == cat)
empty_list <- c(empty_list, d)
}
This gives an unbroken list that repeats all 29 columns for each month, with a length of 261. Not ideal, but workable maybe. Then I try using lapply to split the list, but I screw it up.
lapply(empty_list, split(empty_list, empty_list$month))
Error in match.fun(FUN) :
'split(x = empty_list, f = empty_list$month)' is not a function, character or symbol
In addition: Warning message:
In split.default(x = empty_list, f = empty_list$month) :
data length is not a multiple of split variable
Any suggestions?
Thank you.

You can use split to chunk the dataset into a list already:
list <- split(data, data$month)
Also you can use facet_wrap to make multiple plots on one page with the same data if you're using ggplot.
library(ggplot2)
ggplot(data, aes(x = PlayerName, y = PPG)) + geom_point() + facet_wrap(~month)

Related

For Loop Across Specific Column Range in R

I have a wide data frame consisting of 1000 rows and over 300 columns. The first 2 columns are GroupID and Categorical fields. The remaining columns are all continuous numeric measurements. What I would like to do is loop through a specific range of these columns in R, beginning with the first numeric column (column #3). For example, loop through columns 3:10. I would also like to retain the column names in the loop. I've started with the following code using
for(i in 3:ncol(df)){
print(i)
}
But this includes all columns to the right of column #3 (not the range 3:10), and this does not identify column names. Can anyone help get me started on this loop so I can specify the column range and also retain column names? TIA!
Side Note: I've used tidyr to gather the data frame in long format. That works, but I've found it makes my data frame very large and therefore eats a lot of time and memory in my loop.
As long as you do not include your data, I created a similar dummy data (1000 rows and 302 columns, 2 id vars ) in order to show you how to select columns, and prepare for plot:
library(reshape2)
library(ggplot2)
set.seed(123)
#Dummy data
Numvars <- as.data.frame(matrix(rnorm(1000*300),nrow = 1000,ncol = 300))
vec1 <- 1:1000
vec2 <- rep(paste0('class',1:5),200)
IDs <- data.frame(vec1,vec2,stringsAsFactors = F)
#Bind data
Data <- cbind(IDs,Numvars)
#Select vars (in your case 10 initial vars)
df <- Data[,1:12]
#Prepare for plot
df.melted <- melt(data = df,id.vars = c('vec1','vec2'))
#Plot
ggplot(df.melted,aes(x=vec1,y=value,group=variable,color=variable))+
geom_line()+
facet_wrap(~vec2)
You will end up with a plot like this:
I hope this helps.
You can keep column names by feeding them into an lapply function, here's an example with the iris dataset:
lapply(names(iris)[2:4], function(columntoplot){
df <- data.frame(datatoplot=iris[[columntoplot]])
graphname <- columntoplot
ggplot(df, aes(x = datatoplot)) +
geom_histogram() +
ggtitle(graphname)
ggsave(filename = paste0(graphname, ".png"), width = 4, height = 4)
})
In the lapply function, you create a new dataset comprising one column (note the double brackets). You can then plot and optionally save the output within the function (see ggsave line). You're then able to use the column name as the plot title as well as the file name.

Print histograms including variable name for all variables in R

I'm trying to generate a simple histogram for every variable in my dataframe, which I can do using sapply below. But, how can I include the name of the variable in either the title or the x-axis so I know which one I'm looking at? (I have about 20 variables.)
Here is my current code:
x = # initialize dataframe
sapply(x, hist)
Here's a way to modify your existing approach to include column name as the title of each histogram, using the iris dataset as an example:
# loop over column *names* instead of actual columns
sapply(names(iris), function(cname){
# (make sure we only plot the numeric columns)
if (is.numeric(iris[[cname]]))
# use the `main` param to put column name as plot title
print(hist(iris[[cname]], main=cname))
})
After you run that, you'll be able to flip through the plots with the arrows in the viewer pane (assuming you're using R Studio).
Here's an example output:
p.s. check out grid::grob(), gridExtra::grid.arrange(), and related functions if you want to arrange the histograms onto a single plot window and save it to a single file.
How about this? Assuming you have wide data you can transform it to long format with gather. Than a ggplot solution with geom_histogram and facet_wrap:
library(tidyverse)
# make wide data (20 columns)
df <- matrix(rnorm(1000), ncol = 20)
df <- as.data.frame(df)
colnames(df) <- LETTERS[1:20]
# transform to long format (2 columns)
df <- gather(df, key = "name", value = "value")
# plot histigrams per name
ggplot(df) +
geom_histogram(aes(value)) +
facet_wrap(~name, ncol = 5)

Loop through and plot columns of two identical dataframes

I have two dataframes I'd like to plot against each other:
> df1 <- data.frame(HV = c(3,3,3), NAtlantic850t = c(0.501, 1.373, 1.88), AO = c(-0.0512, 0.2892, 0.0664))
> df2 <- data.frame(HV = c(3,3,2), NAtlantic850t = c(1.2384, 1.3637, -0.0332), AO = c(-0.5915, -0.0596, -0.8842))
They're identical, I'd like to plot them column vs column (e.g. df1$HV, df2$HV) - loop through the dataframe columns and plot them against each other in a scatter graph.
I've looked through 20+ questions asking similar things and can't figure it out - would appreciate some help on where to start. Can I use lapply and plot or ggplot when they're two DFs? Should I merge them first?
As you suggest, I would indeed first rearrange into a list of plottable data frames before calling the plot command. I think that would especially be the way to go if you want to feed the data argument into ggplot. Something like:
plot_dfs <- lapply(names(df1),function(nm)data.frame(col1 = df1[,nm], col2 = df2[,nm]))
for (df in plot_dfs)plot(x = df[,"col1"], y = df[,"col2"])
or using ggplot:
for (df in plot_dfs){
print(
ggplot(data = df, aes(x=col1, y=col2)) +
geom_point())}
and if you want to add the column names as plot titles, you can do:
for (idx in seq_along(plot_dfs)){
print(
ggplot(data = plot_dfs[[idx]], aes(x=col1, y=col2)) +
ggtitle(names(df1)[idx]) +
geom_point())}
You can loop through the columns like this:
for(col in 1:ncol(df1)){
plot(df1[,col], df2[,col])
}
Make sure that both data frames have the same number of columns (and the order of the columns are the same) before running this.
Here’s one way to do it — loop over the column indices and create the plots one by one, adding them to a list and writing each one to a file:
library(ggplot2)
# create some data to plot
df1 <- iris[, sapply(iris, is.numeric)]
df2 <- iris[sample(1:nrow(iris)), sapply(iris, is.numeric)]
# a list to catch each plot object
plot_list <- vector(mode="list", length=ncol(df1))
for (idx in seq_along(df1)){
plot_list[[idx]] <- ggplot2::qplot(df1[[idx]], df2[[idx]]) +
labs(title=names(df1)[idx])
ggsave(filename=paste0(names(df1)[idx], ".pdf"), plot=plot_list[[idx]])
}
As you suggest in the question, you can also use s/lapply() with an anonymous function, e.g. like this (though here we're not storing the plots, just writing each one to disk):
lapply(seq_along(df1), function(idx){
the_plot <- ggplot2::qplot(df1[[id]], df2[[idx]]) + labs(title=names(df1)[idx])
ggsave(filename=paste0(names(df1)[idx], ".pdf"), plot=the_plot)
})
If you want to keep the list of plots (as in the for-loop example), just assign the lapply() to a variable (e.g. plot_list) and add line like return(the_plot) before closing the function.
There's tons of ways you could modify/adapt this approach, depending on what your objectives are.
Hope this helps ~~
p.s. if it's possible the columns won't be in the same order, it is better to loop over column names instead of column indices (i.e. use for (colname in names(df1)){... instead of for (idx in seq_along(df1)){...). You can use the same [[ subsetting syntax with both names and indices.

using an apply function with ggplot2 to create bar plots for more than one variable in a data.frame

Is there a way to use an apply function in R in order to create barplots with ggplot2?
Say, we have a dataframe containing only factor variables out of which one is boolean. In my case I have a dateframe with +40 variables. Can one plot all the variables against the boolean one with a single line of code?
data("diamonds")
factors <- sapply(diamonds, function(x) is.factor(x))
factors_only <- diamonds[,factors]
factors_only$binary <- sample(c(1, 0), length(factors_only), replace=TRUE)
factors_only$binary <- as.factor(factors_only$binary)
But I want to create barplots like this one:
qplot(factors_only$color, data=factors_only, geom="bar", fill=factors_only$binary)
This does not work:
sapply(factors_only,function(x) qplot(x, data=factors_only, geom="bar", fill=binary))
Please advise
You can use lapply to run through the variable names and then use get to pull up the current variable.
temp <- lapply(names(factors_only),
function(x) qplot(get(x), data=factors_only, geom="bar", fill=binary, xlab=x))
To print a list item,
print(temp[[1]])
will produce
A nice feature of running through the variable names is that you can use these to dynamically name the labels in the figure.

create multiple graphs/plots from xlsx file with n columns

I have a .xlsx file with several column (with some inter-dependencies). I'd like to plot multiple graphs on the same chart using a select number of the columns. The first column is Date (which will be my only X-variable) and the remainder columns of interests will be Y-values. There are 1000 rows of data in this file.
So ...
X-axis ... "Date" column only
Y-axis (multiple data) ... columns B, C, D, E, T, U, V only
Question:
How to:
1) Read the file
2) plot a line graph of the data, all on the same chart (X-axis = Date, Y-axis = columns B, C, D, E, T, U, V)
3) Color code each line with some type of a legend
I've read this post and many more (not allowed to post more than 2 links??) ... none has been helpful. Most are too arbitrary:
how to plot all the columns of a data frame in R
The problem you have is with this labels/sublabels combination. They mess up the import (variable classes are not recognized). Here is a two-step solution.
In the first step, we import the database just to extract clean column names. What I did for that was to concatenate the main label (row 2) with the sublabel (row 3) when there was one. There are two pairs of identical column labels, so we also rename them to have clean colnames (I suggest you take the time to review your variable names and give them proper labels). Then we save them as an object (n).
Then, we import the file again skipping the first two rows. That way, read_excel knows what classes to expect. We assign the previously saved names to the new data.frame. Now the data is clean. The rest is trivial: melt with tidyr:gather and plot with ggplot.
Code
library(readxl)
library(tidyr)
library(zoo)
library(ggplot2)
df <- read_excel("./myfile.xlsx",skip = 1)
names(df)[!is.na(df[1,])] <- paste(na.locf(names(df)[!is.na(df[1,])]),df[1,][!is.na(df[1,])],sep="_")
names(df)[duplicated(names(df))] <- paste0(names(df)[duplicated(names(df))],"bis")
n <- names(df)
df <- read_excel("./myfile.xlsx",skip = 2)
names(df) <- n
# df <- dplyr::slice(df,1:3) # this line is for the censored datafile that has only three rows
melted <- gather(df,key,value,-Date)
ggplot(melted, aes(x=Date,y=value,color=key)) + geom_line()
Of course, with only three rows of data, the result is ugly:

Resources