Print histograms including variable name for all variables in R - r

I'm trying to generate a simple histogram for every variable in my dataframe, which I can do using sapply below. But, how can I include the name of the variable in either the title or the x-axis so I know which one I'm looking at? (I have about 20 variables.)
Here is my current code:
x = # initialize dataframe
sapply(x, hist)

Here's a way to modify your existing approach to include column name as the title of each histogram, using the iris dataset as an example:
# loop over column *names* instead of actual columns
sapply(names(iris), function(cname){
# (make sure we only plot the numeric columns)
if (is.numeric(iris[[cname]]))
# use the `main` param to put column name as plot title
print(hist(iris[[cname]], main=cname))
})
After you run that, you'll be able to flip through the plots with the arrows in the viewer pane (assuming you're using R Studio).
Here's an example output:
p.s. check out grid::grob(), gridExtra::grid.arrange(), and related functions if you want to arrange the histograms onto a single plot window and save it to a single file.

How about this? Assuming you have wide data you can transform it to long format with gather. Than a ggplot solution with geom_histogram and facet_wrap:
library(tidyverse)
# make wide data (20 columns)
df <- matrix(rnorm(1000), ncol = 20)
df <- as.data.frame(df)
colnames(df) <- LETTERS[1:20]
# transform to long format (2 columns)
df <- gather(df, key = "name", value = "value")
# plot histigrams per name
ggplot(df) +
geom_histogram(aes(value)) +
facet_wrap(~name, ncol = 5)

Related

For Loop Across Specific Column Range in R

I have a wide data frame consisting of 1000 rows and over 300 columns. The first 2 columns are GroupID and Categorical fields. The remaining columns are all continuous numeric measurements. What I would like to do is loop through a specific range of these columns in R, beginning with the first numeric column (column #3). For example, loop through columns 3:10. I would also like to retain the column names in the loop. I've started with the following code using
for(i in 3:ncol(df)){
print(i)
}
But this includes all columns to the right of column #3 (not the range 3:10), and this does not identify column names. Can anyone help get me started on this loop so I can specify the column range and also retain column names? TIA!
Side Note: I've used tidyr to gather the data frame in long format. That works, but I've found it makes my data frame very large and therefore eats a lot of time and memory in my loop.
As long as you do not include your data, I created a similar dummy data (1000 rows and 302 columns, 2 id vars ) in order to show you how to select columns, and prepare for plot:
library(reshape2)
library(ggplot2)
set.seed(123)
#Dummy data
Numvars <- as.data.frame(matrix(rnorm(1000*300),nrow = 1000,ncol = 300))
vec1 <- 1:1000
vec2 <- rep(paste0('class',1:5),200)
IDs <- data.frame(vec1,vec2,stringsAsFactors = F)
#Bind data
Data <- cbind(IDs,Numvars)
#Select vars (in your case 10 initial vars)
df <- Data[,1:12]
#Prepare for plot
df.melted <- melt(data = df,id.vars = c('vec1','vec2'))
#Plot
ggplot(df.melted,aes(x=vec1,y=value,group=variable,color=variable))+
geom_line()+
facet_wrap(~vec2)
You will end up with a plot like this:
I hope this helps.
You can keep column names by feeding them into an lapply function, here's an example with the iris dataset:
lapply(names(iris)[2:4], function(columntoplot){
df <- data.frame(datatoplot=iris[[columntoplot]])
graphname <- columntoplot
ggplot(df, aes(x = datatoplot)) +
geom_histogram() +
ggtitle(graphname)
ggsave(filename = paste0(graphname, ".png"), width = 4, height = 4)
})
In the lapply function, you create a new dataset comprising one column (note the double brackets). You can then plot and optionally save the output within the function (see ggsave line). You're then able to use the column name as the plot title as well as the file name.

Loop through and plot columns of two identical dataframes

I have two dataframes I'd like to plot against each other:
> df1 <- data.frame(HV = c(3,3,3), NAtlantic850t = c(0.501, 1.373, 1.88), AO = c(-0.0512, 0.2892, 0.0664))
> df2 <- data.frame(HV = c(3,3,2), NAtlantic850t = c(1.2384, 1.3637, -0.0332), AO = c(-0.5915, -0.0596, -0.8842))
They're identical, I'd like to plot them column vs column (e.g. df1$HV, df2$HV) - loop through the dataframe columns and plot them against each other in a scatter graph.
I've looked through 20+ questions asking similar things and can't figure it out - would appreciate some help on where to start. Can I use lapply and plot or ggplot when they're two DFs? Should I merge them first?
As you suggest, I would indeed first rearrange into a list of plottable data frames before calling the plot command. I think that would especially be the way to go if you want to feed the data argument into ggplot. Something like:
plot_dfs <- lapply(names(df1),function(nm)data.frame(col1 = df1[,nm], col2 = df2[,nm]))
for (df in plot_dfs)plot(x = df[,"col1"], y = df[,"col2"])
or using ggplot:
for (df in plot_dfs){
print(
ggplot(data = df, aes(x=col1, y=col2)) +
geom_point())}
and if you want to add the column names as plot titles, you can do:
for (idx in seq_along(plot_dfs)){
print(
ggplot(data = plot_dfs[[idx]], aes(x=col1, y=col2)) +
ggtitle(names(df1)[idx]) +
geom_point())}
You can loop through the columns like this:
for(col in 1:ncol(df1)){
plot(df1[,col], df2[,col])
}
Make sure that both data frames have the same number of columns (and the order of the columns are the same) before running this.
Here’s one way to do it — loop over the column indices and create the plots one by one, adding them to a list and writing each one to a file:
library(ggplot2)
# create some data to plot
df1 <- iris[, sapply(iris, is.numeric)]
df2 <- iris[sample(1:nrow(iris)), sapply(iris, is.numeric)]
# a list to catch each plot object
plot_list <- vector(mode="list", length=ncol(df1))
for (idx in seq_along(df1)){
plot_list[[idx]] <- ggplot2::qplot(df1[[idx]], df2[[idx]]) +
labs(title=names(df1)[idx])
ggsave(filename=paste0(names(df1)[idx], ".pdf"), plot=plot_list[[idx]])
}
As you suggest in the question, you can also use s/lapply() with an anonymous function, e.g. like this (though here we're not storing the plots, just writing each one to disk):
lapply(seq_along(df1), function(idx){
the_plot <- ggplot2::qplot(df1[[id]], df2[[idx]]) + labs(title=names(df1)[idx])
ggsave(filename=paste0(names(df1)[idx], ".pdf"), plot=the_plot)
})
If you want to keep the list of plots (as in the for-loop example), just assign the lapply() to a variable (e.g. plot_list) and add line like return(the_plot) before closing the function.
There's tons of ways you could modify/adapt this approach, depending on what your objectives are.
Hope this helps ~~
p.s. if it's possible the columns won't be in the same order, it is better to loop over column names instead of column indices (i.e. use for (colname in names(df1)){... instead of for (idx in seq_along(df1)){...). You can use the same [[ subsetting syntax with both names and indices.

Plot multiple days data of multiple homes in Facet form of ggplot

I have hourly timeseries data of three homes(H1, H2, H3) for continuous five days created as
library(xts)
library(ggplot2)
set.seed(123)
dt <- data.frame(H1 = rnorm(24*5,200,2),H2 = rnorm(24*5,150,2),H3 = rnorm(24*5,50,2)) # hourly data of three homes for 5 days
timestamp <- seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-01-05 23:59:59"), by = "hour") # create timestamp
dt$timestamp <- timestamp
Now I want to plot data homewise in facet form; accordingly I melt dataframe as
tempdf <- reshape2::melt(dt,id.vars="timestamp") # melt data for faceting
colnames(tempdf) <- c("time","var","val") # rename so as not to result in conflict with another melt inside geom_line
Within each facet (for each home), I want to see the values of all the five days in line plot form (each facet should contain 5 lines corresponding to different days). Accordingly,
ggplot(tempdf) + facet_wrap(~var) +
geom_line(data = function(x) {
locdat <- xts(x$val,x$time)# create timeseries object for easy splitting
sub <- split.xts(locdat,f="days") # split data daywise of considered home
sub2 <- sapply(sub, function(y) return(coredata(y))) # arrange data in matrix form
df_sub2 <- as.data.frame(sub2)
df_sub2$timestamp <- index(sub[[1]]) # forcing same timestamp for all days [okay with me]
df_melt <- reshape2::melt(df_sub2,id.vars="timestamp") # melt to plot inside each facet
#return(df_melt)
df_melt
}, aes(x=timestamp, y=value,group=variable,color=variable),inherit.aes = FALSE)
I have forced the same timestamp for all the days of a home to make plotting simple. With above code, I get plot as
Only problem with above plot is that, It is plotting same data in all the facets. Ideally, H1 facet should contain data of home 1 only and H2 facet should contain data of home 2. I know that I am not able to pass homewise data in geom_line(), can anyone help to do in correct manner.
I think that you may find it more efficient to modify the data outside the call to ggplot rather than inside it (allows closer inspection of what is happening at each step, at least in my opinion).
Here, I am using lubridate to generate two new columns. The first holds only the date (and not the time) to allow faceting on that. The second holds the full datetime, but I then modify the date so that they are all the same. This leaves only the times as mattering (and we can suppress the chosen date in the plot).
library(lubridate)
tempdf$day <- date(tempdf$time)
tempdf$forPlotTime <- tempdf$time
date(tempdf$forPlotTime) <-
"2016-01-01"
Then, I can pass that modified data.frame to ggplot. You will likely want to modify colors/labels, but this should get you a pretty good start.
ggplot(tempdf
, aes(x = forPlotTime
, y = val
, col = as.factor(day))) +
geom_line() +
facet_wrap(~var) +
scale_x_datetime(date_breaks = "6 hours"
, date_labels = "%H:%M")
Generates:

Ploting a matrix using ggplot2 in R

I want to plot using ggplot2 the distribution of 5 variables corresponding to a matrix's column names
a <- matrix(runif(1:25),ncol=5,nrow=1)
colnames(a) <- c("a","b","c","d","e")
rownames(a) <- c("count")
I tried:
ggplot(data=melt(a),aes(x=colnames(a),y=a[,1]))+ geom_point()
However, this gives a result as if all columns had the same y value
Note: i'm using the reshape package for the melt() function
All columns look like they have the same y-value because you are only specifying 1 number in the y= statement. You are saying y=a[,1] which if you type a[,1] into your command window you will find is 0.556 (the number that everything is appearing at). I think this is what you want:
library(reshape2)
library(ggplot2)
a_melt<- melt(a)
ggplot(data=a_melt,aes(x=unique(Var2),y=value))+ geom_point()
Note that I saved a new dataset called a_melt so that things were easier to reference. Also since the data was melted,it is cleaner if we define our x-values to be the Var2 column of a_meltrather than the columns of a.

Plot multiple ordered conditional box plots using columns from a data frame

I am trying to plot several ordered (ie., from high to low median) conditional box plots from a single data frame. The general sequence is as follows:
Reverse sort group medians for variable1 according to variable.group ;
Create ordered conditional box plot using variable.group and sorted medians;
Repeat (loop?) process for remaining variables in data frame.
I want to loop through about 70 variables using the above process but am stuck moving from tapply to aggregate, accessing each variable in the dataframe, and coding the looping sequence. Apologies in advance for the lack of elegance in my R code below:
bpdf = data.frame(group=c("A","A","A","B","B","B","C","C","C"),
x=c(1,1,2,2,3,3,3,4,4),
y=c(7,5,2,9,7,6,3,1,2),
z=c(4,5,2,9,8,9,7,6,7))
sorted.medians = rev(sort(with(bpdf,tapply(bpdf$x,bpdf$group,median))))
boxplot(bpdf$x~factor(bpdf$group,levels=names(sorted.medians)))
I think, you need just to put your 2 lines within lapply:
lapply(bpdf[,-1],function(x){
## decreasing better than rev here
y <- sort(tapply(x,bpdf$group,median),decreasing=TRUE)
boxplot(x~factor(bpdf$group,levels=names(y)))
})
EDIT to plot variable name , you use main argument of the boxplot and you loop over the colanmes of bpdf:
lapply(colnames(bpdf[,-1]),function(i){
## decreasing better than rev here
x <- bpdf[,i]
title <- paste0('title',i) ## you can change it here
y <- sort(tapply(x,bpdf$group,median),decreasing=TRUE)
boxplot(x~factor(bpdf$group,levels=names(y)),main=title)
})
If I understand the question correctly, I think following should do what you want:
Load in a few packages and create some data:
library(plyr)
library(reshape2)
dd = data.frame(group=c("A","B","C", "D"),
x1=runif(40),x2=runif(40),x3=runif(40),x4=runif(40))
Now calculate the median conditional on the variable and group
dd_m = melt(dd, "group")
meds = ddply(dd_m, c("variable", "group"), summarise, m = median(value))
Order the data frame by variable and median:
sorted_meds = meds[with(meds, order(variable, -m)), ]
Look through the variables, and sort each data frame in turn:
for(var in unique(sorted_meds$variable)){
grp_order = sorted_meds[sorted_meds$variable==var, ]$group
dd_tmp = dd_m[dd_m$variable==var,]
dd_tmp$group = factor(dd_tmp$group, levels = grp_order)
boxplot(dd_tmp$value ~ dd_tmp$group)
}

Resources