Setting up a CSV file for R to display histograms

Setting up a CSV file for R to display histograms - r

Greetings,
Basically, I have two vectors of data (let's call it experimental and baseline). I want to use the lattice library and histogram functions of R to plot the two histograms side-by-side, just as seen at the end of this page.
I have my data in a CSV file like this:
Label1,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label2,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label3,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label4,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label5,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label6,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Each row should be a new pair of histograms. Columns 1-9 represents the data for the experiment (left-side histogram). Columns 10-18 represents the baseline data (right-side histogram).
Can anyone help me on this? Thanks.

Your data is poorly formatted for faceting with lattice. You can restructure it using reshape.
read.csv(textConnection("Label1,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label2,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label3,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label4,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label5,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label6,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18"), header = F)->data
colnames(data)[1] <- "ID"
colnames(data)[2:10] <- paste("exp",1:9, sep = "_")
colnames(data)[11:19] <- paste("base", 1:9, sep = "_")
library(reshape)
data.m <- melt(data, id = "ID")
data.m <- cbind(data.m, colsplit(data.m$variable, "_", names = c("Source","Measure")))
data.m is now in the format you really want your data to be in for almost everything. I don't know if each of the 9 measurements from the experiment and the baseline are meaningful or can be meaningfully compared so I kept them distinct.
Now, you can use lattice properly.
histogram(~value | Source + ID, data = data.m)
If the measurements are meaningfully compared (that is, data[,2] and data[,11] are somehow the "same"), you could recast the data to directly compare experiment to baseline
data.comp <- cast(data.m, ID + Measure ~ Source)
## I know ggplot2 better
library(ggplot2)
qplot(base, exp, data = data.comp)+
geom_abline()+
expand_limits(x = 0, y = 0)

Something like this should work:
library(lattice)
data <- matrix(1:18, ncol=18, nrow=3, byrow=T)
for (i in 1:nrow(data))
{
tmp <- cbind(data[i,], rep(1:2, each=9))
print(histogram(~tmp[,1]|tmp[,2]), split=c(1,i,1,nrow(data)), more=T)
}
Note: this will work only for few rows of data... for larger datasets you may want to think of slightly different layout (change the split parameter in histogram)

Related

How to efficiently draw lots of graphs in R from data in a wide format?

I'm trying to draw 18 graphs using R and the ggplot2 package. My data look like this:
v1 v2 v3 ... v18 subject group
534 543 512 ... 410 1 (6.5, 18]
437 576 465 ... 420 2 (0, 6.5]
466 487 492 ... 501 3 (18, 55]
And I need to create a "faceted" histogram showing distributions for all of the groups in one frame (i. e. to conveniently present all of the subgroups' distributions) like this:
I came up with this code for a single plot:
ggplot(data = df, aes (x = v1)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
But since there are 18 variables (v1, v2,...), I'm looking for a way to write an efficient function/loop/command that would draw all the 18 graphs without me having to copy/paste and change the variable name 18 times. Like this:
ggplot(data = df, aes (x = **v1**)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
ggplot(data = df, aes (x = **v2**)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
ggplot(data = df, aes (x = **v3**)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
I know that the solution probably lies in looping and it seems like a useful skill to have, so I'm also using this opportunity to learn this right.
Thank you, any help is appreciated! (And thanks to all the suggestions so far!)
This is where I've gotten so far with the kind help of the user below:
for (v in c(v1,v2)) {
pdf("plots.pdf")
histograms <- ggplot(data = data, aes (x = v)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
print(histograms)
}
dev.off()

EDIT A significantly revised answer is provided having clarified the needs.
The problem presents several common issues, each of which are addressed in other posts. However, perhaps this suggestion allows for a one-stop solution to these common issues.
My first suggestion is to reformat the data into a "long" format. There are many resources describing this and packages to help. Many users embrace the "tidyverse" set of tools and I'll leave that to others. I'll demonstrate a simple approach using base functions. I don't recommend the reshape() function in the stats package. I find it to be useful for repeated measures with time as one of the variables but find it rather complicated for other data.
A large fake data set will be generated in the "wide" format with demographic data (id, sex, weight, age, group) and 18 variables named "v01", "v02", ..., "v18" as random integers between 400 and 500.
# Set random number generator and number of "individuals" in fake data
set.seed(1234) # to ensure reproducibility
N <- 936 # number of "individuals" in the fake data
# Create typical fake demographic data and divide the age into 4 groups
id <- factor(sample(1e4:9e4, N, replace = FALSE))
age <- rpois(N, 36)
sex <- sample(c("F","M"), N, replace = TRUE)
weight <- 16 * log(age)
group <- cut(age, breaks = c(12, 32, 36, 40, 62))
Generate 18 fake values for each individual for the wide format and then create the fake "wide" data.frame.
# 18 variable measurements for wide format
V <- replicate(18, sample(400:600, N, replace = TRUE), simplify = FALSE)
names(V) <- sprintf("v%02d", 1:18)
# Add a little variation to the fake data
adj <- sample(1:6, 18, replace = TRUE)
V <- Map("/", V, adj) # divide each value by the number in 'adj'
V <- lapply(V, round, 1) # simplify
# Create data.frame with variable data in wide format
vars <- as.data.frame(V)
names(vars)
# Assemble demographic and variable data into a typical "wide" data set
wide <- data.frame(id, sex, weight, age, group, vars)
names(wide)
head(wide)
In the "wide" format, each row corresponds to a unique individual with demographic information and 18 values for 18 variables. This is going to be changed into the "long" format with each value represented by a row. The new "long" data frame will have two new variables for the data (values) and a factor indicating the group from which the data came (ind). Typically they get renamed but I will simply work with the default names here.
As noted above, the simple base function stack() will be used to stack the variables into a single vector. In contrast to cbind(), the data.frame() function will replicate values only as long as they are an even multiple of each other. The following code takes advantage of this property to build the "long" data.frame.
# Identify those variables to be stacked (they all start with 'v')
sel <- grepl("^v", names(wide))
long <- data.frame(wide[!sel], stack(wide[sel]))
head(long)
My second suggestion is to use one of the "apply" functions to create a list of ggplot objects. By storing the plots in this variable, you have the option of plotting them with different formats without running the plotting code each time.
The code creates a plot for each of the 18 different variables, which are identified by the new variable ind. I changed boundary = 500 to a bins = 10 since I don't know what your actual data looks like. I also added a "caption" to each plot identifying the original variable.
library(ggplot2) # to use ggplot...
plotList <- lapply(levels(long$ind), function(i)
ggplot(data = subset(long, ind == i), aes(x = values))
+ geom_histogram(bins = 10)
+ facet_wrap(~ group, nrow = 2)
+ labs(caption = paste("Variable", i)))
names(plotList) <- levels(long$ind) # name the list elements for convenience
Now to examine each of the 18 plots (this may not work in RStudio):
opar <- par(ask = TRUE)
plotList # This is the same as print(plotList)
par(opar) # turn off the 'ask' option
To save the plots to file, the advice of Imo is good. But it would be wise to take control of the size and nature of the file output. I suggest you look at the help files for pdf() and dev.print(). The last part of this answer shows one possibility with the pdf() function using a for loop to generate single page plots.
for (v in levels(long$ind)) {
fname <- paste(v, "pdf", sep = ".")
fname <- file.path("~", fname) # change this to specify a directory
pdf(fname, width = 6.5, height = 7, paper = "letter")
print(plotList[[v]])
dev.off()
}
And just to add another possible approach, here's a solution with lattice showing 6 groups of variables per plot. (Personally, I'm a fan of this simpler approach.)
library(lattice)
idx <- split(levels(long$ind), gl(3, 6, 18))
opar <- par(ask = TRUE)
for (i in idx)
plot(histogram(~values | group + ind, data = long,
subset = ind %in% i, as.table = TRUE))
par(opar)

How to change x axis (in order to be scaled) of a boxplot in R

I'm new to R and i'm having some trouble in solving this problem.
I have the following table/dataframe:
I am trying to generate a boxplot like this one:
However, i want that the x-axis be scaled according to the labels 1000, 2000, 5000, etc.
So, i want that the distance between 1000 and 2000 be different from the distance between 50000 and 100000, since the exact distance is not the same.
Is it possible to do that in R?
Thank you everyone and have a nice day!

Maybe try to convert the data set in to this format, ie as integers in a column, rather than a header title?
# packages
library(ggplot2)
library(reshape2)
# data in ideal format
dt <- data.frame(x=rep(c(1,10,100), each=5),
y=runif(15))
# data that we have. Use reshape2::dcast to get data in to this format
dt$id <- rep(1:5, 3)
dt_orig <- dcast(dt, id~x, value.var = "y")
dt_orig$id <- NULL
names(dt_orig) <- paste0("X", names(dt_orig))
# lets get back to dt, the ideal format :)
# melt puts it in (variable, value) form. Need to configure variable column
dt2 <- melt(dt_orig)
# firstly, remove X from the string
dt2$variable <- gsub("X", "", dt2$variable)
# almost there, can see we have a character, but need it to be an integer
class(dt2$variable)
dt2$variable <- as.integer(dt2$variable)
# boxplot with variable X axis
ggplot(dt2, aes(x=variable, y=value, group=variable)) + geom_boxplot() + theme_minimal()
Base way of re-shaping data: https://www.statmethods.net/management/reshape.html

plot returns discontinuous lines (with gaps) in R when using spectacles package

UPDATE ABOUT SOLUTION:
set argument gaps = F in your plot() command
library(spectacles)
spectra(your_data) <- sr_no ~ ... ~ 350:2500 # turn your data frame to Spectra* object with the spectrum range of your interest
plot(your_data, gaps = F) # return plot with no gaps or line discontinuously
#
I have to work with a Spectra* object and plot with the code as similar as this, just changing buil-in "australia" data frame by my data:
library(spectacles)
data(australia)
spectra(australia) <- sr_no ~ ... ~ 350:2500
s <- cut(australia, wl =c(-1*450:500, -1*1800:2050))
plot(s)
My plot in R returned discontinuously as the attached image below. Data frame's dimension is 150 rows, 3156 columns. The problem remains when I reduced df to 1800 columns.
Has anyone had the same problems? Could you suggest a solution? Thank you so much in advance!

Please bear with me as it is a lot easier to explain my question by visualization than to leave a comment. Do you want to remove this data or are you wanting to cut a portion of the data? See examples below.
s1 <- cut(australia, wl =c(-1*450:500, -1*1800:2050))
plot(s, gap=TRUE)
Alternatively you could use the example for the help file in R.
library(ggplot)
data(australia)
spectra(australia) <- id ~ ... ~ 500:1800 # changed the range
r <- melt_spectra(australia)
australia$fact <- sample(LETTERS[1:3], size = nrow(australia), replace = TRUE)
r <- melt_spectra(australia, attr = 'fact')
p <- ggplot(r) + geom_line(aes(x=wl, y=nir, group=id, colour=fact)) + theme_bw()
print(p)
Alternative you can take a portion of the analysis.
s <- as(cut(australia, wl = 1*450:550), 'Spectra')
plot(s)

Convert absolute values to ranges for charting in R

Warning: still new to R.
I'm trying to construct some charts (specifically, a bubble chart) in R that shows political donations to a campaign. The idea is that the x-axis will show the amount of contributions, the y-axis the number of contributions, and the area of the circles the total amount contributed at this level.
The data looks like this:
CTRIB_NAML CTRIB_NAMF CTRIB_AMT FILER_ID
John Smith $49 123456789
The FILER_ID field is used to filter the data for a particular candidate.
I've used the following functions to convert this data frame into a bubble chart (thanks to help here and here).
vals<-sort(unique(dfr$CTRIB_AMT))
sums<-tapply( dfr$CTRIB_AMT, dfr$CTRIB_AMT, sum)
counts<-tapply( dfr$CTRIB_AMT, dfr$CTRIB_AMT, length)
symbols(vals,counts, circles=sums, fg="white", bg="red", xlab="Amount of Contribution", ylab="Number of Contributions")
text(vals, counts, sums, cex=0.75)
However, this results in way too many intervals on the x-axis. There are several million records all told, and divided up for some candidates could still result in an overwhelming amount of data. How can I convert the absolute contributions into ranges? For instance, how can I group the vals into ranges, e.g., 0-10, 11-20, 21-30, etc.?
----EDIT----
Following comments, I can convert vals to numeric and then slice into intervals, but I'm not sure then how I combine that back into the bubble chart syntax.
new_vals <- as.numeric(as.character(sub("\\$","",vals)))
new_vals <- cut(new_vals,100)
But regraphing:
symbols(new_vals,counts, circles=sums)
Is nonsensical -- all the values line up at zero on the x-axis.

Now that you've binned vals into a factor with cut, you can just use tapply again to find the counts and the sums using these new breaks. For example:
counts = tapply(dfr$CTRIB_AMT, new_vals, length)
sums = tapply(dfr$CTRIB_AMT, new_vals, sum)
For this type of thing, though, you might find the plyr and ggplot2 packages helpful. Here is a complete reproducible example:
require(ggplot2)
# Options
n = 1000
breaks = 10
# Generate data
set.seed(12345)
CTRIB_NAML = replicate(n, paste(letters[sample(10)], collapse=''))
CTRIB_NAMF = replicate(n, paste(letters[sample(10)], collapse=''))
CTRIB_AMT = paste('$', round(runif(n, 0, 100), 2), sep='')
FILER_ID = replicate(10, paste(as.character((0:9)[sample(9)]), collapse=''))[sample(10, n, replace=T)]
dfr = data.frame(CTRIB_NAML, CTRIB_NAMF, CTRIB_AMT, FILER_ID)
# Format data
dfr$CTRIB_AMT = as.numeric(sub('\\$', '', dfr$CTRIB_AMT))
dfr$CTRIB_AMT_cut = cut(dfr$CTRIB_AMT, breaks)
# Summarize data for plotting
plot_data = ddply(dfr, 'CTRIB_AMT_cut', function(x) data.frame(count=nrow(x), total=sum(x$CTRIB_AMT)))
# Make plot
dev.new(width=4, height=4)
qplot(CTRIB_AMT_cut, count, data=plot_data, geom='point', size=total) + opts(axis.text.x=theme_text(angle=90, hjust=1))

Subset of data included in more than one ggplot facet

I have a population and a sample of that population. I've made a few plots comparing them using ggplot2 and its faceting option, but it occurred to me that having the sample in its own facet will distort the population plots (however slightly). Is there a way to facet the plots so that all records are in the population plot, and just the sampled records in the second plot?

Matt,
If I understood your question properly - you want to have a faceted plot where one panel contains all of your data, and the subsequent facets contain only a subset of that first plot?
There's probably a cleaner way to do this, but you can create a new data.frame object with the appropriate faceting variable that corresponds to each subset. Consider:
library(ggplot2)
df <- data.frame(x = rnorm(100), y = rnorm(100), sub = sample(letters[1:5], 100, TRUE))
df2 <- rbind(
cbind(df, faceter = "Whole Sample")
, cbind(df[df$sub == "a" ,], faceter = "Subset A")
#other subsets go here...
)
qplot(x,y, data = df2) + facet_wrap(~ faceter)
Let me know if I've misunderstood your question.
-Chase