Merge data.frames for grouped boxplot r - r

I have two data frames z (1 million observations) and b (500k observations).
z= Tracer time treatment
15 0 S
20 0 S
25 0 X
04 0 X
55 15 S
16 15 S
15 15 X
20 15 X
b= Tracer time treatment
2 0 S
35 0 S
10 0 X
04 0 X
20 15 S
11 15 S
12 15 X
25 15 X
I'd like to create grouped boxplots using time as a factor and treatment as colour. Essentially I need to bind them together and then differentiate between them but not sure how. One way I tried was using:
zz<-factor(rep("Z", nrow(z))
bb<-factor(rep("B",nrow(b))
dumB<-merge(z,zz) #this won't work because it says it's too big
dumB<-merge(b,zz)
total<-rbind(dumB,dumZ)
But z and zz merge won't work because it says it's 10G in size (which can't be right)
The end plot might be similar to this example: Boxplot with two levels and multiple data.frames
Any thoughts?
Cheers,
EDIT: Added boxplot

I would approach it as follows:
# create a list of your data.frames
l <- list(z,b)
# assign names to the dataframes in the list
names(l) <- c("z","b")
# bind the dataframes together with rbindlist from data.table
# the id parameter will create a variable with the names of the dataframes
# you could also use 'bind_rows(l, .id="id")' from 'dplyr' for this
library(data.table)
zb <- rbindlist(l, id="id")
# create the plot
ggplot(zb, aes(x=factor(time), y=Tracer, color=treatment)) +
geom_boxplot() +
facet_wrap(~id) +
theme_bw()
which gives:
Other alternatives for creating your plot:
# facet by 'time'
ggplot(zb, aes(x=id, y=Tracer, color=treatment)) +
geom_boxplot() +
facet_wrap(~time) +
theme_bw()
# facet by 'time' & color by 'id' instead of 'treatment'
ggplot(zb, aes(x=treatment, y=Tracer, color=id)) +
geom_boxplot() +
facet_wrap(~time) +
theme_bw()
In respons to your last comment: to get everything in one plot, you use interaction to distinguish between the different groupings as follows:
ggplot(zb, aes(x=treatment, y=Tracer, color=interaction(id, time))) +
geom_boxplot(width = 0.7, position = position_dodge(width = 0.7)) +
theme_bw()
which gives:

The key is you do not need to perform a merge, which is computationally expensive on large tables. Instead assign a new variable and value (source c(b,z) in my code below) to each dataframe and then rbind. Then it becomes straight forward, my solution is very similar to #Jaap's just with different faceting.
library(ggplot2)
#Create some mock data
t<-seq(1,55,by=2)
z<-data.frame(tracer=sample(t,size = 10,replace = T), time=c(0,15), treatment=c("S","X"))
b<-data.frame(tracer=sample(t,size = 10,replace = T), time=c(0,15), treatment=c("S","X"))
#Add a variable to each table to id itself
b$source<-"b"
z$source<-"z"
#concatenate the tables together
all<-rbind(b,z)
ggplot(all, aes(source, tracer, group=interaction(treatment,source), fill=treatment)) +
geom_boxplot() + facet_grid(~time)

Related

Plot selected rows with multiple columns with ggplot

I have a dataset set with 34 columns and 600+ rows.
I successfully managed to reshape it for my data to be predicted for 5 columns (5 years) using reshape2
Dataset_name <- melt(data=XYZ, id.vars=c("A", "B", "C",.... {so on minus 5 columns}))
Now I have the reshaped data and plotted the graph and since it has 600+ points in each column, I cant make sense of it.
Is it possible for me to plot the top Row 1 to Row 50 in one graph and in another Row 51 to Row 100 and so on?
Also, I want to connect the dots to see whether they varied over the years.
Thanks.
Dataset
You can assign rows numbers (first 50 designated as 1, second 50 as 2...) and use that variable in facet_wrap. Each facet would thus hold 50 data points. Here's an example on the iris dataset which comes shipped with R.
library(ggplot2)
nrow(iris) # 150, let's do 50 obs. per facet
iris <- iris[sample(1:nrow(iris)), ] # shuffle the dataset
iris$desig <- rep(c("set1", "set2", "set3"), each = 50)
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
theme_bw() +
geom_point() +
facet_wrap(~ desig)

ggplot multicolomn on R using csv

I want to plot my .csv data (I named it p) on R using ggplot2 but I am having difficulties.
Time d d d d c m m m m c c c........... (top row of data p)
there are 14 rows and there are 304 columns. First column is time and rest are d c m so on ......I want to plot Time on x axis against rest 303 on y axis on a single plot window and these 303 graph lines to be distinguished by Color.
the top row has letters like d c m.. theese are my 3 forest Groups: coniferous, deciduous, mixed. so i want all graph lines with 'd' to be grouped in one particular Color. then 'c' in another Color and 'm' in some other.
I found a way to do that using ggplot
ggplot(p, aes(x = Time, group = 1)) +
geom_line(aes(y = d), colour="blue") +
geom_line(aes(y = c), colour = "red") +
geom_line(aes(y = m), colour = "green") +
ylab(label="NDVI") + xlab("Time")
but from 303 i have 117 columns of d
77 for c
109 for m
What code should I use so R would plot all columns by giving all ds , cs and ms different Color?
Please help I have been stuck on this for days.
Here is a proposition I hope will suit your needs:
require(ggplot2)
require(reshape2)
time <- 1:14 # your 14 rows
# here data with only 3 forest types, consider using yours instead !
data.test <- data.frame(time=time, d=runif(14, max=1)*time, c=runif(14, max=2)*time, m=runif(14, max=3)*time)
#reshape your data for it to suit the ggploting
data.test <- melt(data.test, id.vars="time", variable.name="forest", value.name="y")
# add a numeric version of the factor forest (for gradient color)
data.test$numeric.forest <- rep(1:3, each=14) # replace 1:3 by 1:304
# first plot, each forest type got a color ... should not be readable with your dimensions (304 lines)
ggplot(data.test) + geom_line(aes(x=time, y=y, group=forest, color=forest))
# if you have 304 forest types, you might consider using color gradient,
# but you need to identify a logical order in your forest, for the gradient to be informative ...
ggplot(data.test) + geom_line(aes(x=time, y=y, group=forest, color=numeric.forest), alpha=.9)
# consider playing with alpha (transparency) for your lines to be readable

Generating a histogram and density plot from binned data

I've binned some data and currently have a dataframe that consists of two columns, one that specifies a bin range and another that specifies the frequency like this:-
> head(data)
binRange Frequency
1 (0,0.025] 88
2 (0.025,0.05] 72
3 (0.05,0.075] 92
4 (0.075,0.1] 38
5 (0.1,0.125] 20
6 (0.125,0.15] 16
I want to plot a histogram and density plot using this but I can't seem to find a way of doing so without having to generate new bins etc. Using this solution here I tried to do the following:-
p <- ggplot(data, aes(x= binRange, y=Frequency)) + geom_histogram(stat="identity")
but it crashes. Anyone know of how to deal with this?
Thank you
the problem is that ggplot doesnt understand the data the way you input it, you need to reshape it like so (I am not a regex-master, so surely there are better ways to do is):
df <- read.table(header = TRUE, text = "
binRange Frequency
1 (0,0.025] 88
2 (0.025,0.05] 72
3 (0.05,0.075] 92
4 (0.075,0.1] 38
5 (0.1,0.125] 20
6 (0.125,0.15] 16")
library(stringr)
library(splitstackshape)
library(ggplot2)
# extract the numbers out,
df$binRange <- str_extract(df$binRange, "[0-9].*[0-9]+")
# split the data using the , into to columns:
# one for the start-point and one for the end-point
df <- cSplit(df, "binRange")
# plot it, you actually dont need the second column
ggplot(df, aes(x = binRange_1, y = Frequency, width = 0.025)) +
geom_bar(stat = "identity", breaks=seq(0,0.125, by=0.025))
or if you don't want the data to be interpreted numerically, you can just simply do the following:
df <- read.table(header = TRUE, text = "
binRange Frequency
1 (0,0.025] 88
2 (0.025,0.05] 72
3 (0.05,0.075] 92
4 (0.075,0.1] 38
5 (0.1,0.125] 20
6 (0.125,0.15] 16")
library(ggplot2)
ggplot(df, aes(x = binRange, y = Frequency)) + geom_bar(stat = "identity")
you won't be able to plot a density-plot with your data, given its not continous but rather categorical, thats why I actually prefer the second way of showing it,
You can try
library(ggplot2)
ggplot(df, aes(x = binRange, y = Frequency)) + geom_col()

Color Dependent Bar Graph in R

I'm a bit out of my depth with this one here. I have the following code that generates two equally sized matrices:
MAX<-100
m<-5
n<-40
success<-matrix(runif(m*n,0,1),m,n)
samples<-floor(MAX*matrix(runif(m*n),m))+1
the success matrix is the probability of success and the samples matrix is the corresponding number of samples that was observed in each case. I'd like to make a bar graph that groups each column together with the height being determined by the success matrix. The color of each bar needs to be a color (scaled from 1 to MAX) that corresponds to the number of observations (i.e., small samples would be more red, for instance, whereas high samples would be green perhaps).
Any ideas?
Here is an example with ggplot. First, get data into long format with melt:
library(reshape2)
data.long <- cbind(melt(success), melt(samples)[3])
names(data.long) <- c("group", "x", "success", "count")
head(data.long)
# group x success count
# 1 1 1 0.48513473 8
# 2 2 1 0.56583802 58
# 3 3 1 0.34541582 40
# 4 4 1 0.55829073 64
# 5 5 1 0.06455401 37
# 6 1 2 0.88928606 78
Note melt will iterate through the row/column combinations of both matrices the same way, so we can just cbind the resulting molten data frames. The [3] after the second melt is so we don't end up with repeated group and x values (we only need the counts from the second melt). Now let ggplot do its thing:
library(ggplot2)
ggplot(data.long, aes(x=x, y=success, group=group, fill=count)) +
geom_bar(position="stack", stat="identity") +
scale_fill_gradient2(
low="red", mid="yellow", high="green",
midpoint=mean(data.long$count)
)
Using #BrodieG's data.long, this plot might be a little easier to interpret.
library(ggplot2)
library(RColorBrewer) # for brewer.pal(...)
ggplot(data.long) +
geom_bar(aes(x=x, y=success, fill=count),colour="grey70",stat="identity")+
scale_fill_gradientn(colours=brewer.pal(9,"RdYlGn")) +
facet_grid(group~.)
Note that actual values are probably different because you use random numbers in your sample. In future, consider using set.seed(n) to generate reproducible random samples.
Edit [Response to OP's comment]
You get numbers for x-axis and facet labels because you start with matrices instead of data.frames. So convert success and samples to data.frames, set the column names to whatever your test names are, and prepend a group column with the "list of factors". Converting to long format is a little different now because the first column has the group names.
library(reshape2)
set.seed(1)
success <- data.frame(matrix(runif(m*n,0,1),m,n))
success <- cbind(group=rep(paste("Factor",1:nrow(success),sep=".")),success)
samples <- data.frame(floor(MAX*matrix(runif(m*n),m))+1)
samples <- cbind(group=success$group,samples)
data.long <- cbind(melt(success,id=1), melt(samples, id=1)[3])
names(data.long) <- c("group", "x", "success", "count")
One way to set a threshold color is to add a column to data.long and use that for fill:
threshold <- 25
data.long$fill <- with(data.long,ifelse(count>threshold,max(count),count))
Putting it all together:
library(ggplot2)
library(RColorBrewer)
ggplot(data.long) +
geom_bar(aes(x=x, y=success, fill=fill),colour="grey70",stat="identity")+
scale_fill_gradientn(colours=brewer.pal(9,"RdYlGn")) +
facet_grid(group~.)+
theme(axis.text.x=element_text(angle=-90,hjust=0,vjust=0.4))
Finally, when you have names for the x-axis labels they tend to get jammed together, so I rotated the names -90°.

Plotting multiple variables via ggplot2

I'd like to create a bar chart using factors and more than two variables! My data looks like this:
Var1 Var2 ... VarN Factor1 Factor2
Obs1 1-5 1-5 ... 1-5
Obs2 1-5 1-5 ... ...
Obs3 ... ... ... ...
Each datapoint is a likert item ranging from 1-5
Plotting total sums using a dichotomized version (every item above 4 is a one, else 0)
I converted the data using this
MyDataFrame = dichotomize(MyDataFrame,>=4)
p <- colSums(MyDataFrame)
p <- data.frame(names(p),p)
names(p) <- c("var","value")
ggplot(p,aes(var,value)) + geom_bar() + coord_flip()
Doing this i loose the information provided by factor1 etc, i'd like to use stacking in order to visualize from which group of people the rating came
Is there a elegant solution to this problem? I read about using reshape to melt the data and then applying ggplot?
I would suggest the following: use one of your factor for stacking, the other one for faceting. You can remove position="fill" to geom_bar() to use counts instead of standardized values.
my.df <- data.frame(replicate(10, sample(1:5, 100, rep=TRUE)),
F1=gl(4, 5, 100, labels=letters[1:4]),
F2=gl(2, 50, labels=c("+","-")))
my.df[,1:10] <- apply(my.df[,1:10], 2, function(x) ifelse(x>4, 1, 0))
library(reshape2)
my.df.melt <- melt(my.df)
library(plyr)
res <- ddply(my.df.melt, c("F1","F2","variable"), summarize, sum=sum(value))
library(ggplot2)
ggplot(res, aes(y=sum, x=variable, fill=F1)) +
geom_bar(stat="identity", position="fill") +
coord_flip() +
facet_grid(. ~ F2) +
ylab("Percent") + xlab("Item")
In the above picture, I displayed observed frequencies of '1' (value above 4 on the Likert scale) for each combination of F1 (four levels) and F2 (two levels), where there are either 10 or 15 observations:
> xtabs(~ F1 + F2, data=my.df)
F2
F1 + -
a 15 10
b 15 10
c 10 15
d 10 15
I then computed conditional item sum scores with ddply,† using a 'melted' version of the original data.frame. I believe the remaining graphical commands are highly configurable, depending on what kind of information you want to display.
† In this simplified case, the ddply instruction is equivalent to with(my.df.melt, aggregate(value, list(F1=F1, F2=F2, variable=variable), sum)).

Resources