I have two data sets, d1 and d2 in csv files. Each data has 6 columns. I managed to combine them by melt command and graph them together in ggplot. After I have added one extra column which is another variable the graph would depend on, I couldn't get the required graph. Provided sample of the dataset and my code.
The dataset after using melt and reshape package:
initi A B C D E L1
0.005 1 23.7 1.0 1.0 24.7 d2
0.005 2 31.2 2.0 2.1 31.2 d2
0.005 3 35.8 3.1 3.2 35.6 d2
1 1 6.2 1.0 1.0 6.2 d1
1 2 10.1 2.0 2.1 7.0 d1
1 3 11.2 3.0 3.5 7.0 d1
2 1 14.2 8.0 14.3 5.2 d1
2 2 15.9 7.0 13.0 5.5 d1
2 3 16.0 6.2 12.4 5.8 d1
I need to graph A in the X-axis and B in the Y-axis. The initi value will represent each graph. In other words, it will be in the legend. For d1, I need to plot the results between A and B. For d2 I want to plot two graphs where the first graph when the initi = 1 and the second for initi = 2. All of the graphs for d1 and d2 are between A and B and combined in graph. The total in this case 3 lines combined in one graph.
I managed to graph d1 and d2 before I have added the initi column. Now I struggling. Below is my code:
dlist <- list(d1 =data1 ,d2 = data2)
reshaped_data <- melt(dlist, id.vars = c('initi','A','B','C','D','E'))
graph_AB<-ggplot(reshaped_data,aes(x = A, y = B, colour = initi)) +
geom_point(size = 5)+
geom_line() +
ggtitle("DATA1 vs DATA2")
The above code is close to what I want ...Except that the lines are connected in strange way. I should get each pattern for "initi" as one line. Plus the legend is not showing each value of 'initi'
You say you want two graphs, one for when initi is 1 and another for when it is 2, but it also takes the value 0.005. You need to subset your data first if you want to omit the 0.005 level.
In ggplot2, multiple graphs are called "facets", and since your facets depend on one variable, the command you want is facet_wrap(). (You'd use facet_grid if your facets depended on 2 variables.) Facets should be on factors, so we'll make sure initi is a factor
reshaped_data$initi <- factor(reshaped_data$initi)
then just add + facet_wrap(~ initi) to your ggplot.
EDIT:
If you want just one graph, try
graph_AB<-ggplot(reshaped_data,aes(x = A, y = B, colour = initi, group = L1)) +
geom_point(size = 5)+
geom_line() +
ggtitle("DATA1 vs DATA2")
graph_AB
In the future, post data using dput. For example, if you do dput(reshaped_data) and paste it into your question, the output will be this:
structure(list(initi = c(0.005, 0.005, 0.005, 1, 1, 1, 2, 2,
2), A = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), B = c(23.7, 31.2,
35.8, 6.2, 10.1, 11.2, 14.2, 15.9, 16), C = c(1, 2, 3.1, 1, 2,
3, 8, 7, 6.2), D = c(1, 2.1, 3.2, 1, 2.1, 3.5, 14.3, 13, 12.4
), E = c(24.7, 31.2, 35.6, 6.2, 7, 7, 5.2, 5.5, 5.8), L1 = structure(c(2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("d1", "d2"), class = "factor")), .Names = c("initi",
"A", "B", "C", "D", "E", "L1"), class = "data.frame", row.names = c(NA,
-9L))
which anyone can paste into R and use easily.
dlist <- list(d1 =data1 ,d2 = data2)
reshaped_data <- melt(dlist, id.vars = c('initi','A','B','C','D','E'))
graph_AB<-ggplot(reshaped_data,aes(x = A, y = B, colour = initi)) +
geom_point(size = 5)+
ggtitle("DATA1 vs DATA2")
print(graph_AB)
I just removed geom_line from original question !
Related
I am trying to visualize my data separately as a bar graph and as a dot plot connected by a line.
The experimental design includes 2 treatments, 3 levels for each treatment, and 2 diets as independent variables and weight measurement as a dependent variable. Each sample (e.g. treatment "a" level "1" diet "l" is duplicated. Below is a sample data frame (the response variable values are simplified):
df <- data.frame(treatment=c('a','a','a','b','b','b','a','a','a','b','b','b',
'a','a','a','b','b','b','a','a','a','b','b','b',
'a','a','a','b','b','b','a','a','a','b','b','b',
'a','a','a','b','b','b','a','a','a','b','b','b'),
level=c(1,2,3,1,2,3,1,2,3,1,2,3,
1,2,3,1,2,3,1,2,3,1,2,3,
1,2,3,1,2,3,1,2,3,1,2,3,
1,2,3,1,2,3,1,2,3,1,2,3,
1,2,3,1,2,3,1,2,3,1,2,3,
1,2,3,1,2,3,1,2,3,1,2,3,
1,2,3,1,2,3,1,2,3,1,2,3,
1,2,3,1,2,3,1,2,3,1,2,3),
diet=c('l','l','l','l','l','l','h','h','h','h','h','h',
'l','l','l','l','l','l','h','h','h','h','h','h',
'l','l','l','l','l','l','h','h','h','h','h','h',
'l','l','l','l','l','l','h','h','h','h','h','h'),
rep=c(1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,
1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2),
weight=c(100,75,50,50,25,12.5,100,75,50,50,25,12.5,
100,75,50,50,25,12.5,100,75,50,50,25,12.5,
200,150,100,100,50,25,200,150,100,100,50,25,
200,150,100,100,50,25,200,150,100,100,50,25))
Using a linear mixed model, I see that treatment and level effects are individually significant.
fit_df <- lmer(weight ~ treatment*level*diet + (1|rep), data=df)
I have also run emmeans to see pairwise contrasts between each combination of treatment and level.
(emm_wt <- emmeans(fit_df, specs=pairwise~treatment*level))
Then, I want to visualize the result shown below in a bar graph and a dot plot connected by a line. For the bar graph, the y-axis is emmean, x-axis is treatment*level, and error bars show emmean±SE.
$emmeans
treatment level emmean SE df lower.CL upper.CL
a 1 150.0 7.98 27.7 133.64 166.4
b 1 75.0 7.98 27.7 58.64 91.4
a 2 112.5 7.98 27.7 96.14 128.9
b 2 37.5 7.98 27.7 21.14 53.9
a 3 75.0 7.98 27.7 58.64 91.4
b 3 18.8 7.98 27.7 2.39 35.1
Results are averaged over the levels of: diet
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
The code below produces something similar to what I am looking for, but I am not sure how to add a line connecting the dots by the treatment (a1 to a3 and b1 to b3)...
It would also be nice to assign colors by the treatment (e.g. red for a and blue for b).
plot(emm_wt[[1]],
CIs=TRUE,
PIs=TRUE,
comparisons=TRUE,
colors=c("black","dark grey","grey","red"),
alpha=0.05,
adjust="tukey") +
theme_bw() +
coord_flip()
If anybody has any insights as to how I could visualize this, please let me know. Thank you in advance!
You could do something like this, using ggplot2
library(ggplot2)
ggplot(df,aes(reorder(trt,level),emmean, group=treatment, color=treatment)) +
geom_line(size=2) +
scale_color_manual(values=c("a" = "red", "b"="blue")) +
geom_linerange(aes(ymin=lower.CL, ymax=upper.CL), size=2,show.legend = F) +
geom_point(color="black", size=8) +
ylim(0,200) + labs(x="Treatment/Level", color="Treatment") +
theme(legend.position="bottom")
Output:
Input:
df = structure(list(treatment = c("a", "b", "a", "b", "a", "b"), level = c(1L,
1L, 2L, 2L, 3L, 3L), emmean = c(150, 75, 112.5, 37.5, 75, 18.8
), SE = c(7.98, 7.98, 7.98, 7.98, 7.98, 7.98), df = c(27.7, 27.7,
27.7, 27.7, 27.7, 27.7), lower.CL = c(133.64, 58.64, 96.14, 21.14,
58.64, 2.39), upper.CL = c(166.4, 91.4, 128.9, 53.9, 91.4, 35.1
), trt = structure(c(1L, 4L, 2L, 5L, 3L, 6L), .Label = c("a1",
"a2", "a3", "b1", "b2", "b3"), class = c("ordered", "factor"))), row.names = c(NA,
-6L), class = "data.frame")
I'm trying to create a combination Boxplot/Scatterplot. I'm doing alright with it so far but there's one issue that's really bothering me that I've been unable to figure out. I'm in R and I've installed the ggplot2 package. Here's the code I'm using:
#(xx= stand in for my data set, which I imported from excel with the
# column labels as the X-axis values)
> boxplot(xx, lwd = 1.5, ylab = 'Minutes', xlab = "Epoch")
> stripchart(xx, vertical = TRUE,
+ method = "jitter", add = TRUE, pch = 20, col = 'blue')
This gives me a plot that is pretty close to what I want but the problem is that the outliers are placed on the chart twice. If possible, I'd like to have the stripchart exclude them (highest groups of blue dots) and only use the ones from the boxplot (black outlined circles) so they stand out as different and don't look so sloppy.
I've tried to alter the points in question by putting a lot of different outlier arguments into the stripchart command, unfortunately with no luck. I've tried setting y-limits below their values, tried using outline=false (which completely removes the stripchart), tried changing outlier color, outpch, etc. The command has not worked for any of these attempts. Here's an example of ylim:
> stripchart(xx, vertical = TRUE,
+ method = "jitter", add = TRUE, pch = 20, col = 'blue', ylim = true,
ylim (0,20))
Error in ylim(0, 20) : could not find function "ylim"
And here's an example with outlier color:
> stripchart(xx vertical = TRUE,
+ method = "jitter", add = TRUE, pch = 20, col = 'blue', outcol = "black")
Warning messages:
1: In plot.xy(xy.coords(x, y), type = type, ...) : "outcol" is not a
graphical parameter
.......# warning messages continue as such.
Are stripcharts capable of outlier exclusion? Or do I simply not know enough about them yet (and R as a whole, for that matter) to effectively write the code?
If this can be done, how should I proceed? I'm totally fine with solutions that don't directly address the outlier issue in terms of the data as long as the visual effect on the plot is the same.
Thank you for your time and any help you can give!
Edit: Here's some of the data to play around with. Top row is column labels and data is beneath. Sorry if this formatting is bad.The 29s and 30s and such in the 9th row of data, 10th overall, are examples of some of the points plotted as outliers in my graphs that I would like to keep in the boxplot but not in the scatterplot/stripchart.
1 5 10 15 30 60
7.233333333 8.166666667 9.666666667 7.75 9 7
7.133333333 9.25 9.333333333 9.75 10 11
0.733333333 0.5 0.833333333 1 1 0
1.766666667 1.166666667 1 0.75 1 0
1.75 2.25 2.333333333 2.25 1 1
6.75 7 7.166666667 7.75 6.5 7
1.516666667 1.75 1.333333333 2 2 2
1.533333333 1.5 2 1.25 1.5 2
27.3 28.33333333 29.33333333 30.25 28.5 29
6.35 6 6.333333333 7 6 6
7.083333333 8.333333333 8.833333333 8.75 8 8
8.533333333 10.08333333 10.5 12 10.5 11
7.65 8.416666667 9 10.75 9 12
6.85 7.333333333 8 7.25 6 8
4.433333333 5 5.5 5 6.5 6
8.616666667 10 11.66666667 12.25 13 12
3.633333333 3.75 3.5 3.25 3 2
0.8 0.75 0.833333333 1 1 0
7.283333333 8.583333333 9.666666667 9.75 12 8
7.483333333 8.75 8.333333333 7.75 6.5 7
3.466666667 2.916666667 3.166666667 2.5 2 0
5.483333333 6.416666667 6.833333333 6.75 7 8
There are a few things going on here. If you wanted to stick with the base plotting functions (boxplot() and stripchart()), you could simply tell stripchart to plot only the points that are within some criterion. A common standard for outliers would be any point 3 or more standard deviations away from the mean. Instead of passing your unmodified data set to stripchart, we subset that data set (note the [ ] brackets).
boxplot(xx)
stripchart(xx[xx <= mean(xx) + sd(xx) * 3], vertical = T, method = 'jitter', add = T, pch = 20, col = 'blue')
Of course, if you really did want to use ggplot2 (and I recommend installing not only that package, but the entire tidyverse with install.packages('tidyverse')), you could produce an arguably nicer plot:
The data formatting and commands needed to produce the ggplot version are quite different from the base graphics version, and beyond the scope of this answer. Reproducible code follows.
library(tidyverse)
df <- structure(list(X1 = c(7.233333333, 7.133333333, 0.733333333, 1.766666667, 1.75, 6.75, 1.516666667, 1.533333333, 27.3, 6.35, 7.083333333, 8.533333333, 7.65, 6.85, 4.433333333, 8.616666667, 3.633333333, 0.8, 7.283333333, 7.483333333, 3.466666667, 5.483333333 ), X5 = c(8.166666667, 9.25, 0.5, 1.166666667, 2.25, 7, 1.75, 1.5, 28.33333333, 6, 8.333333333, 10.08333333, 8.416666667, 7.333333333, 5, 10, 3.75, 0.75, 8.583333333, 8.75, 2.916666667, 6.416666667 ), X10 = c(9.666666667, 9.333333333, 0.833333333, 1, 2.333333333, 7.166666667, 1.333333333, 2, 29.33333333, 6.333333333, 8.833333333, 10.5, 9, 8, 5.5, 11.66666667, 3.5, 0.833333333, 9.666666667, 8.333333333, 3.166666667, 6.833333333), X15 = c(7.75, 9.75, 1, 0.75, 2.25, 7.75, 2, 1.25, 30.25, 7, 8.75, 12, 10.75, 7.25, 5, 12.25, 3.25, 1, 9.75, 7.75, 2.5, 6.75), X30 = c(9, 10, 1, 1, 1, 6.5, 2, 1.5, 28.5, 6, 8, 10.5, 9, 6, 6.5, 13, 3, 1, 12, 6.5, 2, 7), X60 = c(7L, 11L, 0L, 0L, 1L, 7L, 2L, 2L, 29L, 6L, 8L, 11L, 12L, 8L, 6L, 12L, 2L, 0L, 8L, 7L, 0L, 8L)), .Names = c("X1", "X5", "X10", "X15", "X30", "X60"), class = "data.frame", row.names = c(NA, -22L))
df.long <- gather(df, x, value) %>%
mutate(x = as.factor(as.numeric(gsub('X', '', x)))) %>%
group_by(x) %>%
mutate(is.outlier = value > mean(value) + sd(value) * 3)
plot.df <- ggplot(data = df.long, aes(x = x, y = value, group = x)) +
geom_boxplot() +
geom_point(data = filter(df.long, !is.outlier), color = '#0000ff88', position = position_jitter(width = 0.1))
print(plot.df)
My code is reading in a file which looks like this:
Wafer Analyses
%HN20396815 150.000000 03/27/2016 04:06:27
Vf2 Site
2.76 1
2.32 2
2.56 3
2.45 2
2.76 4
2.98 3
2.58 1
2.42 2
2.76 1
2.32 4
2.56 3
2.45 2
2.76 4
2.98 3
2.58 1
2.42 2
then I segregate "Vf2" data by "Site" and plot for each site in a histogram, however the histogram doesn't get printed out, how do I rectify this? This is my code>
blufile <- read.table("C:/blufiles/WFRHN20396815_700.blu", skip = 2, header = TRUE, sep="\t")
Vf2 = blufile["Vf2"]
Site = blufile["Site"]
blufile2 <- data.frame(Vf2, Site)
blufile2$Vf2[blufile2$Site == '2']
set.seed(42)
p1 <- blufile2$Vf2[blufile2$Site == '1'] # centered at 4
p2 <- blufile2$Vf2[blufile2$Site == '2'] # centered at 6
p3 <- blufile2$Vf2[blufile2$Site == '3'] # centered at 4
p4 <- blufile2$Vf2[blufile2$Site == '4'] # centered at 6
plot( p1, col=rgb(0,0,1,1/4), xlim=c(0,4)) # first histogram
plot( p2, col=rgb(1,0,0,1/4), xlim=c(0,4), add=T) # second
plot( p3, col=rgb(0,1,0,1/4), xlim=c(0,4), add=T) # first histogram
plot( p4, col=rgb(1,1,0,1/4), xlim=c(0,4), add=T) # second
I want a similar output ->
The solution #alistaire have provided worked with the data you have provided.
ggplot(df, aes(x=Vf2)) +
geom_histogram(bins=5) +
facet_wrap(~Site)
The data df used in this case is:
structure(list(Vf2 = c(2.76, 2.32, 2.56, 2.45, 2.76, 2.98, 2.58,
2.42, 2.76, 2.32, 2.56, 2.45, 2.76, 2.98, 2.58, 2.42), Site = c(1L,
2L, 3L, 2L, 4L, 3L, 1L, 2L, 1L, 4L, 3L, 2L, 4L, 3L, 1L, 2L)), .Names = c("Vf2",
"Site"), class = "data.frame", row.names = c(NA, -16L))
Please elaborate more on your issues if this still do not work.
EDIT:
The issue with your code is that you used plot which would result in a x,y point plot, instead of hist.
Here's a solution with base:
df_list <- split(df, df$Site) #produce a list of 4 data.frame, each with one level of Site
dev.off() # close existing output on console
for(i in 1:length(df_list)){
dat <- df_list[[i]]$Vf2 # select Vf2 data from site i
hist_col <- c(rgb(0,0,1,1/4), rgb(1,0,0,1/4), rgb(0,1,0,1/4), rgb(1,1,0,1/4))[i] # select rgb colour for site i plot
hist(dat, col=hist_col, xlim=c(0,4), ylim=c(0, 5)) #plot histogram for each site
par(new=T) # add new histogram layer on top
}
The solution for ggplot is slightly more confusing in this case:
ggplot(data=df, aes(x=Vf2, fill=factor(Site), colour=factor(Site))) +
geom_histogram(data=df %>% filter(Site==1), bins=5, alpha=.5) +
geom_histogram(data=df %>% filter(Site==2), bins=5, alpha=.5) +
geom_histogram(data=df %>% filter(Site==3), bins=5, alpha=.5) +
geom_histogram(data=df %>% filter(Site==4), bins=5, alpha=.5) +
theme_bw()
Looking to learn function writing. I have data laid out in the following (e.g.):
Genus Species Wing Tail
A X 10.5 20.3
A Y 10.7 20.7
B XX 15.2 22.5
B XY 15.5 24
I calculate variance for a given trait using the equation:
sqrt(max(Wing) - min (Wing))
which I sum for all traits.
So I can write the following function so sum variance for the total data set:
variance<- function(data){
t <- sqrt(max(Tail)-min(Tail))
w <- sqrt(max(Wing)-min(Wing))
x <- sum(t,w)
x
}
But I can'twork out how to generate a response to give me an output where this result is dependant on the Genus. So i'm looking to generate an output like:
Genus A Genus B
2.345 3.456
I am going to give a new name to your function because it's just wrong to call it "variance". I hope you can overlook that. We can work on a dataframe object
dput(dfrm)
structure(list(Genus = structure(c(1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), Species = structure(c(1L, 4L, 2L, 3L
), .Label = c("X", "XX", "XY", "Y"), class = "factor"), Wing = c(10.5,
10.7, 15.2, 15.5), Tail = c(20.3, 20.7, 22.5, 24)), .Names = c("Genus",
"Species", "Wing", "Tail"), class = "data.frame", row.names = c(NA,
-4L))
dev2<- function(df){
t <- sqrt(max(df[["Tail"]])-min(df[["Tail"]]))
w <- sqrt(max(df[["Wing"]])-min(df[["Wing"]]))
x <- sum(t,w)
x
}
Now use it to work on the full dataframe, using the split-lapply strategy, which passes sections of the original dataframe determined by the Genus values to the dev2 function
lapply( split(dfrm, list(dfrm$Genus)), FUN = dev2)
$A
[1] 1.079669
$B
[1] 1.772467
I'm trying to understand how to have more than one series on a plot, using the following data.
Year <- c('1950', '1960', '1970', '1980')
Bus <- c(10,20,30,40)
Bus.sd <- c(1.1, 2.2, 3.3, 4.4)
Car <- c(20, 20, 40, 40)
Car.sd <- c(1.1, 2.2, 3.3, 4.4)
sample_data = data.frame(Year, Bus, Bus.sd, Car, Car.sd)
qplot(Year, Bus, data=sample_data, geom="pointrange",
ymin = Bus - Bus.sd/2, ymax = Bus + Bus.sd/2)
For example, using the above data, how do I show both sample_data$Bus and sample_data$Car on the same plot in different colors?
What I tried doing was:
p <- qplot(...)
then
p <- p + qplot(...)
where I replicated the previous line, but this gave me an error.
I don't fully understand how AES works. I have studied the ggplot2 examples, but have difficulty understanding the relevant examples here. Or, if it is possible to make a stacked bar (geom_bar) using this data, I think that would also represent it appropriately.
I Hope this helps
gplot2 works best with data in long format, like so:
Year score sd variable
1 1950 10 1.1 bus
2 1960 20 2.2 bus
3 1970 30 3.3 bus
4 1980 40 4.4 bus
5 1950 20 1.1 car
6 1960 20 2.2 car
7 1970 40 3.3 car
8 1980 40 4.4 car
This will get the data into R:
data <- structure(list(Year = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L), class = "factor", .Label = c("1950", "1960", "1970", "1980"
)), score = c(10, 20, 30, 40, 20, 20, 40, 40), sd = c(1.1, 2.2,
3.3, 4.4, 1.1, 2.2, 3.3, 4.4), variable = c("bus", "bus", "bus",
"bus", "car", "car", "car", "car")), .Names = c("Year", "score",
"sd", "variable"), row.names = c(NA, -8L), class = "data.frame")
And this will make the plot, with dodge an all. You properbly need the dodge, because your data is overlapping. You can control the amount of dodging with the "W" value.
ggplot(data, aes(x=Year, y=score,col=variable))+
geom_point(position=position_dodge(w=0.2))+
geom_pointrange(aes(ymin=score-sd, ymax=score+sd,group=Year),position=position_dodge(w=0.2))