So I have two histogram plots I can do one at a time. The result using the following code gives a 2 row x 3 col facet plot for six different histograms:
ggplot(data) +
aes(x=values) +
geom_histogram(binwidth=2, fill='blue', alpha=0.3, color="black", aes(y=(..count..)*100/(sum(..count..)/6))) +
facet_wrap(~ model_f, ncol = 3)
Here the aes(y...) just gives the percentage instead of counts.
As stated, I have two of this 6 facet_wrap plot, which I now which to combine to show that one is more shifted than the other.
In addition, the data size is not the same, so for one I have:
# A tibble: 5,988 x 5
values ID structure model model_f
<dbl> <chr> <chr> <chr> <fctr>
1 6 1 bone qua Model I
2 7 1 bone liu Model II
3 20 1 bone dav Model III
4 3 1 bone ema Model IV
5 3 1 bone tho Model V
6 4 1 bone ranc Model VI
7 3 2 bone qua Model I
8 5 2 bone liu Model II
9 18 2 bone dav Model III
10 2 2 bone ema Model IV
# ... with 5,978 more rows
And the other:
# A tibble: 954 x 5
values ID structure model model_f
<dbl> <chr> <chr> <chr> <fctr>
1 9 01 bone qua Model I
2 8 01 bone liu Model II
3 22 01 bone dav Model III
4 6 01 bone ema Model IV
5 5 01 bone tho Model V
6 9 01 bone ran Model VI
7 12 02 bone qua Model I
8 11 02 bone liu Model II
9 24 02 bone dav Model III
10 9 02 bone ema Model IV
# ... with 944 more rows
So they are not the same size, the ID's are not the same (data not related), but still, I wish to merge the histograms in order to see the difference between the data.
I thought this might do the trick:
ggplot() +
geom_histogram(data=data1, aes(x=values), binwidth=1, fill='blue', alpha=0.3, color="black", aes(y=(..count..)*100/(sum(..count..)/6))) +
geom_histogram(data=data2, aes(x=values), binwidth=1, fill='blue', alpha=0.3, color="black", aes(y=(..count..)*100/(sum(..count..)/6))) +
facet_wrap(~ model_f, ncol = 3)
However, that didn't do much.
So now I'm stuck. Is this possible to do, or...?
Here is my crack at this, based on the builtin dataset iris (since you did not provide reproducible data). To create the smaller, shifted dataset, I am using dplyr to keep the first 20 rows from each species and add 1 to the Sepal length for each observation:
smallIris <-
iris %>%
group_by(Species) %>%
slice(1:20) %>%
ungroup() %>%
mutate(Sepal.Length = Sepal.Length + 1)
Your code at the end gets you close, but you did not specify different colors for the two histograms. If you set the fill differently for each, you will get them to show up differently. You could either set this directly (e.g., change "blue" to "red" in one of them) or by setting a name within aes. Setting it in aes has the advantage of creating (and labeling) a legend:
ggplot() +
geom_histogram(data=iris
, aes(x=Sepal.Length
, fill = "Big"
, y=(..count..)*100/(sum(..count..)))
, alpha=0.3) +
geom_histogram(data=smallIris
, aes(x=Sepal.Length
, fill = "Small"
, y=(..count..)*100/(sum(..count..)))
, alpha=0.3) +
facet_wrap(~Species)
Creates this:
However, I tend to dislike the look of overlapping histograms, so I would prefer to use a density plot. You can do it just like the above (just change the geom_histogram), but I think you get a bit more control (and the ability to expand this to more than two groups) by stacking the data. Again, this uses dplyr to stitch the two datasets together:
bigIris <-
bind_rows(
small = smallIris
, big = iris
, .id = "Source"
)
Then, you can create the plot relatively easily:
bigIris %>%
ggplot(aes(x = Sepal.Length, col = Source)) +
geom_line(stat = "density") +
facet_wrap(~Species)
creates:
Related
I am trying to use ggplot to graph 3 linear models:
model = lm(dep_delay ~ season, data = data)
model2 = lm(dep_delay ~ carrier, data = data)
model3 = lm(dep_delay ~ origin, data = data)
My data is structured as follow:
season origin carrier dep_delay
1 winter EWR UA 2
2 winter LGA UA 4
3 winter JFK AA 2
4 winter JFK B6 -1
5 winter LGA DL -6
6 winter EWR UA -4
I am trying to use this line of code:
ggplot(data, aes(x = season, y = dep_delay)) + geom_boxplot() + labs(x="season") + geom_smooth(method = "lm",se=FALSE, col = "blue")
it is giving me the plot I want, but is not putting a line on the plot, how do i get the line to appear?
geom_smooth() fails to fit the line given x is not numeric. You can recode season to number and use as.numeric(season) when specifying aes.
Description of the data
I am trying to produce in R a suitable graphical display of the cluster means.
How can I place the attributes on the x-axis and treat the means for each cluster as trajectories over the items?
All the data is continuous.
What about the following approach: since your variables are on a similar measurement scale (e.g. Likert scale) you could show the distribution of each variable within each cluster (with e.g. boxplots) and visually compare their distribution by using the same axis limits on every cluster.
This can be accomplished by a putting your data in a suitable format and using the ggplot2 package to generate the plot. This is shown below.
Step 1: Generate simulated data to mimic the numeric data you have
The generated data contains four non-negative integer variables and a cluster variable with 3 clusters.
library(ggplot2)
set.seed(1717) # make the simulated data repeatable
N = 100
nclusters = 3
cluster = as.numeric( cut(rnorm(N), breaks=nclusters, label=seq_len(nclusters)) )
df = data.frame(cluster=cluster,
x1=floor(cluster + runif(N)*5),
x2=floor(runif(N)*5),
x3=floor((nclusters-cluster) + runif(N)*5),
x4=floor(cluster + runif(N)*5))
df$cluster = factor(df$cluster) # define cluster as factor to ease plotting code below
tail(df)
table(df$cluster)
whose output is:
cluster x1 x2 x3 x4
95 2 5 2 5 2
96 3 5 4 0 3
97 3 3 3 1 7
98 2 5 4 3 3
99 3 6 1 1 7
100 3 5 1 2 5
1 2 3
15 64 21
i.e., out of the 100 simulated cases, the data contains 15 cases in cluster 1, 64 cases in cluster 2 and 21 cases in cluster 3.
Step 2: Prepare the data for plotting
Here we use reshape() from the stats package to transpose the dataset from wide to long so that the four numeric variables (x1, x2, x3, x4) are placed into one single column, suitable for generating a boxplot for each of the four variables which are then grouped by the cluster variable.
vars2transpose = c("x1", "x2","x3", "x4")
df.long = reshape(df, direction="long", idvar="id",
varying=list(vars2transpose),
timevar="var", times=vars2transpose, v.names="value")
head(df.long)
table(df.long$cluster)
whose output is:
cluster var value id
1.x1 1 x1 5 1
2.x1 1 x1 3 2
3.x1 3 x1 5 3
4.x1 1 x1 1 4
5.x1 2 x1 3 5
6.x1 1 x1 2 6
1 2 3
60 256 84
Note that the number of cases in each cluster has increased 4-fold (i.e. the number of numeric variables) since the data is now in transposed long format.
Step 3: Create the variable's boxplots by cluster with line-connected means
We plot horizontal boxplots for each variable x1, x2, x3, x4 that show their distribution in each cluster, and mark the mean values with connected red crosses (the trajectories you are after).
gg <- ggplot(df.long, aes(x=var, y=value))
gg + facet_grid(cluster ~ ., labeller=label_both) +
geom_boxplot(aes(fill=cluster)) +
stat_summary(fun.y="mean", geom="point", color="red", pch="x", size=3) +
stat_summary(fun.y="mean", geom="line", color="red", aes(group=1)) +
coord_flip() # swap the x and y axis to make boxplots horizontal instead of vertical
which generates the following graph.
The graph might get packed with the many variables you have, so you may want to:
either show vertical boxplots by removing the last coord_flip() line
or remove the boxplots altogether and just show the connected red crosses by eliminating the geom_boxplot() line.
And if you want to compare each variable side by side among the different clusters, you can swap the grouping and x-axis variables as follows:
gg <- ggplot(df.long, aes(x=cluster, y=value))
gg + facet_grid(var ~ ., labeller=label_both) +
geom_boxplot(aes(group=cluster, fill=cluster)) +
stat_summary(fun.y="mean", geom="point", color="red", pch="x", size=3) +
stat_summary(fun.y="mean", geom="line", color="red", aes(group=1)) +
coord_flip() # swap the x and y axis to make boxplots horizontal instead of vertical
I'm trying to create simple line charts with r that connect data points the average of groups of respondents (would also nive to lable them or distinguish them in diferent colors etc.)
My data is in long format and sorted like this shown (I also have it in wide format if thats of any value):
ID gender week class motivation
1 male 0 1 100
1 male 6 1 120
1 male 10 1 130
...
2 female 0 1 90
2 female 6 1 NA
2 female 10 1 117
...
3 male 0 2 89
3 male 6 2 112
3 male 10 2 NA
...
Basically, every respondent was measured a total of n times and the occasions (week) were the same for everyone. Some respondents were missing during one or more occasions. Let's say for motivation. Variables like gender, class and ID don't change, motivation does.
I tried to get a line chart using ggplot2
## define base for the graphs and store in object 'p'
plot <- ggplot(data = DataRlong, aes(x = week, y = motivation, group = gender))
plot + geom_line()
As grouping variable, I want to use class or gender for example.
However, my approach does not lead to lines that connect the averages per group.
I also get vertical lines for each measurement occasion. What does this mean? The only way I cold imagine fixing this is to create a new variable average.motivation and to compute the average for every group per occasion and then assign this average to all members of the group. However, this would mean that I had do this for every single group variable when I want to display group lines based on another factor.
Also, how does the plot handle missing data? (If one member of a group has a missing value, I still want the group average of this occasion to calculate the point rather than omitting the whole occasion for that group ).
Edit:
Thank you, the solution with dplyr works great for all my categorical variables.
Now, I'm trying to figure out how I can distinguish between subgroups by colouring their lines based on a second/third factor.
For example, I plot 20 lines for the groups of "class2", but rather than having all of them in 20 different colors, I would like them to use the same colour, if they belong to the same type of class ("class_type", e.g. A, B or C =20 lines, three groups of colours).
I've added the second factor to "mean_data2". That works well. Next, I've tried to change the colour argument in ggplot, (also tried as in geom_line), but that way, I don't have 20 lines anymore.
mean_data2 <- group_by(DataRlong, class2, class_type, occ)%>%
summarise(procras = mean(procras, na.rm = TRUE))
library(ggplot2) ggplot(na.omit(mean_data2), aes(x = occ, y = procras,
colour=class2)) + geom_point() + geom_line(aes(colour=class_type))
You can also use the dplyr package to aggregate the data:
library(dplyr)
mean_data <- group_by(data, gender, week) %>%
summarise(motivation = mean(motivation, na.rm = TRUE))
You can use na.omit() to get rid of the NA values as follows:
library(ggplot2)
ggplot(na.omit(mean_data), aes(x = week, y = motivation, colour = gender)) +
geom_point() + geom_line()
There is no need here to explicitly use the group aesthetic because ggplot will automatically group the lines by the categorical variables in your plot. And the only categorical variable you have is gender. (See this answer for more information).
Another possibility is using stat_summary, so you can do it only with ggplot.
ggplot(data = DataRlong, aes(x = week, y = motivation, group = gender)) +
stat_summary(geom = "line", fun.y = mean)
You almost certainly have to make sure those grouping variables are factors.
I'm not quite sure what you want, but here's a shot...
library("ggplot2")
df <- read.table(textConnection("ID gender week class motivation
1 male 0 1 100
1 male 6 1 120
1 male 10 1 130
2 female 0 1 90
2 female 6 1 NA
2 female 10 1 117
3 male 0 2 89
3 male 6 2 112
3 male 10 2 NA"), header=TRUE, stringsAsFactors=FALSE)
df2 <- aggregate(df$motivation, by=list(df$gender, df$week),
function(x)mean(x, na.rm=TRUE))
names(df2) <- c("gender", "week", "avg")
df2$gender <- factor(df2$gender)
ggplot(data = df2[!is.na(df2$avg), ],
aes(x = week, y = avg, group=gender, color=gender)) +
geom_point()+geom_line()
I'm using ggplot to plot an ordered sequence of numbers that is colored by a factor. For example, given this fake data:
# Generate fake data
library(dplyr)
set.seed(12345)
plot.data <- data.frame(fitted = rnorm(20),
actual = sample(0:1, 20, replace=TRUE)) %>%
arrange(fitted)
head(plot.data)
fitted actual
1 -1.8179560 0
2 -0.9193220 1
3 -0.8863575 1
4 -0.7505320 1
5 -0.4534972 1
6 -0.3315776 0
I can easily plot the actual column from rows 1–20 as colored lines:
# Plot with lines
ggplot(plot.data, aes(x=seq(length.out = length(actual)), colour=factor(actual))) +
geom_linerange(aes(ymin=0, ymax=1))
The gist of this plot is to show how often the actual numbers appear sequentially across the range of fitted values. As you can see in the image, sequential 0s and 1s are readily seen as sequential blue and red vertical lines.
However, I'd like to move away from the lines and use geom_rect instead to create bands for the sequential number. I can fake this with really thick lineranges:
# Fake rectangular regions with thick lines
ggplot(plot.data, aes(x=seq(length.out = length(actual)), colour=factor(actual))) +
geom_linerange(aes(ymin=0, ymax=1), size=10)
But the size of these lines is dependent on the number of observations—if they're too thick, they'll overlap. Additionally, doing this means that there are a bunch of extraneous graphical elements that are plotted (i.e. sequential rectangular sections are really just a bunch of line segments that bleed into each other). It would be better to use geom_rect instead.
However, geom_rect requires that data include minimum and maximum values for x, meaning that I need to reshape actual to look something like this instead:
xmin xmax colour
0 1 red
1 5 blue
I need to programmatically calculate the run length of each color to mark the beginning and end of that color. I know that R has the rle() function, which is likely the best option for calculating the run length, but I'm unsure about how to split the run length into two columns (xmin and xmax).
What's the best way to calculate the run length of a variable so that geom_rect can plot it correctly?
Thanks to #baptiste, it seems that the best way to go about this is to condense the data into just those rows that see a change in x:
condensed <- plot.data %>%
mutate(x = seq_along(actual), change = c(0, diff(actual))) %>%
subset(change != 0 ) %>% select(-change)
first.row <- plot.data[1,] %>% mutate(x = 0)
condensed.plot.data <- rbind(first.row, condensed) %>%
mutate(xmax = lead(x),
xmax = ifelse(is.na(xmax), max(x) + 1, xmax)) %>%
rename(xmin = x)
condensed.plot.data
# fitted actual xmin xmax
# 1 -1.8179560 0 0 2
# 2 -0.9193220 1 2 6
# 3 -0.3315776 0 6 9
# 4 -0.1162478 1 9 11
# 5 0.2987237 0 11 14
# 6 0.5855288 1 14 15
# 7 0.6058875 0 15 20
# 8 1.8173120 1 20 21
ggplot(condensed.plot.data) +
geom_rect(aes(xmin=xmin, xmax=xmax, ymin=0, ymax=1, fill=factor(actual)))
Hi I'm trying to draw an histogram in ggplot but my data doesn't have all the values but values and number of occurrences.
value=c(1,2,3,4,5,6,7,8,9,10)
weight<-c(8976,10857,10770,14075,18075,20757,24770,14556,11235,8042)
df <- data.frame(value,weight)
df
value weight
1 1 8976
2 2 10857
3 3 10770
4 4 14075
5 5 18075
6 6 20757
7 7 24770
8 8 14556
9 9 11235
10 10 8042
Anybody would know either how to bin the values or how to plot an histogram of binned values.
I want to get something that would look like
bin weight
1 1-2 19833
2 3-4 24845
...
I would add another variable that designates the binning and then
df$group <- rep(c("1-2", "3-4", "5-6", "7-8", "9-10"), each = 2)
draw it using ggplot.
ggplot(df, aes(y = weight, x = group)) + stat_summary(fun.y="sum", geom="bar")
Here's one method for binning the data up:
df$bin <- findInterval(df$value,seq(1,max(df$value),2))
result <- aggregate(df["weight"],df["bin"],sum)
# get your named bins automatically without specifying them individually
result$bin <- tapply(df$value,df$bin,function(x) paste0(x,collapse="-"))
# result
bin weight
1 1-2 19833
2 3-4 24845
3 5-6 38832
4 7-8 39326
5 9-10 19277
# barplot it (base example since Roman has covered ggplot)
with(result,barplot(weight,names.arg=bin))
Just expand your data:
value=c(1,2,3,4,5,6,7,8,9,10)
weight<-c(8976,10857,10770,14075,18075,20757,24770,14556,11235,8042)
dat = rep(value,weight)
# plot result
histres = hist(dat)
And histres contains some potentially useful information if you want details of the histogram data.