Distribution chart using ggplot 2 [duplicate] - r

I've tried to search for an answer, but can't seem to find the right one that does the job for me.
I have a dataset (data) with two variables: people's ages (age) and number of awards (awards)
My objective is to plot the number of awards against age in R. FYI, a person can have multiple awards and people can have the same age.
I tried to plot a histogram and barplot, but the problem with that is that it counts the number of observations instead of summing the number of awards.
A sample dataset:
age <- c(21,22,22,25,30,34,45,26,37,46,49,21)
awards <- c(0,3,2,1,0,0,1,3,1,1,1,1)
data <- data.frame(cbind(age,awards))
What I'm looking for is a histogram (or barplot) that represents this data.
Ideally, I'd want the ages to be split into age groups. For example,
20-30, 31-40, 41-50 and then the total number of awards for each group.
The age group would be on the x-axis and the total number of awards for each age group would be on the y-axis.
Thanks!

We can use the aggregate function and then use the ggplot2 package. I don't make too many barplots in base R these days so I'm not sure of the best way to do it without loading ggplot2:
create sample data
#data
set.seed(123)
dat <- data.frame(age = sample(20:50, 200, replace = TRUE),
awards = rpois(200, 3))
head(dat)
age awards
1 28 2
2 44 6
3 32 3
4 47 3
5 49 2
6 21 5
By age
#aggregate
sum_by_age <- aggregate(awards ~ age, data = dat, FUN = sum)
library(ggplot2)
ggplot(sum_by_age, aes(x = age, y = awards))+
geom_bar(stat = 'identity')
By age group
#create groups
dat$age_group <- ifelse(dat$age <= 30, '20-30',
ifelse(dat$age <= 40, '30-40',
'41 +'))
sum_by_age_group <- aggregate(awards ~ age_group, data = dat, FUN = sum)
ggplot(sum_by_age_group, aes(x = age_group, y = awards))+
geom_bar(stat = 'identity')
Note
We could skip the aggregate step altogether and just use:
ggplot(dat, aes(x = age, y = awards)) + geom_bar(stat = 'identity')
but I don't prefer that way because I think having an intermediate data step may be useful within your analytical pipeline for comparisons other than visualizing.

For completeness, I am adding the base R solution to #bouncyball's great answer. I will use their synthetic data, but I will use cut to create the age groups before aggregation.
# Creates data for plotting
> set.seed(123)
> dat <- data.frame(age = sample(20:50, 200, replace = TRUE),
awards = rpois(200, 3))
# Created a new column containing the age groups
> dat[["ageGroups"]] <- cut(dat[["age"]], c(-Inf, 20, 30, 40, Inf),
right = FALSE)
cut will divide up a set of numeric data based on breaks defined in the second argument. right = FALSE flips the breaks so values the groups would include the lower values rather than the upper ones (ie 20 <= x < 30 rather than the default of 20 < x <= 30). The groups do not have to be equally spaced. If you do not want to include data above or below a certain value, simply remove the Inf from the end or -Inf from the beginning respectively, and the function will return <NA> instead. If you would like to give your groups names, you can do so with the labels argument.
Now we can aggregate based on the groups we created.
> (summedGroups <- aggregate(awards ~ ageGroups, dat, FUN = sum))
ageGroups awards
1 [20,30) 188
2 [30,40) 212
3 [40, Inf) 194
Finally, we can plot these data using the barplot function. The key here is to use names for the age groups.
> barplot(summedGroups[["awards"]], names = summedGroups[["ageGroups"]])

Related

How to reorder bars in barplot using ggplot 2 [duplicate]

This question already has answers here:
Order Bars in ggplot2 bar graph
(16 answers)
Closed 1 year ago.
I wanted to move my bars according to this particular order for the beetle number i.e., from 0 to 1-5 to 6-10 to 11-15 to Above 15. I also wanted to place Village first and the Municipality. The plots should also be arranged in terms of the age of the building. Under 5 years first, then 5-10 years followed by Above 10 years
ggplot(g,aes(x=Locality.Division))+
geom_bar(aes(fill=Number.of.Beetle),position="dodge")+
facet_wrap(~Building.Age)
#> Error in ggplot(g, aes(x = Locality.Division)): could not find function "ggplot"
Created on 2021-05-30 by the reprex package (v2.0.0)
The order of the bars is determined by the order of the factor levels of the variable.
You have the Number.of.Beetle variable in your data a character variable. ggplot() converts this to a factor variable with factor(), which by default sorts character variables alphabetically. To specify a different order, convert the variable to a factor yourself before plotting:
g <- mutate(g,
Number.of.Beetle = factor(Number.of.Beetle, levels = c("1-5", "6-10", "11-15", "15+))
)
If the order is shown backwards, then also use forcats::fct_rev() to reverse the order:
g <- mutate(g,
Number.of.Beetle = forcats::fct_rev(factor(Number.of.Beetle, levels = c("1-5", "6-10", "11-15", "15+)))
)
I hope the following helps to get you started. You did not provide a minimal reproducible example, thus, I simulate some data. I also adapted the variable names.
A key strategy to control the order of variables is making them a factor. I do this when plotting.
Note: number of beetles is quasi-sorted given the values used. Here you could also work with a factor, if needed.
library(ggplot2)
set.seed(666) # fix random picks for replicability
# simulate data of 30 buildings
df <- data.frame(
Building = 1:30
, Building.Age = sample(x = c("U5","5-10","A10"), size = 30, replace = TRUE)
, Nbr.Beetle = sample(x = c("1-5","6-10","11-15","15+"), size = 30, replace = TRUE)
, Locality = sample(x = c("A","B","C"), size = 30, replace = TRUE))
# plot my example
ggplot(data = df, aes(x=Locality)) +
geom_bar(aes(fill=Nbr.Beetle),position="dodge") +
# --------------------- control the sequence of panels by forcing level sequence of factor
facet_wrap(. ~ factor( Building.Age, levels = c("U5","5-10","A10") ) )
This yields:

Plotting of cluster datapoints between two columns of a dataframe

I have a data frame,in which, two of the columns are Age and Income. I have clustered the data Using Kmeans. Now I want to plot between Age and Income distinguishing the data points based on Clusters (By Colours)
df
Age Income Cluster
20 10000 1
30 20000 2
40 25000 1
50 20000 2
60 10000 3
70 15000 3
.
plot(df$Age,df$Income)
I want to plot the datapoints between Age and Income and Each datapoint should be coloured based on clusters
You could use ggplot() for this:
ggplot() +
geom_point(mapping = aes(x = Age, y = Income, color = Cluster))
Here it is creating the aesthetics based on the values in the data (x position of the point is based on age, y position on the income, and colour of the point on the variable "cluster").
You could also add this using base R, here's an example using the mtcars dataset...
plot(x = mtcars$wt, y = mtcars$mpg, col = mtcars$cyl)
try something like this :
library(ggplot2)
ggplot() + geom_point(data = df, aes(x = Age, y = Income, group = Cluster, color = Cluster))
I found one Using plot function
df
Age Income
20 10000
30 20000
40 25000
50 20000
60 10000
70 15000
clust <- kmeans(df,centers = 3) # df without the last "Cluster" Column as in the Question
plot(df,col=clust$cluster, color=TRUE,las=1,xlab ="Age",ylab="Income") # df containing only Columns Age and Income. #Cluster is one of the components of Class Kmeans
If your data frame contains more than two Columns, subset it to the two Columns you want to plot between.

Plotting binned data using sum instead of count

I've tried to search for an answer, but can't seem to find the right one that does the job for me.
I have a dataset (data) with two variables: people's ages (age) and number of awards (awards)
My objective is to plot the number of awards against age in R. FYI, a person can have multiple awards and people can have the same age.
I tried to plot a histogram and barplot, but the problem with that is that it counts the number of observations instead of summing the number of awards.
A sample dataset:
age <- c(21,22,22,25,30,34,45,26,37,46,49,21)
awards <- c(0,3,2,1,0,0,1,3,1,1,1,1)
data <- data.frame(cbind(age,awards))
What I'm looking for is a histogram (or barplot) that represents this data.
Ideally, I'd want the ages to be split into age groups. For example,
20-30, 31-40, 41-50 and then the total number of awards for each group.
The age group would be on the x-axis and the total number of awards for each age group would be on the y-axis.
Thanks!
We can use the aggregate function and then use the ggplot2 package. I don't make too many barplots in base R these days so I'm not sure of the best way to do it without loading ggplot2:
create sample data
#data
set.seed(123)
dat <- data.frame(age = sample(20:50, 200, replace = TRUE),
awards = rpois(200, 3))
head(dat)
age awards
1 28 2
2 44 6
3 32 3
4 47 3
5 49 2
6 21 5
By age
#aggregate
sum_by_age <- aggregate(awards ~ age, data = dat, FUN = sum)
library(ggplot2)
ggplot(sum_by_age, aes(x = age, y = awards))+
geom_bar(stat = 'identity')
By age group
#create groups
dat$age_group <- ifelse(dat$age <= 30, '20-30',
ifelse(dat$age <= 40, '30-40',
'41 +'))
sum_by_age_group <- aggregate(awards ~ age_group, data = dat, FUN = sum)
ggplot(sum_by_age_group, aes(x = age_group, y = awards))+
geom_bar(stat = 'identity')
Note
We could skip the aggregate step altogether and just use:
ggplot(dat, aes(x = age, y = awards)) + geom_bar(stat = 'identity')
but I don't prefer that way because I think having an intermediate data step may be useful within your analytical pipeline for comparisons other than visualizing.
For completeness, I am adding the base R solution to #bouncyball's great answer. I will use their synthetic data, but I will use cut to create the age groups before aggregation.
# Creates data for plotting
> set.seed(123)
> dat <- data.frame(age = sample(20:50, 200, replace = TRUE),
awards = rpois(200, 3))
# Created a new column containing the age groups
> dat[["ageGroups"]] <- cut(dat[["age"]], c(-Inf, 20, 30, 40, Inf),
right = FALSE)
cut will divide up a set of numeric data based on breaks defined in the second argument. right = FALSE flips the breaks so values the groups would include the lower values rather than the upper ones (ie 20 <= x < 30 rather than the default of 20 < x <= 30). The groups do not have to be equally spaced. If you do not want to include data above or below a certain value, simply remove the Inf from the end or -Inf from the beginning respectively, and the function will return <NA> instead. If you would like to give your groups names, you can do so with the labels argument.
Now we can aggregate based on the groups we created.
> (summedGroups <- aggregate(awards ~ ageGroups, dat, FUN = sum))
ageGroups awards
1 [20,30) 188
2 [30,40) 212
3 [40, Inf) 194
Finally, we can plot these data using the barplot function. The key here is to use names for the age groups.
> barplot(summedGroups[["awards"]], names = summedGroups[["ageGroups"]])

Efficient way to summarise, re-group and plot data set of group frequencies in R

I have a set of data for three groups (A, B, C) which gives information on how often a certain value "x" (between -3 and +3) is observed for that group (0 to 100). To give a simplified example:
df <- data.frame(x = seq(-3, 3, 1),
A = c(0, 10, 25, 30, 15, 0, 0),
B = c(25, 30, 24, 29, 2, 15, 0),
C = c(0, 0, 5, 10, 20, 30, 30))
The actual data set is quite big, however, so there is a large number of very detailed x values (at least two decimals) for which each group has associated frequencies, which often drop to near-zero for certain x values. When plotting this using the command below, the result looks rather convoluted.
df <- melt(df, id = "x")
ggplot(df, aes(x=x, y=value, color=variable)) + geom_line()
What would be the best way to calculate summary statistics for each
group? (mean, median, ...)
What would be the most effective way to aggregate x values and their
neighbours into ranges of x values, summing up the associated group
frequencies in the process, to get a more generalised picture?
How would one tell ggplot to produce a histogram or density plot which
accounts for the observed frequencies, so that one would get
a plot which looks like this?
I thought of iterating over the data set and doing all of the above "manually", but figured that this would be inefficient and prone to errors. Any suggestions you may have would be greatly appreciated!
In order to create a histogram you need to remove the "value" variable and create the corresponding number of rows for "x" based on that value. So, if for group A you have x = 3 and value = 10, the process has to create x = 3 for group A 10 times. Run the process step by step to see how it works. I've included decimals for "x".
library(reshape2)
library(dplyr)
library(ggplot2)
set.seed(22)
df <- data.frame(x = seq(-3, 3, 0.01),
A = round(c(rnorm(200, 30,3),rnorm(401,20,4))),
B = round(c(rexp(300, 1/5), rexp(301,1/20))),
C = round(runif(601, 2, 25)))
df <- melt(df, id = "x")
# create number of rows for each x and group based on the value
df2=
df %>%
rowwise() %>%
do(data.frame(x = rep(.$x, .$value),
variable = rep(.$variable, .$value))) %>%
ungroup
# check mean and median x values for each group
df2 %>%
group_by(variable) %>%
summarise(N = n(),
MEAN_X= mean(x),
MEDIAN_X= median(x))
# variable N MEAN_X MEDIAN_X
# 1 A 13979 -0.27480292 -0.47
# 2 B 7051 0.84527159 1.03
# 3 C 7906 -0.03190741 -0.07
ggplot(df2, aes(x=x, fill=variable)) +
geom_histogram(binwidth=.2, alpha=.5, position="dodge")
ggplot(df2, aes(x=x, colour=variable)) +
geom_density()
If you want to group x for each group in terms of the frequencies you can use a regression tree method that will split x into bins and will give you the break-point(s):
library(party)
# tree for group A only
model = ctree(value~x+variable, data = df[df$variable=="A",])
plot(model, type = "simple")
This tells you that for group A there's a break point at x = -1.01 (you can visualise from the histograms as well) which splits x in two groups. The left side averages a value = 29.8 and the right side averages a value = 19.99. The number of observations in each bin are 200 and 401 respectively. Which sounds correct, as I've created this variable like that in the beginning.
Note that the trees are statistical models, which split your variable(s) based on statistical significant differences (or other metrics). You can't force any grouping by yourself. If you want to do that it's better to group your variable "x" in N groups (based on quantiles maybe? or something else that makes more sense to you) and see how the value changes within those groups.

Plots in R (ggplot2) for time series with multiple values per time?

Let's say I have data consisting of the time I leave the house and the number of minutes it takes me to get to work. I'll have some repeated values:
08:00, 20
08:04, 25
08:30, 40
08:20, 23
08:04, 22
And some numbers will repeat (like 08:04). What I want to do is a run a scatter plot that is correctly scaled at the x-axis but allows these multiple values per entry so that I could view the trend.
Is a time-series even what I want to be using? I've been able to plot a time series graph that has one value per time, and I've gotten multiple values plotted but without the time-series scaling. Can anyone suggest a good approach? Preference for ggplot2 but I'll take standard R plotting if it's easier.
First lets prepare some more data
set.seed(123)
df <- data.frame(Time = paste0("08:", sample(35:55, 40, replace = TRUE)),
Length = sample(20:50, 40, replace = TRUE),
stringsAsFactors = FALSE)
df <- df[order(df$Time), ]
df$Attempt <- unlist(sapply(rle(df$Time)$lengths, function(i) 1:i))
df$Time <- as.POSIXct(df$Time, format = "%H:%M") # Fixing y axis
head(df)
Time Length Attempt
6 08:35 24 1
18 08:35 43 2
35 08:35 34 3
15 08:37 37 1
30 08:38 33 1
38 08:39 38 1
As I understand, you want to preserve the order of observations of the same leaving house time. At first I ignored that and got a scatter plot like this:
ggplot(data = df, aes(x = Length, y = Time)) +
geom_point(aes(size = Length, colour = Length)) +
geom_path(aes(group = Time, colour = Length), alpha = I(1/3)) +
scale_size(range = c(2, 7)) + theme(legend.position = 'none')
but considering three dimensions (Time, Length and Attempt) scatter plot no longer can show us all the information. I hope I understood you correctly and this is what you are looking for:
ggplot(data = df, aes(y = Time, x = Attempt)) + geom_tile(aes(fill = Length))

Resources