Grouped Bar Plot Extra Variable in R - r

I have the following data frame in R:
> data <- data.frame(tbi_military[0:4])
> data
Severity Active Guard Reserve
1 Penetrating 189 33 12
2 Severe 102 26 11
3 Moderate 709 177 63
4 Mild 5896 1332 541
5 Not Classifiable 122 29 12
And when I do barplot(as.matrix(data)) I get the following output:
Barplot Image
Is there a way for me to get rid of the severity on the x-axis to only have Active, Guard, Reserve? Thanks

one option is to send only the data you want to plot to the plotting function. In this case you want all columns from the second to the last (number four) so a small adjustment to your function call does the job:
barplot(as.matrix(data[, 2:4]))
A solution within the tidyverse (dplyr, tidyr and ggplot2) would be this:
library(dplyr)
library(tidyr)
library(ggplot2)
data %>%
# get data in tidy format to be able to use ggplot2 efciently
tidyr::pivot_longer(-Severity, names_to = "Type", values_to = "Value") %>%
# set up the plot by assigning variable to plot
ggplot2::ggplot(aes(Type, Value, fill = Severity)) +
# put out a bar chart with stat parameter set for stacked barchart
ggplot2::geom_bar(stat = "identity")

Related

How to plot multiple boxplots with a single variable each on ggplot2?

I have a dataset looks like this:
01/02/2013 02/02/2013 03/02/2013 04/02/2013
1 2 3 3
2 1 6 7
3 3 4 2
4 1 1 8
I want to make a graph with n boxplots according to the number of the columns in my dataset, where each boxplot only contains one variable which is its corresponding column. So in this case, there would be 4 boxplots.
I used boxplot() function and it worked for my data, however I want to use geom_jitter() from ggplot2 to beautify my plots. And ggplot2 requires both x and y axes where I don't really have with my dataset.
This is what I want for my plot:
Bring your data in long format with pivot_longer from tidyr package (is in tidyverse)
use ggplot from ggplot2 package (is also in tidyverse)
geom_boxplot and geom_jitter if needed.
library(tidyverse)
df %>%
mutate(id = row_number()) %>%
pivot_longer(
cols = starts_with("X"),
names_to = "names",
values_to = "values"
) %>%
ggplot(aes(x=names, y=values, fill=names))+
geom_boxplot() +
geom_jitter(aes(y=values))

How to plot top 5 most frequent variables by region in R

I am looking to do a plot to look into the most common occuring FINAL_CALL_TYPE in my dataset by BOROUGH in NYC. I have a dataset with over 3 million obs. I broke this down into a sample of 2000, but have refined it even more to just the incident type and the borough it occured in.
Essentially, I want to create a plot that will visualize to the 5 most common call types in each borough, with the count of how many of each call types there was in each borough.
Below is a brief look of how my data looks with just Call Type and Borough
> head(df)
FINAL_CALL_TYPE BOROUGH
1804978 INJURY BRONX
1613888 INJMAJ BROOKLYN
294874 INJURY BROOKLYN
1028374 DRUG BROOKLYN
1974030 INJURY MANHATTAN
795815 CVAC BRONX
This shows how many unique values there are
> str(df)
'data.frame': 2000 obs. of 2 variables:
$ FINAL_CALL_TYPE: Factor w/ 139 levels "ABDPFC","ABDPFT",..: 50 48 50 34 50 25 17 138 28 28 ...
$ BOROUGH : Factor w/ 5 levels "BRONX","BROOKLYN",..: 1 2 2 2 3 1 4 2 4 4 ...
This is the code that I have tried
> ggplot(df, aes(x=BOROUGH, y=FINAL_CALL_TYPE)) +
+ geom_bar(stat = 'identity') +
+ facet_grid(~BOROUGH)
and below is the result
I have tried a few suggestions accross this community, but I have not found any that shows how to perform the action with 2 columns.
It would be much appreciated if there is someone who know a solution for this.
Thanks!
If I understand correctly, you can use tidyverse to doo something like:
df <- df %>%
group_by(BOROUGH, FINAL_CALL) %>%
summarise(count = n()) %>%
top_n(n = 5, wt = count)
then plot
ggplot(df, aes(x = FINAL_CALL, y = count) +
geom_col() +
facet(~BOROUGH, scales = "free")
creating the barplot
The first part of your problem is to create the barplot. With geom_bar you only need to supply the x variable, as the y-axis is the count of observations of that variable. You can then use the facet option to separate that count into different panels for another grouping variable.
library(ggplot2)
ggplot(data = diamonds, aes(x = color)) +
geom_bar() +
facet_grid(.~cut)
filtering to top 5 observations
The second part of your problem, limiting the data to only the top five in each group is slightly more complex. An easy way to do this is to first tally the data which will create a column n that has the count of observations. By adding the sort option we can filter the data to the first five rows in each group. tally, like summarize, automatically removes the last group.
In the ggplot call I now use geom_col instead of geom_bar and I explicitly specify that the y-variable is n (n is created by tally).
geom_bar plots the count of observations per x-variable, geom_col plots a y-variable value for each value of the x-variable.
scales = "free_x" removes values from the x-axis that are present in one cut panel but not another.
library(tidyverse)
df <- diamonds %>%
group_by(cut, color) %>%
tally(sort = TRUE) %>%
filter(row_number() <= 5)
ggplot(data = df, aes(x = color, y = n)) +
geom_col() +
facet_grid(.~cut, scales = "free_x")

Creating a line chart in r for the average value of groups

I'm trying to create simple line charts with r that connect data points the average of groups of respondents (would also nive to lable them or distinguish them in diferent colors etc.)
My data is in long format and sorted like this shown (I also have it in wide format if thats of any value):
ID gender week class motivation
1 male 0 1 100
1 male 6 1 120
1 male 10 1 130
...
2 female 0 1 90
2 female 6 1 NA
2 female 10 1 117
...
3 male 0 2 89
3 male 6 2 112
3 male 10 2 NA
...
Basically, every respondent was measured a total of n times and the occasions (week) were the same for everyone. Some respondents were missing during one or more occasions. Let's say for motivation. Variables like gender, class and ID don't change, motivation does.
I tried to get a line chart using ggplot2
## define base for the graphs and store in object 'p'
plot <- ggplot(data = DataRlong, aes(x = week, y = motivation, group = gender))
plot + geom_line()
As grouping variable, I want to use class or gender for example.
However, my approach does not lead to lines that connect the averages per group.
I also get vertical lines for each measurement occasion. What does this mean? The only way I cold imagine fixing this is to create a new variable average.motivation and to compute the average for every group per occasion and then assign this average to all members of the group. However, this would mean that I had do this for every single group variable when I want to display group lines based on another factor.
Also, how does the plot handle missing data? (If one member of a group has a missing value, I still want the group average of this occasion to calculate the point rather than omitting the whole occasion for that group ).
Edit:
Thank you, the solution with dplyr works great for all my categorical variables.
Now, I'm trying to figure out how I can distinguish between subgroups by colouring their lines based on a second/third factor.
For example, I plot 20 lines for the groups of "class2", but rather than having all of them in 20 different colors, I would like them to use the same colour, if they belong to the same type of class ("class_type", e.g. A, B or C =20 lines, three groups of colours).
I've added the second factor to "mean_data2". That works well. Next, I've tried to change the colour argument in ggplot, (also tried as in geom_line), but that way, I don't have 20 lines anymore.
mean_data2 <- group_by(DataRlong, class2, class_type, occ)%>%
summarise(procras = mean(procras, na.rm = TRUE))
library(ggplot2) ggplot(na.omit(mean_data2), aes(x = occ, y = procras,
colour=class2)) + geom_point() + geom_line(aes(colour=class_type))
You can also use the dplyr package to aggregate the data:
library(dplyr)
mean_data <- group_by(data, gender, week) %>%
summarise(motivation = mean(motivation, na.rm = TRUE))
You can use na.omit() to get rid of the NA values as follows:
library(ggplot2)
ggplot(na.omit(mean_data), aes(x = week, y = motivation, colour = gender)) +
geom_point() + geom_line()
There is no need here to explicitly use the group aesthetic because ggplot will automatically group the lines by the categorical variables in your plot. And the only categorical variable you have is gender. (See this answer for more information).
Another possibility is using stat_summary, so you can do it only with ggplot.
ggplot(data = DataRlong, aes(x = week, y = motivation, group = gender)) +
stat_summary(geom = "line", fun.y = mean)
You almost certainly have to make sure those grouping variables are factors.
I'm not quite sure what you want, but here's a shot...
library("ggplot2")
df <- read.table(textConnection("ID gender week class motivation
1 male 0 1 100
1 male 6 1 120
1 male 10 1 130
2 female 0 1 90
2 female 6 1 NA
2 female 10 1 117
3 male 0 2 89
3 male 6 2 112
3 male 10 2 NA"), header=TRUE, stringsAsFactors=FALSE)
df2 <- aggregate(df$motivation, by=list(df$gender, df$week),
function(x)mean(x, na.rm=TRUE))
names(df2) <- c("gender", "week", "avg")
df2$gender <- factor(df2$gender)
ggplot(data = df2[!is.na(df2$avg), ],
aes(x = week, y = avg, group=gender, color=gender)) +
geom_point()+geom_line()

R ggplot multiple series curved line

I am plotting multiple series of data on one plot.
I have data that looks like this:
count_id AMV Hour duration_in_traffic AMV_norm
1 16012E 4004 14 99 0
2 16012E 4026 12 94 22
3 16012E 4099 15 93 95
4 16012E 4167 11 100 163
5 16012E 4239 10 97 235
I am plotting in R using:
ggplot(td_results, aes(AMV,duration_in_traffic)) + geom_line(aes(colour=count_id))
This is giving me:
However, rather than straight lines linking points I would like curved.
I found the following question but got an unexpected output. Equivalent of curve() for ggplot
I used: ggplot(td_results, aes(AMV,duration_in_traffic)) + geom_line(aes(colour=count_id)) + stat_function(fun=sin)
Thus giving:
How can I get a curve with some form of higher order polynomial?
As #MrFlick mentions in the comments, there are serious statistical ways of getting curved lines, which are probably off topic here.
If you just want your graph to look nicer however, you could try interpolating your data with spline, then adding it on as another layer.
First we make some spline data, using 10 times the number of data points you had (you can increase or decrease this as desired):
library(dplyr)
dat2 <- td_results %>% select(count_id, AMV, duration_in_traffic) %>%
group_by(count_id) %>%
do(as.data.frame(spline(x= .[["AMV"]], y= .[["duration_in_traffic"]], n = nrow(.)*10)))
Then we plot, using your original data for points, but then using lines from the spline data (dat2):
library(ggplot2)
ggplot(td_results, aes(AMV, duration_in_traffic)) +
geom_point(aes(colour = factor(count_id))) +
geom_line(data = dat2, aes(x = x, y = y, colour = factor(count_id)))
This gives me the following graph from your test data:

Stacked bar plot in R

I have a csv file like below, I want to make a stacked bar plot that x-axis is link column and and y-axis shows the frequency and each bar is grouped based on Freq_E and Freq_S. when I read the csv and give it to barplot it doesn't work. I searched a lot but all examples data is in form of contingency table. I donno what should I do...
link Freq_E Freq_S
1 tube.com 214 214
2 list.net 120 120
3 vector.com 119 118
4 4cdn.co 95 96
"It doesn't work" is not an error message in R that I'm familiar with, but I'm guessing your problem is that you are trying to use barplot on a data.frame while you should be using a matrix or a vector.
Assuming your data.frame is called "df" (as defined at the start of Codoremifa's answer), you can try the following:
x <- as.matrix(df[-1]) ## Drop the first column since it's a character vector
rownames(x) <- df[, 1] ## Add the first column back in as the rownames
barplot(t(x)) ## Transpose the new matrix and plot it
You should look at the excellent ggplot2 library, try this code snippet for your example -
df <- read.table(textConnection(
'link Freq_E Freq_S
tube.com 214 214
list.net 120 120
vector.com 119 118
4cdn.co 95 96'), header = TRUE)
library(ggplot2)
library(reshape2)
df <- melt(df, id = 'link')
ggplot(
data = df,
aes(
y = value,
x = link,
group = variable,
shape = variable,
fill = variable
)
) +
geom_bar(stat = "identity")

Resources