Creating a bar plot with new data - r

I'm trying to create a bar plot of the average number of units per customer by Lifestage and I'm unable to figure out how to plot this data. It's basically a 7x2 matrix with the first column being life stage and the second column being the respective "unit per customer". Does anyone know what code I can use to create a bar plot with this new vector?
units_by_lifestage <- aggregate(data$PROD_QTY,
by=list(data$LIFESTAGE),
FUN=sum)
#Calculate the average number of units per customer by LIFESTAGE
units_per_customer_by_lifestage <- units_by_lifestage$x / customers_by_lifestage$x
mat <- as.matrix(units_per_customer_by_lifestage)
LIFESTAGE <- c("MIDAGE SINGLES/COUPLES", "NEW FAMILIES", "OLDER FAMILIES","OLDER SINGLES/COUPLES","RETIREES","YOUNG FAMILIES","YOUNG SINGLES/COUPLES")
new_mat <- cbind(LIFESTAGE, mat)
new_mat
below is output of str(new_mat) to give an idea of the data
chr [1:7, 1:2] "MIDAGE SINGLES/COUPLES" "NEW FAMILIES" "OLDER FAMILIES" "OLDER SINGLES/COUPLES" "RETIREES" "YOUNG FAMILIES" "YOUNG SINGLES/COUPLES" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "LIFESTAGE" "" output
with the respective data for units_per_customer_by_lifestage being:
1.901697, 1.857781, 1.946410, 1.913354, 1.892593, 1.940460, 1.834025

"New data", as you call it, doesn't change how one creates a bar plot. Your post is about a pretty basic question and quite likely a duplicate.
There is the base barplot():
df <- data.frame(lifestage = c("MIDAGE SINGLES/COUPLES", "NEW FAMILIES", "OLDER FAMILIES", "OLDER SINGLES/COUPLES", "RETIREES", "YOUNG FAMILIES", "YOUNG SINGLES/COUPLES"), units = c(1.901697, 1.857781, 1.946410, 1.913354, 1.892593, 1.940460, 1.834025))
barplot(units~lifestage, df)
You can also use the ggplot2 library.
library(ggplot2)
ggplot(df, aes(y=units, x=lifestage)) + geom_col()
The geom_col() function is one of two ways to create the barplot. I don't remember the other function off the top of my head. The ggplot2 figure looks like this:
The axis labels are a little simplified because I was quick with recreating your data frame. The axis labels by default reflect the actual names for the columns called. To customize the axis labels you would add lab() like this:
ggplot(df, aes(y=units, x=lifestage)) + geom_col() + labs(x="test")

Related

Adjusting facet order and legend labels when using plot_model function of sjplot

I have successfully used the plot_model function of sjplot to plot a multinomial logistic regression model. The regression contains an outcome (Info Sought, with 3 levels) and 2 continuous predictors (DSA, ASA). I have also changed the values of ASA in the plot_model so as to plot predicted effect outcomes based on the ASA mean value and SDs:
plot1 <- plot_model(multinomialmodel , type = "pred", terms = c("DSA", "ASA[meansd]")
I have two customization questions:
1) Facet Order: The facet order is based on the default alphabetical order of the outcome levels ("Expand" then "First Pic" then "Multiple Pics"). Is there a means by which to adjust this? I tried resorting the levels with factor() (as exampled here with ggplot2) prior to running and plotting the model, but this did not cause any changes in the resulting facet order. Perhaps instead something through ggplot2, as exampled in the first solution provided here?
2) Legend Labels: The legend currently labels the plotted lines with the -1 SD, mean, and +1 SD values for ASA; is there a way to adjust these labels to instead simply say "-1 SD", "mean", and "+1 SD" instead of the raw values?
Thanks!
First I replicate your plot using your supplied data:
library(dplyr)
library(readr)
library(nnet)
library(sjPlot)
"ASA,DSA,Info_Sought
-0.108555801,0.659899854,First Pic
0.671946671,1.481880373,First Pic
2.184170211,-0.801398848,First Pic
-0.547588442,1.116555698,First Pic
-1.27930951,-0.299077419,First Pic
0.037788412,1.527545958,First Pic
-0.74271406,-0.755733264,Multiple Pics
1.20854212,-1.166723523,Multiple Pics
0.769509479,-0.390408588,Multiple Pics
-0.450025633,-1.02972677,Multiple Pics
0.769509479,0.614234269,Multiple Pics
0.281695434,0.705565438,Multiple Pics
-0.352462824,-0.299077419,Expand
0.671946671,1.481880373,Expand
2.184170211,-0.801398848,Expand
-0.547588442,1.116555698,Expand
-0.157337206,1.070890114,Expand
-1.27930951,-0.299077419,Expand" %>%
read_csv() -> d
multinomialmodel <- multinom(Info_Sought ~ ASA + DSA, data = d)
p1 <- plot_model(multinomialmodel ,
type = "pred",
terms = c("DSA", "ASA[meansd]"))
p1
Your attempt to re-factor did not work because sjPlot::plot_model() does not pay heed. One way to tackle reordering the facets is to produce an initial plot as above and replace the faceting variable in the data with a factor version containing your desired order like so:
p2 <- p1
p2$data$response.level <- factor(p2$data$response.level,
levels = c("Multiple Pics", "First Pic", "Expand"))
p2
Finally, to tackle the legend labeling issue, we can just replace the color scale with one containing your desired labels:
p2 +
scale_color_discrete(labels = c("-1 SD", "mean", "+1 SD"))
Just following up on #the-mad-statter's answer, I wanted to add a note on how to change the legend title and labels when you're working with a black-and-white graph where the lines differ by linetype (i.e. using sjplot's colors = "bw" argument).
p1 <- plot_model(multinomialmodel ,
type = "pred",
terms = c("DSA", "ASA[meansd]"),
colors = "bw)
As the lines are all black, if you would like to change the axis title and labels, you need to use the scale_linetype_manual() function instead of scale_color_discrete(), like this:
p1 + scale_linetype_manual(name = "ASA values",
values = c("dashed", "solid", "dotted"),
labels = c("Low (-1 SD)", "Medium (mean)", "High (+1 SD)"))
The resulting graph with look like this:
Note that I also took this opportunity to change how linetypes are assigned to values, making the line corresponding to the mean of ASA solid.

Ordering bars in ggplot2 stacked barplot via levels() but output looks different

I'm struggling with my ggplot2 stacked barplot. I want to define the order of the bars manually. So I do that normally by transforming the variable into a factor and defining the levels in my desired order.
data <- transform(data, variable = factor(variable, levels = c("A4 Da/De/Du", "A2 London", "A3 Berlin", "A1 Paris", "A5 Rome")))
When I check my variable levels I can see that the levels are in my desired order to plot
head(data$variable)
When I plot my data everything looks as desired, but somehow, and I have no clue why, one variable (for example "A4 Da/De/Du") is not in my defined variable order...
Has someone an idea what the problem could be?
-It's the only variable with special characters (e.g "/") in it
-It's the only variable which has zero levels in it (e.g. c(20,40,0,0,40))
-My ggplot code is quite complex, and I use the "reorder()" function, and I use the "forcats" package in my ggplot2 code. Could that be a problem?
Thanks very much for any help or ideas!
EDIT (some example data)
library(reshape2)
library(ggplot2)
library(dplyr)
df <- data.frame(cbind(a=c(20,40,20,10,10),b=c(10,30,50,5,5), c=c(60,10,10,15,5), d=c(80,20,0,0,0), e=c(50,10,10,15,15)))
colnames(df) <- c("D1 Paris", "D2 London", "D3 Berlin", "D4 Da/De/Du", "D5 Rome")
df$category <- c("C1", "C2", "C3", "C4", "C5")
data <- data %>% group_by(variable) %>% arrange(variable)
data <- melt(data)
data$percent <- data$value/100
data <- transform(data, variable = factor(variable,
levels = c("D4 Da/De/Du", "D2 London", "D3 Berlin", "D1 Paris", "D5 Rome")))
And the short version of the ggplot2 code:
ggplot(data, aes(x=reorder((variable), percent), y=percent, fill=category)) +
coord_flip()+
geom_bar(stat="identity", width = .4, colour="black", lwd=0.1)
SOLUTION
I finally solved my problem :)
Gregor was right, after transforming the levels of the specific variable in the desired order, the reorder() function in ggplot2 is no longer necessary respectively overwrites the earlier defined levels, what at the end produced my error...
Thanks Gregor!

How can I color a line graph by grouping the variables in R?

I have produced a line graph something that looks like this
I have the data set of 50 countries and its GDP for last 10 years.
Sample data:
Country variable value
China Y2007 3.55218e+12
USA Y2007 1.45000e+13
Japan Y2007 4.51526e+12
UK Y2007 3.06301e+12
Russia Y2007 1.29971e+12
Canada Y2007 1.46498e+12
Germany Y2007 3.43995e+12
India Y2007 1.20107e+12
France Y2007 2.66311e+12
SKorea Y2007 1.12268e+12
I generated the line graph using the code
GDP_lineplot = ggplot(data=GDP_linechart, aes(x=variable,y=value)) +
geom_line() +
scale_y_continuous(name = "GDP(USD in Trillions)",
breaks = c(0.0e+00,5.0e+12,1.0e+13,1.5e+13),
labels = c(0,5,10,15)) +
scale_x_discrete(name = "Years", labels = c(2007,"",2009,"",2011,"",2013,"",2015))
The idea is to make the graph look like this.
I tried adding
group=country, color = country
It outputs coloring all the countries.
How can I color the countries with top 4 and the rest?
PS: I am still naive with R.
By plotting subsets, the other groups aren't included in the colour legend on the right. The alternative approach below manipulates factor levels and uses a customized color scale to overcome this.
Preparing data
It is assumed that GDP_long contains the data in long format. This is in line with the data shown by the OP (GDP_lineplot, but see Data section below for differences). To manipulate factor levels, the forcatspackage is used (and data.table).
library(data.table)
library(forcats)
# coerce to data.table, reorder factors by values in last = most actual year
setDT(GDP_long)[, Country := fct_reorder(Country, -value, last)]
# create new factor which collapses all countries to "Other" except the top 4 countries
GDP_long[, top_country := fct_other(Country, keep = head(levels(Country), 4))]
Create plot
library(ggplot2)
ggplot(GDP_long, aes(Year, value/1e12, group = Country, colour = top_country)) +
geom_point() + geom_line(size = 1) + theme_bw() + ylab("GDP(USD in Trillions)") +
scale_colour_manual(name = "Country",
values = c("green3", "orange", "blue", "red", "grey"))
The chart is now quite similar to the expected result. The lines of the top 4 countries are displayed in different colours while the other countries are displayed in grey but do appear in the colour legend to the right.
Note that the groupaesthetic is still needed so that a single line is plotted for each country while colour is controlled by the levels of top_country.
Data
The data set is too large to be reproduced here (even with dput()). The structure
str(GDP_long)
'data.frame': 1763 obs. of 3 variables:
$ Country: chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
$ Year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
$ value : num 9.84e+09 1.07e+10 1.35e+11 4.01e+09 6.04e+10 ...
is similar to OP's data with the exception that the variable column already is converted to an integer column year. This will give a nicely formatted x-axis without additional effort.
My apologies I missed the part about only coloring a subset of the countries... in the geom_line calls you can add the subsetting that suits your needs.
df <- data.frame(Country=rep(LETTERS[1:10], each=5),
Year=rep(2007:2011, length.out=10),
value=rnorm(50))
ggplot(df) +
geom_line(data=df[21:50, ], aes(x=Year, y=value, group=Country), color="#999999") +
geom_line(data=df[1:20, ], aes(Year, y=value, color=Country))

box-plot not working with factor data

I'm trying to create a simple boxplot of some survey data.
Data
The data is survey data, and each row has a response recorded 1-5.
**Example Data**
Race= 2,2,3,2,5
Rating = 1,1,3,5,5
Converting to factors
df$Race = factor(DF$Race)
df$Rating = factor(DF$Rating)
Assigning each factor variable levels
levels(df$Race) = c("Asian/Pacific Islander", "White" , "American Indian/Eskimo", "Black/African American", "Other","NA")
levels(df$Rating) = c("Poor","Below Avg.","Neutral","Good","Excellent", "NA")
ggplot(df, aes(x=Race, y=Rating)) + geom_boxplot()
Using the full data I get a result like this.
Please let me know why this turns out funky. Also, How can I remove NA's?. I'm brand new to R. So if you see something else that I am doing wrong, or poorly please let me know! Thanks!
UPDATE
Using #jlhoward code provided in the comments I can generate the following - but it's plotting them all the same, and not plotting "white."
ggplot(df, aes(x=Race, y=as.numeric(Rating))) + geom_boxplot() +scale_y_continuous(labels=df$Rating,breaks=as.integer(df$Rating))
If I understand correctly, you want the factor levels ("Poor", "Below Avg" etc.) to appear on the Y axis, but you actually want the "rating" boxplot to be computed with numerical values. Is that correct?
If that is the case, you would need to not convert your "rating" variable into a factor before feeding them to ggplot (leave it numerical), and then simply label the y axis appropriately according to your factor levels.
(A reproducible example would help answer the question more fully).

Ordering the axis labels in geom_tile

I have a data frame containing order data for each of 20+ products from each of 20+ countries. I have put it in a highlight table using ggplot2 with code similar to this:
require(ggplot2)
require(reshape)
require(scales)
mydf <- data.frame(industry = c('all industries','steel','cars'),
'all regions' = c(250,150,100), americas = c(150,90,60),
europe = c(150,60,40), check.names = FALSE)
mydf
mymelt <- melt(mydf, id.var = c('industry'))
mymelt
ggplot(mymelt, aes(x = industry, y = variable, fill = value)) +
geom_tile() + geom_text(aes(fill = mymelt$value, label = mymelt$value))
Which produces a plot like this:
In the real plot, the 450 cell table very nicely shows the 'hotspots' where orders are concentrated. The last refinement I want to implement is to arrange the items on both the x-axis and y-axis in alphabetical order. So in the plot above, the y-axis (variable) would be ordered as all regions, americas, then europe and the x-axis (industry) would be ordered all industries, cars and steel. In fact the x-axis is already ordered alphabetically, but I wouldn't know how to achieve that if it were not already the case.
I feel somewhat embarrassed about having to ask this question as I know there are many similar on SO, but sorting and ordering in R remains my personal bugbear and I cannot get this to work. Although I do try, in all except the simplest cases I got lost in a welter of calls to factor, levels, sort, order and with.
Q. How can I arrange the above highlight table so that both y-axis and x-axis are ordered alphabetically?
EDIT: The answers from smillig and joran below do resolve the question with the test data but with the real data the problem remains: I can't get an alphabetical sort. This leaves me scratching my head as the basic structure of the data frame looks the same. Clearly I have omitted something, but what??
> str(mymelt)
'data.frame': 340 obs. of 3 variables:
$ Industry: chr "Animal and vegetable products" "Food and beverages" "Chemicals" "Plastic and rubber goods" ...
$ variable: Factor w/ 17 levels "Other areas",..: 17 17 17 17 17 17 17 17 17 17 ...
$ value : num 0.000904 0.000515 0.007189 0.007721 0.000274 ...
However, applying the with statement doesn't result in levels with an alphabetical sort.
> with(mymelt,factor(variable,levels = rev(sort(unique(variable)))))
[1] USA USA USA
[4] USA USA USA
[7] USA USA USA
[10] USA USA USA
[13] USA USA USA
[16] USA USA USA
[19] USA USA Canada
[22] Canada Canada Canada
[25] Canada Canada Canada
[28] Canada Canada Canada
All the way down to:
[334] Other areas Other areas Other areas
[337] Other areas Other areas Other areas
[340] Other areas
And if you do a levels() it seems to show the same thing:
[1] "Other areas" "Oceania" "Africa"
[4] "Other Non-Eurozone" "UK" "Other Eurozone"
[7] "Holland" "Germany" "Other Asia"
[10] "Middle East" "ASEAN-5" "Singapore"
[13] "HK/China" "Japan" "South Central America"
[16] "Canada" "USA"
That is, the non-reversed version of the above.
The following shot shows what the plot of the real data looks like. As you can see, the x-axis is sorted and the y-axis is not. I'm perplexed. I'm missing something but can't see what it is.
The y-axis on your chart is also already ordered alphabetically, but from the origin. I think you can achieve the order of the axes that you want by using xlim and ylim. For example:
ggplot(mymelt, aes(x = industry, y = variable, fill = value)) +
geom_tile() + geom_text(aes(fill = mymelt$value, label = mymelt$value)) +
ylim(rev(levels(mymelt$variable))) + xlim(levels(mymelt$industry))
will order the y-axis from all regions at the top, followed by americas, and then europe at the bottom (which is reverse alphabetical order, technically). The x-axis is alphabetically ordered from all industries to steel with cars in between.
As smillig says, the default is already to order the axes alphabetically, but the y axis will be ordered from the lower left corner up.
The basic rule with ggplot2 that applies to almost anything that you want in a specific order is:
If you want something to appear in a particular order, you must make the corresponding variable a factor, with the levels sorted in your desired order.
In this case, all you should need to do it this:
mymelt$variable <- with(mymelt,factor(variable,levels = rev(sort(unique(variable)))))
which should work regardless of whether you're running R with stringsAsFactors = TRUE or FALSE.
This principle applies to ordering axis labels, ordering bars, ordering segments within bars, ordering facets, etc.
For continuous variables there is a convenient scale_*_reverse() but apparently not for discrete variables, which would be a nice addition, I think.
Another possibility is to use fct_reorder from forecast library.
library(forecast)
mydf %>%
pivot_longer(cols=c('all regions', 'americas', 'europe')) %>%
mutate(name1=fct_reorder(name, value, .desc=FALSE)) %>%
ggplot( aes(x = industry, y = name1, fill = value)) +
geom_tile() + geom_text(aes( label = value))
Maybe a little bit late,
with(mymelt,factor(variable,levels = rev(sort(unique(variable)))))
this function doesn't order, because you are ordering "variable" that has no order (it's an unordered factor).
You should transform first the variable to a character, with the as.character function, like so:
with(mymelt,factor(variable,levels = rev(sort(unique(as.character(variable))))))
maybe this StackOverflow question can help:
Order data inside a geom_tile
specifically the first answer by Brandon Bertelsen:
"Note it's not an ordered factor, it's a factor in the right order"
It helped me to get the right order of the y-axis in a ggplot2 geom_tile plot.

Resources