Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I am trying to learn the R programming language to analyse and visualize my data. I have made some good progress so far and I am really enjoying learning R but I am stomped here.
I am having some trouble creating line graphs for products in specific categories. I have no problem creating graphs to show sales all categories but I would like to specify a particular category and show the product sales.
This is what my data set looks like.
Can someone show me how I could do this? E.g I would like to create a line graph to show the sales of Products in the Bakery category where the X axis would have the product name and the Y axis would have the quantity sold.
Any help would be greatly appreciated.
Next time please include the head this can be done using
head(Store_sales)
ProductID category sales product
1 101 Bakery 9468 White bread
2 102 Personal Care 9390 Everday Female deodorant
3 103 Cereal 9372 Weetabix
4 104 Produce 9276 Apple
5 105 Meat 9268 Chicken Breasts
6 106 Bakery 9252 Pankcakes
I reproduced relevant fields to help you out. First thing is to filter out Baker items from categories.
> install.packages("tidyverse")
> library(tidyverse)
Store sales before filter
> Store_sales
ProductID category sales product
1 101 Bakery 9468 White bread
2 102 Personal Care 9390 Everday Female deodorant
3 103 Cereal 9372 Weetabix
4 104 Produce 9276 Apple
5 105 Meat 9268 Chicken Breasts
6 106 Bakery 9252 Pankcakes
7 107 Produce 9228 Carrot
Filter out "Bakery" from category column into Store_sales_bakery
> Store_sales_bakery <- filter(Store_sales, category == "Bakery")
What Store_sales_bakery includes
> Store_sales_bakery
ProductID category sales product
1 101 Bakery 9468 White bread
2 106 Bakery 9252 Pankcakes
Unfortunately because the picture you gave us does not contain enough information to produce a line graph (you only have 1 data point for each variable which is not enough to create a line) so in its stead I created a point plot for you.
ggplot(Store_sales, aes(x = product, y = sales)) + geom_point()
ggplot point
Here is a bar plot with two variables
ggplot(Store_sales, aes(x = product, y = sales)) + geom_bar(stat = "identity")
bar plot
If you had enough data to make a line graph you would replace geom_bar() or geom_point() with geom_line()
Here is a link to ggplot cheatsheet that may help you in the future
https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
Related
So I don't think this has been asked before, but SO search might just be getting confused by combinations of 'ratio' and 'faceting'. I'm trying to calculate a productivity ratio; number of widgets produced for number of workers on a given day or period. I've got my data structured in a single data frame, with each widget produced each day by each worker in it's own record, and other workers that worked that day but didn't produce a widget also in their own record, along with various metadata.
Something like this:
widget_ind
employee_active_ind
employee_id
day
product_type
employee_bu
1
1
123
6/1/2021
pc
americas
0
1
234
6/1/2021
mac
emea
0
1
345
6/1/2021
mac
apac
1
1
444
6/1/2021
mac
americas
1
1
333
6/1/2021
pc
emea
0
1
356
6/1/2021
pc
americas
I'm trying to find the ratio of widget_inds to employee_active_inds, over time, while retaining the metadata, so that i can filter or facet within the ggplot2 code, something like:
plot <- ggplot(data = df[df$employee_bu == 'americas',],aes(y = (widget_ind/employee_active_ind), x = day)) +
geom_bar(stat = 'identity', position = 'stack') +
facet_wrap(product_type ~ ., scales = 'fixed') + #change these to look at different cuts of metadata
print(plot)
Retaining the metadata is appealing rather than making individual dataframes summarizing by the various combinations, but the results with no faceting aren't even correct (e.g. the ggplot is showing a barchart with a height of ~18 widgets per person; creating a summarized dataframe with no faceting is showing a ratio of less than 1 widget per person).
I'm currently getting this error when I run the ggplot code:
Warning message:
Removed 9865 rows containing missing values (geom_bar).
Which doesn't make sense since in my data frame both widget_ind and employee_active_ind have no NA values, so calculating the ratio of the two should always work?
Edit 1: Clarifying employee_active_ind: I should not have any employee_active_ind = 0, but my current joins produce them (and it passes the reality sniff test; the process we are trying to model allows you to do work on day 1 that results in a widget on day 2, where you may not do any work, so wouldn't be counted as active on that day). I think I need to re-think my data structure. Even so, I'm assuming here that ggplot2 is acting like it would for a given bar chart; it's taking the number in each widget_ind record, for a given day (along with any facets and filters), and is then summing that set and displaying the result. The wrinkle I'm adding is dividing by the number of active employees on that day, and while you can have some one out on a given day, you'd never have everyone out. But that isn't what ggplot is doing is it?
I agree with MrFlick - especially the question concerning employee_active_ind of 0. If you have them, this could create NA values where something is divided by 0.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have this dataset
airline avail_seat_km_per_week Number Year
1: Aer Lingus 320906734 2 1985-99
2: Aeroflot* 1197672318 76 1985-99
3: Aerolineas Argentinas 385803648 6 1985-99
4: Aeromexico* 596871813 3 1985-99
5: Air Canada 1865253802 2 1985-99
---
108: United / Continental* 7139291291 14 2000-14
109: US Airways / America West* 2455687887 11 2000-14
110: Vietnam Airlines 625084918 1 2000-14
111: Virgin Atlantic 1005248585 0 2000-14
112: Xiamen Airlines 430462962 2 2000-14
These are some instances of the dataset:
data.frame(airline=c("Aer Lingus", "Aeroflot*", "Aerolineas Argentinas", "Aeromexico*", "Air Canada", "Aer Lingus", "Aeroflot*", "Aerolineas Argentinas", "Aeromexico*", "Air Canada"), Number=c(2, 76, 6, 3, 2,0 ,6,1,5,2), Year=c("1985-99", "1985-99", "1985-99", "1985-99", "1985-99", "2000-14", "2000-14", "2000-14", "2000-14", "2000-14"))
which includes the number of crashes of airlines around the world in 2 different periods, 85-99 and 00-14, I want to plot a scatterplot that displays the number of crashes in period 85-99 against period 00-14, what is a neat way to do it using dplyr and ggplot2 packages, preferably using pipes?.
Please let me know if there are something I could do to further specify the problem. Appreciate your help!
When asking for help with plots in general, and ggplot, it's helpful if you're very clear about what data goes with each dimension - x, y, color, etc.
library(tidyr)
library(ggplot2)
# (calling your data d)
d %>%
# widen the data so each plot dimension gets a column
pivot_wider(names_from = Year, values_from = Number) %>%
# use backticks for non-standard column names (because of the dash in this case)
ggplot(aes(x = `1985-99`, y = `2000-14`, color = airline)) +
geom_point()
It's difficult for me to create a reproducible example of this as the issue only seems to show as the size of the data frame goes up to too large to paste here. I hope someone will bear with me and help here. I'm sure I'm doing something stupid but reading the help and searching is failing (perhaps on the "stupid" issue.)
I have a data frame of 2,319 rows and three variables: clientID, month and nSlots where clientID is character, month is 1:12 and nSlots is 1:2.
> head(tmpDF2)
month clientID2 nSlots
21 1 8 1
30 2 8 1
31 4 8 1
28 5 8 1
25 6 8 1
24 7 8 1
Here's table(tmpDF2$nSlots)
> table(tmpDF2$nSlots, useNA = "always")
1 2 <NA>
1844 15 0
I'm trying to use ggplot and geom_tile to plot the attendance of clients and I expect two colours for the tiles depending on the two values of nSlots but when the size of the data frame goes up, I am getting a third colour. Here is is the plot.
OK. Well I gather you can't see that so perhaps I should stop here! Aha, or maybe you can click through to that link. I hope so!
Here's the code then for what it's worth.
ggplot(dat=tmpDF2,
aes(x=month,y=clientID2,fill=nSlots)) +
geom_tile() +
# geom_text(aes(label=nSlots)) +
theme(panel.background = element_blank()) +
theme(axis.text.x=element_text(angle=90,hjust=1)) +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.line=element_line()) +
ylab("clients")
The bizarre thing (to me) is that when I keep the number of rows small, the plot seems to work fine but as the number goes up, there's a point, and I've failed utterly to find if one row in the data or value of nrow(tmpDF2) triggers it, when this third colour, a paler value than the one in the legend, appears.
TIA,
Chris
Im reading a book and I found this code. Which I tried and im a little bit confused about the graph im getting.
This is Data Sample.
consumption[sample(1:nrow(consumption), 5, replace=F),]
Food Units Year Amount
8 Fruits and Vegetables Pounds 1980 603.57948
31 Caloric sweeteners Pounds 1995 144.08113
16 Fruits and Vegetables Pounds 1985 630.24491
28 Eggs Number 1995 232.28203
19 Fish and Shellfist Pounds 1990 14.94411
And im getting this graph. Which the Y indexes are numbers from 1 to 20, that are not the correct "Amounts".
What can I do so the Amount index in the Y axis shows correctly?
The figure you show is just like the one in the book, R in a Nutshell, that provided you with the code. Actually, the book provides the code for two different versions of the same plot. I suggest trying them both.
library(nutshell)
data(consumption)
library(lattice)
dotplot(Amount ~ Year | Food, consumption)
dotplot(Amount ~ Year | Food, consumption,
aspect="xy", scales=list(relation="sliced", cex=.4))
I want to create the following plot: The x-axis goes from 1 to 900 representing trial numbers. The y-axis shows three different lines with the moving average of reaction time. One line is shown for each difficulty level (Hard, Medium, Easy). Separate plots should be shown for each participant using facet_wrap.
Now this all works fine if I use ggplot's geom_smooth() function. Like this:
ggplot(cw_trials_f, aes(x=trial_number, y=as.numeric(correct), col=difficulty)) +
facet_wrap(~session_id) +
geom_smooth() +
ggtitle("Stroop Task")
The problem arises when I try to use zoo library's rollmean function. Here is what I tried:
ggplot(cw_trials_f, aes(x=trial_number, y=rollmean(as.numeric(correct)-1, 50, na.pad=T, align="right"), col=difficulty)) +
facet_wrap(~session_id) +
geom_line() +
ggtitle("Stroop Task")
It seems that this doesn't partition the data according to difficulty first and then apply the rollmean function, but the other way around. Thus only one line is shown but in all three colors. How can I have rollmean be applied to each category of trials (Easy, Medium, Hard) separately?
Here is some sample data:
session_id test_number trial_number trial_duration rule concordant switch correct reaction_time difficulty
1 11674020 1 1 1872 word concordant yes yes 1357 Easy
2 11674020 1 2 2839 word discordant no yes 2324 Medium
3 11674020 1 3 1525 color discordant yes no 1025 Hard
4 11674020 1 4 1544 color discordant no no 1044 Medium
5 11674020 1 5 1451 word concordant yes yes 952 Easy
6 11674020 1 6 1252 color concordant yes yes 746 Easy
So, I ended up following #joran's suggestion (thanks) from the comment above and did the following:
cw_trials_f <- ddply(cw_trials_f, .(session_id, difficulty), .fun = function(X) transform(X, movrt = rollmean(X$reaction_time, 50, na.pad=T, align="right"), movacc = rollmean(as.numeric(X$correct)-1, 50, na.pad=T, align="right")))
This adds two additional columns to the data.frame with the moving averages of accuracy and reaction time.
Then this works fine to plot them:
ggplot(cw_trials_f, aes(x=trial_number, y=movacc, col=difficulty)) + geom_line() + facet_wrap(~session_id) + ggtitle("Stroop Task")
There is one (minor) disadvantage to this compared to what I originally wanted to do: It makes it a bit tedious and slow to try out different lengths for the moving average function.