Calculating a ratio in a ggplot2 graph while retaining faceting variables - r

So I don't think this has been asked before, but SO search might just be getting confused by combinations of 'ratio' and 'faceting'. I'm trying to calculate a productivity ratio; number of widgets produced for number of workers on a given day or period. I've got my data structured in a single data frame, with each widget produced each day by each worker in it's own record, and other workers that worked that day but didn't produce a widget also in their own record, along with various metadata.
Something like this:
widget_ind
employee_active_ind
employee_id
day
product_type
employee_bu
1
1
123
6/1/2021
pc
americas
0
1
234
6/1/2021
mac
emea
0
1
345
6/1/2021
mac
apac
1
1
444
6/1/2021
mac
americas
1
1
333
6/1/2021
pc
emea
0
1
356
6/1/2021
pc
americas
I'm trying to find the ratio of widget_inds to employee_active_inds, over time, while retaining the metadata, so that i can filter or facet within the ggplot2 code, something like:
plot <- ggplot(data = df[df$employee_bu == 'americas',],aes(y = (widget_ind/employee_active_ind), x = day)) +
geom_bar(stat = 'identity', position = 'stack') +
facet_wrap(product_type ~ ., scales = 'fixed') + #change these to look at different cuts of metadata
print(plot)
Retaining the metadata is appealing rather than making individual dataframes summarizing by the various combinations, but the results with no faceting aren't even correct (e.g. the ggplot is showing a barchart with a height of ~18 widgets per person; creating a summarized dataframe with no faceting is showing a ratio of less than 1 widget per person).
I'm currently getting this error when I run the ggplot code:
Warning message:
Removed 9865 rows containing missing values (geom_bar).
Which doesn't make sense since in my data frame both widget_ind and employee_active_ind have no NA values, so calculating the ratio of the two should always work?
Edit 1: Clarifying employee_active_ind: I should not have any employee_active_ind = 0, but my current joins produce them (and it passes the reality sniff test; the process we are trying to model allows you to do work on day 1 that results in a widget on day 2, where you may not do any work, so wouldn't be counted as active on that day). I think I need to re-think my data structure. Even so, I'm assuming here that ggplot2 is acting like it would for a given bar chart; it's taking the number in each widget_ind record, for a given day (along with any facets and filters), and is then summing that set and displaying the result. The wrinkle I'm adding is dividing by the number of active employees on that day, and while you can have some one out on a given day, you'd never have everyone out. But that isn't what ggplot is doing is it?

I agree with MrFlick - especially the question concerning employee_active_ind of 0. If you have them, this could create NA values where something is divided by 0.

Related

Memory management in R ComplexUpset Package

I'm trying to plot an stacked barplot inside an upset-plot using the ComplexUpset package. The plot I'd like to get looks something like this (where mpaa would be component in my example):
I have a dataframe of size 57244 by 21, where one column is ID and the other is type of recording, and other 19 columns are components from 1 to 19:
ID component1 component2 ... component19 type
1 1 0 1 a
2 0 0 1 b
3 1 1 0 b
Ones and zeros indicate affiliation with a certain component. As shown in the example in the docs, I first convert these ones and zeros to logical, and then try to plot the basic upset plot. Here's the code:
df <- df %>% mutate(across(where(is.numeric), as.logical))
components <- colnames(df)[2:20]
upset(df, components, name='protein', width_ratio = 0.1)
But unfortunately after thinking for a while when processing the last line it spits out an error message like this:
Error: cannot allocate vector of size 176.2 Mb
Though I know I'm using the 32Gb RAM architecture, I'm sure I couldn't have flooded the memory so much that 167 Mb can't be allocated, so my guess is I am managing memory in R somehow wrong. Could you please explein what's faulty in my code, if possible.
I also know that UpsetR package plots the same data, but as far as i know it provides no way for the stacked barplotting.
Somehow, it works if you:
Tweak the min_size parameter so that the plot is not overloaded and makes a better impression
Making the first argument of ComplexUpset a sample with some data also helps, even if your sample is the whole dataset.

Simple barplot displaying voting of a county

I'm fairly new to R and making plots, so sorry about that. I have a dataset of the voting for counties and I want to make a barplot showing how many mandates each county voted for.
What I've done so far is to extract one row, which includes the name of the county and the number of mandates it voted for the different parties (which are headers).
Fylker AP FRP H KrF SP
Ostlandet 3 2 2 0 1
Sorry for the bad display of code, whenever I paste the code, it looks really weird, despite indenting.
The data is called "Ostlandet" and is only 1 row. So as I tried to explain above, I want to make some sort of barplot out of this. The idea is to have the different parties on the x-axis and number of votes on y. I've tried this so far
ggplot(Ostfold, aes(x = Ostfold[1,])) +
geom_histogram(binwidth = 20)
Which just gave me tons of errors.
I've also tried using barplot, but I just can't seem to figure this out.
Sorry, this is probably super easy, but I'm just getting into coding.
You have a few issues. First, there's no need for extracting rows. Second, the data are in "wide" format (mandates in columns) instead of "long format" (a column named "mandate" with values). Third, you want to plot counts so geom_col() is better than geom_histogram().
The gather() function from the tidyr package will get your data from wide into long:
library(tidyr)
library(ggplot2)
Ostfold %>%
gather(Mandate, Votes, -Fylker)
That should generate something like this:
Fylker Mandate Votes
1 Ostlandet AP 3
2 Ostlandet FRP 2
3 Ostlandet H 2
4 Ostlandet KrF 0
5 Ostlandet SP 1
You can pass that to ggplot:
Ostfold %>%
gather(Mandate, Votes, -Fylker) %>%
ggplot(aes(Mandate, Votes)) + geom_col()
Result for your one row:
For a dataset with multiple counties, you might want to add + facet_wrap(~Fylker) to facet the plot by county, depending on how many there are.

R Question: How can I create a histogram with 2 variables against eachother?

Okay, let me be as clear as I can in my problem. I'm new to R, so your patience is appreciated.
I want to create a histogram using two different vectors. The first vector contains a list of models (products). These models are listed as either integers, strings, or NA. I'm not exactly sure how R is storing them (I assume they're kept as strings), or if that is a relevant issue. I also have a vector containing a list of incidents pertaining to that model. So for example, one row in the dataframe might be:
Model Incidents
XXX1991 7
How can I create a histogram where the number of incidents for each model is shown? So the histogram will look like
| =
| =
Frequency of | =
Incidents | = =
| = = =
| = = = = =
- - - - - -
Each different Model
Just to give a general idea.
I also need to be able to map everything out with standard deviation lines, so that it's easy to see which models are the least reliable. But that's not the main question here. I just don't want to do anything that will make me unable to use standard deviation in the future.
So far, all I really understand is how to make a histogram with the frequency marked, but for some reason, the x-axis is marked with numbers, not the models' names.
I don't really care if I have to download new packages to make this work, but I suspect that this already exists in basic R or ggplot2 and I'm just too dumb to figure it out.
Feel free to ask clarfying questions. Thanks.
EDIT: I forgot to mention, there are multiple rows of incidents listed under each model. So to add to my example earlier:
Model Incidents
XXX1991 7
XXX1991 1
XXX1991 19
3
5
XXX1002 9
XXX1002 4
etc . . .
I want to add up all the incidents for a model under one label.
I am assuming that you did not mean to leave the model blank in your example, so I filled in some values.
You can add up the number of incidents by model using aggregate then make the relevant plot using barplot.
## Example Data
data = read.table(text="Model Incidents
XXX1991 7
XXX1991 1
XXX1991 19
XXX1992 3
XXX1992 5
XXX1002 9
XXX1002 4",
header=TRUE)
TAB = aggregate(data$Incidents, list(data$Model), sum)
TAB
Group.1 x
1 XXX1002 13
2 XXX1991 27
3 XXX1992 8
barplot(TAB$x, names.arg=TAB$Group.1 )

geom_tile adds a third fill colour not in the data

It's difficult for me to create a reproducible example of this as the issue only seems to show as the size of the data frame goes up to too large to paste here. I hope someone will bear with me and help here. I'm sure I'm doing something stupid but reading the help and searching is failing (perhaps on the "stupid" issue.)
I have a data frame of 2,319 rows and three variables: clientID, month and nSlots where clientID is character, month is 1:12 and nSlots is 1:2.
> head(tmpDF2)
month clientID2 nSlots
21 1 8 1
30 2 8 1
31 4 8 1
28 5 8 1
25 6 8 1
24 7 8 1
Here's table(tmpDF2$nSlots)
> table(tmpDF2$nSlots, useNA = "always")
1 2 <NA>
1844 15 0
I'm trying to use ggplot and geom_tile to plot the attendance of clients and I expect two colours for the tiles depending on the two values of nSlots but when the size of the data frame goes up, I am getting a third colour. Here is is the plot.
OK. Well I gather you can't see that so perhaps I should stop here! Aha, or maybe you can click through to that link. I hope so!
Here's the code then for what it's worth.
ggplot(dat=tmpDF2,
aes(x=month,y=clientID2,fill=nSlots)) +
geom_tile() +
# geom_text(aes(label=nSlots)) +
theme(panel.background = element_blank()) +
theme(axis.text.x=element_text(angle=90,hjust=1)) +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.line=element_line()) +
ylab("clients")
The bizarre thing (to me) is that when I keep the number of rows small, the plot seems to work fine but as the number goes up, there's a point, and I've failed utterly to find if one row in the data or value of nrow(tmpDF2) triggers it, when this third colour, a paler value than the one in the legend, appears.
TIA,
Chris

Excel: Select data for graph

To put it simple, I have three columns in excel like the ones below:
Vehicle x y
1 10 10
1 15 12
1 12 9
2 8 7
2 11 6
3 7 12
x and y are the coordinates of customers assigned to the corresponding vehicle. This file is the output of a program I run in advance. The list will always be sorted by vehicle, but the number of customers assigned to vehicle "k" may change from one experiment to the next.
I would like to plot a graph containing 3 series, one for each vehicle, where the customers of each vehicle would appear (as dots in 2D based on their x- and y- values) in different color.
In my real file, I have 12 vehicles and 3200 customers, and the ranges change from one experiment to the next, so I would like to automate the process, i.e copy-paste the list on my excel and see the graph appear automatically (if this is possible).
Thanks in advance for your time and effort.
EDIT: There is a similar post here: Use formulas to select chart data but requires the use of VB. Moreover, I am not sure whether it has been indeed answered.
you should try this free online tool - www.cloudyexcel.com/excel-to-graph/

Resources