So I don't think this has been asked before, but SO search might just be getting confused by combinations of 'ratio' and 'faceting'. I'm trying to calculate a productivity ratio; number of widgets produced for number of workers on a given day or period. I've got my data structured in a single data frame, with each widget produced each day by each worker in it's own record, and other workers that worked that day but didn't produce a widget also in their own record, along with various metadata.
Something like this:
widget_ind
employee_active_ind
employee_id
day
product_type
employee_bu
1
1
123
6/1/2021
pc
americas
0
1
234
6/1/2021
mac
emea
0
1
345
6/1/2021
mac
apac
1
1
444
6/1/2021
mac
americas
1
1
333
6/1/2021
pc
emea
0
1
356
6/1/2021
pc
americas
I'm trying to find the ratio of widget_inds to employee_active_inds, over time, while retaining the metadata, so that i can filter or facet within the ggplot2 code, something like:
plot <- ggplot(data = df[df$employee_bu == 'americas',],aes(y = (widget_ind/employee_active_ind), x = day)) +
geom_bar(stat = 'identity', position = 'stack') +
facet_wrap(product_type ~ ., scales = 'fixed') + #change these to look at different cuts of metadata
print(plot)
Retaining the metadata is appealing rather than making individual dataframes summarizing by the various combinations, but the results with no faceting aren't even correct (e.g. the ggplot is showing a barchart with a height of ~18 widgets per person; creating a summarized dataframe with no faceting is showing a ratio of less than 1 widget per person).
I'm currently getting this error when I run the ggplot code:
Warning message:
Removed 9865 rows containing missing values (geom_bar).
Which doesn't make sense since in my data frame both widget_ind and employee_active_ind have no NA values, so calculating the ratio of the two should always work?
Edit 1: Clarifying employee_active_ind: I should not have any employee_active_ind = 0, but my current joins produce them (and it passes the reality sniff test; the process we are trying to model allows you to do work on day 1 that results in a widget on day 2, where you may not do any work, so wouldn't be counted as active on that day). I think I need to re-think my data structure. Even so, I'm assuming here that ggplot2 is acting like it would for a given bar chart; it's taking the number in each widget_ind record, for a given day (along with any facets and filters), and is then summing that set and displaying the result. The wrinkle I'm adding is dividing by the number of active employees on that day, and while you can have some one out on a given day, you'd never have everyone out. But that isn't what ggplot is doing is it?
I agree with MrFlick - especially the question concerning employee_active_ind of 0. If you have them, this could create NA values where something is divided by 0.
I hope I can explain this correctly... basically, what Im trying to do is to remove points on a grid where there is no data... but the issue is, Im trying to do this with 2 factors!
Hopefully I can explain more clearly below.
To begin I have 2 factors drink and food, as shown below. Then I'm creating a grid (which im using to calculate something else) but I'm trying to remove 'points' from the grid where there is no data... for example:
drink = as.factor(c("A","A","A","A","A","A","A","A","A","A","A","B"))
food = as.factor(c('pizza','pizza','pizza','fries','fries','taco','taco','pizza','taco','pizza','taco','fries'))
# looking at a contingency table
table(drink, food)
> food
drink fries pizza taco
A 2 5 4
B 1 0 0
Now Im creating the grid that spans the entire range of the data like so:
# create the grid
gridvals1 <- levels(drink)
gridvals2 <- levels(food)
gridvalsNew <- expand.grid(gridvals1, gridvals2)
If we plot the data and the grid side-by-side, we can see that the grid covers area where there is no data:
par(mfrow=c(1,2))
plot(drink, food)
plot(gridvalsNew)
What Im trying to do is resize the grid so it removes the area where there is no data (i.e., where the count iz zero) . But I cant figure it out.
I am very new to this and I wanted to add that the various ways in which I tried to reshape/melt the data. My data in three different variations:
Version 1:
year,type,total,action,perc
2015,v,"1,199,310",crime,42.16
2015,p,"8,024,115",crime,18.24
2015,v,"505,681",arrest,42.16
2015,p,"1,463,213",arrest,18.24
2016,v,"1,250,162",crime,32.85
2016,p,"7,928,530",crime,17.07
2016,v,"410,717",arrest,32.85
2016,p,"1,353,283",arrest,17.07
2017,v,"1,247,321",crime,41.58
2017,p,"7,694,086",crime,16.24
2017,v,"518,617",arrest,41.58
2017,p,"1,249,757",arrest,16.24
Version 2:
year,type,crime,arrest,perc
2015,1,"1,199,310","505,681",42.16
2015,2,"8,024,115","1,463,213",18.24
2016,1,"1,250,162","410,717",32.85
2016,2,"7,928,530","1,353,283",17.07
2017,1,"1,247,321","518,617",41.58
2017,2,"7,694,086","1,249,757",16.24
Version 3:
df <- vpcrimetotal
year,vcrime,varrest,varrestperc,pcrime,parrest,parrestperc
2017,"1,247,321","518,617",0.4158,"7,694,086","1,249,757",0.1624
2016,"1,250,162","410,717",0.3285,"7,928,530","1,353,283",0.1707
2015,"1,199,310","505,681",0.4216,"8,024,115","1,463,213",0.1824
The idea is to show the total number of violent crime versus property crime from 1990-2017 with the number of arrests (labeled as a percent) inside each bar based on crime type (property or violent). The preference is to stack all four into one bar per year with different colors for each.
I found these that helped but was still confused in figuring out how to fit my data into them. how to create stacked bar charts for multiple variables with percentages, but to maybe look like this Count and Percent Together using Stack Bar in R
I have used these sets of data to the code but is probably confusing if I post all the different ones I tried that don't work.
I'm fairly new to R and making plots, so sorry about that. I have a dataset of the voting for counties and I want to make a barplot showing how many mandates each county voted for.
What I've done so far is to extract one row, which includes the name of the county and the number of mandates it voted for the different parties (which are headers).
Fylker AP FRP H KrF SP
Ostlandet 3 2 2 0 1
Sorry for the bad display of code, whenever I paste the code, it looks really weird, despite indenting.
The data is called "Ostlandet" and is only 1 row. So as I tried to explain above, I want to make some sort of barplot out of this. The idea is to have the different parties on the x-axis and number of votes on y. I've tried this so far
ggplot(Ostfold, aes(x = Ostfold[1,])) +
geom_histogram(binwidth = 20)
Which just gave me tons of errors.
I've also tried using barplot, but I just can't seem to figure this out.
Sorry, this is probably super easy, but I'm just getting into coding.
You have a few issues. First, there's no need for extracting rows. Second, the data are in "wide" format (mandates in columns) instead of "long format" (a column named "mandate" with values). Third, you want to plot counts so geom_col() is better than geom_histogram().
The gather() function from the tidyr package will get your data from wide into long:
library(tidyr)
library(ggplot2)
Ostfold %>%
gather(Mandate, Votes, -Fylker)
That should generate something like this:
Fylker Mandate Votes
1 Ostlandet AP 3
2 Ostlandet FRP 2
3 Ostlandet H 2
4 Ostlandet KrF 0
5 Ostlandet SP 1
You can pass that to ggplot:
Ostfold %>%
gather(Mandate, Votes, -Fylker) %>%
ggplot(aes(Mandate, Votes)) + geom_col()
Result for your one row:
For a dataset with multiple counties, you might want to add + facet_wrap(~Fylker) to facet the plot by county, depending on how many there are.
I am a beginner with R so I don't have much experience. I ran into a problem when trying to split my scatterplot in groups based on infection status. My dataset consists of log transformed antibody levels logapfhap2 in this example. Infection status any Pf inf is coded as Yes or No and gives information on if someone has been infected during the follow-up period. I am plotting timepoints (x) against antibody levels (y). For time point 1 and 14 I would like to make 2 groups based on infection status.
This is the main part of the code I use to plot the data without splitting in groups:
ggplot() +
geom_jitter(data=data2, aes(x='1', y=logapfhap2, colour='PfHAP2A')) +
geom_jitter(data=data2,aes(x='14', y=logbpfhap2, colour='PfHAP2B')) +
geom_jitter(data=TRC, aes(x='C', y=PfHAP2, colour='PfHAP2C'))
which results in this graph:
Then I tried to split it (I only show the first time point here) which returns an error.
ggplot() +
geom_jitter(data=data2[data2$any_Pf_inf=='Yes'],
aes(x='1inf', y=logapfhap2[data2$any_Pf_inf=='Yes'],
colour='PfHAP2A')) +
geom_jitter(data=data2[data2$any_Pf_inf=='No'],
aes(x='1un', y=logapfhap2[data2$any_Pf_inf=='No'],
colour='PfHAP2B'))
I wanted to create this graph but I get this error:
Error: Length of logical index vector must be 1 or 55, got: 482
Hope this is clear! Could anyone help me with this problem? Thanks!
EDIT
Not sure if this makes it clearer, but this is what my data looks like:
I just tried some other things and I have solved it now!
ggplot()+
geom_jitter(data=data2[data2$any_Pf_inf=='Yes',],
aes(x='1inf', y=logapfhap2,
colour='PfHAP2A')) +
geom_jitter(data=data2[data2$any_Pf_inf=='No',],
aes(x='1un', y=logbpfhap2,
colour='PfHAP2B'))
Apparently you have to add a comma after [data2$any_Pf_inf=='Yes',] to extract rows instead of columns.