R geom_point() number of points reflect value in column - r

Say I have mydf, a dataframe which is as follows:
Name
Value
Mark
101
Joe
121
Bill
131
How would I go about creating a scatterplot in ggplot that takes the data in the value column (e.g., 101) and makes that number of points on a chart? Would this be a stat = that I am unfamiliar with, or would I have to structure the data such that Mark, for example, has 101 unique rows, Joe has 121, etc.?

Update: As suggest by Ben Bolker (many thanks) we could set the width of geom_jitter additionally we could add some colour asthetics:
df %>%
group_by(Name) %>%
complete(Value = 1:Value) %>%
ggplot(aes(x=Name, y=Value, colour=Name))+
geom_jitter(width = 0.1)
OR more compact as suggested by Henrik (many thanks) using uncount:
ggplot(uncount(df, Value, .id = "y"), aes(x = Name, y = y)) + ...
First answer:
Something like this?
library(dplyr)
library(ggplot2)
library(tidyr) # complete
df %>%
group_by(Name) %>%
complete(Value = 1:Value) %>%
ggplot(aes(x=Name, y=Value))+
geom_jitter()

Related

How to Create A Stacked Column Plot of Multiple Variables in R (ggplot2)

Currently, I have a data frame that looks like this:
Month Total Revenue Dues Total Retail Other Revenue
8/31/2020 36615.00 30825 1200 4590
9/30/2020 38096.69 34322 2779.4 995.29
10/31/2020 43594.15 35936 2074.68 5583.47
11/30/2020 51856.9 43432 993.5 7431.4
I want to create a plot (which I imagine should be a stacked column) in ggplot that shows the revenue mix by type for each month. For my data, Total Revenue is the sum of dues, total retail and other revenue. Dues, total retail and other revenue should stack on top of each other, each having its own colour. I also want labels on the column chart describing what percentage of the total revenue is from each source of income.
I can plot the total revenue with no issues, but I cannot seem to wrap my head around splitting the columns up. My only successful example so far is as follows.
# Create Column Plot of Total Revenue
library(tidyverse)
plot1 <- ggplot(August_Data, aes(Month_End, `Total Revenue`)) + geom_col()
This example obviously does not split up the revenue into the correct subcategories. I thought that using the fill command may work however I face an error.
plot1 <- ggplot(August_Data, aes(Month_End, `Total Revenue`)) + geom_col(aes(fill = C(Dues, `Total Retail`, `Other Revenue`)))
Thank you so much for your help
Update after clarification:
library(tidyverse)
library(lubridate)
df %>%
mutate(Month = mdy(Month)) %>% # this line is not necessary in OPs original code (not the one presented here)
pivot_longer(
cols = c("Dues", "TotalRetail", "OtherRevenue"),
# cols = -c(Month_End, SID) in OPs original code
names_to = "names",
values_to = "values"
) %>%
mutate(percent = values/TotalRevenue*100) %>%
ggplot(aes(x = Month, y= values, fill= names))+
geom_col() +
geom_text(aes(label = paste0(round(percent,1),"%")),
position = position_stack(vjust = 0.5), size = 5)
First answer:
You were almost there. Pivot longer and add fill.
library(tidyverse)
library(lubridate)
df %>%
mutate(Month = mdy(Month)) %>%
pivot_longer(
-Month,
names_to = "names",
values_to = "values"
) %>%
ggplot(aes(x = Month, y= values, fill= names))+
geom_col()

ifelse condition: is in top n

Usually when i need a subset on geom_label() i use ifelse() and i specify a number as below:
library(tidyverse)
data = starwars %>% filter(mass < 500)
data %>%
ggplot(aes(x = mass, y = height, label = ifelse(birth_year > 100, name, NA))) +
geom_point() +
geom_label()
#> Warning: Removed 54 rows containing missing values (geom_label).
Created on 2020-05-31 by the reprex package (v0.3.0)
But with the dataset i'm working on, i need a dynamic solution, something like ifelse("birth_year is in top n", name, NA).
Thoughts?
For your method, I think using rank should work fine, e.g.,
ifelse(rank(birth_year) < 10, name, NA))
You can use rank(-birth_year) if you want it sorted the other way (or, if you're using dplyr, rank(desc(birth_year)), which will work on non-numeric columns too). You may want to read up on tie methods at ?rank.
I'd also propose a more general solution: filtering data for the geom_label layer. For more complex conditions (e.g., where a group_by would come in handy) it will be more straightforward:
data %>%
ggplot(aes(x = mass, y = height, label = name)) +
geom_point() +
geom_label(
data = data %>%
group_by(species) %>%
top_n(n = 1, wt = desc(birth_year)) # youngest of each species
)
Something like this? To get top 4 values.
library(ggplot2)
data %>%
ggplot(aes(x = mass, y = height, label = ifelse(birth_year >= sort(birth_year, decreasing = TRUE)[4], name, NA))) +
geom_point() +
geom_label()
This is a more explicit approach. I assume you want to count the number of characters per birth year, per your example. In this case, we handle the ranking first, then add a column to the original dataset, then plot. The new 'label' field is either blank/NA or has members of the top set. I suppress the pesky missing data warning in the geom_label arguments.
data = starwars %>% filter(mass < 500)
# counts names per birthyear, returns vector of top 4
top4 <- data %>%
drop_na(birth_year) %>%
count(birth_year, sort = TRUE) %>%
top_n(4) %>%
pull(birth_year)
# adds column to data with the names from the top 4 birth years
data <- data %>%
mutate(label = ifelse(birth_year %in% top4, name, NA))
# plots data with label, dropping NAs
data %>%
ggplot(aes(x = mass, y = height, label = label)) +
geom_point() +
geom_label(na.rm = TRUE)

bicolor heatmap with factor levels

I have this dataframe:
set.seed(0)
df <- data.frame(id = factor(sample(1:100, 10000, replace=TRUE), levels=1:100),
year = factor(sample(1950:2019, 10000, replace=TRUE), levels=1950:2019)) %>% unique() %>% arrange(id, year)
And I'm looking to plot a heatmap graph where the ids are in the X-axis, years at the Y-axis, and the color is blue when the data point exists and the color is red when the data doesn't exist. I'm almost there, but I can't figure out to change the fill argument for the two colors:
ggplot(df, aes(id, year, fill= year)) +
geom_tile()
The objective to plot both variables as factors is to plot them even when some year doesn't have any id (and plotting its whole row as red).
EDIT:
Two things I forgot to add (hope it's not too late):
How to add alpha transparency to geom_tile() without messing it?
I need to sort the ids from maximum missings to minimum missings.
The complete() function from the tidyr package is useful for filling in missing combinations. First, you need to set a flag variable to indicate if the data is present or not, and then expand the data frame with the missing combinations and fill the new flag variable with 0:
df <- df %>%
mutate(flag = TRUE) %>%
complete(id, year, fill = list(flag = FALSE))
ggplot(df, aes(id, year, fill = flag)) +
geom_tile()
EDIT1: To add transparency, add alpha = 0.x within geom_tile(), where x is a value indicating the transparency. The lower the value, the more transparent.
EDIT2: To sort by missingness add the following code prior to the ggplot code:
# Determine the order of the IDs
df_order <- df %>%
group_by(id) %>%
summarize(sum = sum(flag)) %>%
arrange(desc(sum)) %>%
mutate(order = row_number()) %>%
select(id, order)
# Set the IDs in order on the chart
df <- df %>%
left_join(df_order) %>%
mutate(id = fct_reorder(id, order))
I think you need to do some pre-processing before plotting. Create a temporary variable (data_exist) which denotes data is present for that id and year. Then use complete to fill the missing years for each id and plot it.
library(tidyverse)
df %>%
mutate_all(~as.integer(as.character(.))) %>%
mutate(data_exist = 1) %>%
complete(id, year = min(year):max(year), fill = list(data_exist = 0)) %>%
mutate(data_exist = factor(data_exist)) %>%
ggplot() + aes(id, year, fill= data_exist) + geom_tile()
With expand.gridyou can create a dataframe with all combinations of ids and years, then left join on this combinations to see if you had them in df
all <- expand.grid(id=levels(df$id),year=levels(df$year)) %>%
left_join(df) %>%
mutate(present=ifelse(is.na(present),'0','1'))
ggplot(all, aes(as.numeric(id), as.numeric(year), fill= present)) +
geom_tile() +
scale_fill_manual(values=c('0'='red','1'='blue')) + # change default colors
theme(legend.position="None") # hide legend

Adding character values of a column in R

I have two columns i.e. square_id & Smart_Nsmart as given below.
I want to count(add) N's and S's against each square_id and ggplot the data i.e. plot square_id vs Smart_Nsmart.
square_id 1
1
2 2 2 2 3 3 3 3
Smart_Nsmart
S N N N S S N S S S
We can use count and then use ggplot to plot the frequency. Here, we are plotting it with geom_bar (as it is not clear from the OP's post)
library(dplyr)
library(ggplot2)
df %>%
count(square_id, Smart_Nsmart) %>%
ggplot(., aes(x= square_id, y = n, fill = Smart_Nsmart)) +
geom_bar(stat = 'identity')
The above answer is very smart. However, instead of count function, you can implement group_by and summarise just in case in future you want to apply some other functions to your code.
library(dplyr)
library(ggplot2)
dff <- data.frame(a=c(1,1,1,1,2,1,2),b=c("C","C","N","N","N","C","N"))
dff %>%
group_by(a,b) %>%
summarise(n = length(b) ) %>%
ggplot(., aes(x= a, y = n, fill = b)) +
geom_bar(stat = 'identity')

grouped by factor level in ggplot2()

I've got a data frame with four three-level categorical variables: before_weight, after_weight, before_pain, and after_pain.
I'd like to make a bar plot featuring the proportion for each level of the variables. That my current code achieves.
The problem's the presentation of the data. I'd like the respective before and after bars to be grouped together, so that the bar representing the people that answered 1 in the before_weight variable is grouped next to the bar representing the people that answered 1 in the after_weight variable, and so forth for both the weight and pain variables.
I've been trying to use dplyr, mutate() with numerous ifelse() statements, to make a new variable pairing up the groups in question, but can't seem to get it to work.
Any help would be much appreciated.
starting point (df):
df <- data.frame(before_weight=c(1,2,3,2,1),before_pain=c(2,2,1,3,1),after_weight=c(1,3,3,2,3),after_pain=c(1,1,2,3,1))
current code:
library(tidyr)
dflong <- gather(df, varname, score, before_weight:after_pain, factor_key=TRUE)
df$score<- as.factor(df$score)
library(ggplot2)
library(dplyr)
dflong %>%
group_by(varname) %>%
count(score) %>%
mutate(prop = 100*(n / sum(n))) %>%
ggplot(aes(x = varname, y = prop, fill = factor(score))) + scale_fill_brewer() + geom_col(position = 'dodge', colour = 'black')
UPDATE:
I'd like proportions rather than counts, so I've attempted to tweak Nate's code. Since I'm using the question variable to group the data to get the proportions, I can't seem use gsub() to change the content of that variable. Instead I added question2 and passed it into facet_wrap(). It seems to work.:
df %>% gather("question", "val") %>%
count(question, val) %>%
group_by(question) %>%
mutate(percent = 100*(n / sum(n))) %>%
mutate(time= factor(ifelse(grepl("before", question), "before", "after"), c("before", "after"))) %>%
mutate(question2= ifelse(grepl("weight", question), "weight", "pain")) %>%
ggplot(aes(x=val, y=percent, fill = time)) + geom_col(position = "dodge") + facet_wrap(~question2)
Does this code make the visual comparisons you are after? One ifelse and a gsub will help make variables we can use for facetting and filling in ggplot.
df %>% gather("question", "val") %>% # go long
mutate(time = factor(ifelse(grepl("before", question), "before", "after"),
c("before", "after")), # use factor with levels to control order
question = gsub(".*_", "", question)) %>% # clean for facets
ggplot(aes(x = val, fill = time)) + # use fill not color for whole bar
geom_bar(position = "dodge") + # stacking is the default option
facet_wrap(~question) # two panels

Resources