ifelse condition: is in top n - r

Usually when i need a subset on geom_label() i use ifelse() and i specify a number as below:
library(tidyverse)
data = starwars %>% filter(mass < 500)
data %>%
ggplot(aes(x = mass, y = height, label = ifelse(birth_year > 100, name, NA))) +
geom_point() +
geom_label()
#> Warning: Removed 54 rows containing missing values (geom_label).
Created on 2020-05-31 by the reprex package (v0.3.0)
But with the dataset i'm working on, i need a dynamic solution, something like ifelse("birth_year is in top n", name, NA).
Thoughts?

For your method, I think using rank should work fine, e.g.,
ifelse(rank(birth_year) < 10, name, NA))
You can use rank(-birth_year) if you want it sorted the other way (or, if you're using dplyr, rank(desc(birth_year)), which will work on non-numeric columns too). You may want to read up on tie methods at ?rank.
I'd also propose a more general solution: filtering data for the geom_label layer. For more complex conditions (e.g., where a group_by would come in handy) it will be more straightforward:
data %>%
ggplot(aes(x = mass, y = height, label = name)) +
geom_point() +
geom_label(
data = data %>%
group_by(species) %>%
top_n(n = 1, wt = desc(birth_year)) # youngest of each species
)

Something like this? To get top 4 values.
library(ggplot2)
data %>%
ggplot(aes(x = mass, y = height, label = ifelse(birth_year >= sort(birth_year, decreasing = TRUE)[4], name, NA))) +
geom_point() +
geom_label()

This is a more explicit approach. I assume you want to count the number of characters per birth year, per your example. In this case, we handle the ranking first, then add a column to the original dataset, then plot. The new 'label' field is either blank/NA or has members of the top set. I suppress the pesky missing data warning in the geom_label arguments.
data = starwars %>% filter(mass < 500)
# counts names per birthyear, returns vector of top 4
top4 <- data %>%
drop_na(birth_year) %>%
count(birth_year, sort = TRUE) %>%
top_n(4) %>%
pull(birth_year)
# adds column to data with the names from the top 4 birth years
data <- data %>%
mutate(label = ifelse(birth_year %in% top4, name, NA))
# plots data with label, dropping NAs
data %>%
ggplot(aes(x = mass, y = height, label = label)) +
geom_point() +
geom_label(na.rm = TRUE)

Related

lapply on a list of dataframe to create plots

I have created a list of dataframe df_list, each of which is an 18 x 4 dataframe.
The first columns of the dataframe is 18-times-repeated gene name, and the rest three columns are the gene's information. Each dataframe describes different gene.
Now I'd like to iterate the list of dataframe (i.e, a list of gene and their respective information) over the boxplot, to get plots on each gene; however, I am not sure how to deal with the ggtitle below:
Here is my simplified boxplot function:
box <- function(df){
df %>%
ggplot(df, aes(x = df[,4], y = df[,2])) +
geom_boxplot() +
ggtitle(g)
}
g is the gene name in each dataframe in the df_list
and when I run lapply(df_list,box),
I got Error: Mapping should be created with aes()oraes_().
Does anyone know how to fix this? Thank you.
As you used dplyr pipe, the first argument of ggplot() is already filled, leading df to be understood as aes argument.
box <- function(df){
df %>%
ggplot(aes(x = df[,4], y = df[,2])) +
geom_boxplot() +
ggtitle(g)
}
Instead of using df[,4] or df[,2] use column names in aes. Assuming the column names on x-axis is col1 and that on y-axis is col2 try -
box <- function(df, g){
df %>%
ggplot(aes(x = col1, y = col2)) +
geom_boxplot() +
ggtitle(df$g[1])
}
lapply(df_list,box)
g is the column name in the dataframe, so we can take the first value from it in title.
This is how to do it with another example dataset (Species instead of genes):
library(tidyverse)
plots <-
iris %>%
pivot_longer(-Species) %>%
nest(-Species) %>%
mutate(
plt = data %>% map2(Species, ~ {
.x %>%
ggplot(aes(name, value)) +
geom_boxplot() +
labs(title = .y)
})
) %>%
pull(plt)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(name, value)`?
plots[[1]]
Created on 2021-09-09 by the reprex package (v2.0.1)

Timeseries graphs of mean values of group in R

I am learning R and dealing a data set of with multiple repetitive columns, say 200 times as given columns are repeated 200 times.
I want to take mean of each column and the group the mean of each variable. So there will be 200 values of mean of each variable. I want to make a line chart like this of mean values of each variable.
I am trying these codes
library(data.table)
library(tidyverse)
library(ggplot2)
library(viridisLite)
df <- read.table("H-W.csv", sep = ",")
df
dat %>% filter(Scenario != 'NULL') %>%
mutate("Scenario" = ifelse(Scenario == 'NULL2', "BASELINE", Scenario)) %>%
group_by(.dots = c("X.step.", "Scenario")) %>%
summarise('height.people' = mean(height),
'weight.people' = mean(weight),
"wealth.people" = mean(wealth)) %>%
pivot_longer(c('height.people', 'weight.people', 'wealth.people')) %>%
ggplot(aes(x = X.step., y = value, colour = Scenario)) +
geom_line(size = 1) + facet_grid(name~., scales = "free_y") + theme_classic() +
scale_colour_viridis_d() + scale_y_log10()
I found this error
Error in UseMethod("filter") :
no applicable method for 'filter' applied to an object of class "NULL"
I think you might have the same problem as this...
Is your data in a data.frame or tibble?
Other wise if that doesn't work try this...
filter is a function in stats and dplyr,
so you could try changing
dat %>% filter(Scenario != 'NULL') %>%
to
dat %>% dplyr::filter(Scenario != "NULL") %>%

bicolor heatmap with factor levels

I have this dataframe:
set.seed(0)
df <- data.frame(id = factor(sample(1:100, 10000, replace=TRUE), levels=1:100),
year = factor(sample(1950:2019, 10000, replace=TRUE), levels=1950:2019)) %>% unique() %>% arrange(id, year)
And I'm looking to plot a heatmap graph where the ids are in the X-axis, years at the Y-axis, and the color is blue when the data point exists and the color is red when the data doesn't exist. I'm almost there, but I can't figure out to change the fill argument for the two colors:
ggplot(df, aes(id, year, fill= year)) +
geom_tile()
The objective to plot both variables as factors is to plot them even when some year doesn't have any id (and plotting its whole row as red).
EDIT:
Two things I forgot to add (hope it's not too late):
How to add alpha transparency to geom_tile() without messing it?
I need to sort the ids from maximum missings to minimum missings.
The complete() function from the tidyr package is useful for filling in missing combinations. First, you need to set a flag variable to indicate if the data is present or not, and then expand the data frame with the missing combinations and fill the new flag variable with 0:
df <- df %>%
mutate(flag = TRUE) %>%
complete(id, year, fill = list(flag = FALSE))
ggplot(df, aes(id, year, fill = flag)) +
geom_tile()
EDIT1: To add transparency, add alpha = 0.x within geom_tile(), where x is a value indicating the transparency. The lower the value, the more transparent.
EDIT2: To sort by missingness add the following code prior to the ggplot code:
# Determine the order of the IDs
df_order <- df %>%
group_by(id) %>%
summarize(sum = sum(flag)) %>%
arrange(desc(sum)) %>%
mutate(order = row_number()) %>%
select(id, order)
# Set the IDs in order on the chart
df <- df %>%
left_join(df_order) %>%
mutate(id = fct_reorder(id, order))
I think you need to do some pre-processing before plotting. Create a temporary variable (data_exist) which denotes data is present for that id and year. Then use complete to fill the missing years for each id and plot it.
library(tidyverse)
df %>%
mutate_all(~as.integer(as.character(.))) %>%
mutate(data_exist = 1) %>%
complete(id, year = min(year):max(year), fill = list(data_exist = 0)) %>%
mutate(data_exist = factor(data_exist)) %>%
ggplot() + aes(id, year, fill= data_exist) + geom_tile()
With expand.gridyou can create a dataframe with all combinations of ids and years, then left join on this combinations to see if you had them in df
all <- expand.grid(id=levels(df$id),year=levels(df$year)) %>%
left_join(df) %>%
mutate(present=ifelse(is.na(present),'0','1'))
ggplot(all, aes(as.numeric(id), as.numeric(year), fill= present)) +
geom_tile() +
scale_fill_manual(values=c('0'='red','1'='blue')) + # change default colors
theme(legend.position="None") # hide legend

using facets on every column with color grouping

I've seen a lot of people use facets to visualize data. I want to be able to run this on every column in my dataset and then have it grouped by some categorical value within each individual plot.
I've seen others use gather() to plot histogram or densities. I can do that ok, but I guess I fundamentally misunderstand how to use this technique.
I want to be able to do just what I have below - but when I have it grouped by a category. For example, histogram of every column but stacked by the value color. Or dual density plots of every column with these two lines of different colors.
I'd like this - but instead of clarity it is every single column like this...
library(tidyverse)
# what I want but clarity should be replaced with every column except FILL
ggplot(diamonds, aes(x = price, fill = color)) +
geom_histogram(position = 'stack') +
facet_wrap(clarity~.)
# it would look exactly like this, except it would have the fill value by a group.
gathered_data = gather(diamonds %>% select_if(is.numeric))
ggplot(gathered_data , aes(value)) +
geom_histogram() +
theme_classic() +
facet_wrap(~key, scales='free')
tidyr::gather needs four pieces:
1) data (in this case diamonds, passed through the pipe into the first parameter of gather below)
2) key
3) value
4) names of the columns that will be converted to key / value pairs.
gathered_data <- diamonds %>%
gather(key, value,
select_if(diamonds, is.numeric) %>% names())
It's not entirely clear what you are looking for. A picture of your expected output would have been much more illuminating than a description (not all of us are native English speakers...), but perhaps something like this?
diamonds %>%
rename(group = color) %>% # change this line to use another categorical
# column as the grouping variable
group_by(group) %>% # select grouping variable + all numeric variables
select_if(is.numeric) %>%
ungroup() %>%
tidyr::gather(key, value, -group) %>% # gather all numeric variables
ggplot(aes(x = value, fill = group)) +
geom_histogram(position = "stack") +
theme_classic() +
facet_wrap(~ key, scales = 'free')
# alternate example using geom density
diamonds %>%
rename(group = cut) %>%
group_by(group) %>%
select_if(is.numeric) %>%
ungroup() %>%
tidyr::gather(key, value, -group) %>%
ggplot(aes(x = value, color = group)) +
geom_density() +
theme_classic() +
facet_wrap(~ key, scales = 'free')

R: using ggplot2 with a group_by data set

I can't quite figure this out. A CSV of 200+ rows assigned to data like so:
gid,bh,p1_id,p1_x,p1_y
90467,R,543333,80.184,98.824
90467,L,408045,74.086,90.923
90467,R,543333,57.629,103.797
90467,L,408045,58.589,95.937
Trying to group by p1_id and plot the mean values for p1_x and p1_y:
grp <- data %>% group_by(p1_id)
Trying to plot geom_point objects like so:
geom_point(aes(mean(grp$p1_x), mean(grp$p1_y), color=grp$p1_id))
But that isn't showing unique plot points per distinct p1_id values.
What's the missing step here?
Why not calculate the mean first?
library(dplyr)
grp <- data %>%
group_by(p1_id) %>%
summarise(mean_p1x = mean(p1_x),
mean_p1y = mean(p1_y))
Then plot:
library(ggplot2)
ggplot(grp, aes(x = mean_p1x, y = mean_p1y)) +
geom_point(aes(color = as.factor(p1_id)))
Edit: As per #eipi10, you can also pipe directly into ggplot
data %>%
group_by(p1_id) %>%
summarise(mean_p1x = mean(p1_x),
mean_p1y = mean(p1_y)) %>%
ggplot(aes(x = mean_p1x, y = mean_p1y)) +
geom_point(aes(color = as.factor(p1_id)))

Resources