How to set aside certain numeric values of x with ggplot? - r

I have a continuous scale including some values which codify different categories of missing (for example 998,999), and I want to make a plot excluding these numeric missing values.
Since the values are together, I can use xlim each time, but since it determines the domain of the plot I have to change the values for each different case.
Then, I ask for a solution. I think in two possibilities.
Is it possible to put non-determining limits to the x-values? I mean, if I give 990 as a maximum limit, but the maximum value that appears is 100, the plot should show an x-range till approximately 100, not 990, as xlim does.
Is there an opposite function to xlim?, meaning that the range determined by the limits (or a discrete set of values given) won't be included in the x-axis.
Thanks in advance.

I think the simplest way is to exclude these values in the plot, either before or during the ggplot call.
MWE
library(tidyverse)
# Create data with overflowing data
mtcars2 <- mtcars
mtcars2[5:15, 'mpg'] <- 998
# Full plot
mtcars2 %>% ggplot() +
geom_point(aes(x = mpg, y = disp))
Filtering before plot
mtcars2 %>%
filter(mpg < 250) %>%
ggplot() +
geom_point(aes(x = mpg, y = disp))
Filtering during plot
mtcars2 %>%
ggplot() +
geom_point(aes(x = mpg, y = disp), data = . %>% filter(mpg < 250))

I would filter those missing values from the original dataset:
library(dplyr)
df <- data.frame(cat = rep(LETTERS[1:4], 3),
values = sample(10, 12, replace = TRUE)
)
# Add missing values
df$values[c(1,5,10)] <- 999
df$values[c(2,7)] <- 998
invalid_values <- c(998, 999)
library(ggplot2)
df %>%
filter(!values %in% invalid_values) %>%
ggplot() +
geom_point(aes(cat, values))
Alternatively, if that's not possible for some reason, you can define a scale transformation:
df %>%
ggplot() +
geom_point(aes(cat, values)) +
scale_y_continuous(trans = scales::trans_new('remove_invalid',
transform = function(d) {d <- if_else(d %in% invalid_values, NA_real_, d)},
inverse = function(d) {if_else(is.na(d), 999, d)}
)
)
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Warning: Removed 5 rows containing missing values (geom_point).
Created on 2018-05-09 by the reprex package (v0.2.0).

Related

ggplots2 combination of geom_line and geom_point creates too many shapes along lines

I am working on 3-way interaction effect plotting using my own data. But my code creates too many (continuous) shapes along the lines.
How can I leave the points only at the ends of the lines instead of the figure attached above?
I will deeply appreciate if anybody helps.
g1=ggplot(mygrid,aes(x=control,y=pred,color=factor(nknowledge),
lty=factor(nknowledge),shape=factor(nknowledge)))+
geom_line(size=1.5)+
geom_point(size=2.5)+
labs(x="control", y="attitudes",lty = "inc level")+
scale_linetype_manual("know level",breaks=1:3,values=c("longdash", "dotted","solid"),label=c("M-SD","M","M+SD"))+
scale_color_manual("know level",breaks=1:3,values=c("red", "blue","grey"),label=c("M-SD","M","M+SD"))+
scale_shape_manual("know level",breaks=1:3,values=c(6,5,4),label=c("M-SD","M","M+SD"))+
theme_classic()
This could be achieved by making use of a second dataset which filters the data for the endpoints by group using e.g. a group_by and range and passing the filtered dataset as data to geom_point:
Using some random example data try this:
set.seed(42)
mygrid <- data.frame(
control = runif(30, 1, 7),
pred = runif(30, 1, 3),
nknowledge = sample(1:3, 30, replace = TRUE)
)
library(ggplot2)
library(dplyr)
mygrid_pt <- mygrid %>%
group_by(nknowledge) %>%
filter(control %in% range(control))
ggplot(mygrid,aes(x=control,y=pred,color=factor(nknowledge),
lty=factor(nknowledge),shape=factor(nknowledge)))+
geom_line(size=1.5)+
geom_point(data = mygrid_pt, size=2.5)+
labs(x="control", y="attitudes",lty = "inc level")+
scale_linetype_manual("know level",breaks=1:3,values=c("longdash", "dotted","solid"),label=c("M-SD","M","M+SD"))+
scale_color_manual("know level",breaks=1:3,values=c("red", "blue","grey"),label=c("M-SD","M","M+SD"))+
scale_shape_manual("know level",breaks=1:3,values=c(6,5,4),label=c("M-SD","M","M+SD"))+
theme_classic()
If you use geom_point, then you'll get points for all rows in your data frame. If you want specific points and shapes plotted at the ends of your lines, you'll want to create a filtered data frame for the only points you want to have plotted.
library(ggplot2); library(dplyr)
g1 <- ggplot()+
geom_line(data = mtcars,
mapping = aes(x=hp,y=mpg,color=factor(cyl),lty=factor(cyl)),
size=1.5)+
geom_point(data = mtcars %>% group_by(cyl) %>% filter(hp == max(hp) | hp == min(hp)),
mapping = aes(x=hp,y=mpg,color=factor(cyl),shape=factor(cyl)),
size=2.5)
g1
Created on 2021-01-28 by the reprex package (v0.3.0)

ifelse condition: is in top n

Usually when i need a subset on geom_label() i use ifelse() and i specify a number as below:
library(tidyverse)
data = starwars %>% filter(mass < 500)
data %>%
ggplot(aes(x = mass, y = height, label = ifelse(birth_year > 100, name, NA))) +
geom_point() +
geom_label()
#> Warning: Removed 54 rows containing missing values (geom_label).
Created on 2020-05-31 by the reprex package (v0.3.0)
But with the dataset i'm working on, i need a dynamic solution, something like ifelse("birth_year is in top n", name, NA).
Thoughts?
For your method, I think using rank should work fine, e.g.,
ifelse(rank(birth_year) < 10, name, NA))
You can use rank(-birth_year) if you want it sorted the other way (or, if you're using dplyr, rank(desc(birth_year)), which will work on non-numeric columns too). You may want to read up on tie methods at ?rank.
I'd also propose a more general solution: filtering data for the geom_label layer. For more complex conditions (e.g., where a group_by would come in handy) it will be more straightforward:
data %>%
ggplot(aes(x = mass, y = height, label = name)) +
geom_point() +
geom_label(
data = data %>%
group_by(species) %>%
top_n(n = 1, wt = desc(birth_year)) # youngest of each species
)
Something like this? To get top 4 values.
library(ggplot2)
data %>%
ggplot(aes(x = mass, y = height, label = ifelse(birth_year >= sort(birth_year, decreasing = TRUE)[4], name, NA))) +
geom_point() +
geom_label()
This is a more explicit approach. I assume you want to count the number of characters per birth year, per your example. In this case, we handle the ranking first, then add a column to the original dataset, then plot. The new 'label' field is either blank/NA or has members of the top set. I suppress the pesky missing data warning in the geom_label arguments.
data = starwars %>% filter(mass < 500)
# counts names per birthyear, returns vector of top 4
top4 <- data %>%
drop_na(birth_year) %>%
count(birth_year, sort = TRUE) %>%
top_n(4) %>%
pull(birth_year)
# adds column to data with the names from the top 4 birth years
data <- data %>%
mutate(label = ifelse(birth_year %in% top4, name, NA))
# plots data with label, dropping NAs
data %>%
ggplot(aes(x = mass, y = height, label = label)) +
geom_point() +
geom_label(na.rm = TRUE)

Create a table with values from ecdf graph

I am trying to create a table using values from an ecdf plot. I've recreated an example below.
#Data
data(mtcars)
#Sort by mpg
mtcars <- mtcars[order(mtcars$mpg),]
#Make arbitrary ranking variable based on mpg
mtcars <- mtcars %>% mutate(Rank = dense_rank(mpg))
#Make variable for percent picked
mtcars <- mutate(mtcars, Percent_Picked = Rank/max(mtcars$Rank))
#Make cyl categorical
mtcars$cyl<-cut(mtcars$cyl, c(3,5,7,9), right=FALSE, labels=c(4,6,8))
#Make the graph
ggplot(mtcars, aes(Percent_Picked, color = cyl)) +
stat_ecdf(size=1) +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent)
Which creates this plot
I want to create a table for the value of each of the cylinder types when the overall Percent_Picked is at 25%, 50%, and 75%. So something that shows that 4-cylander is at 0%, 6 is around 28%, and 8 is around 85%.
Calculating quantiles by group doesn't give me what I want (it shows the percent of all cylinders picked when 25%, 50%, and 75% of the particular cylinder type was picked). (For example, the suggestions by tbradley1013 on their blog only help with quantiles for each particular cylinder, not the overall cdf for each cylinder at given quantiles for Percent_Picked.)
Any leads would be appreciated!
So looking around I found this question. Yours extends this a little by asking for group specific ecdf values, so we can use the do function in dplyr (here's an example] to do so. There's some slight differences in the values when comparing between this table and the values in your ggplot and I'm not exactly sure why that is. It could be just that the mtcars data set is somewhat small, so if you run this on a larger data set, I'd expect it to be closer to the actual values.
#Sort by mpg
mtcars <- mtcars[order(mtcars$mpg),]
#Make arbitrary ranking variable based on mpg
mtcars <- mtcars %>% mutate(Rank = dense_rank(mpg))
#Make variable for percent picked
mtcars <- mutate(mtcars, Percent_Picked = Rank/max(mtcars$Rank))
#Make cyl categorical
mtcars$cyl<-cut(mtcars$cyl, c(3,5,7,9), right=FALSE, labels=c(4,6,8))
#Make the graph
ggplot(mtcars, aes(Percent_Picked, color = cyl)) +
stat_ecdf(size=1) +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent)
create_ecdf_vals <- function(vec){
df <- data.frame(
x = unique(vec),
y = ecdf(vec)(unique(vec))*length(vec)
) %>%
mutate(y = scale(y, center = min(y), scale = diff(range(y)))) %>%
union_all(data.frame(x=c(0,1),
y=c(0,1))) # adding in max/mins
return(df)
}
mt.ecdf <- mtcars %>%
group_by(cyl) %>%
do(create_ecdf_vals(.$Percent_Picked))
mt.ecdf %>%
summarise(q25 = y[which.max(x[x<=0.25])],
q50 = y[which.max(x[x<=0.5])],
q75 = y[which.max(x[x<=0.75])])
ggplot(mt.ecdf,aes(x,y,color = cyl)) +
geom_step()
~EDIT~
After some digging around in the ggplot2 docs, we can actually explicitly pull out the data from the plot using the layer_data function.
my.plt <- ggplot(mtcars, aes(Percent_Picked, color = cyl)) +
stat_ecdf(size=1) +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent)
plt.data <- layer_data(my.plt) # magic happens here
# and here's the table you want
plt.data %>%
group_by(group) %>%
summarise(q25 = y[which.max(x[x<=0.25])],
q50 = y[which.max(x[x<=0.5])],
q75 = y[which.max(x[x<=0.75])])
A much shorter answer that I can't believe I didn't see earlier. Essentially I just divide the number of rows equal to or less than .25, .5, and .75 by the total number of rows, for each cyl.
cyl.table<-mtcars %>%
group_by(cyl) %>%
summarise("25% Picked" = sum(Percent_Picked<=0.25)/(sum(Percent_Picked<=1)),
"50% Picked" = sum(Percent_Picked<=0.5)/(sum(Percent_Picked<=1)),
"75% Picked" = sum(Percent_Picked<=0.75)/(sum(Percent_Picked<=1)))
cyl.table

How to always have fixed number of bins in geom_bar with missing values

I would like to ask how to always have fixed number of bins in barplots no matter how much variables we have - it must be in bar plot not histogram
for example:
DF <- mtcars
ggplot(DF, aes(gear)) + geom_bar()
will produce three bars from (3 to 5 values) I would like to also have values 1 and 2 and they must be equal to zero - So we will end up with 5 bar plots. where 2 will be equal to 0 and last 3 values will be equal to values in dataset.
You need to include the counts for all missing values of gear that you want. One way of achieving that is by using complete:
DF <- mtcars %>%
group_by( gear ) %>%
tally() %>%
complete( gear = 1:max(gear), fill = list(n=0) )
ggplot(DF, aes(x = gear, y = n)) + geom_bar( stat = 'identity' )
You can edit the properties of the x-axis to include 1 and 2. You can add a scale_x_continous and manually define the breaks and the limits. However, you cannot really see the column for these values because it is a line...
library(tidyverse)
DF <- mtcars
ggplot(DF, aes(gear)) + geom_bar() +
scale_x_continuous(breaks = 1:5, limits = c(0.5,5.5))
Created on 2019-12-06 by the reprex package (v0.3.0)
Does this help?

Trying to filter rows by intervals and plotting number of rows obtained

Consider the column "disp" in mtcars. I am trying to divide disp into intervals so that I can count the number of observations in each interval. After doing this I want to plot the results as a ggplot geom_line
This is what I have tried:
library (tidyverse)
library (ggplot2)
a1 <- mtcars %>% arrange(desc(disp)) %>%
mutate(counts = cut_interval(disp, length = 5)) %>% group_by(counts) %>% mutate(nn = n())
a2 <- a1 %>% select(counts,nn) %>% unique()
ggplot(a2, aes(counts, nn)) +
geom_point(shape = 16, size = 1, show.legend = FALSE) +
theme_bw()
I get the intervals I need in a2. i can use it to plot a scatterplot but I can see that there is no proper scale. Is there any way to use these intervals to get a continuous scale and draw a lineplot of counts vs nn?
mtcars %>% ggplot(aes(x = disp)) + geom_histogram(binwidth = 1) + theme_bw()
Thanks so much Rui Barradas! I just needed a count plot so no need of doing extra stuff.

Resources