Stata drop if equivalent for string variable in ggplot (R) - r

I am trying to produce a graph for a categorical variable with three sub-groups, but I would like to strictly present the results for two groups. In Stata, this can be done while producing a graph by adding something like, but I am not sure if there is an R equivalent?
drop if sentiment== "neutral"
Here is the a data example:
dput(head(sample_graph, 5))
(list(sentiment = structure(c(3L, 2L, 4L, NA, 2L), .Label = c("meg",
"negative", "neutral", "positive"), class = "factor"), treatment_announcement = c("pre",
"pre", "pre", "pre", "post"), n = c(78L, 150L, 87L, 1L, 829L),
sentiment_percentage = c(0.246835443037975, 0.474683544303797,
0.275316455696203, 0.00316455696202532, 0.490822972172883
), am = structure(c(2L, 2L, 2L, 2L, 1L), .Label = c("post",
"pre"), class = "factor")), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -5L), groups = structure(list(
treatment_announcement = c("post", "pre"), .rows = structure(list(
5L, 1:4), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L), .drop = TRUE))
I have used this code before, which works well but it drops all observations under this category, but I only want to drop them for visualization purposes, not all rows in the df itself.
For instance, after running the code below, my observations declined from 8000 to 6323.
sample_graph<- sample_graph %>%
drop_na() %>%
filter(sentiment != "neutral")
Therefore, I have attempted dropping the specific subgroup within the ggplot itself, but I am facing an error: "Problem with filter() input ..2.
i Input ..2 is aes(x = treatment_announcement, fill = sentiment, y = sentiment_percentage)."
ggplot(sample_graph %>% filter(sentiment != "neutral", aes(x = treatment_announcement, fill = sentiment, y = sentiment_percentage))) +
geom_bar(stat = "identity", position=position_dodge()) +
scale_fill_grey() +
ylab("percentage") +
theme(text=element_text(size=20)) +
scale_fill_manual(values = c("positive" = "green",
"negative" = "red")) +
theme(plot.title = element_text(size = 18, face = "bold")) +
scale_x_discrete(limits = c("pre", "post")) +
theme_bw()
Following Allen's advice below, I tried the following:
twitter_posts |>
drop_na() |>
filter(sentiment != "neutral") |>
select(sentiment, treatment_announcement) |> # we're only interested in sentiment & treatment_announcement
group_by(sentiment) %>% # group data and
add_count(treatment_announcement) |> # add count of treatment_announcement
unique() |> # remove duplicates
ungroup() |> # remove grouping
group_by(treatment_announcement) |> # group by treatment_announcement
mutate(sentiment_percentage = n/sum(n)) |> # ...calculating percentage
mutate(sentiment = as.factor(sentiment)) |> # change to factors so that ggplot treats...
mutate(am = as.factor(treatment_announcement)) |>
twitter_posts (data = teacher_posts, aes(x = treatment_announcement, fill = sentiment, y = sentiment_percentage)) +
geom_bar(stat = "identity", position=position_dodge()) +
scale_fill_grey() +
xlab("Treatment refers to the implementation of the wage subsidy program targeted at jobless teachers") +
ylab("percentage") +
theme(text=element_text(size=20)) +
scale_fill_manual(values = c("positive" = "green",
"negative" = "red")) +
theme(plot.title = element_text(size = 18, face = "bold")) +
scale_x_discrete(limits = c("pre", "post")) +
theme_bw()
And I am receiving this error "Mapping should be created with aes() or aes_()." although I have the aes mapping for the plot.

You can do some version of this via piping to ggplot or using filter in the data argument
library(tidyverse)
library(palmerpenguins)
penguins <- penguins
penguins |>
drop_na() |>
filter(species != "Adelie") |>
ggplot(aes(x = bill_length_mm, y = body_mass_g)) +
geom_point()
ggplot(data = filter(penguins,species != "Adelie"), aes(x = bill_length_mm, y = body_mass_g)) +
geom_point()
#> Warning: Removed 1 rows containing missing values (geom_point).
Created on 2022-07-18 by the reprex package (v2.0.1)
So taking the code you provided it would look something like this
twitter_posts |>
drop_na() |>
filter(sentiment != "neutral") |>
select(sentiment, treatment_announcement) |> # we're only interested in sentiment & treatment_announcement
group_by(sentiment) %>% # group data and
add_count(treatment_announcement) |> # add count of treatment_announcement
unique() |> # remove duplicates
ungroup() |> # remove grouping
group_by(treatment_announcement) |> # group by treatment_announcement
mutate(sentiment_percentage = n/sum(n)) |> # ...calculating percentage
mutate(sentiment = as.factor(sentiment)) |> # change to factors so that ggplot treats...
mutate(am = as.factor(treatment_announcement)) |>
ggplot(aes(x = treatment_announcement, fill = sentiment, y = sentiment_percentage)) +
geom_bar(stat = "identity", position=position_dodge()) +
scale_fill_grey() +
xlab("Treatment refers to the implementation of the wage subsidy program targeted at jobless teachers") +
ylab("percentage") +
theme(text=element_text(size=20)) +
scale_fill_manual(values = c("positive" = "green",
"negative" = "red")) +
theme(plot.title = element_text(size = 18, face = "bold")) +
scale_x_discrete(limits = c("pre", "post")) +
theme_bw()
So you would be doing your data cleaning and then plotting it. Because you are piping it you do not need to include the data argument.

If I were you, I would just create a new dataframe by filtering your original one with
newdataframe <- originaldataframe %>%
filter(variable==)
or something in this style.
From there generating the new graph should be trivial if you already have a working code.
Maybe is not the most polished way to do it, but its fast and effective.
Hope it helps.

Related

"Crossing off" tiles on a heatmap

For a heatmap made using ggplot and geom_tile, how would you "cross off" a tile based on a conditional value?
The heatmap shows counts of the number of times an animal performed a behavior between 1990-2020.
Rows are animal IDs, columns are years.
Years go from 1990-2020 but not all animals are alive throughout that time frame (ie, some born later than 1990 or die earlier than 2020)
So I want to cross off any tiles where an animal isn't alive, or before it was born.
Data look like this (shortened to 5 rows for brevity):
data <- data.frame(date = structure(c(8243, 8243, 8243, 8248, 8947), class = "Date"),
year = c("1992", "1992", "1992", "1992", "1994"),
event.id = c(8L, 8L, 8L, 10L, 11L),
id = c("L5", "L58", "L73", "L21", "L5"),
birth = c(1964L, 1980L, 1986L, 1950L, 1964L),
death = c(2012L, 2003L, NA, NA, 2012L))
NA means the animal is still alive and it wouldn't be crossed off since before it was born.
Any help to create this is greatly appreciated!
Code looks like this:
heatmap <- data %>%
mutate(x = case_when(year %in% 1990:1999 ~ "1990-1999",
TRUE ~ year)) %>%
mutate(y = paste(id)) %>%
group_by(x, y, .drop = FALSE) %>%
summarize(count = n()) %>%
arrange(y)
ggplot(data = heatmap, aes(x, y, fill = count)) +
geom_tile()
EDIT
Current heat map.
Here's how you could use color to indicate NA, like suggested by #Gregor Thomas.
Transforming your data to "complete":
library(dplyr)
library(tidyr)
library(ggplot2)
hm <- dat %>%
mutate(x = case_when(year < 1960 ~ "Pre-1960",
year %in% 1960:1969 ~ "1960-1969",
year %in% 1970:1979 ~ "1970-1979",
year %in% 1980:1989 ~ "1980-1989",
year %in% 1990:1999 ~ "1990-1999",
TRUE ~ year)) %>%
mutate(y = paste(matriline, id)) %>%
group_by(x, y, .drop = FALSE) %>%
summarize(count = n()) %>%
ungroup() %>%
tidyr::complete(x, y) %>%
arrange(y) %>%
tidyr::separate(y, into = c("ym", "yid"), sep = " ", remove = FALSE)
Then define a color for NA:
ggplot(data = hm, aes(x, yid, fill = count)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "red", na.value = "grey50") +
scale_x_discrete(position = "top", drop = FALSE) +
scale_y_discrete(limits=rev) +
labs(x = NULL, y = NULL) +
facet_wrap( ~ ym, strip.position = "left", dir = "v", ncol = 1) +
theme(panel.spacing = unit(0.2, "lines"),
strip.background = element_blank(),
strip.placement = "outside",
axis.text.x = element_text(angle = 45, hjust = -0.02))
Data:
ids <- c("J11", "J16", "J17", "J02", "J22", "J26", "J27", "J30")
matrilines <- c("J02","J04", "K11", "L20", "P90", "K100", "R22")
dat <- data.frame(year = as.character(sample(1960:2018, 1000, replace = TRUE)),
id = sample(ids, 1000, replace = TRUE),
matriline = sample(matrilines, 1000, replace = TRUE))

Is there a way to expand your ggplot figure conditioned on a factor variable?

The figure below shows that the estimate of the UK is "0.4", but it is a bit cut at the end of the graph. One of the solutions is to adjust the hjust option in geom_text. But this is a reproducible example and the original example consists of 30 countries where the estimates of the last country on the x-axis is a bit out of the picture as it is the case here with the UK and the hjust option don't solve the problem. Is there a way to expand more the picture so the estimate "0.4" is clear to the reader?
Here is the plot:
Here is the code:
df %>%
mutate(label = replace(round(estimate, 2),cntry==1, '')) %>%
ggplot(aes(estimate, cntry,label=label)) +
geom_text(hjust= -2) +
geom_point(mapping=aes(x=estimate, y=cntry), size=2.3, shape=21, fill="black") +
coord_flip()
Here is the data:
structure(list(cntry = structure(1:3, .Label = c("FR", "IT",
"UK"), class = "factor"), term = c("unemp", "unemp", "unemp"),
estimate = c(-1.73, 0.20, 0.48
)), row.names = c(NA, -3L), class = "data.frame")
You can pass an expansion to your axis scale(s):
df %>%
mutate(label = replace(round(estimate, 2),cntry==1, '')) %>%
ggplot(aes(estimate, cntry,label=label)) +
geom_text(hjust= -2) +
geom_point(mapping=aes(x=estimate, y=cntry), size=2.3, shape=21, fill="black") +
coord_flip() +
scale_y_discrete(expand = expansion(add = 1))

plotting a multivariate time series daily in ggplot2

I have a time series data with multiple variables measured in different units. it is daily data. The data is as below. (Example data)
structure(list(date = structure(18324:18329, class = "Date"),
x = c(-1805605.65336663, -217934.802608961, -1032002.23625031, 234816.624919304, 1321982.20108174, 104251.623282941), y = c(0.633729348424822, 0.244916933588684, 0.873351667076349, 0.552934182109311, 0.348864572821185, 0.197756679030135), z = c(3L, 5L, 5L, 6L, 5L, 6L)), class = "data.frame", row.names = c(NA, -6L
))
Suppose X is measured in Rs Billion, Y is a ratio between 0 and 1, and Z is a count variable. I want to plot all these variables over the time period in multiple graphs ( preferably using facet_wrap)
You can use the following code
library(tidyverse)
library(lubridate)
df %>%
dplyr::mutate(date = ymd(date)) %>%
gather(key = "key", value = "value",-date) %>%
ggplot(aes(x=date, y=value)) + geom_line() + facet_wrap("key", scales = "free")
Update
df %>%
dplyr::mutate(date = ymd(date)) %>%
gather(key = "key", value = "value",-date) %>%
ggplot(aes(x=date, y=value)) + geom_line() + theme_bw() +
facet_wrap(~key, scales = "free_y", ncol = 1,
strip.position = "left",
labeller=as_labeller(c(x = "Rs Billion", y = "Ratio", z = "Count variable (n)"))) +
ylab(NULL) +xlab("Date")+
theme(strip.background = element_blank(),
strip.placement = "outside")

how to build a bar chart with multidimensional data in R

I want to build a bar chart showing comparison of cost in combination with two variables.
Data:
Cost should be in Y' axis and Age and Gender should be in X' axis.
To findout, which combination of Age and gender having more cost?
Can anyone help out on this ?
I tried:
x = c(q4$Age,q4$Gender)
y = q4$Cost
plt <- ggplot(data = q4, mapping = aes(x,y)) + geom_bar(stat = "identity")
I want help on to build Age and Gender bars should be in side by side to compare the cost of each combination of Age and Gender.
Thanks a lot for your valuable time.
You can have the use of interaction into your aes:
library(ggplot2)
ggplot(df, aes(x = interaction(age,gender), y = cost, fill = interaction(age, gender)))+
geom_bar(stat = "identity", position = position_dodge())
Alternatively, you can also create a new column in your dataframe (here using the function mutate from dplyr) and plot according to this column:
library(ggplot2)
library(dplyr)
df %>% mutate(Age_Gender = paste0("Age: ",age,"\n","Gender: ",gender)) %>%
ggplot(aes(x = Age_Gender, y = cost, fill = Age_Gender))+
geom_bar(stat = "identity", show.legend = FALSE)+
theme(axis.text.x = element_text(angle = 45, hjust =1))
Does it answer your question ?
Data
structure(list(age = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5), gender = structure(c(2L,
1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L), .Label = c("F", "M"), class = "factor"),
cost = c(100, 45, 50, 56, 60, 55, 50, 70, 70, 60)), class = "data.frame", row.names = c(NA,
-10L))
It would be a lot easier with a sample of your data. Please, use dput and check here how to make a great R reproducible example.
However, as I read your request, you can use tidyverse
I wrote the following, please try this on your data inserting your relevant covariates
#Data
attatch(mtcars)
w <- mtcars
#Script
library(tidyverse)
w %>%
as_tibble() %>%
mutate(cyl=as.character(cyl),
vs =as.factor(vs)) %>%
bind_rows(., mutate(., cyl="all")) %>%
count(cyl, vs) %>%
ggplot(aes(cyl, n, color = vs, fill= vs)) +
scale_fill_manual(values = c("#fcebeb", "#edf1f9"), name="", label=c("Text 1", "Text 2")) +
scale_colour_manual(values = c("red", "#1C73C2"), name="", label=c("Text 1", "Text 2")) +
geom_col(position = position_dodge2(preserve = "single", padding = 0.1))
Which yields
datax$Gender <- as.factor(datax$Gender)
ggplot(datax, aes(fill=Gender, y=cost, x=Age)) + geom_bar(position="dodge",
stat="identity")

Plot confusion matrix in R using ggplot

I have two confusion matrices with calculated values as true positive (tp), false positives (fp), true negatives(tn) and false negatives (fn), corresponding to two different methods. I want to represent them as
I believe facet grid or facet wrap can do this, but I find difficult to start.
Here is the data of two confusion matrices corresponding to method1 and method2
dframe<-structure(list(label = structure(c(4L, 2L, 1L, 3L, 4L, 2L, 1L,
3L), .Label = c("fn", "fp", "tn", "tp"), class = "factor"), value = c(9,
0, 3, 1716, 6, 3, 6, 1713), method = structure(c(1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L), .Label = c("method1", "method2"), class = "factor")), .Names = c("label",
"value", "method"), row.names = c(NA, -8L), class = "data.frame")
This could be a good start
library(ggplot2)
ggplot(data = dframe, mapping = aes(x = label, y = method)) +
geom_tile(aes(fill = value), colour = "white") +
geom_text(aes(label = sprintf("%1.0f",value)), vjust = 1) +
scale_fill_gradient(low = "white", high = "steelblue")
Edited
TClass <- factor(c(0, 0, 1, 1))
PClass <- factor(c(0, 1, 0, 1))
Y <- c(2816, 248, 34, 235)
df <- data.frame(TClass, PClass, Y)
library(ggplot2)
ggplot(data = df, mapping = aes(x = TClass, y = PClass)) +
geom_tile(aes(fill = Y), colour = "white") +
geom_text(aes(label = sprintf("%1.0f", Y)), vjust = 1) +
scale_fill_gradient(low = "blue", high = "red") +
theme_bw() + theme(legend.position = "none")
It is a very old question, still it seems there is a quite straight forward solution to that using ggplot2 which hasn't been mentioned.
Hope it might be helpful to someone:
cm <- confusionMatrix(factor(y.pred), factor(y.test), dnn = c("Prediction", "Reference"))
plt <- as.data.frame(cm$table)
plt$Prediction <- factor(plt$Prediction, levels=rev(levels(plt$Prediction)))
ggplot(plt, aes(Prediction,Reference, fill= Freq)) +
geom_tile() + geom_text(aes(label=Freq)) +
scale_fill_gradient(low="white", high="#009194") +
labs(x = "Reference",y = "Prediction") +
scale_x_discrete(labels=c("Class_1","Class_2","Class_3","Class_4")) +
scale_y_discrete(labels=c("Class_4","Class_3","Class_2","Class_1"))
A slightly more modular solution based on MYaseen208's answer. Might be more effective for large datasets / multinomial classification:
confusion_matrix <- as.data.frame(table(predicted_class, actual_class))
ggplot(data = confusion_matrix
mapping = aes(x = Var1,
y = Var2)) +
geom_tile(aes(fill = Freq)) +
geom_text(aes(label = sprintf("%1.0f", Freq)), vjust = 1) +
scale_fill_gradient(low = "blue",
high = "red",
trans = "log") # if your results aren't quite as clear as the above example
Here's another ggplot2 based option; first the data (from caret):
library(caret)
# data/code from "2 class example" example courtesy of ?caret::confusionMatrix
lvs <- c("normal", "abnormal")
truth <- factor(rep(lvs, times = c(86, 258)),
levels = rev(lvs))
pred <- factor(
c(
rep(lvs, times = c(54, 32)),
rep(lvs, times = c(27, 231))),
levels = rev(lvs))
confusionMatrix(pred, truth)
And to construct the plots (substitute your own matrix below as needed when setting up "table"):
library(ggplot2)
library(dplyr)
table <- data.frame(confusionMatrix(pred, truth)$table)
plotTable <- table %>%
mutate(goodbad = ifelse(table$Prediction == table$Reference, "good", "bad")) %>%
group_by(Reference) %>%
mutate(prop = Freq/sum(Freq))
# fill alpha relative to sensitivity/specificity by proportional outcomes within reference groups (see dplyr code above as well as original confusion matrix for comparison)
ggplot(data = plotTable, mapping = aes(x = Reference, y = Prediction, fill = goodbad, alpha = prop)) +
geom_tile() +
geom_text(aes(label = Freq), vjust = .5, fontface = "bold", alpha = 1) +
scale_fill_manual(values = c(good = "green", bad = "red")) +
theme_bw() +
xlim(rev(levels(table$Reference)))
# note: for simple alpha shading by frequency across the table at large, simply use "alpha = Freq" in place of "alpha = prop" when setting up the ggplot call above, e.g.,
ggplot(data = plotTable, mapping = aes(x = Reference, y = Prediction, fill = goodbad, alpha = Freq)) +
geom_tile() +
geom_text(aes(label = Freq), vjust = .5, fontface = "bold", alpha = 1) +
scale_fill_manual(values = c(good = "green", bad = "red")) +
theme_bw() +
xlim(rev(levels(table$Reference)))
Here it is a reprex using cvms package i.e., Wrapper function for ggplot2 to make confusion matrix.
library(cvms)
library(broom)
library(tibble)
library(ggimage)
#> Loading required package: ggplot2
library(rsvg)
set.seed(1)
d_multi <- tibble("target" = floor(runif(100) * 3),
"prediction" = floor(runif(100) * 3))
conf_mat <- confusion_matrix(targets = d_multi$target,
predictions = d_multi$prediction)
# plot_confusion_matrix(conf_mat$`Confusion Matrix`[[1]], add_sums = TRUE)
plot_confusion_matrix(
conf_mat$`Confusion Matrix`[[1]],
add_sums = TRUE,
sums_settings = sum_tile_settings(
palette = "Oranges",
label = "Total",
tc_tile_border_color = "black"
)
)
Created on 2021-01-19 by the reprex package (v0.3.0)
Old question, but I wrote this function which I think makes a prettier answer. Results in a divergent color palette (or whatever you want, but default is divergent):
prettyConfused<-function(Actual,Predict,colors=c("white","red4","dodgerblue3"),text.scl=5){
actual = as.data.frame(table(Actual))
names(actual) = c("Actual","ActualFreq")
#build confusion matrix
confusion = as.data.frame(table(Actual, Predict))
names(confusion) = c("Actual","Predicted","Freq")
#calculate percentage of test cases based on actual frequency
confusion = merge(confusion, actual, by=c('Actual','Actual'))
confusion$Percent = confusion$Freq/confusion$ActualFreq*100
confusion$ColorScale<-confusion$Percent*-1
confusion[which(confusion$Actual==confusion$Predicted),]$ColorScale<-confusion[which(confusion$Actual==confusion$Predicted),]$ColorScale*-1
confusion$Label<-paste(round(confusion$Percent,0),"%, n=",confusion$Freq,sep="")
tile <- ggplot() +
geom_tile(aes(x=Actual, y=Predicted,fill=ColorScale),data=confusion, color="black",size=0.1) +
labs(x="Actual",y="Predicted")
tile = tile +
geom_text(aes(x=Actual,y=Predicted, label=Label),data=confusion, size=text.scl, colour="black") +
scale_fill_gradient2(low=colors[2],high=colors[3],mid=colors[1],midpoint = 0,guide='none')
}

Resources