I'm trying an example I found here where data is plotted using ggplot2. The code looks like this:
raw_text %>%
group_by(newsgroup) %>%
summarize(messages = n_distinct(id)) %>%
ggplot(aes(messages, newsgroup)) +
geom_col() +
labs(y = NULL)
The bars inside the diagram are supposed to be horizontal, i.e. from left to right, but for me they are vertical:
What do I have to change to also get horizontal bars?
I reproduced the example you indicated by downloading the data. My plot looks like:
code as given:
library(dplyr)
library(tidyr)
library(purrr)
library(readr)
library(ggplot2)
# dataset has to be downloaded see question user1406177
training_folder <- "data/20news-bydate-train/"
# Define a function to read all files from a folder into a data frame
read_folder <- function(infolder) {
tibble(file = dir(infolder, full.names = TRUE)) %>%
mutate(text = map(file, read_lines)) %>%
transmute(id = basename(file), text) %>%
unnest(text)
}
raw_text <- tibble(folder = dir(training_folder, full.names = TRUE)) %>%
mutate(folder_out = map(folder, read_folder)) %>%
unnest(cols = c(folder_out)) %>%
transmute(newsgroup = basename(folder), id, text)
raw_text %>%
group_by(newsgroup) %>%
summarize(messages = n_distinct(id)) %>%
ggplot(aes(messages, newsgroup)) +
geom_col() +
labs(y = NULL)
I would try this:
#Code
raw_text %>%
group_by(newsgroup) %>%
summarize(messages = n_distinct(id)) %>%
ggplot(aes(y=messages, x=newsgroup)) +
geom_col() +
labs(y = NULL)+
coord_flip()
Related
I'm trying to plot cumulative values as stacked area plot, but in return, the image that I get do not consist stacked area, only background and legends are there. Initially, I thought the function doesn't recognize duplicate dates, so I tried to use group by function but it still return the same result. Does anyone know what is going on here?
The dataset that I'm using can be found on kaggle.
covid datset
Update:
I'm able to do ggplot(aes(..., text = count)) but not:
ggplot(aes(..., text = comma(count))) or
ggplot(aes(..., text = paste(count))) or
ggplot(aes(..., text = text)) where text is a mutated column
if (!require("pacman")) install.packages("pacman")
# load packages
pacman::p_load(pacman, installr, rio, dplyr, tidyr, ggplot2, stringr, scales, viridis, hrbrthemes, plotly, htmlwidgets)
daily_covid <- import("./worldometer_coronavirus_daily_data.csv")
daily_covid <-
daily_covid %>%
replace(is.na(.), 0) %>%
mutate(date = as.Date(date))
data <-
daily_covid %>%
group_by(date) %>%
summarise(
cumulative_total_cases = sum(cumulative_total_cases, na.rm = T),
cumulative_total_deaths = sum(cumulative_total_deaths, na.rm = T),
) %>% # actually na.rm is not needed
gather(categories, count,
cumulative_total_cases, cumulative_total_deaths) %>%
# adding this section of code, the dataframe looks fine
rowwise() %>%
mutate(text =
paste(
str_to_title(str_replace_all(categories, "_", " ")),
"Count:", count
)
) %>%
# section end
arrange(date) # this is just to check if text is appended properly
data # a small section of this data is shown below this block
# write.csv(data, paste0(out_data_path, "temp.csv"))
# (subset(data, is.na(text))) # enable this to check if something is na
q3 <- # V here is the problem
ggplot(data, aes(x=date, y=count, fill=categories, text = text)) +
geom_area() +
facet_wrap(~categories, scales = "free_y")
# scale_fill_viridis(discrete = TRUE) +
# theme(legend.position="none") +
# ggtitle("Cumulative Covid Cases Stacked Area plot") +
# theme_ipsum() +
# theme(legend.position="none")
q3
#
itrt_q3 <- ggplotly(q3, tooltip = "text")
itrt_q3
This is a small piece of the manipulated data data shown above
"date","categories","count","text"
2020-01-22,"cumulative_total_cases",571,"Cumulative Total Cases Count: 571"
2020-01-22,"cumulative_total_deaths",17,"Cumulative Total Deaths Count: 17"
After you add na.rm = TRUE the graph should print.
data <-
daily_covid %>%
group_by(date) %>%
summarise(
cumulative_total_cases = sum(cumulative_total_cases),
cumulative_total_deaths = sum(cumulative_total_deaths, na.rm = TRUE),
) %>%
mutate(text =
paste(
"Cumulative Cases:",
cumulative_total_cases,
"\nCumulative Deaths:",
cumulative_total_deaths
)
) %>%
gather(categories, count,
cumulative_total_cases, cumulative_total_deaths) %>%
ungroup() %>%
arrange(date) %>%
group_by(date, text, categories)
data %>%
ggplot(aes(x=date, y=count, fill=categories)) +
geom_area()
I'm trying to create a sentiment analysis using the tidytext code here but my graph comes out vertical, without the output making sense compared to the original which is horizontal. How can I fix this?
#Unnest tokens
edAItext = edAI %>% select(Group, Participant_ID, Brainscape_Pattern) %>%
unnest_tokens(word, Brainscape_Pattern)
# Inner join
bing_word_counts <- edAItextTest %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
#Check
bing_word_counts
#Plot
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 5) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
This is how it looks:
This is how it's supposed to look:
I am trying to fix the limits of a bar chart so the horizontal bar doesn't go over the plot area. I could set the limit manually using limits=c(0,3000000)but I guess there is a way to make it automatically scalable. The code
corona.conf <- read.csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv",header = TRUE,check.names=FALSE)
corona.conf %>% .[,c(-1,-3,-4)] %>% melt(.,variable.name="day") %>%
group_by(`Country/Region`,day) %>% summarize(value=sum(value)) %>%
mutate(day=as.Date(day,format='%m/%d/%y')) %>% mutate(count=value-lag(value)) %>%
replace(is.na(.),0) %>% group_by(`Country/Region`) %>% summarize(count=sum(count)) %>%
top_n(20) %>% arrange(desc(count)) %>% ggplot(.,aes(x=reorder(`Country/Region`,count),y=count,fill=count)) +
geom_bar(stat = "identity") + coord_flip() + geom_text(aes(label=format(count,big.mark = ",")),hjust=-0.1,size=4) +
scale_y_continuous(expand = c(0,0))
I thought something like:
scale_y_continuous(expand = c(0,0),limits=c(0,max(count))
Appreciate any suggestions on the fix.
I think it would be easier to read an run the code by splitting it into several parts.
We can use layer_data to get the information from a ggplot object, and the calculate the maximum from that. Based on your example, I would also suggest you multiply the maximum by 1.7 to include the text.
library(tidyverse)
library(data.table)
corona.conf <- read.csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv",header = TRUE,check.names=FALSE)
dat <- corona.conf %>% .[,c(-1,-3,-4)] %>% melt(.,variable.name="day") %>%
group_by(`Country/Region`,day) %>% summarize(value=sum(value)) %>%
mutate(day=as.Date(day,format='%m/%d/%y')) %>% mutate(count=value-lag(value)) %>%
replace(is.na(.),0) %>% group_by(`Country/Region`) %>% summarize(count=sum(count)) %>%
top_n(20) %>% arrange(desc(count))
p <- ggplot(dat, aes(x=reorder(`Country/Region`,count),y=count,fill=count)) +
geom_bar(stat = "identity") +
coord_flip() +
geom_text(aes(label=format(count,big.mark = ",")),hjust=-0.1,size=4)
p +
scale_y_continuous(expand = c(0,1), limits = c(0, max(layer_data(p)$y) * 1.7))
I would like to see the y-axis (in the plot is flipped) starting at some arbitrary value, like 7.5
After a little bit of researching, I came across ylim, but in this case is giving me some
errors:
Scale for 'y' is already present. Adding another scale for 'y', which will
replace the existing scale.
Warning message:
Removed 10 rows containing missing values (geom_col).
This is my code, and a way to download the data I'm using:
install.packages("remotes")
remotes::install_github("tweed1e/werfriends")
library(werfriends)
friends_raw <- werfriends::friends_episodes
library(tidytext)
library(tidyverse)
#"best" writers with at least 10 episodes
friends_raw %>%
unnest(writers) %>%
group_by(writers) %>%
summarize(mean_rating = mean(rating),
n = n()) %>%
arrange(desc(mean_rating)) %>%
filter(n > 10) %>%
head(10) %>%
mutate(writers = fct_reorder(writers, mean_rating)) %>%
ggplot(aes(x = writers, y = mean_rating, fill = writers)) + geom_col() +
coord_flip() + theme(legend.position = "None") + scale_y_continuous(breaks = seq(7.5,10,0.5)) +
ylim(7.5,10)
You should use coord_cartesian for zoom in a particular location (here the official documentation: https://ggplot2.tidyverse.org/reference/coord_cartesian.html).
With your example, your code should be something like that:
friends_raw %>%
unnest(writers) %>%
group_by(writers) %>%
summarize(mean_rating = mean(rating),
n = n()) %>%
arrange(desc(mean_rating)) %>%
filter(n > 10) %>%
head(10) %>%
mutate(writers = fct_reorder(writers, mean_rating)) %>%
ggplot(aes(x = writers, y = mean_rating, fill = writers)) + geom_col() +
coord_flip() + theme(legend.position = "None") + scale_y_continuous(breaks = seq(7.5,10,0.5)) +
coord_cartesian(ylim = c(7.5,10))
If this is not working please provide a reproducible example of your dataset (see: How to make a great R reproducible example)
I found out the solution. With my actual plot, the answer submitted by #dc37 didn't work because coord_flip() and coord_cartesian() exclude each other. So the way to do this is:
friends_raw %>%
unnest(writers) %>%
group_by(writers) %>%
summarize(mean_rating = mean(rating),
n = n()) %>%
arrange(mean_rating) %>%
filter(n > 10) %>%
head(10) %>%
mutate(writers = fct_reorder(writers, mean_rating)) %>%
ggplot(aes(x = writers, y = mean_rating, fill = writers)) + geom_col() +
theme(legend.position = "None") +
coord_flip(ylim = c(8,8.8))
Although my query shows me values in descending order, ggplot then displays them alphabetically instead of ascending order.
Known solutions to this problem haven't seem to work. They suggest using Reorder or factor for values, which didn't work in this case
This is my code:
boxoffice %>%
group_by(studio) %>%
summarise(movies_made = n()) %>%
arrange(desc(movies_made)) %>%
top_n(10) %>%
arrange(desc(movies_made)) %>%
ggplot(aes(x = studio, y = movies_made, fill = studio, label = as.character(movies_made))) +
geom_bar(stat = 'identity') +
geom_label(label.size = 1, size = 5, color = "white") +
theme(legend.position = "none") +
ylab("Movies Made") +
xlab("Studio")
for those wanting a more complete example, here's where I got:
library(dplyr)
library(ggplot2)
# get some dummy data
boxoffice = boxoffice::boxoffice(dates=as.Date("2017-1-1"))
df <- (
boxoffice %>%
group_by(distributor) %>%
summarise(movies_made = n()) %>%
mutate(studio=reorder(distributor, -movies_made)) %>%
top_n(10))
ggplot(df, aes(x=distributor, y=movies_made)) + geom_col()
You'll need to convert boxoffice$studio to an ordered factor. ggplot will then respect the order of rows in the data set, rather than alphabetizing. Your dplyr chain will look like this:
boxoffice %>%
group_by(studio) %>%
summarise(movies_made = n()) %>%
arrange(desc(movies_made)) %>%
ungroup() %>% # ungroup
mutate(studio = factor(studio, studio, ordered = T)) %>% # convert variable
top_n(10) %>%
arrange(desc(movies_made)) %>%
ggplot(aes(x = studio, y... (rest of plotting code)