Date format using scale_x_date giving Error - r

Hello I need to get my ggplot with date format having this format in X axis:
.
But my date format has time with it.
sentiment_bing1 <- tidy_trump_tweets %>%
inner_join(get_sentiments("bing")) %>%
count(word, created_at, sentiment) %>%
ungroup()
p <- sentiment_bing1 %>% filter(sentiment == "positive") %>% ggplot(aes(x=created_at, y = n)) +
geom_line(stat="identity", position = "identity", color = "Blue") + scale_x_date(date_breaks ='3 months', date_labels = '%b-%Y') + stat_smooth() + theme_gdocs() +
xlab("Date") + ylab("Normalized Frequency of Positive Words in Trup's Tweets")
1 abound 11/30/17 13:05 positive 0.0
2 abuse 1/11/18 12:33 negative 0.0
3 abuse 10/27/17 1:18 negative 0.0
4 abuse 2/18/18 17:10 negative 0.0
This is what I have done to get the result. Now how do I achieve it like the picture? Conversion to date doesn't help as there are instances where the tweet takes place on same day but different time and that then messes the graph.

Welcome to SO!
It's hard to answer your question without seeing the data you are using and the error that your code is generating. Next time try and create a reproducible question. This will make it easier for someone to identify where your problem lies.
Based on the code and data you've provided I've created a sample data set with a (broadly) similar structure to that from the chart...
library(lubridate)
library(ggplot2)
library(ggthemes)
set.seed(100)
start_date <- mdy_hm("03-01-2017-12:00")
end_date <- mdy_hm("03-01-2018-12:00")
number_hours <- interval(start_date, end_date)/hours(1)
created_at <- start_date + hours(6:number_hours)
length(created_at)
word <- sample(c("abound", "abuse"), size = length(created_at), replace = TRUE,
prob=c(0.25, 0.75))
Your plotting code looks good. I could be wrong here, but from what I can tell your problem could lie in the way you are summarising the frequencies. In the code below, I've used the lubridate package to group you data by dates (day), allowing for a daily frequency count.
test_plot <- data_frame(created_at, word) %>%
mutate(sentiment =
case_when(
word == "abound" ~ "positive",
word == "abuse" ~ "negative")) %>%
filter(sentiment == "positive") %>%
mutate(created_at = date(round_date(ymd_hms(created_at), unit = "day"))) %>%
group_by(created_at) %>%
tally() %>%
ggplot() +
aes(x = created_at, y = n) +
geom_line(stat="identity", position = "identity", color = "Blue") +
geom_smooth() +
scale_x_date(date_breaks ='3 months', date_labels = '%b-%Y') +
theme_gdocs() +
xlab("Date") +
ylab("Frequency of Positive Words in Trump's Tweets")
Which gives you this...

sentiment_bing1 <- tidy_trump_tweets %>%
inner_join(get_sentiments("bing")) %>%
count(created_at, sentiment) %>%
spread(sentiment, n, fill=0) %>%
mutate(N = (sentiment_bing1$negative - min(sentiment_bing1$negative)) / (max(sentiment_bing1$negative) - min(sentiment_bing1$negative))) %>%
mutate(P = (sentiment_bing1$positive - min(sentiment_bing1$positive)) / (max(sentiment_bing1$positive) - min(sentiment_bing1$positive))) %>%
ungroup
sentiment_bing1$created_at <- as.Date(sentiment_bing1$created_at, "%m/%d/%y")
The use of spread helped in separating the positive and negative and then in normalization to get the result I wasa looking for!

Related

Difference in n() count and geom_col graph likely resulted from group_by(), but why and how?

Sorry in advance for I'm an R newbie. So I was working on Divvy Bike Share data (details see here. Here is a subset of my df:
I wanted to visualize the total ridership count (how many times bikes are used) as compressed and shown in a week. I tried two blocks of codes, with the only difference being summarize() - the second one has "month" inside the function. I don't understand what resulted in this difference in y-axis values in the two graphs.
p1 <- df %>%
group_by(member_casual, day_of_week) %>%
summarize(total_rides = n()) %>%
ggplot(aes(x = day_of_week, y = total_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Total Rides by Days in a Week", subtitle = "Casual Customers vs. Members", y = "ride times count (in thousands)") +
theme(axis.title.x = element_blank()) +
scale_fill_discrete(name = "") +
scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "k"))
p1
p2 <- df %>%
group_by(member_casual, day_of_week, month) %>%
summarize(total_rides = n()) %>%
ggplot(aes(x = day_of_week, y = total_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Total Rides by Days in a Week", subtitle = "Casual Customers vs. Members", y = "ride times count (in thousands)") +
theme(axis.title.x = element_blank()) +
scale_fill_discrete(name = "") +
scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "k"))
p2
I tested the tables generated before a plot is visualized, so I tried the following blocks:
df %>%
group_by(member_casual, day_of_week) %>%
summarize(total_rides = n())
df %>%
group_by(member_casual, day_of_week, month) %>%
summarize(total_rides = n())
I guess I understand by adding more elements in group_by, the resulting table will become more catagorized or "grouped". However, the total should always be the same, no? For example, if you add up all the casual & Sundays (as separated into 12 months) in tibble 2, you'll get exactly the number in tibble 1 - 392107, the same number as shown in p1, not p2. So this exacerbated my confusion.
So in a word, I have two questions:
Why the difference in p1 and p2? How could I have avoided such errors in the future?
Where does the numbers come in p2?
Any advice would be greatly appreciated. Thank you!
You’re assuming that the counts for each month will be stacked, so that together the column will show the total across all months. But in fact the counts are overplotted in front of one another, so only the highest month-count is visible. You can see this is the case if you add a border and make your columns transparent. Using mpg as an example, with cyl as the “extra” grouping variable:
library(dplyr)
library(ggplot2)
mpg %>%
count(drv, year, cyl) %>%
ggplot(aes(year, n, fill = drv)) +
geom_col(
position = "dodge",
color = "black",
alpha = .5
)
NB: count(x) is a shortcut for group_by(x) %>% summarize(n = n()).

Plotting a line graph by datetime with a histogram/bar graph by date

I'm relatively new to R and could really use some help with some pretty basic ggplot2 work.
I'm trying to visualize total number of submissions on a graph, showing the overall total in a line graph and the daily total in a histogram (or bar graph) on top of it. I'm not sure how to add breaks or bins to the histogram so that it takes the submission datetime column and makes each bar the daily total.
I tried adding a column that converts the datetime into just date and plots based on that, but I'd really like the line graph to include the time.
Here's what I have so far:
df <- df %>%
mutate(datetime = lubridate::mdy_hm(datetime))%>%
mutate(date = lubridate::as_date(datetime))
#sort by datetime
df <- df %>%
arrange(datetime)
#add total number of submissions
df <- df %>%
mutate(total = row_number())
#ggplot
line_plus_histo <- df%>%
ggplot() +
geom_histogram(data = df, aes(x=datetime)) +
geom_line(data = df, aes(x=datetime, y=total), col = "red") +
stat_bin(data = df, aes(x=date), geom = "bar") +
labs(
title="Submissions by Day",
x="Date",
y="Submissions",
legend=NULL)
line_plus_histo
As you can see, I'm also calculating the total number of submissions by sorting by time and then adding a column with the row number. So if you can help me use a better method I'd really appreciate it.
Please, find below the line plus histogram of time v. submissions:
Here's the pastebin link with my data
You can extend your data manipulation by:
df <- df |>
mutate(datetime = lubridate::mdy_hm(datetime)) |>
arrange(datetime) |>
mutate(midday = as_datetime(floor_date(as_date(datetime), unit = "day") + 0.5)) |>
mutate(totals = row_number()) |>
group_by(midday) |>
mutate(N = n())|>
ungroup()
then use midday for bars and datetime for line:
df%>%
ggplot() +
geom_bar(data = df, aes(x = midday)) +
geom_line(data = df, aes(x=datetime, y=totals), col = "red") +
labs(
title="Submissions by Day",
x="Date",
y="Submissions",
legend=NULL)
PS. Sorry for Polish locales on X axis.
PS2. With geom_bar it looks much better
Created on 2022-02-03 by the reprex package (v2.0.1)

Why can't I get the right horizontal axis labels on my ggplot2 chart?

I am trying to do a faceted plot of a grouped dataframe with ggplot2, using geom_line(). My dataframe has a Date column and I would like to have dates on the horizontal axis. If I just use Date in aes(x=Date, ...) I get nice labels on the horizontal axis. However, the line has an almost horizontal section where the date jumps from the end of one group to the beginning of the next group. This code and chart shows that:
dts <- seq.Date(as.Date("2020-01-01"), as.Date("2021-12-31"), by="day")
mos <- sapply(dts, month)
df <- data.frame(Date=dts, Month=mos)
nr <- nrow(df)
df$X <- rep(1, nr)
df %>%
group_by(Month) -> dfgrp
dfgrp %>%
group_by(Month) %>%
mutate(Time = Date[1:n()],
Z = cumsum(X)) %>%
ggplot(aes(x=Date, y=Z)) +
geom_line(color="darkgreen", size=0.5) +
facet_grid(. ~ Month, scale="free_x") +
theme(axis.text.x = element_text(angle=45, size=7))
I would not like my chart to have those almost-horizontal lines when the date changes by a large amount. I was able to generate a chart without those lines using integers on aes() as follows:
dfgrp %>%
mutate(Time = 1:n() %>% as.integer(),
Z = cumsum(X)) %>%
ggplot(aes(x=Time, y=Z)) +
geom_line(color="darkgreen", size=0.5) +
facet_grid(. ~ Month, scale="free_x") +
scale_x_continuous(breaks = seq(from=1, to=nr, by=10) %>% as.integer(),
labels = function(x) as.character(dfgrp$Date[x])) +
theme(axis.text.x = element_text(angle=45, size=7))
The line on the chart looks like I want it but the dates on the horizontal axis are not correct: they end in February 2020 in every facet while the dates in the dataframe end in December 2021 and the dates in the first chart begin and end on different months in different facets.
I tried many things but nothing worked. Any suggestions on how to have a chart with dates like in the first chart above and lines like in the second chart above?
Help will be much appreciated.
You may want to adjust the dates to be in the same year, but noting the original year as a variable:
library(lubridate)
dfgrp %>%
group_by(Month) %>%
mutate(year = year(Date),
adj_date = ymd(paste(2020, month(Date), day(Date)))) %>%
# 2020 was leap year so 2/29 won't be lost
mutate(Time = Date[1:n()],
Z = cumsum(X)) %>%
ggplot(aes(x=adj_date, y=Z, color = year, group = year)) +
geom_line(size=0.5) +
facet_grid(. ~ Month, scale="free_x") +
theme(axis.text.x = element_text(angle=45, size=7))

Plot number of people (observations) in a data set?

I want to use a barplot to display the number of male and female participants for a program in the month of July. However, I keep getting a percentage instead. I know there's something wrong with my code but I'm not sure what.
july_all%>%
filter(month == 0)%>%
group_by(sex)%>%
summarize(id=round(100*n()/nrow(july_all)))%>%
ggplot(aes(x =sex,y =id)) +
geom_bar(stat ="identity") +
labs(y ="Number of participants")
You haven't provided any data, but I'm guessing the following is a reasonable approximation:
set.seed(1)
july_all <- data.frame(sex = sample(c("Male", "Female"), 500, TRUE), month = 0)
Now, running your code, we get:
july_all %>%
filter(month == 0) %>%
group_by(sex) %>%
summarize(id = round(100*n()/nrow(july_all))) %>%
ggplot(aes(x =sex, y = id)) +
geom_bar(stat ="identity") +
labs(y ="Number of participants")
Which shows percentages. Why? Because that's what you have calculated with summarize(id = round(100*n()/nrow(july_all))) . Here, id is the percentage of each sex present in the data.
If you want raw counts, your code can be much simpler, since geom_bar plots counts by default, so you need only do:
july_all %>%
filter(month == 0) %>%
ggplot(aes(x = sex)) +
geom_bar() +
labs(y ="Number of participants")
And you'll see we have the raw counts plotted.

ggplot2 specify a secondary axis for both value and date/datetime

I am trying to produce a ggplot graph where I can compare two time periods without indexing the data. So, that I get one time-window running along the x-axis at the bottom and one along the x-axis at the top, kinda like the the example from the tidyverse manual page (see graph at bottom of linked page).
However I would like to have dual axis on the y-axis too, something like this (made this by copy-pasiting using the code below),
economics_long %>% filter(variable== "pce" & date > "2008-01-01" & date < "2010-01-01") %>%
ggplot(aes(date, value01, colour = variable)) + geom_line()
economics_long %>% filter(variable== "pce" & date > "1990-01-01" & date < "1992-01-01") %>%
ggplot(aes(date, value01, colour = variable)) + geom_line()
I imagine I would need to use bind_rows() to cut the two time periods out and put them on top and maybe make a new variables like variable with two options, like time-window 1 and time-window 2, however I wanted to ask here before I start manually build something crazy. Maybe others have done something simular?
I have made some first steps, like,
tw01 <- economics_long %>%
filter(variable== "pce" & date > "2008-01-01" & date < "2010-01-01")
tw02 <- economics_long %>%
filter(variable== "pce" & date > "1990-01-01" & date < "1992-01-01")
tw02$date <- tw01$date
tw <- bind_rows(tw01, tw02, .id = "time_window")
tw %>% ggplot(aes(date, value01, colour = time_window)) + geom_line()
Maybe this is what you are looking for:
For the date transformation I make use of lubridate::years. Addtionally for the transformation inside sec_axis I to wrap into hms::hms as I otherwise got an error.
As I personally find secondary axes always a bit confusing, expecially with both a secondary x and y axis I colored both the x and y labels according to the color of the lines. If you don't like that you can simply drop the theme() adjustemnts.
library(ggplot2)
library(dplyr)
tw1_START <- "2008-01-01"; tw1_END <- "2010-01-01"
tw2_START <- "1990-01-01"; tw2_END <- "1992-01-01"
s_factor <- .52
Intv <- interval(ymd(tw2_START), ymd(tw1_START))
IntvM <- time_length(Intv, "month") # time_length(YrDis , "year")
tw01 <- economics_long %>%
filter(variable== "pce" & date > tw1_START & date < tw1_END )
tw02 <- economics_long %>%
filter(variable== "pce" & date > tw2_START & date < tw2_END) %>%
mutate(date = date + hms::hms(months(IntvM))) %>%
mutate(value01 = value01 + s_factor)
tw <- bind_rows(tw01, tw02, .id = "time_window")
tw %>%
ggplot(aes(date, value01, colour = time_window)) +
geom_line() +
scale_x_date(sec.axis = sec_axis(~ . -hms::hms(months(IntvM)))) +
scale_y_continuous(sec.axis = sec_axis(~ . - s_factor), position = "right") +
theme(axis.text.x.top = element_text(color = scales::hue_pal()(2)[2]),
axis.text.x.bottom = element_text(color = scales::hue_pal()(2)[1]),
axis.text.y.right = element_text(color = scales::hue_pal()(2)[1]),
axis.text.y.left = element_text(color = scales::hue_pal()(2)[2]))

Resources