Issue with plotting multiple axes - r

I'm trying to create a plot that shows a couple of variables (integers) on a single x axis (date), and having some issues getting the base created.
I keep getting the error message
"Error in as.matrix(x) : argument "x" is missing, with no default".
Here is all my code, if you can help it would be fantastic!
avail <- avail %>%
mutate(Date = as.Date(avail$Date, format = "%a %d-%b-%Y"))
avail <- format(avail, format="%d-%m-%Y")
avail
names(avail)[2] <- "Available Jobs"
df <- data.frame("Date" = avail$Date,
"Available Jobs" = avail$`Available Jobs`,
"Jobs" = job$Jobs.Added,
"Views" = view$Views)
This is the part that gives error messages:
ggplot(df, aes(x=df$Date))+
geom_line(aes(y=df$Available.Jobs), size=2, color=scale.default())+
geom_line(aes(y=df$Views), size=2, color=scale.default())+
scale_x_continuous(
name= "Available Jobs",
sec.axis = sec_axis(~.*coeff, name = "Views"))+
ggtitle("September views against available jobs")

"it doesn't work without specifying the dataset": well, with respect, then you're doing something wrong. It turns out you're probably doing several things wrong...
You haven't given us a reprodicible example, so I can't be 100% sure of what you want, but here's a close as I can get.
First, create some data and load libraries
library(tidyverse)
library(lubridate)
# For reproducibility
set.seed(123)
df <- tibble(
Date=seq(ymd("2022-09-01"), ymd("2022-09-30"), "1 day"),
`Available Jobs`=runif(30, 100, 200),
Jobs=runif(30, 50, 150),
Views=runif(30, 200, 300)
)
Now attempt to create a plot. I've commented where I've either corrected errors or made assumptions.
df %>% ggplot(aes(x=Date)) +
# Note the correction to the name of the y variable
# color=scale.default() is the source of the "as.matrix()" error
geom_line(aes(y=`Available Jobs`), size=2, color="blue") +
geom_line(aes(y=Views), size=2, color="red") +
# Why label the x axis with the name of the y variable? Is this change correct?
scale_y_continuous(
name= "Available Jobs",
# Removing color=scale.default() introduces object 'coeff' not found here
# sec.axis = sec_axis(~.*coeff, name = "Views")
) +
ggtitle("September views against available jobs")
[See? No references to df inside the pipe.] This gives:
Another problem may be that your data is (probably) not tidy - because you have information in the names of your columns. ggplot (and the rest of the tidyverse) expects tidy data, so you are attempting to fight against the tidyverse's expectations. One way to make your data tidy and produce a similar plot might be:
df %>%
pivot_longer(
c(`Available Jobs`, Views),
names_to="Metric",
values_to="Value"
) %>%
ggplot() +
geom_line(aes(x=Date, y=Value, colour=Metric))
I've omitted some of the formatting of the plot to focus on the key issue: making your data tidy.
I've had to make guesses and assumptions in providing this answer, principally because you didn't provide a minimal reproducible example.

Related

Why does my line plot (ggplot2) look vertical?

I am new to coding in R, when I was using ggplot2 to make a line graph, I get vertical lines. This is my code:
all_trips_v2 %>%
group_by(Month_Name, member_casual) %>%
summarise(average_duration = mean(length_of_ride))%>%
ggplot(aes(x = Month_Name, y = average_duration)) + geom_line()
And I'm getting something like this:
This is a sample of my data:
(Not all the cells in the Month_Name is August, it's just sorted)
Any help will be greatly appreciated! Thank you.
I added a bit more code just for the mere example. the data i chose is probably not the best choice to display a proper timer series.
I hope the features of ggplot i displayed will be benficial for you in the future
library(tidyverse)
library(lubridate)
mydat <- sample_frac(storms,.4)
# setting the month of interest as the current system's month
month_of_interest <- month(Sys.Date(),label = TRUE)
mydat %>% group_by(year,month) %>%
summarise(avg_pressure = mean(pressure)) %>%
mutate(month = month(month,label = TRUE),
current_month = month == month_of_interest) %>%
# the mutate code is just for my example.
ggplot(aes(x=year, y=avg_pressure,
color=current_month,
group=month,
size=current_month
))+geom_line(show.legend = FALSE)+
## From here its not really important,
## just ideas for your next plots
scale_color_manual(values=c("grey","red"))+
scale_size_manual(values = c(.4,1))+
ggtitle(paste("Averge yearly pressure,\n
with special interest in",month_of_interest))+
theme_minimal()
## Most important is that you notice the group argument and also,
# in most cases you will want to color your different lines.
# I added a logical variable so only October will be colored,
# but that is not mandatory
You should add a grouping argument.
see further info here:
https://ggplot2.tidyverse.org/reference/aes_group_order.html
# Multiple groups with one aesthetic
p <- ggplot(nlme::Oxboys, aes(age, height))
# The default is not sufficient here. A single line tries to connect all
# the observations.
p + geom_line()
# To fix this, use the group aesthetic to map a different line for each
# subject.
p + geom_line(aes(group = Subject))

ggplot par new=TRUE option

I am trying to plot 400 ecdf graphs in one image using ggplot.
As far as I know ggplot does not support the par(new=T) option.
So the first solution I thought was use the grid.arrange function in gridExtra package.
However, the ecdfs I am generating are in a for loop format.
Below is my code, but you could ignore the steps for data processing.
i=1
for(i in 1:400)
{
test<-subset(df,code==temp[i,])
test<-test[c(order(test$Distance)),]
test$AI_ij<-normalize(test$AI_ij)
AI = test$AI_ij
ggplot(test, aes(AI)) +
stat_ecdf(geom = "step") +
scale_y_continuous(labels = scales::percent) +
theme_bw() +
new_theme +
xlab("Calculated Accessibility Value") +
ylab("Percent")
}
So I have values stored in "AI" in the for loop.
In this case how should I plot 400 graphs in the same chart?
This is not the way to put multiple lines on a ggplot. To do this, it is far easier to pass all of your data together and map code to the "group" aesthetic to give you one ecdf line for each code.
By far the hardest part of answering this question was attempting to reverse-engineer your data set. The following data set should be close enough in structure and naming to allow the code to be run on your own data.
library(dplyr)
library(BBmisc)
library(ggplot2)
set.seed(1)
all_codes <- apply(expand.grid(1:16, LETTERS), 1, paste0, collapse = "")
temp <- data.frame(sample(all_codes, 400), stringsAsFactors = FALSE)
df <- data.frame(code = rep(all_codes, 100),
Distance = sqrt(rnorm(41600)^2 + rnorm(41600)^2),
AI_ij = rnorm(41600),
stringsAsFactors = FALSE)
Since you only want the first 400 codes from temp that appear in df to be shown on the plot, you can use dplyr::filter to filter out code %in% test[[1]] rather than iterating through the whole thing one element at a time.
You can then group_by code, and arrange by Distance within each group before normalizing AI_ij, so there is no need to split your data frame into a new subset for every line: the data is processed all at once and the data frame is kept together.
Finally, you plot this using the group aesthetic. Note that because you have 400 lines on one plot, you need to make each line faint in order to see the overall pattern more clearly. We do this by setting the alpha value to 0.05 inside stat_ecdf
Note also that there are multiple packages with a function called normalize and I don't know which one you are using. I have guessed you are using BBmisc
So you can get rid of the loop and do:
df %>%
filter(code %in% temp[[1]]) %>%
group_by(code) %>%
arrange(Distance, by_group = TRUE) %>%
mutate(AI = normalize(AI_ij)) %>%
ggplot(aes(AI, group = code)) +
stat_ecdf(geom = "step", alpha = 0.05) +
scale_y_continuous(labels = scales::percent) +
theme_bw() +
xlab("Calculated Accessibility Value") +
ylab("Percent")

Adding multiple lines in ggplot2 with dates

Struggled with this question for a while and not sure what's going on. I know this question will definitely demonstrate my deficiency with ggplot.
I have a script like this that functions nicely:
beta.bray= c(0.681963714,0.73301985,0.6797153,0.79358052,0.85055556,0.76297686,0.60653007)
beta.bray.gradient=c(0.182243513, 0.565267411,0.427449441,0.655012391,0.357146893,0.286457524,0.338706138)
Date=c("07/18/14","07/26/14","08/19/14","08/25/14", "07/25/15","08/22/15", "07/26/16")
dat=data.frame(Date, beta.bray, beta.bray.gradient)
test<-ggplot(dat, aes(x=reorder(Date, x=fct_inorder(Date)), y=beta.bray, group=1))+geom_line(linetype="dashed")+geom_point()+
labs(x="Date", y="β, multiple-site dissimilarity", title="SNARL riffle site/site β through time, 2014-2016") +coord_cartesian(xlim=c(1,7),ylim=c(.58,.85))
test
But when I want to add another line for beta.bray.gradient, I can't get anything to work. I think it has something to do with the way I used aes() in the above code, but I didn't know how else to do it, in order to use reorder() and fct_inorder() to make sure the dates are plotted in the right way. Here's an example of a way I tried adding the second line:
test<-ggplot(dat, aes(x=reorder(Date, x=fct_inorder(Date)), y=beta.bray, group=1))+geom_line(linetype="dashed")+geom_point()+
geom_line(dat, aes(y=beta.bray.gradient, linetype="c"))+
labs(x="Date", y="β, multiple-site dissimilarity", title="SNARL riffle site/site β through time, 2014-2016") +coord_cartesian(xlim=c(1,7),ylim=c(.58,.85))
In these situations we see a multitude of errors, in this case Error: ggplot2 doesn't know how to deal with data of class uneval
I would think it best to use actual date objects for the x axis and reshape your data into a long format:
library(dplyr)
library(tidyr)
library(ggplot2)
beta.bray <- c(0.681963714,0.73301985,0.6797153,0.79358052,0.85055556,0.76297686,0.60653007)
beta.bray.gradient <- c(0.182243513, 0.565267411,0.427449441,0.655012391,0.357146893,0.286457524,0.338706138)
Date <- as.Date(c("07/18/14","07/26/14","08/19/14","08/25/14", "07/25/15","08/22/15", "07/26/16"),"%m/%d/%y")
dat <- data.frame(Date, beta.bray, beta.bray.gradient) %>%
gather(key = "grp",value = "val",beta.bray,beta.bray.gradient)
ggplot(dat, aes(x = Date, y = val, group = grp,color = grp)) +
geom_line(linetype="dashed") +
geom_point() +
labs(x="Date", y="β, multiple-site dissimilarity",
title="SNARL riffle site/site β through time, 2014-2016") +
coord_cartesian(xlim=Date[c(1,7)])

(Re)name factor levels (or include variable name) in ggplot2 facet_ call

One pattern I do a lot is to facet plots on cuts of numeric values. facet_wrap in ggplot2 doesn't allow you to call a function from within, so you have to create a temporary factor variable. This is okay using mutate from dplyr. The advantage of this is that you can play around doing EDA and varying the number of quantiles, or changing to set cut points etc. and view the changes in one line. The downside is that the facets are only labelled by the factor level; you have to know, for example, that it's a temperature. This isn't too bad for yourself, but even I get confused if I'm doing a facet_grid on two such variables and have to remember which is which. So, it's really nice to be able to relabel the facets by including a meaningful name.
The key points of this problem is that the levels will change as you change the number of quantiles etc.; you don't know what they are in advance. You could use the base levels() function, but that means augmenting the data frame with the cut variable, then calling levels(), then passing this augmented data frame to ggplot().
So, using plyr::mapvalues, we can wrap all this into a dplyr::mutate, but the required arguments for mapvalues() makes it quite clunky. Having to retype "Temp.f" many times is not very "dplyr"!
Is there a neater way of renaming such factor levels "on the fly"? I hope this description is clear enough and the code example below helps.
library(ggplot2)
library(plyr)
library(dplyr)
library(Hmisc)
df <- data.frame(Temp = seq(-100, 100, length.out = 1000), y = rnorm(1000))
# facet_wrap doesn't allow functions so have to create new, temporary factor
# variable Temp.f
ggplot(df %>% mutate(Temp.f = cut2(Temp, g = 4))) + geom_histogram(aes(x = y)) + facet_wrap(~Temp.f)
# fine, but facet headers aren't very clear,
# we want to highlight that they are temperature
ggplot(df %>% mutate(Temp.f = paste0("Temp: ", cut2(Temp, g = 4)))) + geom_histogram(aes(x = y)) + facet_wrap(~Temp.f)
# use of paste0 is undesirable because it creates a character vector and
# facet_wrap then recodes the levels in the wrong numerical order
# This has the desired effect, but is very long!
ggplot(df %>% mutate(Temp.f = cut2(Temp, g = 4), Temp.f = mapvalues(Temp.f, levels(Temp.f), paste0("Temp: ", levels(Temp.f))))) + geom_histogram(aes(x = y)) + facet_wrap(~Temp.f)
I think you can do this from within facet_wrap using a custom labeller function, like so:
myLabeller <- function(x){
lapply(x,function(y){
paste("Temp:", y)
})
}
ggplot(df %>% mutate(Temp.f = cut2(Temp, g = 4))) +
geom_histogram(aes(x = y)) +
facet_wrap(~Temp.f
, labeller = myLabeller)
That labeller is clunky, but at least an example. You could write one for each variable that you are going to use (e.g. tempLabeller, yLabeller, etc).
A slight tweak makes this even better: it automatically uses the name of the thing you are facetting on:
betterLabeller <- function(x){
lapply(names(x),function(y){
paste0(y,": ", x[[y]])
})
}
ggplot(df %>% mutate(Temp.f = cut2(Temp, g = 4))) +
geom_histogram(aes(x = y)) +
facet_wrap(~Temp.f
, labeller = betterLabeller)
Okay, with thanks to Mark Peterson for pointing me towards the labeller argument/function, the exact answer I'm happy with is:
ggplot(df %>% mutate(Temp.f = cut2(Temp, g = 4))) + geom_histogram(aes(x = y)) + facet_wrap(~Temp.f, labeller = labeller(Temp.f = label_both))
I'm a fan of lazy and "label_both" means I can simply create a meaningful temporary (or overwrite the original) variable column and both the name and the value are given. Rolling your own labeller function is more powerful, but using label_both is a good, easy option.

How to create histogram in R with CSV time data?

I have CSV data of a log for 24 hours that looks like this:
svr01,07:17:14,'u1#user.de','8.3.1.35'
svr03,07:17:21,'u2#sr.de','82.15.1.35'
svr02,07:17:30,'u3#fr.de','2.15.1.35'
svr04,07:17:40,'u2#for.de','2.1.1.35'
I read the data with tbl <- read.csv("logs.csv")
How can I plot this data in a histogram to see the number of hits per hour?
Ideally, I would get 4 bars representing hits per hour per srv01, srv02, srv03, srv04.
Thank you for helping me here!
I don't know if I understood you right, so I will split my answer in two parts. The first part is how to convert your time into a vector you can use for plotting.
a) Converting your data into hours:
#df being the dataframe
df$timestamp <- strptime(df$timestamp, format="%H:%M:%S")
df$hours <- as.numeric(format(df$timestamp, format="%H"))
hist(df$hours)
This gives you a histogram of hits over all servers. If you want to split the histograms this is one way but of course there are numerous others:
b) Making a histogram with ggplot2
#install.packages("ggplot2")
require(ggplot2)
ggplot(data=df) + geom_histogram(aes(x=hours), bin=1) + facet_wrap(~ server)
# or use a color instead
ggplot(data=df) + geom_histogram(aes(x=hours, fill=server), bin=1)
c) You could also use another package:
require(plotrix)
l <- split(df$hours, f=df$server)
multhist(l)
The examples are given below. The third makes comparison easier but ggplot2 simply looks better I think.
EDIT
Here is how thes solutions would look like
first solution:
second solution:
third solution:
An example dataset:
dat = data.frame(server = paste("svr", round(runif(1000, 1, 10)), sep = ""),
time = Sys.time() + sort(round(runif(1000, 1, 36000))))
The trick I use is to create a new variable which only specifies in which hour the hit was recorded:
dat$hr = strftime(dat$time, "%H")
Now we can use some plyr magick:
hits_hour = count(dat, vars = c("server","hr"))
And create the plot:
ggplot(data = hits_hour) + geom_bar(aes(x = hr, y = freq, fill = server), stat="identity", position = "dodge")
Which looks like:
I don't really like this plot, I'd be more in favor of:
ggplot(data = hits_hour) + geom_line(aes(x = as.numeric(hr), y = freq)) + facet_wrap(~ server, nrow = 1)
Which looks like:
Putting all the facets in one row allows easy comparison of the number of hits between the servers. This will look even better when using real data instead of my random data.

Resources