Improper facet_wrap output - r

The below is my sample data
Though there are values for mobile and tablet across all 4 dates, when I try to facet_wrap across device category, my results are not what is expected. All the values corresponding to each date are being added to the desktop only and are not being distributed across the 3 categories.
The code that I used is
qplot(data=gaData, x=gaData$Date, y=gaData$Users, xlim = c(20170101,20170101))+
facet_wrap(~gaData$Device.Category, ncol = 1)
The output that I'm seeing in the plot is
I'm new to the whole data visualization area. I'm unable to identify what is wrong with the code.
P.S. I'm able to plot mobile and tablet individually for the same dates successfully as individual plots.

x <- data.frame(Date = c('2017-01-01','2017-01-01','2017-01-01','2017-01-02','2017-01-02','2017-01-02','2017-01-03','2017-01-03','2017-01-03',
'2017-01-04','2017-01-04','2017-01-04'), Device = c("desktop","mobile","tablet","desktop","mobile","tablet",
"desktop","mobile","tablet","desktop","mobile","tablet"),
Users = c(404,223,39,529,211,43,1195,285,29,1019,275,35))
x$Date <- as.POSIXct(x$Date, tz = "UTC")
ggplot(x, aes(Date, Users)) + geom_line() + facet_wrap(~Device)
Is this what you wanted?

Hope this helps.
# Simulate some dummy data
dat <- data.frame(
Date = rep(20170101:20170104, each = 3),
Device = rep(c('D', 'M', 'T'), 4),
Users = runif(n = 12, max = 1000, min = 10) %>% round()
)
# This is the 'base map', map variables onto aesthetics
ggplot(aes(x = Date, y = Users, col = Device), data = dat) +
# What kind of geometry?
geom_line() +
geom_point() +
# From 1d panel to 2d
facet_wrap(~Device, ncol = 1)
Plot Result
You may also consider converting Date variable to class date.
The following references hope to help you gain some understanding of ggplot2.
http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html
http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html
http://ggplot2.tidyverse.org/reference/
Also, DataCamp provides wonderful online tutorials.
Welcome to the amazing world of R. Cheers.

First I suggest you convert 'Date' to POSIXct. I use lubridate package:
library(lubridate)
Date=ymd(c(rep("2017-01-01",3),rep("2017-01-02",3),rep("2017-01-03",3),rep("2017-01-04",3)))
then we can build the rest of the dat frame
Country=rep("United States",12)
Device.Category=rep(c("Desktop","Mobile","Tablet"),4)
Users=c(404,223,39,529,211,43,1195,285,29,1019,275,35)
df=data.frame(Date,Country,Device.Category,Users)
If you want to plot only for "2017-01-01" use this
ggplot(df,aes(x=Date,y=Users))+geom_point()+facet_grid(Device.Category~.)+xlim(ymd("2017-01-01"),ymd("2017-01-01"))
Or if you want all dates just remove the xlim function
ggplot(df,aes(x=Date,y=Users))+geom_point()+facet_grid(Device.Category~.)+xlim(ymd("2017-01-01"),ymd("2017-01-01"))

Related

How to indicate R the dates I want to draw the segment in?

I am making a step graphic of "inventary levels" and I want to include horizontal segments at the level 5 in Y. So, I am importing my table from Excel. The first of the two columns has Dates with the format "2020-12-04" and R reads it very well.
I plot the data using geom_step and it works perfectly, the X axis becomes the dates and the Y axis is the inventory level.
The problem is when I try to add a segment between to dates using geom_segment(aes(x=2020-12-04, y=5, xend=2020-12-12, yend=5 ))
it shows me:
Error: Invalid input: time_trans works with objects of class POSIXct only
How can I fix this? Here is the code I am using
Datos<-read_excel(ruta_excel, sheet="CDSYG", range="J4:K45")
p<-ggplot(data=Datos, aes(x=Dia, y=Nivel))
p+geom_step()+geom_segment(aes(x=2020-12-04, y=5, xend=2020-12-12, yend=5 ))
Convert the dates where you want to draw the segment to a datetime:
BTW: To help us to help it would be easier to share a snippet of your data or use some fake data as I did. See how to make a minimal reproducible example. To post data type dput(NAME_OF_DATASET) into the console and copy & paste the output starting with structure(.... into your post. If your dataset has a lot of observations you could do dput(head(NAME_OF_DATASET, 20)) for the first twenty rows of data
library(ggplot2)
ggplot(data = Datos, aes(x = Dia, y = Nivel)) +
geom_step() +
geom_segment(aes(x = as.POSIXct("2020-12-04"), y = 5, xend = as.POSIXct("2020-12-12"), yend = 5), color = "red")
DATA
set.seed(42)
Datos <- data.frame(
Dia = seq.Date(as.Date("2020-12-01"), as.Date("2020-12-31"), by = "day"),
Nivel = sample(1:10, 31, replace = TRUE)
)
Datos$Dia <- as.POSIXct(Datos$Dia)

I found a unexpected line in my scatterplot, how can I extract all the data near the line for further analysis?

My data is about filesize and the time cost dealing with the file.
When I drew the point plot I got the below result:
ggplot(data,aes(filesize,time))+geom_point()
You can see there are 2 lines in the plot.
How can I extract all the data near the line for further analysis?
Any advice for what to learn? Thank you in advance.
A good next step would be to identify those ratios that seem more common, to make it easier to isolate those observations.
library(dplyr)
data %>%
mutate(time_per_size = time/file_size) %>%
ggplot(aes(time_per_size)) +
geom_histogram(bins = 50) # 30 bins is default, fiddle to see what value captures the predominant ratios most cleanly
Using #PavoDive's sample data, for instance, we can look at the ratios using this process, and use plotly to look at the spikes interactively, seeing that they are around 1.5 and 3.
library(ggplot2); library(dplyr)
dt %>%
mutate(time_per_size = y/x) %>%
filter(time_per_size < 10) %>%
ggplot(aes(time_per_size)) +
geom_histogram(bins = 300)
plotly::ggplotly(.Last.value)
I agree with #heds1 that there's probably some underlying relationship between your outcome and [at least] a third variable, wether it's know to you or not.
See the following example with dummy data:
library(data.table)
library(ggplot2)
# try to mimic your data in the x axis. Include some random types
set.seed(1)
dt <- data.table(x = rbeta(3000, shape1 = 1.8, shape2 = 10), type = sample(LETTERS[1:5], 3000, TRUE))
# introduce a couple lines:
dt[type == "A", y := 3*x]
dt[type == "C", y := 1.5*x]
# and add some "white noise":
dt[!type %chin% c("A", "C"), y := abs(rnorm(.N, .5, .25))]
# see what you have:
plot(dt$x, dt$y)
# now see the light:
ggplot(dt, aes(x, y, colour = type))+geom_point()

Create barplot to represent time series in ggplot2

I have a basic dataframe with 3 columns: (i) a date (when a sample was taken); (ii) a site location and (iii) a binary variable indicating what the condition was when sampling (e.g. wet versus dry).
Some reproducible data:
df <- data.frame(Date = rep(seq(as.Date("2010-01-01"), as.Date("2010-12-01"), by="months"),times=2))
df$Site <- c(rep("Site.A",times = 12),rep("Site.B",times = 12))
df$Condition<- as.factor(c(0,0,0,0,1,1,1,1,0,0,0,0,
0,0,0,0,0,1,1,0,0,0,0,0))
What I would like to do is use ggplot to create a bar chart indicating the condition of each site (y axis) over time (x axis) - the condition indicated by a different colour. I am guessing some kind of flipped barplot would be the way to do this, but I cannot figure out how to tell ggplot2 to recognise the values chronologically, rather than summed for each condition. This is my attempt so far which clearly doesn't do what I need it to.
ggplot(df) +
geom_bar(aes(x=Site,y=Date,fill=Condition),stat='identity')+coord_flip()
So I have 2 questions. Firstly, how do I tell ggplot to recognise changes in condition over time and not just group each condition in a traditional stacked bar chart?
Secondly, it seems ggplot converts the date to a numerical value, how would I reformat the x-axis to show a time period, e.g. in a month-year format? I have tried doing this via the scale_x_date function, but get an error message.
labDates <- seq(from = (head(df$Date, 1)),
to = (tail(df$Date, 1)), by = "1 months")
Datelabels <-format(labDates,"%b %y")
ggplot(df) +
geom_bar(aes(x=Site,y=Date,fill=Condition),stat='identity')+coord_flip()+
scale_x_date(labels = Datelabels, breaks=labDates)
I have also tried converting sampling times to factors and displaying these instead. Below I have done this by changing each sampling period to a letter (in my own code, the factor levels are in a month-year format - I put letters here for simplicity). But I cannot format the axis to place each level of the factor as a tick mark. Either a date or factor solution for this second question would be great!
df$Factor <- as.factor(unique(df$Date))
levels(df$Factor) <- list(A = "2010-01-01", B = "2010-02-01",
C = "2010-03-01", D = "2010-04-01", E = "2010-05-01",
`F` = "2010-06-01", G = "2010-07-01", H = "2010-08-01",
I = "2010-09-01", J = "2010-10-01", K= "2010-11-01", L = "2010-12-01")
ggplot(df) +
geom_bar(aes(x=Site,y=Date,fill=Condition),stat='identity')+coord_flip()+
scale_y_discrete(breaks=as.numeric(unique(df$Date)),
labels=levels(df$Factor))
Thank you in advance!
It doesn't really make sense to use geom_bar() considering you do not want to summarise the data and require the visualisation over "time"
I would rather use geom_line() and increase the line thickness if you want to portray a bar chart.
library(tidyr)
library(dplyr)
library(ggplot2)
library(scales)
library(lubridate)
df <- data.frame(Date = rep(seq.Date(as.Date("2010-01-01"), as.Date("2010-12-01"), by="months"),times=2))
df$Site <- c(rep("Site.A",times = 12),rep("Site.B",times = 12))
df$Condition<- as.factor(c(0,0,0,0,1,1,1,1,0,0,0,0,
0,0,0,0,0,1,1,0,0,0,0,0))
df$Date <- ymd(df$Date)
ggplot(df) +
geom_line(aes(y=Site,x=Date,color=Condition),size=10)+
scale_x_date(labels = date_format("%b-%y"))
Note using coord_flip() also does not work, I think this causes the Date issue, see below threads:
how to use coord_carteisan and coord_flip together in ggplot2
In ggplot2, coord_flip and free scales don't work together

Dynamically Set X limits on time plot

I am wondering how to dynamically set the x axis limits of a time series plot containing two time series with different dates. I have developed the following code to provide a reproducible example of my problem.
#Dummy Data
Data1 <- data.frame(Date = c("4/24/1995","6/23/1995","2/12/1996","4/14/1997","9/13/1998"), Area_2D = c(20,11,5,25,50))
Data2 <- data.frame(Date = c("6/23/1995","4/14/1996","11/3/1997","11/6/1997","4/15/1998"), Area_2D = c(13,15,18,25,19))
Data3 <- data.frame(Date = c("4/24/1995","6/23/1995","2/12/1996","4/14/1996","9/13/1998"), Area_2D = c(20,25,28,30,35))
Data4 <- data.frame(Date = c("6/23/1995","4/14/1996","11/3/1997","11/6/1997","4/15/1998"), Area_2D = c(13,15,18,25,19))
#Convert date column as date
Data1$Date <- as.Date(Data1$Date,"%m/%d/%Y")
Data2$Date <- as.Date(Data2$Date,"%m/%d/%Y")
Data3$Date <- as.Date(Data3$Date,"%m/%d/%Y")
Data4$Date <- as.Date(Data4$Date,"%m/%d/%Y")
#PLOT THE DATA
max_y1 <- max(Data1$Area_2D)
# Define colors to be used for cars, trucks, suvs
plot_colors <- c("blue","red")
plot(Data1$Date,Data1$Area_2D, col=plot_colors[1],
ylim=c(0,max_y1), xlim=c(min_x1,max_x1),pch=16, xlab="Date",ylab="Area", type="o")
par(new=T)
plot(Data2$Date,Data2$Area_2D, col=plot_colors[2],
ylim=c(0,max_y1), xlim=c(min_x1,max_x1),pch=16, xlab="Date",ylab="Area", type="o")
The main problem I see with the code above is there are two different x axis on the plot, one for Data1 and another for Data2. I want to have a single x axis spanning the date range determined by the dates in Data1 and Data2.
My questions is:
How do i dynamically create an x axis for both series? (i.e select the minimum and maximum date from the data frames 'Data1' and 'Data2')
The solution is to combine the data into one data.frame, and base the x-axis on that. This approach works very well with the ggplot2 plotting package. First we merge the data and add an ID column, which specifies to which dataset it belongs. I use letters here:
Data1$ID = 'A'
Data2$ID = 'B'
merged_data = rbind(Data1, Data2)
And then create the plot using ggplot2, where the color denotes which dataset it belongs to (can easily be changed to different colors):
library(ggplot2)
ggplot(merged_data, aes(x = Date, y = Area_2D, color = ID)) +
geom_point() + geom_line()
Note that you get one uniform x-axis here. In this case this is fine, but if the timeseries do not overlap, this might be problematic. In that case we can use multiple sub-plots, known as facets in ggplot2:
ggplot(merged_data, aes(x = Date, y = Area_2D)) +
geom_point() + geom_line() + facet_wrap(~ ID, scales = 'free_x')
Now each facet has it's own x-axis, i.e. one for each sub-dataset. What approach is most valid depends on the specific situation.

How to create histogram in R with CSV time data?

I have CSV data of a log for 24 hours that looks like this:
svr01,07:17:14,'u1#user.de','8.3.1.35'
svr03,07:17:21,'u2#sr.de','82.15.1.35'
svr02,07:17:30,'u3#fr.de','2.15.1.35'
svr04,07:17:40,'u2#for.de','2.1.1.35'
I read the data with tbl <- read.csv("logs.csv")
How can I plot this data in a histogram to see the number of hits per hour?
Ideally, I would get 4 bars representing hits per hour per srv01, srv02, srv03, srv04.
Thank you for helping me here!
I don't know if I understood you right, so I will split my answer in two parts. The first part is how to convert your time into a vector you can use for plotting.
a) Converting your data into hours:
#df being the dataframe
df$timestamp <- strptime(df$timestamp, format="%H:%M:%S")
df$hours <- as.numeric(format(df$timestamp, format="%H"))
hist(df$hours)
This gives you a histogram of hits over all servers. If you want to split the histograms this is one way but of course there are numerous others:
b) Making a histogram with ggplot2
#install.packages("ggplot2")
require(ggplot2)
ggplot(data=df) + geom_histogram(aes(x=hours), bin=1) + facet_wrap(~ server)
# or use a color instead
ggplot(data=df) + geom_histogram(aes(x=hours, fill=server), bin=1)
c) You could also use another package:
require(plotrix)
l <- split(df$hours, f=df$server)
multhist(l)
The examples are given below. The third makes comparison easier but ggplot2 simply looks better I think.
EDIT
Here is how thes solutions would look like
first solution:
second solution:
third solution:
An example dataset:
dat = data.frame(server = paste("svr", round(runif(1000, 1, 10)), sep = ""),
time = Sys.time() + sort(round(runif(1000, 1, 36000))))
The trick I use is to create a new variable which only specifies in which hour the hit was recorded:
dat$hr = strftime(dat$time, "%H")
Now we can use some plyr magick:
hits_hour = count(dat, vars = c("server","hr"))
And create the plot:
ggplot(data = hits_hour) + geom_bar(aes(x = hr, y = freq, fill = server), stat="identity", position = "dodge")
Which looks like:
I don't really like this plot, I'd be more in favor of:
ggplot(data = hits_hour) + geom_line(aes(x = as.numeric(hr), y = freq)) + facet_wrap(~ server, nrow = 1)
Which looks like:
Putting all the facets in one row allows easy comparison of the number of hits between the servers. This will look even better when using real data instead of my random data.

Resources