I have irregularly measured observations of a phenomenon with a timestamp each:
2013-01-03 00:04:23
2013-01-03 00:02:04
2013-01-02 23:45:16
2013-01-02 23:35:16
2013-01-02 23:31:56
2013-01-02 23:31:30
2013-01-02 23:29:18
2013-01-02 23:28:43
...
Now I would like to plot these points on the x axis and apply a kernel density function to them, so I can visually explore temporal density using various bandwidths. Something like this should turn out, although the example below does not use x axis labeling; I would like to have labels with, for example, particular days (January 1st, January 5th, etc.):
It is important, however, that the measurement points themselves are visible in the plot, like above.
#dput
df <- structure(list(V1 = structure(c(2L, 2L, 1L, 3L, 1L, 4L, 5L, 4L), .Label = c("2013-01-02", "2013-01-03", "2013-01-04", "2013-01-05", "2013-01-11"), class = "factor"), V2 = structure(c(1L, 3L, 8L, 4L, 7L, 6L, 5L, 2L), .Label = c(" 04:04:23", " 06:28:43", " 10:02:04", " 11:35:16", " 14:29:18", " 17:31:30", " 23:31:56", " 23:45:16"), class = "factor")), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA, -8L))
Using ggplot since it gives fine-grained control over your plot. Use different layers for the measurements and the density itself.
df$tcol<- as.POSIXct(paste(df$dte, df$timestmp), format= "%Y-%m-%d %H:%M:%S")
library(ggplot2)
measurements <- geom_point(aes(x=tcol, y=0), shape=15, color='blue', size=5)
kde <- geom_density(aes(x=tcol), bw="nrd0")
ggplot(df) + measurements + kde
Leads to
Now, if you want to further adjust the x-axis labels (since you want each separate day marked, you can use the scales package.
We are going to use scale_x_date but that only takes in 'Date'
library(scales)
df$tcol <- as.Date(df$tcol, format= "%Y-%m-%d %H:%M:%S")
xlabel <- scale_x_date(labels=date_format("%m-%d"), breaks="1 day")
ggplot(df) + xlabel + measurements + kde
This gives:
Please note that the hours seem to have gotten rounded.
Hopefully this helps you move forward.
Convert your values to POSIXct, convert that numeric (i.e., seconds in UNIX time) and then apply your kernel density function. If z is your vector of timestamps:
z2 <- as.POSIXct(z, "%Y-%m-%d %H:%M:%S", tz="GMT")
plot(density(as.numeric(z2)))
It would then be relatively easy to add a labeled x-axis with axis.
Related
This question already has answers here:
geom_smooth on a subset of data
(3 answers)
Closed 3 years ago.
Data: Height was recorded daily
I want to plot the Height of my Plants (Plant A1 - Z50)
in single Plots, and i want to Highlight the current Year.
So i made a Subset of each Plant and a subset for the current year (2018)
Now i need a Plot with the total record an the highlighted Data from 2018
dput(Plant)
structure(list(Name = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("Plant A1", "Plant B1", "Plant C1"), class = "factor"),
Date = structure(c(1L, 4L, 5L, 7L, 1L, 4L, 6L, 1L, 2L, 3L
), .Label = c(" 2001-01-01", " 2001-01-02", " 2001-01-03",
" 2002-01-01", " 2002-02-01", " 2019-01-01", " 2019-12-31"
), class = "factor"), Height_cm = c(91, 106.1, 107.4, 145.9,
169.1, 192.1, 217.4, 139.8, 140.3, 140.3)), .Names = c("Name",
"Date", "Height_cm"), class = "data.frame", row.names = c(NA,
-10L))
Plant_A1 <- filter(Plant, Name == "Plant A1")
Current_Year <- as.numeric("2018")
Plant_A1_Subset <- filter(Plant_A1, format(Plant_A1$Date, '%Y') == Current_Year)
ggplot(data=Plant_A1,aes(x=Plant_A1$Date, y=Plant_A1$Heigth)) +
geom_point() +
geom_smooth(method="loes", level=0.95, span=1/2, color="red") +
labs(x="Data", y="Height cm")
Now i don't know how to put my new Subset for 2018(Plant_A1_Subset) into this graph.
As noted, this question has a duplicate with an answer in this question.
That said here's likely the most common way of handling your problem.
In ggplot2 future calls inherits any arguments passed into aes of the ggplot(aes(...)) function. Thus the plot will always use these arguments in future ggplot functions, unless one manually overwrites the arguments. However we can solve your problem, by simply adding an extra argument in the aes of geom_point. Below I've illustrated a simple way to achieve what you might be looking for.
Specify the aes argument in individual calls
The first method is likely the most intuitive. aes controls the the plotted parameters. As such if you want to add colour to certain points, one way is to let the aes be individual to the geom_point and geom_smooth argument.
library(ggplot2)
library(lubridate) #for month(), year(), day() functions
current_year <- 2018
ggplot(data = Plant_A1, aes(x = Date, y = Heigth)) +
#Note here, colour set in geom_point
geom_point(aes(col = ifelse(year(Date) == current_year, "Yes", "No"))) +
geom_smooth(method="loess", level=0.95,
span=1/2, color="red") +
labs(x="Data", y="Height cm",
col = "Current year?") #Specify legend title for colour
Note here that i have used the inheritance of the aes argument. Simply put, the aes will check the names within data, and if it can find it, it will use these as variables. So there is no need to specify data$....
I am struggling to do this in R. I have a list of station names with two associated variables: Start Date and End Date. What I would like to do is plot a horizontal line or bar chart that ranges from the start and end date for each station name.
I have tried using ggplot, but I'll confess I am recent user to R.
If you have data looking like this (dput is at the end) with
start date
end date
task name or station name
and an optional group name
(I invented some data, as the OP does not provide data)
StartDate EndDate TaskName Group
1 2018-10-01 2018-11-02 KPI: high level definition KPI Definition
2 2018-11-05 2018-11-16 KPI: data translation KPI Definition
3 2019-02-18 2019-03-01 KPI: corroboration KPI Definition
4 2018-11-05 2018-11-16 KPI: Define Graphical Format KPI Definition
5 2018-10-22 2018-12-07 Data: Which data Define and Get Data
6 2018-10-08 2018-10-19 Data: Mail requesting data Define and Get Data
7 2018-12-07 2018-12-14 Data: Mail defining data Define and Get Data
8 2018-12-17 2018-12-28 Data: Test data dump Define and Get Data
9 2018-12-17 2018-12-28 Data: CSV temporary Define and Get Data
10 2018-12-31 2019-01-25 Data: Quality inspection of Data Dump Define and Get Data
11 2018-12-31 2019-01-25 Data: Create graphs Define and Get Data
12 2019-01-28 2019-02-15 Data: Correct data comparison with KPI defs Define and Get Data
13 2019-02-04 2019-03-01 Data: Create and publish ppt format Define and Get Data
14 2018-11-19 2018-12-14 Storage: Where Storage
15 2018-11-19 2018-12-14 Storage: How much Storage
You will need to put it in long format (a separate line for start and end)
library(ggplot2)
library(reshape2) # for melt to get the data in long format
m_planning_data2 <- melt(planning_data2, measure.vars = c("StartDate", "EndDate"))
Then plot it using ggplot:
ggplot(m_planning_data2, aes(value, TaskName)) +
geom_line(size=4) +
xlab(NULL) +
ylab(NULL) +
ggtitle("Example Assignment Planning 1") +
theme_minimal() +
theme(aspect.ratio = 0.4, axis.text = element_text(size = 7))
... yielding this simple plot:
Or plot it with grouping and an annotation for "today"
ggplot(m_planning_data2, aes(value, TaskName, col = Group)) +
geom_line(size=4) +
xlab(NULL) +
ylab(NULL) +
ggtitle("Example Assignment Planning 2") +
geom_vline(xintercept = as.POSIXct(as.Date(Sys.time())) , linetype = 1, size=1.5, colour = "purple", alpha= .5) +
annotate("text", x = as.POSIXct(as.Date(Sys.time())) + 86400*1.5, y = 3,
label = as.Date(Sys.time()), colour = "purple", angle=90, size= 3) +
theme_minimal() +
theme(aspect.ratio = 0.4, axis.text = element_text(size = 7))
... yielding the following plot:
Please, let me know whether this is what you were after.
DATA
structure(list(StartDate = structure(c(1538344800, 1541372400,
1550444400, 1541372400, 1540159200, 1538949600, 1544137200, 1545001200,
1545001200, 1546210800, 1546210800, 1548630000, 1549234800, 1542582000,
1542582000), class = c("POSIXct", "POSIXt"), tzone = ""), EndDate = structure(c(1541113200,
1542322800, 1551394800, 1542322800, 1544137200, 1539900000, 1544742000,
1545951600, 1545951600, 1548370800, 1548370800, 1550185200, 1551394800,
1544742000, 1544742000), class = c("POSIXct", "POSIXt"), tzone = ""),
TaskName = structure(c(13L, 11L, 10L, 12L, 9L, 6L, 5L, 8L,
4L, 7L, 3L, 1L, 2L, 15L, 14L), .Label = c("Data: Correct data comparison with KPI defs",
"Data: Create and publish ppt format", "Data: Create graphs",
"Data: CSV temporary", "Data: Mail defining data", "Data: Mail requesting data",
"Data: Quality inspection of Data Dump", "Data: Test data dump",
"Data: Which data", "KPI: corroboration", "KPI: data translation",
"KPI: Define Graphical Format", "KPI: high level definition",
"Storage: How much", "Storage: Where"), class = "factor"),
Group = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 3L), .Label = c("Define and Get Data", "KPI Definition",
"Storage"), class = "factor")), .Names = c("StartDate", "EndDate",
"TaskName", "Group"), row.names = c(NA, -15L), class = "data.frame")
I have a dataframe comprising two columns, 'host', and 'date'; which describes a series of cyber attacks against a number of different servers on specific dates over a seven month period.
Here's what the data looks like,
> china_atks %>% head(100)
host date
1 groucho-oregon 2013-03-03
2 groucho-oregon 2013-03-03
...
46 groucho-singapore 2013-03-03
48 groucho-singapore 2013-03-04
...
Where 'groucho-oregon', 'groucho-signapore', etc., is the hostname of the server targeted by an attack.
There are around 190,000 records, spanning 03/03/2013 to 08/09/2013, e.g.
> unique(china_atks$date)
[1] "2013-03-03" "2013-03-04" "2013-03-05" "2013-03-06" "2013-03-07"
"2013-03-08" "2013-03-09"
[8] "2013-03-10" "2013-03-11" "2013-03-12" "2013-03-13" "2013-03-14"
"2013-03-15" "2013-03-16"
[15] "2013-03-17" "2013-03-18" "2013-03-19" "2013-03-20" "2013-03-21"
"2013-03-22" "2013-03-23"
...
I'd like to create a multi-line time series chart that visualises how many attacks each individual server received each day over the range of dates, but I can't figure out how to pass the data to ggplot to achieve this. There are nine unique hostnames, and so the chart would show nine lines.
Thanks!
Here's one way to do this.
First Summarize the count frequency by date.
library(plyr)
df <- plyr::count(da,c("host", "date"))
Then Do the plotting.
ggplot(data=df, aes(x=date, y=freq, group=1)) +
geom_line(aes(color = host))
Data
da <- structure(list(host = structure(1:4, .Label = c("groucho-eu",
"groucho-oregon", "groucho-singapore", "groucho-tokyo"), class = "factor"),
date = structure(c(1L, 1L, 1L, 1L), .Label = "2013-03-03", class = "factor"),
freq = c(1L, 4L, 2L, 1L)), .Names = c("host", "date", "freq"
), row.names = c(NA, -4L), class = "data.frame")
ggplot2 library is capable of performing statistics. Hence, an option could be to let ggplot handle count/frequency. This should draw multiple lines (one for each group)
ggplot(df, aes(x=Date, colour = host, group = host)) + geom_line(stat = "count")
Note: Make sure host is converted to factor to have discrete color for lines.
I have a problem connecting two points with the same y value. My dataset looks like this (I hope the formatting is ok):
attackerip,min,max
125.88.146.123,2016-03-29 17:38:17.949778,2016-03-30 07:28:47.912983
58.218.205.101,2016-04-05 15:53:20.69986,2016-05-12 17:32:08.583255
183.3.202.195,2016-04-05 15:58:27.862509,2016-04-15 18:15:13.117774
58.218.199.166,2016-04-05 16:09:34.448588,2016-04-24 06:02:12.237922
58.218.204.107,2016-04-05 16:57:17.624509,2016-05-31 00:52:44.007908
What I have so far is the following:
mydata = read.csv("timeline.csv", sep=',')
mydata$min <- strptime(as.character(mydata$min), format='%Y-%m-%d %H:%M:%S')
mydata$max <- strptime(as.character(mydata$max), format='%Y-%m-%d %H:%M:%S')
plot(mydata$min, mydata$attackerip, col="red")
points(mydata$max, mydata$attackerip, col="blue")
Which results in:
Now I want to connect the points with the same y-axis value. And can not get lines or abline to work. Thanks in Advance!
EDIT: dput of data
dput(mydata)
structure(list(attackerip = structure(c(1L, 5L, 2L, 3L, 4L), .Label = c("125.88.146.123",
"183.3.202.195", "58.218.199.166", "58.218.204.107", "58.218.205.101"
), class = "factor"), min = structure(1:5, .Label = c("2016-03-29 17:38:17.949778",
"2016-04-05 15:53:20.69986", "2016-04-05 15:58:27.862509", "2016-04-05 16:09:34.448588",
"2016-04-05 16:57:17.624509"), class = "factor"), max = structure(c(1L,
4L, 2L, 3L, 5L), .Label = c("2016-03-30 07:28:47.912983", "2016-04-15 18:15:13.117774",
"2016-04-24 06:02:12.237922", "2016-05-12 17:32:08.583255", "2016-05-31 00:52:44.007908"
), class = "factor")), .Names = c("attackerip", "min", "max"), class = "data.frame", row.names = c(NA,
-5L))
Final Edit:
The reason why plotting lines did not work was, that the datatype of min and max was timestamps. Casting those to numeric values yielded the expected result. Thanks for your help everyone
The lines function should work just fine. However, you will need to call it for every pair (or set) of points that share the same y value. Here is a reproducible example:
# get sets of observations with the same y value
dupeVals <- unique(y[duplicated(y) | duplicated(y, fromLast=T)])
# put the corresponding indices into a list
dupesList <- lapply(dupeVals, function(i) which(y == i))
# scatter plot
plot(x, y)
# plot the lines using sapply
sapply(dupesList, function(i) lines(x[i], y[i]))
This returns
data
set.seed(1234)
x <- sort(5* runif(30))
y <- sample(25, 30, replace=T)
As it appears that you have two separate groups for which you would like draw these lines, the following would be the algorithm:
for each group, (min and max, I believe)
calculate the duplicate values of the y variable
put the indicies of these duplicates into a dupesList (maybe dupesListMin and dupesListMax).
plot the points
run one sapply function on each dupesList.
I want to rank the variables in my dataset in a descending order of the Number of Plants used. I tried ranking in .csv and then exporting it in R. But even then, the plot was not ranked in the required order. Here is my dataset
df <- structure(list(Lepidoptera.Family = structure(c(3L, 2L, 5L, 1L, 4L, 6L),
.Label = c("Hesperiidae", "Lycaenidae", "Nymphalidae", "Papilionidae", "Pieridae","Riodinidae"), class = "factor"),
LHP.Families = c(55L, 55L, 15L, 14L, 13L, 1L)),
.Names = c("Lepidoptera.Family", "LHP.Families"),
class = "data.frame", row.names = c(NA, -6L))
library(ggplot2)
library(reshape2)
gg <- melt(df,id="Lepidoptera.Family", value.name="LHP.Families", variable.name="Type")
ggplot(gg, aes(x=Lepidoptera.Family, y=LHP.Families, fill=Type))+
geom_bar(stat="identity")+
coord_flip()+facet_grid(Type~.)
How do i rank them in a descending order? Also, i want to combine 3 plots into one. How can i go about it?
The reason this is happening is that ggplot plots the x variables that are factors in the ordering of the underlying values (recall that factors are stored as numbers underneath the covers). If you want to graph them in an alternate order, you should change the order of the levels before plotting
gg$Lepidoptera.Family<-with(gg,
factor(Lepidoptera.Family,
levels=Lepidoptera.Family[order(LHP.Families)]))
The trick is to reorder the levels of the Lepidoptera.Family factor, which by default is alphabetical:
df = within(df, {
factor(Lepidoptera.Family, levels = reorder(Lepidoptera.Family, LHP.Families))
})
gg <- melt(df,id="Lepidoptera.Family", value.name="LHP.Families", variable.name="Type")
ggplot(gg, aes(x=Lepidoptera.Family, y=LHP.Families, fill=Type))+ geom_bar(stat="identity")+ coord_flip()+facet_grid(Type~.)