making a presence/absence timeline in r for multiple y objects - r

This is my first time using SO and I am an R newbie; sorry if this is a little basic or unclear (or if the question has already been answered... I'm struggling with coding and need pretty specific answers to understand)
I would like to produce an image similar to this one:
Except I would like it to be oriented horizontally on a timeline, and with two vertical lines drawn from the x-axis.
I can set the data up simply, and there are only two variables - date and Tag.
Tag Date
1 1 1/1/2014
2 3 1/1/2014
3 1 1/3/2014
4 2 1/3/2014
5 3 1/3/2014
6 5 1/3/2014
7 2 1/4/2015
8 3 1/4/2015
9 4 1/4/2015
10 6 1/4/2015
11 1 1/5/2014
12 2 1/5/2014
13 4 1/5/2014
14 6 1/5/2014
15 1 1/6/2014
16 2 1/6/2014
17 3 1/6/2014
18 4 1/6/2014
19 6 1/6/2014
20 2 1/7/2014
21 4 1/7/2014
22 1 1/8/2014
23 2 1/8/2014
24 6 1/8/2014
Here is a drawn image of what I would like to accomplish:
To recap - I want to take this data, which shows the dates of detection of animals at a certain location and plot it on a timeline with two vertical lines on two dates. If an animal (say, tag 2) was detected on consecutive days, I would like to connect those dates with a line, and if the detection happened without detection on consecutive days, a simple dot will suffice. I imagine the y-axis is stacked with each individual Tag, and the x-axis is a date scale - for each date, if A tag ID was detected, then its corresponding x,y coordinate will be marked; if a tag was not detected on a certain date; the corresponding x,y coordinate will remain blank.
Here's a follow-up question:
I want to add a shaded background to some of the dates. I figured that I can use this using geom_rect, but i keep getting the following error:
Error: Invalid input: date_trans works with objects of class Date only
using the code you wrote, this is what I have added to receive the error:
geom_rect(aes(xmin=16075, xmax=16078, ymin=-Inf, ymax=Inf), fill="red", alpha=0.25)
this code will plot, but is not transparent, and so becomes fairly useless:
geom_rect(xmin=16075, xmax=16078, ymin=-Inf, ymax=Inf)

You first need to change your date format into Date. Then you need to figure out if dates are consecutive. And finally you need to plot them. Below is a possible solution using the packages dplyr and ggplot2.
# needed packages
require(ggplot2)
require(dplyr)
# input your data (changed to 2014 dates)
dat <- structure(list(Tag = c(1L, 3L, 1L, 2L, 3L, 5L, 2L, 3L, 4L, 6L, 1L, 2L, 4L, 6L, 1L, 2L, 3L, 4L, 6L, 2L, 4L, 1L, 2L, 6L), Date = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 7L, 7L, 7L), .Label = c("1/1/2014", "1/3/2014", "1/4/2014", "1/5/2014", "1/6/2014", "1/7/2014", "1/8/2014"), class = "factor")), .Names = c("Tag", "Date"), class = "data.frame", row.names = c(NA, -24L))
# change date to Date format
dat[, "Date"] <- as.Date(dat[, "Date"], format='%m/%d/%Y')
# adding consecutive tag for first day of cons. measurements
dat <- dat %>% group_by(Tag) %>% mutate(consecutive=c(diff(Date), 2)==1)
# plotting command
ggplot(dat, aes(Date, Tag)) + geom_point() + theme_bw() +
geom_line(aes(alpha=consecutive, group=Tag)) +
scale_alpha_manual(values=c(0, 1), breaks=c(FALSE, TRUE), guide='none')

Related

Interactively plotting multiple lines with shiny and ggplot2

I'm creating a shiny application that will have a checkboxGroupInput, where each box checked will add another line to a frequency plot. I'm trying to wrap my head around reshape2 and ggplot2 to understand how to make this possible.
data:
head(testSet)
date store_id product_id count
1 2015-08-15 3 1 8
2 2015-08-15 3 3 1
3 2015-08-17 3 1 7
4 2015-08-17 3 2 3
5 2015-08-17 3 3 1
6 2015-08-18 3 3 2
class level information:
dput(droplevels(head(testSet, 10)))
structure(list(date = structure(c(16662, 16662, 16664,
16664, 16664, 16665, 16665, 16665, 16666, 16666), class = "Date"),
store_id = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), product_id = c(1L,
3L, 1L, 2L, 3L, 3L, 1L, 2L, 1L, 2L), count = c(8L, 1L, 7L,
3L, 1L, 2L, 18L, 1L, 0L, 2L)), .Names = c("date", "store_id",
"product_id", "count"), row.names = c(NA, 10L), class = "data.frame")
The graph should have an x-axis that corresponds to date, and a y-axis that corresponds to count. I would like to have a checkbox group input where for each box representing a product checked, a line corresponding to product_id will be plotted on the graph. The data is already filtered to store_id.
My first thought was to write a for loop inside of the plot to render a new geom_line() per each returned value of the input$productId vector. -- however after some research it seems that's the wrong way to go about things.
Currently I'm trying to melt() the data down to something useful, and then aes(...group=product_id), but getting errors on whatever I try.
Attempting to melt the data:
meltSet <- melt(testSet, id.vars="product_id", value.name="count", variable.name="date")
head of meltSet
head(meltSet)
product_id date count
1 1 date 16662
2 3 date 16662
3 1 date 16664
4 2 date 16664
5 3 date 16664
6 3 date 16665
tail of meltSet
tail(meltSet)
product_id date count
76 9 count 5
77 1 count 19
78 2 count 1
79 3 count 39
80 8 count 1
81 9 count 4
Plotting:
ggplot(data=meltSet, aes(x=date, y=count, group = product_id, colour = product_id)) + geom_line()
So my axis and values are all wonky, and not what I'm expecting from setting the plot.
If I'm understanding it correctly you don't need any melting, you just need to aggregate your data, summing up count by date and product_id. you can use data.table for this purpose:
testSet = data.table(testSet)
aggrSet = testSet[, .(count=sum(count)), by=.(date, product_id)]
You can do your ggplot stuff on aggrSet. It has three columns now: date, product_id, count.
When you melt like you did you merge two variables with different types into date: date(Date) and store_id(int).

Producing smmary tables for very large datasets

I am working with migration data, and I want to produce three summary tables from a very large dataset (>4 million). An example of which is detailed below:
migration <- structure(list(area.old = structure(c(2L, 2L, 2L, 2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("leeds",
"london", "plymouth"), class = "factor"), area.new = structure(c(7L,
13L, 3L, 2L, 4L, 7L, 6L, 7L, 6L, 13L, 5L, 8L, 7L, 11L, 12L, 9L,
1L, 10L, 11L), .Label = c("bath", "bristol", "cambridge", "glasgow",
"harrogate", "leeds", "london", "manchester", "newcastle", "oxford",
"plymouth", "poole", "york"), class = "factor"), persons = c(6L,
3L, 2L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 3L, 4L, 1L, 1L, 2L, 3L, 4L,
9L, 4L)), .Names = c("area.old", "area.new", "persons"), class = "data.frame", row.names = c(NA,
-19L))
Summary table 1: 'area.within'
The first table I wish to create is called 'area.within'. This will detail only areas where people have moved within the same area (i.e. it will count the total number of persons where 'london' is noted down in 'area.old' and 'area.new'). There will probably be multiple occurrences of this within the data table. It will then do this for all of the different areas, so the summary would be:
area.within persons
1 london 13
2 leeds 5
3 plymouth 5
Using the data table package, I have as far as:
setDT(migration)[as.character(area.old)==as.character(area.new)]
... but this doesn't get rid of duplicates...
Summary table 2: 'moved.from'
The second table will summarise areas which have experienced people moving out (i.e. those unique values in 'area.old'). It will identify areas for which column 1 and 2 are different and add together all the people that are detailed (i.e. excluding those who have moved between areas - in summary table 1). The resulting table should be:
moved.from persons
1 london 24
2 leeds 17
3 plymouth 19
Summary table 3: 'moved.to'
The third table summarises which areas have experienced people moving to (i.e. those unique values in 'area.new'). It will identify all the unique areas for which column 1 and 2 are different and add together all the people that are detailed (i.e. excluding those who have moved between areas - in summary table 1). The resulting table should be:
moved.to persons
1 london 5
2 york 3
3 cambridge 2
4 bristol 5
5 glasgow 6
6 leeds 8
7 york 6
8 harrogate 3
9 manchester 4
10 plymouth 0
11 poole 2
12 newcastle 3
13 bath 4
14 oxford 9
Most importantly, a sum of all the persons detailed in tables 2 and 3 should be the same. And then this value, combined with the persons total for table 1 should equal the sum of the all the persons in the original table.
If anyone could help me sort out how to structure my code using the data table package to produce my tables, I should be most grateful.
Using data.table is a good choice i think.
setDT(migration) #This has to be done only once
1.
To avoid duplicates just sum them up by city as follows
migration[as.character(area.old)==as.character(area.new),
.(persons = sum(persons)),
by=.(area.within = area.new)]
2.
This is very similar to the 1. one but uses != in the i-Argument
migration[as.character(area.old)!=as.character(area.new),
.(persons = sum(persons)),
by=.(moved.from = area.old)]
3.
Same as 2.
migration[as.character(area.old)!=as.character(area.new),
.(persons = sum(persons)),
by=.(moved.to = area.new)]
Alternative
As 2. and 3. are very similar you can also do:
moved <- migration[as.character(area.old)!=as.character(area.new)]
#2
moved[,.(persons = sum(persons)), by=.(moved.from = area.old)]
#3
moved[,.(persons = sum(persons)), by=.(moved.to = area.new)]
Thus only once the selection of the right rows has to be done.

Identifying 24hour periods in GPS data

I would like to identify sequential 24 hour periods in GPS data. I have a datetime column that is numerical (ex: 41422.29) and I know each rounded number is a day. I know how to get the day (just round), however my schedule does not specifically follow days. Instead, I would specifically like to identify all of the columns that are within 24 hours from the first column, and then go from there. I can not use a count of columns, as 24 hours is not divided into equal increments.
This is my logic so far, though it doesn't get me where I need to be:
for (i in 1:length(example)){
base<-round(example$DT_LMT[i], digits=0)
if(example$DT_LMT[i]<=base+1) {
example$DaySeq<-base
}
else {
base+1
}
}
I have a dummy data set example, with the kind of thing I would like:
structure(list(ID = 1:19, DT_LMT = c(41423.62517, 41423.79236,
41423.95868, 41424.12534, 41424.29203, 41424.45888, 41424.62535,
41424.79186, 41424.95852, 41425.12502, 41425.29185, 41425.75016,
41425.79201, 41425.83352, 41425.87534, 41425.91744, 41425.95868,
41426.00105, 41426.04257), NEED = c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L)), .Names = c("ID",
"DT_LMT", "NEED"), class = "data.frame", row.names = c(NA, -19L
))
Here is one approach, assuming df is the data assigned in your question. I created a new variable, need which I believe is your desired outcome.
transform(df, need = trunc(DT_LMT - DT_LMT[1]) + 1)
I would add 1 to the first value as the filter the data frame.
data<-data.frame(ID = 1:19, DT_LMT = c(41423.62517, 41423.79236,
41423.95868, 41424.12534, 41424.29203, 41424.45888, 41424.62535,
41424.79186, 41424.95852, 41425.12502, 41425.29185, 41425.75016,
41425.79201, 41425.83352, 41425.87534, 41425.91744, 41425.95868,
41426.00105, 41426.04257), NEED = c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L))
data[data$DT_LMT<=data$DT_LMT[1]+1,]
Output:
ID DT_LMT NEED
1 1 41423.63 1
2 2 41423.79 1
3 3 41423.96 1
4 4 41424.13 1
5 5 41424.29 1
6 6 41424.46 1
If you want to split the data into a list by 24 hour period.
split(data,unlist(lapply(data$DT_LMT,function(x){floor(x-data$DT_LMT[1])})))
Output:
$`0`
ID DT_LMT NEED
1 1 41423.63 1
2 2 41423.79 1
3 3 41423.96 1
4 4 41424.13 1
5 5 41424.29 1
6 6 41424.46 1
$`1`
ID DT_LMT NEED
7 7 41424.63 2
8 8 41424.79 2
9 9 41424.96 2
10 10 41425.13 2
11 11 41425.29 2
$`2`
ID DT_LMT NEED
12 12 41425.75 3
13 13 41425.79 3
14 14 41425.83 3
15 15 41425.88 3
16 16 41425.92 3
17 17 41425.96 3
18 18 41426.00 3
19 19 41426.04 3
To add a column with the day.
data$day<-lapply(data$DT_LMT,function(x){floor(x-data$DT_LMT[1])+1})

R: iterate over columns and plot

My data looks like this:
Group Feature_A Feature_B Feature_C Feature_D
1 1 0 3 2 4
2 1 5 2 2 8
3 1 9 8 6 5
4 2 5 7 8 8
5 2 2 6 8 1
6 2 3 8 6 4
7 3 1 5 3 5
8 3 1 4 3 4
df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), Feature_A = c(0L,
5L, 9L, 5L, 2L, 3L, 1L, 1L), Feature_B = c(3L, 2L, 8L, 7L, 6L,
8L, 5L, 4L), Feature_C = c(2L, 2L, 6L, 8L, 8L, 6L, 3L, 3L), Feature_D = c(4L,
8L, 5L, 8L, 1L, 4L, 5L, 4L)), .Names = c("Group", "Feature_A",
"Feature_B", "Feature_C", "Feature_D"), class = "data.frame", row.names = c(NA,
-8L))
For every Feature I want to generate a plot (e.g., boxplot) that would higlight difference between Groups.
# Get unique Feature and Group
Features<-unique(colnames(df[,-1]))
Group<-unique(colnames(df$Group))
But how can I do the rest?
Pseudo-code might look like this:
Select Feature from Data
Split Data according Group
Boxplot
for (i in 1:levels(df$Features)){
for (o in 1:length(Group)){
}}
How can I achieve this? Hope someone can help me.
I would put py data in the long format. Then Using ggplot2 you can do some nice things.
library(reshape2)
library(ggplot2)
library(gridExtra)
## long format using Group as id
dat.m <- melt(dat,id='Group')
## bar plot
p1 <- ggplot(dat.m) +
geom_bar(aes(x=Group,y=value,fill=variable),stat='identity')
## box plot
p2 <- ggplot(dat.m) +
geom_boxplot(aes(x=factor(Group),y=value,fill=variable))
## aggregate the 2 plots
grid.arrange(p1,p2)
This is easy to do. I do this all the time
The code below will generate the charts using ggplot and save them as ch_Feature_A ....
you can wrap the answer in a pdf statement to send them to pdf as well
library(ggplot2)
df$Group <- as.factor(df$Group)
for (i in 2:dim(df)[2]) {
ch <- ggplot(df,aes_string(x="Group",y=names(df)[i],fill="Group"))+geom_boxplot()
assign(paste0("ch_",names(df)[i]),ch)
}
or even simpler, if you do not want separate charts
library(reshape2)
df1 <- melt(df)
ggplot(df1,aes(x=Group,y=value,fill=Group))+geom_boxplot()+facet_grid(.~variable)

Checking row format of csv

I am trying to import some data (below) and checking to see if I have the appropriate number of rows for later analysis.
repexample <- structure(list(QueueName = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c(" Overall", "CCM4.usci_retention_eng", "usci_helpdesk"
), class = "factor"), X8Tile = structure(c(1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L), .Label = c(" Average", "1", "2", "3", "4", "5", "6", "7",
"8"), class = "factor"), Actual = c(508.1821504, 334.6994838,
404.9048759, 469.4068667, 489.2800416, 516.5744106, 551.7966176,
601.5103783, 720.9810622, 262.4622533, 250.2777778, 264.8281938,
272.2807882, 535.2466968, 278.25, 409.9285714, 511.6635101, 553,
641, 676.1111111, 778.5517241, 886.3666667), Calls = c(54948L,
6896L, 8831L, 7825L, 5768L, 7943L, 5796L, 8698L, 3191L, 1220L,
360L, 454L, 406L, 248L, 11L, 9L, 94L, 1L, 65L, 9L, 29L, 30L),
Pop = c(41L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 3L, 1L, 1L,
1L, 11L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L)), .Names = c("QueueName",
"X8Tile", "Actual", "Calls", "Pop"), class = "data.frame", row.names = c(NA,
-22L))
The data gives 5 columns and is one example of some data that I would typically import (via a .csv file). As you can see there are three unique values in the column "QueueName". For each unique value in "QueueName" I want to check that it has 9 rows, or the corresponding values in the column "X8Tile" ( Average, 1, 2, 3, 4, 5, 6, 7, 8). As an example the "QueueName" Overall has all of the necessary rows, but usci_helpdesk does not.
So my first priority is to at least identify if one of the unique values in "QueueName" does not have all of the necessary rows.
My second priority would be to remove all of the rows corresponding to a unique "QueueName" that does not meet the requirements.
Both these priorities are easily addressed using the Split-Apply-Combine paradigm, implemented in the plyr package.
Priority 1: Identify values of QueueName which don't have enough rows
require(plyr)
# Make a short table of the number of rows for each unique value of QueueName
rowSummary <- ddply(repexample, .(QueueName), summarise, numRows=length(QueueName))
print(rowSummary)
If you have lots of unique values of QueueName, you'll want to identify the values which are not equal to 9:
rowSummary[rowSummary$numRows !=9, ]
Priority 2: Eliminate rows for which QueueNamedoes not have enough rows
repexample2 <- ddply(repexample, .(QueueName), transform, numRows=length(QueueName))
repexampleEdit <- repexample2[repexample2$numRows ==9, ]
print(repxampleEdit)
(I don't quite understand the meaning of 'check that it has 9 rows, or the corresponding values in the column "X8Tile"). You could edit the repexampleEdit line based on your needs.
This is an approach that makes some assumptions about how your data are ordered. It can be modified (or your data can be reordered) if the assumption doesn't fit:
## Paste together the values from your "X8tile" column
## If all is in order, you should have "Average12345678"
## If anything is missing, you won't....
myMatch <- names(
which(with(repexample, tapply(X8Tile, QueueName, FUN=function(x)
gsub("^\\s+|\\s+$", "", paste(x, collapse = ""))))
== "Average12345678"))
## Use that to subset...
repexample[repexample$QueueName %in% myMatch, ]
# QueueName X8Tile Actual Calls Pop
# 1 Overall Average 508.1822 54948 41
# 2 Overall 1 334.6995 6896 6
# 3 Overall 2 404.9049 8831 5
# 4 Overall 3 469.4069 7825 5
# 5 Overall 4 489.2800 5768 5
# 6 Overall 5 516.5744 7943 5
# 7 Overall 6 551.7966 5796 5
# 8 Overall 7 601.5104 8698 5
# 9 Overall 8 720.9811 3191 5
# 14 CCM4.usci_retention_eng Average 535.2467 248 11
# 15 CCM4.usci_retention_eng 1 278.2500 11 2
# 16 CCM4.usci_retention_eng 2 409.9286 9 2
# 17 CCM4.usci_retention_eng 3 511.6635 94 2
# 18 CCM4.usci_retention_eng 4 553.0000 1 1
# 19 CCM4.usci_retention_eng 5 641.0000 65 1
# 20 CCM4.usci_retention_eng 6 676.1111 9 1
# 21 CCM4.usci_retention_eng 7 778.5517 29 1
# 22 CCM4.usci_retention_eng 8 886.3667 30 1
Similar approaches can be taken with aggregate+merge and similar tools.

Resources