Counting occurence based on condition for each element of a column - r

I am analysing air traffic movements at an airport. My data set comprises aircraft block off times (leaving the gate) and the respective take-off times.
I am looking for an efficient way to count the (cumulative) occurrence of take-off events based on a condition given by the block-time of a flight.
Being relatively new to R, I have managed to code this by
looping over all rows of my data;
subsetting the data for the block time (condition event) in that row; and
counting the number of rows for the (temporary) data frame.
My solution is pretty slow already for a month of data (~ 50.000 flights), so it will be cumbersome to analyse larger time frames of one or two years.
I failed to find a similar problem on stackoverflow (or elsewhere) that applies to my problem. Neither could I make an apply() or sapply() work properly.
This is my code:
## count depeartures before own off-block
data$CUM_DEPS <- rep(NA, nrow(data)) # initialise column for dep count
for(i in 1:nrow(data)){ # loop over the data
data$CUM_DEPS[i] <- nrow(data[data$TAKE_OFF_TIME < data$BLOCK_OFF_TIME[i],])
}
Any pointers?
As suggested, this is a snapshot of the data and the result column i created.
FLTID TAKE_OFF_TIME BLOCK_OFF_TIME CUM_DEPS
Flight1 2013-07-01 05:02:42 2013-07-01 04:51:00 0
Flight2 2013-07-01 05:04:30 2013-07-01 04:53:52 0
Flight3 2013-07-01 05:09:01 2013-07-01 04:55:14 0
Flight4 2013-07-01 05:10:30 2013-07-01 05:00:57 0
Flight5 2013-07-01 05:12:58 2013-07-01 05:00:06 0
Flight6 2013-07-01 05:18:45 2013-07-01 05:04:14 1
Flight7 2013-07-01 05:22:12 2013-07-01 05:03:39 1
Flight8 2013-07-01 05:26:02 2013-07-01 05:09:32 3
Flight9 2013-07-01 05:27:24 2013-07-01 05:19:24 6
Flight10 2013-07-01 05:31:32 2013-07-01 05:17:05 5

From above code, it seems like you are doing one-to-many comparison.
The thing that makes your code slow is that you are subsetting data based on boolean index
for every single step.
data$CUM_DEPS <- rep(NA, nrow(data))
take_off_time = data$TAKE_OFF_TIME
for(i in 1:nrow(data)){
data$CUM_DEPS[i] = sum(data$BLOCK_OFF_TIME[i] > take_off_time)
}
This small modification will make it much faster, although I cannot say with an exact
number since I do not have a reproducible example.
The biggest difference is that I store date vector 'take_off_time' rather than
calling from the dataframe for every single iteration, and not subsetting data based on boolean, but summing single boolean.
Above all is from the assumption that you have processed date correctly so that it can be compared with inequality.

You could check where, in-between "TAKE_OFF_TIME"s, each "BLOCK_OFF_TIME" falls. findInterval is fast for this; the following looks valid, but maybe you'll have to check findInterval's arguments to suit your exact problem.
findInterval(as.POSIXct(DF[["BLOCK_OFF_TIME"]]),
as.POSIXct(DF[["TAKE_OFF_TIME"]]))
#[1] 0 0 0 0 0 1 1 3 6 5
And, for the record, the loop using sapply:
BOT = as.POSIXct(DF[["BLOCK_OFF_TIME"]])
TOT = as.POSIXct(DF[["TAKE_OFF_TIME"]])
sapply(BOT, function(x) sum(TOT < x))
#[1] 0 0 0 0 0 1 1 3 6 5
Where "DF":
DF = structure(list(FLTID = structure(c(1L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 2L), .Label = c("Flight1", "Flight10", "Flight2", "Flight3",
"Flight4", "Flight5", "Flight6", "Flight7", "Flight8", "Flight9"
), class = "factor"), TAKE_OFF_TIME = structure(1:10, .Label = c("2013-07-01 05:02:42",
"2013-07-01 05:04:30", "2013-07-01 05:09:01", "2013-07-01 05:10:30",
"2013-07-01 05:12:58", "2013-07-01 05:18:45", "2013-07-01 05:22:12",
"2013-07-01 05:26:02", "2013-07-01 05:27:24", "2013-07-01 05:31:32"
), class = "factor"), BLOCK_OFF_TIME = structure(c(1L, 2L, 3L,
5L, 4L, 7L, 6L, 8L, 10L, 9L), .Label = c("2013-07-01 04:51:00",
"2013-07-01 04:53:52", "2013-07-01 04:55:14", "2013-07-01 05:00:06",
"2013-07-01 05:00:57", "2013-07-01 05:03:39", "2013-07-01 05:04:14",
"2013-07-01 05:09:32", "2013-07-01 05:17:05", "2013-07-01 05:19:24"
), class = "factor"), CUM_DEPS = c(0L, 0L, 0L, 0L, 0L, 1L, 1L,
3L, 6L, 5L)), .Names = c("FLTID", "TAKE_OFF_TIME", "BLOCK_OFF_TIME",
"CUM_DEPS"), class = "data.frame", row.names = c(NA, -10L))

Related

Interactively plotting multiple lines with shiny and ggplot2

I'm creating a shiny application that will have a checkboxGroupInput, where each box checked will add another line to a frequency plot. I'm trying to wrap my head around reshape2 and ggplot2 to understand how to make this possible.
data:
head(testSet)
date store_id product_id count
1 2015-08-15 3 1 8
2 2015-08-15 3 3 1
3 2015-08-17 3 1 7
4 2015-08-17 3 2 3
5 2015-08-17 3 3 1
6 2015-08-18 3 3 2
class level information:
dput(droplevels(head(testSet, 10)))
structure(list(date = structure(c(16662, 16662, 16664,
16664, 16664, 16665, 16665, 16665, 16666, 16666), class = "Date"),
store_id = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), product_id = c(1L,
3L, 1L, 2L, 3L, 3L, 1L, 2L, 1L, 2L), count = c(8L, 1L, 7L,
3L, 1L, 2L, 18L, 1L, 0L, 2L)), .Names = c("date", "store_id",
"product_id", "count"), row.names = c(NA, 10L), class = "data.frame")
The graph should have an x-axis that corresponds to date, and a y-axis that corresponds to count. I would like to have a checkbox group input where for each box representing a product checked, a line corresponding to product_id will be plotted on the graph. The data is already filtered to store_id.
My first thought was to write a for loop inside of the plot to render a new geom_line() per each returned value of the input$productId vector. -- however after some research it seems that's the wrong way to go about things.
Currently I'm trying to melt() the data down to something useful, and then aes(...group=product_id), but getting errors on whatever I try.
Attempting to melt the data:
meltSet <- melt(testSet, id.vars="product_id", value.name="count", variable.name="date")
head of meltSet
head(meltSet)
product_id date count
1 1 date 16662
2 3 date 16662
3 1 date 16664
4 2 date 16664
5 3 date 16664
6 3 date 16665
tail of meltSet
tail(meltSet)
product_id date count
76 9 count 5
77 1 count 19
78 2 count 1
79 3 count 39
80 8 count 1
81 9 count 4
Plotting:
ggplot(data=meltSet, aes(x=date, y=count, group = product_id, colour = product_id)) + geom_line()
So my axis and values are all wonky, and not what I'm expecting from setting the plot.
If I'm understanding it correctly you don't need any melting, you just need to aggregate your data, summing up count by date and product_id. you can use data.table for this purpose:
testSet = data.table(testSet)
aggrSet = testSet[, .(count=sum(count)), by=.(date, product_id)]
You can do your ggplot stuff on aggrSet. It has three columns now: date, product_id, count.
When you melt like you did you merge two variables with different types into date: date(Date) and store_id(int).

Reducing multiple rows to 1 by index in R [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
I am relatively new to R. I am working with a dataset that has multiple datapoints per timestamp, but they are in multiple rows. I am trying to make a single row for each timestamp with a columns for each variable.
Example dataset
Time Variable Value
10 Speed 10
10 Acc -2
10 Energy 10
15 Speed 9
15 Acc -1
20 Speed 9
20 Acc 0
20 Energy 2
I'd like to convert this to
Time Speed Acc Energy
10 10 -2 10
15 9 -1 (blank or N/A)
20 8 0 2
These are measured values so they are not always complete.
I have tried ddply to extract each individual value into an array and recombine, but the columns are different lengths. I have tried aggregate, but I can't figure out how to keep the variable and value linked. I know I could do this with a for loop type solution, but that seems a poor way to do this in R. Any advice or direction would help. Thanks!
I assume data.frame's name is df
library(tidyr)
spread(df,Variable,Value)
Typically a job for dcast in reshape2.First, we make your example reproducible:
df <- structure(list(Time = c(10L, 10L, 10L, 15L, 15L, 20L, 20L, 20L),
Variable = structure(c(3L, 1L, 2L, 3L, 1L, 3L, 1L, 2L), .Label = c("Acc",
"Energy", "Speed"), class = "factor"), Value = c(10L, -2L, 10L,
9L, -1L, 9L, 0L, 2L)), .Names = c("Time", "Variable", "Value"),
class = "data.frame", row.names = c(NA, -8L))
Then:
library(reshape2)
dcast(df, Time ~ ...)
Time Acc Energy Speed
10 -2 10 10
15 -1 NA 9
20 0 2 9
With dplyr you can (cosmetics) reorder the columns with:
library(dplyr)
dcast(df, Time ~ ...) %>% select(Time, Speed, Acc, Energy)
Time Speed Acc Energy
10 10 -2 10
15 9 -1 NA
20 9 0 2

Extract data between two pattern occurrences

I am trying to extract data between the occurrence of two patterns. I.e. if the pattern occurs subset all data until that pattern occurs again. I would then need to give this subset a number so that it is then identifiable
USING (R)
example data:
DF<-(structure(list(date.time = structure(c(1374910680, 1374911040,
1374911160, 1374911580, 1374913380, 1374913500, 1374913620, 1374913740,
1374914160, 1374914400, 1374914520, 1374914940, 1374915000, 1374915120,
1374915240), class = c("POSIXct", "POSIXt"), tzone = ""), aerial = structure(c(2L,
2L, 8L, 8L, 2L, 2L, 2L, 8L, 8L, 8L, 2L, 2L, 8L, 2L, 2L), .Label = c("0",
"1", "10", "11", "2", "3", "4", "5", "6", "7", "8", "9", "m"), class = "factor")), .Names = c("date.time",
"aerial"), row.names = c(1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L,
14L, 15L), class = "data.frame") )
example pattern: where DF$aerial repeats 1,1
From the above I want to subset/extract the data between the occurrence of the pattern, and then give this an identifiable number as to number of occurrence of this pattern(i.e. this is the first occurrence, this is the second occurrence etc etc)
desired output:
date.time aerial occurrence
3 2013-07-27 08:46:00 5 1
4 2013-07-27 08:53:00 5 1
8 2013-07-27 09:29:00 5 2
9 2013-07-27 09:36:00 5 2
10 2013-07-27 09:40:00 5 2
13 2013-07-27 09:50:00 5 3
I can Identify the pattern:
require(zoo)
library(zoo)
pat <- c(1,1)
x <- rollapply(DF$aerial, length(pat), FUN=function(x) all(x == pat))
DF[which(x),]
and obviously I can create an is.between function
is.between <- function(x, a, b) {
x > a & x < b
}
However after this I get stuck,
Note: data between the pattern may not always be aerial 5, this is used to simplify the example
help and pointers greatly appreciated!
It seems that it is good enough to exclude all runs of 1's that are at least 2 long so try this:
library(zoo)
a <- as.numeric(as.character(DF$aerial))
r <- rle(a)
cond <- with(r, values != 1 | lengths < 2)
ok <- rep(cond, r$lengths)
occur <- rep(cumsum(cond), r$lengths)
cbind(DF, occur)[ok, ]
which gives:
date.time aerial occur
3 2013-07-27 03:46:00 5 1
4 2013-07-27 03:53:00 5 1
8 2013-07-27 04:29:00 5 2
9 2013-07-27 04:36:00 5 2
10 2013-07-27 04:40:00 5 2
13 2013-07-27 04:50:00 5 3
REVISION: Added occur column

R for loop not working

I'm trying to use R to find the max value of each day for 1 to n days. My issue is there are multiple values in each day. Heres my code. After I run it incorrect number of dimensions.
Any suggestions:
Days <- unique(theData$Date) #Gets each unique Day
numDays <- length(Days)
Time <- unique(theData$Time) #Gets each unique time
numTime <- length(Time)
rowCnt <- 1
for (i in 1:numDays) #Do something for each individual day. In this case find max
{
temp <- which(theData[i]$Date == numDays[i])
temp <- theData[[i]][temp,]
High[rowCnt, (i-2)+2] <- max(temp$High) #indexing for when I print to CSV
rowCnt <- rowCnt + 1
}
Heres what it should come out to: Except 1 to n days and times.
Day Time Value
20130310 09:30:00 5
20130310 09:31:00 1
20130310 09:32:00 2
20130310 09:33:00 3
20130311 09:30:00 12
20130311 09:31:00 0
20130311 09:32:00 1
20130311 09:33:00 5
so this should return:
day time value
20130310 09:33:00 3
20130311 09:30:00 12
Any help would be greatly appreciated! Thanks!
Here is the solution using plyr package
mydata<-structure(list(Day = structure(c(2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L), .Label = c("", "x", "y"), class = "factor"), Value = c(0L,
1L, 2L, 3L, 12L, 0L, 1L, 5L), Time = c(5L, 6L, 7L, 8L, 1L, 2L,
3L, 4L)), .Names = c("Day", "Value", "Time"), row.names = c(NA,
8L), class = "data.frame")
library(plyr)
ddply(mydata,.(Day),summarize,max.value=max(Value))
Day max.value
1 x 3
2 y 12
Updated1: If your day is say 10/02/2012 12:00:00 AM, then you need to use:
mydata$Day<-with(mydata,as.Date(Day, format = "%m/%d/%Y"))
ddply(mydata,.(Day),summarize,max.value=max(Value))
Please see here for the example.
Updated2: as per new data: If your day is like the one you updated, you don't need to do anything. You can just use the code as following:
mydata1<-structure(list(Day = c(20130310L, 20130310L, 20130310L, 20130310L,
20130311L, 20130311L, 20130311L, 20130311L), Time = structure(c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("9:30:00", "9:31:00",
"9:32:00", "9:33:00"), class = "factor"), Value = c(5L, 1L, 2L,
3L, 12L, 0L, 1L, 5L)), .Names = c("Day", "Time", "Value"), class = "data.frame", row.names = c(NA,
-8L))
ddply(mydata,.(Day),summarize,Time=Time[which.max(Value)],max.value=max(Value))
Day Time max.value
1 20130310 9:30:00 5
2 20130311 9:30:00 12
If you want the time to appear in the output, then just use Time=Time[which.max(Value)] which gives the time at the maximum value.
This is a base function approach:
> do.call( rbind, lapply(split(dfrm, dfrm$Day),
function (df) df[ which.max(df$Value), ] ) )
Day Time Value
20130310 20130310 09:30:00 5
20130311 20130311 09:30:00 12
To explain what's happening it's good to learn to read R functions from the inside out (since they are often built around each other.) You wanted lines from a dataframe, so you would either need to build a numeric or logical vector that spanned the number of rows, .... or you can take the route I did and break the problem up by Day. That's what split does with dataframes. Then within each dataframe I applied a function, which.max to just a single day's subset of the data. Since I only got the results back from lapply as a list of dataframes, I needed to squash them back together, and the typical method for doing so is do.call(rbind, ...).
If I took the other route of making a vector for selection that applied to the whole dataframe I would use ave:
> dfrm[ with(dfrm, ave(Value, Day, FUN=function(v) v==max(v) ) ) , ]
Day Time Value
1 20130310 09:30:00 5
1.1 20130310 09:30:00 5
Huh? That's not right... What's the problem?
with(dfrm, ave(Value, Day, FUN=function(v) v==max(v) ) )
[1] 1 0 0 0 1 0 0 0
So despite asking for a logical vector with the "==" function, I got conversion to a numeric vector, something I still don't understand. But converting to logical outside that result I succeed again:
> dfrm[ as.logical( with(dfrm, ave(Value, Day,
FUN=function(v) v==max(v) ) ) ), ]
Day Time Value
1 20130310 09:30:00 5
5 20130311 09:30:00 12
Also note that the ave function (unlike tapply or aggregate) requires that you offer the function as a named argument with FUN=function(.). That is a common error I make. If you see the "error message unique() applies only to vectors", it seems out of the blue, but means that ave tried to group an argument that it expected to be discrete and you gave it a function.
Unlike other programming languages, in R it is considered good practice to avoid using for loops. Instead try something like:
index <- sapply(Days, function(x) {
which.max(Value)
})
theData[index, c("Day", "Time", "Value")]
This means for each value of Days, find the maximum value of Value and return its index. Then you can select the rows and columns of interest.
I recommend reading the help documentation for apply(), lapply(), sapply(), tapply(), mapply() (I'm probably forgetting one of them…) in and the plyr package.

R - Select rows for random sample of column values?

How can I select all of the rows for a random sample of column values?
I have a dataframe that looks like this:
tag weight
R007 10
R007 11
R007 9
J102 11
J102 9
J102 13
J102 10
M942 3
M054 9
M054 12
V671 12
V671 13
V671 9
V671 12
Z990 10
Z990 11
That you can replicate using...
weights_df <- structure(list(tag = structure(c(4L, 4L, 4L, 1L, 1L, 1L, 1L,
3L, 2L, 2L, 5L, 5L, 5L, 5L, 6L, 6L), .Label = c("J102", "M054",
"M942", "R007", "V671", "Z990"), class = "factor"), value = c(10L,
11L, 9L, 11L, 9L, 13L, 10L, 3L, 9L, 12L, 12L, 14L, 5L, 12L, 11L,
15L)), .Names = c("tag", "value"), class = "data.frame", row.names = c(NA,
-16L))
I need to create a dataframe containing all of the rows from the above dataframe for two randomly sampled tags. Let's say tags R007and M942 get selected at random, my new dataframe needs to look like this:
tag weight
R007 10
R007 11
R007 9
M942 3
How do I do this?
I know I can create a list of two random tags like this:
library(plyr)
tags <- ddply(weights_df, .(tag), summarise, count = length(tag))
set.seed(5464)
tag_sample <- tags[sample(nrow(tags),2),]
tag_sample
Resulting in...
tag count
4 R007 3
3 M942 1
But I just don't know how to use that to subset my original dataframe.
is this what you want?
subset(weights_df, tag%in%sample(levels(tag),2))
If your data.frame is named dfrm, then this will select 100 random tags
dfrm[ sample(NROW(dfrm), 100), "tag" ] # possibly with repeats
If, on the other hand, you want a dataframe with the same columns (possibly with repeats):
samp <- dfrm[ sample(NROW(dfrm), 100), ] # leave the col name entry blank to get all
A third possibility... you want 100 distinct tags at random, but not with the probability at all weighted to the frequency:
samp.tags <- unique(dfrm$tag)[ sample(length(unique(dfrm$tag)), 100]
Edit: With to revised question; one of these:
subset(dfrm, tag %in% c("R007", "M942") )
Or:
dfrm[dfrm$tag %in% c("R007", "M942"), ]
Or:
dfrm[grep("R007|M942", dfrm$tag), ]

Resources