Count number of events in the previous time period - r

I trying to create a variable (the made up one "events60" in the data below, that keeps a "running" count of the number of events in the past (in this example it's 60 minutes, but it could be any arbitrary value). So, it keeps a tally "how many events occurred in the previous hour".
I'm making slow headway with cumsum, rle, diff etc. and whatnot but I'm certain there is a more elegant and quicker solution. It will be applied to a dataset of a minimum 30 million rows so a loop is probably not very efficient.
Example data below in R format
structure(list(Performed_DT_TM = structure(c(1508310211, 1508312843,
1508322697, 1508331061, 1508331161, 1508331452, 1508332222, 1508332900,
1508333781, 1508334349, 1508337531, 1508341065, 1508343542, 1508346756,
1508363905, 1508371639, 1508388245, 1508402001, 1508413612, 1508430173,
1508445426, 1508453675), class = c("POSIXct", "POSIXt"), tzone = ""),
time_since_prev_obs = c(0, 43.8666666666667, 164.233333333333,
139.4, 1.66666666666667, 4.85, 12.8333333333333, 11.3, 14.6833333333333,
9.46666666666667, 53.0333333333333, 58.9, 41.2833333333333,
53.5666666666667, 285.816666666667, 128.9, 276.766666666667,
229.266666666667, 193.516666666667, 276.016666666667, 254.216666666667,
137.483333333333), events60 = c(0, 1, 0, 0, 1, 2, 3, 4, 5,
6, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA,
-22L), class = "data.frame")
Any help greatly appreciated of course
Cheers
Norm

in Base R you could do:
m <- outer(df$Performed_DT_TM,df$Performed_DT_TM,"-")
c(0,rowsum(as.numeric(m[lower.tri(m)]<3600),row(m)[lower.tri(m)]))
[1] 0 1 0 0 1 2 3 4 5 6 1 1 1 1 0 0 0 0 0 0 0 0

Since your dataset is huge, you can try a rolling join and then an non-equi join from data.table for speed:
setDT(DT)[, Performed_DT_TM := as.POSIXct(Performed_DT_TM, format="%Y-%-%d %T")]
DT[, c("rn", "endtime") := .(.I, Performed_DT_TM - 60L*60L)]
DT[, Last60mins :=
DT[DT, on=.(Performed_DT_TM=endtime), roll=Inf, i.rn - x.rn - 1L]
]
DT[is.na(Last60mins), Last60mins := fcoalesce(Last60mins,
DT[.SD, on=.(Performed_DT_TM>=endtime, Performed_DT_TM<Performed_DT_TM), .N, by=.EACHI]$N)
]
DT
data:
library(data.table)
DT <- structure(list(Performed_DT_TM = structure(c(1508310211, 1508312843,
1508322697, 1508331061, 1508331161, 1508331452, 1508332222, 1508332900,
1508333781, 1508334349, 1508337531, 1508341065, 1508343542, 1508346756,
1508363905, 1508371639, 1508388245, 1508402001, 1508413612, 1508430173,
1508445426, 1508453675), class = c("POSIXct", "POSIXt"), tzone = ""),
time_since_prev_obs = c(0, 43.8666666666667, 164.233333333333,
139.4, 1.66666666666667, 4.85, 12.8333333333333, 11.3, 14.6833333333333,
9.46666666666667, 53.0333333333333, 58.9, 41.2833333333333,
53.5666666666667, 285.816666666667, 128.9, 276.766666666667,
229.266666666667, 193.516666666667, 276.016666666667, 254.216666666667,
137.483333333333), events60 = c(0, 1, 0, 0, 1, 2, 3, 4, 5,
6, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA,
-22L), class = "data.frame")

Related

match every row whose `region_ID=0` with the rows whose `region_ID=1` and calculate a certain distance

I have a dataset that looks like the following:
structure(list(X = c(36, 37, 38, 39, 40, 41, 1, 2, 3, 4, 5, 6
), Y = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), region_ID = c(0,
0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1)), row.names = c(NA, -12L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x7fb8fc819ae0>)
I want to match every row whose region_ID=0 with the rows whose region_ID=1 and calculate
dist_to_r1=sqrt((X - i.X)^2 + (Y - i.Y)^2))
where i. prefix refers to the latter rows. I want to do this using data table syntax.
I have been trying to do this with left joins, but couldn't make it work.
You want a full join, such that each of the six rows in region 0 are joined to the six rows in region 1?.
In that case, you can simply set allow.cartesian = T:
data[, id:=1][region_ID==0][data[region_ID==1], on ="id", allow.cartesian=T][, dist_to_r1:=sqrt((X-i.X)^2 + (Y-i.Y)^2)][]
Edit: OP clarified that only the minimum distance to a point in region 0 is required. In this case, we can do something like this:
data[,id:=1]
region0 = data[region_ID==0]
# function that gets the minimum distance between two regions
get_min_dist <- function(region_a, region_b) {
region_a[region_b, on="id", allow.cartesian=T][,min(sqrt((X-i.X)^2 + (Y-i.Y)^2))]
}
# apply the function above to every region
data[,
(min_dist_to_zero = get_min_dist(
region_a = region0,
region_b = data[region_ID==.BY]
)),
by=region_ID]
Output:
region_ID min_dist_to_zero
1: 0 0
2: 1 30

if_else with haven_labelled column fails because of wrong class

I have the following data:
dat <- structure(list(value = structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
label = "value: This is my label",
labels = c(`No` = 0, `Yes` = 1),
class = "haven_labelled"),
group = structure(c(1, 2, 1, 1, 2, 3, 3, 1, 3, 1, 3, 3, 1, 2, 3, 2, 1, 3, 3, 1),
label = "my group",
labels = c(first = 1, second = 2, third = 3),
class = "haven_labelled")),
row.names = c(NA, -20L),
class = c("tbl_df", "tbl", "data.frame"),
label = "test.sav")
As you can see, the data uses a special class from tidyverse's haven package, so called labelled columns.
Now I want to recode my initial value variable such that:
if group equals 1, value should stay the same, otherwise it should be missing
I was trying the following, but getting an error:
dat_new <- dat %>%
mutate(value = if_else(group != 1, NA, value))
# Error: `false` must be a logical vector, not a `haven_labelled` object
I got so far as to understand that if_else from dplyr requires the true and false checks in the if_else command to be of same class and since there is no NA equivalent for class labelled (e.g. similar to NA_real_ for doubles), the code probably fails, right?
So, how can I recode my inital variables and preserve the labels?
I know I could change my code above and replace the if_else by R's base version ifelse. However, this deletes all labels and coerces the value column to a numeric one.
You can try dplyr::case_when for cases where group == 1. If no cases are matched, NA is returned:
dat %>% mutate(value = case_when(group == 1 ~ value))
You can create an NA value in the haven_labelled class with this ugly code:
haven::labelled(NA_real_, labels = attr(dat$value, "labels"))
I'd recommend writing a function for that, e.g.
labelled_NA <- function(value)
haven::labelled(NA_real_, labels = attr(value, "labels"))
and then the code for your mutate isn't quite so ugly:
dat_new <- dat %>%
mutate(value = if_else(group != labelled_NA(value), value))
Then you get
> dat_new[1:5,]
# A tibble: 5 x 2
value group
<dbl+lbl> <dbl+lbl>
1 NA 1 [first]
2 NA 2 [second]
3 0 [No] 1 [first]
4 0 [No] 1 [first]
5 NA 2 [second]

Is there a way to count occurrences of a specific value for unique columns in a dataframe in R?

I am relatively new to R and have a dataframe (cn_data2) with several duplicated columns. It looks something like this:
Gene breast_cancer breast_cancer breast_cancer lung_cancer lung_cancer
myc 1 0 1 1 2
ARID1A 0 2 1 1 0
Essentially, the rows are genes and the columns are different types of cancers. What I want is to find for each gene the number of times, a value (0,1,or 2) occurs for each unique cancer type.
I have tried several things but haven't been able to achieve what I want. For example, cn_data2$count1 <- rowSums(cn_data == '1') gives me a column with the number of "1" for each gene but what I want the number of "1" for each individual disease.
Hope my question is clear!I appreciate any help, thank you!
structure(list(gene1 = structure(1:6, .Label = c("ACAP3", "ACTRT2",
"AGRN", "ANKRD65", "ATAD3A", "ATAD3B"), class = "factor"), glioblastoma_multiforme_Primary_Tumor = c(0,
0, 0, 0, 0, 0), glioblastoma_multiforme_Primary_Tumor.1 = c(-1,
-1, -1, -1, -1, -1), glioblastoma_multiforme_Primary_Tumor.2 = c(0,
0, 0, 0, 0, 0), glioblastoma_multiforme_Primary_Tumor.3 = c(2,
2, 2, 2, 2, 2), glioblastoma_multiforme_Primary_Tumor.4 = c(0,
0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, 6L))

Tried streamlining w/ SDCols - got "longer object length is not a multiple of shorter object length"

I have tried searching stackoverflow and google to get answers to my question, but I couldn't find anything that applied closely enough for me to be able to apply it. However, I'm very new to R, so it's likely that I may just need a little walking through it.
If I use the following code, it works just fine.
> dput(b)
structure(list(DUMP_END_SHIFT_DATE = structure(c(1420070400,
1420070400, 1420156800, 1420156800, 1420243200, 1420243200, 1420329600,
1420329600, 1420416000, 1420416000, 1420502400), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), QUANTITY_REPORTING = c(235, 219, 232,
219, 219, 219, 219, 219, 219, 219, 235), WTRECV = c(32.71, 32.71,
20.19, 33.42, 21.61, 21.61, 21.61, 20.19, 21.61, 20.19, 24.2),
LC12 = c(0, 0, 0, 94, 100, 100, 100, 0, 100, 0, 100), LC34 = c(0,
100, 0, 6, 0, 0, 0, 0, 0, 0, 0), LC5 = c(0, 0, 5, 0, 0, 0,
0, 5, 0, 5, 0), HIS = c(25, 0, 60, 0, 0, 0, 0, 60, 0, 60,
0), UC = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), IBC = c(75,
0, 35, 0, 0, 0, 0, 35, 0, 35, 0)), .Names = c("DUMP_END_SHIFT_DATE",
"QUANTITY_REPORTING", "WTRECV", "LC12", "LC34", "LC5", "HIS",
"UC", "IBC"), class = c("data.table", "data.frame"), row.names = c(NA,
-11L), .internal.selfref = <pointer: 0x0000000005860788>)
library(data.table)
b_daily <- b[,.(d_tons=sum(QUANTITY_REPORTING)),by=DUMP_END_SHIFT_DATE]
b_daily[,"d_WTRECV" := b[,.(d_WTRECV=sum(QUANTITY_REPORTING*WTRECV)),by=DUMP_END_SHIFT_DATE] [,.(round(d_WTRECV/d_tons, digits=2))]]
b_daily[,"d_LC12" := b[,.(d_LC12=sum(QUANTITY_REPORTING*LC12)),by=DUMP_END_SHIFT_DATE] [,.(round(d_LC12/d_tons, digits=2))]]
b_daily[,"d_LC34" := b[,.(d_LC34=sum(QUANTITY_REPORTING*LC34)),by=DUMP_END_SHIFT_DATE] [,.(round(d_LC34/d_tons, digits=2))]]
b_daily[,"d_LC5" := b[,.(d_LC5=sum(QUANTITY_REPORTING*LC5)),by=DUMP_END_SHIFT_DATE] [,.(round(d_LC5/d_tons, digits=2))]]
b_daily[,"d_HIS" := b[,.(d_HIS=sum(QUANTITY_REPORTING*HIS)),by=DUMP_END_SHIFT_DATE] [,.(round(d_HIS/d_tons, digits=2))]]
b_daily[,"d_UC" := b[,.(d_UC=sum(QUANTITY_REPORTING*UC)),by=DUMP_END_SHIFT_DATE] [,.(round(d_UC/d_tons, digits=2))]]
b_daily[,"d_IBC" := b[,.(d_IBC=sum(QUANTITY_REPORTING*IBC)),by=DUMP_END_SHIFT_DATE] [,.(round(d_IBC/d_tons, digits=2))]]
However, it seems very inelegant - I think that I should be able to do this using SD and SDcols. I tried the following, just as a test case:
b_daily2 <- b[,lapply(.SD, function (x) sum(x*b[,QUANTITY_REPORTING])/sum(b[,QUANTITY_REPORTING])), by=DUMP_END_SHIFT_DATE, .SDcols=c("WTRECV")] [,.(DUMP_END_SHIFT_DATE,d_WTRECV=round(WTRECV, digits=2))]
The resulting numbers are a little off, and I get the following warning:
"In x * MQD[, QUANTITY_REPORTING] : longer object length is not a multiple of shorter object length"
I understand that this indicates recycling due to objects being different lengths...but I don't understand why or what. Any help would be much appreciated. I apologize in advance if this is an elementary question. Thank you.
This is arguably also inelegant, but at least fits into a single operation:
b_daily <- b[,{
d_tons = sum(QUANTITY_REPORTING)
d_WTRECV = round( sum(QUANTITY_REPORTING*WTRECV)/d_tons, digits = 2 )
list(d_tons = d_tons, d_WTRECV = d_WTRECV)
},by=DUMP_END_SHIFT_DATE]
If there are many columns like d_WTRECV, with names stored in cols = c("WTRECV",...), then...
cols <- c("WTRECV","LC12","LC34","LC5","HIS","UC","IBC")
b_daily2 <- b[,{
d_tons = sum(QUANTITY_REPORTING)
res = lapply(mget(cols), function(x)
round( sum(QUANTITY_REPORTING*x)/d_tons, digits = 2 )
)
c(list(d_tons = d_tons), setNames(res, paste0("d_",cols)))
},by=DUMP_END_SHIFT_DATE]
A similar approach using .SDcols will be possible when a bug related to it is fixed.
Aside. I think there is a feature request to allow for the first column to be used in computing the second, like
# NON-WORKING CODE:
b_daily <- b[,.(
d_tons = sum(QUANTITY_REPORTING),
d_WTRECV = round( sum(QUANTITY_REPORTING*WTRECV) / d_tons, digits = 2)
),by=DUMP_END_SHIFT_DATE]
This is how mutate in the dplyr package works. However, for your multicolumn case, dplyr is more of a hassle than a help, as far as I can figure.
By the way, you may want to wait on rounding. Usually, it's only a good idea for printing purposes and just unnecessarily worsens your later calculations.
I don't think there is a particularly elegant way to do this. Here's a quick take.
sdc <- c("WTRECV", "LC12", "LC34", "LC5", "HIS", "UC", "IBC")
b2 <- copy(b)
b2[, (sdc) := lapply(.SD, "*", b2[, QUANTITY_REPORTING]), .SDcols=sdc]
b_daily <- b2[, lapply(.SD, sum), by=DUMP_END_SHIFT_DATE]
data.table(
b_daily[, .(DUMP_END_SHIFT_DATE)],
b_daily[, lapply(lapply(.SD, "/", b_daily[,QUANTITY_REPORTING]), round, 2), .SDcols=sdc]
)

performing calculations on columns created from ddply

I am using ddply within subset to calculate some metrics and roll up a table as required. Some of the metrics I want to calculate need to use the summarized columns created as a result of the ddply operation.
Here is the function with the simple calculated columns:
subset_by_market <- function (q, marketname, dp) {
subset(ddply(df, .(quarter, R.DMA.NAMES, daypart, station), summarise,
spot.count = length(spot.id),
station.investment = sum(rate),
nullspots.male = sum(nullspot.male),
nullspots.allpersons = sum(nullspot.allpersons),
total.male.imp = sum(male.imp),
total.allpersons.imp = sum(allpersons.imp),
spotvalue.male = sum(spotvalue.male),
spotvalue.allpersons = sum(spotvalue.allpersons)),
quarter == q & R.DMA.NAMES == marketname & daypart == dp)
}
I use subset_by_market ("Q32013" , "Columbus.OH", "primetime") to summarize create a subset. My resulting table looks like:
quarter R.DMA.NAMES daypart station spot.count station.investment nullspots.male nullspots.allpersons
10186 Q32013 Columbus.OH primetime ADSM COLUMBUS, OH 103 5150 67 61
10187 Q32013 Columbus.OH primetime ESYX 49 0 49 49
10188 Q32013 Columbus.OH primetime MTV COLUMBUS, OH 61 4500 7 1
10189 Q32013 Columbus.OH primetime WCMH-Retro TV 94 564 93 93
10190 Q32013 Columbus.OH primetime WTTE 1 0 0 0
10191 Q32013 Columbus.OH primetime WWHO 9 0 2 2
total.male.imp total.allpersons.imp spotvalue.male spotvalue.allpersons
10186 47.2 127.7 4830.409 4775.1068
10187 0.0 0.0 NaN NaN
10188 157.9 371.1 4649.746 4505.2608
10189 0.3 0.3 3162.000 3162.0000
10190 3.5 10.3 570.166 591.0231
10191 3.9 15.8 7155.000 4356.4162
Question 1: I would like to add to the same data frame for e.g.: Percentage values of spot.count. = spot.count / sum(spot.count) (ii) percent.nullspots.male = nullspots.male / sum(nullspots.male)
However, when I add that to the ddply arguments, I get 1 (100%) in the resulting column. The value divides by itself instead of dividing by the sum of the column.
Question 2: This is slow and humbly I accept this may not be optimal coding. I am using an i5-2.6GHz PC with 16Gb ddr3 RAM with 64 bit OS. The dataset is 1M rows.
system.time(subset_by_market ("Q32013" , "Albuquerque.Santa.Fe", "late fringe"))
user system elapsed
228.13 176.84 416.12
The intention is to visualize all calculated metrics on an online dashboard and allow user to select the subset_by_market (q , marketname, dp) using drop-down menus. How can I make it faster?
ADDING SAMPLE DATA:
`> structure(list(market = c("Local", "Local", "Local", "Local",
"Local", "Local", "Local", "NATIONAL CABLE", "Local", "Local"
), spot.id = c(11248955L, 11262196L, 11946349L, 11625265L, 12929889L,
11259758L, 11517638L, 11599834L, 12527365L, 12930259L), date = structure(c(1375675200,
1376625600, 1390280400, 1383627600, 1401249600, 1375848000, 1380772800,
1383019200, 1397102400, 1401163200), class = c("POSIXct", "POSIXt"
), tzone = ""), hour = c(15, 17, 11, 18, 19, 1, 13, 14, 16, 22
), time = structure(c(0.642361111111111, 0.749305555555556, 0.481944444444444,
0.770138888888889, 0.830555555555556, 0.0597222222222222, 0.582638888888889,
0.597222222222222, 0.675694444444444, 0.930555555555556), format = "h:m:s", class = "times"),
local.date = structure(c(1375675200, 1376625600, 1390280400,
1383627600, 1401249600, 1375848000, 1380772800, 1383019200,
1397102400, 1401163200), class = c("POSIXct", "POSIXt"), tzone = ""),
local.hour = c(15, 17, 11, 18, 18, 0, 13, 14, 15, 22), local.time = structure(c(0.642361111111111,
0.749305555555556, 0.481944444444444, 0.770138888888889,
0.788888888888889, 0.0180555555555556, 0.582638888888889,
0.597222222222222, 0.634027777777778, 0.930555555555556), format = "h:m:s", class = "times"),
vendor = c("Time Warner - Myrtle Beach", "WMYD", "WSBK",
"WDCA", "Comcast - Memphis", "Charter Media - Birmingham",
"WBNA", "G4", "Comcast - Houston", "Comcast - Youngstown"
), station = c("VH-1 MYRTLE BEACH", "WMYD", "WSBK", "WDCA",
"COM MEMPHIS", "FX BIRMINGHAM", "WBNA", "G4", "SPK HOUSTON",
"COM YOUNGSTOWN CC"), male.imp = c(0, 2, 0, 0, 0.6, 0.4,
0, 0, 3.9, 0), women.imp = c(0, 2.5, 0, 2.5, 0.2, 0.6, 0,
0, 4.6, 0.6), allpersons.imp = c(0, 3.5, 0, 2.5, 0.8, 0.8,
0, 0, 7.8, 0.6), hh.imp = c(0, 8.5, 8, 64.5, 1.3, 2.9, 1.3,
15, 13.7, 1), isci = c("IT6140MB", "ITCD78DT", "IT6192BS",
"IT6170WD", "IT6173ME", "IT6162BI", "IT6155LO", "ITES13410",
"IT3917", "IT3921"), creative = c("Eugene Elbert (Bach. Tcom Eng. Tech) :60",
"The Problem Solvers (revised) - IET :60", "Murtech/Kinetic/Integra :60",
"Kevin Bumper/NTSG/Lifetime :60", "NCR/Schlumberger/Sprint (revised) (Bach) :60",
"Skills Gap (revised) /Kevin :60", "Rising Costs60 (Opportunity Scholar - No Nursing)",
"Irina Lund (Bach. ISS) :60", "Augustine Lopez (A. CEET) :30 (no loc)",
"John Ryan Ellis (B. PM/A. CDD) :30 (no loc)"), program = c(NA,
"TYLER PERRY'S MEET THE BROWNS", "THE PEOPLE'S COURT", "Judge Judy",
NA, NA, "Meet the Browns/Are We There Yet/News/Wendy Willia",
"HEROES", "Spike EF Rotator", NA), rate = c(5, 230, 100,
625, 40, 0, 15, 40, 110, 7), R.DMA.NAMES = c("Myrtle.Beach.Florence",
"Detroit", "Boston.Manchester.", "Washington.DC.Hagrstwn.",
"Memphis", "Birmingham.Ann.and.Tusc.", "Louisville", "NATIONAL CABLE",
"Houston", "Youngstown"), date.time = c("2013-08-05 15:25:00",
"2013-08-16 17:59:00", "2014-01-21 11:34:00", "2013-11-05 18:29:00",
"2014-05-28 19:56:00", "2013-08-07 01:26:00", "2013-10-03 13:59:00",
"2013-10-29 14:20:00", "2014-04-10 16:13:00", "2014-05-27 22:20:00"
), daypart = c("afternoon", "evening", "morning", "evening",
"evening", "late fringe", "afternoon", "afternoon", "afternoon",
"primetime"), quarter = structure(c(4L, 4L, 1L, 6L, 3L, 4L,
6L, 6L, 3L, 3L), .Label = c("Q12014", "Q22013", "Q22014",
"Q32013", "Q32014", "Q42013"), class = "factor"), cpi.allpersons = c(96.2179487179487,
79.0114068441065, 35.1219512195122, 82.3322348711803, 30,
0, 138.721804511278, 28.3135215453195, 28.2384088854449,
86.6666666666667), cpi.male = c(750.5, 188.882673751923,
115.959004392387, 144.492639327024, 38.9847715736041, 0,
595.161290322581, 34.7402005469462, 62.010777084515, 156.712328767123
), spotvalue.allpersons = c(0, 276.539923954373, 0, 205.830587177951,
24, 0, 0, 0, 220.25958930647, 52), spotvalue.male = c(0,
377.765347503846, 0, 0, 23.3908629441624, 0, 0, 0, 241.842030629609,
0), nullspot.allpersons = c(1, 0, 1, 0, 0, 0, 1, 1, 0, 0),
nullspot.male = c(1, 0, 1, 1, 0, 0, 1, 1, 0, 1)), .Names = c("market",
"spot.id", "date", "hour", "time", "local.date", "local.hour",
"local.time", "vendor", "station", "male.imp", "women.imp", "allpersons.imp",
"hh.imp", "isci", "creative", "program", "rate", "R.DMA.NAMES",
"date.time", "daypart", "quarter", "cpi.allpersons", "cpi.male",
"spotvalue.allpersons", "spotvalue.male", "nullspot.allpersons",
"nullspot.male"), row.names = c(561147L, 261262L, 89888L, 941010L,
500366L, 65954L, 484053L, 598996L, 380976L, 968615L), class = "data.frame")`
Apologies for the ugly dput.
This answers only my second question related to making the function faster. Based on #beginneR tip, I converted the function to dplyr.
subset_by_market <- function (q, marketname, dp) {
subset(df %>% group_by(quarter, R.DMA.NAMES, daypart, station) %>%
summarize (spot.count = length(spot.id), station.investment = sum(rate),
nullspots.male = sum(nullspot.male),
nullspots.allpersons = sum(nullspot.allpersons),
total.male.imp = sum(male.imp),
total.allpersons.imp = sum(allpersons.imp),
spotvalue.male = sum(spotvalue.male),
spotvalue.allpersons = sum(spotvalue.allpersons),
male.imp.per.spot = total.male.imp / spot.count,
allpersons.imp.per.spot = total.allpersons.imp / spot.count,
cost.per.spot = station.investment / spot.count,
male.value.per.spot = spotvalue.male / spot.count,
allpersons.value.per.spot = spotvalue.allpersons / spot.count),
quarter == q & R.DMA.NAMES == marketname & daypart == dp) }
This reduced the time drastically to :
> system.time(subset_by_market ("Q32013" , "Albuquerque.Santa.Fe", "late fringe"))
user system elapsed
1.06 0.00 1.09
The glitch I faced in using dplyr was a column called "time" in my data which was of class times from package chron. I kept receiving the error Error: column 'local.time' has unsupported type . I couldn't figure the exact work around for this so I simply converted it to POSIXct class using df$time <- as.POSIXct(as.character(df$time, format = "%H:%M:%S")) . This was not optimal because the reason I converted it to times using chron was to maintain the time chronology without needing the date or time zone. More on that here: work around to specifying date range in ifelse statement in R. However, it solves the immediate problem at hand.

Resources