performing calculations on columns created from ddply - r

I am using ddply within subset to calculate some metrics and roll up a table as required. Some of the metrics I want to calculate need to use the summarized columns created as a result of the ddply operation.
Here is the function with the simple calculated columns:
subset_by_market <- function (q, marketname, dp) {
subset(ddply(df, .(quarter, R.DMA.NAMES, daypart, station), summarise,
spot.count = length(spot.id),
station.investment = sum(rate),
nullspots.male = sum(nullspot.male),
nullspots.allpersons = sum(nullspot.allpersons),
total.male.imp = sum(male.imp),
total.allpersons.imp = sum(allpersons.imp),
spotvalue.male = sum(spotvalue.male),
spotvalue.allpersons = sum(spotvalue.allpersons)),
quarter == q & R.DMA.NAMES == marketname & daypart == dp)
}
I use subset_by_market ("Q32013" , "Columbus.OH", "primetime") to summarize create a subset. My resulting table looks like:
quarter R.DMA.NAMES daypart station spot.count station.investment nullspots.male nullspots.allpersons
10186 Q32013 Columbus.OH primetime ADSM COLUMBUS, OH 103 5150 67 61
10187 Q32013 Columbus.OH primetime ESYX 49 0 49 49
10188 Q32013 Columbus.OH primetime MTV COLUMBUS, OH 61 4500 7 1
10189 Q32013 Columbus.OH primetime WCMH-Retro TV 94 564 93 93
10190 Q32013 Columbus.OH primetime WTTE 1 0 0 0
10191 Q32013 Columbus.OH primetime WWHO 9 0 2 2
total.male.imp total.allpersons.imp spotvalue.male spotvalue.allpersons
10186 47.2 127.7 4830.409 4775.1068
10187 0.0 0.0 NaN NaN
10188 157.9 371.1 4649.746 4505.2608
10189 0.3 0.3 3162.000 3162.0000
10190 3.5 10.3 570.166 591.0231
10191 3.9 15.8 7155.000 4356.4162
Question 1: I would like to add to the same data frame for e.g.: Percentage values of spot.count. = spot.count / sum(spot.count) (ii) percent.nullspots.male = nullspots.male / sum(nullspots.male)
However, when I add that to the ddply arguments, I get 1 (100%) in the resulting column. The value divides by itself instead of dividing by the sum of the column.
Question 2: This is slow and humbly I accept this may not be optimal coding. I am using an i5-2.6GHz PC with 16Gb ddr3 RAM with 64 bit OS. The dataset is 1M rows.
system.time(subset_by_market ("Q32013" , "Albuquerque.Santa.Fe", "late fringe"))
user system elapsed
228.13 176.84 416.12
The intention is to visualize all calculated metrics on an online dashboard and allow user to select the subset_by_market (q , marketname, dp) using drop-down menus. How can I make it faster?
ADDING SAMPLE DATA:
`> structure(list(market = c("Local", "Local", "Local", "Local",
"Local", "Local", "Local", "NATIONAL CABLE", "Local", "Local"
), spot.id = c(11248955L, 11262196L, 11946349L, 11625265L, 12929889L,
11259758L, 11517638L, 11599834L, 12527365L, 12930259L), date = structure(c(1375675200,
1376625600, 1390280400, 1383627600, 1401249600, 1375848000, 1380772800,
1383019200, 1397102400, 1401163200), class = c("POSIXct", "POSIXt"
), tzone = ""), hour = c(15, 17, 11, 18, 19, 1, 13, 14, 16, 22
), time = structure(c(0.642361111111111, 0.749305555555556, 0.481944444444444,
0.770138888888889, 0.830555555555556, 0.0597222222222222, 0.582638888888889,
0.597222222222222, 0.675694444444444, 0.930555555555556), format = "h:m:s", class = "times"),
local.date = structure(c(1375675200, 1376625600, 1390280400,
1383627600, 1401249600, 1375848000, 1380772800, 1383019200,
1397102400, 1401163200), class = c("POSIXct", "POSIXt"), tzone = ""),
local.hour = c(15, 17, 11, 18, 18, 0, 13, 14, 15, 22), local.time = structure(c(0.642361111111111,
0.749305555555556, 0.481944444444444, 0.770138888888889,
0.788888888888889, 0.0180555555555556, 0.582638888888889,
0.597222222222222, 0.634027777777778, 0.930555555555556), format = "h:m:s", class = "times"),
vendor = c("Time Warner - Myrtle Beach", "WMYD", "WSBK",
"WDCA", "Comcast - Memphis", "Charter Media - Birmingham",
"WBNA", "G4", "Comcast - Houston", "Comcast - Youngstown"
), station = c("VH-1 MYRTLE BEACH", "WMYD", "WSBK", "WDCA",
"COM MEMPHIS", "FX BIRMINGHAM", "WBNA", "G4", "SPK HOUSTON",
"COM YOUNGSTOWN CC"), male.imp = c(0, 2, 0, 0, 0.6, 0.4,
0, 0, 3.9, 0), women.imp = c(0, 2.5, 0, 2.5, 0.2, 0.6, 0,
0, 4.6, 0.6), allpersons.imp = c(0, 3.5, 0, 2.5, 0.8, 0.8,
0, 0, 7.8, 0.6), hh.imp = c(0, 8.5, 8, 64.5, 1.3, 2.9, 1.3,
15, 13.7, 1), isci = c("IT6140MB", "ITCD78DT", "IT6192BS",
"IT6170WD", "IT6173ME", "IT6162BI", "IT6155LO", "ITES13410",
"IT3917", "IT3921"), creative = c("Eugene Elbert (Bach. Tcom Eng. Tech) :60",
"The Problem Solvers (revised) - IET :60", "Murtech/Kinetic/Integra :60",
"Kevin Bumper/NTSG/Lifetime :60", "NCR/Schlumberger/Sprint (revised) (Bach) :60",
"Skills Gap (revised) /Kevin :60", "Rising Costs60 (Opportunity Scholar - No Nursing)",
"Irina Lund (Bach. ISS) :60", "Augustine Lopez (A. CEET) :30 (no loc)",
"John Ryan Ellis (B. PM/A. CDD) :30 (no loc)"), program = c(NA,
"TYLER PERRY'S MEET THE BROWNS", "THE PEOPLE'S COURT", "Judge Judy",
NA, NA, "Meet the Browns/Are We There Yet/News/Wendy Willia",
"HEROES", "Spike EF Rotator", NA), rate = c(5, 230, 100,
625, 40, 0, 15, 40, 110, 7), R.DMA.NAMES = c("Myrtle.Beach.Florence",
"Detroit", "Boston.Manchester.", "Washington.DC.Hagrstwn.",
"Memphis", "Birmingham.Ann.and.Tusc.", "Louisville", "NATIONAL CABLE",
"Houston", "Youngstown"), date.time = c("2013-08-05 15:25:00",
"2013-08-16 17:59:00", "2014-01-21 11:34:00", "2013-11-05 18:29:00",
"2014-05-28 19:56:00", "2013-08-07 01:26:00", "2013-10-03 13:59:00",
"2013-10-29 14:20:00", "2014-04-10 16:13:00", "2014-05-27 22:20:00"
), daypart = c("afternoon", "evening", "morning", "evening",
"evening", "late fringe", "afternoon", "afternoon", "afternoon",
"primetime"), quarter = structure(c(4L, 4L, 1L, 6L, 3L, 4L,
6L, 6L, 3L, 3L), .Label = c("Q12014", "Q22013", "Q22014",
"Q32013", "Q32014", "Q42013"), class = "factor"), cpi.allpersons = c(96.2179487179487,
79.0114068441065, 35.1219512195122, 82.3322348711803, 30,
0, 138.721804511278, 28.3135215453195, 28.2384088854449,
86.6666666666667), cpi.male = c(750.5, 188.882673751923,
115.959004392387, 144.492639327024, 38.9847715736041, 0,
595.161290322581, 34.7402005469462, 62.010777084515, 156.712328767123
), spotvalue.allpersons = c(0, 276.539923954373, 0, 205.830587177951,
24, 0, 0, 0, 220.25958930647, 52), spotvalue.male = c(0,
377.765347503846, 0, 0, 23.3908629441624, 0, 0, 0, 241.842030629609,
0), nullspot.allpersons = c(1, 0, 1, 0, 0, 0, 1, 1, 0, 0),
nullspot.male = c(1, 0, 1, 1, 0, 0, 1, 1, 0, 1)), .Names = c("market",
"spot.id", "date", "hour", "time", "local.date", "local.hour",
"local.time", "vendor", "station", "male.imp", "women.imp", "allpersons.imp",
"hh.imp", "isci", "creative", "program", "rate", "R.DMA.NAMES",
"date.time", "daypart", "quarter", "cpi.allpersons", "cpi.male",
"spotvalue.allpersons", "spotvalue.male", "nullspot.allpersons",
"nullspot.male"), row.names = c(561147L, 261262L, 89888L, 941010L,
500366L, 65954L, 484053L, 598996L, 380976L, 968615L), class = "data.frame")`
Apologies for the ugly dput.

This answers only my second question related to making the function faster. Based on #beginneR tip, I converted the function to dplyr.
subset_by_market <- function (q, marketname, dp) {
subset(df %>% group_by(quarter, R.DMA.NAMES, daypart, station) %>%
summarize (spot.count = length(spot.id), station.investment = sum(rate),
nullspots.male = sum(nullspot.male),
nullspots.allpersons = sum(nullspot.allpersons),
total.male.imp = sum(male.imp),
total.allpersons.imp = sum(allpersons.imp),
spotvalue.male = sum(spotvalue.male),
spotvalue.allpersons = sum(spotvalue.allpersons),
male.imp.per.spot = total.male.imp / spot.count,
allpersons.imp.per.spot = total.allpersons.imp / spot.count,
cost.per.spot = station.investment / spot.count,
male.value.per.spot = spotvalue.male / spot.count,
allpersons.value.per.spot = spotvalue.allpersons / spot.count),
quarter == q & R.DMA.NAMES == marketname & daypart == dp) }
This reduced the time drastically to :
> system.time(subset_by_market ("Q32013" , "Albuquerque.Santa.Fe", "late fringe"))
user system elapsed
1.06 0.00 1.09
The glitch I faced in using dplyr was a column called "time" in my data which was of class times from package chron. I kept receiving the error Error: column 'local.time' has unsupported type . I couldn't figure the exact work around for this so I simply converted it to POSIXct class using df$time <- as.POSIXct(as.character(df$time, format = "%H:%M:%S")) . This was not optimal because the reason I converted it to times using chron was to maintain the time chronology without needing the date or time zone. More on that here: work around to specifying date range in ifelse statement in R. However, it solves the immediate problem at hand.

Related

How do I combine data taken separately into a single dataset?

I have a dataset comprised of leaves which I've weighed individually in order of emergence (first emerged through final emergence), and I'd like to combine these masses so that I have the entire mass of all the leaves for each individual plant.
How would I add these up using R programming language, or what would I need to google to get started on figuring this out?
structure(list(Tray = c(1, 1, 1, 1, 1, 1), Plant = c(2, 2, 2,
2, 3, 3), Treatment = structure(c(4L, 4L, 4L, 4L, 4L, 4L), .Label = c("2TLH",
"E2TL", "EH", "WL"), class = "factor"), PreSwitch = c("Soil",
"Soil", "Soil", "Soil", "Soil", "Soil"), PostSwitch = c("Soil",
"Soil", "Soil", "Soil", "Soil", "Soil"), Pellet = c(1, 1, 1,
1, 1, 1), Rep = c(1, 1, 1, 1, 1, 1), Date = structure(c(1618963200,
1618963200, 1618963200, 1618963200, 1618963200, 1618963200), tzone = "UTC", class = c("POSIXct",
"POSIXt")), DAP = c(60, 60, 60, 60, 60, 60), Position = c(2,
1, 3, 4, 4, 3), Whorl = structure(c(1L, 1L, 2L, 2L, 2L, 2L), .Label = c("1",
"2", "3", "4", "5"), class = "factor"), PetioleLength = c(1.229,
1.365, 1.713, 1.02, 0, 1.408), BladeLength = c(1.604, 1.755,
2.466, 2.672, 0.267, 2.662), BladeWidth = c(1.023, 1.185, 1.803,
1.805, 0.077, 1.771), BladeArea = c(1.289, 1.634, 3.492, 3.789,
0.016, 3.704), BladePerimeter = c(6.721, 7.812, 11.61, 12.958,
1.019, 14.863), BladeCircularity = c(0.359, 0.336, 0.326, 0.284,
0.196, 0.211), BPR = c(1.30512611879577, 1.28571428571429, 1.43957968476357,
2.61960784313725, NA, 1.890625), Leaf.Mass = c(9, 11, 31, 33,
32, 33), BladeAR = c(1.56793743890518, 1.48101265822785, 1.36772046589018,
1.4803324099723, 3.46753246753247, 1.50310559006211), Subirrigation = c(0,
0, 0, 0, 0, 0), Genotype = c(1, 1, 1, 1, 1, 1), Location = c(0,
0, 0, 0, 0, 0)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
I may be missing something but isn't this a sum by Plant?
One solution below sums it for each plant into a separate table with just the totals and the second summarizes and adds it back to the main data set in a single step.
library(tidyverse)
#summary data set
plant_total <- df %>% group_by(Plant) %>% summarize(plant_weight = sum(Leaf.Mass, na.rm= TRUE))
#add plant_weight column to df data set
plant_total <- df %>% group_by(Plant) %>% mutate(plant_weight = sum(Leaf.Mass, na.rm = TRUE))

Count number of events in the previous time period

I trying to create a variable (the made up one "events60" in the data below, that keeps a "running" count of the number of events in the past (in this example it's 60 minutes, but it could be any arbitrary value). So, it keeps a tally "how many events occurred in the previous hour".
I'm making slow headway with cumsum, rle, diff etc. and whatnot but I'm certain there is a more elegant and quicker solution. It will be applied to a dataset of a minimum 30 million rows so a loop is probably not very efficient.
Example data below in R format
structure(list(Performed_DT_TM = structure(c(1508310211, 1508312843,
1508322697, 1508331061, 1508331161, 1508331452, 1508332222, 1508332900,
1508333781, 1508334349, 1508337531, 1508341065, 1508343542, 1508346756,
1508363905, 1508371639, 1508388245, 1508402001, 1508413612, 1508430173,
1508445426, 1508453675), class = c("POSIXct", "POSIXt"), tzone = ""),
time_since_prev_obs = c(0, 43.8666666666667, 164.233333333333,
139.4, 1.66666666666667, 4.85, 12.8333333333333, 11.3, 14.6833333333333,
9.46666666666667, 53.0333333333333, 58.9, 41.2833333333333,
53.5666666666667, 285.816666666667, 128.9, 276.766666666667,
229.266666666667, 193.516666666667, 276.016666666667, 254.216666666667,
137.483333333333), events60 = c(0, 1, 0, 0, 1, 2, 3, 4, 5,
6, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA,
-22L), class = "data.frame")
Any help greatly appreciated of course
Cheers
Norm
in Base R you could do:
m <- outer(df$Performed_DT_TM,df$Performed_DT_TM,"-")
c(0,rowsum(as.numeric(m[lower.tri(m)]<3600),row(m)[lower.tri(m)]))
[1] 0 1 0 0 1 2 3 4 5 6 1 1 1 1 0 0 0 0 0 0 0 0
Since your dataset is huge, you can try a rolling join and then an non-equi join from data.table for speed:
setDT(DT)[, Performed_DT_TM := as.POSIXct(Performed_DT_TM, format="%Y-%-%d %T")]
DT[, c("rn", "endtime") := .(.I, Performed_DT_TM - 60L*60L)]
DT[, Last60mins :=
DT[DT, on=.(Performed_DT_TM=endtime), roll=Inf, i.rn - x.rn - 1L]
]
DT[is.na(Last60mins), Last60mins := fcoalesce(Last60mins,
DT[.SD, on=.(Performed_DT_TM>=endtime, Performed_DT_TM<Performed_DT_TM), .N, by=.EACHI]$N)
]
DT
data:
library(data.table)
DT <- structure(list(Performed_DT_TM = structure(c(1508310211, 1508312843,
1508322697, 1508331061, 1508331161, 1508331452, 1508332222, 1508332900,
1508333781, 1508334349, 1508337531, 1508341065, 1508343542, 1508346756,
1508363905, 1508371639, 1508388245, 1508402001, 1508413612, 1508430173,
1508445426, 1508453675), class = c("POSIXct", "POSIXt"), tzone = ""),
time_since_prev_obs = c(0, 43.8666666666667, 164.233333333333,
139.4, 1.66666666666667, 4.85, 12.8333333333333, 11.3, 14.6833333333333,
9.46666666666667, 53.0333333333333, 58.9, 41.2833333333333,
53.5666666666667, 285.816666666667, 128.9, 276.766666666667,
229.266666666667, 193.516666666667, 276.016666666667, 254.216666666667,
137.483333333333), events60 = c(0, 1, 0, 0, 1, 2, 3, 4, 5,
6, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA,
-22L), class = "data.frame")

ggplot2 is not plotting two groups

I am having a complete brain fart right now. Why is the following code not plotting two lines for each category of the variable Ford? To my mind, I have the variable mapped as a grouping variable and then an aesthetic (col) in both geom_point() and geom_line(). I feel like I'm just overlooking something very basic.
#libraries
library(tidyverse)
#data
structure(list(stressx = c(0, 0.33, 0.67, 1, 0, 0.33, 0.67, 1,
0, 0.33, 0.67, 1, 0, 0.33, 0.67, 1), visiblex = c(0, 0, 0, 0,
1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1), ford = c(0, 0, 0, 0, 0,
0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1), preds = c(0.166275511711196,
0.25404479263251, 0.344473748733258, 0.432243029654572, 0.417891216538386,
0.449861131692899, 0.482799832155125, 0.514769747309638, 0.166275511711196,
0.25404479263251, 0.344473748733258, 0.432243029654572, 0.417891216538386,
0.449861131692899, 0.482799832155125, 0.514769747309638), se.fit =
c(0.0216850668407667,
0.0140669010411715, 0.014932848560481, 0.0233259879905658,
0.0546688696666978,
0.034867400606124, 0.0282122239553816, 0.0418803253364085,
0.0216850668407667,
0.0140669010411715, 0.014932848560481, 0.0233259879905658,
0.0546688696666978,
0.034867400606124, 0.0282122239553816, 0.0418803253364085)), .Names =
c("stressx",
"visiblex", "ford", "preds", "se.fit"), out.attrs = structure(list(
dim = structure(c(4L, 2L, 2L), .Names = c("stressx", "visiblex",
"ford")), dimnames = structure(list(stressx = c("stressx=0.0000000",
"stressx=0.3333333", "stressx=0.6666667", "stressx=1.0000000"
), visiblex = c("visiblex=0", "visiblex=1"), ford = c("ford=0",
"ford=1")), .Names = c("stressx", "visiblex", "ford"))), .Names = c("dim",
"dimnames")), row.names = c(NA, -16L), class = "data.frame")`
My plot
newdat %>%
mutate(visiblex=recode_factor(visiblex, `0`="Not Visible Minority",
`1`="Visible Minority"), ford=recode_factor(ford, `0`="Disapprove",
`1`="Approve"), stressx=recode_factor(stressx, `0`='Strongly disagree',
`0.33`='Somewhat disagree', `0.67`='Somewhat agree', `1`='Strongly agree'))
%>%
rename(Stress=stressx, Visible=visiblex, Ford=ford, Prob=preds) %>%
#filter(Ford=='Approve') %>%
ggplot(., aes(x=Stress, y=Prob, group=Ford))+
geom_point(aes(col=Ford))+
geom_line(aes(col=Ford))+
facet_wrap(~Visible)+
ylim(c(0,1))+
theme(axis.text.x=element_text(angle=45, vjust=0.5))`
It's because you have identical data points for both levels of the factor variable Ford. I have modified your code slightly to show the data and then plotted the data with geom_jitter instead of geom_point and now you can see both data points. Since the underlying datapoints are identical, the lines drawn through those data points are also overlapping and only one of them is visible.
#libraries
library(tidyverse)
#data
newdat <- structure(
list(
stressx = c(0, 0.33, 0.67, 1, 0, 0.33, 0.67, 1,
0, 0.33, 0.67, 1, 0, 0.33, 0.67, 1),
visiblex = c(0, 0, 0, 0,
1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1),
ford = c(0, 0, 0, 0, 0,
0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1),
preds = c(
0.166275511711196,
0.25404479263251,
0.344473748733258,
0.432243029654572,
0.417891216538386,
0.449861131692899,
0.482799832155125,
0.514769747309638,
0.166275511711196,
0.25404479263251,
0.344473748733258,
0.432243029654572,
0.417891216538386,
0.449861131692899,
0.482799832155125,
0.514769747309638
),
se.fit =
c(
0.0216850668407667,
0.0140669010411715,
0.014932848560481,
0.0233259879905658,
0.0546688696666978,
0.034867400606124,
0.0282122239553816,
0.0418803253364085,
0.0216850668407667,
0.0140669010411715,
0.014932848560481,
0.0233259879905658,
0.0546688696666978,
0.034867400606124,
0.0282122239553816,
0.0418803253364085
)
),
.Names =
c("stressx",
"visiblex", "ford", "preds", "se.fit"),
out.attrs = structure(
list(
dim = structure(c(4L, 2L, 2L), .Names = c("stressx", "visiblex",
"ford")),
dimnames = structure(
list(
stressx = c(
"stressx=0.0000000",
"stressx=0.3333333",
"stressx=0.6666667",
"stressx=1.0000000"
),
visiblex = c("visiblex=0", "visiblex=1"),
ford = c("ford=0",
"ford=1")
),
.Names = c("stressx", "visiblex", "ford")
)
),
.Names = c("dim",
"dimnames")
),
row.names = c(NA, -16L),
class = "data.frame"
)
#my plot
data <- newdat %>%
mutate(
visiblex = recode_factor(visiblex, `0` = "Not Visible Minority",
`1` = "Visible Minority"),
ford = recode_factor(ford, `0` = "Disapprove",
`1` =
"Approve"),
stressx = recode_factor(
stressx,
`0` = 'Strongly disagree',
`0.33` =
'Somewhat disagree',
`0.67` = 'Somewhat agree',
`1` = 'Strongly agree'
)
) %>%
dplyr::rename(
Stress = stressx,
Visible = visiblex,
Ford = ford,
Prob = preds
)
# display data
data
#> Stress Visible Ford Prob se.fit
#> 1 Strongly disagree Not Visible Minority Disapprove 0.1662755 0.02168507
#> 2 Somewhat disagree Not Visible Minority Disapprove 0.2540448 0.01406690
#> 3 Somewhat agree Not Visible Minority Disapprove 0.3444737 0.01493285
#> 4 Strongly agree Not Visible Minority Disapprove 0.4322430 0.02332599
#> 5 Strongly disagree Visible Minority Disapprove 0.4178912 0.05466887
#> 6 Somewhat disagree Visible Minority Disapprove 0.4498611 0.03486740
#> 7 Somewhat agree Visible Minority Disapprove 0.4827998 0.02821222
#> 8 Strongly agree Visible Minority Disapprove 0.5147697 0.04188033
#> 9 Strongly disagree Not Visible Minority Approve 0.1662755 0.02168507
#> 10 Somewhat disagree Not Visible Minority Approve 0.2540448 0.01406690
#> 11 Somewhat agree Not Visible Minority Approve 0.3444737 0.01493285
#> 12 Strongly agree Not Visible Minority Approve 0.4322430 0.02332599
#> 13 Strongly disagree Visible Minority Approve 0.4178912 0.05466887
#> 14 Somewhat disagree Visible Minority Approve 0.4498611 0.03486740
#> 15 Somewhat agree Visible Minority Approve 0.4827998 0.02821222
#> 16 Strongly agree Visible Minority Approve 0.5147697 0.04188033
# plot the data
data %>%
#filter(Ford=='Approve') %>%
ggplot2::ggplot(data = .,
mapping = aes(x = Stress, y = Prob, group = Ford, colour = Ford)) +
ggplot2::geom_jitter() + # change this back geom_point()
ggplot2::geom_line() +
ggplot2::facet_wrap( ~ Visible) +
ggplot2::scale_y_continuous(limits = c(0, 1)) +
ggplot2::theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
Created on 2018-03-13 by the reprex package (v0.2.0).

tapply(dat$`lagged Date`, INDEX = dat$Location, FUN = diff(dat$`lagged Date`))

Can someone explain me why this is not working?
tapply(dat$`lagged Date`, INDEX = dat$Location, FUN = diff(dat$`lagged Date`))
I receive the following error:
Error in match.fun(FUN) : 'diff(dat$lagged Date)' is not a
function, character or symbol
structure(list(`lagged Date` = structure(c(1466306880, 1466307060,
1466307240, 1466307420, 1466307600, 1466307780), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Location = c(309, 309, 309, 309, 309,
309), Duration = c(0, 0, 0, 0, 0, 0), Latitude = c(53.50205667,
53.501915, 53.50183667, 53.50178833, 53.50184, 53.50186167),
Longitude = c(-3.354733333, -3.354096667, -3.353838333, -3.353673333,
-3.353711667, -3.353741667), `Number of Records` = c(1, 1,
1, 1, 1, 1), Speed = c(0.9, 0, 0, 0, 0, 0), `Sum of Var` = c(38,
38, 38, 38, 38, 38), check = c(0, 0, 0, 0, 0, 0)), .Names = c("lagged Date",
"Location", "Duration", "Latitude", "Longitude", "Number of Records",
"Speed", "Sum of Var", "check"), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
thank you!
I'm not sure what you want to achieve, but using only diff as the FUN part works and produces this output:
tapply(dat$`lagged Date`, INDEX = dat$Location, FUN = diff)
$`309`
Time differences in mins
[1] 3 3 3 3 3
If you want to convert the output into hours, you can do that by selecting only the values of the difftime-list object and convert those:
as.numeric(tapply(dat$`lagged Date`, INDEX = dat$Location, FUN = diff)[[1]], units = "hours")
Output then looks like this:
[1] 0.05 0.05 0.05 0.05 0.05

How to properly index list items to return rows, not columns, inside a for loop

I'm trying to write a for loop within another for loop. The first loop grabs the ith vcov matrix from a list of variously sized matrices (vcmats below) and grabs a frame of 24 predictor models of appropriate dimension to multiply with the current vcov matrix from a list of frames (jacobians below) for the different models. The second loop should pull the jth record (row) from the selected predictor frame, correctly format it, then run the calculation with the vcov matrix and output an indicator variable and calculated result needed for post processing to the holding table (holdtab).
When I run the code below I get the following error: Error in jjacob[, 1:4] : incorrect number of dimensions because R is returning the column of 1s (i.e. the intercept column of jacobs), not the complete first record (i.e. jjacob = jacobs[1,]). I've substantially simplified the example but left enough complexity to demonstrate the problem. I would appreciate any help in resolving this issue.
vcmats <- list(structure(c(0.67553, -0.1932, -0.00878, -0.00295, -0.00262,
-0.00637, -0.1932, 0.19988, 0.00331, -0.00159, 0.00149, 2e-05,
-0.00878, 0.00331, 0.00047, -6e-05, 3e-05, 3e-05, -0.00295, -0.00159,
-6e-05, 0.00013, -2e-05, 6e-05, -0.00262, 0.00149, 3e-05, -2e-05,
2e-05, 0, -0.00637, 2e-05, 3e-05, 6e-05, 0, 0.00026), .Dim = c(6L,
6L)), structure(c(0.38399, -0.03572, -0.00543, -0.00453, -0.00634,
-0.03572, 0.10912, 0.00118, -0.00044, 0.00016, -0.00543, 0.00118,
0.00042, -3e-05, 4e-05, -0.00453, -0.00044, -3e-05, 0.00011,
5e-05, -0.00634, 0.00016, 4e-05, 5e-05, 0.00025), .Dim = c(5L,
5L)))
jacobians <- list(structure(list(intcpt = c(1, 1, 1, 1), species = c(1, 1,
0, 0), nage = c(6, 6, 6, 6), T = c(12, 50, 12, 50), hgt = c(90,
90, 90, 90), moon = c(7, 7, 7, 7), hXm = c(0, 0, 0, 0), covr = c(0,
0, 0, 0), het = c(0, 0, 0, 0)), .Names = c("intcpt", "species",
"nage", "T", "hgt", "moon", "hXm", "covr", "het"), row.names = c("1",
"1.4", "1.12", "1.16"), class = "data.frame"), structure(list(
intcpt = c(1, 1, 1, 1), species = c(1, 1, 0, 0), nage = c(6,
6, 6, 6), T = c(12, 50, 12, 50), hgt = c(0, 0, 0, 0), moon = c(7,
7, 7, 7), hXm = c(0, 0, 0, 0), covr = c(0, 0, 0, 0), het = c(0,
0, 0, 0)), .Names = c("intcpt", "species", "nage", "T", "hgt",
"moon", "hXm", "covr", "het"), row.names = c("2", "2.4", "2.12",
"2.16"), class = "data.frame"))
holdtab <- structure(list(model = structure(c(4L, 4L, 4L, 4L, 5L, 5L, 5L,
5L), .Label = c("M.1.BaseCov", "M.2.Height", "M.5.Height.X.LastNewMoon",
"M.6.Height.plus.LastNew", "M.7.LastNewMoon", "M.G.Global"), class = "factor"),
aicc = c(341.317, 341.317, 341.317, 341.317, 342.1412, 342.1412,
342.1412, 342.1412), species = c(NA, NA, NA, NA, NA, NA,
NA, NA), condVar = c(NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("model",
"aicc", "species", "condVar"), row.names = c(1L, 2L, 3L, 4L,
25L, 26L, 27L, 28L), class = "data.frame")
jloop <- 1
for (imat in vcmats) { # Call the outside loop of vcov matrices
jacobs = jacobians[[jloop]] # Set tempvar jacobs as the jth member of the jacobians frame (n/24)
for (jjacob in jacobs) { # Call inside loop of lines in jacob (each individual set of predictor levels)
# I need to reduce the vector length to match my vcov matrix so
pt1 = jjacob[,1:4] # Separate Core columns from variable columns (because I don't want to drop species when ==0)
pt2 = jjacob[,5:9] # Pull out variable columns for next step
pt2 = pt2[,!apply(pt2 == 0, 2, all)] # Drop any variable columns that ==0
jjacob = cbind(pt1, pt2) # Reconstruct the record now of correct dimensions for the relevant vcov matrix
jjacob = as.matrix(jjacob) # Explicitly convert jjmod - I was having trouble with this previously
tj = (t(jjacob)) # Transpose the vector
condvar = jjacob %*% imat %*% tj # run the calculation
condVarTab[record,3] = jjacob[2] # Write species 0 or 1 to the output table
condVarTab[record,4] = condvar # Write the conditional variance to the table
record = record+1 # Iterate the record number for the next output run
}
jloop = jloop+1 # Once all 24 models in a frame are calculated iterate to the next frame of models which will be associated with a new vcv matrix
}

Resources