tapply(dat$`lagged Date`, INDEX = dat$Location, FUN = diff(dat$`lagged Date`)) - r

Can someone explain me why this is not working?
tapply(dat$`lagged Date`, INDEX = dat$Location, FUN = diff(dat$`lagged Date`))
I receive the following error:
Error in match.fun(FUN) : 'diff(dat$lagged Date)' is not a
function, character or symbol
structure(list(`lagged Date` = structure(c(1466306880, 1466307060,
1466307240, 1466307420, 1466307600, 1466307780), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Location = c(309, 309, 309, 309, 309,
309), Duration = c(0, 0, 0, 0, 0, 0), Latitude = c(53.50205667,
53.501915, 53.50183667, 53.50178833, 53.50184, 53.50186167),
Longitude = c(-3.354733333, -3.354096667, -3.353838333, -3.353673333,
-3.353711667, -3.353741667), `Number of Records` = c(1, 1,
1, 1, 1, 1), Speed = c(0.9, 0, 0, 0, 0, 0), `Sum of Var` = c(38,
38, 38, 38, 38, 38), check = c(0, 0, 0, 0, 0, 0)), .Names = c("lagged Date",
"Location", "Duration", "Latitude", "Longitude", "Number of Records",
"Speed", "Sum of Var", "check"), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
thank you!

I'm not sure what you want to achieve, but using only diff as the FUN part works and produces this output:
tapply(dat$`lagged Date`, INDEX = dat$Location, FUN = diff)
$`309`
Time differences in mins
[1] 3 3 3 3 3
If you want to convert the output into hours, you can do that by selecting only the values of the difftime-list object and convert those:
as.numeric(tapply(dat$`lagged Date`, INDEX = dat$Location, FUN = diff)[[1]], units = "hours")
Output then looks like this:
[1] 0.05 0.05 0.05 0.05 0.05

Related

save list elements into separated dfs in R

I have a list containing 180.000 elements each represents data about an investor and a specific traded asset.
I want to save all the elements of the list into single dataframes called df into a specific folder "dev/test-data/investors-singleass/" , so that I can later on apply a specific function on all the dfs of the folder
The list of my data has a structure similar to this
list(`4Z627.004125` = structure(list(investor = c("4Z627", "4Z627",
"4Z627"), asset = c("004125", "004125", "004125"), datetime = c("2015-05-12",
"2015-05-28", "2016-08-19"), Avgprice = c(169.4, 168, 162), operation = c(2000,
1000, -3000), portfolio = c(2000, 3000, 0), last_port = c(0,
2000, 3000), marketprice = c(169.4, 166.5, 161.75), portprice = c(169.4,
168.933333333333, 0), G = c(0, 0, 1), gainminus = c(2, 0, 0),
numasset = c(5, 8, 13)), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
Basically each elements of the list is an "investor" ID and an "asset" code for which i then have multiple other columns to work with
I would do it like this based on link
df1 <- list(`4Z627.004125` = structure(list(investor = c("4Z627", "4Z627",
"4Z627"), asset = c("004125", "004125", "004125"),
datetime = c("2015-05-12", "2015-05-28", "2016-08-19"),
Avgprice = c(169.4, 168, 162),
operation = c(2000, 1000, -3000), portfolio = c(2000, 3000, 0),
last_port = c(0,2000, 3000), marketprice = c(169.4, 166.5, 161.75),
portprice = c(169.4,
168.933333333333, 0), G = c(0, 0, 1), gainminus = c(2, 0, 0),
numasset = c(5, 8, 13)), row.names = c(NA, -3L),
class = c("tbl_df", "tbl", "data.frame")),
`4Z628.004128` = structure(list(investor = c("4Z627", "4Z627",
"4Z627"), asset = c("004125", "004125", "004125"),
datetime = c("2015-05-12", "2015-05-28", "2016-08-19"),
Avgprice = c(169.4, 168, 162),
operation = c(2000, 1000, -3000), portfolio = c(2000, 3000, 0),
last_port = c(0,2000, 3000), marketprice = c(169.4, 166.5, 161.75),
portprice = c(169.4,
168.933333333333, 0), G = c(0, 0, 1), gainminus = c(2, 0, 0),
numasset = c(5, 8, 13)), row.names = c(NA, -3L),
class = c("tbl_df", "tbl", "data.frame")))
library(purrr)
iwalk(df1, ~saveRDS(.x, paste0("dev/test-data/investors-singleass/", .y, '.RData')))
You can get the data back into R with
library(dplyr)
df <- list.files(path = "dev/test-data/investors-singleass/", pattern = ".RData") %>%
map_dfr(readRDS)

stargazer package producing empty summary table

i am having an issue using the stargazer package. i have downloaded the newest version of r (4.1.3) for mac but when I try to get a summary statistic table with the stargazer package the table comes up empty.here is what I have written in my script. i have downloaded stargazer and all other necessary packages I think.
stargazer(econ.dta[c("responses", "price", "offers", "bestoffer", "meanoffer", "name", "polite", "black", "tattoo", "white")] , type = "text",
summary.stat = c("mean", "sd", "median", "min", "max"))
data
structure(list(ad = structure(c(55, 59, 60, 65, 66, 67), label = "Ad ID", format.stata = "%8.0g"), responses = structure(c(0, 1, 2, 0, 5, 2), label = "Number of responses", format.stata = "%9.0g"), offers = structure(c(0, 1, 0, 0, 2, 1), label = "Number of offers", format.stata = "%9.0g"), bestoffer = structure(c(NA, 95, NA, NA, 90, 75), label = "Best offer", format.stata = "%9.0g"), meanoffer = structure(c(NA, 95, NA, NA, 82.5, 75), label = "Mean offer", format.stata = "%9.0g"),
name = structure(c(NA, 1, 1, NA, 0.200000002980232, 0), label = "Incl.\\ name", format.stata = "%8.0g"), polite = structure(c(NA, 1, 1, NA, 0.600000023841858, 0), label = "Polite", format.stata = "%8.0g"), price = structure(c(130, 110, 90, 110, 90, 110), label = "Asking price", format.stata = "%8.0g"), texttype = structure(c(0, 0, 0, 0, 0, 0), label = "Text series", format.stata = "%9.0g", labels = c(A = 0, B = 1, C = 2), class = c("haven_labelled", "vctrs_vctr", "double")),
black = structure(c(0, 0, 1, 0, 1, 1), label = "Black", format.stata = "%9.0g", labels = c(Other = 0, Black = 1), class = c("haven_labelled", "vctrs_vctr", "double" )), tattoo = structure(c(0, 0, 0, 0, 0, 0), label = "Tattoo", format.stata = "%9.0g", labels = c(Other = 0, Tattoo = 1), class = c("haven_labelled", "vctrs_vctr", "double" )), white = structure(c(1, 1, 0, 1, 0, 0), label = "White", format.stata = "%9.0g", labels = c(Other = 0, White = 1), class = c("haven_labelled", "vctrs_vctr", "double"))), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame" ))
There are two issues going on. The first is that some of the variables are actually of class "haven-labelled" which doesn't play nicely, so we need to "zap" the labels:
econ.dta <- haven::zap_labels(econ.dta)
The second is that it seems that the stargazer package doesn't work well with tibbles, so it needs to be converted as a data frame:
stargazer(as.data.frame(econ.dta[c("responses", "price", "offers", "bestoffer", "meanoffer", "name", "polite", "black", "tattoo", "white")]),
type = "text",
summary.stat = c("mean", "sd", "median", "min", "max"))
===============================================
Statistic Mean St. Dev. Median Min Max
-----------------------------------------------
responses 1.667 1.862 1.5 0 5
price 106.667 15.055 110 90 130
offers 0.667 0.816 0.5 0 2
bestoffer 86.667 10.408 90 75 95
meanoffer 84.167 10.104 82.500 75.000 95.000
name 0.550 0.526 0.600 0.000 1.000
polite 0.650 0.473 0.800 0.000 1.000
black 0.500 0.548 0.5 0 1
tattoo 0.000 0.000 0 0 0
white 0.500 0.548 0.5 0 1
-----------------------------------------------

Issue with loop: argument of length 0

I've tried working on this loop and come out with the below errors. I'm not sure if I can provide data, if needed I'll do my best to obfuscate the data. Here is the loop I am trying to use, any tips on what I'm doing wrong would be greatly appreciated as I haven't found a viable solution yet. The exact error is below the code.
decay_function = function(df)
{
df <- df[order(df$department,df$product,df$region,df$monthnum),]
for(mk in 1:ncol(levels_department)) {
newdata <- df[which(df$department==as.character(levels_department[,mk])), ]
levels_product<-as.data.frame(t(levels(as.factor(newdata$product))))
for(md in 1:ncol(levels_product)){
newdata <- newdata[which(newdata$product==as.character(levels_product[,md])), ]
levels_region<-as.data.frame(t(levels(as.factor(newdata$region))))
for(dm in 1:ncol(levels_region)){
newdata <- newdata[which(newdata$region==as.character(levels_region[,dm])), ]
for(i in 1:(nrow(newdata)-1)){
start_month = newdata$monthnum[i]
end_month = newdata$monthnum[nrow(newdata)]
row_vector = c()
decay_vector = c()
for(j in 5:ncol(newdata)){
k = 0
for(l in start_month:end_month){
distance_initial = (l - start_month)
vector_increment = (l - (start_month-1))
decay_rate = (0.5)^((1/halflife)*distance_initial)
decay_value = (decay_rate)*(newdata[[i,j]])
k = k + decay_value
}
df2[i,j] = k
}
print(df2)
}
if (mk=='1' & md=='1' & dm=='1'){
outdata<-df2
} else {
outdata<-rbind(outdata,df2)
}
}
}
}
}
output_data = decay_function(tempone)
Error in start_month:end_month : argument of length 0
> dput(head(df))
structure(list(monthnum = c(33, 33, 33, 33, 33, 33), Region = c(2251,
2251, 2251, 2251, 2251, 2251), Department = c("Softlines", "Softlines",
"Softlines", "Softlines", "Softlines", "Softlines"), Product = c("T-Shirt",
"Jacket", "Sweat Shirt", "Tank Top", "Sweat Pants", "Mens Jeans"
), Incentive_Amount = c(5742.43, 108006.61, 459076.67, 34006,
141632.42, 29580.38), Leads_T1 = c(0, 0, 0, 0, 0, 0), DCLeads = c(0,
1, 0, 0, 0, 0), PhoneLeads = c(0, 0, 0, 0, 0, 0), T3_CRM_Leads = c(0,
0, 0, 0, 0, 0), Leads_Third = c(0, 1, 0, 0, 0, 0)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L))

mapply through columns and rows at the same time in R

I have a data frame column with born dates and one vector with 1200 date periods (valuation dates). I want to calculate the age of each born date in the 1200 periods. I have a solution but I think that it is weird.
Age function:
age <- function(born_date, valuation_date){
if(any(!inherits(born_date, c("POSIXt", "POSIXct", "POSIXlt", "Date"))) |
any(!inherits(valuation_date, c("POSIXt", "POSIXct", "POSIXlt", "Date"))))
stop("Born date or valuation date is not in POSIXt or Date format")
round(as.double(valuation_date - born_date)/365.25)
}
Born Dates in df$FNACIMI:
df <- structure(list(FNACIMI = structure(c(-4216, -188, -4124, -2750,
-2113, -95, 3840, -3629, 2321, 1629, -7547, -2883, -9987, -5078,
-895, -848, 3069, -5134, -3778, -3647, -1573, 2156, -3250, -8277,
-2516, -1829, -2448, -4658, -5731, -3554), class = "Date"), PRIMA_ANUAL = c(360,
0, 0, 0, 0, 720, 720, 100, 1200, 1200, 0, 1020, 0, 0, 0, 2100,
0, 0, 1200, 3000, 0, 0, 0, 8000, 1500, 0, 360, 0, 0, 0)), .Names = c("FNACIMI",
"PRIMA_ANUAL"), row.names = c(NA, -30L), class = "data.frame")
My solution is:
result <- sapply(1:1200, flowDates)
flowDates <- function(i) {
mapply(fluxDates, df$FNACIMI[1:30], i)
}
fluxDates <- function(x, y){
dates <- seq(as.Date("2017-08-31"), by = "month", length.out = 1200)
mapply(age, x, dates[y])
}
As we can see, there is a mapply inside another mapply and inside a sapply. Is there any more elegant way to do that?
Thank you!

performing calculations on columns created from ddply

I am using ddply within subset to calculate some metrics and roll up a table as required. Some of the metrics I want to calculate need to use the summarized columns created as a result of the ddply operation.
Here is the function with the simple calculated columns:
subset_by_market <- function (q, marketname, dp) {
subset(ddply(df, .(quarter, R.DMA.NAMES, daypart, station), summarise,
spot.count = length(spot.id),
station.investment = sum(rate),
nullspots.male = sum(nullspot.male),
nullspots.allpersons = sum(nullspot.allpersons),
total.male.imp = sum(male.imp),
total.allpersons.imp = sum(allpersons.imp),
spotvalue.male = sum(spotvalue.male),
spotvalue.allpersons = sum(spotvalue.allpersons)),
quarter == q & R.DMA.NAMES == marketname & daypart == dp)
}
I use subset_by_market ("Q32013" , "Columbus.OH", "primetime") to summarize create a subset. My resulting table looks like:
quarter R.DMA.NAMES daypart station spot.count station.investment nullspots.male nullspots.allpersons
10186 Q32013 Columbus.OH primetime ADSM COLUMBUS, OH 103 5150 67 61
10187 Q32013 Columbus.OH primetime ESYX 49 0 49 49
10188 Q32013 Columbus.OH primetime MTV COLUMBUS, OH 61 4500 7 1
10189 Q32013 Columbus.OH primetime WCMH-Retro TV 94 564 93 93
10190 Q32013 Columbus.OH primetime WTTE 1 0 0 0
10191 Q32013 Columbus.OH primetime WWHO 9 0 2 2
total.male.imp total.allpersons.imp spotvalue.male spotvalue.allpersons
10186 47.2 127.7 4830.409 4775.1068
10187 0.0 0.0 NaN NaN
10188 157.9 371.1 4649.746 4505.2608
10189 0.3 0.3 3162.000 3162.0000
10190 3.5 10.3 570.166 591.0231
10191 3.9 15.8 7155.000 4356.4162
Question 1: I would like to add to the same data frame for e.g.: Percentage values of spot.count. = spot.count / sum(spot.count) (ii) percent.nullspots.male = nullspots.male / sum(nullspots.male)
However, when I add that to the ddply arguments, I get 1 (100%) in the resulting column. The value divides by itself instead of dividing by the sum of the column.
Question 2: This is slow and humbly I accept this may not be optimal coding. I am using an i5-2.6GHz PC with 16Gb ddr3 RAM with 64 bit OS. The dataset is 1M rows.
system.time(subset_by_market ("Q32013" , "Albuquerque.Santa.Fe", "late fringe"))
user system elapsed
228.13 176.84 416.12
The intention is to visualize all calculated metrics on an online dashboard and allow user to select the subset_by_market (q , marketname, dp) using drop-down menus. How can I make it faster?
ADDING SAMPLE DATA:
`> structure(list(market = c("Local", "Local", "Local", "Local",
"Local", "Local", "Local", "NATIONAL CABLE", "Local", "Local"
), spot.id = c(11248955L, 11262196L, 11946349L, 11625265L, 12929889L,
11259758L, 11517638L, 11599834L, 12527365L, 12930259L), date = structure(c(1375675200,
1376625600, 1390280400, 1383627600, 1401249600, 1375848000, 1380772800,
1383019200, 1397102400, 1401163200), class = c("POSIXct", "POSIXt"
), tzone = ""), hour = c(15, 17, 11, 18, 19, 1, 13, 14, 16, 22
), time = structure(c(0.642361111111111, 0.749305555555556, 0.481944444444444,
0.770138888888889, 0.830555555555556, 0.0597222222222222, 0.582638888888889,
0.597222222222222, 0.675694444444444, 0.930555555555556), format = "h:m:s", class = "times"),
local.date = structure(c(1375675200, 1376625600, 1390280400,
1383627600, 1401249600, 1375848000, 1380772800, 1383019200,
1397102400, 1401163200), class = c("POSIXct", "POSIXt"), tzone = ""),
local.hour = c(15, 17, 11, 18, 18, 0, 13, 14, 15, 22), local.time = structure(c(0.642361111111111,
0.749305555555556, 0.481944444444444, 0.770138888888889,
0.788888888888889, 0.0180555555555556, 0.582638888888889,
0.597222222222222, 0.634027777777778, 0.930555555555556), format = "h:m:s", class = "times"),
vendor = c("Time Warner - Myrtle Beach", "WMYD", "WSBK",
"WDCA", "Comcast - Memphis", "Charter Media - Birmingham",
"WBNA", "G4", "Comcast - Houston", "Comcast - Youngstown"
), station = c("VH-1 MYRTLE BEACH", "WMYD", "WSBK", "WDCA",
"COM MEMPHIS", "FX BIRMINGHAM", "WBNA", "G4", "SPK HOUSTON",
"COM YOUNGSTOWN CC"), male.imp = c(0, 2, 0, 0, 0.6, 0.4,
0, 0, 3.9, 0), women.imp = c(0, 2.5, 0, 2.5, 0.2, 0.6, 0,
0, 4.6, 0.6), allpersons.imp = c(0, 3.5, 0, 2.5, 0.8, 0.8,
0, 0, 7.8, 0.6), hh.imp = c(0, 8.5, 8, 64.5, 1.3, 2.9, 1.3,
15, 13.7, 1), isci = c("IT6140MB", "ITCD78DT", "IT6192BS",
"IT6170WD", "IT6173ME", "IT6162BI", "IT6155LO", "ITES13410",
"IT3917", "IT3921"), creative = c("Eugene Elbert (Bach. Tcom Eng. Tech) :60",
"The Problem Solvers (revised) - IET :60", "Murtech/Kinetic/Integra :60",
"Kevin Bumper/NTSG/Lifetime :60", "NCR/Schlumberger/Sprint (revised) (Bach) :60",
"Skills Gap (revised) /Kevin :60", "Rising Costs60 (Opportunity Scholar - No Nursing)",
"Irina Lund (Bach. ISS) :60", "Augustine Lopez (A. CEET) :30 (no loc)",
"John Ryan Ellis (B. PM/A. CDD) :30 (no loc)"), program = c(NA,
"TYLER PERRY'S MEET THE BROWNS", "THE PEOPLE'S COURT", "Judge Judy",
NA, NA, "Meet the Browns/Are We There Yet/News/Wendy Willia",
"HEROES", "Spike EF Rotator", NA), rate = c(5, 230, 100,
625, 40, 0, 15, 40, 110, 7), R.DMA.NAMES = c("Myrtle.Beach.Florence",
"Detroit", "Boston.Manchester.", "Washington.DC.Hagrstwn.",
"Memphis", "Birmingham.Ann.and.Tusc.", "Louisville", "NATIONAL CABLE",
"Houston", "Youngstown"), date.time = c("2013-08-05 15:25:00",
"2013-08-16 17:59:00", "2014-01-21 11:34:00", "2013-11-05 18:29:00",
"2014-05-28 19:56:00", "2013-08-07 01:26:00", "2013-10-03 13:59:00",
"2013-10-29 14:20:00", "2014-04-10 16:13:00", "2014-05-27 22:20:00"
), daypart = c("afternoon", "evening", "morning", "evening",
"evening", "late fringe", "afternoon", "afternoon", "afternoon",
"primetime"), quarter = structure(c(4L, 4L, 1L, 6L, 3L, 4L,
6L, 6L, 3L, 3L), .Label = c("Q12014", "Q22013", "Q22014",
"Q32013", "Q32014", "Q42013"), class = "factor"), cpi.allpersons = c(96.2179487179487,
79.0114068441065, 35.1219512195122, 82.3322348711803, 30,
0, 138.721804511278, 28.3135215453195, 28.2384088854449,
86.6666666666667), cpi.male = c(750.5, 188.882673751923,
115.959004392387, 144.492639327024, 38.9847715736041, 0,
595.161290322581, 34.7402005469462, 62.010777084515, 156.712328767123
), spotvalue.allpersons = c(0, 276.539923954373, 0, 205.830587177951,
24, 0, 0, 0, 220.25958930647, 52), spotvalue.male = c(0,
377.765347503846, 0, 0, 23.3908629441624, 0, 0, 0, 241.842030629609,
0), nullspot.allpersons = c(1, 0, 1, 0, 0, 0, 1, 1, 0, 0),
nullspot.male = c(1, 0, 1, 1, 0, 0, 1, 1, 0, 1)), .Names = c("market",
"spot.id", "date", "hour", "time", "local.date", "local.hour",
"local.time", "vendor", "station", "male.imp", "women.imp", "allpersons.imp",
"hh.imp", "isci", "creative", "program", "rate", "R.DMA.NAMES",
"date.time", "daypart", "quarter", "cpi.allpersons", "cpi.male",
"spotvalue.allpersons", "spotvalue.male", "nullspot.allpersons",
"nullspot.male"), row.names = c(561147L, 261262L, 89888L, 941010L,
500366L, 65954L, 484053L, 598996L, 380976L, 968615L), class = "data.frame")`
Apologies for the ugly dput.
This answers only my second question related to making the function faster. Based on #beginneR tip, I converted the function to dplyr.
subset_by_market <- function (q, marketname, dp) {
subset(df %>% group_by(quarter, R.DMA.NAMES, daypart, station) %>%
summarize (spot.count = length(spot.id), station.investment = sum(rate),
nullspots.male = sum(nullspot.male),
nullspots.allpersons = sum(nullspot.allpersons),
total.male.imp = sum(male.imp),
total.allpersons.imp = sum(allpersons.imp),
spotvalue.male = sum(spotvalue.male),
spotvalue.allpersons = sum(spotvalue.allpersons),
male.imp.per.spot = total.male.imp / spot.count,
allpersons.imp.per.spot = total.allpersons.imp / spot.count,
cost.per.spot = station.investment / spot.count,
male.value.per.spot = spotvalue.male / spot.count,
allpersons.value.per.spot = spotvalue.allpersons / spot.count),
quarter == q & R.DMA.NAMES == marketname & daypart == dp) }
This reduced the time drastically to :
> system.time(subset_by_market ("Q32013" , "Albuquerque.Santa.Fe", "late fringe"))
user system elapsed
1.06 0.00 1.09
The glitch I faced in using dplyr was a column called "time" in my data which was of class times from package chron. I kept receiving the error Error: column 'local.time' has unsupported type . I couldn't figure the exact work around for this so I simply converted it to POSIXct class using df$time <- as.POSIXct(as.character(df$time, format = "%H:%M:%S")) . This was not optimal because the reason I converted it to times using chron was to maintain the time chronology without needing the date or time zone. More on that here: work around to specifying date range in ifelse statement in R. However, it solves the immediate problem at hand.

Resources