I am solving an exercise in "R for data science. An exercise under "useful creations functions" in the chapter data transformation with dplyr. The question goes as follows using the nycflights13 dataset:
Currently dep_time and sched_time are convenient to look at, but hard to compute with because they are not really continuous numbers. Convert them to a more convenient representation of numbers of minutes since midnight.
And I saw this answer online:
transmute(flights,
dep_time_since_midnight = (dep_time %% 100) + ((dep_time %/% 100) * 60),
sched_dep_time_since_midnight = (sched_dep_time %% 100) + ((sched_dep_time %/% 100) * 60)
)
Please the question is I don't understand the conversion, this is somewhat of a mathematical problem than a coding problem. Please help
%% is read as "mod", and it gives you the remainder (e.g. 7 %% 3 = 1)
%/% is integer division (e.g. 7 %/% 3 = 2)
In working with dep_time:
hour = dep_time %/% 100
minute = dep_time %% 100
so, the above expression can be read as minutes + hour * 60
Related
I am using this code to find a difference between two times:
station_data.avg$duration[i] = if_else(station_data.avg$swath[i] != 0, round(
difftime(station_data.avg$end[i], station_data.avg$start[i], units = "mins"),
3
), 0)
But the output is 3.116667 and I want the output to be in the format Min:sec so 3:18
I tried
station_data.avg$duration[i]= as.character(times(station_data.avg$duration[i] / (24 * 60 )))
and was hoping that would work but it did not
You can use the chron package to convert fraction of the minute (ie, x.25 indicating 25% of a minute) into x.15 indicating out of 60 seconds (15/60 = 25). An example is below, but if you edit your question to make it reproducible, I can provide more specific help.
Data
a <- Sys.time()
b <- Sys.time() + 60 * 3 + 15 # add 3 min 15 seconds
Code
difftime(b, a, units = "min")
# Time difference of 3.250006 mins
chron::times(as.numeric(difftime(b, a, units = "days")))
# [1] 00:03:15
Note the change to units = "days" in this context.
You could further parse this out by wrapping this in lubridate::hms:
lubridate::hms(
chron::times(as.numeric(difftime(b, a, units = "days")))
)
# [1] "3M 15S"
fff5=function(x)x*31*24 * (1/(31*24))*0.30 + 400*31*24 * (1/(31*24))*0.025 + ( (10 * 31 * 24 - 100*31*24/20 )/(31*24) * 6 ) - 200
fff5 function describes the cost of Amazon Elastic File System where x is the Gb of storage in Standard plan for 24hours per day 31 days, 400 is the gb of storage in EFS Infrequent Access with 24 hours per day 31 days and 10 is the MB/s throughput 24 hours per day 31 days, 200 is the maximum budget.
When i do:
uniroot(fff5, lower=0, upper=1, extendInt = "yes",maxiter = 10000)$root
[1] 533.3333
I find the highest value of GB's that can be stored in the standard plan 24 hours a day 31 days plus the cost of 400gb in the Infrequent Access and plus the cost of 10mb in the throughput with a maximum budget of 200:
fff5(533.3333)
>[1] -0.00001
> fff5(533.3334)
[1] 0.00002
How to do the same for the other two unknowns (y, z)? How to find root with more than one unknown?? How to find all the combinations of value of x y z that makes this function positive.
fff6=function(x,y,z)x*31*24 * (1/(31*24))*0.30 + y*31*24 * (1/(31*24))*0.025 + ( (z* 31 * 24 - 100*31*24/20 )/(31*24) * 6 ) - 200
The equation you propose is of the type
ax + by + cz + d = 0
that's a plan. This means that your solutions are infinite and are all points belonging to the plane defined by the equation.
Since there are infinite solutions, the only thing you can do is try to narrow the space where to look for them as much as possible.
You can choose one unknown (for example x) and treat the other two as parameters
At this point, assign reasonable values to y and z. Unfortunately I don't know what those variables indicate, but I assume they have the same order of magnitude as x found in the previous point (~ 500)
yy <- seq(400, 600, 10)
zz <- seq(400, 600, 10)
These two variables must be recombined in order to obtain a grid:
df_grid <- expand.grid(y = yy, z = zz)
ATTENTION: the longer the vectors, the heavier the calculation will be.
Now you can find the x solutions via uniroot (passing the y and z as numbers) and the solutions of your problem (within the chosen range) will be all triples x, y, z
fff6=function(x,y,z) { x*31*24 * (1/(31*24))*0.30 +
y*31*24 * (1/(31*24))*0.025 +
( (z* 31 * 24 - 100*31*24/20 )/(31*24) * 6 ) - 200
}
x_sol <- NULL
for (i in 1:nrow(df_grid)) {
xs <- uniroot(fff6, c(-10000, 10000), y = df_grid$y[i], z = df_grid$z[i] )$root
x_sol <- c(x_sol, xs)
}
df_grid$x <- x_sol
NOTE1: There are more elegant ways to avoid writing the previous for loop. For example:
x_sol <- mapply(function(y, z) uniroot(fff6, interval = c(-10000,10000),
y=y, z=z)$root, df_grid$y, df_grid$z))
df_grid$x <- x_sol
NOTE2: The range I have chosen shows negative solutions (which I suspect are not useful). A possible choice for obtaining positive solutions is:
yy <- seq(100, 300, 10)
zz <- seq(10, 30, 1)
Choose to search for solutions in an appropriate range!
I'm new to R and programming in general, and I'm struggling with a for-loop for building the lx function in a life table.
I have the age function x, the death function qx (the probability that someone aged exactly x will die before reaching age x+1), and the surviving function px = 1 - qx.
I want to write a function that returns a vector with all the lx values from first to last age in my table. The function is simple...
I've defined cohort = 1000000. The first age in my table is x = 5, so, considering x = 5...
l_(x) = cohort
And, from now on, l_(x+n) = l_(x+n-1)*p_(x+n-1)
I've searched about for-loops, and I can only get my code working for lx[1] and lx[2], and I get nothing for lx[n] if n > 2.
I wrote that function:
living_x <- function(px, cohort){
result <- vector("double", length(px))
l_x <- vector("double", length(px))
for (i in 1:length(px)){
if (i == 1){
l_x[i] = cohort
}
else l_x[i] = l_x[i-1]*px[i-1]
result[i] = l_x
print(result)
}
}
When I run it, I get several outputs (more than length(px)) and "There were 50 or more warnings (use warnings() to see the first 50)".
When I run warnings(), I get "In result[i] <- l_x : number of items to replace is not a multiple of replacement length" for every number.
Also, everything I try besides it give me different errors or only calculate lx for lx[1] and lx[2]. I know there's something really wrong with my code, but I still couldn't identify it. I'd be glad if someone could give me a hint to find out what to change.
Thank you!
Here's an approach using dplyr from the tidyverse packages, to use px to calculate lx. This can be done similarly in "Base R" using excerpt$lx = 100000 * cumprod(1 - lag(excerpt$qx)).
lx is provided in the babynames package, so we can check our work:
library(tidyverse)
library(babynames)
# Get excerpt with age, qx, and lx.
excerpt <- lifetables %>%
filter(year == 2010, sex == "F") %>%
select(x, qx_given = qx, lx_given = lx)
excerpt
# A tibble: 120 x 3
x qx_given lx_given
<dbl> <dbl> <dbl>
1 0 0.00495 100000
2 1 0.00035 99505
3 2 0.00022 99471
4 3 0.00016 99449
5 4 0.00012 99433
6 5 0.00011 99421
7 6 0.00011 99410
8 7 0.0001 99399
9 8 0.0001 99389
10 9 0.00009 99379
# ... with 110 more rows
Using that data to estimate lx_calc:
est_lx <- excerpt %>%
mutate(px = 1 - qx_given,
cuml_px = cumprod(lag(px, default = 1)),
lx_calc = cuml_px * 100000)
And finally, comparing visually the given lx with the one calculated based on px. They match exactly.
est_lx %>%
gather(version, val, c(lx_given, lx_calc)) %>%
ggplot(aes(x, val, color = version)) + geom_line()
I could do it in a very simple way after thinking for some minutes more.
lx = c()
for (i in 2:length(px)){
lx[1] = 10**6
lx[i] = lx[i-1]*px[i-1]
}
When I create an rpart tree that uses a date cutoff at a node, the print methods I use - both rpart.plot and fancyRpartPlot - print the dates in scientific notation, which makes it hard to interpret the result. Here's the fancyRpartPlot:
Is there a way to print this tree with more interpretable date values? This tree plot is meaningless as all those dates look the same.
Here's my code for creating the tree and plotting two ways:
library(rpart) ; library(rpart.plot) ; library(rattle)
my_tree <- rpart(a ~ ., data = dat)
rpart.plot(my_tree)
fancyRpartPlot(my_tree)
Using this data:
# define a random date/time selection function
generate_days <- function(N, st="2012/01/01", et="2012/12/31") {
st = as.POSIXct(as.Date(st))
et = as.POSIXct(as.Date(et))
dt = as.numeric(difftime(et,st,unit="sec"))
ev = runif(N, 0, dt)
rt = st + ev
rt
}
set.seed(1)
dat <- data.frame(
a = runif(1:100),
b = rpois(100, 5),
c = sample(c("hi","med","lo"), 100, TRUE),
d = generate_days(100)
)
From a practical standpoint, perhaps you'd like to just use days from the start of the data:
dat$d <- dat$d-as.POSIXct(as.Date("2012/01/01"))
my_tree <- rpart(a ~ ., data = dat)
rpart.plot(my_tree,branch=1,extra=101,type=1,nn=TRUE)
This reduces the number to something manageable and meaningful (though not as meaningful as a specific date, perhaps). You may even want to round it to the nearest day or week. (I can't install GTK+ on my computer so I can't us fancyRpartPlot.)
One possible way might be to use the digits options in print to examine the tree and as.POSIXlt to convert to date:
> print(my_tree,digits=100)
n= 100
node), split, n, deviance, yval
* denotes terminal node
1) root 100 7.0885590 0.5178471
2) d>=1346478795.049611568450927734375 33 1.7406368 0.4136051
4) b>=4.5 23 1.0294497 0.3654257 *
5) b< 4.5 10 0.5350040 0.5244177 *
3) d< 1346478795.049611568450927734375 67 4.8127122 0.5691901
6) d< 1340921905.3460228443145751953125 55 4.1140164 0.5368048
12) c=hi 28 1.8580913 0.4779574
24) d< 1335890083.3241622447967529296875 18 0.7796261 0.3806526 *
25) d>=1335890083.3241622447967529296875 10 0.6012662 0.6531062 *
13) c=lo,med 27 2.0584052 0.5978317
26) d>=1337494347.697483539581298828125 8 0.4785274 0.3843749 *
27) d< 1337494347.697483539581298828125 19 1.0618892 0.6877082 *
7) d>=1340921905.3460228443145751953125 12 0.3766236 0.7176229 *
## Get date on first node
> as.POSIXlt(1346478795.049611568450927734375,origin="1970-01-01")
[1] "2012-08-31 22:53:15 PDT"
I also check the digits option in available in rpart.plot and fancyRpartPlot:
rpart.plot(my_tree,digits=10)
fancyRpartPlot(my_tree, digits=10)
I don't know how important the specific chronological date is in your classification but an alternative method would be to breakdown your dates by the characteristics. In other words, create bins based on the "year" (2012,2013,2014...) as [1,0]. "Day of the Week" (Mon, Tues, Wed, Thurs, Fri...) as [1,0]. Maybe even as "Day of Month" (1,2,3,4,5...31) as [1,0]. This adds a lot more categories to be classifying by but it eliminates the issue with working with a fully formatted date.
I got a serial of times, as following,
2013-12-27 00:31:15
2013-12-29 17:01:17
2013-12-31 01:52:41
....
my target is to know what time in a day is more important, like most times are in the period of 17:00 ~ 19:00.
In order to do that, I think I should draw every single time as a point in x-axes, and the unit of x-axes is minute.
I don't know how to do it exactly with R and ggplot2.
Am I on the right way? I mean, is there a better way to get my target?
library(chron)
# create some test data - hrs
set.seed(123)
Lines <- "2013-12-27 00:31:15
2013-12-29 17:01:17
2013-12-31 01:52:41
"
tt0 <- times(read.table(text = Lines)[[2]]) %% 1
rng <- range(tt0)
hrs <- 24 * as.vector(sort(diff(rng) * runif(100)^2 + rng[1]))
# create density, find maximum of it and plot
d <- density(hrs)
max.hrs <- d$x[which.max(d$y)]
ggplot(data.frame(hrs)) +
geom_density(aes(hrs)) +
geom_vline(xintercept = max.hrs)
giving:
> max.hrs # in hours - nearly 2 am
[1] 1.989523
> times(max.hrs / 24) # convert to times
[1] 01:59:22