Lookup values corresponding to the closest date - r

I have a data.frame x with date and Value
x = structure(list(date = structure(c(1376534700, 1411930800, 1461707400,
1478814300, 1467522000, 1451088000, 1449956100, 1414214400, 1472585400,
1418103000, 1466176500, 1434035100, 1442466300, 1410632100, 1448571900,
1439276400, 1468382700, 1476137400, 1413177300, 1438881300), class = c("POSIXct",
"POSIXt"), tzone = ""), Value = c(44L, 49L, 31L, 99L, 79L, 92L,
10L, 72L, 60L, 41L, 28L, 21L, 67L, 61L, 8L, 65L, 40L, 48L, 53L,
90L)), .Names = c("date", "Value"), row.names = c(NA, -20L), class = "data.frame")
and another list y with only date
y = structure(c(1470356820, 1440168960, 1379245020, 1441582800, 1381753740
), class = c("POSIXct", "POSIXt"), tzone = "")
Before I try to do it with a loop, I wanted to find out if there is a quick way (or packages) to lookup Value from the closest date in x for dates in y? The goal is to find out a date in x that is closest to the date in y and obtain the corresponding Value.
The desired output (got from Excel VLOOKUP, so may not be perfect) would be something like:
output = structure(list(y = structure(c(1470356820, 1440168960, 1379245020,
1441582800, 1381753740), class = c("POSIXct", "POSIXt"), tzone = ""),
Value = c(40, 65, 44, 65, 44)), .Names = c("y", "Value"), row.names = c(NA,
-5L), class = "data.frame")

sapply(y, function(z) x$Value[which.min(abs(x$date - z))])
# [1] 40 65 44 67 44

Using data.table you can join to the nearest value
library(data.table)
x <- as.data.table(x)
y <- data.table(date=y)
res <- x[y, on='date', roll='nearest']

Related

Operate with a table in R

I want to get a tuple from a table in R.
For example, if i have this table named batch_task:
arrival_time, departure_time, jod_id, task_id
11792, 11999, 18, 88
11792, 14331, 18, 82
11792, 12112, 18, 91
16281, 16552, 27, 147
16281, 16396, 27, 139
16281, 16529, 27, 137
So, for each job_id i need a tuple {arrival_time, service_time}; for example, for the job_id = 18, i want to get the tuple {11792, (11999-11792)+(14331-11792)+(12112-11792)} = {11792, 3066}.
Anyone could help me? Thanks in advance,
Jesús
Assuming that batch_task is actually a data frame rather than a table (which is a specific type of R structure dealing with count data), here's a solution using the dplyr package:
library(dplyr)
batch_task %>%
group_by(arrival_time) %>%
summarise(service_time = sum(departure_time - arrival_time), .groups = "drop")
#> # A tibble: 2 x 2
#> arrival_time service_time
#> <int> <int>
#> 1 11792 3066
#> 2 16281 634
Note that there is no data structure called a tuple in R. There are various ways to represent tuples, but the most sensible way in this case would be to keep the result in a data frame format.
Data
batch_task <- structure(list(arrival_time = c(11792L, 11792L, 11792L, 16281L,
16281L, 16281L), departure_time = c(11999L, 14331L, 12112L, 16552L,
16396L, 16529L), jod_id = c(18L, 18L, 18L, 27L, 27L, 27L), task_id = c(88L,
82L, 91L, 147L, 139L, 137L)), class = "data.frame", row.names = c(NA,
-6L))
With the data
batch_task <- structure(list(arrival_time = c(11792L, 11792L, 11792L, 16281L,
16281L, 16281L),
departure_time = c(11999L, 14331L, 12112L, 16552L,
16396L, 16529L),
jod_id = c(18L, 18L, 18L, 27L, 27L, 27L),
task_id = c(88L,
82L, 91L, 147L, 139L, 137L)),
class = "data.frame", row.names = c(NA,
-6L))
I propose this solution
library(tidyverse)
batch_task %>%
mutate(service_time = departure_time - arrival_time) %>%
group_by(jod_id, arrival_time) %>% #assuming arrival_time is equal for each job_id
summarise(service_time = sum(service_time))
Using by.
do.call(rbind, by(d, d$jod_id, function(x) {
a <- x$arrival_time[1]
c(a, sum(x$departure_time - a))
}))
# [,1] [,2]
# 18 11792 3066
# 27 16281 634
Data:
d <- structure(list(arrival_time = c(11792L, 11792L, 11792L, 16281L,
16281L, 16281L), departure_time = c(11999L, 14331L, 12112L, 16552L,
16396L, 16529L), jod_id = c(18L, 18L, 18L, 27L, 27L, 27L), task_id = c(88L,
82L, 91L, 147L, 139L, 137L)), class = "data.frame", row.names = c(NA,
-6L))

Assign trip number based on condition

I have a time series data. I would like to group and number rows when column "soak" > 3600. The first row when soak > 3600 is numbered as 1, and the consecutive rows are numbered as 1 too until another row met the condition of soak > 3600. Then that row and consequent rows are numbered as 2 until the third occurrence of soak > 3600.
A small sample of my data and the code I tried is also provided.
My code did the count, but seems using the ave() gave me some decimal numbers... Is there a way to output integer?
starts <- structure(list(datetime = structure(c(1440578907, 1440579205,
1440579832, 1440579885, 1440579926, 1440579977, 1440580044, 1440580106,
1440580195, 1440580256, 1440580366, 1440580410, 1440580476, 1440580529,
1440580931, 1440580966, 1440587753, 1440587913, 1440587933, 1440587954
), class = c("POSIXct", "POSIXt"), tzone = ""), soak = c(NA,
70L, 578L, 21L, 2L, 41L, 14L, 16L, 32L, 9L, 45L, 20L, 51L, 25L,
364L, 4L, 6764L, 20L, 4L, 5L)), row.names = c(NA, -20L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x000000000a4d1ef0>)
starts$trip <- with(starts, ave(tdiff, cumsum(replace(soka, NA, 10000) > 3600)))
Using dplyr
library(dplyr)
starts %>% mutate(trip = cumsum(replace(soak, is.na(soak), 1) > 3600))
And with base R
starts$trip = with(starts, ave(soak, FUN=function(x) cumsum(replace(x, is.na(x), 1) > 3600)))

Transform periodogram to dataframe in R

Simple example with R on mydata:
l=structure(list(dat = structure(1:9, .Label = c("01.01.2016",
"02.01.2016", "03.01.2016", "04.01.2016", "05.01.2016", "06.01.2016",
"07.01.2016", "08.01.2016", "09.01.2016"), class = "factor"),
lpt = c(94L, 3L, 30L, 92L, 20L, 80L, 20L, 190L, 52L)), .Names = c("dat",
"lpt"), class = "data.frame", row.names = c(NA, -9L))
l=ts(l)
spectrum(l)
R returned plot with periodogram
On this periodogram we can see bursts of values (two bursts 0.23, 0.45).
How should the values of the x-axis and the у-axis be reduced to a dataframe, but only for those values on the x axis that are bursts?
Second question:
Can these values be displayed not in frequencies, but in absolute, original units(dat,lpt)?

How do I count rows in my data starting with a particular occurence of a value

structure(list(PROD_DATE = structure(c(1465876800, 1465963200,
1466049600, 1466136000, 1466222400, 1466308800, 1466395200, 1466481600,
1466568000, 1466654400), class = c("POSIXct", "POSIXt"), tzone = ""),
FILENUM = c(51922L, 51922L, 51922L, 51922L, 51922L, 51922L,
51922L, 51922L, 51922L, 51922L), CHOKE_SETTING = c(16L, 18L,
50L, 40L, 30L, 23L, 29L, 32L, 35L, 30L)), .Names = c("PROD_DATE",
"FILENUM", "CHOKE_SETTING"), row.names = c(NA, -10L), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), vars = "FILENUM", drop = TRUE, indices = list(
0:9), group_sizes = 10L, biggest_group_size = 10L, labels = structure(list(
FILENUM = 51922L), row.names = c(NA, -1L), class = "data.frame", vars = "FILENUM", drop = TRUE, .Names = "FILENUM"))
df <- df %>% group_by(FILENUM) %>% arrange(PROD_DATE) %>%
mutate(DAYS_ON = row_number())
I'm using the code above to start numbering the rows of the dataset to count days since the start. Rather than using Date-time variable in prod_date.
I am unsure how to add another column that counts days since the occurrence of a max value in a different column. It should start counting the first row at the value of 50. Previous rows would either be NA or 0

how do you do or within ifelse in data.table package in r

I have this data frame called final:
dput(head(final))
structure(list(Date = structure(c(1468268220, 1468269840, 1468268160,
1468268700, 1468268760, 1468268940), class = c("POSIXct", "POSIXt"
), tzone = ""), Trans = c(1454L, 816L, 1144L, 2542L, 200L, 180L
), Cpu = c(64L, 40L, 56L, 61L, 16L, 11L), HeapUsage = c(57.0708814590398,
80.9735869076328, 48.902594693305, 94.0288575990024, 69.2328919834884,
82.9988047593942), upper = c(65.8548134294912, 42.4943251956111,
51.6875408154387, 84.6919764584793, 23.1855269194862, 22.5880538495753
), lower = c(45.1948274687367, 24.2759551097326, 32.5602919366396,
56.2960707237243, 8.75861931966005, 6.83960375748239), diff = c(1.85481342949123,
2.49432519561113, 4.31245918456129, 23.6919764584793, 7.18552691948618,
11.5880538495753)), .Names = c("Date", "Trans", "Cpu", "HeapUsage",
"upper", "lower", "diff"), class = c("data.table", "data.frame"
), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x0000000000170788>)
using data.table package I need to see if the Trans value for each date within upper and lower limits:
final<-final[,inside:=ifelse(Cpu>=lower || Cpu=<upper,1,0),by=c("Date")]
looks like not working with OR any ideas

Resources