Extract intervals from time data in R - r

My problem is simple. I have table where each row is event (month, day, hour, minute is given). However, the machine was set to record 24/7. So I have more events (rows) than I need. How to remove surplus rows from daytime and to keep only rows from night (from sunset to sunrise)?
Dreadful thing is, that the timing of sunrise/sunset is slightly different each day.
In this example I provide two tables. First is table with all events, second contain timings of sunset/sunrise for each day.
If it is possible to extract, please notice that EACH night consists from two dates could be a additional column inserted in table containing ID of night? (see scheme below)
# table with all events
my.table <- data.frame(event = 1:34,
day = rep(c(30,31,1,2,3), times = c(8,9,7,8,2)),
month = rep(c(3,4), each = 17),
hour = c(13,13,13,13,22,
22,23,23,2,2,2,
14,14,14,19,22,22,
2,2,2,14,15,22,22,
3,3,3,14,14,14,
23,23,2,14),
minute = c(11,13,44,55,27,
32,54,57,10,14,
26,12,16,46,30,
12,13,14,16,45,
12,15,12,15,24,
26,28,12,16,23,12,13,11,11))
# timings of sunset/sunrise for each day
sun.table <- data.frame(day = c(30,31,31,1,1,2,2,3),
month = rep(c(3,4), times = c(3,5)),
hour = rep(c(19,6), times = 4),
minute = c(30,30,31,29,32,
28,33,27),
type = rep(c("sunset","sunrise"), times = 4))
# rigth solution reduced table would contain only rows:
# 5,6,7,8,9,10,11,16,17,18,19,20,23,24,25,26,27,31,32,33.
# nrow("reduced table") == 20

Here's one possible strategy
#convert sun-up, sun-down times to proper dates
ss <- with(sun.table, ISOdate(2000,month,day,hour,minute))
up <- ss[seq(1,length(ss),by=2)]
down <- ss[seq(2,length(ss),by=2)]
Here I assume the table is ordered and starts with a sunrise and alternates back and forth and ends with a sunset. Date values also need a year, here I just hard coded 2000. As long as your data doesn't span years (or leap days) that should be fine, but you'll probably want to pop in the actual year of your observations.
Now do the same for events
tt <- with(my.table, ISOdate(2000,month,day,hour,minute))
Find rows during the day
daytime <- sapply(tt, function(x) any(up<x & x<down))
and extract those rows
my.table[daytime, ]
# event day month hour minute
# 5 5 30 3 22 27
# 6 6 30 3 22 32
# 7 7 30 3 23 54
# 8 8 30 3 23 57
# 9 9 31 3 2 10
# 10 10 31 3 2 14
# 11 11 31 3 2 26
# 16 16 31 3 22 12
# 17 17 31 3 22 13
# 18 18 1 4 2 14
# 19 19 1 4 2 16
# 20 20 1 4 2 45
# 23 23 1 4 22 12
# 24 24 1 4 22 15
# 25 25 2 4 3 24
# 26 26 2 4 3 26
# 27 27 2 4 3 28
# 31 31 2 4 23 12
# 32 32 2 4 23 13
# 33 33 3 4 2 11
Here we only grab values that are after sunrise and before sunset. Since there isn't enough information in the sun.table to make sure that row 34 actually happens before subset, it is not returned.

Related

Calculate Cumulative sum for previous 6 months

RECORD
ATTRIBUTE
DATE
MONTH
AMT
CML AMT
1
A
1/1/2021
1
10
10
2
A
2/1/2021
2
10
20
3
A
3/1/2021
3
10
30
4
A
4/1/2021
4
10
40
5
A
5/1/2021
5
10
50
6
A
6/1/2021
6
10
60
7
B
1/1/2021
1
20
20
8
B
3/1/2021
3
20
40
9
B
5/1/2021
5
20
60
10
B
7/1/2021
7
20
80
11
B
9/1/2021
9
20
80
12
B
11/1/2021
11
20
80
13
C
1/1/2021
1
30
30
14
C
8/1/2021
8
30
30
15
C
9/1/2021
9
30
60
I am looking to calculate the cumulative sum (CML AMT column) using the AMT column for the past 6 months.
The CML AMT column should only look at window of 6 Months.
If there is no other record for the same attribute within a 6 month time frame, then it should simply return the AMT column.
I tried the below which clearly wont work as the dates/months are not consistent.
Any help will be appreciated.
SUM(AMT)
OVER (PARTITION BY ATTRIBUTE
ORDER BY DATE
ROWS BETWEEN 4 PRECEDING AND CURRENT ROW)
Unfortunately Teradata doesn't support RANGE, but if you need to sum over a small number of values only (six months = up to six rows) you can apply a brute-force-approach:
AMT
+
CASE WHEN LAG(DATE,1) OVER (PARTITION BY ATTRIBUTE ORDER BY DATE) >= ADD_MONTHS(DATE,-6)
THEN LAG(AMT,1) OVER (PARTITION BY ATTRIBUTE ORDER BY DATE)
ELSE 0
END
+
CASE WHEN LAG(DATE,2) OVER (PARTITION BY ATTRIBUTE ORDER BY DATE) >= ADD_MONTHS(DATE,-6)
THEN LAG(AMT,2) OVER (PARTITION BY ATTRIBUTE ORDER BY DATE)
ELSE
END
+
...
Looks ugly, but it's mostly cut&paste&modify and still a single step in Explain. Other possible solutions would be based on an additional EXPAND ON or time-series aggregation step.

Transpose column and group dataframe [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 years ago.
I'm trying to change a dataframe in R to group multiple rows by a measurement. The table has a location (km), a size (mm) a count of things in that size bin, a site and year. I want to take the sizes, make a column from each one (2, 4 and 6 in this example), and place the corresponding count into each the row for that location, site and year.
It seems like a combination of transposing and grouping, but I can't figure out a way to accomplish this in R. I've looked at t(), dcast() and aggregate(), but those aren't really close at all.
So I would go from something like this:
df <- data.frame(km=c(rep(32,3),rep(50,3)), mm=rep(c(2,4,6),2), count=sample(1:25,6), site=rep("A", 6), year=rep(2013, 6))
km mm count site year
1 32 2 18 A 2013
2 32 4 2 A 2013
3 32 6 12 A 2013
4 50 2 3 A 2013
5 50 4 17 A 2013
6 50 6 21 A 2013
To this:
km site year mm_2 mm_4 mm_6
1 32 A 2013 18 2 12
2 50 A 2013 3 17 21
Edit: I tried the solution in a suggested duplicate, but I did not work for me, not really sure why. The answer below worked better.
As suggested in the comment above, we can use the sep argument in spread:
library(tidyr)
spread(df, mm, count, sep = "_")
km site year mm_2 mm_4 mm_6
1 32 A 2013 4 20 1
2 50 A 2013 15 14 22
As you mentioned dcast(), here is a method using it.
set.seed(1)
df <- data.frame(km=c(rep(32,3),rep(50,3)),
mm=rep(c(2,4,6),2),
count=sample(1:25,6),
site=rep("A", 6),
year=rep(2013, 6))
library(reshape2)
dcast(df, ... ~ mm, value.var="count")
# km site year 2 4 6
# 1 32 A 2013 13 10 20
# 2 50 A 2013 3 17 1
And if you want a bit of a challenge you can try the base function reshape().
df2 <- reshape(df, v.names="count", idvar="km", timevar="mm", ids="mm", direction="wide")
colnames(df2) <- sub("count.", "mm_", colnames(df2))
df2
# km site year mm_2 mm_4 mm_6
# 1 32 A 2013 13 10 20
# 4 50 A 2013 3 17 1

Aggregating by subsets in dplyr

I have a dataset with a million records that I need to aggregate after first subsetting the data. It is difficult to provide a good reproducible sample because in this case, the sample size would be rather large - but I will try anyway.
A random sample of the data that I am working with looks like this:
> df
auto_id user_id month
164537 7124 240249 10
151635 7358 226423 9
117288 7376 172463 9
177119 6085 199194 11
128904 7110 141608 9
157194 7143 241964 9
71303 6090 141646 7
72480 6808 175910 7
108705 6602 213098 8
97889 7379 185516 8
184906 6405 212580 12
37242 6057 197905 8
157284 6548 162928 9
17910 6885 194180 10
70660 7162 161827 7
8593 7375 207061 8
28712 6311 176373 10
144194 7324 142715 9
73106 7196 176153 7
67065 7392 171039 7
77954 7116 161489 7
59842 7107 162637 7
101819 5994 182973 9
183546 6427 142029 12
102881 6477 188129 8
In every month, there many users who are the same, and first we should subset by month and make a frequency table of the users and the amount of trips taken (unfortunately, in the random sample above there is only one trip per user, but in the larger dataset, this is not the case):
full_data <- full_data[full_data$month == 7,]
users <- as.data.frame(table(full_data$user_id))
head(users)
Var1 Freq
1 100231 10
2 100744 17
3 111281 1
4 111814 2
5 113716 3
6 117493 3
As we can see, in the full data set, in month of July (month = 7), users have taken multiple trips. Now the important part - which is to subset only the top 10% of these users (the top 10% in terms of Freq)
tenPercent = round(nrow(users)/10)
users <- users[order(-users$Freq),]
topten <- head(users, n = tenPercent)
Now the new dataframe - topten - can be summed and we get the amount of trips taken by the top ten percent of users
sum(topten$Freq)
[1] 12147
In the end the output should look like this
> output
month trips
1 7 12147
2 8 ...
3 9 ...
4 10 ...
5 11 ...
6 12 ...
Is there a way to automate this process using dplyr - I mean specifically the subsetting by the top ten percent ? I have tried
output <- full_data %>%
+ group_by(month) %>%
+ summarise(n = n())
But this only aggregates total trips by month. Could someone suggest a way to integrate this part into the query in dplyr ? :
tenPercent = round(nrow(users)/10)
users <- users[order(-users$Freq),]
topten <- head(users, n = tenPercent)
The code below counts the number of rows for each user_id in each month, and then selects the 10% of users with the most rows in each month and sums them. Let me know if it solves your problem.
library(dplyr)
full_data %>% group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9) %>%
summarise(n_trips = sum(n))
UPDATE: Following up on your comment, let's do a check with some fake data. Below we have 30 different values of user_id and 10,000 total rows. I've also used the prob argument so that the probability of a user_id being selected is proportional to its value (i.e., user_id 1 is the least likely to be chosen and user_id 30 is the most likely to be chosen).
set.seed(3)
full_data = data.frame(user_id=sample(1:30,10000, replace=TRUE, prob=1:30),
month=sample(1:12, 10000, replace=TRUE))
Let's look as the number of rows for each user_id for month==1. The code below counts the number of rows for each user_id and sorts from most to least common. Note that the three most common values of user_id (28,29,26) comprise 171 rows (60+57+54). Since there are 30 different values of user_id the top three users represent the top 10% of users:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
arrange(desc(n)) %>% as.data.frame
month user_id n
1 1 28 60
2 1 29 57
3 1 26 54
4 1 30 53
5 1 27 49
6 1 22 43
7 1 21 41
8 1 20 40
9 1 23 40
10 1 24 38
11 1 25 38
12 1 19 37
13 1 18 33
14 1 16 28
15 1 15 27
16 1 17 27
17 1 14 26
18 1 9 20
19 1 12 20
20 1 13 20
21 1 10 17
22 1 11 17
23 1 6 15
24 1 7 13
25 1 8 13
26 1 4 9
27 1 5 7
28 1 2 3
29 1 3 2
30 1 1 1
So now let's take the next step and select the top 10% of users. To answer the question in your comment, filter(percent_rank(n) >= 0.9) keeps only the top 10% of user_id, based on the value of n (which is the number of rows for each user_id). percent_rank is on of several ranking functions in dplyr that have different ways of dealing with ties (which may be the reason you're not getting the results you expect). See ?percent_rank for details:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9)
month user_id n
1 1 26 54
2 1 28 60
3 1 29 57
And the sum of n (the total number of trips for the top 10%) is:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9) %>%
summarise(n_trips = sum(n))
month n_trips
1 1 171
So it looks like the code does what we'd naively expect, but maybe the issue is related to how ties are dealt with. Let me know if you're still getting anomalous results in your real data or if I've misunderstood what you're trying to accomplish.

Data frame to wide with data compression [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
A data set similar to this...
ID <- c(rep(10,4),rep(20,4),rep(30,4),rep(40,4),rep(50,4))
Activity <- rep(c("In","Start","Finish","Out"),5)
Rsn <- c(rep("Rsn1",4),rep("Rsn11",4),rep("Rsn111",4),rep("Rsn11",4),rep("Rsn111",4))
Inst <- seq(1,20,1)
Loc <- c(rep("Here",4),rep("There",4),rep("Anywhere",4),rep("Somewhere",4),rep("SomewhereElse",4))
dc <- data.frame(ID,Activity,Rsn,Inst,Loc)
ID Activity Rsn Inst Loc
10 In Rsn1 1 Here
10 Start Rsn1 2 Here
10 Finish Rsn1 3 Here
10 Out Rsn1 4 Here
20 In Rsn11 5 There
20 Start Rsn11 6 There
20 Finish Rsn11 7 There
20 Out Rsn11 8 There
30 In Rsn111 9 Anywhere
30 Start Rsn111 10 Anywhere
30 Finish Rsn111 11 Anywhere
30 Out Rsn111 12 Anywhere
40 In Rsn11 13 Somewhere
40 Start Rsn11 14 Somewhere
40 Finish Rsn11 15 Somewhere
40 Out Rsn11 16 Somewhere
50 In Rsn111 17 SomewhereElse
50 Start Rsn111 18 SomewhereElse
50 Finish Rsn111 19 SomewhereElse
50 Out Rsn111 20 SomewhereElse
the end result that I would like to end up with is this...
ID2 <- c(10,20,30,40,50)
In2 <- seq(1,20,4)
Start2 <- seq(2,20,4)
Finish2 <- seq(3,20,4)
Out2 <- seq(4,20,4)
Rsn2 <- c("Rsn1","Rsn11","Rsn111","Rsn11","Rsn111")
Loc2 <- c("Here","There","Anywhere","Somewhere","SomewhereElse")
dw <- data.frame(ID2, In2, Start2, Finish2, Out2, Rsn2,Loc2)
ID In2 Start2 Finish2 Out2 Rsn2 Loc2
10 1 2 3 4 Rsn1 Here
20 5 6 7 8 Rsn11 There
30 9 10 11 12 Rsn111 Anywhere
40 13 14 15 16 Rsn11 Somewhere
50 17 18 19 20 Rsn111 SomewhereElse
I have used tidyr, reshape and can't seem to get it right.
The data set has 816K lines and takes forever with loops and conditionals.
I have broken the dataset down by the locations, but it is still slow.
Thanks in advance.
we can use dcast
library(data.table)
dcast(setDT(dc), ID+Rsn+Loc~Activity, value.var = "Inst")

Prepare Time Series for Machine Learning - Long to Wide Format

I have a data frame of time series data in a 'long' format where there is 1 row/observation per day. I would like to transform this data into a 'wide' format. Each row/observation should have the time series value for the current date and the previous 2 days.
To provide a concrete example, I will use the Air Quality data available in R. This is what my input data frame looks like.
> input <- airquality[1:4,c("Month", "Day", "Ozone")]
> input
Month Day Ozone
1 5 1 41
2 5 2 36
3 5 3 12
4 5 4 18
I would like to transform this input so that it looks like the following.
output <- data.frame(Month = 5, Day = 1:4, Ozone=c(41,36,12,18), Ozone.Prev.1=c(NA,41,36,12), Ozone.Prev.2=c(NA,NA,41,36))
> output
Month Day Ozone Ozone.Prev.1 Ozone.Prev.2
1 5 1 41 NA NA
2 5 2 36 41 NA
3 5 3 12 36 41
4 5 4 18 12 36
Any suggestions on a nice, clean way to do this? Many thanks in advance.
You can use the lag function from zoo, but the following small function get's the trick done without using additional packages:
shift_vector = function(vec, n) c(rep(NA, n), head(vec, -n))
output = transform(input, prev_1 = shift_vector(Ozone, 1),
prev_2 = shift_vector(Ozone, 2))
output
Month Day Ozone prev_1 prev_2
1 5 1 41 NA NA
2 5 2 36 41 NA
3 5 3 12 36 41
4 5 4 18 12 36

Resources