How to statistic the historical data using R language? - r

I have a data.frame named A as the following:
uid uname csttime action_type
1 felix 2014-01-01 01:00:00 1
1 felix 2014-01-01 02:00:00 2
1 felix 2014-01-01 03:00:00 2
1 felix 2014-01-01 04:00:00 2
1 felix 2014-01-01 05:00:00 3
2 john 2014-02-01 01:00:00 1
2 john 2014-02-01 02:00:00 1
2 john 2014-02-01 03:00:00 1
2 john 2014-02-02 08:00:00 3
.......
I want to statistic the historical action_type for each <uid,uname,csttime> combination, for example, for <1,'felix','2014-01-01 03:00:00'>, I want to know how many different action_types have ever occurred. Here, for <1,'felix','2014-01-01 03:00:00'>, the action_type_1 is 1 and the action_type_2 is 1.

If I'm understanding your question correctly I believe there is a fairly simple dplyr answer.
library(dplyr)
group_by(stack, uid, uname, csttime) %>%
count(uid, action_type)
This will yield:
uid action_type n
1 1 1 1
2 1 2 3
3 1 3 1
4 2 1 3
5 2 3 1
as you can see this gives you each unique id, the action types they have taken and the number of times. if you want to say, change to include date, you can do
group_by(stack, uid, uname, csttime) %>%
count(uid, csttime, action_type)
hope that helps.

Related

Referring to the row above when using mutate() in R

I want to create a new variable in a dataframe that refers to the value of the same new variable in the row above. Here's an example of what I want to do:
A horse is in a field divided into four zones. The horse is wearing a beacon that signals every minute, and the signal is picked up by one of four sensors, one for each zone. The field has a fence that runs most of the way down the middle, such that the horse can pass easily between zones 2 and 3, but to get between zones 1 and 4 it has to go via 2 and 3. The horse cannot jump over the fence.
|________________|
| |
sensor 2 | X | | sensor 3
| | |
| | |
| | |
sensor 1 | Y| | sensor 4
| | |
|----------------|
In the schematic above, if the horse is at position X, it will be picked up by sensor 2. If the horse is near the middle fence at position Y, however, it may be picked up by either sensor 1 or sensor 4, the ranges of which overlap slightly.
In the toy example below, I have a dataframe where I have location data each minute for 20 minutes. In most cases, the horse moves one zone at a time, but in several instances, it switches back and forth between zone 1 and 4. This should be impossible: the horse cannot jump the fence, and neither can it run around in the space of a minute.
I therefore want to calculate a new variable in the dataset that provides the "true" location of the animal, accounting for the impossibility of travelling between 1 and 4.
Here's the data:
library(tidyverse)
library(reshape2)
example <- data.frame(time = seq(as.POSIXct("2022-01-01 09:00:00"),
as.POSIXct("2022-01-01 09:20:00"),
by="1 mins"),
location = c(1,1,1,1,2,3,3,3,4,4,4,3,3,2,1,1,4,1,4,1,4))
example
Create two new variables: "prevloc" is where the animal was in the previous minute, and "diffloc" is the number differences between the animal's current and previous location.
example <- example %>% mutate(prevloc = lag(location),
diffloc = abs(location - prevloc))
example
Next, just change the first value of "diffloc" from NA to zero:
example <- example %>% mutate(diffloc = ifelse(is.na(diffloc), 0, diffloc))
example
Now we have a dataframe where diffloc is either 0 (animal didn't move), 1 (animal moved one zone), or 3 (animal apparently moved from zone 1 to zone 4 or vice versa). Where diffloc = 3, I want to create a "true" location taking account of the fact that such a change in location is impossible.
In my example, the animal went from zone 1 -> 4 -> 1 -> 4 -> 1 -> 4. Based on the fact that the animal started in zone 1, my assumption is that the animal just stayed in zone 1 the whole time.
My attempt to solve this below, which doesn't work:
example <- example %>%
mutate(returnloc = ifelse(diffloc < 3, location, lag(returnloc)))
I wonder whether anyone can help me to solve this? I've been trying for a couple of days and haven't even got close...
Best wishes,
Adam
One possible solution is to, when diffloc == 3, look at the previous value that is not 1 nor 4. If it is 2, then the horse is certainly in 1 afterwards, if it is 3, then the horse is certainly in 4.
example %>%
mutate(trueloc = case_when(diffloc == 3 & sapply(seq(row_number()), \(i) tail(location[1:i][!location %in% c(1, 4)], 1) == 2) ~ 1,
diffloc == 3 & sapply(seq(row_number()), \(i) tail(location[1:i][!location %in% c(1, 4)], 1) == 3) ~ 4,
T ~ location))
time location prevloc diffloc trueloc
1 2022-01-01 09:00:00 1 NA 0 1
2 2022-01-01 09:01:00 1 1 0 1
3 2022-01-01 09:02:00 1 1 0 1
4 2022-01-01 09:03:00 1 1 0 1
5 2022-01-01 09:04:00 2 1 1 2
6 2022-01-01 09:05:00 3 2 1 3
7 2022-01-01 09:06:00 3 3 0 3
8 2022-01-01 09:07:00 3 3 0 3
9 2022-01-01 09:08:00 4 3 1 4
10 2022-01-01 09:09:00 4 4 0 4
11 2022-01-01 09:10:00 4 4 0 4
12 2022-01-01 09:11:00 3 4 1 3
13 2022-01-01 09:12:00 3 3 0 3
14 2022-01-01 09:13:00 2 3 1 2
15 2022-01-01 09:14:00 1 2 1 1
16 2022-01-01 09:15:00 1 1 0 1
17 2022-01-01 09:16:00 4 1 3 1
18 2022-01-01 09:17:00 1 4 3 1
19 2022-01-01 09:18:00 4 1 3 1
20 2022-01-01 09:19:00 1 4 3 1
21 2022-01-01 09:20:00 4 1 3 1
Here is an approach using a funciton containing a for-loop.
You cannot rely on diff, because this will not pick up sequences of (wrong) zone 4's.
c(1,1,4,4,4,1,1,1) should be converted to c(1,1,1,1,1,1,1,1) if I understand your question correctly.
So, you need to iterate (I think).
library(data.table)
# custom sample data set
example <- data.frame(time = seq(as.POSIXct("2022-01-01 09:00:00"),
as.POSIXct("2022-01-01 09:20:00"),
by="1 mins"),
location = c(1,1,1,1,2,3,3,3,4,4,4,3,3,2,1,1,4,4,4,1,4))
# Make it a data.table, make sure the time is ordered
setDT(example, key = "time")
# function
fixLocations <- function(x) {
for(i in 2:length(x)) {
if (abs(x[i] - x[i-1]) > 1) x[i] <- x[i-1]
}
return(x)
}
NB that this function only works if the location in the first row is correct. If it start with (wrong) zone 4's, it will go awry.
example[, locationNew := fixLocations(location)][]
# time location locationNew
# 1: 2022-01-01 09:00:00 1 1
# 2: 2022-01-01 09:01:00 1 1
# 3: 2022-01-01 09:02:00 1 1
# 4: 2022-01-01 09:03:00 1 1
# 5: 2022-01-01 09:04:00 2 2
# 6: 2022-01-01 09:05:00 3 3
# 7: 2022-01-01 09:06:00 3 3
# 8: 2022-01-01 09:07:00 3 3
# 9: 2022-01-01 09:08:00 4 4
#10: 2022-01-01 09:09:00 4 4
#11: 2022-01-01 09:10:00 4 4
#12: 2022-01-01 09:11:00 3 3
#13: 2022-01-01 09:12:00 3 3
#14: 2022-01-01 09:13:00 2 2
#15: 2022-01-01 09:14:00 1 1
#16: 2022-01-01 09:15:00 1 1
#17: 2022-01-01 09:16:00 4 1
#18: 2022-01-01 09:17:00 4 1
#19: 2022-01-01 09:18:00 4 1
#20: 2022-01-01 09:19:00 1 1
#21: 2022-01-01 09:20:00 4 1
# time location locationNew

FOR Loop with multiple parameters from a dataframe in R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I would like to know whether it is possible to build a FOR loop in R which would change multiple parameters at every run.
I have parameter dataframe [df_params] which looks like this:
group person date_from date_to
1 Mike 2020-10-01 12:00:00 2020-10-01 13:00:00
2 Mike 2020-10-04 09:00:00 2020-10-07 17:00:00
3 Dave 2020-10-07 12:00:00 2020-10-07 13:00:00
4 Dave 2020-10-09 09:00:00 2020-10-11 17:00:00
I would like to loop over a larger dataframe [df] and get only the rows matching parameters of individual rows in the "df_params" dataframe.
The large dataframe [df] looks like this:
person datetime books tasks done
Mike 2020-10-01 12:15:00 5 7 2
Mike 2020-10-01 12:17:00 5 7 3
Mike 2020-10-01 18:00:00 5 7 4
Mike 2020-10-02 12:00:00 5 5 0
Mike 2020-10-04 09:08:00 5 3 3
Mike 2020-10-09 12:00:00 5 7 1
Dave 2020-10-07 12:22:00 7 5 1
Dave 2020-10-08 02:34:00 7 5 2
Dave 2020-10-09 07:00:00 7 3 3
Dave 2020-10-09 08:00:00 7 8 5
Dave 2020-10-09 09:48:00 7 7 2
Nick 2020-10-01 13:00:00 3 7 3
Nick 2020-10-02 12:58:00 3 3 2
Nick 2020-10-03 10:02:00 3 7 1
The desired result would look like this:
person datetime books tasks done group
Mike 2020-10-01 12:15:00 5 7 2 1
Mike 2020-10-01 12:17:00 5 7 3 1
Mike 2020-10-04 09:08:00 5 3 3 2
Dave 2020-10-07 12:22:00 7 5 1 3
Dave 2020-10-09 09:48:00 7 7 2 4
Is something like this possible in R.
Thank you very much for any suggestions.
This might be a slightly expensive solution if your datasets are very large, but it outputs the desired result.
I don't know if your date variables are already in date format; below I convert them with the lubridate package just in case they aren't.
Also, I create the variable date_interval that will be used later for a filtering condition.
library(dplyr)
library(lubridate)
# convert to date format
df_params <- df_params %>%
mutate(
date_from = ymd_hms(date_from),
date_to = ymd_hms(date_to),
# create interval
date_interval = interval(date_from, date_to)
)
df <- df %>%
mutate(datetime = ymd_hms(datetime))
After this manipulation step, I use a left_join on the person name in order to have a larger dataframe - for this reason I said before that this operation might be a little expensive - and then filter only the rows where datetime is within the above-mentioned interval.
left_join(df, df_params, by = "person") %>%
filter(datetime %within% date_interval) %>%
select(person:group)
# person datetime books tasks done group
# 1 Mike 2020-10-01 12:15:00 5 7 2 1
# 2 Mike 2020-10-01 12:17:00 5 7 3 1
# 3 Mike 2020-10-04 09:08:00 5 3 3 2
# 4 Dave 2020-10-07 12:22:00 7 5 1 3
# 5 Dave 2020-10-09 09:48:00 7 7 2 4
Starting data
df_params <- read.table(text="
group person date_from date_to
1 Mike 2020-10-01T12:00:00 2020-10-01T13:00:00
2 Mike 2020-10-04T09:00:00 2020-10-07T17:00:00
3 Dave 2020-10-07T12:00:00 2020-10-07T13:00:00
4 Dave 2020-10-09T09:00:00 2020-10-11T17:00:00", header=T)
df <- read.table(text="
person datetime books tasks done
Mike 2020-10-01T12:15:00 5 7 2
Mike 2020-10-01T12:17:00 5 7 3
Mike 2020-10-01T18:00:00 5 7 4
Mike 2020-10-02T12:00:00 5 5 0
Mike 2020-10-04T09:08:00 5 3 3
Mike 2020-10-09T12:00:00 5 7 1
Dave 2020-10-07T12:22:00 7 5 1
Dave 2020-10-08T02:34:00 7 5 2
Dave 2020-10-09T07:00:00 7 3 3
Dave 2020-10-09T08:00:00 7 8 5
Dave 2020-10-09T09:48:00 7 7 2
Nick 2020-10-01T13:00:00 3 7 3
Nick 2020-10-02T12:58:00 3 3 2
Nick 2020-10-03T10:02:00 3 7 1 ", header=T)

How to convert minute data to hourly data correctly in R? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Say I have the following sample minute data.
> data = xts(1:12, as.POSIXct("2020-01-01")+(1:12)*60*20)
> data
[,1]
2020-01-01 00:20:00 1
2020-01-01 00:40:00 2
2020-01-01 01:00:00 3
2020-01-01 01:20:00 4
2020-01-01 01:40:00 5
2020-01-01 02:00:00 6
2020-01-01 02:20:00 7
2020-01-01 02:40:00 8
2020-01-01 03:00:00 9
2020-01-01 03:20:00 10
2020-01-01 03:40:00 11
2020-01-01 04:00:00 12
This already aligned minute data, but now I want to get hourly.
Easy, just use the to.hourly command right?
> to.hourly(data)
data.Open data.High data.Low data.Close
2020-01-01 00:40:00 1 2 1 2
2020-01-01 01:40:00 3 5 3 5
2020-01-01 02:40:00 6 8 6 8
2020-01-01 03:40:00 9 11 9 11
2020-01-01 04:00:00 12 12 12 12
The problem is that it puts the end values of each bar into the next bar, and the last value is creates its own hour period.
Now to only show correct hourly bars I use align.time.
> align.time(to.hourly(data),60*60)
data.Open data.High data.Low data.Close
2020-01-01 01:00:00 1 2 1 2
2020-01-01 02:00:00 3 5 3 5
2020-01-01 03:00:00 6 8 6 8
2020-01-01 04:00:00 9 11 9 11
2020-01-01 05:00:00 12 12 12 12
The previous last entry creates its own hour bar which I need to remove.
The same issue occurs if I convert to daily, the last enry goes to the next day and an extra day is created.
The question is how to convert to different periods correctly?
The desired result for the example is:
data.Open data.High data.Low data.Close
2020-01-01 01:00:00 1 3 1 3
2020-01-01 02:00:00 4 6 4 6
2020-01-01 03:00:00 7 9 7 9
2020-01-01 04:00:00 10 12 10 12
This seems like a very basic option and I have searched and found many examples, but not one that considers the last value in a period. Thank you.
UPDATE:
Allan Cameron gave a fantastic answer and it absolutely works, I am just concerned that it will fail at some point with different time periods.
My workflow starts with tick data which I convert to second and minute and so on. Converting tick to higher periods would work perfectly, but it is too much data to handle at once, hence the staggered approach. That is why the aligned data needs to work with any period conversion.
I made as small modification to Allan's code:
setNames(shift.time(to.hourly(shift.time(data, -.0000001193)), .0000001193), c("Open", "High", "Low", "Close"))
.0000001193 was the smallest value I found to work with simple trial and error.
Is there any time where this would not work or would the min value be different?
Is this the best way to handle this issue?
Thank you.
You can shift the time back 60 seconds, do as.hourly, then shift the time forward 60 seconds. This maintains the groupings. You'll need to rename the columns too:
setNames(shift.time(to.hourly(shift.time(data, -60)), 60), c("Open", "High", "Low", "Close"))
#> Open High Low Close
#> 2020-01-01 01:00:00 1 3 1 3
#> 2020-01-01 02:00:00 4 6 4 6
#> 2020-01-01 03:00:00 7 9 7 9
#> 2020-01-01 04:00:00 10 12 10 12

R Max of Same Date, Previous Date, and Previous Hour Value

A couple basic data manipulations. I searched with different wordings and couldn't find much.
I have data structured as below. In reality the hourly data is continuous, but I just included 4 lines as an example.
start <- as.POSIXlt(c('2017-1-1 1:00','2017-1-1 2:00','2017-1-2 1:00','2017-1-2 2:00'))
values <- as.numeric(c(2,5,4,3))
df <- data.frame(start,values)
df
start values
1 2017-01-01 01:00:00 2
2 2017-01-01 02:00:00 5
3 2017-01-02 01:00:00 4
4 2017-01-02 02:00:00 3
I would like to add a couple columns that:
1) Show the max of the same day.
2) Show the max of the previous day.
3) Show the value of one previous hour.
The goal is to have an output like:
MaxValueDay <- as.numeric(c(5,5,4,4))
MaxValueYesterday <- as.numeric(c(NA,NA,5,5))
PreviousHourValue <- as.numeric(c(NA,2,NA,4))
df2 <- data.frame(start,values,MaxValueDay,MaxValueYesterday,PreviousHourValue)
df2
start values MaxValueDay MaxValueYesterday PreviousHourValue
1 2017-01-01 01:00:00 2 5 NA NA
2 2017-01-01 02:00:00 5 5 NA 2
3 2017-01-02 01:00:00 4 4 5 NA
4 2017-01-02 02:00:00 3 4 5 4
Any help would be greatly appreciated. Thanks
A solution using dplyr, magrittr, and lubridate packages:
library(dplyr)
library(magrittr)
library(lubridate)
df %>%
within(MaxValueDay <- sapply(as.Date(start), function (x) max(df$values[which(x==as.Date(start))]))) %>%
within(MaxValueYesterday <- MaxValueDay[sapply(as.Date(start)-1, match, as.Date(start))]) %>%
within(PreviousHourValue <- values[sapply(start-hours(1), match, start)])
# start values MaxValueDay MaxValueYesterday PreviousHourValue
# 1 2017-01-01 01:00:00 2 5 NA NA
# 2 2017-01-01 02:00:00 5 5 NA 2
# 3 2017-01-02 01:00:00 4 4 5 NA
# 4 2017-01-02 02:00:00 3 4 5 4

Creating a 4-hr time interval using a reference column in R

I want to create a 4-hrs interval using a reference column from a data frame. I have a data frame like this one:
species<-"ABC"
ind<-rep(1:4,each=24)
hour<-rep(seq(0,23,by=1),4)
depth<-runif(length(ind),1,50)
df<-data.frame(cbind(species,ind,hour,depth))
df$depth<-as.numeric(df$depth)
What I would like is to create a new column (without changing the information or dimensions of the original data frame) that could look at my hour column (reference column) and based on that value will give me a 4-hrs time interval. For example, if the value from the hour column is between 0 and 3, then the value in new column will be 0; if the value is between 4 and 7 the value in the new column will be 4, and so on... In excel I used to use the floor/ceiling functions for this, but in R they are not exactly the same. Also, if someone has an easier suggestion for this using the original date/time data that could work too. In my original script I used the function as.POSIXct to get the date/time data, and from there my hour column.
I appreciate your help,
what about taking the column of hours, converting it to integers, and using integer division to get the floor? something like this
# convert hour to integer (hour is currently a col of factors)
i <- as.numeric(levels(df$hour))[df$hour]
# make new column
df$interval <- (i %/% 4) * 4
Expanding on my comment, since I think you're ultimately looking for actual dates at some point...
Some sample hourly data:
set.seed(1)
mydata <- data.frame(species = "ABC",
ind = rep(1:4, each=24),
depth = runif(96, 1, 50),
datetime = seq(ISOdate(2000, 1, 1, 0, 0, 0),
by = "1 hour", length.out = 96))
list(head(mydata), tail(mydata))
# [[1]]
# species ind depth datetime
# 1 ABC 1 14.00992 2000-01-01 00:00:00
# 2 ABC 1 19.23407 2000-01-01 01:00:00
# 3 ABC 1 29.06981 2000-01-01 02:00:00
# 4 ABC 1 45.50218 2000-01-01 03:00:00
# 5 ABC 1 10.88241 2000-01-01 04:00:00
# 6 ABC 1 45.02109 2000-01-01 05:00:00
#
# [[2]]
# species ind depth datetime
# 91 ABC 4 12.741841 2000-01-04 18:00:00
# 92 ABC 4 3.887784 2000-01-04 19:00:00
# 93 ABC 4 32.472125 2000-01-04 20:00:00
# 94 ABC 4 43.937191 2000-01-04 21:00:00
# 95 ABC 4 39.166819 2000-01-04 22:00:00
# 96 ABC 4 40.068132 2000-01-04 23:00:00
Transforming that data using cut and format:
mydata <- within(mydata, {
hourclass <- cut(datetime, "4 hours") # Find the intervals
hourfloor <- format(as.POSIXlt(hourclass), "%H") # Display just the "hour"
})
list(head(mydata), tail(mydata))
# [[1]]
# species ind depth datetime hourclass hourfloor
# 1 ABC 1 14.00992 2000-01-01 00:00:00 2000-01-01 00:00:00 00
# 2 ABC 1 19.23407 2000-01-01 01:00:00 2000-01-01 00:00:00 00
# 3 ABC 1 29.06981 2000-01-01 02:00:00 2000-01-01 00:00:00 00
# 4 ABC 1 45.50218 2000-01-01 03:00:00 2000-01-01 00:00:00 00
# 5 ABC 1 10.88241 2000-01-01 04:00:00 2000-01-01 04:00:00 04
# 6 ABC 1 45.02109 2000-01-01 05:00:00 2000-01-01 04:00:00 04
#
# [[2]]
# species ind depth datetime hourclass hourfloor
# 91 ABC 4 12.741841 2000-01-04 18:00:00 2000-01-04 16:00:00 16
# 92 ABC 4 3.887784 2000-01-04 19:00:00 2000-01-04 16:00:00 16
# 93 ABC 4 32.472125 2000-01-04 20:00:00 2000-01-04 20:00:00 20
# 94 ABC 4 43.937191 2000-01-04 21:00:00 2000-01-04 20:00:00 20
# 95 ABC 4 39.166819 2000-01-04 22:00:00 2000-01-04 20:00:00 20
# 96 ABC 4 40.068132 2000-01-04 23:00:00 2000-01-04 20:00:00 20
Note that your new "hourclass" variable is a factor and the new "hourfloor" variable is character, but you can easily change those, even during the within stage.
str(mydata)
# 'data.frame': 96 obs. of 6 variables:
# $ species : Factor w/ 1 level "ABC": 1 1 1 1 1 1 1 1 1 1 ...
# $ ind : int 1 1 1 1 1 1 1 1 1 1 ...
# $ depth : num 14 19.2 29.1 45.5 10.9 ...
# $ datetime : POSIXct, format: "2000-01-01 00:00:00" "2000-01-01 01:00:00" ...
# $ hourclass: Factor w/ 24 levels "2000-01-01 00:00:00",..: 1 1 1 1 2 2 2 2 3 3 ...
# $ hourfloor: chr "00" "00" "00" "00" ...
tip number 1, don't use cbind to create a data.frame with differing type of columns, everything gets coerced to the same type (in this case factor)
findInterval or cut would seem appropriate here.
df <- data.frame(species,ind,hour,depth)
# copy
df2 <- df
df2$fourhour <- c(0,4,8,12,16,20)[findInterval(df$hour, c(0,4,8,12,16,20))]
Though there is probably a simpler way, here is one attempt.
Make your data.frame not using cbind first though, so hour is not a factor but numeric
df <- data.frame(species,ind,hour,depth)
Then:
df$interval <- factor(findInterval(df$hour,seq(0,23,4)),labels=seq(0,23,4))
Result:
> head(df)
species ind hour depth interval
1 ABC 1 0 23.11215 0
2 ABC 1 1 10.63896 0
3 ABC 1 2 18.67615 0
4 ABC 1 3 28.01860 0
5 ABC 1 4 38.25594 4
6 ABC 1 5 30.51363 4
You could also make the labels a bit nicer like:
cutseq <- seq(0,23,4)
df$interval <- factor(
findInterval(df$hour,cutseq),
labels=paste(cutseq,cutseq+3,sep="-")
)
Result:
> head(df)
species ind hour depth interval
1 ABC 1 0 23.11215 0-3
2 ABC 1 1 10.63896 0-3
3 ABC 1 2 18.67615 0-3
4 ABC 1 3 28.01860 0-3
5 ABC 1 4 38.25594 4-7
6 ABC 1 5 30.51363 4-7

Resources