How can I select cases based on time data? - r

I am new to using R and I'm stumbling upon a few problems which I can't seem to solve on my own. I can't figure out how I can select cases based on time units.
I want to select cases where Time_D - Time_A is equal or above 5 seconds (for the same individual).
For instance my data frame consists of the following data:
+-------------------+--------------+---------------+
| | Individual | Time_A | Time_D |
+-------------------+--------------+---------------+
| 1 | A | 09:21:27 | 09:21:28 |
| 2 | A | 09:21:29 | 09:21:40 |
| 3 | A | 09:21:30 | 09:21:36 |
| 4 | B | 09:32:14 | 09:32:23 |
| 5 | B | 09:32:18 | 09:32:22 |
+-------------------+--------------+---------------+
And I want to only select the cases where Time_D - Time_A >= 5 seconds to get the following data frame:
+----------------+------------+-------------+
| | Individual | Time_A | Time_D |
+----------------+------------+-------------+
| 2 | A | 09:21:29 | 09:21:40 |
| 3 | A | 09:21:30 | 09:21:36 |
| 4 | B | 09:32:14 | 09:32:23 |
+----------------+------------+-------------+
I have already coded for time:
DT <- as.data.table(df3)[, Time_A := as.ITime(Time_A)][, Time_D := as.ITime(Time_D)]

After converting the columns to ITime you can subtract Time_D - Time_A and keep rows where the difference is greater than 5.
library(data.table)
cols <- c('Time_A', 'Time_D')
setDT(df)[, (cols) := lapply(.SD, as.ITime), .SDcols = cols]
df[(Time_D - Time_A) >= 5]
# Individual Time_A Time_D
#1: A 09:21:29 09:21:40
#2: A 09:21:30 09:21:36
#3: B 09:32:14 09:32:23
In base R, you can do this with as.POSIXct.
subset(df, as.POSIXct(Time_D, format = '%T') -
as.POSIXct(Time_A, format = '%T') >= 5)

We can use tidyverse
library(dplyr)
library(lubridate)
df1 %>%
filter(period_to_seconds(hms(Time_D)) - period_to_seconds(hms(Time_A)) >=5)
# Individual Time_A Time_D
#1 A 09:21:29 09:21:40
#2 A 09:21:30 09:21:36
#3 B 09:32:14 09:32:23
data
df1 <- structure(list(Individual = c("A", "A", "A", "B", "B"),
Time_A = c("09:21:27",
"09:21:29", "09:21:30", "09:32:14", "09:32:18"), Time_D = c("09:21:28",
"09:21:40", "09:21:36", "09:32:23", "09:32:22")), class = "data.frame",
row.names = c(NA,
-5L))

Related

Create a new data.table of percentages based off another data.table

I am trying to create a new data.table of percentages based on another data.table.
I thought about creating new columns and dividing them but I have gotten lost in the logic of how to do this. Basically I need to know what the percentage of subjects MEET at VISIT
datatable_1
+---------------+------------+--------------+
| Meet | Subject | Visit |
+---------------+------------+--------------+
| 1 | a | 1 |
+---------------+------------+--------------+
| 1 | a | 2 |
+---------------+------------+--------------+
| 0 | a | 3 |
+---------------+------------+--------------+
| 1 | b | 1 |
+---------------+------------+--------------+
| 1 | b | 2 |
+---------------+------------+--------------+
| 1 | b | 3 |
+---------------+------------+--------------+
This is what the new data.table should look like
datatable 2
+---------------+------------+
| Subject | Percentage |
+---------------+------------+
| a | .66 |
+---------------+------------+
| b | .100 |
+---------------+------------+
If Meet values has only 1/0 values, you can take average of Meet values for each Subject.
library(data.table)
setDT(df)[, .(Percentage = mean(Meet)), Subject]
# Subject Percentage
#1: a 0.667
#2: b 1.000
This can also be written in base R and dplyr
#Base R
aggregate(Meet~Subject, df, mean)
#dplyr
library(dplyr)
df %>% group_by(Subject) %>%summarise(Percentage = mean(Meet))

Grouping data based on repetitive records using R

I have a dataset which contains repetitive records/common records. It looks something like this:
| Vendor | Buyer | Amount |
|--------|:-----:|-------:|
| A | P | 100 |
| B | P | 150 |
| C | Q | 300 |
| A | P | 290 |
I need to group similar records together but I do not want to summarize my amount. I want to have the amount value being represented individually. The output should like something like this:
| Vendor | Buyer | Amount |
|--------|:-----:|-------:|
| A | P | 100 |
| A | P | 290 |
| | | |
| B | P | 150 |
| | | |
| C | Q | 300 |
I thought of using split(), but since my original data has too many records, the split function creates too many lists and it becomes tedious to create new datasets from them. How can I achieve the above stated output with any other method?
EDIT:
Let us assume that we have an additional column called date and the dataset now looks like this:
| Vendor | Buyer | Amount | Date |
|--------|:-----:|-------:|-----------|
| A | P | 100 | 3/6/2019 |
| B | P | 150 | 7/6/2018 |
| C | Q | 300 | 4/21/2018 |
| A | P | 290 | 6/5/2018 |
Once, each buyer and vendor is grouped together, I need to arrange the dates in ascending order for each buyer and vendor such that it looks something like the below one:
| Vendor | Buyer | Amount | Date |
|--------|:-----:|-------:|-----------|
| A | P | 290 | 6/5/2018 |
| A | P | 100 | 3/6/2019 |
| | | | |
| B | P | 150 | 7/6/2018 |
| | | | |
| C | Q | 300 | 4/21/2018 |
and then remove the single transactions to get the final table containing only
| Vendor | Buyer | Amount | Date |
|--------|:-----:|-------:|----------|
| A | P | 290 | 6/5/2018 |
| A | P | 100 | 3/6/2019 |
In the following we sort the data frame and add a group column which allows easy subsequent processing of individual groups. For example, to process the groups without creating a large split of DF:
for(g in unique(DFout$group)) {
DFsub <- subset(DFout, group == g)
... process DFsub ...
}
1) Base R Sort the data and then assign the group column using cumsum on the non-duplicated elements.
library(data.table)
o <- with(DF, order(Vendor, Buyer))
DFo <- DF[o, ]
DFout <- transform(DFo, group = cumsum(!duplicated(data.frame(Vendor, Buyer))))
DFout
giving:
Vendor Buyer Amount group
1 A P 100 1
4 A P 290 1
2 B P 150 2
3 C Q 300 3
I am not sure this is such a good idea to do in the first place but if you really want to add a row of NAs after each group:
ix <- unname(unlist(tapply(DFout$group, DFout$group, function(x) c(x, NA))))
ix[!is.na(ix)] <- seq_len(nrow(DFout))
DFout[ix, ]
2) data.table Convert to data.table, set the key (which sorts it) and use rleid to assign the group number.
library(data.table)
DT <- data.table(DF)
setkey(DT, Vendor, Buyer)
DT[, group := rleid(Vendor, Buyer)]
3) sqldf Another approach is to use SQL. This requires the development version of RSQLite on github. Here dense_rank acts similarly to rleid above.
library(sqldf)
sqldf("select *, dense_rank() over (order by Vendor, Buyer) as [group]
from DF
order by Vendor, Buyer")
giving:
Vendor Buyer Amount group
1 A P 100 1
2 A P 290 1
3 B P 150 2
4 C Q 300 3
Note
DF <- structure(list(Vendor = structure(c(1L, 2L, 3L, 1L), .Label = c("A",
"B", "C"), class = "factor"), Buyer = structure(c(1L, 1L, 2L,
1L), .Label = c("P", "Q"), class = "factor"), Amount = c(100L,
150L, 300L, 290L)), class = "data.frame", row.names = c(NA, -4L
))

Create new variable based on the order of values in other columns

I have a data frame relative to accesses to a website. Several accesses per day, with different possible actions and descriptions of the actions
People | Date | Time | Action | Descr |
| | | | |
j | 01/01/2010 | 10:13 | X | A |
j | 01/01/2010 | 10:15 | Y | B |
j | 02/01/2010 | 14:15 | Z | C |
j | 03/01/2010 | 11:45 | X | D |
j | 03/01/2010 | 13:56 | X | E |
j | 03/01/2010 | 18:43 | Z | F |
j | 03/01/2010 | 18:44 | X | A |
After reducing the data frame to a balanced daily panel data, I need to create variables such that:
-the value of the first variable (FirstX) must be equal to the description (Descr) of the first Action = X of the day (if available) and zero otherwise
-the value of the second variable must be equal to the description of the second Action = X of the day and zero otherwise
-so on
Once I transformed it into a balanced daily panel (which I can do) I need to have a final result which looks like this:
People | Date |Accesses| First X|Second X| Third X| Fourth X |
| | | | | | |
j | 01/01/2010 | 2 | A | 0 | 0 | 0 |
j | 02/01/2010 | 1 | 0 | 0 | 0 | 0 |
j | 03/01/2010 | 4 | D | E | A | 0 |
You can do it using the dplyr package:
library(dplyr)
df %>%
group_by(People,Date) %>%
summarise(Accesses = n(),
FirstX = ifelse(sum(Action=="X")>=1,Descr[Action=="X"][1],"0"),
SecondX = ifelse(sum(Action=="X")>=2,Descr[Action=="X"][2],"0"),
ThirdX = ifelse(sum(Action=="X")>=3,Descr[Action=="X"][3],"0"),
FourthX = ifelse(sum(Action=="X")>=4,Descr[Action=="X"][4],"0"))
This returns:
People Date Accesses FirstX SecondX ThirdX FourthX
<chr> <chr> <int> <chr> <chr> <chr> <chr>
1 j 01/01/2010 2 A 0 0 0
2 j 02/01/2010 1 0 0 0 0
3 j 03/01/2010 4 D E A 0
Note that you cannot have numeric 0s and characters in the same vector, so I put character 0s in the FirstX, SecondX, .. columns.
I found a solution myself. I post it here in case this is useful to somebody.
# create temp variables to be used for the count(just a vector of all the
numbers from 1 to N)
subset$temp_var1<-c(1:N)
# generate a variable which starts counting from one and starts again
# every time "date" or "people" change
subset$count<-ave(subset$temp_var1 , subset$date ,
subset$people , FUN = seq_along)
#drop variable "Action"
subset<-subset( subset, select=c("people" , "date" ,
"descr" , "count"))
#reshape
subset_comuni<-reshape(subset_comuni , idvar=c("nome_utente" , "date") ,
timevar = "count" , direction = "wide")

data.table alternative for dplyr mutate_?

I have following r code which uses dplyr.
Due to large data size, we want to use data.table.
test <- function(Act, mac, type, thisYear){
Act %>%
mutate_(var = type) %>%
filter(var == mac) %>%
filter(floor_date(as.Date(submit_ts), 'year') == thisYear)
}
Act is as follows
| submit_ts | col1 | col2 |
| ------------- |---------------|-------|
| '2015-01-01' | 'x' | 1000 |
| '2015-01-01' | 'y' | 200 |
| '2015-01-01' | 'x' | 200 |
basically function can works as follows
test(act, 'x', 'col1', 2015)
result is as follows
| submit_ts | col1 | col2 |
| ------------- |---------------|-------|
| '2015-01-01' | 'x' | 1000 |
| '2015-01-01' | 'x' | 200 |
test(act, 200, 'col2', 2015)
result is as follows
| submit_ts | col1 | col2 |
| ------------- |---------------|-------|
| '2015-01-01' | 'y' | 200 |
| '2015-01-01' | 'x' | 200 |
How should I do it using data.table ?
We can do a similar approach in data.table with
library(data.table)
library(lubridate)
test1 <- function(Act, mac, type){
setnames(setDT(Act), type, "var")[
var==mac & year(floor_date(as.Date(submit_ts), "year"))==thisYear]
}
test1(dat, 2, "val")
# submit_ts var
#1: 2013-05-05 2
#2: 2013-05-12 2
NOTE: The floor_date does not return a yyyy year.
data
dat <- data.frame(submit_ts= c("2013-05-05", "2012-05-10", "2013-05-12"),
val = c(2, 1, 2), stringsAsFactors=FALSE)
thisYear <- 2013

R ddply sum value from next row

I want to sum the column value from a row with the next one.
> df
+----+------+--------+------+
| id | Val | Factor | Col |
+----+------+--------+------+
| 1 | 15 | 1 | 7 |
| 3 | 20 | 1 | 4 |
| 2 | 35 | 2 | 8 |
| 7 | 35 | 1 | 12 |
| 5 | 40 | 1 | 11 |
| 6 | 45 | 2 | 13 |
| 4 | 55 | 1 | 4 |
| 8 | 60 | 1 | 7 |
| 9 | 15 | 2 | 12 |
..........
I would like to have the mean of sum of the Row$Val + nextRow$Val based on their id and Col. I can't assume that the id or Col are consecutive.
I am using ddply to summarize my df. I have tried
> ddply(df, .(Factor), summarize,
max(Val),
sum(Val),
mean(Val + df[df$id == id+1 & df$Col = Col]$Val)
)
> "longer object length is not a multiple of shorter object length"
You can build a vector of values with
sapply(df$id, function(x){mean(c(
subset(df, id == x, select = Val, drop = TRUE),
subset(df, id == x+1, select = Val, drop = TRUE)
))})
You could simplify, but I tried to make it as readable as possible.
You can use rollapply from the zoo package. Since you want mean of only two consecutive rows , you can try
library(zoo)
rollapply(df[order(df$id), 2], 2, function(x) sum(x)/2)
#[1] 17.5 27.5 35.0 37.5 42.5 50.0 57.5 37.5
You can do something like this with dplyr package:
library(dplyr)
df <- arrange(df, id)
mean(df$Val + lead(df$Val), na.rm = TRUE)
[1] 76.25

Resources