R: extract the latest revisionTime for each date - r

I have a revision data frame with 3 columns:
revisionTime
date
value
For instance here is a sample, but mine is very very long (several hundred of thousands of rows)
df = structure(list(revisionTime = structure(c(1471417781, 1471417781,
1471417781, 1473978576, 1473978576, 1473978576), class = c("POSIXct",
"POSIXt"), tzone = ""), date = structure(c(1464652800, 1467244800,
1469923200, 1456704000, 1467244800, 1472601600), class = c("POSIXct",
"POSIXt"), tzone = ""), value = c(103.7, 104.1, 104.9, 104.414,
104.3, 104.4)), .Names = c("revisionTime", "date", "value"), row.names = 536:541, class = "data.frame")
What I need is a very fast way to extract from this data.frame the latest revisionTime for each date (and the corresponding value). There are some similar questions, but my question is more precise: is there a way to avoid loops?
Thank you

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'date' after converting to Date class, order the 'revisionTime' in descending order (in i) and get the first row with head.
library(data.table)
setDT(df1)[order(-revisionTime), head(.SD, 1), .(date = as.Date(date))]

If your revisionTime is nicely formatted (Y-m-d H:M:S) always as in your example, you may not need to convert to Date time at all, this should simply work:
aggregate(revisionTime ~ date, df, max)

Related

How to apply a specific function to typer character columns only in a dataset?

My test dataset
df=structure(list(Atencion = c(871739L, 866903L, 847986L, 872950L,
860503L, 868579L), NomAtenTipoBase = c("Hospitalización", "Hospitalización",
"Hospitalización", "Urgencias", "Hospitalización", "Hospitalización"
), FecIngreso = structure(c(1656598680, 1656161220, 1654693680,
1656675480, 1655690640, 1656423480), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Plan_Del_Contrato = c("ATENCIÓN PGP ONCOLÓGICO SUBSIDIADO - CONTRIBUTIV",
"SANTANDER-C", "PBS-C", "ACCIDENTES DE TRANSITO", "ATENCIÓN INTEGRAL ONCOLOGIA REG- SUBSIDIADO",
"ARL")), row.names = c(NA, 6L), class = "data.frame")
I require to recursively apply this function only to the columns with character type (NomAtenTipoBase and Plan_Del_Contrato), instead of applying the code on each column:
df$NomAtenTipoBase = stri_enc_toutf8(df$NomAtenTipoBase)
df$Plan_Del_Contrato = stri_enc_toutf8(df$Plan_Del_Contrato)
We could use across with where
library(dplyr)
library(stringi)
df <- df %>%
mutate(across(where(is.character), stri_enc_toutf8))
Or with base R , find the character column and update only those by looping over the subset of columns
i1 <- sapply(df, is.character)
df[i1] <- lapply(df[i1], stri_enc_toutf8)

Find closest date in a df to a date in another df and subtract the difference

I have 2 dataframes:
df1<-structure(list(id = c(1, 2), date = structure(c(1483636800, 1485192000
), class = c("POSIXct", "POSIXt"), tzone = "")), class = "data.frame", row.names = c(NA,
-2L))
df2<-structure(list(id.1 = structure(1:3, .Label = c("A", "B", "C"
), class = "factor"), sunrise = structure(c(1483617946, 1485198384,
1485205584), class = c("POSIXct", "POSIXt"), tzone = "")), class = "data.frame", row.names = c(NA,
-3L))
df1:
id date
1 1 2017-01-05 12:20:00
2 2 2017-01-23 12:20:00
df2:
id.1 sunrise
1 A 2017-01-05 07:05:46
2 B 2017-01-23 14:06:24
3 C 2017-01-23 16:06:24
I would like to find the closest sunrise date in df2 to each of the dates in df1, and then calculate the time difference (in hours) and put these new values in new column "closest" in df1. To find the time until sunrise, whereby a negative value would indicate a period after sunrise and a positive value a period before sunrise.
My df1 is very large > 100 million rows, so it's important that the solution works fast and efficient. The solution I found cannot accommodate the size of my data set, not even on a single column (will return: "Error: vector memory exhausted (limit reached?)". Attempts to remedy this issue were so far unsuccessful.
sDTA <- data.table(df1)[2]
rDTB <- data.table(df2)
try1<-sDTA[, closest := rDTB[sDTA, on = .(sunrise = date), roll = "nearest", x.sunrise]][]
try1$Time_until_sunrise<difftime(try1$closest,try1$date,units="hours")
I previously encountered exhausted memory issues while trying to use aggregate() functions, but was able to 'repair' this by replacing these with those that used dplyr. Perhaps this could be a solution again.

Convert the date into an individual date and time in R

I would like to convert a date that I have in R into an individual date and time. At the moment the format of the date is POSIXct
An example is given here:
"2019-03-29 20:42:07"
I want the date to be in one column and the time of that date in a corresponding column. I have found something similar here, but it doesn't answer my question.
Many thanks
If the column shows POSIXct class. Create two new columns by coercing to Date (as.Date) and the time part with format
df1 <- transform(df1, date = as.Date(datetime), time = format(datetime, "%T"))
df1
# datetime date time
#1 2019-03-29 20:42:07 2019-03-30 20:42:07
data
df1 <- structure(list(datetime = structure(1553910127, class = c("POSIXct",
"POSIXt"), tzone = "")), class = "data.frame", row.names = c(NA,
-1L))

Subset data frame by rows containing the system date

I would like to subset a data frame by selecting only the rows with the current system date.
For example, I have this data frame:
df = data.frame("var" = c("A", "A", "B", "B"),
"date" = c("2020-03-01", "2020-03-17",
"2020-03-01", "2020-03-17"))
df$date = as.POSIXct(df$date, format = "%Y-%m-%d")
If today is 2020-03-17, I would like to subset the rows that contain only the current date.
I have tried the following:
df_today = df[which(df$date == Sys.Date()),]
Which gives the error:
Warning message: In which(df$date == Sys.Date()) :
Incompatible methods ("Ops.POSIXt", "Ops.Date") for "=="
I have also tried:
df[which(df$date == as.POSIXct(Sys.Date())),]
Which returns an empty data frame. What I found works is if I coerce the date column as a character and then subset the rows in this way:
df$date = as.character(df$date)
df[which(df$date == as.character(Sys.Date)),]
This can work, but I would like to know where I am going wrong with my my previous attempts and if there is a better way than converting back and forth between character and POSIXct?
Thank you in advance for any input!
Class "Date" is not the same as class "POSIXct", you need to convert first to the former using local Sys.timezone().
df[as.Date(df$date, tz=Sys.timezone()) == Sys.Date(),]
# var date
# 2 A 2020-03-17
# 4 B 2020-03-17
Data used
df <- structure(list(var = structure(c(1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), date = structure(c(1583017200, 1584399600,
1583017200, 1584399600), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = c(NA,
-4L), class = "data.frame")
library(dplyr)
df$date = as.Date(df$date, format = "%Y-%m-%d")
df %>% filter(date==Sys.Date())

Returning first row of group

I have a dataframe consisting of an ID, that is the same for each element in a group, two datetimes and the time interval between these two. One of the datetime objects is my relevant time marker. Now I like to get a subset of the dataframe that consists of the earliest entry for each group. The entries (especially the time interval) need to stay untouched.
My first approach was to sort the frame according to 1. ID and 2. relevant datetime. However, I wasn't able to return the first entry for each new group.
I then have been looking at the aggregate() as well as ddply() function but I could not find an option in both that just returns the first entry without applying an aggregation function to the time interval value.
Is there an (easy) way to accomplish this?
ADDITION:
Maybe I was unclear by adding my aggregate() and ddply() notes. I do not necessarily need to aggregate. Given the fact that the dataframe is sorted in a way that the first row of each new group is the row I am looking for, it would suffice to just return a subset with each row that has a different ID than the one before (which is the start-row of each new group).
Example data:
structure(list(ID = c(1454L, 1322L, 1454L, 1454L, 1855L, 1669L,
1727L, 1727L, 1488L), Line = structure(c(2L, 1L, 3L, 1L, 1L,
1L, 1L, 1L, 1L), .Label = c("A", "B", "C"), class = "factor"),
Start = structure(c(1357038060, 1357221074, 1357369644, 1357834170,
1357913412, 1358151763, 1358691675, 1358789411, 1359538400
), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1357110430,
1357365312, 1357564413, 1358230679, 1357978810, 1358674600,
1358853933, 1359531923, 1359568151), class = c("POSIXct",
"POSIXt"), tzone = ""), Interval = c(1206.16666666667, 2403.96666666667,
3246.15, 6608.48333333333, 1089.96666666667, 8713.95, 2704.3,
12375.2, 495.85)), .Names = c("ID", "Line", "Start", "End",
"Interval"), row.names = c(NA, -9L), class = "data.frame")
By reproducing the example data frame and testing it I found a way of getting the needed result:
Order data by relevant columns (ID, Start)
ordered_data <- data[order(data$ID, data$Start),]
Find the first row for each new ID
final <- ordered_data[!duplicated(ordered_data$ID),]
As you don't provide any data, here is an example using base R with a sample data frame :
df <- data.frame(group=c("a", "b"), value=1:8)
## Order the data frame with the variable of interest
df <- df[order(df$value),]
## Aggregate
aggregate(df, list(df$group), FUN=head, 1)
EDIT : As Ananda suggests in his comment, the following call to aggregate is better :
aggregate(.~group, df, FUN=head, 1)
If you prefer to use plyr, you can replace aggregate with ddply :
ddply(df, "group", head, 1)
Using ffirst from collapse
library(collapse)
ffirst(df, g = df$group)
data
df <- data.frame(group=c("a", "b"), value=1:8)
This could also be achieved by dplyr using group_by and slice-family of functions,
data %>%
group_by(ID) %>%
slice_head(n = 1)

Resources