it's my first time posting here!
I am super stuck on what I'm sure is an easy thing to do.
I have a dataframe of irregular intervals and a vector of dates. If one of the dates occurs within any of the given intervals, I would like a new column to flag this (as the intervals need to be deleted). Similar to this post but the solution doesn't work due to the irregular intervals. I have over 2000 intervals and 2000 dates.
I can get the dates that occur within the given intervals using the %within% function, but this is no good as I cant find which intervals the dates are occurring in.
I've tried the solutions in this similar post but I have no grouping variable and can't get them to work.
Any suggestions would be incredibly helpful!!! Thank you so much!!
Example raw data (not as lubridate intervals):
>df1
diveno start fin
1 1 2018-08-01 08:20:40 2018-08-01 08:39:20
2 2 2018-08-01 08:40:50 2018-08-01 08:53:40
3 3 2018-08-01 10:01:00 2018-08-01 10:16:30
4 4 2018-08-01 15:45:30 2018-08-01 15:58:20
5 5 2018-08-01 17:06:00 2018-08-01 17:18:20
>df2
date
1 2018-08-01 08:30:00
2 2018-08-01 15:47:00
3 2018-08-02 17:10:00
What I'd like
> df3
diveno start fin dateoccurs
1 1 2018-08-01 08:20:40 2018-08-01 08:39:20 Y
2 2 2018-08-01 08:40:50 2018-08-01 08:53:40 N
3 3 2018-08-01 10:01:00 2018-08-01 10:16:30 N
4 4 2018-08-01 15:45:30 2018-08-01 15:58:20 Y
5 5 2018-08-01 17:06:00 2018-08-01 17:18:20 N
Where the dateoccurs column flags if a date from df2 occurs in given intervals in df1
Code for example data:
df1<-data.frame(diveno=c(1,2,3,4,5),
start=c("2018-08-01 08:20:40","2018-08-01 08:40:50", "2018-08-01 10:01:00","2018-08-01 15:45:30","2018-08-01 17:06:00"),
fin=c("2018-08-01 08:39:20","2018-08-01 08:53:40","2018-08-01 10:16:30","2018-08-01 15:58:20", "2018-08-01 17:18:20"))
df1$start <- as.POSIXct(df1$start,format="%Y-%m-%d %H:%M:%S",tz="CET")
df1$fin <- as.POSIXct(df1$fin,format="%Y-%m-%d %H:%M:%S",tz="CET")
df2<-data.frame(date=c("2018-08-01 08:30:00", "2018-08-01 15:47:00", "2018-08-02 17:10:00"))
df2$date <- as.POSIXct(df2$date,format="%Y-%m-%d %H:%M:%S",tz="CET")
What I need:
df3<-data.frame(diveno=c(1,2,3,4,5),
start=c("2018-08-01 08:20:40","2018-08-01 08:40:50", "2018-08-01 10:01:00","2018-08-01 15:45:30","2018-08-01 17:06:00"),
fin=c("2018-08-01 08:39:20","2018-08-01 08:53:40","2018-08-01 10:16:30","2018-08-01 15:58:20", "2018-08-01 17:18:20"),
dateoccurs=c("Y","N","N","Y","N"))
The closest I've gotten is using an answer from this post
But it returns altered 'fin' times, and when applied to the real massive dataset seems to duplicate values and change the number of 'diveno'!
intervals<-df1
elements<-df2[,1]
library(data.table) #v1.10.0
j<-setDT(intervals)[data.table(elements), on = .(start <= elements, fin >= elements)]
j2<-as.data.frame(j)
na.omit(j2)
UPDATED sample data for df2 that seems to produce false positives?
> dput(df2) structure(list(date = structure(c(1533096000, 1533096300, 1533096600, 1533096900, 1533097200, 1533097500, 1533097800, 1533098100, 1533098400, 1533098700, 1533099000, 1533099300, 1533099600, 1533099900, 1533100200, 1533100500, 1533100800, 1533101100, 1533101400, 1533101700, 1533102000, 1533102300, 1533102600, 1533102900, 1533103200, 1533103500, 1533103800, 1533104100, 1533104400, 1533104700, 1533105000, 1533105300, 1533105600, 1533105900, 1533106200, 1533106500, 1533106800, 1533107100, 1533107400, 1533107700, 1533108000, 1533108300, 1533108600, 1533108900, 1533109200, 1533109500, 1533109800, 1533110100), tzone = "UTC", class = c("POSIXct", "POSIXt")), depth = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)), class = "data.frame", row.names = c(NA,
-48L))
Update 2 (sorry!)
within df2 the following times:
30 2018-08-01 06:25:00 NA
31 2018-08-01 06:30:00 NA
32 2018-08-01 06:35:00 NA
Seem to be recognised as falling within the following intervals:
diveno start fin dateoccurs
1 1 2018-08-01 08:20:40 2018-08-01 08:39:20 Y
2 2 2018-08-01 08:40:50 2018-08-01 08:53:40 Y
3 3 2018-08-01 10:01:00 2018-08-01 10:16:30 N
Why might this be happening?
s<-df1[1,2] f<-df1[1,3] int<-interval(s,f)
df2[,1] %within% ir
data.table solution using a non-equi join
library(data.table)
setDT(df1); setDT(df2)
# initialise new column with "N"
df1[, dateoccurs := "N"]
# update join
df1[df2, dateoccurs := "Y", on = .(start <= date, fin >= date)][]
# diveno start fin dateoccurs
# 1: 1 2018-08-01 08:20:40 2018-08-01 08:39:20 Y
# 2: 2 2018-08-01 08:40:50 2018-08-01 08:53:40 N
# 3: 3 2018-08-01 10:01:00 2018-08-01 10:16:30 N
# 4: 4 2018-08-01 15:45:30 2018-08-01 15:58:20 Y
# 5: 5 2018-08-01 17:06:00 2018-08-01 17:18:20 N
You may use outer.
fun <- function(i, j) data.table::between(df2[i, 'date'], df1[j, 'start'], df1[j, 'fin'])
df1$occ <- colSums(outer(seq_len(nrow(df2)), seq_len(nrow(df1)), Vectorize(fun)))
df1
# diveno start fin occ
# 1 1 2018-08-01 08:20:40 2018-08-01 08:39:20 1
# 2 2 2018-08-01 08:40:50 2018-08-01 08:53:40 0
# 3 3 2018-08-01 10:01:00 2018-08-01 10:16:30 0
# 4 4 2018-08-01 15:45:30 2018-08-01 15:58:20 1
# 5 5 2018-08-01 17:06:00 2018-08-01 17:18:20 0
The binary column can easily be wrapped as factor if you like.
df1$occ <- colSums(outer(seq_len(nrow(df2)), seq_len(nrow(df1)), Vectorize(fun))) |>
factor(labels=c("N", "Y"))
df1
# diveno start fin occ
# 1 1 2018-08-01 08:20:40 2018-08-01 08:39:20 Y
# 2 2 2018-08-01 08:40:50 2018-08-01 08:53:40 N
# 3 3 2018-08-01 10:01:00 2018-08-01 10:16:30 N
# 4 4 2018-08-01 15:45:30 2018-08-01 15:58:20 Y
# 5 5 2018-08-01 17:06:00 2018-08-01 17:18:20 N
Data:
df1 <- structure(list(diveno = c(1, 2, 3, 4, 5), start = structure(c(1533104440,
1533105650, 1533110460, 1533131130, 1533135960), class = c("POSIXct",
"POSIXt"), tzone = "CET"), fin = structure(c(1533105560, 1533106420,
1533111390, 1533131900, 1533136700), class = c("POSIXct", "POSIXt"
), tzone = "CET"), occ = structure(c(2L, 1L, 1L, 2L, 1L), levels = c("N",
"Y"), class = "factor")), row.names = c(NA, -5L), class = "data.frame")
df2 <- structure(list(date = structure(c(1533105000, 1533131220, 1533222600
), class = c("POSIXct", "POSIXt"), tzone = "CET")), row.names = c(NA,
-3L), class = "data.frame")
Related
I have a data df2
date X Days
2020-01-06 525 NA
2020-01-07 799 NA
2020-01-08 782 NA
2020-01-09 542 NA
2020-01-10 638 5
2020-01-11 1000 5
2020-01-12 1400 3
2020-01-13 3500 1
I want to count how many days it will take for the sum of X to surpass a value. In this case, the value is 3000.
For example on 1/13, it took 1 day because X is 3500, so it already surpassed 3000. On 1/12 it took 3 days (1400+1000+638)=3038.
I wish to get the column Days.
dput(df2)
structure(list(date = structure(c(1578268800, 1578355200, 1578441600,
1578528000, 1578614400, 1578700800, 1578787200, 1578873600), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), X = c(525, 799, 782, 542, 638, 1000,
1400, 3500), Days = c(NA, NA, NA, NA, 5, 5, 3, 1)), class = "data.frame", row.names = c(NA,
-8L))
I think a rolling-function works well. Unlike most rolling functions which have a fixed window that is smaller than the length of data, we will intentionally make this full-width.
zoo::rollapplyr(
df2$X, nrow(df2),
FUN = function(z) which(cumsum(rev(z)) > 3000)[1],
partial = TRUE)
# [1] NA NA NA NA 5 5 3 1
(I'm ignoring date, assuming that the rows are consecutive-days.)
cs <- c(0, cumsum(rev(df$X)))
out <- sapply(cs, function(x) which(cs - x > 3e3)[1])
rev(out - seq_along(cs))
#> [1] NA NA NA NA NA 5 5 3 1
Created on 2022-01-06 by the reprex package (v2.0.1)
This question already has answers here:
Return pmin or pmax of data.frame with multiple columns
(3 answers)
Is there a vectorized parallel max() and min()?
(4 answers)
How to use apply function in a pipe operator
(2 answers)
Closed 2 years ago.
I have large data set with over 400 columns which represent certain user input elements from an online platform and the time when each input occured. Each row represent a user ID.
200 of those columns are of class "POSIXct" "POSIXt" (e.g. 2019-11-04 15:33:50), and missing values can occure, as not every element is displayed to or filled in by every user.
My goal is to create two additional columns that include the earliest and the latest date per row of each of those 200 "POSIXct" "POSIXt" columns.
Here a simplified example of the frame and one of the desired additional columns.
(ID 4 would be someone that never bothered to open the side, but has data from other data sources available, and should remain in the dataset for now)
ID Other_columns date_column date_column2 date_column3 max_date (what I want)
1 "numeric" 2019-11-04 19:33:50 2019-11-05 15:33:50 2019-11-05 16:33:50 2019-11-05 16:33:50
2 "numeric" NA 2019-11-04 17:20:10 2019-11-09 19:12:50 2019-11-09 19:12:50
3 "numeric" 2019-11-07 20:33:50 NA 2019-11-04 18:31:50 2019-11-07 20:33:50
4 NA NA NA NA NA
So far I did not really come further that filtering out the other non-date columns,
is.POSIXt <- function(x) inherits(x, "POSIXt")
df%>%select(where(is.POSIXt))
Instead of the select I probably should use a mutate_at or something as condition,
but what is the best way to check all of those remaining 200 date/time columns and then assign the earliest/latest date to the newly created columns (while ignoring the NA values).
We can use pmax and pmin on the 'date' columns to return the earliest and latest date for each row
library(dplyr)
df %>%
mutate(max_date = do.call(pmax, c(select(., starts_with('date')), na.rm = TRUE)),
min_date = do.call(pmin, c(select(., starts_with('date')),
na.rm = TRUE)))
# ID Other_columns date_column date_column2 date_column3 max_date min_date
#1 1 numeric 2019-11-04 19:33:50 2019-11-05 15:33:50 2019-11-05 16:33:50 2019-11-05 16:33:50 2019-11-04 19:33:50
#2 2 numeric <NA> 2019-11-04 17:20:10 2019-11-09 19:12:50 2019-11-09 19:12:50 2019-11-04 17:20:10
#3 3 numeric 2019-11-07 20:33:50 <NA> 2019-11-04 18:31:50 2019-11-07 20:33:50 2019-11-04 18:31:50
#4 4 <NA> <NA> <NA> <NA> <NA> <NA>
Or another option with rowwise with c_across
df %>%
rowwise() %>%
mutate(max_date = max(as.POSIXct(c_across(starts_with('date'))),
na.rm = TRUE),
min_date = min(as.POSIXct(c_across(starts_with('date'))),
na.rm = TRUE))
-output
# A tibble: 4 x 7
# Rowwise:
# ID Other_columns date_column date_column2 date_column3 max_date min_date
# <int> <chr> <chr> <chr> <chr> <dttm> <dttm>
#1 1 numeric 2019-11-04 19:33:50 2019-11-05 15:33:50 2019-11-05 16:33:50 2019-11-05 16:33:50 2019-11-04 19:33:50
#2 2 numeric <NA> 2019-11-04 17:20:10 2019-11-09 19:12:50 2019-11-09 19:12:50 2019-11-04 17:20:10
#3 3 numeric 2019-11-07 20:33:50 <NA> 2019-11-04 18:31:50 2019-11-07 20:33:50 2019-11-04 18:31:50
#4 4 <NA> <NA> <NA> <NA> NA NA NA NA
data
df <- structure(list(ID = 1:4, Other_columns = c("numeric", "numeric",
"numeric", NA), date_column = c("2019-11-04 19:33:50", NA, "2019-11-07 20:33:50",
NA), date_column2 = c("2019-11-05 15:33:50", "2019-11-04 17:20:10",
NA, NA), date_column3 = c("2019-11-05 16:33:50", "2019-11-09 19:12:50",
"2019-11-04 18:31:50", NA)), class = "data.frame", row.names = c(NA,
-4L))
Here is another approach that you can use without using any package.
First, get data for date columns, and from that you can use apply function on each row to get max and min value accordingly. Here is the example:
df_date = df[, sapply(df, FUN = function(x) class(x)[1]) %in% c("POSIXct", "POSIXt")]
df$max = apply(df_date, 2, FUN = function(x) max(x, na.rm = TRUE)
df$min = apply(df_date, 2, FUN = function(x) min(x, na.rm = TRUE)
Data
structure(list(ID = 1:4, Other_columns = c("numeric", "numeric",
"numeric", NA), date_column = structure(c(1572876230, NA, 1573139030,
NA), class = c("POSIXct", "POSIXt"), tzone = ""), date_column2 = structure(c(1572948230,
1572868210, NA, NA), class = c("POSIXct", "POSIXt"), tzone = ""),
date_column3 = structure(c(1572951830, 1573306970, 1572872510,
NA), class = c("POSIXct", "POSIXt"), tzone = "")), class = "data.frame", row.names = c(NA,
-4L))
I have a dataset contaning return values of cryptocurrency ret.daily, small part of it looks like this
Bitcoin Ethereum XRP Bitcoin.Cash Bitcoin.SV ...
2018-01-01 -0.04 0.02 0.04 -0.04 NA
2018-01-02 0.09 0.13 0.04 0.11 NA
2018-01-03 0.01 0.08 0.23 -0.04 NA
...
I have then given each coin into one of 5 groups for each day, based on ceratin values price.groups.daily (these are just the biggest coins i included, the are many other coins, so there are coins in each of the 5 groups)
Bitcoin Ethereum XRP Bitcoin.Cash Bitcoin.SV ...
2018-01-01 5 5 4 5 NA
2018-01-02 5 5 4 5 NA
2018-01-03 5 5 4 5 NA
...
What I then want to do is to take the mean of each group for each day, and make a new matrix, looking like this
1 2 3 4 5
2018-01-01 Mean(groups 1 numbers) Mean(groups 2 numbers) ... ... mean(-0.04, 0.02,-0.04,...)
2018-01-02 Mean(groups 1 numbers) Mean(groups 2 numbers)
2018-01-03 Mean(groups 1 numbers) Mean(groups 2 numbers)
...
When i made the grouping, I did the following (where price.daily is daily price data, which is what i used to sort the data into groups)
col.daily <- seq(1,length(price.daily$Bitcoin))
quantile.daily = sapply(col.daily, function(y) {quantile(x = unlist(price.daily[y,] ), seq(0,1, length=6),na.rm = TRUE )})
quantile.daily.t = t(quantile.daily)
rownames(quantile.daily.t) = rownames(price.daily)
combined.daily = cbind(price.daily, quantile.daily.t)
price.groups.daily = as.data.frame(t(apply(combined.daily, 1, function(x) findInterval(x[1:ncol(price.daily)], x[(1 + ncol(price.daily)):ncol(combined.daily)]))))
colnames(price.groups.daily) = colnames(price.daily)
price.groups.daily[price.groups.daily == 6] = 5
I added the last line like that, since i didnt know how to get around if the biggest values was equal to the end interval in the last group, but this works just fine. I imagine this could also be done using some apply function, i am just not certain how, since before i could use function such as Quantile, and findInterval which did exactly what i wanted to do. Not sure if there is a function that could work in this scenario?
EDIT : Added some of my data using dput(head(price.groups.daily[1:5])) (my data starts in 2014, but i started from 2018 in my example, since most coins didnt exist at that time)
structure(list(Bitcoin = c(5, 5, 5, 5, 5, 5), Ethereum = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), XRP = c(1L, 1L, 1L, 1L, 1L, 2L), Bitcoin.Cash = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), Bitcoin.SV = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_)), row.names = c("2014-01-01", "2014-01-02",
"2014-01-03", "2014-01-04", "2014-01-05", "2014-01-06"), class = "data.frame")
and for > dput(head(ret.daily[1:5]))
structure(list(Bitcoin = c(0.0201473710988784, 0.048620314369761,
0.0826106401572204, 0.0209460599834816, -0.17281055170073, 0.0495261478685647
), Ethereum = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), XRP = c(-0.0390090806022911, 0.0180075172268452, -0.108767309981219,
0.0184572292482077, -0.111605656954607, 0.0104300601469132),
Bitcoin.Cash = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), Bitcoin.SV = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_)), row.names = c("2014-01-03",
"2014-01-04", "2014-01-05", "2014-01-06", "2014-01-07", "2014-01-08"
), class = "data.frame")
You could have a look at data.table or various tidyverse functions to accomplish that.
Below is an example using data.table:
library(data.table)
library(Hmisc)
## prepare example data
set.seed(1)
dts <- seq.Date(
from = as.Date("2019/1/1"),
to = as.Date("2020/1/1"),
by = 1
)
ret.daily <- matrix(
rnorm(length(dts) * 50, 0, 6),
ncol = 50,
dimnames = list(
dts,
c("Bitcoin", "Ethereum", "XRP", "Bitcoin.Cash", "Bitcoin.SV",
paste0("coin_", sprintf("%02d", seq_len(45))))
))
ret.daily[sample(seq_len(length(ret.daily)), 200 )] <- NA # add some NA's
ret.daily <- data.frame(ret.daily)
## start of summarizations
ret.daily <- melt(data.table(date = as.character(dts), ret.daily), id.vars = "date")
setkey(ret.daily, date, variable)
cuts <- ret.daily[, .(as.list(
Hmisc::cut2(value, g = 6)
)), by = .(date)]
setkey(cuts, date)
# grouping based on daily percentiles (in long format)
ret.daily[, group := unlist(lapply(cuts$V1, as.numeric))][]
#> date variable value group
#> 1: 2019-01-01 Bitcoin -3.7587229 2
#> 2: 2019-01-01 Ethereum 4.0700411 5
#> 3: 2019-01-01 XRP -6.3744503 1
#> 4: 2019-01-01 Bitcoin.Cash -4.5996998 2
#> 5: 2019-01-01 Bitcoin.SV -4.9012655 2
#> ---
#> 18296: 2020-01-01 coin_41 -4.1377852 2
#> 18297: 2020-01-01 coin_42 -0.7649347 3
#> 18298: 2020-01-01 coin_43 0.7698973 4
#> 18299: 2020-01-01 coin_44 -4.6674720 2
#> 18300: 2020-01-01 coin_45 -3.6291231 2
# summarize mean by group and date, and casting the data into wide format
dcast(ret.daily[, .(mean = mean(value, na.rm = TRUE)), by = .(date, group)],
date ~ group, value.var = "mean")
#> date NA 1 2 3 4 5 6
#> 1: 2019-01-01 NA -8.284783 -4.173707 -0.9096477 1.3175870 4.501497 11.123123
#> 2: 2019-01-02 NA -7.379199 -4.502193 -2.1457718 1.1179902 4.207471 8.069149
#> 3: 2019-01-03 NaN -9.070030 -4.708133 -1.8032877 0.9011769 2.699407 7.673678
#> 4: 2019-01-04 NA -7.019294 -2.995686 -0.9035496 1.6644289 4.565588 9.178561
#> 5: 2019-01-05 NA -9.457924 -3.957598 -1.9535285 0.3493898 3.265330 7.396461
#> ---
#> 362: 2019-12-28 NA -9.866193 -4.481655 -2.2775438 1.0612454 3.863716 9.159870
#> 363: 2019-12-29 NA -8.555226 -3.319358 -0.6815004 1.5801415 4.379455 9.354069
#> 364: 2019-12-30 NA -7.430636 -4.011801 -1.3067570 2.2528401 4.805392 10.595387
#> 365: 2019-12-31 NA -7.316091 -2.784448 -0.8047659 0.7121429 3.508579 7.714213
#> 366: 2020-01-01 NaN -8.502224 -4.369027 -1.7029667 0.5042703 3.959396 9.084915
Created on 2020-04-15 by the reprex package (v0.3.0)
I have the following dataframe:
dff <- structure(list(`MCI ID` = c("070405344", "230349820", "260386435","370390587", "380406805", "391169282", "440377986", "750391394","890373764", "910367024"), `123a_1` = structure(c(16672, 16372,16730, 16688, 16700, 16783, 16709, 17033, 16786, 16675), class = "Date"),`123a_2` = structure(c(17029, 16422, 17088, 17036, 17057,17140, 17072, 17043, 17141, 17038), class = "Date"), `123a_3` = structure(c(NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, NA_real_, NA_real_), class = "Date"), `123a_4` = structure(c(NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, NA_real_, NA_real_), class = "Date"), `123a_5` = structure(c(NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, NA_real_, NA_real_), class = "Date"), max123a = structure(c(17029,16422, 17088, 17036, 17057, 17140, 17072, 17043, 17141, 17038), class = "Date")), .Names = c("MCI ID", "123a_1", "123a_2","123a_3", "123a_4", "123a_5", "max123a"), row.nam... <truncated>
I already have a column for the largest of each row for 123a_1 through 123a_5. For this I was able to use:
dff <- mutate(dff, max123a = pmax(`123a_1`, `123a_2`, `123a_3`, `123a_4`, `123a_5`, na.rm = T))
However, now I need the second largest of each row. This assumes that there might be data other than NA in 123a_3 through 123a_5. Ideally, I'd like a dplyr solution, so that I can pipe the two commands together, but I'll take anything.
With dplyr and tidyr:
library(dplyr)
library(tidyr)
dff %>%
gather(var, val, 2:6) %>%
group_by(`MCI ID`) %>%
summarise(max2 = max(val[val != max(val, na.rm = TRUE)], na.rm = TRUE)) %>%
left_join(dff, .)
This results in:
MCI ID 123a_1 123a_2 123a_3 123a_4 123a_5 max123a max2
1 070405344 2015-08-25 2016-08-16 <NA> <NA> <NA> 2016-08-16 2015-08-25
2 230349820 2014-10-29 2014-12-18 <NA> <NA> <NA> 2014-12-18 2014-10-29
3 260386435 2015-10-22 2016-10-14 <NA> <NA> <NA> 2016-10-14 2015-10-22
4 370390587 2015-09-10 2016-08-23 <NA> <NA> <NA> 2016-08-23 2015-09-10
5 380406805 2015-09-22 2016-09-13 <NA> <NA> <NA> 2016-09-13 2015-09-22
6 391169282 2015-12-14 2016-12-05 <NA> <NA> <NA> 2016-12-05 2015-12-14
7 440377986 2015-10-01 2016-09-28 <NA> <NA> <NA> 2016-09-28 2015-10-01
8 750391394 2016-08-20 2016-08-30 <NA> <NA> <NA> 2016-08-30 2016-08-20
9 890373764 2015-12-17 2016-12-06 <NA> <NA> <NA> 2016-12-06 2015-12-17
10 910367024 2015-08-28 2016-08-25 <NA> <NA> <NA> 2016-08-25 2015-08-28
You could do everything together as follows:
dff %>%
gather(var, val, 2:6) %>%
group_by(`MCI ID`) %>%
summarise(max2 = max(val[val != max(val, na.rm = TRUE)], na.rm = TRUE)) %>%
left_join(dff,.) %>%
mutate(max123a = pmax(`123a_1`, `123a_2`, `123a_3`, `123a_4`, `123a_5`, na.rm = TRUE))
A solution in base R:
dff$max2 <- apply(dff[2:6], 1, function(x) rev(sort(x))[2])
We can use tidyverse
library(tidyverse)
dff %>%
summarise_each(funs(rev(sort(.))[2]))
This question already has answers here:
Getting the minimum of the rows in a data frame
(3 answers)
Closed 6 years ago.
I have a group of columns, and I need a new column min123 with the minimum value of these columns 123a_1 through 123a_5.
dff <- structure(list(`MCI ID` = c("070405344", "230349820", "260386435","370390587", "380406805", "391169282", "440377986", "750391394","890373764", "910367024"),
`123a_1` = structure(c(16672, 16372,16730, 16688, 16700, 16783, 16709, 17033, 16786, 16675), class = "Date"),
`123a_2` = structure(c(17029, 16422, 17088, 17036, 17057,17140, 17072, 17043, 17141, 17038), class = "Date"),
`123a_3` = structure(c(NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, NA_real_, NA_real_), class = "Date"),
`123a_4` = structure(c(NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, NA_real_, NA_real_), class = "Date"),
`123a_5` = structure(c(NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, NA_real_, NA_real_), class = "Date")),
.Names = c("MCI ID","123a_1", "123a_2", "123a_3", "123a_4", "123a_5"), row.names = c(NA,10L), class = "data.frame")
A base R approach using do.call and pmin:
dff$min123 <- do.call(pmin, c(dff[-1], na.rm = TRUE))
A similar appraoch in dplyr:
library(dplyr)
dff %>%
mutate(min123 = do.call(pmin, c(select(., -1), na.rm = TRUE)))
Or data.table:
library(data.table)
setDT(dff)[, min123 := do.call(pmin, c(.SD, na.rm = TRUE)), .SDcols = -1]
library(dplyr)
dff %>%
mutate(min123 = pmin(`123a_1`, `123a_2`, `123a_3`, `123a_4`, `123a_5`, na.rm = T))
That's what the function pminis for:
> str(dff)
'data.frame': 10 obs. of 6 variables:
$ MCI ID: chr "070405344" "230349820" "260386435" "370390587" ...
$ 123a_1: Date, format: "2015-08-25" "2014-10-29" "2015-10-22" ...
$ 123a_2: Date, format: "2016-08-16" "2014-12-18" "2016-10-14" ...
$ 123a_3: Date, format: NA NA NA ...
$ 123a_4: Date, format: NA NA NA ...
$ 123a_5: Date, format: NA NA NA ...
> dff$groupmin <- pmin(dff[[2]],dff[[3]],dff[[4]], dff[[5]], dff[[6]], na.rm=TRUE)
> head(dff)
MCI ID 123a_1 123a_2 123a_3 123a_4 123a_5 groupmin
1 070405344 2015-08-25 2016-08-16 <NA> <NA> <NA> 2015-08-25
2 230349820 2014-10-29 2014-12-18 <NA> <NA> <NA> 2014-10-29
3 260386435 2015-10-22 2016-10-14 <NA> <NA> <NA> 2015-10-22
4 370390587 2015-09-10 2016-08-23 <NA> <NA> <NA> 2015-09-10
5 380406805 2015-09-22 2016-09-13 <NA> <NA> <NA> 2015-09-22
6 391169282 2015-12-14 2016-12-05 <NA> <NA> <NA> 2015-12-14