I have a dataset with hundreds of thousands of measurements taken from several subjects. However, the measurements are only partially available, i.e., there may be large stretches with NA. I need to establish up front, for which timespan positive data are available for each subject.
Data:
df
timestamp C B A starttime_ms
1 00:00:00.033 NA NA NA 33
2 00:00:00.064 NA NA NA 64
3 00:00:00.066 NA 0.346 NA 66
4 00:00:00.080 47.876 0.346 22.231 80
5 00:00:00.097 47.876 0.346 22.231 97
6 00:00:00.099 47.876 0.346 NA 99
7 00:00:00.114 47.876 0.346 NA 114
8 00:00:00.130 47.876 0.346 NA 130
9 00:00:00.133 NA 0.346 NA 133
10 00:00:00.147 NA 0.346 NA 147
My (humble) solution so far is (i) to pick out the range of timestamp values that are not NA and to select the first and last such timestamp for each subject individually. Here's the code for subject C:
NotNA_C <- df$timestamp[which(!is.na(df$C))]
range_C <- paste(NotNA_C[1], NotNA_C[length(NotNA_C)], sep = " - ")
range_C
[1] "00:00:00.080" "00:00:00.130"
That doesn't look elegant and, what's more, it needs to be repeated for all other subjects. Is there a more efficient way to establish the range of time for which non-NA values are available for all subjects in one go?
EDIT
I've found a base R solution:
sapply(df[,2:4], function(x)
paste(df$timestamp[which(!is.na(x))][1],
df$timestamp[which(!is.na(x))][length(df$timestamp[which(!is.na(x))])], sep = " - "))
C B A
"00:00:00.080 - 00:00:00.130" "00:00:00.066 - 00:00:00.147" "00:00:00.080 - 00:00:00.097"
but would be interested in other solutions as well!
Reproducible data:
df <- structure(list(timestamp = c("00:00:00.033", "00:00:00.064",
"00:00:00.066", "00:00:00.080", "00:00:00.097", "00:00:00.099",
"00:00:00.114", "00:00:00.130", "00:00:00.133", "00:00:00.147"
), C = c(NA, NA, NA, 47.876, 47.876, 47.876, 47.876, 47.876,
NA, NA), B = c(NA, NA, 0.346, 0.346, 0.346, 0.346,
0.346, 0.346, 0.346, 0.346), A = c(NA, NA, NA, 22.231, 22.231, NA, NA, NA, NA,
NA), starttime_ms = c(33, 64, 66, 80, 97, 99, 114, 130, 133,
147)), row.names = c(NA, 10L), class = "data.frame")
dplyr solution
library(tidyverse)
df <- structure(list(timestamp = c("00:00:00.033", "00:00:00.064",
"00:00:00.066", "00:00:00.080", "00:00:00.097", "00:00:00.099",
"00:00:00.114", "00:00:00.130", "00:00:00.133", "00:00:00.147"
), C = c(NA, NA, NA, 47.876, 47.876, 47.876, 47.876, 47.876,
NA, NA), B = c(NA, NA, 0.346, 0.346, 0.346, 0.346,
0.346, 0.346, 0.346, 0.346), A = c(NA, NA, NA, 22.231, 22.231, NA, NA, NA, NA,
NA), starttime_ms = c(33, 64, 66, 80, 97, 99, 114, 130, 133,
147)), row.names = c(NA, 10L), class = "data.frame")
df %>%
pivot_longer(-c(timestamp, starttime_ms)) %>%
group_by(name) %>%
drop_na() %>%
summarise(min = timestamp %>% min(),
max = timestamp %>% max())
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 3
#> name min max
#> <chr> <chr> <chr>
#> 1 A 00:00:00.080 00:00:00.097
#> 2 B 00:00:00.066 00:00:00.147
#> 3 C 00:00:00.080 00:00:00.130
Created on 2021-02-15 by the reprex package (v0.3.0)
You could look at the cumsum of differences where there's no NA, coerce them to logical and subset first and last element.
lapply(data.frame(apply(rbind(0, diff(!sapply(df[c("C", "B", "A")], is.na))), 2, cumsum)),
function(x) c(df$timestamp[as.logical(x)][1], rev(df$timestamp[as.logical(x)])[1]))
# $C
# [1] "00:00:00.080" "00:00:00.130"
#
# $B
# [1] "00:00:00.066" "00:00:00.147"
#
# $A
# [1] "00:00:00.080" "00:00:00.097"
From the documentation I read that invisible() returns a (temporarily) invisible copy of an object. Now when I use invisible I always need to call the object twice before it is actually printed.
I use data.table and would like my function to return an invisible copy of the object given that a certain condition is met (i.e premature abortion of function).
I've noticed that this behaviour of "needing double/two calls" also applies if the invisibly returned object is used inside another function, making its use seemingly unusable. What causes this behaviour? Am I doing something wrong? How do I get the function to return invisibly, and printed on the first call?
Please see sample code below:
example <- function(DT) {
if (!(1 %in% DT$RSI.verticalBottom) | !(1 %in% DT$RSI.top)) {
# abort if there is no buy or sell signal
DT[, `:=`(pos = NA,
return = NA
)]
return(invisible(DT))
}
> example(sample.data)
> sample.data
> sample.data
conm tic datadate cshoq gind year month yearmon fdateq pdateq fyr fyearq fqtr
1: NS GROUP INC NSS.1 2000-01-31 NA 101010 2000 1 2000_1 NA <NA> NA NA NA
2: NS GROUP INC NSS.1 2000-02-29 NA 101010 2000 2 2000_2 NA <NA> NA NA NA
3: NS GROUP INC NSS.1 2000-03-31 21.533 101010 2000 3 2000_3 NA <NA> 9 2000 2
4: NS GROUP INC NSS.1 2000-04-30 NA 101010 2000 4 2000_4 NA <NA> NA NA NA
5: NS GROUP INC NSS.1 2000-05-31 NA 101010 2000 5 2000_5 NA <NA> NA NA NA
6: NS GROUP INC NSS.1 2000-06-30 22.008 101010 2000 6 2000_6 NA <NA> 9 2000 3
req epspiq epspxq ajexq saleq saley ivncfy gsubind dpq ibmiiq ibq iby oiadpq
1: NA NA NA NA NA NA NA NA NA NA NA NA NA
2: NA NA NA NA NA NA NA NA NA NA NA NA NA
3: -58.396 -0.38 -0.38 1 100.107 186.733 10.77 10101020 5.517 NA -8.231 -21.165 -5.617
4: NA NA NA NA NA NA NA NA NA NA NA NA NA
5: NA NA NA NA NA NA NA NA NA NA NA NA NA
6: -63.168 -0.19 -0.23 1 73.652 260.385 20.90 10101020 NA NA -5.048 -26.213 NA
oiadpy oibdpq oibdpy xiq xoprq cogsy dlcchy wcapchy QEBIT.adep YEBIT.adep QEBIT.bdep
1: NA NA NA NA NA NA NA NA NA NA NA
2: NA NA NA NA NA NA NA NA NA NA NA
3: -16.924 -0.1 -5.57 0 100.207 177.826 -0.394 NA -0.05610996 -0.09063208 -0.0009989311
4: NA NA NA NA NA NA NA NA NA NA NA
5: NA NA NA NA NA NA NA NA NA NA NA
6: NA NA NA 0 NA NA -0.394 NA NA NA NA
YEBIT.bdep QEBT YEBT f_id I.QSales IWA.QEBIT IWA.QEBT I.YSales IWA.YEBIT
1: NA NA NA NA NA NA NA NA NA
2: NA NA NA NA NA NA NA NA NA
3: -0.000535524 -0.08222202 -0.1133437 2000Q2 19344.53 0.08160277 0.03577741 196223.7 0.08329726
4: NA NA NA NA NA NA NA NA NA
5: NA NA NA NA NA NA NA NA NA
6: NA -0.06853853 -0.1006702 2000Q3 19798.64 0.10680607 0.06096211 196223.7 0.08329726
IWA.YEBT QSales.pc YSales.pc RSI_QEBIT RSI_QEBT RSI_IWA.QEBIT RSI_IWA.QEBT adj.factor
1: NA NA NA NA NA NA NA 1
2: NA NA NA NA NA NA NA 1
3: 0.03875869 0.005174952 0.0009516334 41.45963 32.93934 29.96487 18.23527 1
4: NA NA NA NA NA NA NA 1
5: NA NA NA NA NA NA NA 1
6: 0.03875869 0.003720053 0.0013269806 49.83110 34.64800 37.58678 24.75847 1
dvpsxm cshtrm curcdm close high low trfm trt1m close.unAdj mktcap close.div
1: NA 4557500 USD 8.8750 10.1250 6.7500 1.0409 16.3934 8.8750 NA 8.8750
2: NA 4506100 USD 11.6875 12.1250 8.0625 1.0409 31.6901 11.6875 NA 11.6875
3: NA 4146200 USD 16.3125 16.8125 11.3750 1.0409 39.5722 16.3125 351.2571 16.3125
4: NA 3215400 USD 15.8750 16.3750 12.8750 1.0409 -2.6820 15.8750 NA 15.8750
5: NA 2948800 USD 18.3125 19.3750 16.0625 1.0409 15.3543 18.3125 NA 18.3125
6: NA 4296100 USD 20.9375 21.0000 17.7500 1.0409 14.3345 20.9375 460.7925 20.9375
RSI_close RSI.verticalBottom RSI.top return pos
1: NA NA NA NA NA
2: NA NA NA NA NA
3: NA NA NA NA NA
4: NA NA NA NA NA
5: NA NA NA NA NA
6: NA NA NA NA NA
Sample data
> dput(sample.data)
structure(list(conm = c("NS GROUP INC", "NS GROUP INC", "NS GROUP INC",
"NS GROUP INC", "NS GROUP INC", "NS GROUP INC"), tic = c("NSS.1",
"NSS.1", "NSS.1", "NSS.1", "NSS.1", "NSS.1"), datadate = structure(c(10987,
11016, 11047, 11077, 11108, 11138), class = "Date"), cshoq = c(NA,
NA, 21.533, NA, NA, 22.008), gind = c(101010L, 101010L, 101010L,
101010L, 101010L, 101010L), year = c(2000, 2000, 2000, 2000,
2000, 2000), month = c(1, 2, 3, 4, 5, 6), yearmon = c("2000_1",
"2000_2", "2000_3", "2000_4", "2000_5", "2000_6"), fdateq = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), pdateq = structure(c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), class = "Date"), fyr = c(NA, NA, 9L, NA,
NA, 9L), fyearq = c(NA, NA, 2000L, NA, NA, 2000L), fqtr = c(NA,
NA, 2L, NA, NA, 3L), req = c(NA, NA, -58.396, NA, NA, -63.168
), epspiq = c(NA, NA, -0.38, NA, NA, -0.19), epspxq = c(NA, NA,
-0.38, NA, NA, -0.23), ajexq = c(NA, NA, 1, NA, NA, 1), saleq = c(NA,
NA, 100.107, NA, NA, 73.652), saley = c(NA, NA, 186.733, NA,
NA, 260.385), ivncfy = c(NA, NA, 10.77, NA, NA, 20.9), gsubind = c(NA,
NA, 10101020L, NA, NA, 10101020L), dpq = c(NA, NA, 5.517, NA,
NA, NA), ibmiiq = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), ibq = c(NA, NA, -8.231, NA, NA, -5.048), iby = c(NA,
NA, -21.165, NA, NA, -26.213), oiadpq = c(NA, NA, -5.617, NA,
NA, NA), oiadpy = c(NA, NA, -16.924, NA, NA, NA), oibdpq = c(NA,
NA, -0.1, NA, NA, NA), oibdpy = c(NA, NA, -5.57, NA, NA, NA),
xiq = c(NA, NA, 0, NA, NA, 0), xoprq = c(NA, NA, 100.207,
NA, NA, NA), cogsy = c(NA, NA, 177.826, NA, NA, NA), dlcchy = c(NA,
NA, -0.394, NA, NA, -0.394), wcapchy = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), QEBIT.adep = c(NA,
NA, -0.0561099623402959, NA, NA, NA), YEBIT.adep = c(NA,
NA, -0.0906320789576561, NA, NA, NA), QEBIT.bdep = c(NA,
NA, -0.000998931143676266, NA, NA, NA), YEBIT.bdep = c(NA,
NA, -0.000535523983441598, NA, NA, NA), QEBT = c(NA, NA,
-0.0822220224359935, NA, NA, -0.0685385325585184), YEBT = c(NA,
NA, -0.113343651095414, NA, NA, -0.100670161491637), f_id = c(NA,
NA, "2000Q2", NA, NA, "2000Q3"), I.QSales = c(NA, NA, 19344.526,
NA, NA, 19798.641), IWA.QEBIT = c(NA, NA, 0.0816027748625115,
NA, NA, 0.10680606815387), IWA.QEBT = c(NA, NA, 0.0357774080378087,
NA, NA, 0.0609621135107203), I.YSales = c(NA, NA, 196223.665,
NA, NA, 196223.665), IWA.YEBIT = c(NA, NA, 0.0832972567299668,
NA, NA, 0.0832972567299668), IWA.YEBT = c(NA, NA, 0.0387586889685299,
NA, NA, 0.0387586889685299), QSales.pc = c(NA, NA, 0.00517495233535316,
NA, NA, 0.00372005331072976), YSales.pc = c(NA, NA, 0.000951633433204909,
NA, NA, 0.00132698061673652), RSI_QEBIT = c(NA, NA, 41.4596290506163,
NA, NA, 49.8310957229999), RSI_QEBT = c(NA, NA, 32.939339100869,
NA, NA, 34.6480049470139), RSI_IWA.QEBIT = c(NA, NA, 29.9648696052066,
NA, NA, 37.5867809473848), RSI_IWA.QEBT = c(NA, NA, 18.2352737965041,
NA, NA, 24.7584711404174), adj.factor = c(1, 1, 1, 1, 1,
1), dvpsxm = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), cshtrm = c(4557500, 4506100, 4146200, 3215400,
2948800, 4296100), curcdm = c("USD", "USD", "USD", "USD",
"USD", "USD"), close = c(8.875, 11.6875, 16.3125, 15.875,
18.3125, 20.9375), high = c(10.125, 12.125, 16.8125, 16.375,
19.375, 21), low = c(6.75, 8.0625, 11.375, 12.875, 16.0625,
17.75), trfm = c(1.0409, 1.0409, 1.0409, 1.0409, 1.0409,
1.0409), trt1m = c(16.3934, 31.6901, 39.5722, -2.682, 15.3543,
14.3345), close.unAdj = c(8.875, 11.6875, 16.3125, 15.875,
18.3125, 20.9375), mktcap = c(NA, NA, 351.2570625, NA, NA,
460.7925), close.div = c(8.875, 11.6875, 16.3125, 15.875,
18.3125, 20.9375), RSI_close = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), RSI.verticalBottom = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), RSI.top = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), return = c(NA,
NA, NA, NA, NA, NA), pos = c(NA, NA, NA, NA, NA, NA)), .Names = c("conm",
"tic", "datadate", "cshoq", "gind", "year", "month", "yearmon",
"fdateq", "pdateq", "fyr", "fyearq", "fqtr", "req", "epspiq",
"epspxq", "ajexq", "saleq", "saley", "ivncfy", "gsubind", "dpq",
"ibmiiq", "ibq", "iby", "oiadpq", "oiadpy", "oibdpq", "oibdpy",
"xiq", "xoprq", "cogsy", "dlcchy", "wcapchy", "QEBIT.adep", "YEBIT.adep",
"QEBIT.bdep", "YEBIT.bdep", "QEBT", "YEBT", "f_id", "I.QSales",
"IWA.QEBIT", "IWA.QEBT", "I.YSales", "IWA.YEBIT", "IWA.YEBT",
"QSales.pc", "YSales.pc", "RSI_QEBIT", "RSI_QEBT", "RSI_IWA.QEBIT",
"RSI_IWA.QEBT", "adj.factor", "dvpsxm", "cshtrm", "curcdm", "close",
"high", "low", "trfm", "trt1m", "close.unAdj", "mktcap", "close.div",
"RSI_close", "RSI.verticalBottom", "RSI.top", "return", "pos"
), sorted = c("conm", "tic", "datadate", "cshoq", "gind", "year",
"month", "yearmon"), class = c("data.table", "data.frame"), row.names = c(NA,
-6L), .internal.selfref = <pointer: 0x102806978>)
This question already has answers here:
Select last non-NA value in a row, by row
(3 answers)
Closed last month.
I have a data frame Depth which consist of LON and LAT with corresponding depths temperature data. For each coordinate (LON and LAT) I would like to pull out last record of each depth corresponding to the coordinates into a new data frame,
> Depth<-read.csv('depthdata.csv')
> head(Depth)
LAT LON X150 X175 X200 X225 X250 X275 X300 X325 X350 X375 X400 X425 X450
1 -78.375 -163.875 -1.167 -1.0 NA NA NA NA NA NA NA NA NA NA NA
2 -78.125 -168.875 -1.379 -1.3 -1.259 -1.6 -1.476 -1.374 -1.507 NA NA NA NA NA NA
3 -78.125 -167.625 -1.700 -1.7 -1.700 -1.7 NA NA NA NA NA NA NA NA NA
4 -78.125 -167.375 -2.100 -2.2 -2.400 -2.3 -2.200 NA NA NA NA NA NA NA NA
5 -78.125 -167.125 -1.600 -1.6 -1.600 -1.6 NA NA NA NA NA NA NA NA NA
6 -78.125 -166.875 NA NA NA NA NA NA NA NA NA NA NA NA NA
so that I will have this;
LAT LON
-78.375 -163.875 -1
-78.125 -168.875 -1.507
-78.125 -167.625 -1.7
-78.125 -167.375 -2.2
-78.125 -167.125 -1.6
-78.125 -166.875 NA
I tried the tail() function but I don't have the desirable result.
As I understand it, you want the last non-NA value in each row, for all columns except the first two.
We can use max.col() along with is.na() with our relevant columns to get us the column number for the last non-NA value. 2 is added (shown by + 2L) to compensate for the removal of the first two columns (shown by [-(1:2)]).
idx <- max.col(!is.na(Depth[-(1:2)]), ties.method = "last") + 2L
We can use idx in cbind() to create an index matrix for retrieving the values.
Depth[cbind(seq_len(nrow(Depth)), idx)]
# [1] -1.000 -1.507 -1.700 -2.200 -1.600 NA
Bind this together with the first two columns of the original data with cbind() and we're done.
cbind(Depth[1:2], LAST = Depth[cbind(seq_len(nrow(Depth)), idx)])
# LAT LON LAST
# 1 -78.375 -163.875 -1.000
# 2 -78.125 -168.875 -1.507
# 3 -78.125 -167.625 -1.700
# 4 -78.125 -167.375 -2.200
# 5 -78.125 -167.125 -1.600
# 6 -78.125 -166.875 NA
Data:
Depth <- structure(list(LAT = c(-78.375, -78.125, -78.125, -78.125, -78.125,
-78.125), LON = c(-163.875, -168.875, -167.625, -167.375, -167.125,
-166.875), X150 = c(-1.167, -1.379, -1.7, -2.1, -1.6, NA), X175 = c(-1,
-1.3, -1.7, -2.2, -1.6, NA), X200 = c(NA, -1.259, -1.7, -2.4,
-1.6, NA), X225 = c(NA, -1.6, -1.7, -2.3, -1.6, NA), X250 = c(NA,
-1.476, NA, -2.2, NA, NA), X275 = c(NA, -1.374, NA, NA, NA, NA
), X300 = c(NA, -1.507, NA, NA, NA, NA), X325 = c(NA, NA, NA,
NA, NA, NA), X350 = c(NA, NA, NA, NA, NA, NA), X375 = c(NA, NA,
NA, NA, NA, NA), X400 = c(NA, NA, NA, NA, NA, NA), X425 = c(NA,
NA, NA, NA, NA, NA), X450 = c(NA, NA, NA, NA, NA, NA)), .Names = c("LAT",
"LON", "X150", "X175", "X200", "X225", "X250", "X275", "X300",
"X325", "X350", "X375", "X400", "X425", "X450"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))