Tibble silently changes recycled difftime variables

Tibble silently changes recycled difftime variables - r

If a difftime variable is included in a tibble, and the specified number of observations is equal to the other variable(s), then the class of the variable is maintained.
tibble::tibble(a = c(1,2), b = as.difftime(c(1,2), units = "hours"))
# A tibble: 2 x 2
a b
<dbl> <time>
1 1 1 hours
2 2 1 hours
However, if the specified number of observations in the difftime variable is a proper factor of the number of observations in the other variable, so that the difftime variable is recycled, then the class of the variable silently changes to numeric:
tibble::tibble(a = c(1,2), b = as.difftime(1, units = "hours"))
# A tibble: 2 x 2
a b
<dbl> <dbl>
1 1 1
2 2 1
Does this difference in behaviour occur because tidyverse users are encouraged to use the period or duration objects provided by lubridate to specify times, rather than base R's difftime objects? Or is this an unintended bug?
The same issue occurs when using tibble::data_frame, and dplyr::data_frame, although I believe these may be deprecated in the future.
To be clear, the following calls do not silently change the class of the time-type variable:
tibble::tibble(a = c(1,2), b = lubridate::as.period("1H"))
# A tibble: 2 x 2
a b
<dbl> <S4: Period>
1 1 1H 0M 0S
2 2 1H 0M 0S
tibble::tibble(a = c(1,2), b = lubridate::as.duration("1H"))
# A tibble: 2 x 2
a b
<dbl> <S4: Duration>
1 1 3600s (~1 hours)
2 2 3600s (~1 hours)

The behavior you are seeing stems from something quite peculiar with the vector recycling process during dataframe creations. As you already know, objects passed to the data.frame function should have the same number of rows. But atomic vectors will be recycled a whole number of times if necessary. This raises the question as to why the following does not work:
dff <- data.frame(a=c(1,2), b=as.difftime(1, units="hours"))
The code above raises the following error:
Error in data.frame(a = c(1, 2), b = as.difftime(1, units = "hours"))
: arguments imply differing number of rows: 2, 1
It turns out, the reason this does not work is because a vector of difftime objects is not recognized as an atomic vector. You can check with the following:
is.vector(as.difftime(1, units="hours"))
This returns:
[1] FALSE
As result, when the data.frame function tries to recycle the column b, it checks first if the column is in fact a vector (with is.vector). Since that returns FALSE, the recycling does not proceed; and hence the error returned.
So, the ensuing question is: why not just convert column b with as.vector?
This would have actually been a good idea, expect that as.vector removes all attributes, including names, for the resulting vector. You can see that with the following:
as.vector(as.difftime(1, units="hours"))
returns:
[1] 1
All the properties of the difftime object got lost during the coercion process. This leads me to think that the tibble::data_frame function actually uses as.vector somewhere along the process of generating the data_frame. As a result, we see the following behavior:
data_frame(a=c(1,2), b=as.difftime(1, units="hours"))
returns
# A tibble: 2 x 2
a b
<dbl> <dbl>
1 1 1
2 2 1
I guess the conclusion is the same as the one reached by #agstudy: to maintain the difftime object, you may have to use list for column b as follows:
tibble::tibble(a = c(1,2), b = list(as.difftime(1, units = "hours")))
I hope this proves, in some way, useful.

I don't think that tibble encourages the use of lubridate ( even if I encourage you to use it ) do deal with date like types , But it is more an issue of how are vector are created internally when you recycle. In fact you can reproduce the same recyclying behavior when you play with c and list. For example using c you will lose typing :
c(as.difftime(c(1), units = "hours"),1)
### Time differences in hours
### [1] 1 1
But using list will keep the time difference type :
list(as.difftime(c(1), units = "hours"),2)
# [[1]]
# Time difference of 1 hours
#
# [[2]]
# [1] 2
Applying list with tibble , you "conserve" the class type:
tibble::tibble(a = c(1,2),
b = list(as.difftime(c(1), units = "hours")))
# A tibble: 2 x 2
# a b
# <dbl> <list>
# 1 1 <time [1]>
# 2 2 <time [1]>
But this is hardly manipulated later. Better to use lubridate in this case.

Related

How to fill in missing times in R from next and previous date time?

I want to use a way where I can successfully match values for 'checks' which have no start and end times. At first I thought to use bilinear interpolation for this task, but then i thought that's too much complicated and rather I just need something very similar.
My data looks something like this:
df <- data.frame("ID" = c(A,A,A,A,A,B,B,B,B,B),
"Check"= c(1:5),
"Start_time" = c("start_a1","start_a2","start_a3","start_a4","start_a5","startb1","startb2","startb3",NA,"startb5"),
"end_time" = c("end_a1","end_a2","end_a3","end_a4","end_a5","end_b1","end_b2",NA,NA,"endb5")
)
so what I am ideally looking for any check which has missing start time and end time it should pick data from the next check's start time, not previous.
I am trying the following code block but its giving me an issue:
df$end_time[df$check==3 & is.na(df_main$end_time)]] <- df$start_time[df$check==5]
#this gives a length issue
Any advice would be helpful here, my dataset contains approx 5k rows, and each ID has a number of checks with start time and end time.

The tidyr package has a function fill() which does exactly this.
library(tidyr)
df %>%
group_by(ID) %>%
fill(c(Start_time,end_time),.direction='up')
# A tibble: 10 × 4
# Groups: ID [2]
ID Check Start_time end_time
<chr> <int> <chr> <chr>
1 A 1 start_a1 end_a1
2 A 2 start_a2 end_a2
3 A 3 start_a3 end_a3
4 A 4 start_a4 end_a4
5 A 5 start_a5 end_a5
6 B 1 startb1 end_b1
7 B 2 startb2 end_b2
8 B 3 startb3 endb5
9 B 4 startb5 endb5
10 B 5 startb5 endb5
The .direction="up" parameter means it takes the next non-missing value to fill in blanks. To use the previous value you would use .direction="down". And using .direction="updown" would use the next value unless there are no more non-missing values in that group, then it would take the previous non-missing value. (Useful in cases where the missing value is the last row of the group.)

How to flag after comparing values over all rows based on a pre-defined threshold in R?

df <- cbind(c(1,1,1,1,1,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5), c(6,12,18,24,30,3,9,21,6,12,18,24,30,36,6,12,18,24,30,36,12,24,36,48), c(0.4,1.5,2.7,1.6,0.4,1.3,3.1,3.6,0.5,2.6,3.7,1.8,0.9,0.3,0.7,1.6,1.3,2.8,1.9,1.8,2.0,1.0,3.0,0.8))
colnames(df) <- c("ID","time","value")
I have a dataset as given by the code above. I would like to know that for each ID, starting from the lowest/starting time, if the value bounced up by at least 2 compared to the lowest value before the rise and then went down below or equal to the lowest pre-bounce value. I would like to flag the time of the increase by at least 2 as the time of bounce.
So for example, in the above dataset, for ID 1, the lowest value was 0.4 at time 6 before it started rising. At time 18 it met the pre-defined threshold of 2 and then at time 30, it went down to a value equal to the pre-bounce lowest value. So I would like to flag ID 1 has having bounce and time 18 as the time for bounce.
On the other hand, for ID 2, although it rose at least by a value 2 (1.3-->3.6), never went back to a value below or equal to 1.3
For ID 3, it again met the criteria for bounce (0.5-->2.6-->3.7-->1.8-->0.9-->0.3). So I would like to flag ID 2 as having bounce and month 18 as the time for bounce.
For ID 4, although there was a rise by at least 2 i.e. from 0.7-->1.6-->1.3-->2.8 (at time 24), however, later on it never went down below 0.7 the lowest value before having the bounce. So it cannot be flagged as having bounce.
For ID 5, the values were 2-->1-->3-->0.8, so there was a bounce by at least 2 (1-->3) and then a fall to a value below the lowest pre-bounce value (0.8 <1.0). So this ID should be flagged as having a bounce and the time of bounce should be time 36.
Please help me with this dynamic calculation and also explain the codes if possible so that I can understand the concept. Thank you in advance.

Consider this:
func <- function(tm, val, threshold = 2) {
mtx <- outer(val, val, `-`)
mtx[upper.tri(mtx)] <- NA
if (all(mtx < threshold, na.rm = TRUE)) return(tm[NA][1])
ij <- which.max(mtx) # counts through the matrix, along columns
i <- (ij-1) %/% length(val) + 1
j <- (ij-1) %% length(val) + 1
if (i < length(val) && any(val[-seq_len(i)] <= val[i])) {
return(tm[j])
} else {
return(tm[NA][i])
}
}
df <- data.frame(
ID = c(1,1,1,1,1,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5),
time = c(6,12,18,24,30,3,9,21,6,12,18,24,30,36,6,12,18,24,30,36,12,24,36,48),
value = c(0.4,1.5,2.7,1.6,0.4,1.3,3.1,3.6,0.5,2.6,3.7,1.8,0.9,0.3,0.7,1.6,1.3,2.8,1.9,1.8,2.0,1.0,3.0,0.8)
)
I use which.max and the %/% and %% operators because in general I don't like doing which(val == max(val), arr.ind = TRUE); while the latter works, it is also relying on equality tests of floating point numbers, which can be problematic with extreme values. See Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754. If you don't like this safe-guarding, feel free to adapt the function to use which(.) instead.
The reason I go through the trouble of tm[NA][1] is so that the return value is of the exact class as your input time variable. For instance, dplyr in many situations can warn or err if the value you're changing in a vector is not the same class. This warning or error is good, as R's native (and silent) coercion of values can be problematic. For instance, Sys.time() is class POSIXt but NA is not. But Sys.time()[NA] is class POSIXt. Similarly, integer and numeric both have different types of NA. Perhaps this is being a bit over-defensive, but the use of tm[NA][1] ensures that the output is always the same class as the input time.
dplyr
library(dplyr)
# # A tibble: 5 x 2
# ID time
# * <dbl> <dbl>
# 1 1 18
# 2 2 NA
# 3 3 18
# 4 4 NA
# 5 5 36
data.table
library(data.table)
DF <- as.data.table(df)
DF[, .(time = func(time, value)), by = .(ID)]
# ID time
# <num> <num>
# 1: 1 18
# 2: 2 NA
# 3: 3 18
# 4: 4 NA
# 5: 5 36

How to avoid number rounding when using as.numeric() in R?

I am reading well structured, textual data in R and in the process of converting from character to numeric, numbers lose their decimal places.
I have tried using round(digits = 2) but it didn't work since I first had to apply as.numeric. At one point, I did set up options(digits = 2) before the conversion but it didn't work either.
Ultimately, I desired to get a data.frame with its numbers being exactly the same as the ones seen as characters.
I looked up for help here and did find answers like this, this, and this; however, none really helped me solve this issue.
How will I prevent number rounding when converting from character to
numeric?
Here's a reproducible piece of code I wrote.
library(purrr)
my_char = c(" 246.00 222.22 197.98 135.10 101.50 86.45
72.17 62.11 64.94 76.62 109.33 177.80")
# Break characters between spaces
my_char = strsplit(my_char, "\\s+")
head(my_char, n = 2)
#> [[1]]
#> [1] "" "246.00" "222.22" "197.98" "135.10" "101.50" "86.45"
#> [8] "72.17" "62.11" "64.94" "76.62" "109.33" "177.80"
# Convert from characters to numeric.
my_char = map_dfc(my_char, as.numeric)
head(my_char, n = 2)
#> # A tibble: 2 x 1
#> V1
#> <dbl>
#> 1 NA
#> 2 246
# Delete first value because it's empty
my_char = my_char[-1,1]
head(my_char, n = 2)
#> # A tibble: 2 x 1
#> V1
#> <dbl>
#> 1 246
#> 2 222.

It's how R visualize data in a tibble.
The function map_dfc is not rounding your data, it's just a way R use to display data in a tibble.
If you want to print the data with the usual format, use as.data.frame, like this:
head(as.data.frame(my_char), n = 4)
V1
#>1 246.00
#>2 222.22
#>3 197.98
#>4 135.10
Showing that your data has not been rounded.
Hope this helps.

Creating tibble returns error due to name

I have 2 vectors. I am trying to create a tibble with all combinations of the 2 vectors with the following error.
C <- c(1,2,3,4)
G <- c(1,2,3,4,5)
tibble('C' = rep(C, each = length(G)), 'G' = rep(G, length(C)))
Error: Column `C` must be length 1 or 100, not 20
Error disappears when I rename column 'C' to column 'A' for example.
We also don't get the same error with a data.frame
I suspect length(C) takes 'C' value from the tibble.
Is this an intended behaviour?
If so can someone explain how this is useful in practice? (i.e how would someone take advantage of this in their code)

Because tibbles are an extension to data.frame, and not an exact drop-in replacement, you can do things like:
tibble(a=1:3, b=a+1)
## A tibble: 3 x 2
# a b
# <int> <dbl>
#1 1 2
#2 2 3
#3 3 4
...where you can reference earlier created columns. And your example is an instance of when that might be a problem.
To quote the manual:
"Arguments are evaluated sequentially, so you can refer to previously
created variables."
Source: http://tibble.tidyverse.org/reference/tibble.html
So in this case, the C in rep(G, length(C)) is actually referencing the tibblename$C you just created, which is length 20, rather than the vector C in the global environment, which is length 4.

Creating groups of equal sum in R

I am trying to group a column of my data.frame/data.table into three groups, all with equal sums.
The data is first ordered from smallest to largest, such that group one would be made up of a large number of rows with small values, and group three would have a small number of rows with large values. This is accomplished in spirit with:
test <- data.frame(x = as.numeric(1:100000))
store <- 0
total <- sum(test$x)
for(i in 1:100000){
store <- store + test$x[i]
if(store < total/3){
test$y[i] <- 1
} else {
if(store < 2*total/3){
test$y[i] <- 2
} else {
test$y[i] <- 3
}
}
}
While successful, I feel like there must be a better way (and maybe a very obvious solution that I am missing).
I never like resorting to loops, especially with nested ifs, when a vectorized approach is available - with even 100,000+ records this code becomes quite slow
This method would become impossibly complex to code to a larger number of groups (not necessarily the looping, but the ifs)
Requires pre-ordering of the column. Might not be able to get around this one.
As a nuance (not that it makes a difference) but the data to be summed would not always (or ever) be consecutive integers.

Maybe with cumsum:
test$z <- cumsum(test$x) %/% (ceiling(sum(test$x) / 3)) + 1

This is more or less a bin-packing problem.
Use the binPack function from the BBmisc package:
library(BBmisc)
test$bins <- binPack(test$x, sum(test$x)/3+1)
The sums of the 3 bins are nearly identical:
tapply(test$x, test$bins, sum)
1 2 3
1666683334 1666683334 1666683332

I thought that the cumsum/modulo division approach was very elegant, but it does retrun a somewhat irregular allocation:
> tapply(test$x, test$z, sum)
1 2 3
1666636245 1666684180 1666729575
> sum(test)/3
[1] 1666683333
So I though I would first create a random permutation and offer something similar:
test$x <- sample(test$x)
test$z2 <- cumsum(test$x)[ findInterval(cumsum(test$x),
c(0, 1666683333*(1:2), sum(test$x)+1))]
> tapply(test$x, test$z2, sum)
91099 116379 129539
1666676164 1666686837 1666686999
This also achieves a more even distribution of counts:
> table(test$z2)
91099 116379 129539
33245 33235 33520
> table(test$z)
1 2 3
57734 23915 18351
I must admit to puzzlement regarding the naming of the entries in z2.

Or you can just cut on the cumsum
test$z <- cut(cumsum(test$x), breaks = 3, labels = 1:3)
or use ggplot2::cut_interval instead of cut:
test$z <- cut_interval(cumsum(test$x), n = 3, labels = 1:3)

You can use fold() from groupdata2 and get an almost equal number of elements per group:
# Create data frame
test <- data.frame(x = as.numeric(1:100000))
# Use fold() to create 3 numerically balanced groups
test <- groupdata2::fold(k = 3, num_col = "x")
# Watch first 10 rows
head(test, 10)
## # A tibble: 10 x 2
## # Groups: .folds [3]
## x .folds
## <dbl> <fct>
## 1 1 1
## 2 2 3
## 3 3 2
## 4 4 1
## 5 5 2
## 6 6 2
## 7 7 1
## 8 8 3
## 9 9 2
## 10 10 3
# Check the sum and number of elements per group
test %>%
dplyr::group_by(.folds) %>%
dplyr::summarize(sum_ = sum(x),
n_members = dplyr::n())
## # A tibble: 3 x 3
## .folds sum_ n_members
## <fct> <dbl> <int>
## 1 1 1666690952 33333
## 2 2 1666716667 33334
## 3 3 1666642381 33333

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Tibble silently changes recycled difftime variables - r

Related

How to fill in missing times in R from next and previous date time?

How to flag after comparing values over all rows based on a pre-defined threshold in R?

How to avoid number rounding when using as.numeric() in R?

Creating tibble returns error due to name

Creating groups of equal sum in R

Categories

Resources