I want to use a way where I can successfully match values for 'checks' which have no start and end times. At first I thought to use bilinear interpolation for this task, but then i thought that's too much complicated and rather I just need something very similar.
My data looks something like this:
df <- data.frame("ID" = c(A,A,A,A,A,B,B,B,B,B),
"Check"= c(1:5),
"Start_time" = c("start_a1","start_a2","start_a3","start_a4","start_a5","startb1","startb2","startb3",NA,"startb5"),
"end_time" = c("end_a1","end_a2","end_a3","end_a4","end_a5","end_b1","end_b2",NA,NA,"endb5")
)
so what I am ideally looking for any check which has missing start time and end time it should pick data from the next check's start time, not previous.
I am trying the following code block but its giving me an issue:
df$end_time[df$check==3 & is.na(df_main$end_time)]] <- df$start_time[df$check==5]
#this gives a length issue
Any advice would be helpful here, my dataset contains approx 5k rows, and each ID has a number of checks with start time and end time.
The tidyr package has a function fill() which does exactly this.
library(tidyr)
df %>%
group_by(ID) %>%
fill(c(Start_time,end_time),.direction='up')
# A tibble: 10 × 4
# Groups: ID [2]
ID Check Start_time end_time
<chr> <int> <chr> <chr>
1 A 1 start_a1 end_a1
2 A 2 start_a2 end_a2
3 A 3 start_a3 end_a3
4 A 4 start_a4 end_a4
5 A 5 start_a5 end_a5
6 B 1 startb1 end_b1
7 B 2 startb2 end_b2
8 B 3 startb3 endb5
9 B 4 startb5 endb5
10 B 5 startb5 endb5
The .direction="up" parameter means it takes the next non-missing value to fill in blanks. To use the previous value you would use .direction="down". And using .direction="updown" would use the next value unless there are no more non-missing values in that group, then it would take the previous non-missing value. (Useful in cases where the missing value is the last row of the group.)
df <- cbind(c(1,1,1,1,1,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5), c(6,12,18,24,30,3,9,21,6,12,18,24,30,36,6,12,18,24,30,36,12,24,36,48), c(0.4,1.5,2.7,1.6,0.4,1.3,3.1,3.6,0.5,2.6,3.7,1.8,0.9,0.3,0.7,1.6,1.3,2.8,1.9,1.8,2.0,1.0,3.0,0.8))
colnames(df) <- c("ID","time","value")
I have a dataset as given by the code above. I would like to know that for each ID, starting from the lowest/starting time, if the value bounced up by at least 2 compared to the lowest value before the rise and then went down below or equal to the lowest pre-bounce value. I would like to flag the time of the increase by at least 2 as the time of bounce.
So for example, in the above dataset, for ID 1, the lowest value was 0.4 at time 6 before it started rising. At time 18 it met the pre-defined threshold of 2 and then at time 30, it went down to a value equal to the pre-bounce lowest value. So I would like to flag ID 1 has having bounce and time 18 as the time for bounce.
On the other hand, for ID 2, although it rose at least by a value 2 (1.3-->3.6), never went back to a value below or equal to 1.3
For ID 3, it again met the criteria for bounce (0.5-->2.6-->3.7-->1.8-->0.9-->0.3). So I would like to flag ID 2 as having bounce and month 18 as the time for bounce.
For ID 4, although there was a rise by at least 2 i.e. from 0.7-->1.6-->1.3-->2.8 (at time 24), however, later on it never went down below 0.7 the lowest value before having the bounce. So it cannot be flagged as having bounce.
For ID 5, the values were 2-->1-->3-->0.8, so there was a bounce by at least 2 (1-->3) and then a fall to a value below the lowest pre-bounce value (0.8 <1.0). So this ID should be flagged as having a bounce and the time of bounce should be time 36.
Please help me with this dynamic calculation and also explain the codes if possible so that I can understand the concept. Thank you in advance.
Consider this:
func <- function(tm, val, threshold = 2) {
mtx <- outer(val, val, `-`)
mtx[upper.tri(mtx)] <- NA
if (all(mtx < threshold, na.rm = TRUE)) return(tm[NA][1])
ij <- which.max(mtx) # counts through the matrix, along columns
i <- (ij-1) %/% length(val) + 1
j <- (ij-1) %% length(val) + 1
if (i < length(val) && any(val[-seq_len(i)] <= val[i])) {
return(tm[j])
} else {
return(tm[NA][i])
}
}
df <- data.frame(
ID = c(1,1,1,1,1,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5),
time = c(6,12,18,24,30,3,9,21,6,12,18,24,30,36,6,12,18,24,30,36,12,24,36,48),
value = c(0.4,1.5,2.7,1.6,0.4,1.3,3.1,3.6,0.5,2.6,3.7,1.8,0.9,0.3,0.7,1.6,1.3,2.8,1.9,1.8,2.0,1.0,3.0,0.8)
)
I use which.max and the %/% and %% operators because in general I don't like doing which(val == max(val), arr.ind = TRUE); while the latter works, it is also relying on equality tests of floating point numbers, which can be problematic with extreme values. See Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754. If you don't like this safe-guarding, feel free to adapt the function to use which(.) instead.
The reason I go through the trouble of tm[NA][1] is so that the return value is of the exact class as your input time variable. For instance, dplyr in many situations can warn or err if the value you're changing in a vector is not the same class. This warning or error is good, as R's native (and silent) coercion of values can be problematic. For instance, Sys.time() is class POSIXt but NA is not. But Sys.time()[NA] is class POSIXt. Similarly, integer and numeric both have different types of NA. Perhaps this is being a bit over-defensive, but the use of tm[NA][1] ensures that the output is always the same class as the input time.
dplyr
library(dplyr)
# # A tibble: 5 x 2
# ID time
# * <dbl> <dbl>
# 1 1 18
# 2 2 NA
# 3 3 18
# 4 4 NA
# 5 5 36
data.table
library(data.table)
DF <- as.data.table(df)
DF[, .(time = func(time, value)), by = .(ID)]
# ID time
# <num> <num>
# 1: 1 18
# 2: 2 NA
# 3: 3 18
# 4: 4 NA
# 5: 5 36
I am reading well structured, textual data in R and in the process of converting from character to numeric, numbers lose their decimal places.
I have tried using round(digits = 2) but it didn't work since I first had to apply as.numeric. At one point, I did set up options(digits = 2) before the conversion but it didn't work either.
Ultimately, I desired to get a data.frame with its numbers being exactly the same as the ones seen as characters.
I looked up for help here and did find answers like this, this, and this; however, none really helped me solve this issue.
How will I prevent number rounding when converting from character to
numeric?
Here's a reproducible piece of code I wrote.
library(purrr)
my_char = c(" 246.00 222.22 197.98 135.10 101.50 86.45
72.17 62.11 64.94 76.62 109.33 177.80")
# Break characters between spaces
my_char = strsplit(my_char, "\\s+")
head(my_char, n = 2)
#> [[1]]
#> [1] "" "246.00" "222.22" "197.98" "135.10" "101.50" "86.45"
#> [8] "72.17" "62.11" "64.94" "76.62" "109.33" "177.80"
# Convert from characters to numeric.
my_char = map_dfc(my_char, as.numeric)
head(my_char, n = 2)
#> # A tibble: 2 x 1
#> V1
#> <dbl>
#> 1 NA
#> 2 246
# Delete first value because it's empty
my_char = my_char[-1,1]
head(my_char, n = 2)
#> # A tibble: 2 x 1
#> V1
#> <dbl>
#> 1 246
#> 2 222.
It's how R visualize data in a tibble.
The function map_dfc is not rounding your data, it's just a way R use to display data in a tibble.
If you want to print the data with the usual format, use as.data.frame, like this:
head(as.data.frame(my_char), n = 4)
V1
#>1 246.00
#>2 222.22
#>3 197.98
#>4 135.10
Showing that your data has not been rounded.
Hope this helps.
I have 2 vectors. I am trying to create a tibble with all combinations of the 2 vectors with the following error.
C <- c(1,2,3,4)
G <- c(1,2,3,4,5)
tibble('C' = rep(C, each = length(G)), 'G' = rep(G, length(C)))
Error: Column `C` must be length 1 or 100, not 20
Error disappears when I rename column 'C' to column 'A' for example.
We also don't get the same error with a data.frame
I suspect length(C) takes 'C' value from the tibble.
Is this an intended behaviour?
If so can someone explain how this is useful in practice? (i.e how would someone take advantage of this in their code)
Because tibbles are an extension to data.frame, and not an exact drop-in replacement, you can do things like:
tibble(a=1:3, b=a+1)
## A tibble: 3 x 2
# a b
# <int> <dbl>
#1 1 2
#2 2 3
#3 3 4
...where you can reference earlier created columns. And your example is an instance of when that might be a problem.
To quote the manual:
"Arguments are evaluated sequentially, so you can refer to previously
created variables."
Source: http://tibble.tidyverse.org/reference/tibble.html
So in this case, the C in rep(G, length(C)) is actually referencing the tibblename$C you just created, which is length 20, rather than the vector C in the global environment, which is length 4.
I am trying to group a column of my data.frame/data.table into three groups, all with equal sums.
The data is first ordered from smallest to largest, such that group one would be made up of a large number of rows with small values, and group three would have a small number of rows with large values. This is accomplished in spirit with:
test <- data.frame(x = as.numeric(1:100000))
store <- 0
total <- sum(test$x)
for(i in 1:100000){
store <- store + test$x[i]
if(store < total/3){
test$y[i] <- 1
} else {
if(store < 2*total/3){
test$y[i] <- 2
} else {
test$y[i] <- 3
}
}
}
While successful, I feel like there must be a better way (and maybe a very obvious solution that I am missing).
I never like resorting to loops, especially with nested ifs, when a vectorized approach is available - with even 100,000+ records this code becomes quite slow
This method would become impossibly complex to code to a larger number of groups (not necessarily the looping, but the ifs)
Requires pre-ordering of the column. Might not be able to get around this one.
As a nuance (not that it makes a difference) but the data to be summed would not always (or ever) be consecutive integers.
Maybe with cumsum:
test$z <- cumsum(test$x) %/% (ceiling(sum(test$x) / 3)) + 1
This is more or less a bin-packing problem.
Use the binPack function from the BBmisc package:
library(BBmisc)
test$bins <- binPack(test$x, sum(test$x)/3+1)
The sums of the 3 bins are nearly identical:
tapply(test$x, test$bins, sum)
1 2 3
1666683334 1666683334 1666683332
I thought that the cumsum/modulo division approach was very elegant, but it does retrun a somewhat irregular allocation:
> tapply(test$x, test$z, sum)
1 2 3
1666636245 1666684180 1666729575
> sum(test)/3
[1] 1666683333
So I though I would first create a random permutation and offer something similar:
test$x <- sample(test$x)
test$z2 <- cumsum(test$x)[ findInterval(cumsum(test$x),
c(0, 1666683333*(1:2), sum(test$x)+1))]
> tapply(test$x, test$z2, sum)
91099 116379 129539
1666676164 1666686837 1666686999
This also achieves a more even distribution of counts:
> table(test$z2)
91099 116379 129539
33245 33235 33520
> table(test$z)
1 2 3
57734 23915 18351
I must admit to puzzlement regarding the naming of the entries in z2.
Or you can just cut on the cumsum
test$z <- cut(cumsum(test$x), breaks = 3, labels = 1:3)
or use ggplot2::cut_interval instead of cut:
test$z <- cut_interval(cumsum(test$x), n = 3, labels = 1:3)
You can use fold() from groupdata2 and get an almost equal number of elements per group:
# Create data frame
test <- data.frame(x = as.numeric(1:100000))
# Use fold() to create 3 numerically balanced groups
test <- groupdata2::fold(k = 3, num_col = "x")
# Watch first 10 rows
head(test, 10)
## # A tibble: 10 x 2
## # Groups: .folds [3]
## x .folds
## <dbl> <fct>
## 1 1 1
## 2 2 3
## 3 3 2
## 4 4 1
## 5 5 2
## 6 6 2
## 7 7 1
## 8 8 3
## 9 9 2
## 10 10 3
# Check the sum and number of elements per group
test %>%
dplyr::group_by(.folds) %>%
dplyr::summarize(sum_ = sum(x),
n_members = dplyr::n())
## # A tibble: 3 x 3
## .folds sum_ n_members
## <fct> <dbl> <int>
## 1 1 1666690952 33333
## 2 2 1666716667 33334
## 3 3 1666642381 33333