Path dependent vector transformation in R

Path dependent vector transformation in R - r

I have a vector in a dataframe in R which is a time series that oscillates between 0 and 100.
I am wanting to create a new column/vector in R that has that will be series on 1s and 0s. It will be 1 when the time series drops below 10 and will continue to be 1 until it reaches 80. Thereafter it will go back to zero. So there is a path dependency in this problem I am wanting to solve.
Something like;
DataFrame %>% mutate(BinaryIndicator = ....)
I think the picture below will be the easiest way to show what I am wanting to get to. Any help would be sincerely appreciated.
Here is a link to an example of what I would like to create
Any help much appreciated.

Since the value of one row depends on the value of the previous row (after its value is updated from its previous row, etc), I think a rolling-window operation is appropriate. zoo does this well.
dat <- data.frame(x=rep(c(60, 50, 40, 35, 30, 25, 20, 15, 10.2, 9, 2, 3, 9, 40, 72, 81, 90), 2))
dat$binary <- cumsum(zoo::rollapply(dat$x, 2, function(a) {
if (length(a) < 2) return(0)
if (a[1] >= 10 && a[2] < 10) return(1)
if (a[1] < 80 && a[2] >= 80) return(-1)
return(0)
}, partial = TRUE, align = "right"))
dat
# x binary
# 1 60.0 0
# 2 50.0 0
# 3 40.0 0
# 4 35.0 0
# 5 30.0 0
# 6 25.0 0
# 7 20.0 0
# 8 15.0 0
# 9 10.2 0
# 10 9.0 1
# 11 2.0 1
# 12 3.0 1
# 13 9.0 1
# 14 40.0 1
# 15 72.0 1
# 16 81.0 0
# 17 90.0 0
# 18 60.0 0
# 19 50.0 0
# 20 40.0 0
# 21 35.0 0
# 22 30.0 0
# 23 25.0 0
# 24 20.0 0
# 25 15.0 0
# 26 10.2 0
# 27 9.0 1
# 28 2.0 1
# 29 3.0 1
# 30 9.0 1
# 31 40.0 1
# 32 72.0 1
# 33 81.0 0
# 34 90.0 0
(I wonder if the internal logic can be simplified some.)

Related

How to flag/count missing values in RStudio on a row by row basis?

Taking the airquality dataset from the MASS library as an example:
> head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
I want to create three columns:
Missing_Ozone, Missing_Total and Missing_Percent, such that:
Missing_Ozone = 1 if there is a missing value in the Ozone column in the current row.
Missing_total = total count of missing values in the current row
Missing_Percent = percentage of missing values in a row.
So for example, in row 1:
Missing_Ozone = 0, Missing_total = 0, Missing_percent = 0
In row 5:
Missing_Ozone = 1, Missing_total = 2, Missing percent = 100*(2/6)
In row 6:
Missing_Ozone = 0, Missing_total = 1, Missing percent = 100*(1/6)
I tried two approaches, without any luck:
The first was to iterate over each row and use an if statement:
library(MASS)
df_test = airquality
df_test$Missing_Ozone <- 0
for(i in 1:nrow(df_test)){
if (is.na(df_test$Ozone)) {
df_test$Missing_Ozone <- 1
}
}
The second was to just use the if-statement inside that for-loop.
Neither work, and I just get:
> df_test
Ozone Solar.R Wind Temp Month Day Missing_Ozone
1 41 190 7.4 67 5 1 0
2 36 118 8.0 72 5 2 0
3 12 149 12.6 74 5 3 0
4 18 313 11.5 62 5 4 0
5 NA NA 14.3 56 5 5 0
Any help is appreciated.
Edit: Also, does this type of data manipulation have a certain name? I found it hard to search online for a guide that goes through this type of data manipulation.

Tidyverse approach:
library(dplyr)
airquality <- datasets::airquality
cols <- ncol(airquality)
airquality <- airquality %>%
mutate(
Missing_Ozone = as.numeric(is.na(Ozone)),
Missing_Total = rowSums(is.na(.)),
Missing_Percent = Missing_Total/cols
)
> head(airquality)
Ozone Solar.R Wind Temp Month Day Missing_Ozone Missing_Total Missing_Percent
1 41 190 7.4 67 5 1 0 0 0.0000000
2 36 118 8.0 72 5 2 0 0 0.0000000
3 12 149 12.6 74 5 3 0 0 0.0000000
4 18 313 11.5 62 5 4 0 0 0.0000000
5 NA NA 14.3 56 5 5 1 2 0.3333333
6 28 NA 14.9 66 5 6 0 1 0.1666667
Base R approach:
cols <- ncol(airquality)
airquality$Missing_Ozone <- as.numeric(is.na(airquality$Ozone))
airquality$Missing_Total <- rowSums(is.na(airquality))
airquality$Missing_Percent <- airquality$Missing_Total/cols
> head(airquality)
> Ozone Solar.R Wind Temp Month Day Missing_Ozone Missing_Total Missing_Percent
1 41 190 7.4 67 5 1 0 0 0.0000000
2 36 118 8.0 72 5 2 0 0 0.0000000
3 12 149 12.6 74 5 3 0 0 0.0000000
4 18 313 11.5 62 5 4 0 0 0.0000000
5 NA NA 14.3 56 5 5 1 2 0.3333333
6 28 NA 14.9 66 5 6 0 1 0.1666667
edit: A note on performance
I would advise in general against usage of rowwise operations outside of very specific use cases. It will slow you down heavily as your data set scales. The execution time tends to grow linearly with your data, which is really, really bad. A little benchmark with a data set size of 6,426 rows instead of 153:
library(dplyr)
library(microbenchmark)
airquality <- datasets::airquality
# Rowwise
approachA <- function(data) {
result <- data %>%
mutate(Missing_Ozone = as.integer(is.na(Ozone))) %>%
rowwise() %>%
mutate(Missing_Total = sum(is.na((c_across(-Missing_Ozone))))) %>%
mutate(Missing_Percent = Missing_Total/ncol(data)) %>%
ungroup()
return(result)
}
# Tidy
approachB <- function(data) {
cols <- ncol(data)
result <- data %>%
mutate(
Missing_Ozone = as.numeric(is.na(Ozone)),
Missing_Total = rowSums(is.na(.)),
Missing_Percent = Missing_Total/cols
)
return(result)
}
# Base R
approachC <- function(data) {
cols <- ncol(data)
data$Missing_Ozone <- as.numeric(is.na(data$Ozone))
data$Missing_Total <- rowSums(is.na(data))
data$Missing_Percent <- data$Missing_Total/cols
return(data)
}
Result with data x 42: rowwise() has led to some orders of magnitude worse performance over both proposed approaches.
> test_data <- do.call("rbind", replicate(42, airquality, simplify = FALSE))
> set.seed(42)
> microbenchmark::microbenchmark(approachA(test_data), approachB(test_data), approachC(test_data))
Unit: microseconds
expr min lq mean median uq max neval cld
approachA(test_data) 243340.904 251838.3590 259083.8089 256546.9015 260567.8945 405326.615 100 b
approachB(test_data) 577.977 624.0610 723.8304 741.0955 770.3695 2382.756 100 a
approachC(test_data) 102.377 107.9735 139.5595 119.6175 129.4165 2074.231 100 a
Result with data x 420: Execution time of rowwise approach has grown 10x.
test_data <- do.call("rbind", replicate(420, airquality, simplify = FALSE))
> set.seed(42)
> microbenchmark::microbenchmark(approachA(test_data), approachB(test_data), approachC(test_data))
Unit: microseconds
expr min lq mean median uq max neval cld
approachA(test_data) 2519480.258 2620528.08 2671419.663 2672263.417 2707896.209 2907659.730 100 b
approachB(test_data) 1266.818 1488.71 1909.167 1576.327 1678.725 21011.147 100 a
approachC(test_data) 808.684 881.09 1220.151 1000.277 1067.907 8218.655 100 a

A solution using the dplyr package. rowwise and c_cross allow us to do calculation by each row.
library(dplyr)
dat <- airquality %>%
mutate(Missing_Ozone = as.integer(is.na(Ozone))) %>%
rowwise() %>%
mutate(Missing_Total = sum(is.na((c_across(-Missing_Ozone))))) %>%
mutate(Missing_Percent = Missing_Total/ncol(airquality)) %>%
ungroup()
dat
# # A tibble: 153 x 9
# Ozone Solar.R Wind Temp Month Day Missing_Ozone Missing_Total Missing_Percent
# <int> <int> <dbl> <int> <int> <int> <int> <int> <dbl>
# 1 41 190 7.4 67 5 1 0 0 0
# 2 36 118 8 72 5 2 0 0 0
# 3 12 149 12.6 74 5 3 0 0 0
# 4 18 313 11.5 62 5 4 0 0 0
# 5 NA NA 14.3 56 5 5 1 2 0.333
# 6 28 NA 14.9 66 5 6 0 1 0.167
# 7 23 299 8.6 65 5 7 0 0 0
# 8 19 99 13.8 59 5 8 0 0 0
# 9 8 19 20.1 61 5 9 0 0 0
# 10 NA 194 8.6 69 5 10 1 1 0.167
# # ... with 143 more rows

Create run-length ID while allowing for gaps of certain length in runs

(I initially posted a question here, but it didn't fully cover my issue)
I have a data frame with a 'date' column and a measure of precipitation (rainfall):
date precip
1 1 0.0
2 2 0.0
3 3 12.4
4 4 10.2
5 5 0.0
6 6 13.6
I want to create a column "event" with a counter (ID) for each consecutive period of rainfall. A rainfall event can be defined as consecutive runs with precipitation larger than e.g. 0.
If we don't allow any short gaps of zero rain, the 'event' would look like this, with a counter for non-0 periods, and NA for periods with no rain.
date precip event
1 1 0.0 NA
2 2 0.0 NA
3 3 12.4 1
4 4 10.2 1
5 5 0.0 NA
6 6 13.6 2
In addition, I want to able to allow for shorter periods with no rain, e.g. of size n = 1 day, within each run of non-0.
For example, in the data frame above, if we allow for 1 day with 0 rain within a contiguous period of rain, e.g. day 5, then day 3 to 6 can be defined as one rainfall event:
date precip event
1 1 0.0 NA
2 2 0.0 NA
3 3 12.4 1
4 4 10.2 1
5 5 0.0 1 # <- gap of 1 day with no rain: OK
6 6 13.6 1
A slightly larger toy data set:
structure(list(date = 1:31, precip = c(0, 0, 12.3999996185303,
10.1999998092651, 0, 13.6000003814697, 16.6000003814697, 21.5,
7.59999990463257, 0, 0, 0, 0.699999988079071, 0, 0, 0, 5.40000009536743,
0, 1, 35.4000015258789, 11.5, 16.7000007629395, 13.5, 13.1000003814697,
11.8000001907349, 1.70000004768372, 0, 15.1000003814697, 12.8999996185303,
3.70000004768372, 24.2999992370605)), row.names = c(NA, -31L), class = "data.frame")
Now I'm really stuck. I tried some strange things like the one below (just a start), but I think I will not figure it out by myself and would be super grateful for any help
# this is far from being any helpful, but just to show the direction I was heading...
# the threshold could be 0 to mirror the example above...
rainfall_event = function(df,
daily_thresh = .2,
n = 1) {
for (i in 1:nrow(df)) {
zero_index = 1
if (df[i,]$precip < daily_thresh) {
# every time you encounter a value below the threshold count the 0s
zero_counter = 0
while (df[i,]$precip < daily_thresh) {
zero_counter = zero_counter + 1
if (i != nrow(df)) {
i = i + 1
zero_index = zero_index + 1
} else{
break
}
}
if (zero_counter > n) {
df[zero_index:zero_index + zero_counter,][["event"]] = NA
}
} else{
event_counter = 1
while (df[i, ]$precip > daily_thresh) {
df[["event"]] = event_counter
if (i != nrow(rainfall_one_slide)) {
i = i + 1
} else{
break
}
}
}
}
}

An rle alternative:
# limit of n days with precip = 0 to be allowed in runs of non-zero
n = 1
# rle of precip == 0
r = rle(d$precip == 0)
# replace the values of precip = 0 & length > limit with NA
r$values[r$values & r$lengths > n] = NA
# reconstruct the vector from the updated runs
ir = inverse.rle(r)
# rle of "is NA"
r2 = rle(is.na(ir))
# replace length of NA runs with 0
r2$lengths[r2$values] = 0
# replace values of non-NA runs with a sequence
r2$values[!r2$values] = seq_along(r2$values[!r2$values])
# create event column
d[!is.na(ir), "event"] = inverse.rle(r2)
date precip event
1 1 0.0 NA
2 2 0.0 NA
3 3 12.4 1
4 4 10.2 1
5 5 0.0 1
6 6 13.6 1
7 7 16.6 1
8 8 21.5 1
9 9 7.6 1
10 10 0.0 NA
11 11 0.0 NA
12 12 0.0 NA
13 13 0.7 2
14 14 0.0 NA
15 15 0.0 NA
16 16 0.0 NA
17 17 5.4 3
18 18 0.0 3
19 19 1.0 3
20 20 35.4 3
21 21 11.5 3
22 22 16.7 3
23 23 13.5 3
24 24 13.1 3
25 25 11.8 3
26 26 1.7 3
27 27 0.0 3
28 28 15.1 3
29 29 12.9 3
30 30 3.7 3
31 31 24.3 3

Using data.table with rleid
library(data.table)
f1 <- function(dat, n) {
tmp <- as.data.table(dat)[,
grp := rleid(precip != 0)][precip != 0,
event := .GRP,
grp][, event_fill := nafill(nafill(event, 'locf'),
'nocb')]
tmp[, event := fifelse(.N <= n & precip == 0,
fcoalesce(event, event_fill), event), grp][,
c("grp", "event_fill") := NULL][]
}
-testing
f1(df1, 0)
date precip event
1: 1 0.0 NA
2: 2 0.0 NA
3: 3 12.4 1
4: 4 10.2 1
5: 5 0.0 NA
6: 6 13.6 2
7: 7 16.6 2
8: 8 21.5 2
9: 9 7.6 2
10: 10 0.0 NA
11: 11 0.0 NA
12: 12 0.0 NA
13: 13 0.7 3
14: 14 0.0 NA
15: 15 0.0 NA
16: 16 0.0 NA
17: 17 5.4 4
18: 18 0.0 NA
19: 19 1.0 5
20: 20 35.4 5
21: 21 11.5 5
22: 22 16.7 5
23: 23 13.5 5
24: 24 13.1 5
25: 25 11.8 5
26: 26 1.7 5
27: 27 0.0 NA
28: 28 15.1 6
29: 29 12.9 6
30: 30 3.7 6
31: 31 24.3 6
with n = 1
f1(df1, 1)
date precip event
1: 1 0.0 NA
2: 2 0.0 NA
3: 3 12.4 1
4: 4 10.2 1
5: 5 0.0 1
6: 6 13.6 2
7: 7 16.6 2
8: 8 21.5 2
9: 9 7.6 2
10: 10 0.0 NA
11: 11 0.0 NA
12: 12 0.0 NA
13: 13 0.7 3
14: 14 0.0 NA
15: 15 0.0 NA
16: 16 0.0 NA
17: 17 5.4 4
18: 18 0.0 4
19: 19 1.0 5
20: 20 35.4 5
21: 21 11.5 5
22: 22 16.7 5
23: 23 13.5 5
24: 24 13.1 5
25: 25 11.8 5
26: 26 1.7 5
27: 27 0.0 5
28: 28 15.1 6
29: 29 12.9 6
30: 30 3.7 6
31: 31 24.3 6

So, it will probably never be of interested to someone, but I think I kind of have a solution as well:)
f2 = function(d,
n = 1,
daily_thresh = .2) {
# start int the first row
i = 1
# start with rainfall event 1
event_counter = 0
# set the value initially to 0
d[["event"]] = 0
# while still in the dataframe
while (i <= nrow(d)) {
# get the current precip value
precip = d[i,]$precip
# if its below the threshold --> DRY period starts
if (precip < daily_thresh) {
# count unknown number of following dry days of this dry episode
dry_days = 0
### DRY LOOP
# start from the day with rainfall under the threshold
for (j in i:nrow(d)) {
# count the consecutive dry days
if (d[j,]$precip < daily_thresh) {
dry_days = dry_days + 1
} else{
# hit a rainy day --> Get out the dry loop, just decide to which event it belongs
# if the preceeding dry days are smaller than n --> same as last event
if (dry_days <= n) {
# set all the days without rainfall but within n to rainfall
# if its the first event put it to 1
if(event_counter == 0) event_counter = 1
d[(j-1):(j-dry_days),][["event"]] = event_counter
# set the rainy day to the same event
d[j,][["event"]] = event_counter
break # get back to wet peiod
} else{
# if the gap was too big --> its a new event
# set all the days without rainfall and within n to no rainfall
d[(j-1):(j-dry_days),][["event"]] = NA
# set the rainy day to a new rainfall event
event_counter = event_counter + 1
d[j,][["event"]] = event_counter
break # get back to wet period
}
}
}
# set i to where we stopped in the dry loop
i = j + 1
} else{
# if we initially hit a rainy day, just count on
d[i,][["event"]] = event_counter
i = i + 1
}
}
return(d)
}

Check for perfect square and replace with another value in R

I have attempted multiple different ways to check for perfect squares in an R object then replace with 0's. Below are the multiple single lines of codes I have tried; code must be a single line:
> y
[1] 9 72 49 70 16 3 3 4 81 6 43 7 12 9 3
is.integer(sqrt(y))
[1] FALSE
> ifelse(is.integer(sqrt(y)), 0, y)
[1] 9
> ifelse(sqrt(y)==is.integer(y), 0, y)
[1] 9 72 49 70 16 3 3 4 81 6 43 7 12 9 3

You can divide the number by 1 and get the remainder using %% and compare the value with 0.
sqrt(y)
#[1] 3.00 8.49 7.00 8.37 4.00 1.73 1.73 2.00 9.00 2.45 6.56 2.65 3.46 3.00 1.73
sqrt(y) %% 1 == 0
#[1] TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
Now turn these values to 0 by :
y[sqrt(y) %% 1 == 0] <- 0
#[1] 0 72 0 70 0 3 3 0 0 6 43 7 12 0 3
Or another way :
y * +(sqrt(y) %% 1 != 0)
#[1] 0 72 0 70 0 3 3 0 0 6 43 7 12 0 3

We could create a condition with round or ceiling or as.integer which convert to integer and only those that are exactly matching will return TRUE because of the precision involved
y[sqrt(y) == round(sqrt(y))] <- 0
y[sqrt(y) == as.integer(sqrt(y))] <- 0
data
y <- c(9, 72, 49, 70, 16, 3, 3, 4, 81, 6, 43, 7, 12, 9, 3)

Average neighbours inside a vector

My data :
data <- c(1,5,11,15,24,31,32,65)
There are 2 neighbours: 31 and 32. I wish to remove them and keep only the mean value (e.g. 31.5), in such a way data would be :
data <- c(1,5,11,15,24,31.5,65)
It seems simple, but I wish to do it automatically, and sometimes with vectors containing more neighbours. For instance :
data_2 <- c(1,5,11,15,24,31,32,65,99,100,101,140)

Here is another idea that creates an id via cumsum(c(TRUE, diff(a) > 1)), where 1 shows the gap threshold, i.e.
#our group variable
grp <- cumsum(c(TRUE, diff(a) > 1))
#keep only groups with length 1 (i.e. with no neighbor)
i1 <- a[!!!ave(a, grp, FUN = function(i) length(i) > 1)]
#Find the mean of the groups with more than 1 rows,
i2 <- unname(tapply(a, grp, function(i)mean(i[length(i) > 1])))
#Concatenate the above 2 (eliminating NAs from i2) to get final result
c(i1, i2[!is.na(i2)])
#[1] 1.0 5.0 11.0 15.0 24.0 65.0 31.5
You can also wrap it in a function. I left the gap as a parameter so you can adjust,
get_vec <- function(x, gap) {
grp <- cumsum(c(TRUE, diff(x) > gap))
i1 <- x[!!!ave(x, grp, FUN = function(i) length(i) > 1)]
i2 <- unname(tapply(x, grp, function(i) mean(i[length(i) > 1])))
return(c(i1, i2[!is.na(i2)]))
}
get_vec(a, 1)
#[1] 1.0 5.0 11.0 15.0 24.0 65.0 31.5
get_vec(a_2, 1)
#[1] 1.0 5.0 11.0 15.0 24.0 65.0 140.0 31.5 100.0
DATA:
a <- c(1,5,11,15,24,31,32,65)
a_2 <- c(1, 5, 11, 15, 24, 31, 32, 65, 99, 100, 101, 140)

Here is my solution, which uses run-length encoding to identify groups:
foo <- function(x) {
y <- x - seq_along(x) #normalize to zero differences in groups
ind <- rle(y) #run-length encoding
ind$values <- ind$lengths != 1 #to find groups
ind$values[ind$values] <- cumsum(ind$values[ind$values]) #group ids
ind <- inverse.rle(ind)
xnew <- x
xnew[ind != 0] <- ave(x, ind, FUN = mean)[ind != 0] #calculate means
xnew[!(duplicated(ind) & ind != 0)] #remove duplicates from groups
}
foo(data)
#[1] 1.0 5.0 11.0 15.0 24.0 31.5 65.0
foo(data_2)
#[1] 1.0 5.0 11.0 15.0 24.0 31.5 65.0 100.0 140.0
data_3 <- c(1, 2, 4, 1, 2)
foo(data_3)
#[1] 1.5 4.0 1.5
I assume that you don't need an extremely efficient solution. If you do, I'd recommend a simple C++ for loop in Rcpp.

I have a data.table based solution, same could be translated into dplyr I guess:
library(data.table)
df <- data.table(data2 = c(1,5,11,15,24,31,32,65,99,100,101,140))
df[,neighbours := ifelse(c(0,diff(data_2)) == 1,1,0)]
df[,neighbours := c(neighbours[1:(.N-1)],1),by = rleid(neighbours)]
df[,neigh_seq := rleid(neighbours)]
unique(df[,ifelse(neighbours == 1,mean(data2),data2),by = neigh_seq])
neigh_seq V1
1: 1 1.0
2: 1 5.0
3: 1 11.0
4: 1 15.0
5: 1 24.0
6: 2 31.5
7: 3 65.0
8: 4 100.0
9: 5 140.0
What it does :
first line set neigbours to 1 if the difference with following number is 1
1: 1 0
2: 5 0
3: 11 0
4: 15 0
5: 24 0
6: 31 0
7: 32 1
8: 65 0
9: 99 0
10: 100 1
11: 101 1
12: 140 0
I wanr to group so that neighbour variable is 1 for all neigbours. I need to add 1 to each end of each groups:
df[,neighbours := c(neighbours[1:(.N-1)],1),by = rleid(neighbours)]
data2 neighbours
1: 1 0
2: 5 0
3: 11 0
4: 15 0
5: 24 0
6: 31 1
7: 32 1
8: 65 0
9: 99 1
10: 100 1
11: 101 1
12: 140 0
Then after I just do a grouping on changing neighbour value, and set the value to mean if they are neihbours
df[,ifelse(neighbours == 1,mean(data2),data2),by = rleid(neighbours)]
rleid V1
1: 1 1.0
2: 1 5.0
3: 1 11.0
4: 1 15.0
5: 1 24.0
6: 2 31.5
7: 2 31.5
8: 3 65.0
9: 4 100.0
10: 4 100.0
11: 4 100.0
12: 5 140.0
and take the unique values. And voila.

This is a dplyr version, also using as a grouping variable cumsum(c(1,diff(x)!=1)):
library(dplyr)
data_2 %>% data.frame(x = .) %>%
group_by(id = cumsum(c(1,diff(x)!=1))) %>%
summarise(res = mean(x)) %>%
select(res)
# A tibble: 9 x 1
res
<dbl>
1 1.0
2 5.0
3 11.0
4 15.0
5 24.0
6 31.5
7 65.0
8 100.0
9 140.0

Create a random data from a subset in R

I have a data set with 10 rows and 5 columns. For example:
A <- c(15.0, 10.0, 5.50, 20, 22, 25, 30,
40, 50, 10.0)
B <- c(1, 30, 30, 6, 7, 10, 2, 25,
3, 27)
C <- c(1, 0, 0, 5, 15, 10, 20, 25,
30, 40)
D <- c(50, 100, 100, 500, 150, 100, 200, 250,
0, 0)
Date <- c("1997-05-01","1997-05-02","1997-05-03","1997-05-04","1997-05-05",
"1997-05-06","1997-05-07","1997-05-08","1997-05-09","1997-05-10")
data <- data.frame(A, B, C, D, Date)
Thus, I have a data table in R:
A B C D date
---- ---- ---- ---- ----
15.0 1 1 50 1997-05-01
10.0 20 0 100 1997-05-02
etc...
The range is based on quantile. For A I wanted < or = to quantile 25 (e.g. 11.375), and B to the > or = to quantile 75 (e.g. 23.750)
quantile(data$A, c(.25, .50, .75))
quantile(data$B, c(.25, .50, .75))
One way is to filter your data frame on those two conditions:
data[data$A <= quantile(data$A, 0.25) &
data$B >= quantile(data$B, 0.75), ]
So, I would like to create a random data (with the same amount of previous values, in this case 10 rows) from this subset of 3 rowa, for example:
The new data would be:
A B C D date
---- ---- ---- ---- ----
10.0 30 0 100 1997-05-02
5.5 30 0 100 1997-05-03
10.0 27 40 0 1997-05-10
5.5 30 0 100 1997-05-03
10.0 27 40 0 1997-05-10
10.0 30 0 100 1997-05-02
10.0 27 40 0 1997-05-10
5.5 30 0 100 1997-05-03
10.0 27 40 0 1997-05-10
10.0 30 0 100 1997-05-02
how to do that best?
Thank you!

Perhaps you would like something like this?
d_filtered <- data[data$A <= quantile(data$A, 0.25) &
data$B >= quantile(data$B, 0.75), ]
d_new <- d_filtered[sample(1:nrow(d_filtered), nrow(data), replace = TRUE), ]
A B C D Date
2 10.0 30 0 100 1997-05-02
3 5.5 30 0 100 1997-05-03
3.1 5.5 30 0 100 1997-05-03
3.2 5.5 30 0 100 1997-05-03
10 10.0 27 40 0 1997-05-10
3.3 5.5 30 0 100 1997-05-03
2.1 10.0 30 0 100 1997-05-02
2.2 10.0 30 0 100 1997-05-02
10.1 10.0 27 40 0 1997-05-10
2.3 10.0 30 0 100 1997-05-02

One mathematically oriented way to do it,
d3 <- data[data$A <= quantile(data$A, 0.25) &
data$B >= quantile(data$B, 0.75), ]
final_df <- rbind(d3[rep(seq_len(nrow(d3)), floor(nrow(data)/nrow(d3))),],
d3[(1: (nrow(data) - floor(nrow(data)/nrow(d3))*nrow(d3))),])
rownames(final_df) <- NULL
final_df
# A B C D Date
#1 10.0 30 0 100 1997-05-02
#2 5.5 30 0 100 1997-05-03
#3 10.0 27 40 0 1997-05-10
#4 10.0 30 0 100 1997-05-02
#5 5.5 30 0 100 1997-05-03
#6 10.0 27 40 0 1997-05-10
#7 10.0 30 0 100 1997-05-02
#8 5.5 30 0 100 1997-05-03
#9 10.0 27 40 0 1997-05-10
#10 10.0 30 0 100 1997-05-02

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Path dependent vector transformation in R - r

Related

How to flag/count missing values in RStudio on a row by row basis?

Create run-length ID while allowing for gaps of certain length in runs

Check for perfect square and replace with another value in R

Average neighbours inside a vector

Create a random data from a subset in R

Categories

Resources