I have a problem a little complicate. I have two database BD and BD2. For every row in BD I want to search in entire BD2 and obtain some info: Sum, Mean, Sd etc.
With Sum I succeeded to make it work, but for Mean, Sd, Median I want to obtain the vector and after that apply these functions.
To be clearer, here is the code. For sum it worked. But I want now to save in a vector the values that met the conditions and after calculate Mean,Sd,Median. I tried to make the vector out of the base BD but nothing work or is something trickie and I can't figure it out.
for (i in 1:dim(BD)[1])
{
for (j in 1:dim(BD2)[1])
{
if((BD$Start.Date[i]<BD2$X_TIMESTAMP[j]) && (BD2$X_TIMESTAMP[j]<BD$End.Date[i]) && (BD$Linea[i]==BD2$Linea[j]))
{
vl = BD2$X_VALUE[j]
BD$vec[i] = paste(BD$vec[i],vl,sep="")
#vect = as.numeric(BD$vec[i])
BD$Sum[i] = BD$Sum[i]+ BD2$X_VALUE[j]
#BD$Average[i] = mean(vect)
}
}
}
About data I attached a photo.This is for BD.
And this is for BD2:
I tried to create vec before the for sentence. But still it doesn't work.
The error is always the same: Error in $<-.data.frame(*tmp*, "vec", value = list(NULL, NULL, NULL, : replacement has 47 rows, data has 530
You may be able to use a combination of dplyr and purrr to get your summaries. You could split() your data.frame BD into a list of rows. Then you can use purrr:map() to work with each element of the list (i.e. row in your data) to find matches in BD2 and calculate some statistics. After you have your stats, you can recombine back into a single data.frame with bind_rows():
library(dplyr)
library(purrr)
# create a small data example
BD <- data.frame(Start.Date = 1:5,
End.Date = 2:6,
Linea = c(1, 1, 3, 1, 1))
BD2 <- data.frame(X_TIMESTAMP = c(1.5, 1.5, 2.5, 3.5, 4.5),
X_VALUE = c(31, 4, 60, 20, -11),
Linea = c(1,1,1,1,1))
BD %>%
split(seq(nrow(.))) %>%
map(function(BD_row_i) {
# find the timestamps that overlap in BD2
BD2_matches <- BD2 %>%
filter((BD_row_i$Start.Date < X_TIMESTAMP) &
(BD_row_i$End.Date > X_TIMESTAMP) &
(BD_row_i$Linea == Linea))
if (nrow(BD2_matches) > 0) {
# calculate some stats based on these matches
BD2_matches <- BD2_matches %>%
summarize(Sum = sum(X_VALUE),
Average = mean(X_VALUE),
Median = median(X_VALUE),
Sd = sd(X_VALUE))
} else {
# if no matches, return NAs (blank)
BD2_matches <- BD2_matches %>%
summarize(Sum = NA,
Average = NA,
Median = NA,
Sd = NA)
}
# add these statistics back to your BD_row_i data.frame
BD_row_i <- cbind.data.frame(BD_row_i, BD2_matches)
# return the updated data frame
return(BD_row_i)
}) %>%
bind_rows()
Start.Date End.Date Linea Sum Average Median Sd
1 1 2 1 35 17.5 17.5 19.09188
2 2 3 1 60 60.0 60.0 NA
3 3 4 3 NA NA NA NA
4 4 5 1 -11 -11.0 -11.0 NA
5 5 6 1 NA NA NA NA
Related
Hello all a R noob here,
I hope you guys can help me with the following.
I need to transform multiple columns in my dataset to new columns based on the values in the original columns multiple times. This means that for the first transformation I use column 1, 2, 3 and if certain conditions are met the output results a new column with a 1 or a 0, for the second transformation I use columns 4, 5, 6 and the output should be a 1 or a 0 also. I have to do this 18 times. I already wrote a function which succesfully does the transformation if I impute the variables manually, but I would like to apply this function to all the desired columns at once. My desired output would be 18 new columns with 0's and 1's. Finally I will make a last column which will display a 1 if any of the 18 columns is a 1 and a 0 otherwise.
df <- data.frame(admiss1 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
admiss2 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
admiss3 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
visit1 = sample(seq(as.Date('1995/01/01'), as.Date('1996/01/01'), by="day"), 12),
visit2 = sample(seq(as.Date('1997/01/01'), as.Date('1998/01/01'), by="day"), 12),
reason1 = sample(3,12, replace = T),
reason2 = sample(3,12, replace = T),
reason3 = sample(3,12, replace = T))
df$discharge1 <- df$admiss1 + 10
df$discharge2 <- df$admiss2 + 10
df$discharge3 <- df$admiss3 + 10
#every discharge date is 10 days after the admission date for the sake of this example
#now I have the following dataframe
#for the sake of it I included only 3 dates and reasons(instead of 18)
admiss1 admiss2 admiss3 visit1 visit2 reason1 reason2 reason3 discharge1 discharge2 discharge3
1 1990-03-12 1992-04-04 1998-07-31 1995-01-24 1997-10-07 2 1 3 1990-03-22 1992-04-14 1998-08-10
2 1999-05-18 1990-11-25 1995-10-04 1995-03-06 1997-03-13 1 2 1 1999-05-28 1990-12-05 1995-10-14
3 1993-07-16 1998-06-10 1991-07-05 1995-11-06 1997-11-15 1 1 2 1993-07-26 1998-06-20 1991-07-15
4 1991-07-05 1992-06-17 1995-10-12 1995-05-14 1997-05-02 2 1 3 1991-07-15 1992-06-27 1995-10-22
5 1995-08-16 1999-03-08 1992-04-03 1995-02-20 1997-01-03 1 3 3 1995-08-26 1999-03-18 1992-04-13
6 1999-10-07 1991-12-26 1995-05-05 1995-10-24 1997-10-15 3 1 1 1999-10-17 1992-01-05 1995-05-15
7 1998-03-18 1992-04-18 1993-12-31 1995-11-14 1997-06-14 3 2 2 1998-03-28 1992-04-28 1994-01-10
8 1992-08-04 1991-09-16 1992-04-23 1995-05-29 1997-10-11 1 2 3 1992-08-14 1991-09-26 1992-05-03
9 1997-02-20 1990-02-12 1998-03-08 1995-10-09 1997-12-29 1 1 3 1997-03-02 1990-02-22 1998-03-18
10 1992-09-16 1997-06-16 1997-07-18 1995-12-11 1997-01-12 1 2 2 1992-09-26 1997-06-26 1997-07-28
11 1991-01-25 1998-04-07 1999-07-02 1995-12-27 1997-05-28 3 2 1 1991-02-04 1998-04-17 1999-07-12
12 1996-02-25 1993-03-30 1997-06-25 1995-09-07 1997-10-18 1 3 2 1996-03-06 1993-04-09 1997-07-05
admissdate <- function(admis, dis, rsn, vis1, vis2){
xnew <- ifelse(df[eval(substitute(admis))] >= df[eval(substitute(vis1))] & df[eval(substitute(dis))] <= df[eval(substitute(vis2))] & df[eval(substitute(rsn))] == 2, 1, 0)
xnew <- ifelse(df[eval(substitute(admis))] >= df[eval(substitute(vis1))] & df[eval(substitute(admis))] <= df[eval(substitute(vis2))] & df[eval(substitute(dis))] >= df[eval(substitute(vis2))] & df[eval(substitute(rsn))] == 2, 1, xnew)
return(xnew)
}
I wrote this function to generate a 1 if the conditions are true and a 0 if the conditions are false.
-Condition 1: admission date and discharge date are between visit 1 and visit 2 + admission reason is 2.
-Condition 2: admission date is after visit 1 but before visit 2 and the discharge date is after visit 2 with also admission reason 2.
It should return 1 if these conditions are true and 0 if these conditions are false. Eventually, I will end up with 18 new variables with 1's or 0's and will combine them to make one variable with Admission between visit 1 and visit 2 (with reason 2).
If I manually impute the variable names it will work, but I cant make it work for all the variables at once. I tried to make a string vector with all the admiss dates, discharge dates and reasons and tried to transform them with mapply, but this does not work.
admiss <- paste0(rep("admiss", 3), 1:3)
discharge <- paste0(rep("discharge", 3), 1:3)
reason <- paste0(rep("reason", 3), 1:3)
visit1 <- rep("visit1",3)
visit2 <- rep("visit2",3)
mapply(admissdate, admis = admiss, dis = discharge, rsn = reason, vis1 = visit1, vis2 = visit2)
I have also considered lapply but here you have to define an X = ..., which I think I cannot use because I have multiple column that I want to impute, please correct me if I am wrong!
Also I considered using a for loop, but I don't know how to use that with multiple conditions.
Any help would be greatly appreciated!
You can change the function to accept values instead of column names.
admissdate <- function(admis, dis, rsn, vis1, vis2){
xnew <- as.integer(admis >= vis1 & dis <= vis2 & rsn == 2)
xnew <- ifelse(admis >= vis1 & admis <= vis2 & dis >= vis2 & rsn == 2, 1, xnew)
return(xnew)
}
Now create new columns -
admiss <- paste0("admiss", 1:3)
discharge <- paste0("discharge", 1:3)
reason <- paste0("reason", 1:3)
new_col <- paste0('newcol', 1:3)
df[new_col] <- Map(function(x, y, z) admissdate(x, y, z, df$visit1, df$visit2),
df[admiss],df[discharge],df[reason])
#Additional column will be 1 if any of the value in the new column is 1.
df$result <- as.integer(rowSums(df[new_col]) > 0)
df
I have some data which looks like:
date col1 col2 col3
<chr> <dbl> <dbl> <dbl>
1 2020_09_01 53542. 22133. 25295.
2 2020_09_02 54157. 22505. 25327.
3 2020_09_03 54137. 23115. 24993.
4 2020_09_04 50795. 23127. 24166.
5 2020_09_05 32829. 19600. 21860.
I am trying to use the rollapply function to compute a 7 days percentage change - or week on week percentage change. I cannot seem to get the rollapply to work as I expect. The rollapply will compute daily the percentage change from the previous week.
lagPeriod = 7
matrixCalcFunction <- function(x){
(myData[[x]] - myData[[x - lagPeriod]]) / myData[[x - lagPeriod]]
}
myData %>%
pivot_longer(cols = contains("col")) %>%
tidyquant::tq_mutate(
select = value,
mutate_fun = rollapply,
width = lagPeriod ,
align = "right",
FUN = matrixCalcFunction
)
Expected output:
date col1 col2 col3
<chr> <dbl> <dbl> <dbl>
1 2020_09_01 NA NA NA
2 2020_09_02 NA NA NA
3 2020_09_03 NA NA NA
4 2020_09_04 NA NA NA
5 2020_09_05 NA NA NA
6 2020_09_06 NA NA NA
7 2020_09_07 -0.065 -0.055 -0.39
8 2020_09_08 -0.058 -0.029 -0.041
9 2020_09_09 0.068 0.071 0.039
10 2020_09_10 0.023 -0.0002 0.045
Data:
myData <- structure(list(date = c("2020_09_01", "2020_09_02", "2020_09_03",
"2020_09_04", "2020_09_05", "2020_09_06", "2020_09_07", "2020_09_08",
"2020_09_09", "2020_09_10", "2020_09_11", "2020_09_12", "2020_09_13",
"2020_09_14", "2020_09_15", "2020_09_16", "2020_09_17", "2020_09_18",
"2020_09_19", "2020_09_20", "2020_09_21", "2020_09_22", "2020_09_23",
"2020_09_24", "2020_09_25", "2020_09_26", "2020_09_27", "2020_09_28",
"2020_09_29", "2020_09_30"), col1 = c(53542.497, 54156.934, 54136.844,
50794.971, 32828.797, 28475.082, 50083.573, 51017.288, 57819.908,
51945.242, 27823.172, 34349.466, 28527.527, 54845.664, 56531.057,
56556.415, 55396.121, 54303.732, 37513.441, 30041.867, 52397.815,
55449.939, 56203.125, 53654.182, 53289.437, 38511.761, 28046.879,
52132.573, 56055.611, 55520.683), col2 = c(22133.29, 22504.958,
23115.242, 23126.773, 19599.718, 16752.282, 20920.38, 21844.255,
24763.05, 23121.879, 17430.447, 20110.582, 18795.882, 24027.224,
24890.61, 24408.889, 24363.402, 24582.204, 20146.731, 18376.923,
23063.298, 24221.946, 25228.194, 24658.424, 23333.315, 20066.397,
17504.372, 23561.362, 23456.284, 24101.302), col3 = c(25294.573,
25326.797, 24992.764, 24166.084, 21859.885, 17549.005, 24306.496,
24269.409, 25968.326, 25253.976, 17974.404, 22636.375, 20105.166,
27000.274, 26291.22, 27277.371, 26851.75, 26133.317, 24055.107,
19515.875, 25573.014, 31957.279, 28961.316, 26896.495, 26440.726,
22941.927, 19990.825, 26595.878, 27725.468, 25965.802)), row.names = c(NA,
-30L), class = c("tbl_df", "tbl", "data.frame"))
EDIT:
This code runs but I am a little confused about the diff(.x))/lag(.x, 7) and not sure its doing as I want since I am getting different results to the expected output.
myData %>%
column_to_rownames("date") %>%
mutate(across(everything(), ~ round(c(NA, diff(.x))/lag(.x, 7), 5),
names = "{col}_delta"))
For a single observation (col1 row 1 and row 7) I can use something like diff(c(pull(myData[7, 2]), pull(myData[1, 2]))) = 3458.924 then I can divide this as: 3458.924 / pull(myData[1, 2]) = 0.0646. So would adding something like diff(c(.x, lag(.x, 7))) to the diff function get the result.
Convert it to a zoo object and then use diff giving a zoo object. It could be converted back to a data frame using fortify.zoo(x) where x is the result of the diff. Alternately just leave it as a zoo object so you can use the other facilities of zoo.
library(zoo)
z <- read.zoo(myData, format = "%Y_%m_%d")
diff(z, 6, arith = FALSE, na.pad = TRUE) - 1
To use rollapply instead of the last line use:
rollapplyr(z, 7, function(x) x[7] / x[1] - 1, fill = NA)
or use lag.zoo:
z / lag(z, -6, na.pad = TRUE) - 1
Note that dplyr clobbers lag so either be sure it is not loaded or else if you need it then load it using library(dplyr, exclude = c("filter", "lag")) .
Regarding the EDIT try this:
library(dplyr, exclude = c("lag", "filter"))
myData %>% mutate(across(-1, ~ . / dplyr::lag(., 6) - 1))
I would like to process some GPS-Data rows, pairwise.
For now, I am doing it in a normal for-loop but I'm sure there is a better and faster way.
n = 100
testdata <- as.data.frame(cbind(runif(n,1,10), runif(n,0,360), runif(n,14,16), runif(n, 46,49)))
colnames(testdata) <- c("speed", "heading", "long", "lat")
head(testdata)
diffmatrix <- as.data.frame(matrix(ncol = 3, nrow = dim(testdata)[1] - 1))
colnames(diffmatrix) <- c("distance","heading_diff","speed_diff")
for (i in 1:(dim(testdata)[1] - 1)) {
diffmatrix[i,1] <- spDists(as.matrix(testdata[i:(i+1),c('long','lat')]),
longlat = T, segments = T)*1000
diffmatrix[i,2] <- testdata[i+1,]$heading - testdata[i,]$heading
diffmatrix[i,3] <- testdata[i+1,]$speed - testdata[i,]$speed
}
head(diffmatrix)
How would i do that with an apply-function?
Or is it even possible to do that calclulation in parallel?
Thank you very much!
I'm not sure what you want to do with the end condition but with dplyr you can do all of this without using a for loop.
library(dplyr)
testdata %>% mutate(heading_diff = c(diff(heading),0),
speed_diff = c(diff(speed),0),
longdiff = c(diff(long),0),
latdiff = c(diff(lat),0))
%>% rowwise()
%>% mutate(spdist = spDists(cbind(c(long,long + longdiff),c(lat,lat +latdiff)),longlat = T, segments = T)*1000 )
%>% select(heading_diff,speed_diff,distance = spdist)
# heading_diff speed_diff distance
# <dbl> <dbl> <dbl>
# 1 15.9 0.107 326496
# 2 -345 -4.64 55184
# 3 124 -1.16 25256
# 4 85.6 5.24 221885
# 5 53.1 -2.23 17599
# 6 -184 2.33 225746
I will explain each part below:
The pipe operator %>% is essentially a chain that sends the results from one operation into the next. So we start with your test data and send it to the mutate function.
Use mutate to create 4 new columns that are the difference measurements from one row to the next. Adding in 0 at the last row because there is no measurement following the last datapoint. (Could do something like NA instead)
Next once you have the differences you want to use rowwise so you can apply the spDists function to each row.
Last we create another column with mutate that calls the original 4 columns that we created earlier.
To get only the 3 columns that you were concerned with I used a select statement at the end. You can leave this out if you want the entire dataframe.
I need some help with dplyr.
I have two data frames - one huge, with several time series A,B,... in there (LargeDF), and a second one (Categories) with time intervals (left and right boundaries).
I would like to add another column to LargeDF, labeled leftBoundary, containing the appropriate boundary value, like so:
LargeDF
ts timestamp signal # left_boundary
1 A 0.3209338 10.43279 # 0
2 A 1.4791524 10.34295 # 1
3 A 2.6007494 10.71601 # 2
and
Categories
ts left right
1 A 0 1
2 A 1 2
3 A 2 3
My code I came up with is
LargeDF %>%
group_by(ts) %>%
do(myFUN(., Categories))
# calls this ...
myFUN <- function(Large, Categ) {
CategTS <- Categ %>%
filter(ts == Large[1, "ts"][[1]])
Large %>%
group_by(timestamp) %>% # this is bothering me...
mutate(left_boundary = CategTS$left[CategTS$left < timestamp
& timestamp < CategTS$right])
}
but it is super slow for large time series. I would really like to lose the group_by(timestamp), as they are unique within each ts anyways.
Does someone see a better solution? That would be much appreciated.
# Code for making the example data frames ...
library("dplyr")
n <- 10; series <- c("A", "B", "C")
LargeDF <- data.frame(
ts = rep(series, each = n)
, timestamp = runif(n*length(series), max = 4)
, signal = runif(n*length(series), min = 10, max = 11)
) %>% group_by(ts) %>% arrange(timestamp)
m <- 7
Categories <- data.frame(
ts = rep(series, each = m)
, left = rep(seq(1 : m) - 1, length(series))
, right = rep(seq(1 : m), length(series))
)
Update (data.table and my slightly modified mockup)
So, I tried the suggestions from #DavidArenburg on a quick/dirty mockup-example first, but had the problem that some timestamps were binned twice (into successive categories/intervals).
> foverlaps(d, c, type="any", by.x = c("timestamp", "timestamp2"))
left right value timestamp timestamp2
1: 0.9 1.9 0.1885459 1 1
2: 0.9 1.9 0.0542375 2 2 # binned here
3: 1.9 2.9 0.0542375 2 2 # and here as well
13: 19.9 25.9 0.4579986 20 20
I then read about minoverlap = 1L as a default and realized that a normal timestamp is >> 1.
> as.numeric(Sys.time())
[1] 1429022267
Therefore, if I shifted everything to larger values (e.g. n <- 10 in the example below), everything went fine.
left right value timestamp timestamp2
1: 9 19 0.64971126 10 10
2: 19 29 0.75994751 20 20
3: 29 99 0.98276462 30 30
9: 199 259 0.89816165 200 200
With my real data, everything went smoothly, so thanks again.
## Code for my data.table example -----
n <- 1
d <- data.table( value = runif(9),
timestamp = c(1, 2, 3, 5, 7, 10, 15, 18, 20)*n,
timestamp2 = c(1, 2, 3, 5, 7, 10, 15, 18, 20)*n)
c <- data.table(left = c(0.9, 1.9, 2.9, 9.9, 19.9, 25.9)*n,
right = c(1.9, 2.9, 9.9, 19.9, 25.9, 33.9)*n)
setkey(c, left, right)
foverlaps(d, c, type="any", by.x = c("timestamp", "timestamp2"))
Update 2 (JOIN, then FILTER, within dplyr)
I tested the suggestion from #aosmith to use the dplyr function left_join() to create one (very) large DF, then filter() this again. Very quickly, I ran into memory issues:
Error: std::bad_alloc
Probably, this approach would be a good idea for smaller tables - as the syntax is very nice (but this, again, is personal preference). I'll go for the data.table solution in this case. Thanks again for all suggestions.
dplyr isn't suitable for such operations, try data.tables foverlaps functions instead
library(data.table)
class(LargeDF) <- "data.frame" ## Removing all the dplyr classes
setDT(LargeDF)[, `:=`(left = timestamp, right = timestamp)] # creating min and max boundaries in the large table
setkey(setDT(Categories)) # keying by all columns (necessary for `foverlaps` to work)
LargeDF[, left_boundary := foverlaps(LargeDF, Categories)$left][] # Creating left_boundary
# ts timestamp signal left right left_boundary
# 1: A 0.46771516 10.72175 0.46771516 0.46771516 0
# 2: A 0.58841492 10.35459 0.58841492 0.58841492 0
# 3: A 1.14494484 10.50301 1.14494484 1.14494484 1
# 4: A 1.18298225 10.82431 1.18298225 1.18298225 1
# 5: A 1.69822678 10.04780 1.69822678 1.69822678 1
# 6: A 1.83189609 10.75001 1.83189609 1.83189609 1
# 7: A 1.90947475 10.94715 1.90947475 1.90947475 1
# 8: A 2.73305266 10.14449 2.73305266 2.73305266 2
# 9: A 3.02371968 10.17724 3.02371968 3.02371968 3
# ...
I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!
Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363
Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]