I am trying to create a new column conditional on another column, a bit like a moving average or moving window but based on distance between points. Take for example row 2 with a CO2 of 399.935. I would like to have the mean of all the points within 100 m (traveled) of that point. In my example (looking at column CumDist), rows 1, 3, 4, 5 would be selected to calculate the mean. The column CumDist (*100,000 to have the units in meters) consists of cumulative distance traveled. I have 5000 points and obviously the width (or the number of rows) of the moving window will vary.
I tested over() from the sp package, but it's problematic if the same road is taken more than once. I looked on the web for other solutions and I did not find anything that could help me.
dput(DF)
structure(list(CO2 = c(399.9350305, 399.9350305, 399.9350305,
400.0320031, 400.0320031, 400.0320031, 399.7718229, 399.7718229,
399.7718229, 399.3855075, 399.3855075, 399.3855075, 399.4708139,
399.4708139, 399.4708139, 400.0362474, 400.0362474, 400.0362474,
399.7556753, 399.7556753), lon = c(-103.7093538, -103.709352,
-103.7093492, -103.7093467, -103.7093455, -103.7093465, -103.7093482,
-103.7093596, -103.7094074, -103.7094625, -103.7094966, -103.709593,
-103.709649, -103.7096717, -103.7097349, -103.7097795, -103.709827,
-103.7099007, -103.709924, -103.7099887), lat = c(49.46972027,
49.46972153, 49.46971675, 49.46971533, 49.46971307, 49.4697124,
49.46970636, 49.46968214, 49.46960921, 49.46955984, 49.46953621,
49.46945809, 49.46938994, 49.46935281, 49.46924309, 49.46918635,
49.46914762, 49.46912566, 49.46912407, 49.46913321),distDiff = c(0.000342016147509882,
0.000191466419697602, 0.000569046320857002, 0.000240367540492089,
0.000265977754839834, 0.000103953049523505, 0.000682968856240796,
0.0028176007969857, 0.00882013898948418, 0.00678966015562509,
0.00360774024245839, 0.011149423290729, 0.00859796340323456,
0.00444526066124642, 0.0130344010874029, 0.00709037369666853,
0.00551435348701512, 0.00587377717110946, 0.00169806309901329,
0.00479849401022625), CumDist = c(0.000342016147509882, 0.000533482567207484,
0.00110252888806449, 0.00134289642855657, 0.00160887418339641,
0.00171282723291991, 0.00239579608916071, 0.00521339688614641,
0.0140335358756306, 0.0208231960312557, 0.0244309362737141, 0.0355803595644431,
0.0441783229676777, 0.0486235836289241, 0.0616579847163269, 0.0687483584129955,
0.0742627119000106, 0.08013648907112, 0.0818345521701333, 0.0866330461803596
)), .Names = c("X12CO2_dry", "coords.x1", "coords.x2", "V1",
"CumDist"), row.names = 2:21, class = "data.frame")
thanks, Martin
Man you beat me to it with a cleaner solution mra68.
Here's mine using a few loops.
####################
for (j in 1:nrow(DF)){#Loop through all rows of your dataset
CO2list<-NULL ##Need to make a variable before storing to it in the loop
for(i in 1:nrow(DF)){##Loop through all distances in the table
if ((abs(DF$CumDist[i]-DF$CumDist[j]))<=0.001) {
##Check to see if difference in CumDist<=100/100000 for all entries
#CumDist[j] is point with the 100 meter window around it
CO2list<-c(CO2list,DF$X12CO2_dry[i])
##Store your CO2 entries that are within the 100 meter window to a vector
}
}
DF$CO2AVG[j]<-mean(CO2list)
#Get the mean of your list and store it to column named CO2AVG
}
The window that belongs to the i-th row starts at n[i] and ends at m[i]-1. Hence the sum of the CO2-values in the i-th window is CumCO2[m[i]]-CumCO2[n[i]]. (Notice that the indices in CumCO2 are shifted by 1, because of the leading 0.) Dividing this CO2-sum by the window size m[i]-n[i] gives the values meanCO2 for the new column:
n <- sapply( df$CumDist,
function(x){
which.max( df$CumDist >= x-0.001 )
}
)
m <- sapply( df$CumDist,
function(x){
which.max( c(df$CumDist,Inf) > x+0.001 )
}
)
CumCO2 <- c( 0, cumsum(df$X12CO2) )
meanCO2 <- ( CumCO2[m] - CumCO2[n] ) / (m-n)
.
> n
[1] 1 1 1 2 3 3 5 8 9 10 11 12 13 14 15 16 17 18 19 20
> m
[1] 4 5 7 7 8 8 8 9 10 11 12 13 14 15 16 17 18 19 20 21
> meanCO2
[1] 399.9350 399.9593 399.9835 399.9932 399.9606 399.9606 399.9453 399.7718 399.7718 399.3855 399.3855 399.3855 399.4708 399.4708 399.4708 400.0362
[17] 400.0362 400.0362 399.7557 399.7557
>
Related
I am new to coding and using R. I am working on a project to simulate the game Liar's Dice, also known as Perudo, and have some questions about creating the simulation.
Basically, the game consists of two or more players rolling five dice in a cup, turning it over, and and making bids on how many of a certain side they think is on the table. You can look at your own dice, but not anyone else's. To make bids, on your turn you would say "two 5's," which would mean there are at least two dice that landed on 5. Each bid will either increase the side or the amount. So if you said "two 5's," I could then say "two 6's" or "three 3's" on my turn.
When you believe the last bid is incorrect, you would say "Liar" on your turn, then everyone reveals their dice. If you were wrong, you lose a dice, but if you were right, the last bidder loses a dice. This continues until there is only one player left who has dice.
First, I decided to create a function called cup() which rolls a cup of five six-sided dice.
cup <- function(sides = 6, dice = 5){
sample(1:sides, size = dice, replace = TRUE)
}
Next, with a little assistance, I created a new function called cups() which rolls three cups for three players.
cups <- function(players = 3, sides = 6, dice = 5){
out <- cup(sides, dice)
for(i in 2:players){
out <- rbind(out, cup(sides, dice))
}
rownames(out) <- 1:players
rownames(out) <- c("P1", "P2", "P3")
return(out)
}
What I want to accomplish next is to create a table of probabilities of possible dice outcomes. In other words, what's the probability of there being at least two of a side given fifteen dice (five for each player) in play? And then the probability of there being three, four, five, etc. all the way up to fifteen in this case.
My question is how would I go about doing this in R? And what direction should I go in after getting the probabilities in R?
Here is an empirical process for determining the percentage outcomes of all the same, 4 the same, 3 the same, 2 the same, none the same upon rolling 5 die:
library(gtools) # package with permutations function
allcombos <- permutations(6, 5, repeats.allowed = TRUE) # all 6 choose 5 with replacment combos
alluniques <- apply(allcombos, 1, unique) # uniques for each combo
alllengths <- sapply(alluniques, length) # lengths for each combo imputes num repeats
alllengths2 <- as.factor(alllengths) # convert to factor to count unique
allsum <- summary(alllengths2) # sum by num uniques
allsum
1 2 3 4 5 # 1=all same, 2=4 same, 3=3 same, 4=2 same, 5=all different
6 450 3000 3600 720
totsum <- sum(allsum)
allfrac <- allsum / totsum
allpercent <- allfrac * 100
allpercent
1 2 3 4 5
0.07716049 5.78703704 38.58024691 46.29629630 9.25925926 # percentage breakout
There is no doubt an analytical solution but I don't know what it is. You could use standard probability calculations to estimate specific outcomes among multiple players. E.g. P(at least 1 4-same | 3 players) or run some simulations.
Here's likely more than you asked for but focusing on number of sides on the dice, total number of dice and probability of rolling Nrolled or more
dicegame <- function(Nsides = 6,
Ndice = 5,
Nrolled = 1,
verbose = FALSE)
{
total_possible_outcomes <- choose(Nsides + Ndice - 1, Ndice)
outcomes_matrix <- t(combn(Nsides + Ndice - 1,
Ndice,
sort)) - matrix(rep(c(0:(Ndice - 1)),
each = total_possible_outcomes),
nrow = total_possible_outcomes)
chances <- sum(apply(outcomes_matrix, 1, function(x) sum(x==2)) >= Nrolled) / total_possible_outcomes
if(verbose) {
cat(paste("Number of dice",
Ndice,
"each with", Nsides, "sides",
"chances of rolling", Nrolled,
"\n or more of any one side are:\n"))
}
return(chances)
# return(total_possible_outcomes)
# return(outcomes_matrix)
}
dicegame(verbose = TRUE)
#> Number of dice 5 each with 6 sides chances of rolling 1
#> or more of any one side are:
#> [1] 0.5
dicegame(6, 15, 10)
#> [1] 0.01625387
Using probability we can demonstrate that the probability to get a value n times is equal to :
we can easily write this into an R function:
prob_get_n <- function(ntimes, players=3, dice=5, sides=6){
if(missing(ntimes)) ntimes <- 0:(players*dice)
choose(players*dice,ntimes)*(1-1/sides)^((players*dice)-ntimes)*sides^(-ntimes)
}
Notice that this function is by construction vectorised ie it accepts 1:2, c(9,5) as valid inputs.
prob_get_n() -> probs
data.frame(ntimes=1:length(probs)-1, probs=probs,or_more= rev(cumsum(rev(probs))))
ntimes probs or_more
1 0 6.490547e-02 1.000000e+00
2 1 1.947164e-01 9.350945e-01
3 2 2.726030e-01 7.403781e-01
4 3 2.362559e-01 4.677751e-01
5 4 1.417535e-01 2.315192e-01
6 5 6.237156e-02 8.976567e-02
7 6 2.079052e-02 2.739411e-02
8 7 5.346134e-03 6.603585e-03
9 8 1.069227e-03 1.257451e-03
10 9 1.663242e-04 1.882242e-04
11 10 1.995890e-05 2.190005e-05
12 11 1.814445e-06 1.941153e-06
13 12 1.209630e-07 1.267076e-07
14 13 5.582909e-09 5.744548e-09
15 14 1.595117e-10 1.616385e-10
16 15 2.126822e-12 2.126822e-12
Edit
Or we can use R built in dbinom function to get the distribution and pbinom to get the cumulative probability function:
probs <- function(ntimes, players=3, dice=5, sides=6){
if(missing(ntimes)) ntimes <- 0:(players*dice)
data.frame(ntimes=ntimes, probs=dbinom(ntimes, players*dice, 1/sides), or_more=1-pbinom(ntimes-1, players*dice, 1/sides))
}
ntimes probs or_more
1 0 6.490547e-02 1.000000e+00
2 1 1.947164e-01 9.350945e-01
3 2 2.726030e-01 7.403781e-01
4 3 2.362559e-01 4.677751e-01
5 4 1.417535e-01 2.315192e-01
6 5 6.237156e-02 8.976567e-02
7 6 2.079052e-02 2.739411e-02
8 7 5.346134e-03 6.603585e-03
9 8 1.069227e-03 1.257451e-03
10 9 1.663242e-04 1.882242e-04
11 10 1.995890e-05 2.190005e-05
12 11 1.814445e-06 1.941153e-06
13 12 1.209630e-07 1.267076e-07
14 13 5.582909e-09 5.744548e-09
15 14 1.595117e-10 1.616385e-10
16 15 2.126822e-12 2.126743e-12
I have data in a form like this:
quantity direction
10 n
5 e
6 ne
12 n
20 nw
5 s
8 n
1 sw
3 se
2 ne
6 nw
8 n
2 se
3 e
4 w
9 nw
on which I want to run the rayleigh.test from circular package (For more information why I want to do this check: https://stats.stackexchange.com/questions/198701/check-for-significant-difference-between-numbers-of-sightings-per-cardinal-direc). I guess that I have to use the circular function up front to prepare the data but I have no clue how to do that. The allowed values for the units argument of this function are “radians”, “degrees”, “hours” and I can't figure out how to fit my directions into that.
How can I get rayleigh.test to accept cardinal directions as input?
I can't judge whether the rayleigh test is ok with non-continuous data, but here is a small example of how to map your characters to degrees:
df <- data.frame( quantity = c(37,5,6) , direction = c("n", "ne" , "n") )
df$direction <- as.factor(df$direction)
# create a map from character to degrees:
map <- setNames( c( 0, 45) , c("n", "ne") )
levels(df$direction) <- map[ levels(df$direction) ]
I am attempting to repeatedly add a "fixed number" to a numeric vector depending on a specified bin size. However, the "fixed number" is dependent on the data range.
For instance ; i have a data range 10 to 1010, and I wish to separate the data into 100 bins. Therefore ideally the data would look like this
Since 1010 - 10 = 1000
And 1000 / 100(The number of bin specified) = 10
Therefore the ideal data would look like this
bin1 - 10 (initial data)
bin2 - 20 (initial data + 10)
bin3 - 30 (initial data + 20)
bin4 - 40 (initial data + 30)
bin100 - 1010 (initial data + 1000)
Now the real data is slightly more complex, there is not just one data range but multiple data range, hopefully the example below would clarify
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
Ideally I wish to get something like
10 20
20 30
30 40
.. ..
5000 5015
5015 5030
5030 5045
.. ..
4857694 4858096 # Note theoretically it would have decimal places,
#but i do not want any decimal place
4858096 4858498
.. ..
So far I was thinking along this kind of function, but it seems inefficient because ;
1) I have to retype the function 100 times (because my number of bin is 100)
2) I can't find a way to repeat the function along my values - In other words my function can only deal with the data 10-1010 and not the next one 5000-6500
# The range of the variable
width <- end - start
# The bin size (Number of required bin)
bin_size <- 100
bin_count <- width/bin_size
# Create a function
f1 <- function(x,y){
c(x[1],
x[1] + y[1],
x[1] + y[1]*2,
x[1] + y[1]*3)
}
f1(x= start,y=bin_count)
f1
[1] 10 20 30 40
Perhaps any hint or ideas would be greatly appreciated. Thanks in advance!
Aafter a few hours trying, managed to answer my own question, so I thought to share it. I used the package "binr" and the function in the package called "bins" to get the required bin. Please find below my attempt to answer my question, its slightly different than the intended output but for my purpose it still is okay
library(binr)
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
tmp_list_start <- list() # Create an empty list
# This just extract the output from "bins" function into a list
for (i in seq_along(start)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
# Now i need to convert one of the output from bins into numeric value
s <- gsub(",.*", "", names(tmp$binct))
s <- gsub("\\[","",s)
tmp_list_start[[i]] <- as.numeric(s)
}
# Repeating the same thing with slight modification to get the end value of the bin
tmp_list_end <- list()
for (i in seq_along(end)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
e <- gsub(".*,", "", names(tmp$binct))
e <- gsub("]","",e)
tmp_list_end[[i]] <- as.numeric(e)
}
v1 <- unlist(tmp_list_start)
v2 <- unlist(tmp_list_end)
df <- data.frame(start=v1, end=v2)
head(df)
start end
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
6 61 70
Pardon my crappy code, Please share if there is a better way of doing this. Would be nice if someone could comment on how to wrap this into a function..
Here's a way that may help with base R:
bin_it <- function(START, END, BINS) {
range <- END-START
jump <- range/BINS
v1 <- c(START, seq(START+jump+1, END, jump))
v2 <- seq(START+jump-1, END, jump)+1
data.frame(v1, v2)
}
It uses the function seq to create the vectors of numbers leading to the ending number. It may not work for every case, but for the ranges you gave it should give the desired output.
bin_it(10, 1010)
v1 v2
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
bin_it(5000, 6500)
v1 v2
1 5000 5015
2 5016 5030
3 5031 5045
4 5046 5060
5 5061 5075
bin_it(4857694, 4897909)
v1 v2
1 4857694 4858096
2 4858097 4858498
3 4858499 4858900
4 4858901 4859303
5 4859304 4859705
6 4859706 4860107
GIVEN DATA
I have 6 columns of data of vehicle trajectory (observation of vehicles' change in position, velocity, etc over time) a part of which is shown below:
Vehicle ID Frame ID Global X Vehicle class Vehicle velocity Lane
1 177 6451181 2 24.99 5
1 178 6451182 2 24.95 5
1 179 6451184 2 24.91 5
1 180 6451186 2 24.90 5
1 181 6451187 2 24.96 5
1 182 6451189 2 25.08 5
Vehicle ID is the identification of individual vehicles e.g. vehicle 1, vehicle 2, etc. It is repeated in the column for each frame in which it was observed. Please note that each frame is 0.1 seconds long so 10 frames make 1 second. The IDs of frames is in Frame ID column. Vehicle class is the type of vehicle (1=motorcycle, 2=car, 3=truck). Vehicle velocity column represents instantaneous speed of vehicle in that instant of time i.e. in a frame. Lane represents the number or ID of the lane in which vehicle is present in a particular frame.
WHAT I NEED TO FIND
The data I have is for 15 minutes period. The minimum frame ID is 5 and maximum frame ID is 9952. I need to find the total number of vehicles in every 30 seconds time period. This means that starting from the first 30 seconds (frame ID 5 to frame ID 305), I need to know the unique vehicle IDs observed. Also, for these 30 seconds period, I need to find the average velocity of each vehicle class. This means that e.g. for cars I need to find the average of all velocities of those vehicles whose vehicle class is 2.
I need to find this for all 30 seconds time period i.e. 5-305, 305-605, 605-905,..., 9605-9905. The ouput should tables for cars, trucks and motorcycles like this:
Time Slots Total Cars Average Velocity
5-305 xx xx
305-605 xx xx
. . .
. . .
9605-9905 xx xx
WHAT I HAVE TRIED SO FAR
# Finding the minimum and maximum Frame ID for creating 30-seconds time slots
minfid <- min(data$'Frame ID') # this was 5
maxfid <- max(data$'Frame ID') # this was 9952
for (i in 'Frame ID'==5:Frame ID'==305) {
table ('Vehicle ID')
mean('Vehicle Velocity', 'Vehicle class'==2)
} #For cars in first 30 seconds
I can't generate the required output and I don't know how can I do this for all 30 second periods. Please help.
It's a bit tough to make sure code is completely correct with your data since there is only one vehicle in the sample you show. That said, this is a typical split-apply-combine type analysis you can execute easily with the data.table package:
library(data.table)
dt <- data.table(df) # I just did a `read.table` on the text you posted
dt[, frame.group:=cut(Frame_ID, seq(5, 9905, by=300), include.lowest=T)]
Here, I just converted your data into a data.table (df was a direct import of your data posted above), and then created 300 frame buckets using cut. Then, you just let data.table do the work. In the first expression we calculate total unique vehicles per frame.group
dt[, list(tot.vehic=length(unique(Vehicle_ID))), by=frame.group]
# frame.group tot.vehic
# 1: [5,305] 1
Now we group by frame.group and Vehicle_class to get average speed and count for those combinations:
dt[, list(tot.vehic=length(unique(Vehicle_ID)), mean.speed=mean(Vehicle_velocity)), by=list(frame.group, Vehicle_class)]
# frame.group Vehicle_class tot.vehic mean.speed
# 1: [5,305] 2 1 24.965
Again, a bit silly when we only have one vehicle, but this should work for your data set.
EDIT: to show that it works:
library(data.table)
set.seed(101)
dt <- data.table(
Frame_ID=sample(5:9905, 50000, rep=T),
Vehicle_ID=sample(1:400, 50000, rep=T),
Vehicle_velocity=runif(50000, 25, 100)
)
dt[, frame.group:=cut(Frame_ID, seq(5, 9905, by=300), include.lowest=T)]
dt[, Vehicle_class:=Vehicle_ID %% 3]
head(
dt[order(frame.group, Vehicle_class), list(tot.vehic=length(unique(Vehicle_ID)), mean.speed=mean(Vehicle_velocity)), by=list(frame.group, Vehicle_class)]
)
# frame.group Vehicle_class tot.vehic mean.speed
# 1: [5,305] 0 130 63.34589
# 2: [5,305] 1 131 61.84366
# 3: [5,305] 2 129 64.13968
# 4: (305,605] 0 132 61.85548
# 5: (305,605] 1 132 64.76820
# 6: (305,605] 2 133 61.57129
Maybe it's your data?
Here is a plyr version:
data$timeSlot <- cut(data$FrameID,
breaks = seq(5, 9905, by=300),
dig.lab=5,
include.lowest=TRUE)
# split & combine
library(plyr)
data.sum1 <- ddply(.data = data,
.variables = c("timeSlot"),
.fun = summarise,
totalCars = length(unique(VehicleID)),
AverageVelocity = mean(velocity)
)
# include VehicleClass
data.sum2 <- ddply(.data = data,
.variables = c("timeSlot", "VehicleClass"),
.fun = summarise,
totalCars = length(unique(VehicleID)),
AverageVelocity = mean(velocity)
)
The column names like FrameID would have to be edited to match the ones you use:
data <- read.table(sep = "", header = TRUE, text = "
VehicleID FrameID GlobalX VehicleClass velocity Lane
1 177 6451181 2 24.99 5
1 178 6451182 2 24.95 5
1 179 6451184 2 24.91 5
1 180 6451186 2 24.90 5
1 181 6451187 2 24.96 5
1 182 6451189 2 25.08 5")
data.sum1
# timeSlot totalCars AverageVelocity
# 1 [5,305] 1 24.965
data.sum2
# timeSlot VehicleClass totalCars AverageVelocity
# 1 [5,305] 2 1 24.965
I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!
Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363
Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]