paired t-test with pairs and groups defined in another dataframe - r

I have a dataframe which looks like this
> head(data)
LH3003 LH3004 LH3005 LH3006 LH3007 LH3008 LH3009 LH3010 LH3011
cg18478105 0.02329879 0.08103364 0.01611778 0.01691191 0.01886975 0.01885553 0.01647439 0.02120779 0.01168622
cg14361672 0.09479536 0.07821380 0.02522833 0.06467310 0.05387729 0.05866673 0.08121820 0.10920162 0.04413263
cg01763666 0.03625680 0.04633759 0.04401555 0.08371531 0.09866403 0.17611284 0.07306743 0.12422579 0.11125146
cg02115394 0.10014794 0.09274320 0.08743445 0.08906313 0.09934032 0.18164115 0.06526380 0.08158144 0.08862067
cg13417420 0.01811630 0.02221060 0.01314041 0.01964530 0.02367295 0.01209913 0.01612864 0.01306061 0.04421938
cg26724186 0.32776266 0.31386294 0.24167480 0.29036142 0.24751268 0.26894756 0.20927278 0.28070790 0.33188921
LH3012 LH3013 LH3014
cg18478105 0.02466508 0.01909706 0.02054417
cg14361672 0.09172160 0.06170230 0.07752691
cg01763666 0.04328518 0.13693868 0.04288165
cg02115394 0.08682942 0.08601880 0.12413149
cg13417420 0.01980470 0.02241745 0.02038114
cg26724186 0.30832389 0.27644816 0.37630038
with almost 850000 rows,
and a different dataframe which contains the information behind the sample names:
> variables
Sample_ID Name Group01
3 LH3003 pair1 0
4 LH3004 pair1 1
5 LH3005 pair2 0
6 LH3006 pair2 1
7 LH3007 pair3 0
8 LH3008 pair3 1
9 LH3009 pair4 0
10 LH3010 pair4 1
11 LH3011 pair5 0
12 LH3012 pair5 1
13 LH3013 pair6 0
14 LH3014 pair6 1
Is it possible to do a paired t-test by defining the pairs and the group annotation of the samples based on another dataframe?
Thank you for your help!

Here is an lapply method that will store the results of each test in a list. This assumes that each pair is adjacent in the second data.frame,df2 and the first data.frame is named df1.
myTestList <- lapply(seq(1, nrow(df2), 2), function(i)
t.test(df1[[df2$Sample_ID[i]]], df1[[df2$Sample_ID[i+1]]], paired=TRUE))
which returns
myTestList
[[1]]
Paired t-test
data: df1[[df2$Sample_ID[i]]] and df1[[df2$Sample_ID[i + 1]]]
t = -0.50507, df = 5, p-value = 0.635
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.03453201 0.02319070
sample estimates:
mean of the differences
-0.005670653
[[2]]
Paired t-test
data: df1[[df2$Sample_ID[i]]] and df1[[df2$Sample_ID[i + 1]]]
t = -2.5322, df = 5, p-value = 0.05239
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.0459320947 0.0003458114
sample estimates:
mean of the differences
-0.02279314
data
df1 <- read.table(header=TRUE, text="LH3003 LH3004 LH3005 LH3006 LH3007 LH3008 LH3009 LH3010 LH3011
cg18478105 0.02329879 0.08103364 0.01611778 0.01691191 0.01886975 0.01885553 0.01647439 0.02120779 0.01168622
cg14361672 0.09479536 0.07821380 0.02522833 0.06467310 0.05387729 0.05866673 0.08121820 0.10920162 0.04413263
cg01763666 0.03625680 0.04633759 0.04401555 0.08371531 0.09866403 0.17611284 0.07306743 0.12422579 0.11125146
cg02115394 0.10014794 0.09274320 0.08743445 0.08906313 0.09934032 0.18164115 0.06526380 0.08158144 0.08862067
cg13417420 0.01811630 0.02221060 0.01314041 0.01964530 0.02367295 0.01209913 0.01612864 0.01306061 0.04421938
cg26724186 0.32776266 0.31386294 0.24167480 0.29036142 0.24751268 0.26894756 0.20927278 0.28070790 0.33188921")[1:4]
df2 <- read.table(header=TRUE, text=" Sample_ID Name Group01
3 LH3003 pair1 0
4 LH3004 pair1 1
5 LH3005 pair2 0
6 LH3006 pair2 1")

You need to stack your data and define a pair column and then run the t.test, this is for 1 of the 6 tests:
data2 <- data.frame(x = c(data$LH3003, data$LH3004), pair = c(rep(0, nrow(data)), rep(1, nrow(data))))
t.test(x ~ pair, data2)

Here's a variation on #Imo's:
lapply(unique(df2$Name), function(x){
samples <- df2[df2$Name==x,1]
t.test(df1[,samples[1]], df1[,samples[2]], paired=T)
})

Related

How can I join elements (columns from dataframes) from two lists by row names using R?

I need help please. I have two lists: the first contains ndvi time series for distinct points, the second contains precipitation time series for the same plots (plots are in the same order in the two lists).
I need to combine the two lists. I want to add the column called precipitation from one list to the corresponding ndvi column from the other list respecting the dates (represented here by letters in the row names) to a posterior analises of correlation between columns. However, both time series of ndvi and precipitation have distinct lenghts and distinct dates.
I created the two lists to be used as example of my dataset. However, in my actual dataset the row names are monthly dates in the format "%Y-%m-%d".
library(tidyverse)
set.seed(100)
# First variable is ndvi.mon1 (monthly ndvi)
ndvi.mon1 <- vector("list", length = 3)
for (i in seq_along(ndvi.mon1)) {
aux <- data.frame(ndvi = sample(randu$x,
sample(c(seq(1,20, 1)),1),
replace = T))
ndvi.mon1[i] <- aux
ndvi.mon1 <- ndvi.mon1 %>% map(data.frame)
rownames(ndvi.mon1[[i]]) <- sample(letters, size=seq(letters[1:as.numeric(aux %>% map(length))]) %>% length)
}
# Second variable is precipitation
precipitation <- vector("list", length = 3)
for (i in seq_along(ndvi.mon1)){
prec_aux <- data.frame(precipitation = sample(randu$x*500,
26,
replace = T))
row.names(prec_aux) <- seq(letters[1:as.numeric(prec_aux %>% map(length))])
precipitation[i] <- prec_aux
precipitation <- precipitation %>% map(data.frame)
rownames(precipitation[[i]]) <- letters[1:(as.numeric(precipitation[i] %>% map(dim) %>% map(first)))]
}
Can someone help me please?
Thank you!!!
Marcio.
Maybe like this?
library(dplyr)
library(purrr)
precipitation2 <- precipitation %>%
map(rownames_to_column) %>%
map(rename, precipitation = 2)
ndvi.mon2 <- ndvi.mon1 %>%
map(rownames_to_column) %>%
map(rename, ndvi = 2)
purrr::map2(ndvi.mon2, precipitation2, left_join, by = "rowname")
[[1]]
rowname ndvi precipitation
1 k 0.354886 209.7415
2 x 0.596309 103.3700
3 r 0.978769 403.8775
4 l 0.322291 354.2630
5 c 0.831722 348.9390
6 s 0.973205 273.6030
7 h 0.949827 218.6430
8 y 0.443353 61.9310
9 b 0.826368 8.3290
10 d 0.337308 291.2110
The below will return a list of data.frames, that have been merged, using rownames:
lapply(seq_along(ndvi.mon1), function(i) {
merge(
x = data.frame(date = rownames(ndvi.mon1[[i]]), ndvi = ndvi.mon1[[i]][,1]),
y = data.frame(date = rownames(precipitation[[i]]), precip = precipitation[[i]][,1]),
by="date"
)
})
Output:
[[1]]
date ndvi precip
1 b 0.826368 8.3290
2 c 0.831722 348.9390
3 d 0.337308 291.2110
4 h 0.949827 218.6430
5 k 0.354886 209.7415
6 l 0.322291 354.2630
7 r 0.978769 403.8775
8 s 0.973205 273.6030
9 x 0.596309 103.3700
10 y 0.443353 61.9310
[[2]]
date ndvi precip
1 g 0.415824 283.9335
2 k 0.573737 311.8785
3 p 0.582422 354.2630
4 y 0.952495 495.4340
[[3]]
date ndvi precip
1 b 0.656463 332.5700
2 c 0.347482 94.7870
3 d 0.215425 431.3770
4 e 0.063100 499.2245
5 f 0.419460 304.5190
6 g 0.712057 226.7125
7 h 0.666700 284.9645
8 i 0.778547 182.0295
9 k 0.902520 82.5515
10 l 0.593219 430.6630
11 m 0.788715 443.5345
12 n 0.347482 132.3950
13 q 0.719538 79.1835
14 r 0.911370 100.7025
15 s 0.258743 309.3575
16 t 0.940644 142.3725
17 u 0.626980 335.4360
18 v 0.167640 390.4915
19 w 0.826368 63.3760
20 x 0.937211 439.8685

How do I create a sample from a previous sample of a data frame

I created a data frame with two variables: one with characters(teams)and one numeric. I'd like to do a complete random sample to choose two teams and then another sample between the two elected to get just one. Finally I'd like to repeat this without the first two elected teams, being able to replicate it.
I have tried with this code. However, when it comes to the second sample the election is not from the two elected teams, but from two other teams.
teams <- c('madrid','barcelona','psg','mancunited','mancity','juve')
mean <- c(14, 14.5, 13, 10, 13.4, 13.7)
df <- data.frame(teams, stats)
x <- 1:nrow(df)
a1 <- df[sample((x),2),]
y <- sample(c(a1[1,1], a1[2,1]), 1,
prob = c((a1[1,2]/(a1[1,2]+a1[2,2])), (a1[2,2]/(a1[1,2]+a1[2,2]))))
A1 <- df[y,]
A1
df <- df[!(df$teams==a1[1,1] | df$teams==a1[2,1]),]
x <- 1:nrow(df)
b1 <- df[sample((x),2),]
B1 <- df[sample(c(b1[1,1], b1[2,1]), 1,
prob = c((b1[1,2]/(b1[1,2]+b1[2,2])), (b1[2,2]/(b1[1,2]+b1[2,2])))),]
B1
You can use :
#Choose two teams
random_2_x <- sample(x, 2)
#Chose one out of the above two
random_2_1_x <- sample(random_2_x, 1)
#Chose two from the one not in random_2_x
random_2_y <- sample(x[-random_2_x], 2)
#Chose one out of the above two
random_2_y_1 <- sample(random_2_y, 1)
You can use these indexes to subset from dataframe :
df[random_2_x, ]
# teams mean
#4 mancunited 10.0
#6 juve 13.7
df[random_2_1_x, ]
# teams mean
#6 juve 13.7
df[random_2_y, ]
# teams mean
#1 madrid 14.0
#2 barcelona 14.5
df[random_2_y_1, ]
# teams mean
#2 barcelona 14.5
data
df<- data.frame(teams, mean)
If you want to use your stats column for weighting the probability of selection on the second draw (choosing 1 team from the 2 already selected), you can use the following function. The prob argument of sample can be a vector of probability weights. So you don't need to calculate actual proportions manually - just provide the stats column and R will do what you want.
game <- function(df){
x <- 1:nrow(df)
a1 <- df[sample((x),2),]
y1 <- sample(a1$teams, 1, prob = a1$stats)
df2 <- df[!(df$teams %in% a1$teams),]
x <- 1:nrow(df2)
b1 <- df2[sample(x,2),]
y2 <- sample(b1$teams, 1, prob = b1$stats)
c(y1, y2)
}
Here's your data:
teams <- c('madrid','barcelona','psg','mancunited','mancity','juve')
stats <- c(14, 14.5, 13, 10, 13.4, 13.7)
df <- data.frame(teams, stats) # R 4.0.0 no need to convert strings to factors.
Replicate 10,000 games:
games <- t(replicate(10000, game(df)))
head(games)
# [,1] [,2]
# [1,] "barcelona" "mancity"
# [2,] "madrid" "mancunited"
# [3,] "madrid" "psg"
# [4,] "juve" "psg"
# [5,] "mancity" "barcelona"
# [6,] "mancity" "juve"
You can see the proportion of times each team got selected in each of your phases.
sort(prop.table(table(games[,1])), decr = TRUE) # phase 1
# barcelona madrid psg juve mancity mancunited
# 0.1797 0.1787 0.1687 0.1677 0.1663 0.1389
sort(prop.table(table(games[,1])), decr = TRUE) # phase 2
# madrid barcelona juve mancity psg mancunited
# 0.1826 0.1755 0.1691 0.1670 0.1663 0.1395

How can I conditionally calculate a new variable from pre-existing variables?

I'm a programming novice trying to calculate some ideal body weight numbers from a large dataset of height, sex, and actual body weight. I would like to create a new column in the data frame (df$ibw) based on the ideal body weight calculation for each individual.
Ideal body weight (IBW) is calculated differently for men and women.
For males... IBW = 50 + 0.91((Height in cm)-152.4)
For females... IBW = 45.5 + 0.91((Height in cm)-152.4)
set.seed(1000)
weight <- rnorm(10, 100, 20) # weight in kilograms
sex <- (0:1) # 0 for Male, 1 for Female
height <- rnorm(10, 150, 10) # height in centimeters
df <- data.frame(weight, sex, height)
df
I've been reading other posts using if else statements and other conditional formats, but I keep getting errors. This is something I will be frequently doing for datasets and I'm trying to figure out the best way to accomplish this task.
You could use a one-liner:
df$IBW <- 0.91 * (df$height - 152.4) + 50 - 4.5 * df$sex
df
# weight sex height IBW
# 1 91.08443 0 140.1757 38.87591
# 2 75.88287 1 144.4551 38.27015
# 3 100.82253 0 151.2138 48.92057
# 4 112.78777 1 148.7913 42.21606
# 5 84.26891 0 136.6396 35.65803
# 6 92.29021 1 151.7006 44.86352
# 7 90.48264 0 151.5508 49.22722
# 8 114.39501 1 150.2493 43.54288
# 9 99.62989 0 129.5341 29.19207
# 10 72.53764 1 152.1315 45.25570
If sex = 1 (female), then we just substract 50 - 45.5 = 4.5
This should do it
df$ibw <- ifelse(df$sex == 0, 50 + 0.91 * (df$height - 152.4),
45.5 + 0.91 * (df$height - 152.4))
Something like this should work.
df$ibw <- 0
df[df$sex == 0,]$ibw <- 50 + 0.91*df[df$sex == 0,]$height - 152.4
df[df$sex == 1,]$ibw <- 45.5 + 0.91*df[df$sex == 1,]$height - 152.4

Fisher's and Pearson's test for indepedence

In R I have 2 datasets: group1 and group2.
For group 1 I have 10 game_id which is the id of a game, and we have number which is the numbers of times this games has been played in group1.
So if we type
group1
we get this output
game_id number
1 758565
2 235289
...
10 87084
For group2 we get
game_id number
1 79310
2 28564
...
10 9048
If I want to test if there is a statistical difference between group1 and group2 for the first 2 game_id I can use Pearson chi-square test.
In R I simply create the matrix
# The first 2 'numbers' in group1
a <- c( group1[1,2] , group1[2,2] )
# The first 2 'numbers' in group2
b <- c( group2[1,2], group2[2,2] )
# Creating it on matrix-form
m <- rbind(a,b)
So m gives us
a 758565 235289
b 79310 28564
Here I can test H: "a is independent from b", meaning that users in group1 play game_id 1 more than 2 compared to group2.
In R we type chisq.test(m) and we get a very low p-value meaning that we can reject H, meaning that a and b is not independent.
How should one find game_id's that are played significantly more in group1 than in group2 ?
I created a simpler version of only 3 games. I'm using a chi squared test and a proportions comparison test. Personally, I prefer the second one as it gives you an idea about what percentages you're comparing. Run the script and make sure you understand the process.
# dataset of group 1
dt_group1 = data.frame(game_id = 1:3,
number_games = c(758565,235289,87084))
dt_group1
# game_id number_games
# 1 1 758565
# 2 2 235289
# 3 3 87084
# add extra variables
dt_group1$number_rest_games = sum(dt_group1$number_games) - dt_group1$number_games # needed for chisq.test
dt_group1$number_all_games = sum(dt_group1$number_games) # needed for prop.test
dt_group1$Prc = dt_group1$number_games / dt_group1$number_all_games # just to get an idea about the percentages
dt_group1
# game_id number_games number_rest_games number_all_games Prc
# 1 1 758565 322373 1080938 0.70176550
# 2 2 235289 845649 1080938 0.21767113
# 3 3 87084 993854 1080938 0.08056336
# dataset of group 2
dt_group2 = data.frame(game_id = 1:3,
number_games = c(79310,28564,9048))
# add extra variables
dt_group2$number_rest_games = sum(dt_group2$number_games) - dt_group2$number_games
dt_group2$number_all_games = sum(dt_group2$number_games)
dt_group2$Prc = dt_group2$number_games / dt_group2$number_all_games
# input the game id you want to investigate
input_game_id = 1
# create a table of successes (games played) and failures (games not played)
dt_test = rbind(c(dt_group1$number_games[dt_group1$game_id==input_game_id], dt_group1$number_rest_games[dt_group1$game_id==input_game_id]),
c(dt_group2$number_games[dt_group2$game_id==input_game_id], dt_group2$number_rest_games[dt_group2$game_id==input_game_id]))
# perform chi sq test
chisq.test(dt_test)
# Pearson's Chi-squared test with Yates' continuity correction
#
# data: dt_test
# X-squared = 275.9, df = 1, p-value < 2.2e-16
# create a vector of successes (games played) and vector of total games
x = c(dt_group1$number_games[dt_group1$game_id==input_game_id], dt_group2$number_games[dt_group2$game_id==input_game_id])
y = c(dt_group1$number_all_games[dt_group1$game_id==input_game_id], dt_group2$number_all_games[dt_group2$game_id==input_game_id])
# perform test of proportions
prop.test(x,y)
# 2-sample test for equality of proportions with continuity correction
#
# data: x out of y
# X-squared = 275.9, df = 1, p-value < 2.2e-16
# alternative hypothesis: two.sided
# 95 percent confidence interval:
# 0.02063233 0.02626776
# sample estimates:
# prop 1 prop 2
# 0.7017655 0.6783155
The main thing is that chisq.test is a test that compares counts/proportions, so you need to provide the number of "successes" and "failures" for the groups you compare (contingency table as input). prop.test is another counts/proportions testing command that you need to provide the number of "successes" and "totals".
Now that you're happy with the result and you saw how the process works I'll add a more efficient way to perform those tests.
The first one is using dplyr and broom packages:
library(dplyr)
library(broom)
# dataset of group 1
dt_group1 = data.frame(game_id = 1:3,
number_games = c(758565,235289,87084),
group_id = 1) ## adding the id of the group
# dataset of group 2
dt_group2 = data.frame(game_id = 1:3,
number_games = c(79310,28564,9048),
group_id = 2) ## adding the id of the group
# combine datasets
dt = rbind(dt_group1, dt_group2)
dt %>%
group_by(group_id) %>% # for each group id
mutate(number_all_games = sum(number_games), # create new columns
number_rest_games = number_all_games - number_games,
Prc = number_games / number_all_games) %>%
group_by(game_id) %>% # for each game
do(tidy(prop.test(.$number_games, .$number_all_games))) %>% # perform the test
ungroup()
# game_id estimate1 estimate2 statistic p.value parameter conf.low conf.high
# (int) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 1 0.70176550 0.67831546 275.89973 5.876772e-62 1 0.020632330 0.026267761
# 2 2 0.21767113 0.24429962 435.44091 1.063385e-96 1 -0.029216006 -0.024040964
# 3 3 0.08056336 0.07738492 14.39768 1.479844e-04 1 0.001558471 0.004798407
The other one is using data.table and broom packages:
library(data.table)
library(broom)
# dataset of group 1
dt_group1 = data.frame(game_id = 1:3,
number_games = c(758565,235289,87084),
group_id = 1) ## adding the id of the group
# dataset of group 2
dt_group2 = data.frame(game_id = 1:3,
number_games = c(79310,28564,9048),
group_id = 2) ## adding the id of the group
# combine datasets
dt = data.table(rbind(dt_group1, dt_group2))
# create new columns for each group
dt[, number_all_games := sum(number_games), by=group_id]
dt[, `:=`(number_rest_games = number_all_games - number_games,
Prc = number_games / number_all_games) , by=group_id]
# for each game id compare percentages
dt[, tidy(prop.test(.SD$number_games, .SD$number_all_games)) , by=game_id]
# game_id estimate1 estimate2 statistic p.value parameter conf.low conf.high
# 1: 1 0.70176550 0.67831546 275.89973 5.876772e-62 1 0.020632330 0.026267761
# 2: 2 0.21767113 0.24429962 435.44091 1.063385e-96 1 -0.029216006 -0.024040964
# 3: 3 0.08056336 0.07738492 14.39768 1.479844e-04 1 0.001558471 0.004798407
You can see that each row represent one game and the comparison is between group 1 and 2. You can get the p values from the corresponding column, but other info of the test/comparison as well.

Using R to remove data which is below a quartile threshold

I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!
Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363
Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]

Resources