Using R to remove data which is below a quartile threshold

Using R to remove data which is below a quartile threshold - r

I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!

Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363

Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]

Related

How to generate random data based on some criteria in R

I wan to generate 300 random data based on the following criteria:
Class value
0 1-8
1 9-11
2 12-14
3 15-16
4 17-20
Logic: when class = 0, I want to get random data between 1-8. Or when class= 1, I want to get random data between 9-11 and so on.
This gives me the following hypothetical table as an example:
Class Value
0 7
0 4
1 10
1 9
1 11
. .
. .
I want to have equal and unequal mixtures in each class

You could do:
df <- data.frame(Class = sample(0:4, 300, TRUE))
df$Value <- sapply(list(1:8, 9:11, 12:14, 15:16, 17:20)[df$Class + 1],
sample, size = 1)
This gives you a data frame with 300 rows and appropriate numbers for each class:
head(df)
#> Class Value
#> 1 0 3
#> 2 1 10
#> 3 4 19
#> 4 2 12
#> 5 4 19
#> 6 1 10
Created on 2022-12-30 with reprex v2.0.2

Providing some additional flexibility in the code, so that different probabilities can be used in the sampling, and having the smallest possible amount of hard-coded values:
# load data.table
library(data.table)
# this is the original data
a = structure(list(Class = 0:4, value = c("1-8", "9-11", "12-14",
"15-16", "17-20")), row.names = c(NA, -5L), class = c("data.table",
"data.frame"))
# this is to replace "-" by ":", we will use that in a second
a[, value := gsub("\\-", ":", value)]
# this is a vector of EQUAL probabilities
probs = rep(1/a[, uniqueN(Class)], a[, uniqueN(Class)])
# This is a vector of UNEQUAL Probabilities. If wanted, it should be
# uncommented and adjusted manually
# probs = c(0.05, 0.1, 0.2, 0.4, 0.25)
# This is the number of Class samples wanted
numberOfSamples = 300
# This is the working horse
a[sample(.N, numberOfSamples, TRUE, prob = probs), ][,
smpl := apply(.SD,
1,
function(x) sample(eval(parse(text = x)), 1)),
.SDcols = "value"][,
.(Class, smpl)]
What is good about this code?
If you change your classes, or the value ranges, the only change you need to be concerned about is the original data frame (a, as I called it)
If you want to use uneven probabilities for your sampling, you can set them and the code still runs.
If you want to take a smaller or larger sample, you don't have to edit your code, you only change the value of a variable.

How to apply function with multiple conditions on multiple columns to get new conditional columns in R

Hello all a R noob here,
I hope you guys can help me with the following.
I need to transform multiple columns in my dataset to new columns based on the values in the original columns multiple times. This means that for the first transformation I use column 1, 2, 3 and if certain conditions are met the output results a new column with a 1 or a 0, for the second transformation I use columns 4, 5, 6 and the output should be a 1 or a 0 also. I have to do this 18 times. I already wrote a function which succesfully does the transformation if I impute the variables manually, but I would like to apply this function to all the desired columns at once. My desired output would be 18 new columns with 0's and 1's. Finally I will make a last column which will display a 1 if any of the 18 columns is a 1 and a 0 otherwise.
df <- data.frame(admiss1 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
admiss2 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
admiss3 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
visit1 = sample(seq(as.Date('1995/01/01'), as.Date('1996/01/01'), by="day"), 12),
visit2 = sample(seq(as.Date('1997/01/01'), as.Date('1998/01/01'), by="day"), 12),
reason1 = sample(3,12, replace = T),
reason2 = sample(3,12, replace = T),
reason3 = sample(3,12, replace = T))
df$discharge1 <- df$admiss1 + 10
df$discharge2 <- df$admiss2 + 10
df$discharge3 <- df$admiss3 + 10
#every discharge date is 10 days after the admission date for the sake of this example
#now I have the following dataframe
#for the sake of it I included only 3 dates and reasons(instead of 18)
admiss1 admiss2 admiss3 visit1 visit2 reason1 reason2 reason3 discharge1 discharge2 discharge3
1 1990-03-12 1992-04-04 1998-07-31 1995-01-24 1997-10-07 2 1 3 1990-03-22 1992-04-14 1998-08-10
2 1999-05-18 1990-11-25 1995-10-04 1995-03-06 1997-03-13 1 2 1 1999-05-28 1990-12-05 1995-10-14
3 1993-07-16 1998-06-10 1991-07-05 1995-11-06 1997-11-15 1 1 2 1993-07-26 1998-06-20 1991-07-15
4 1991-07-05 1992-06-17 1995-10-12 1995-05-14 1997-05-02 2 1 3 1991-07-15 1992-06-27 1995-10-22
5 1995-08-16 1999-03-08 1992-04-03 1995-02-20 1997-01-03 1 3 3 1995-08-26 1999-03-18 1992-04-13
6 1999-10-07 1991-12-26 1995-05-05 1995-10-24 1997-10-15 3 1 1 1999-10-17 1992-01-05 1995-05-15
7 1998-03-18 1992-04-18 1993-12-31 1995-11-14 1997-06-14 3 2 2 1998-03-28 1992-04-28 1994-01-10
8 1992-08-04 1991-09-16 1992-04-23 1995-05-29 1997-10-11 1 2 3 1992-08-14 1991-09-26 1992-05-03
9 1997-02-20 1990-02-12 1998-03-08 1995-10-09 1997-12-29 1 1 3 1997-03-02 1990-02-22 1998-03-18
10 1992-09-16 1997-06-16 1997-07-18 1995-12-11 1997-01-12 1 2 2 1992-09-26 1997-06-26 1997-07-28
11 1991-01-25 1998-04-07 1999-07-02 1995-12-27 1997-05-28 3 2 1 1991-02-04 1998-04-17 1999-07-12
12 1996-02-25 1993-03-30 1997-06-25 1995-09-07 1997-10-18 1 3 2 1996-03-06 1993-04-09 1997-07-05
admissdate <- function(admis, dis, rsn, vis1, vis2){
xnew <- ifelse(df[eval(substitute(admis))] >= df[eval(substitute(vis1))] & df[eval(substitute(dis))] <= df[eval(substitute(vis2))] & df[eval(substitute(rsn))] == 2, 1, 0)
xnew <- ifelse(df[eval(substitute(admis))] >= df[eval(substitute(vis1))] & df[eval(substitute(admis))] <= df[eval(substitute(vis2))] & df[eval(substitute(dis))] >= df[eval(substitute(vis2))] & df[eval(substitute(rsn))] == 2, 1, xnew)
return(xnew)
}
I wrote this function to generate a 1 if the conditions are true and a 0 if the conditions are false.
-Condition 1: admission date and discharge date are between visit 1 and visit 2 + admission reason is 2.
-Condition 2: admission date is after visit 1 but before visit 2 and the discharge date is after visit 2 with also admission reason 2.
It should return 1 if these conditions are true and 0 if these conditions are false. Eventually, I will end up with 18 new variables with 1's or 0's and will combine them to make one variable with Admission between visit 1 and visit 2 (with reason 2).
If I manually impute the variable names it will work, but I cant make it work for all the variables at once. I tried to make a string vector with all the admiss dates, discharge dates and reasons and tried to transform them with mapply, but this does not work.
admiss <- paste0(rep("admiss", 3), 1:3)
discharge <- paste0(rep("discharge", 3), 1:3)
reason <- paste0(rep("reason", 3), 1:3)
visit1 <- rep("visit1",3)
visit2 <- rep("visit2",3)
mapply(admissdate, admis = admiss, dis = discharge, rsn = reason, vis1 = visit1, vis2 = visit2)
I have also considered lapply but here you have to define an X = ..., which I think I cannot use because I have multiple column that I want to impute, please correct me if I am wrong!
Also I considered using a for loop, but I don't know how to use that with multiple conditions.
Any help would be greatly appreciated!

You can change the function to accept values instead of column names.
admissdate <- function(admis, dis, rsn, vis1, vis2){
xnew <- as.integer(admis >= vis1 & dis <= vis2 & rsn == 2)
xnew <- ifelse(admis >= vis1 & admis <= vis2 & dis >= vis2 & rsn == 2, 1, xnew)
return(xnew)
}
Now create new columns -
admiss <- paste0("admiss", 1:3)
discharge <- paste0("discharge", 1:3)
reason <- paste0("reason", 1:3)
new_col <- paste0('newcol', 1:3)
df[new_col] <- Map(function(x, y, z) admissdate(x, y, z, df$visit1, df$visit2),
df[admiss],df[discharge],df[reason])
#Additional column will be 1 if any of the value in the new column is 1.
df$result <- as.integer(rowSums(df[new_col]) > 0)
df

Loop same actions in R

Have an issue here.
I want to loop my operations in R, however, do not know how to make this properly and efficiently.
I have several different sized datasets and performing the same block of code each time is time-consuming.
Here is the code I need to apply to each of the datasets and write the data or the output from a model into the datasets with different names.
##########################################################################################################################
#the combined list of separate data frames where the last letter is changing A, B, C...
z <- list(Data_A, Data_B, Data_C)
#need to loop these operations performed by using data from datasets. Here is an example by using data from Data_A dataset.
# TFP estimation by using ACF method
ACF_A <- prodest::prodestACF(Data_A$turn, fX = Data_A$cogs, sX = Data_A$tfa, pX = Data_A$cogs, idvar = Data_A$ID, timevar = Data_A$Year,
R = 100, cX = NULL, opt = 'DEoptim', theta0 = NULL, cluster = NULL)
omegaACF_A <- prodest::omega(ACF_A)
Data_A$omegaACF_A <- prodest::omega(ACF_A)
#########################################################################################################################
# Growth variables
Data_A <- Data_A %>%
arrange(ID, Year) %>%
group_by(ID) %>%
mutate(domegaACF_A = omegaACF_A - dplyr::lag(omegaACF_A),
debt = LOAN + LTD,
ddebt = debt - dplyr::lag(debt),
dsales = SALE - dplyr::lag(SALE)) %>%
ungroup
# Panel data frame
PData_A <- pdata.frame(Data_A, index = c("ID","Year"))
# Within estimator
within_2way_A <- plm(domegaACF_A ~ dplyr::lag(domegaACF_A, 1) + dplyr::lag(domegaACF_A, 2) + ddebt + lag(ff1, 1) + ddebt:lag(ff1, 1) + log(Age) + ta + dsales,
data = PData_A, effect = "twoways", model ="within", index = c("ID", "Year"))
The main problem is that I do not know how to store the data in separate datasets with according names. For example, in the block of the following code, _A should be changing to _B, _C according to the dataset that is used.
ACF_A <- prodest::prodestACF(Data_A$turn, fX = Data_A$cogs, sX = Data_A$tfa, pX = Data_A$cogs, idvar = Data_A$ID, timevar = Data_A$Year,
R = 100, cX = NULL, opt = 'DEoptim', theta0 = NULL, cluster = NULL)
omegaACF_A <- prodest::omega(ACF_A)
Data_A$omegaACF_A <- prodest::omega(ACF_A)
I know there are lapply and for loops but I do not know how to use them with changing names of storing variables:
z <- list (df1, df2, df3)
for (i in z){
ACF_[1 or 2 or 3] <- prodest::prodestACF(i$turn, fX = i$cogs, sX = i$tfa, pX = i$cogs, idvar = i$ID, timevar = i$Year,
R = 100, cX = NULL, opt = 'DEoptim', theta0 = NULL, cluster = NULL)
omegaACF_[1 or 2 or 3] <- prodest::omega(ACF_[1 or 2 or 3])
Data_[]$omegaACF_[1 or 2 or 3] <- prodest::omega(ACF_[1 or 2 or 3])
{
UPD: Here are several datasets: https://drive.google.com/drive/folders/1gBV2ZkywW6JqDjRICafCwtYhh2DHWaUq?usp=sharing
UPD2:
Data_A
turn cogs tfa SALE
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
Data_B
turn cogs tfa SALE
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
After running the loop I need:
ACF_A, ACF_B, etc. storage variable, where the results of the estimations of prodest function will be stored
omegaACF_A, omegaACF_B, etc. storage where omega variable from prodest will be stored
omegaACF_A, omegaACF_B results of estimations should be added to Data_A, Data_B datasets accordingly as a new variable.
After that, in Data_A, Data_B datasets, growth variables should be created
The plm regression should be stored in within_2way_A, within_2way_B accordingly
So in the end, I need:
Data_A
turn cogs tfa SALE omegaACF_A domegaACF_A debt ddebt dsales
1 1 1 1 0.1 NA 1 NA NA
2 2 2 2 0.3 0.2 2 1 1
3 3 3 3 0.6 0.3 3 1 1
4 4 4 4 0.9 0.3 4 1 1
Data_B
turn cogs tfa SALE omegaACF_B domegaACF_B debt ddebt dsales
5 5 5 5 1.1 NA 5 NA NA
6 6 6 6 1.5 0.4 6 1 1
7 7 7 7 1.7 0.2 7 1 1
8 8 8 8 2.0 0.3 8 1 1

One approach is to separate the ACF estimation and omega calculation from the summary creation with different lapply() commands. Since you did not supply any example data, it's a blind shot, but try the following. Note that I assumed that every dataset has the same column names! In case it doesn't solve your problem I will remove my answer.
data <- list(Data_A, Data_B, Data_C)
Estimates <- lapply(data, function(x){
prodest::prodestACF(x$turn, fX = x$cogs, sX = x$tfa, pX = x$cogs, idvar = x$ID, timevar = x$Year,
R = 100, cX = NULL, opt = 'DEoptim', theta0 = NULL, cluster = NULL)
}
Summaries_estimates <- lapply(Estimates, summary)
Omegas <- lapply(Estimates, function(x) prodest::omega(x))
Summaries_omega <- lapply(Omegas, summary)
Alternative using loops
Since you asked, it is also possible to define a loop that loops everything together though this is usually much slower. For this, we have to define empty lists that carry the results of ACF etc. and loop over the lists of data.frames that we already created.
data <- list(Data_A, Data_B, Data_C)
Estimates <- list()
Summaries_estimates <- list()
Omegas <- list()
Summaries_omegas <- list()
for(i in 1:(length(data))){
Estimates[[i]] <- prodest::prodestACF(data[[i]]$turn, fX = data[[i]]$cogs, sX = data[[i]]$tfa, pX = data[[i]]$cogs, idvar = data[[i]]$ID, timevar = data[[i]]$Year,
R = 100, cX = NULL, opt = 'DEoptim', theta0 = NULL, cluster = NULL)
}
Summaries_estimates[[i]] <- summary(Estimates[[i]])
Omegas[[i]] <- prodest::omega(Estimates[[i]]))
Summaries_omega[[i]] <- summary(Omegas[[i]])
}

Calculate the mean per subject and repeat the value for each subject's row

This is the first time that I ask a question on stack overflow. I have tried searching for the answer but I cannot find exactly what I am looking for. I hope someone can help.
I have a huge data set of 20416 observation. Basically, I have 83 subjects and for each subject I have several observations. However, the number of observations per subject is not the same (e.g. subject 1 has 256 observations, while subject 2 has only 64 observations).
I want to add an extra column containing the mean of the observations for each subject (the observations are reading times (RT)).
I tried with the aggregate function:
aggregate (RT ~ su, data, mean)
This formula returns the correct mean per subject. But then I cannot simply do the following:
data$mean <- aggregate (RT ~ su, data, mean)
as R returns this error:
Error in $<-.data.frame(tmp, "mean", value = list(su = 1:83, RT
= c(378.1328125, : replacement has 83 rows, data has 20416
I understand that the formula lacks a command specifying that the mean for each subject has to be repeated for all the subject's rows (e.g. if subject 1 has 256 rows, the mean for subject 1 has to be repeated for 256 rows, if subject 2 has 64 rows, the mean for subject 2 has to be repeated for 64 rows and so forth).
How can I achieve this in R?

The data.table syntax lends itself well to this kind of problem:
Dt[, Mean := mean(Value), by = "ID"][]
# ID Value Mean
# 1: a 0.05881156 0.004426491
# 2: a -0.04995858 0.004426491
# 3: b 0.64054432 0.038809830
# 4: b -0.56292466 0.038809830
# 5: c 0.44254622 0.099747707
# 6: c -0.10771992 0.099747707
# 7: c -0.03558318 0.099747707
# 8: d 0.56727423 0.532377247
# 9: d -0.60962095 0.532377247
# 10: d 1.13808538 0.532377247
# 11: d 1.03377033 0.532377247
# 12: e 1.38789640 0.568760936
# 13: e -0.57420308 0.568760936
# 14: e 0.89258949 0.568760936
As we are applying a grouped operation (by = "ID"), data.table will automatically replicate each group's mean(Value) the appropriate number of times (avoiding the error you ran into above).
Data:
Dt <- data.table::data.table(
ID = sample(letters[1:5], size = 14, replace = TRUE),
Value = rnorm(14))[order(ID)]

Staying in Base R, ave is intended for this use:
data$mean = with(data, ave(x = RT, su, FUN = mean))

Simply merge your aggregated means data with full dataframe joined by the subject:
aggdf <- aggregate (RT ~ su, data, mean)
names(aggdf)[2] <- "MeanOfRT"
df <- merge(df, aggdf, by="su")

Another compelling way of handling this without generating extra data objects is by using group_by of dplyr package:
# Generating some data
data <- data.table::data.table(
su = sample(letters[1:5], size = 14, replace = TRUE),
RT = rnorm(14))[order(su)]
# Performing
> data %>% group_by(su) %>%
+ mutate(Mean = mean(RT)) %>%
+ ungroup()
Source: local data table [14 x 3]
su RT Mean
1 a -1.62841746 0.2096967
2 a 0.07286149 0.2096967
3 a 0.02429030 0.2096967
4 a 0.98882343 0.2096967
5 a 0.95407214 0.2096967
6 a 1.18823435 0.2096967
7 a -0.13198711 0.2096967
8 b -0.34897914 0.1469982
9 b 0.64297557 0.1469982
10 c -0.58995261 -0.5899526
11 d -0.95995198 0.3067978
12 d 1.57354754 0.3067978
13 e 0.43071258 0.2462978
14 e 0.06188307 0.2462978

Plotting only a subset of the points?

I am trying to plot the CDF curve for a large dataset containing about 29 million values using ggplot. The way I am computing this is like this:
mycounts = ddply(idata.frame(newdata), .(Type), transform, ecd = ecdf(Value)(Value))
plot = ggplot(mycounts, aes(x=Value, y=ecd))
This is taking ages to plot. I was wondering if there is a clean way to plot only a sample of this dataset (say, every 10th point or 50th point) without compromising on the actual result?

I am not sure about your data structure, but a simple sample call might be enough:
n <- nrow(mycounts) # number of cases in data frame
mycounts <- mycounts[sample(n, round(n/10)), ] # get an n/10 sample to the same data frame

Instead of taking every n-th point, can you quantize your data set down to a sufficient resolution before plotting it? That way, you won't have to plot resolution you don't need (or can't see).
Here's one way you can do it. (The function I've written below is generic, but the example uses names from your question.)
library(ggplot2)
library(plyr)
## A data set containing two ramps up to 100, one by 1, one by 10
tens <- data.frame(Type = factor(c(rep(10, 10), rep(1, 100))),
Value = c(1:10 * 10, 1:100))
## Given a data frame and ddply-style arguments, partition the frame
## using ddply and summarize the values in each partition with a
## quantized ecdf. The resulting data frame for each partition has
## two columns: value and value_ecdf.
dd_ecdf <- function(df, ..., .quantizer = identity, .value = value) {
value_colname <- deparse(substitute(.value))
ddply(df, ..., .fun = function(rdf) {
xs <- rdf[[value_colname]]
qxs <- sort(unique(.quantizer(xs)))
data.frame(value = qxs, value_ecdf = ecdf(xs)(qxs))
})
}
## Plot each type's ECDF (w/o quantization)
tens_cdf <- dd_ecdf(tens, .(Type), .value = Value)
qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdf)
## Plot each type's ECDF (quantizing to nearest 25)
rounder <- function(...) function(x) round_any(x, ...)
tens_cdfq <- dd_ecdf(tens, .(Type), .value = Value, .quantizer = rounder(25))
qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdfq)
While the original data set and the ecdf set had 110 rows, the quantized-ecdf set is much reduced:
> dim(tens)
[1] 110 2
> dim(tens_cdf)
[1] 110 3
> dim(tens_cdfq)
[1] 10 3
> tens_cdfq
Type value value_ecdf
1 1 0 0.00
2 1 25 0.25
3 1 50 0.50
4 1 75 0.75
5 1 100 1.00
6 10 0 0.00
7 10 25 0.20
8 10 50 0.50
9 10 75 0.70
10 10 100 1.00
I hope this helps! :-)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Using R to remove data which is below a quartile threshold - r

Related

How to generate random data based on some criteria in R

How to apply function with multiple conditions on multiple columns to get new conditional columns in R

Loop same actions in R

Calculate the mean per subject and repeat the value for each subject's row

Plotting only a subset of the points?

Categories

Resources