R - Adstock by group - r

I am trying to create the adstock effect for some variable (adstock is defined as value of the previous observation + value of previous observation*adstock rate). I have a table abc that has two columns: GEOG (A, B ,C) and GRPs (1 to 6) for a total of 18 observations. I would like to create a variable b by taking the first obs of the first GEOG and adstocking it by say .5. Then when I get to the first obs of the second GEOG, reinitialize it=to GRPs and do it again. I created a code that works really well with only one geography. But I cannot figure out how to do it BY geography. Coming from a different statistical language, I am still wrapping my head around the way R works. Can anyone help? Thanks in advance. Here is the code that works for one GOEG:
rate1=.5
rate2=0
for (i in 1:nrow(abc)) {
if (i ==1)
abc[i,3] <- abc[i,2]
else if (i == 2)
#Effect = impression + last week effect * decay rate
abc[i,3] <- abc[i,2] + (abc[i-1,3] * rate1)
else
#Effect = impression + last week effect * decay rate
abc[i,3] <- abc[i,2] + (abc[i-1,3] * rate1) + (abc[i-2,3]*rate2)
}
Output:
GEOG a b
A 1 1
A 2 2.5
A 3 4.25
A 4 6.125
A 5 8.0625
A 6 10.03125
B 1 1
B 2 2.5
B 3 4.25
B 4 6.125
B 5 8.0625
C 1 1
C 2 2.5
C 3 4.25
C 4 6.125
C 5 8.0625

transfo <- function(df, rate1 = 0.5, rate2 = 0) {
b <- df[["a"]]
for (i in seq_along(b)) {
if (i == 2) {
b[i] <- b[i] + rate1 * b[i-1]
} else if (i > 2) {
b[i] <- b[i] + rate1 * b[i-1] + rate2 * b[i-2]
}
}
df[["b"]] <- b
df
}
abc %>%
group_by(GEOG) %>%
nest() %>%
mutate(data = map(data, transfo)) %>%
unnest(data)
Learn more at http://r4ds.had.co.nz/many-models.html.

You can use the stats::filter function and dplyr.
library(tidyverse)
df %>%
group_by(GEOG) %>%
mutate(adstock = stats::filter(x=abc, filter=0.5, method="recursive")) %>%
ungroup()

Related

R: Recursively add rows

The concentration of germs of hands following j surface contacts can be dictated by the following recursive relationship:
H[j+1]=H[j]+T[j]*(S[j]-H[j])
Where S is the surface concentration the hand touches (and is assumed random for ease). T is the transfer efficiency for each contact. I would like to calculate the eventual hand concentration (with zero starting concentration).
I have a data frame that has a vector of surface contacts and transfer efficiencies for each surface. I have two groups a & b and within each group assume I will touch each one sequentially 1:length(df):
df <- data.frame(S = runif(10)*100, T = runif(10),g=rep(c("a","b"),each=5))
I would like to compute the cumulative sum of H by group using dplyr where possible.
a special case:
If g = "a", the starting value of H is 0.
If g=="b" then the starting value of H is the last value from when g=="a"
Here is a similar approach as showed by #AnilGoyal for a generalized case
library(dplyr)
library(purrr)
df %>%
mutate(H = accumulate2(S, T* !lead(!duplicated(g), default = FALSE),
.init = 0, ~ ..1 + ..3 * (..2 - ..1))[-n()])
For the sake of completeness and taking clues from Arun and Onyambu (on a separate question), I am adding baseR answer here too.
transform(df, H = Reduce(function(.x, .y) .x + df$T[.y] * (df$S[.y] - .x) * !c(!duplicated(df$g)[-1], 0)[.y],
seq(nrow(df)),
init = 0,
accumulate = TRUE)[-(1 + nrow(df))])
S T g H
1 37.698250 0.8550377 a 0.00000
2 3.843585 0.4722659 a 32.23342
3 33.150788 0.3684791 a 18.82587
4 8.948116 0.8893603 a 24.10430
5 57.061844 0.5452377 a 10.62499
6 49.648827 0.7719067 b 10.62499
7 95.403697 0.5835950 b 40.74775
8 10.598677 0.1220491 b 72.64469
9 91.913365 0.2166443 b 65.07203
10 69.644200 0.2603413 b 70.88705
Earlier Answer
A slight variation of my friend's answer above, I hope that may serve your purpose. Only assumption I am having is that your data is sorted by groups already and a precedes b (exactly as shown in sample). Since you have not given the random seed, I am also taking the same data took by my friend.
Strategy/hack, I used 0 value of T inside accumulate2 argument so that last value of H in group a is repeated in first value of group b
library(tidyverse)
df <- read.table(header = TRUE, text = ' S T g
1 37.698250 0.8550377 a
2 3.843585 0.4722659 a
3 33.150788 0.3684791 a
4 8.948116 0.8893603 a
5 57.061844 0.5452377 a
6 49.648827 0.7719067 b
7 95.403697 0.5835950 b
8 10.598677 0.1220491 b
9 91.913365 0.2166443 b
10 69.644200 0.2603413 b')
df %>%
mutate(H = accumulate2(S, replace(T, length(g[g=='a']), 0), .init = 0, ~ ..1 + ..3 * (..2 - ..1))[-(1+n())])
S T g H
1 37.698250 0.8550377 a 0.00000
2 3.843585 0.4722659 a 32.23342
3 33.150788 0.3684791 a 18.82587
4 8.948116 0.8893603 a 24.10430
5 57.061844 0.5452377 a 10.62499
6 49.648827 0.7719067 b 10.62499
7 95.403697 0.5835950 b 40.74775
8 10.598677 0.1220491 b 72.64469
9 91.913365 0.2166443 b 65.07203
10 69.644200 0.2603413 b 70.88705
#check - formula
#H[j+1]=H[j]+T[j]*(S[j]-H[j])
# for j =2
# H[2] = H[1] + T[1] * (S[1] -H[1])
0 + 0.8550377 * (37.698250 - 0)
#> [1] 32.23342
#for j=7 (second row group b)
#H[6] + T[6] * (S[6] - H[6])
10.62499 + 0.7719067 * (49.648827 - 10.62499)
#> [1] 40.74775
Created on 2021-07-10 by the reprex package (v2.0.0)
Here is another generalized version I would use for this question:
df$H <- Reduce(function(x, y) {
x + df$T[y] * (df$g[y] == df$g[y + 1]) * (df$S[y] - x)
}, init = 0,
seq_len(nrow(df))[-nrow(df)], accumulate = TRUE)
df
S T g H
1 37.698250 0.8550377 a 0.00000
2 3.843585 0.4722659 a 32.23342
3 33.150788 0.3684791 a 18.82587
4 8.948116 0.8893603 a 24.10430
5 57.061844 0.5452377 a 10.62499
6 49.648827 0.7719067 b 10.62499
7 95.403697 0.5835950 b 40.74775
8 10.598677 0.1220491 b 72.64469
9 91.913365 0.2166443 b 65.07203
10 69.644200 0.2603413 b 70.88705

How can i add more columns in dataframe by for loop

I am beginner of R. I need to transfer some Eviews code to R. There are some loop code to add 10 or more columns\variables with some function in data in Eviews.
Here are eviews example code to estimate deflator:
for %x exp con gov inv cap ex im
frml def_{%x} = gdp_{%x}/gdp_{%x}_r*100
next
I used dplyr package and use mutate function. But it is very hard to add many variables.
library(dplyr)
nominal_gdp<-rnorm(4)
nominal_inv<-rnorm(4)
nominal_gov<-rnorm(4)
nominal_exp<-rnorm(4)
real_gdp<-rnorm(4)
real_inv<-rnorm(4)
real_gov<-rnorm(4)
real_exp<-rnorm(4)
df<-data.frame(nominal_gdp,nominal_inv,
nominal_gov,nominal_exp,real_gdp,real_inv,real_gov,real_exp)
df<-df %>% mutate(deflator_gdp=nominal_gdp/real_gdp*100,
deflator_inv=nominal_inv/real_inv,
deflator_gov=nominal_gov/real_gov,
deflator_exp=nominal_exp/real_exp)
print(df)
Please help me to this in R by loop.
The answer is that your data is not as "tidy" as it could be.
This is what you have (with an added observation ID for clarity):
library(dplyr)
df <- data.frame(nominal_gdp = rnorm(4),
nominal_inv = rnorm(4),
nominal_gov = rnorm(4),
real_gdp = rnorm(4),
real_inv = rnorm(4),
real_gov = rnorm(4))
df <- df %>%
mutate(obs_id = 1:n()) %>%
select(obs_id, everything())
which gives:
obs_id nominal_gdp nominal_inv nominal_gov real_gdp real_inv real_gov
1 1 -0.9692060 -1.5223055 -0.26966202 0.49057546 2.3253066 0.8761837
2 2 1.2696927 1.2591910 0.04238958 -1.51398652 -0.7209661 0.3021453
3 3 0.8415725 -0.1728212 0.98846942 -0.58743294 -0.7256786 0.5649908
4 4 -0.8235101 1.0500614 -0.49308092 0.04820723 -2.0697008 1.2478635
Consider if you had instead, in df2:
obs_id variable real nominal
1 1 gdp 0.49057546 -0.96920602
2 2 gdp -1.51398652 1.26969267
3 3 gdp -0.58743294 0.84157254
4 4 gdp 0.04820723 -0.82351006
5 1 inv 2.32530662 -1.52230550
6 2 inv -0.72096614 1.25919100
7 3 inv -0.72567857 -0.17282123
8 4 inv -2.06970078 1.05006136
9 1 gov 0.87618366 -0.26966202
10 2 gov 0.30214534 0.04238958
11 3 gov 0.56499079 0.98846942
12 4 gov 1.24786355 -0.49308092
Then what you want to do is trivial:
df2 %>% mutate(deflator = real / nominal)
obs_id variable real nominal deflator
1 1 gdp 0.49057546 -0.96920602 -0.50616221
2 2 gdp -1.51398652 1.26969267 -1.19240392
3 3 gdp -0.58743294 0.84157254 -0.69801819
4 4 gdp 0.04820723 -0.82351006 -0.05853872
5 1 inv 2.32530662 -1.52230550 -1.52749012
6 2 inv -0.72096614 1.25919100 -0.57256297
7 3 inv -0.72567857 -0.17282123 4.19901294
8 4 inv -2.06970078 1.05006136 -1.97102841
9 1 gov 0.87618366 -0.26966202 -3.24919196
10 2 gov 0.30214534 0.04238958 7.12782060
11 3 gov 0.56499079 0.98846942 0.57158146
12 4 gov 1.24786355 -0.49308092 -2.53074800
So the question becomes: how do we get to the nice dplyr-compatible data.frame.
You need to gather your data using tidyr::gather. However, because you have 2 sets of variables to gather (the real and nominal values), it is not straightforward. I have done it in two steps, there may be a better way though.
real_vals <- df %>%
select(obs_id, starts_with("real")) %>%
# the line below is where the magic happens
tidyr::gather(variable, real, starts_with("real")) %>%
# extracting the variable name (by erasing up to the underscore)
mutate(variable = gsub(variable, pattern = ".*_", replacement = ""))
# Same thing for nominal values
nominal_vals <- df %>%
select(obs_id, starts_with("nominal")) %>%
tidyr::gather(variable, nominal, starts_with("nominal")) %>%
mutate(variable = gsub(variable, pattern = ".*_", replacement = ""))
# Merging them... Now we have something we can work with!
df2 <-
full_join(real_vals, nominal_vals, by = c("obs_id", "variable"))
Note the importance of the observation id when merging.
We can grep the matching names, and sort:
x <- colnames(df)
df[ sort(x[ (grepl("^nominal", x)) ]) ] /
df[ sort(x[ (grepl("^real", x)) ]) ] * 100
Similarly, if the columns were sorted, then we could just:
df[ 1:4 ] / df[ 5:8 ] * 100
We can loop over column names using purrr::map_dfc then apply a custom function over the selected columns (i.e. the columns that matched the current name from nms)
library(dplyr)
library(purrr)
#Replace anything before _ with empty string
nms <- unique(sub('.*_','',names(df)))
#Use map if you need the ouptut as a list not a dataframe
map_dfc(nms, ~deflator_fun(df, .x))
Custom function
deflator_fun <- function(df, x){
#browser()
nx <- paste0('nominal_',x)
rx <- paste0('real_',x)
select(df, matches(x)) %>%
mutate(!!paste0('deflator_',quo_name(x)) := !!ensym(nx) / !!ensym(rx)*100)
}
#Test
deflator_fun(df, 'gdp')
nominal_gdp real_gdp deflator_gdp
1 -0.3332074 0.181303480 -183.78433
2 -1.0185754 -0.138891362 733.36121
3 -1.0717912 0.005764186 -18593.97398
4 0.3035286 0.385280401 78.78123
Note: Learn more about quo_name, !!, and ensym which they are tools for programming with dplyr here

Create a combination of factors with optimization

library(dplyr)
library(tidyr)
df <- data.frame(
First = c("MW3", "MW3", "MW4", "MW5", "MW6", "MW7", "MW7", "MW8"),
Second = c("MW4; MW5; MW6", "MW5; MW3; MW7", "MW8; MW7; MW3",
"MW5; MW6; MW4", "MW3; MW7; MW8", "MW6; MW8; MW4",
"MW3; MW4; MW5", "MW6; MW3; MW7")
)
df <- df %>%
mutate(
ID = row_number(),
lmt = n_distinct(ID)
) %>%
separate_rows(Second, sep = "; ") %>%
group_by(ID) %>%
mutate(
wgt = row_number()
) %>% ungroup()
Let's say that for each ID I want to keep only 1 combination of First and Second (i.e. the length of unique IDs in df should always be equal to lmt).
However, I'd like to do that with optimizing certain parameters. The solution should be designed in such a way that:
Combinations with wgt 1 should be selected whenever possible, alternatively also 2, but 3 should be avoided (i.e. sum of wgt should be minimal);
The difference between the frequency of a value in Second and frequency in First should be close to 0.
Any ideas on how to approach this in R?
Expected output for the above case is:
ID First Second wgt lmt
1 1 MW3 MW4 1 8
2 2 MW3 MW7 3 8
3 3 MW4 MW7 2 8
4 4 MW5 MW5 1 8
5 5 MW6 MW3 1 8
6 6 MW7 MW8 2 8
7 7 MW7 MW3 1 8
8 8 MW8 MW6 1 8
Why? Simply because with this combination, there is not more of any element on the right side (Second) that it is on the left (First). For example, there are two MW3 elements on the right as well as on the left.
However, the price to pay here is that wgt is not always 1 (sum of wgt is not 8 but 12).
Clarification: In case both criteria cannot be minimized at the same time, the minimization of 2nd criteria (difference between frequencies) should be prioritized.
I played around with this problem and I can share a solution using a variation of minconflicts algorithm. The key here is to find a scoring function that combines your requirements. The implementation below follows your recommendation 'let's say the objective should be to prioritize the minimization of 2nd criteria (difference between frequencies)'. Experiment with other scoring functions on your actual data and let's see how far you get.
On your original data (8 IDs) I got solution equally good as the one you posted:
> solution_summary(current_solution)
Name FirstCount SecondCount diff
1: MW3 2 2 0
2: MW4 1 1 0
3: MW5 1 1 0
4: MW6 1 1 0
5: MW7 2 2 0
6: MW8 1 1 0
[1] "Total freq diff: 0"
[1] "Total wgt: 12"
With random data with 10000 IDs the algorithm is able to find solution with no difference in First/Second frequencies (but sum of wgt is bigger than minimum):
> solution_summary(current_solution)
Name FirstCount SecondCount diff
1: MW3 1660 1660 0
2: MW4 1762 1762 0
3: MW5 1599 1599 0
4: MW6 1664 1664 0
5: MW7 1646 1646 0
6: MW8 1669 1669 0
[1] "Total freq diff: 0"
[1] "Total wgt: 19521"
Code below:
library(data.table)
df <- as.data.table(df)
df <- df[, .(ID, First, Second, wgt)]
# PLAY AROUND WITH THIS PARAMETER
freq_weight <- 0.9
wgt_min <- df[, uniqueN(ID)]
wgt_max <- df[, uniqueN(ID) * 3]
freq_min <- 0
freq_max <- df[, uniqueN(ID) * 2] #verify if this is the worst case scenario
score <- function(solution){
# compute raw scores
current_wgt <- solution[, sum(wgt)]
second_freq <- solution[, .(SecondCount = .N), by = Second]
names(second_freq)[1] <- "Name"
compare <- merge(First_freq, second_freq, by = "Name", all = TRUE)
compare[is.na(compare)] <- 0
compare[, diff := abs(FirstCount - SecondCount)]
current_freq <- compare[, sum(diff)]
# normalize
wgt_score <- (current_wgt - wgt_min) / (wgt_max - wgt_min)
freq_score <- (current_freq - freq_min) / (freq_max - freq_min)
#combine
score <- (freq_weight * freq_score) + ((1 - freq_weight) * wgt_score)
return(score)
}
#initialize random solution
current_solution <- df[, .SD[sample(.N, 1)], by = ID]
#get freq of First (this does not change)
First_freq <- current_solution[, .(FirstCount = .N), by = First]
names(First_freq)[1] <- "Name"
#get mincoflict to be applied on each iteration
minconflict <- function(df, solution){
#pick ID
change <- solution[, sample(unique(ID), 1)]
#get permissible values
values <- df[ID == change, .(Second, wgt)]
#assign scores
values[, score := NA_real_]
for (i in 1:nrow(values)) {
solution[ID == change, c("Second", "wgt") := values[i, .(Second, wgt)]]
set(values, i, "score", score(solution))
}
#return the best combination
scores <<- c(scores, values[, min(score)])
solution[ID == change, c("Second", "wgt") := values[which.min(score), .(Second, wgt)]]
}
#optimize
scores <- 1
iter <- 0
while(TRUE){
minconflict(df, current_solution)
iter <- iter + 1
#SET MAX NUMBER OF ITERATIONS HERE
if(scores[length(scores)] == 0 | iter >= 1000) break
}
# summarize obtained solution
solution_summary <- function(solution){
second_freq <- solution[, .(SecondCount = .N), by = Second]
names(second_freq)[1] <- "Name"
compare <- merge(First_freq, second_freq, by = "Name", all = TRUE)
compare[is.na(compare)] <- 0
compare[, diff := abs(FirstCount - SecondCount)]
print(compare)
print(paste("Total freq diff: ", compare[, sum(diff)]))
print(paste("Total wgt: ", solution[, sum(wgt)]))
}
solution_summary(current_solution)
This is basically a bipartite graph matching problem and so can be solved exactly in reasonable time, either by maxflow or linear programming (bipartite graph matching to match two sets).
library(lpSolve)
MISMATCH.COST <- 1000
.create.row <- function(row.names, first) {
row <- vector(mode="numeric", length=length(first))
for (i in 1:length(row.names))
row = row + (-MISMATCH.COST+i)*(row.names[i]==first)
return(row)
}
find.pairing <- function(First, Second) {
row.names = sapply(Second, strsplit, "; ")
# Create cost matrix for assignment
mat = sapply(row.names, .create.row, First)
assignment <- lp.assign(mat)
print("Total cost:")
print(assignment$objval+length(First)*MISMATCH.COST)
solution <- lp.assign(mat)$solution
pairs <- which(solution>0, arr.ind=T)
matches = First[pairs[,1]]
# Find out where a mismatch has occured, and replace match
for (i in 1:length(matches)) {
if (!(matches[i] %in% row.names[[i]])) {
matches[i] = row.names[[i]][1]
}
}
result = data.frame(
First[pairs[,2]],
matches)
return(result)
}
Running it on your example gives an optimal solution (as it should always do)
> First = c("MW3", "MW3", "MW4", "MW5", "MW6", "MW7", "MW7", "MW8")
> Second = c("MW4; MW5; MW6", "MW5; MW3; MW7", "MW8; MW7; MW3",
"MW5; MW6; MW4", "MW3; MW7; MW8", "MW6; MW8; MW4",
"MW3; MW4; MW5", "MW6; MW3; MW7")
Second = c("MW4; MW5; MW6", "MW5; MW3; MW7", "MW8; MW7; MW3",
+ "MW5; MW6; MW4", "MW3; MW7; MW8", "MW6; MW8; MW4",
+ "MW3; MW4; MW5", "MW6; MW3; MW7")
> find.pairing(First, Second)
[1] "Total cost:"
[1] 12
First.pairs...2.. matches
1 MW3 MW4
2 MW3 MW3
3 MW4 MW7
4 MW5 MW5
5 MW6 MW7
6 MW7 MW8
7 MW7 MW3
8 MW8 MW6

Rolling window with dplyr to find value of factor

I have a matrix like this
head(a)
# A tibble: 6 x 4
date ROE ROFE ROTFE
<date> <dbl> <dbl> <dbl>
1 2000-01-31 0.033968932 0.0324214815 0.010205926
2 2000-02-29 0.006891111 -0.0003352941 -0.005230147
3 2000-03-31 0.006158519 0.0213992647 0.040399265
4 2000-04-28 0.060022222 0.0151191176 0.047586029
5 2000-05-31 -0.016960000 -0.0287617647 -0.036209559
6 2000-06-30 0.034133577 0.0144456522 0.030756522
I want to pick the value of a factor which has highest cumulative return last 2 months over time.
I have done something like this and it works.
However, my friend told me that it can be done in one or two lines of dplyr and I'm wondering if you could please show me how to do that.
index = as.Date(unique(a$date))
nmonth = 2;
mean.ROE = numeric()
for (i in 1:(length(index) - nmonth)) { # i = 2
index1 = index[i]
index2 = index[nmonth + i]
index3 = index[nmonth + i+1]
# Take a 2-month window of ROE returns:
b = a[a$date >= index1 & a$date < index2,] %>% mutate(cum.ROE = cumprod(1 + ROE)) %>% mutate(cum.ROFE = cumprod(1 + ROFE)) %>% mutate(cum.ROTFE = cumprod(1 + ROTFE))
# Use the cumulative return over the 2-month window to determine which factor is best.
mean.ROE1 = ifelse(b$cum.ROE[nmonth] > b$cum.ROFE[nmonth] & b$cum.ROE[nmonth] > b$cum.ROTFE[nmonth], a[a$date == index3,]$ROE, ifelse(b$cum.ROFE[nmonth] > b$cum.ROE[nmonth] & b$cum.ROFE[nmonth] > b$cum.ROTFE[nmonth], a[a$date == index3,]$ROFE, a[a$date == index3,]$ROTFE))
# Bind the answer to the answer vector
mean.ROE = rbind(mean.ROE, mean.ROE1)
}
Create a function maxret which takes 2 + nmonth rows, x, and calculates the cumulative returns, r, for each column of the first two rows. For the largest of those return the value in the last row of x.
Now use rollapplyr to apply it to a rolling window of width 2 + month:
library(zoo)
maxret <- function(x) {
r <- apply(1 + x[1:2, ], 2, prod)
x[2 + nmonth, which.max(r)]
}
z <- read.zoo(as.data.frame(a))
res <- rollapplyr(z, 2 + nmonth, maxret, by.column = FALSE)
giving the zoo series:
> res
2000-04-28 2000-05-31 2000-06-30
0.06002222 -0.03620956 0.03075652
If you want a data frame use fortify.zoo(res) .
Note: 1 The input was not provided in reproducible form in the question so I have assumed this data.frame:
Lines <-
"date ROE ROFE ROTFE
1 2000-01-31 0.033968932 0.0324214815 0.010205926
2 2000-02-29 0.006891111 -0.0003352941 -0.005230147
3 2000-03-31 0.006158519 0.0213992647 0.040399265
4 2000-04-28 0.060022222 0.0151191176 0.047586029
5 2000-05-31 -0.016960000 -0.0287617647 -0.036209559
6 2000-06-30 0.034133577 0.0144456522 0.030756522"
a <- read.table(text = Lines, header = TRUE)
Note 2: With the input in Note 1 or with zoo 1.8.1 (the development version of zoo) this line:
z <- read.zoo(as.data.frame(a))
could be simplified to just:
z <- read.zoo(a)
but we have added the as.data.frame part in the main code so it works with tibbles as well as straight data frames even with the current version of zoo on CRAN.

How to maintain the order of elements of a row when using by and rbind function in r?

I have written a function which takes a subset of data based on the value of name column.It Computes the outlier for column "mark" and replaces all the outliers.
However when I try to combine these different subsets, the order of my elements changes. Is there any way by which I can maintain the order of my elements in the column "mark"
My data set is:
name mark
A 100.0
B 0.5
C 100.0
A 50.0
B 90.0
B 1000.0
C 1200.0
C 5000.0
A 210.0
The function which I have written is :
data.frame(do.call("rbind", as.list(by(data, data$name,
function(x){apply(x[, .(mark)],2,
function(y) {y[y > (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark))]
<- (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark));y})}))))
The result of the above function is the first column below (I've manually added back name for illustratory purposes):
mark NAME
100.000 ----- A
50.000 ----- A
210.000 ----- A
0.500 ----- B
90.000 ----- B
839.625 ----- B
100.000 ----- C
1200.000 ----- C
4875.000 ----- C
In the above result, the order of the values for mark column are changed. Is there any way by which I can maintain the order of the elements ?
Are you sure that code is doing what you think it is?
It looks like you're replacing any value greater than the median (third returned value of quantile) with the median + 1.5*IQR. Maybe that's what you intend, I don't know. The bigger problem is that you're doing that in an apply function, so it's going to re-calculate that median and IQR each iteration, updated with the previous rows already being changed. I'd wager that's not what you intend, but I suppose I've seen stranger.
A better option might be to create an external function to do the work, which takes in all of the data, does the calculation, then outputs all the data. I like dplyr for this simply because it's clean.
Reading your data in (why the "----"?)
scores <- read.table(text="
name mark
A 100.0
B 0.5
C 100.0
A 50.0
B 90.0
B 1000.0
C 1200.0
C 5000.0
A 210.0", header=TRUE)
and creating a function that does something a little more sensible; replaces any value greater than the 75% quantile (referenced by name so you know what it is) or less than the 25% quantile with that limiting value
scale_outliers <- function(data) {
lim <- quantile(data, na.rm = TRUE)
data[data > lim["75%"]] <- lim["75%"]
data[data < lim["25%"]] <- lim["25%"]
return(data)
}
Chaining this processing into dplyr::mutate is neat, and can then be passed on to ggplot. Here's the original data
gg1 <- scores %>% ggplot(aes(x=name, y=mark))
gg1 <- gg1 + geom_point() + geom_boxplot() + coord_cartesian(ylim=range(scores$mark))
gg1
And if we alter it with the new function we get the data back without rows changed around
scores %>% mutate(new_mark = scale_outliers(mark))
#> name mark new_mark
#> 1 A 100.0 100
#> 2 B 0.5 90
#> 3 C 100.0 100
#> 4 A 50.0 90
#> 5 B 90.0 90
#> 6 B 1000.0 1000
#> 7 C 1200.0 1000
#> 8 C 5000.0 1000
#> 9 A 210.0 210
and we can plot that,
gg2 <- scores %>% mutate(new_mark = scale_outliers(mark)) %>% ggplot(aes(x=name, y=new_mark))
gg2 <- gg2 + geom_point() + geom_boxplot() + coord_cartesian(ylim=range(scores$mark))
gg2
Best of all, if you now want to do that quantile comparison group-wise (say, by the name column, it's as easy as using dplyr::group_by(name),
gg3 <- scores %>% group_by(name) %>% mutate(new_mark = scale_outliers(mark)) %>% ggplot(aes(x=name, y=new_mark))
gg3 <- gg3 + geom_point() + geom_boxplot() + coord_cartesian(ylim=range(scores$mark))
gg3
A slightly refactored version of Hack-R's answer -- you can add a index to your data.table:
data <- data.table(name = c("A", "B","C", "A","B","B","C","C","A"),mark = c(100,0.5,100,50,90,1000,1200,5000,210))
data[,i:=.I]
Then you perform your calculation but you keep the name and i:
df <- data.frame(do.call("rbind", as.list(
by(data, data$name,
function(x) cbind(i=x$i,
name=x$name,
apply(x[, .(mark)], 2,function(y) {y[y > (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark))] <- (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark));y})
)))))
And finally you order using the index:
df[order(df$i),]
i name mark
1 1 A 100
4 2 B 0.5
7 3 C 100
2 4 A 50
5 5 B 90
6 6 B 839.625
8 7 C 1200
9 8 C 4875
3 9 A 210

Resources