Rolling window with dplyr to find value of factor

Rolling window with dplyr to find value of factor - r

I have a matrix like this
head(a)
# A tibble: 6 x 4
date ROE ROFE ROTFE
<date> <dbl> <dbl> <dbl>
1 2000-01-31 0.033968932 0.0324214815 0.010205926
2 2000-02-29 0.006891111 -0.0003352941 -0.005230147
3 2000-03-31 0.006158519 0.0213992647 0.040399265
4 2000-04-28 0.060022222 0.0151191176 0.047586029
5 2000-05-31 -0.016960000 -0.0287617647 -0.036209559
6 2000-06-30 0.034133577 0.0144456522 0.030756522
I want to pick the value of a factor which has highest cumulative return last 2 months over time.
I have done something like this and it works.
However, my friend told me that it can be done in one or two lines of dplyr and I'm wondering if you could please show me how to do that.
index = as.Date(unique(a$date))
nmonth = 2;
mean.ROE = numeric()
for (i in 1:(length(index) - nmonth)) { # i = 2
index1 = index[i]
index2 = index[nmonth + i]
index3 = index[nmonth + i+1]
# Take a 2-month window of ROE returns:
b = a[a$date >= index1 & a$date < index2,] %>% mutate(cum.ROE = cumprod(1 + ROE)) %>% mutate(cum.ROFE = cumprod(1 + ROFE)) %>% mutate(cum.ROTFE = cumprod(1 + ROTFE))
# Use the cumulative return over the 2-month window to determine which factor is best.
mean.ROE1 = ifelse(b$cum.ROE[nmonth] > b$cum.ROFE[nmonth] & b$cum.ROE[nmonth] > b$cum.ROTFE[nmonth], a[a$date == index3,]$ROE, ifelse(b$cum.ROFE[nmonth] > b$cum.ROE[nmonth] & b$cum.ROFE[nmonth] > b$cum.ROTFE[nmonth], a[a$date == index3,]$ROFE, a[a$date == index3,]$ROTFE))
# Bind the answer to the answer vector
mean.ROE = rbind(mean.ROE, mean.ROE1)
}

Create a function maxret which takes 2 + nmonth rows, x, and calculates the cumulative returns, r, for each column of the first two rows. For the largest of those return the value in the last row of x.
Now use rollapplyr to apply it to a rolling window of width 2 + month:
library(zoo)
maxret <- function(x) {
r <- apply(1 + x[1:2, ], 2, prod)
x[2 + nmonth, which.max(r)]
}
z <- read.zoo(as.data.frame(a))
res <- rollapplyr(z, 2 + nmonth, maxret, by.column = FALSE)
giving the zoo series:
> res
2000-04-28 2000-05-31 2000-06-30
0.06002222 -0.03620956 0.03075652
If you want a data frame use fortify.zoo(res) .
Note: 1 The input was not provided in reproducible form in the question so I have assumed this data.frame:
Lines <-
"date ROE ROFE ROTFE
1 2000-01-31 0.033968932 0.0324214815 0.010205926
2 2000-02-29 0.006891111 -0.0003352941 -0.005230147
3 2000-03-31 0.006158519 0.0213992647 0.040399265
4 2000-04-28 0.060022222 0.0151191176 0.047586029
5 2000-05-31 -0.016960000 -0.0287617647 -0.036209559
6 2000-06-30 0.034133577 0.0144456522 0.030756522"
a <- read.table(text = Lines, header = TRUE)
Note 2: With the input in Note 1 or with zoo 1.8.1 (the development version of zoo) this line:
z <- read.zoo(as.data.frame(a))
could be simplified to just:
z <- read.zoo(a)
but we have added the as.data.frame part in the main code so it works with tibbles as well as straight data frames even with the current version of zoo on CRAN.

Related

How to apply function with multiple conditions on multiple columns to get new conditional columns in R

Hello all a R noob here,
I hope you guys can help me with the following.
I need to transform multiple columns in my dataset to new columns based on the values in the original columns multiple times. This means that for the first transformation I use column 1, 2, 3 and if certain conditions are met the output results a new column with a 1 or a 0, for the second transformation I use columns 4, 5, 6 and the output should be a 1 or a 0 also. I have to do this 18 times. I already wrote a function which succesfully does the transformation if I impute the variables manually, but I would like to apply this function to all the desired columns at once. My desired output would be 18 new columns with 0's and 1's. Finally I will make a last column which will display a 1 if any of the 18 columns is a 1 and a 0 otherwise.
df <- data.frame(admiss1 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
admiss2 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
admiss3 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
visit1 = sample(seq(as.Date('1995/01/01'), as.Date('1996/01/01'), by="day"), 12),
visit2 = sample(seq(as.Date('1997/01/01'), as.Date('1998/01/01'), by="day"), 12),
reason1 = sample(3,12, replace = T),
reason2 = sample(3,12, replace = T),
reason3 = sample(3,12, replace = T))
df$discharge1 <- df$admiss1 + 10
df$discharge2 <- df$admiss2 + 10
df$discharge3 <- df$admiss3 + 10
#every discharge date is 10 days after the admission date for the sake of this example
#now I have the following dataframe
#for the sake of it I included only 3 dates and reasons(instead of 18)
admiss1 admiss2 admiss3 visit1 visit2 reason1 reason2 reason3 discharge1 discharge2 discharge3
1 1990-03-12 1992-04-04 1998-07-31 1995-01-24 1997-10-07 2 1 3 1990-03-22 1992-04-14 1998-08-10
2 1999-05-18 1990-11-25 1995-10-04 1995-03-06 1997-03-13 1 2 1 1999-05-28 1990-12-05 1995-10-14
3 1993-07-16 1998-06-10 1991-07-05 1995-11-06 1997-11-15 1 1 2 1993-07-26 1998-06-20 1991-07-15
4 1991-07-05 1992-06-17 1995-10-12 1995-05-14 1997-05-02 2 1 3 1991-07-15 1992-06-27 1995-10-22
5 1995-08-16 1999-03-08 1992-04-03 1995-02-20 1997-01-03 1 3 3 1995-08-26 1999-03-18 1992-04-13
6 1999-10-07 1991-12-26 1995-05-05 1995-10-24 1997-10-15 3 1 1 1999-10-17 1992-01-05 1995-05-15
7 1998-03-18 1992-04-18 1993-12-31 1995-11-14 1997-06-14 3 2 2 1998-03-28 1992-04-28 1994-01-10
8 1992-08-04 1991-09-16 1992-04-23 1995-05-29 1997-10-11 1 2 3 1992-08-14 1991-09-26 1992-05-03
9 1997-02-20 1990-02-12 1998-03-08 1995-10-09 1997-12-29 1 1 3 1997-03-02 1990-02-22 1998-03-18
10 1992-09-16 1997-06-16 1997-07-18 1995-12-11 1997-01-12 1 2 2 1992-09-26 1997-06-26 1997-07-28
11 1991-01-25 1998-04-07 1999-07-02 1995-12-27 1997-05-28 3 2 1 1991-02-04 1998-04-17 1999-07-12
12 1996-02-25 1993-03-30 1997-06-25 1995-09-07 1997-10-18 1 3 2 1996-03-06 1993-04-09 1997-07-05
admissdate <- function(admis, dis, rsn, vis1, vis2){
xnew <- ifelse(df[eval(substitute(admis))] >= df[eval(substitute(vis1))] & df[eval(substitute(dis))] <= df[eval(substitute(vis2))] & df[eval(substitute(rsn))] == 2, 1, 0)
xnew <- ifelse(df[eval(substitute(admis))] >= df[eval(substitute(vis1))] & df[eval(substitute(admis))] <= df[eval(substitute(vis2))] & df[eval(substitute(dis))] >= df[eval(substitute(vis2))] & df[eval(substitute(rsn))] == 2, 1, xnew)
return(xnew)
}
I wrote this function to generate a 1 if the conditions are true and a 0 if the conditions are false.
-Condition 1: admission date and discharge date are between visit 1 and visit 2 + admission reason is 2.
-Condition 2: admission date is after visit 1 but before visit 2 and the discharge date is after visit 2 with also admission reason 2.
It should return 1 if these conditions are true and 0 if these conditions are false. Eventually, I will end up with 18 new variables with 1's or 0's and will combine them to make one variable with Admission between visit 1 and visit 2 (with reason 2).
If I manually impute the variable names it will work, but I cant make it work for all the variables at once. I tried to make a string vector with all the admiss dates, discharge dates and reasons and tried to transform them with mapply, but this does not work.
admiss <- paste0(rep("admiss", 3), 1:3)
discharge <- paste0(rep("discharge", 3), 1:3)
reason <- paste0(rep("reason", 3), 1:3)
visit1 <- rep("visit1",3)
visit2 <- rep("visit2",3)
mapply(admissdate, admis = admiss, dis = discharge, rsn = reason, vis1 = visit1, vis2 = visit2)
I have also considered lapply but here you have to define an X = ..., which I think I cannot use because I have multiple column that I want to impute, please correct me if I am wrong!
Also I considered using a for loop, but I don't know how to use that with multiple conditions.
Any help would be greatly appreciated!

You can change the function to accept values instead of column names.
admissdate <- function(admis, dis, rsn, vis1, vis2){
xnew <- as.integer(admis >= vis1 & dis <= vis2 & rsn == 2)
xnew <- ifelse(admis >= vis1 & admis <= vis2 & dis >= vis2 & rsn == 2, 1, xnew)
return(xnew)
}
Now create new columns -
admiss <- paste0("admiss", 1:3)
discharge <- paste0("discharge", 1:3)
reason <- paste0("reason", 1:3)
new_col <- paste0('newcol', 1:3)
df[new_col] <- Map(function(x, y, z) admissdate(x, y, z, df$visit1, df$visit2),
df[admiss],df[discharge],df[reason])
#Additional column will be 1 if any of the value in the new column is 1.
df$result <- as.integer(rowSums(df[new_col]) > 0)
df

Create a combination of factors with optimization

library(dplyr)
library(tidyr)
df <- data.frame(
First = c("MW3", "MW3", "MW4", "MW5", "MW6", "MW7", "MW7", "MW8"),
Second = c("MW4; MW5; MW6", "MW5; MW3; MW7", "MW8; MW7; MW3",
"MW5; MW6; MW4", "MW3; MW7; MW8", "MW6; MW8; MW4",
"MW3; MW4; MW5", "MW6; MW3; MW7")
)
df <- df %>%
mutate(
ID = row_number(),
lmt = n_distinct(ID)
) %>%
separate_rows(Second, sep = "; ") %>%
group_by(ID) %>%
mutate(
wgt = row_number()
) %>% ungroup()
Let's say that for each ID I want to keep only 1 combination of First and Second (i.e. the length of unique IDs in df should always be equal to lmt).
However, I'd like to do that with optimizing certain parameters. The solution should be designed in such a way that:
Combinations with wgt 1 should be selected whenever possible, alternatively also 2, but 3 should be avoided (i.e. sum of wgt should be minimal);
The difference between the frequency of a value in Second and frequency in First should be close to 0.
Any ideas on how to approach this in R?
Expected output for the above case is:
ID First Second wgt lmt
1 1 MW3 MW4 1 8
2 2 MW3 MW7 3 8
3 3 MW4 MW7 2 8
4 4 MW5 MW5 1 8
5 5 MW6 MW3 1 8
6 6 MW7 MW8 2 8
7 7 MW7 MW3 1 8
8 8 MW8 MW6 1 8
Why? Simply because with this combination, there is not more of any element on the right side (Second) that it is on the left (First). For example, there are two MW3 elements on the right as well as on the left.
However, the price to pay here is that wgt is not always 1 (sum of wgt is not 8 but 12).
Clarification: In case both criteria cannot be minimized at the same time, the minimization of 2nd criteria (difference between frequencies) should be prioritized.

I played around with this problem and I can share a solution using a variation of minconflicts algorithm. The key here is to find a scoring function that combines your requirements. The implementation below follows your recommendation 'let's say the objective should be to prioritize the minimization of 2nd criteria (difference between frequencies)'. Experiment with other scoring functions on your actual data and let's see how far you get.
On your original data (8 IDs) I got solution equally good as the one you posted:
> solution_summary(current_solution)
Name FirstCount SecondCount diff
1: MW3 2 2 0
2: MW4 1 1 0
3: MW5 1 1 0
4: MW6 1 1 0
5: MW7 2 2 0
6: MW8 1 1 0
[1] "Total freq diff: 0"
[1] "Total wgt: 12"
With random data with 10000 IDs the algorithm is able to find solution with no difference in First/Second frequencies (but sum of wgt is bigger than minimum):
> solution_summary(current_solution)
Name FirstCount SecondCount diff
1: MW3 1660 1660 0
2: MW4 1762 1762 0
3: MW5 1599 1599 0
4: MW6 1664 1664 0
5: MW7 1646 1646 0
6: MW8 1669 1669 0
[1] "Total freq diff: 0"
[1] "Total wgt: 19521"
Code below:
library(data.table)
df <- as.data.table(df)
df <- df[, .(ID, First, Second, wgt)]
# PLAY AROUND WITH THIS PARAMETER
freq_weight <- 0.9
wgt_min <- df[, uniqueN(ID)]
wgt_max <- df[, uniqueN(ID) * 3]
freq_min <- 0
freq_max <- df[, uniqueN(ID) * 2] #verify if this is the worst case scenario
score <- function(solution){
# compute raw scores
current_wgt <- solution[, sum(wgt)]
second_freq <- solution[, .(SecondCount = .N), by = Second]
names(second_freq)[1] <- "Name"
compare <- merge(First_freq, second_freq, by = "Name", all = TRUE)
compare[is.na(compare)] <- 0
compare[, diff := abs(FirstCount - SecondCount)]
current_freq <- compare[, sum(diff)]
# normalize
wgt_score <- (current_wgt - wgt_min) / (wgt_max - wgt_min)
freq_score <- (current_freq - freq_min) / (freq_max - freq_min)
#combine
score <- (freq_weight * freq_score) + ((1 - freq_weight) * wgt_score)
return(score)
}
#initialize random solution
current_solution <- df[, .SD[sample(.N, 1)], by = ID]
#get freq of First (this does not change)
First_freq <- current_solution[, .(FirstCount = .N), by = First]
names(First_freq)[1] <- "Name"
#get mincoflict to be applied on each iteration
minconflict <- function(df, solution){
#pick ID
change <- solution[, sample(unique(ID), 1)]
#get permissible values
values <- df[ID == change, .(Second, wgt)]
#assign scores
values[, score := NA_real_]
for (i in 1:nrow(values)) {
solution[ID == change, c("Second", "wgt") := values[i, .(Second, wgt)]]
set(values, i, "score", score(solution))
}
#return the best combination
scores <<- c(scores, values[, min(score)])
solution[ID == change, c("Second", "wgt") := values[which.min(score), .(Second, wgt)]]
}
#optimize
scores <- 1
iter <- 0
while(TRUE){
minconflict(df, current_solution)
iter <- iter + 1
#SET MAX NUMBER OF ITERATIONS HERE
if(scores[length(scores)] == 0 | iter >= 1000) break
}
# summarize obtained solution
solution_summary <- function(solution){
second_freq <- solution[, .(SecondCount = .N), by = Second]
names(second_freq)[1] <- "Name"
compare <- merge(First_freq, second_freq, by = "Name", all = TRUE)
compare[is.na(compare)] <- 0
compare[, diff := abs(FirstCount - SecondCount)]
print(compare)
print(paste("Total freq diff: ", compare[, sum(diff)]))
print(paste("Total wgt: ", solution[, sum(wgt)]))
}
solution_summary(current_solution)

This is basically a bipartite graph matching problem and so can be solved exactly in reasonable time, either by maxflow or linear programming (bipartite graph matching to match two sets).
library(lpSolve)
MISMATCH.COST <- 1000
.create.row <- function(row.names, first) {
row <- vector(mode="numeric", length=length(first))
for (i in 1:length(row.names))
row = row + (-MISMATCH.COST+i)*(row.names[i]==first)
return(row)
}
find.pairing <- function(First, Second) {
row.names = sapply(Second, strsplit, "; ")
# Create cost matrix for assignment
mat = sapply(row.names, .create.row, First)
assignment <- lp.assign(mat)
print("Total cost:")
print(assignment$objval+length(First)*MISMATCH.COST)
solution <- lp.assign(mat)$solution
pairs <- which(solution>0, arr.ind=T)
matches = First[pairs[,1]]
# Find out where a mismatch has occured, and replace match
for (i in 1:length(matches)) {
if (!(matches[i] %in% row.names[[i]])) {
matches[i] = row.names[[i]][1]
}
}
result = data.frame(
First[pairs[,2]],
matches)
return(result)
}
Running it on your example gives an optimal solution (as it should always do)
> First = c("MW3", "MW3", "MW4", "MW5", "MW6", "MW7", "MW7", "MW8")
> Second = c("MW4; MW5; MW6", "MW5; MW3; MW7", "MW8; MW7; MW3",
"MW5; MW6; MW4", "MW3; MW7; MW8", "MW6; MW8; MW4",
"MW3; MW4; MW5", "MW6; MW3; MW7")
Second = c("MW4; MW5; MW6", "MW5; MW3; MW7", "MW8; MW7; MW3",
+ "MW5; MW6; MW4", "MW3; MW7; MW8", "MW6; MW8; MW4",
+ "MW3; MW4; MW5", "MW6; MW3; MW7")
> find.pairing(First, Second)
[1] "Total cost:"
[1] 12
First.pairs...2.. matches
1 MW3 MW4
2 MW3 MW3
3 MW4 MW7
4 MW5 MW5
5 MW6 MW7
6 MW7 MW8
7 MW7 MW3
8 MW8 MW6

How can I create new column in data frame by aggregating rows?

I have a large (~200k rows) dataframe that is structured like this:
df <-
data.frame(c(1,1,1,1,1), c('blue','blue','blue','blue','blue'), c('m','m','m','m','m'), c(2016,2016,2016,2016,2016),c(3,4,5,6,7), c(10,20,30,40,50))
colnames(df) <- c('id', 'color', 'size', 'year', 'week','revenue')
Let's say it is currently week 7, and I want to compare the trailing 4 week average of revenue to the current week's revenue. What I would like to do is create a new column for that average when all of the identifiers match.
df_new <-
data.frame(1, 'blue', 'm', 2016,7,50, 25 )
colnames(df_new) <- c('id', 'color', 'size', 'year', 'week','revenue', 't4ave')
How can I accomplish this efficiently? Thank you for the help

good question. for loops are pretty inefficient, but since you do have to check the conditions of prior entries, this is the only solution I can think of (mind you, I'm also an intermediate at R):
for (i in 1:nrow(df))
{
# condition for all entries to match up
if ((i > 5) && (df$id[i] == df$id[i-1] == df$id[i-2] == df$id[i-3] == df$id[i-4])
&& (df$color[i] == df$color[i-1] == df$color[i-2] == df$color[i-3] == df$color[i-4])
&& (df$size[i] == df$size[i-1] == df$size[i-2] == df$size[i-3] == df$size[i-4])
&& (df$year[i] == df$year[i-1] == df$year[i-2] == df$year[i-3] == df$year[i-4])
&& (df$week[i] == df$week[i-1] == df$week[i-2] == df$week[i-3] == df$week[i-4]))
# avg of last 4 entries' revenues
avg <- mean(df$revenue[i-1] + df$revenue[i-2] + df$revenue[i-3] + df$revenue[i-4])
# create new variable of difference between this entry and last 4's
df$diff <- df$revenue[i] - avg
}
This code will probably take forever, but it should work. If this is a one time thing for when the code needs to run, then it should be okay. Otherwise, hopefully others will be able to advise.

A solution using dplyr and zoo. The idea is to group the variable that are the same, such as id, color, size, and year. Aftet that, use rollmean to calculate the rolling mean of revenue. Use na.pad = TRUE and align = "right" to make sure the calculation covers the recent weeks. Finally, use lag to "shift" the calculation results to fit your needs.
library(dplyr)
library(zoo)
df2 <- df %>%
group_by(id, color, size, year) %>%
mutate(t4ave = rollmean(revenue, 4, na.pad = TRUE, align = "right")) %>%
mutate(t4ave = lag(t4ave))
df2
# A tibble: 5 x 7
# Groups: id, color, size, year [1]
id color size year week revenue t4ave
<dbl> <fctr> <fctr> <dbl> <dbl> <dbl> <dbl>
1 1 blue m 2016 3 10 NA
2 1 blue m 2016 4 20 NA
3 1 blue m 2016 5 30 NA
4 1 blue m 2016 6 40 NA
5 1 blue m 2016 7 50 25

R: Summing of the column values by ranged values of another column

Good day!
I’ve got a table of two columns. In the first column (x) there are values which I want to divide in into categories according to the specified range of values (in my instance – 300). And then using these categories I want to sum values in anther column (v). For instance, using my test data: The first category is from 65100 to 65400 (65100
The result: there is a table of two columns. The first one is the categories of x; the second column is the sum of according values of v.
Thank you!!!
# data
set.seed(1)
x <- sample(seq(65100, 67900, by=5), 100, replace = TRUE)
v <- sample(seq(1000, 8000), 100, replace = TRUE)
tabl <- data.frame(x=c(x), v=c(v))
attach(tabl)
#categories
seq(((min(x) - min(x)%%300) + 300), ((max(x) - max(x)%%300) + 300), by =300)

I understood you want to:
Cut vector x,
Using pre-calculated cut-off thresholds
Compute sums over vector v using those groupings
This is one line of code with data.table and chaining. Your data are in data.table named DT.
DT[, CUT := cut(x, breaks)][, sum(v), by=CUT]
Explanation:
First, assign cut-offs to variable breaks like so.
breaks <- seq(((min(x) - min(x) %% 300) + 300), ((max(x) - max(x) %% 300) + 300), by =300)
Second, compute a new column CUT to group rows by the data in breaks.
DT[, CUT := cut(x, breaks)]
Third, sum on column v in groups, using by=. I have chained this operation with the previous.
DT[, CUT := cut(x, breaks)][, sum(v), by=CUT]
Convert your data.frame to data.table like so.
library(data.table)
DT <- as.data.table(tabl)
This is the final result:
CUT V1
1: (6.57e+04,6.6e+04] 45493
2: (6.6e+04,6.63e+04] 77865
3: (6.66e+04,6.69e+04] 22893
4: (6.75e+04,6.78e+04] 61738
5: (6.54e+04,6.57e+04] 44805
6: (6.69e+04,6.72e+04] 64079
7: NA 33234
8: (6.72e+04,6.75e+04] 66517
9: (6.63e+04,6.66e+04] 43887
10: (6.78e+04,6.81e+04] 172
You can dress this up to improve aesthetics. For example, you can reset the factor levels for ease of reading.

When I use dplyr I am used to do it like this. Although I like the cut solution too.
# data
set.seed(1)
x <- sample(seq(65100, 67900, by=5), 100, replace = TRUE)
v <- sample(seq(1000, 8000), 100, replace = TRUE)
tabl <- data.frame(group=c(x), value=c(v))
attach(tabl)
#categories
s <- seq(((min(x) - min(x)%%300) + 300), ((max(x) - max(x)%%300) + 300), by =300)
tabl %>% rowwise() %>% mutate(g = s[min(which(group < s), na.rm=T)]) %>% ungroup() %>%
group_by(g) %>% summarise(sumvalue = sum(value))
result:
g sumvalue
<dbl> <int>
65400 28552
65700 49487
66000 45493
66300 77865
66600 43887
66900 21187
67200 65785
67500 66517
67800 61738
68100 1722

Try this (no package needed):
s <- seq(65100, max(tabl$x)+300, 300)
tabl$col = as.vector(cut(tabl$x, breaks = s, labels = 1:10))
df <- aggregate(v~col, tabl, sum)
# col v
# 1 1 33234
# 2 2 44805
# 3 3 45493
# 4 4 77865
# 5 5 43887
# 6 6 22893
# 7 7 64079
# 8 8 66517
# 9 9 61738
# 10 10 1722

Using R to remove data which is below a quartile threshold

I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!

Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363

Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Rolling window with dplyr to find value of factor - r

Related

How to apply function with multiple conditions on multiple columns to get new conditional columns in R

Create a combination of factors with optimization

How can I create new column in data frame by aggregating rows?

R: Summing of the column values by ranged values of another column

Using R to remove data which is below a quartile threshold

Categories

Resources