Create a dataframe from outliers in a function - r

OK so admittedly this is related to another question here but there has been no response and I suspect it is because I have made it too complex. So Im asking this question which is different because it is simplified. Happy to be scolded if this is not acceptable.
My core problem is that I want to create a dataframe by including outliers only from each column. The dataframe looks like:
chr leftPos TBGGT 12_try 324Gtt AMN2
1 24352 34 43 19 43
1 53534 2 1 -1 -9
2 34 -15 7 -9 -18
3 3443 -100 -4 4 -9
3 3445 -100 -1 6 -1
3 3667 5 -5 9 5
3 7882 -8 -9 1 3
I would like to calculate the upper and lower limit of each column (from the third onwards), exclude all rows that fall within the limits so I only keep outliers, and then end up with a dataframe as follows (for each column). This dataframe then gets passed to the next bit of the code (in the loop) but I wont elaborate on this for the sake of simplicity
chr leftPos TBGGT
2 34 -15
3 3443 -100
3 3445 -100
My code so far:
alpha= 1.5
f1 <- function(df, ZCol){
# Determine the UL and LL and then generate the Zoutliers
UL = median(ZCol, na.rm = TRUE) + alpha*IQR(ZCol, na.rm = TRUE)
LL = median(ZCol, na.rm = TRUE) - alpha*IQR(ZCol, na.rm = TRUE)
Zoutliers <- which(ZCol > UL | ZCol < LL)}
but this just gives me the outlier values without the chr and leftPos it is associated with. How do I get this?

Maybe this:
DF <- read.table(text=" chr leftPos TBGGT 12_try 324Gtt AMN2
1 24352 34 43 19 43
1 53534 2 1 -1 -9
2 34 -15 7 -9 -18
3 3443 -100 -4 4 -9
3 3445 -100 -1 6 -1
3 3667 5 -5 9 5
3 7882 -8 -9 1 3", header = TRUE)
#fix your function as explained by #Thilo
#also make alpha a parameter with default value
f1 <- function(ZCol, alpha = 1.5){
UL <- median(ZCol, na.rm = TRUE) + alpha*IQR(ZCol, na.rm = TRUE)
LL <- median(ZCol, na.rm = TRUE) - alpha*IQR(ZCol, na.rm = TRUE)
ZCol > UL | ZCol < LL
}
#loop over the columns and subset with the function's logical return values
outlist <- lapply(3:6, function(i, df) {
df[f1(df[,i]), c(1:2, i)]
}, df = DF)
#[[1]]
# chr leftPos TBGGT
#4 3 3443 -100
#5 3 3445 -100
#
#[[2]]
# chr leftPos X12_try
#1 1 24352 43
#
#[[3]]
# chr leftPos X324Gtt
#1 1 24352 19
#3 2 34 -9
#
#[[4]]
# chr leftPos AMN2
#1 1 24352 43

You did basically provide the answer yourself, your just missing the last final link.
Your function computes the limits you define for outliers. We change the result such that it returns a boolean vector that is true if the value is an outlier:
isOutlier <- function(values) {
# Determine the UL and LL
UL <- median(values, na.rm = TRUE) + alpha*IQR(values, na.rm = TRUE)
LL <- median(values, na.rm = TRUE) - alpha*IQR(values, na.rm = TRUE)
values > UL | values < LL # Return a boolean vector that can be used as a filter later on.
}
Now you can subset your data frame simply using this function, i.e.
AMN2.outliers <- subset(df, isOutlier(AMN2))
or
AMN2.outliers <- df[isOutlier(AMN2),]
whichever suites you more. Of course you could also wrap this line in the function, but for readability I prefere the solution above.
Besides: I would suggest using the <- operator for assignment instead of =. See here.

Related

t-test through all combinations of all factors all levels

I have a dataframe with the following structure:
> str(data_l)
'data.frame': 800 obs. of 5 variables:
$ Participant: int 1 2 3 4 5 6 7 8 9 10 ...
$ Temperature: Factor w/ 4 levels "35","37","39",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Region : Factor w/ 5 levels "Eyes","Front",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Time : Factor w/ 5 levels "0","15","30",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Rating : num 5 5 5 4 5 5 5 5 5 5 ...
I want to run one-sample t-test for each combination of all factors all levels, for a total of 4*5*5 = 100 t-tests, with Rating as dependent variables, or y.
I am stuck at looping through the combinations, and performing t-test at each combo.
I tried splitting the dataframe by the factors, then lapply t.test() through the list, but to no avail.
Does anyone have a better approach? Cheers!
Edit
My ultimate intention is to calculate confidence interval for arrays in all factors all levels. For instance, I was able to do this:
subset1 <- data_l$Rating[data_l$Temperature == 35 & data_l$Region == "Front" & data_l$Time == 0]
Then,
t.test(subset1)$conf.int
But the problem is I will have to do this 100 times.
Edit 2
I am recreating the dataframe.
Temperature <- rep(seq(35, 41, 2), 10)
Region <- rep(c("Front", "Back", "Eyes", "Left", "Right"), 8)
Time <- rep(seq(0, 60, 15), 8)
Rating <- sample(1:5, 40, replace = TRUE)
data_l <- data.frame(Region = factor(Region), Temperature = factor(Temperature), Time = factor(Time), Rating = as.numeric(Rating))
Two things.
Can this be done? Certainly. Should it? Many of your combinations may have insufficient data to find a reasonable confidence interval. While your data sample is certainly reduced and simplified, I don't have assurances that there will be sufficient fillingness of your factor combinations.
table(sapply(split(data_l$Rating, data_l[,c("Temperature","Region","Time")]), length))
# 0 2
# 80 20
(There are 80 "empty" combinations of your factor levels.)
Let's try this:
outs <- aggregate(data_l$Rating, data_l[,c("Temperature","Region","Time")],
function(x) if (length(unique(x)) > 1) t.test(x)$conf.int else c(NA, NA))
nrow(outs)
# [1] 20
head(outs)
# Temperature Region Time x.1 x.2
# 1 35 Front 0 NA NA
# 2 37 Front 0 -9.706205 15.706205
# 3 39 Front 0 -2.853102 9.853102
# 4 41 Front 0 -15.559307 22.559307
# 5 35 Back 15 -15.559307 22.559307
# 6 37 Back 15 -4.853102 7.853102
Realize that this is not five columns; the fourth is really a matrix embedded in a frame column:
head(outs$x)
# [,1] [,2]
# [1,] NA NA
# [2,] -9.706205 15.706205
# [3,] -2.853102 9.853102
# [4,] -15.559307 22.559307
# [5,] -15.559307 22.559307
# [6,] -4.853102 7.853102
It's easy enough to extract:
outs$conf1 <- outs$x[,1]
outs$conf2 <- outs$x[,2]
outs$x <- NULL
head(outs)
# Temperature Region Time conf1 conf2
# 1 35 Front 0 NA NA
# 2 37 Front 0 -9.706205 15.706205
# 3 39 Front 0 -2.853102 9.853102
# 4 41 Front 0 -15.559307 22.559307
# 5 35 Back 15 -15.559307 22.559307
# 6 37 Back 15 -4.853102 7.853102
(If you're wondering why I have a conditional on length(unique(x)) > 1, then see what happens without it:
aggregate(data_l$Rating, data_l[,c("Temperature","Region","Time")],
function(x) t.test(x)$conf.int)
# Error in t.test.default(x) : data are essentially constant
This is because there are combinations with empty data. You'll likely see something similar with not-empty but still invariant data.)
I am stuck at looping through the combinations, and performing t-test
at each combo.
I'm not sure if this is what you wanted.
N <- 800
df <- data.frame(Participant=1:N,
Temperature=gl(4,200),
Region=sample(1:5, 800, TRUE),
Time=sample(1:5, 800, TRUE),
Rating=sample(1:5, 800, TRUE))
head(df)
t_test <- function(data, y, x){
x <- eval(substitute(x), data)
y <- eval(substitute(y), data)
comb <- combn(levels(x), m=2) # this gives all pair-wise combinations
n <- dim(comb)[2]
t <- vector(n, mode="list")
for(i in 1:n){
xlevs <- comb[,i]
DATA <- subset(data, subset=x %in% xlevs)
x2 <- factor(x, levels=xlevs)
tt <- t.test(y~x2, data=DATA)
t[[i]] <- tt
names(t)[i] <- toString(xlevs)
}
t
}
T.test <- t_test(df, Rating, Temperature)
T.test[1]
$`1, 2`
Welch Two Sample t-test
data: y by x2
t = -1.0271, df = 396.87, p-value = 0.305
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.4079762 0.1279762
sample estimates:
mean in group 1 mean in group 2
2.85 2.99

User Defined Function to create and sum a subset in R

I need help defining a function that creates a vector in a database where, for each row, the function looks at another column in that database, searches for that value in a designated column of a separate database, creates a subset of that second database consisting of all matching rows, sums a separate column of that new subset, and returns that value to the corresponding row of the new column in the original database.
In other words, I have a data frame that looks something like this:
ID <- c('a', 'b', 'c', 'd', 'e')
M <- 20:39
df <- data.frame(cbind(ID, M))
df$M <- as.numeric(df$M)
> df
ID M
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 a 6
7 b 7
8 c 8
9 d 9
10 e 10
11 a 11
12 b 12
13 c 13
14 d 14
15 e 15
16 a 16
17 b 17
18 c 18
19 d 19
20 e 20
> str(df)
'data.frame': 20 obs. of 2 variables:
$ ID: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5 1 2 3 4 5 ...
$ M : num 1 2 3 4 5 6 7 8 9 10 ...
I would like to create a new data frame, Z, such that Z <- data.frame(cbind(X, Y)) where:
X <- as.character(unique(df$ID))
> X
[1] "a" "b" "c" "d" "e"
and Y is a vector of the sum of all a's, sum of all b's, sum of all c's, etc...
So, Y should be equal to c(34, 38, 42, 46, 50) and my final result should be:
> Z
X Y
1 a 34
2 b 38
3 c 42
4 d 46
5 e 50
> str(Z)
'data.frame': 5 obs. of 2 variables:
$ X: chr "a" "b" "c" "d" ...
$ Y: num 34 38 42 46 50
To do this, I've tried first turning X into a data frame (is it easier to work with as a data table?):
> Z <- data.frame(X)
> Z
X
1 a
2 b
3 c
4 d
5 e
> str(Z)
'data.frame': 5 obs. of 1 variable:
$ X: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
and then defining Y as Z$Y <- sum(df[df$ID == Z$X, 2]) but I don't get unique values:
> Z
X Y
1 a 210
2 b 210
3 c 210
4 d 210
5 e 210
I've also tried defining the function f1() like so:
f1 <- function(v, w, x, y, z){sum(v[v$w == x$y, z])}
but that gets me:
> f1(df, 'ID', Z, 'X', 'M')
[1] 0
I have found a function from another post on this forum that does something similar:
f1 <- function(df, cols, match_with, to_x = 50){
df[cols] <- lapply(df[cols], function(i)
ifelse(grepl(to_x, match_with, fixed = TRUE), 'MID',
i))
return(df)
}
This looks for the value "50" in the match_with column and returns the value "MID" to that row of the column designated by cols, provided both columns in the same designated data base df. So, I would need to replace to_x = 50 with something that, instead of looking for the fixed value "50," looks for whatever value is in the column Z$X and, instead of returning the fixed value "MID," returns the sum of the values df[df$ID == Z$X, df$M]. I've attempted these changes myself by writing variations of the following:
f1 <- function(df, cols, match_with, to_x = df[ , 1], x){
df[cols] <- lapply(df[cols], function(i)
ifelse(grepl(to_x, match_with, fixed = TRUE), sum(x),
i))
return(df)
}
but, so far, none of my variations have produced the desired results. This one gave me:
> f1(Z, df, cols = c('Y'), match_with = df$ID, x = df$M)
X Y
1 a 210
2 b 210
3 c 210
4 d 210
5 e 210
Warning messages:
1: In grepl(to_x, match_with, fixed = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
2: In `[<-.data.frame`(`*tmp*`, cols, value = list(Y = c(210, 210, :
replacement element 1 has 20 rows to replace 5 rows
It seems to be summing the entirety of df$M instead of the subsets where df$ID == Z$X. In other variations it seemed to have problems referencing a column in a second data frame.
I am somewhat new to R and have almost no experience writing user-defined functions (as you probably could tell by this question). Any help would be very much appreciated!
Nevermind ya'll, I think I got it!
> f1 <- function(col1, col2, df2, to_add){
+ lapply(col1, function(i){
+ df2$x <- grepl(i, col2, fixed = TRUE)
+ df3 <- df2[df2$x == TRUE, to_add]
+ sum(df3, na.rm = TRUE)
+ })}
> Z$Y <- f1(Z$X, df$ID, df, c('M'))
> Z
X Y
1 a 34
2 b 38
3 c 42
4 d 46
5 e 50

R - change data frame entries by specific string (count and replace by the value of counter, change sign)

I have a data frame that contains numbers and numbers separated by a "." and I want to change the entries dependent on the "." string.
If the entry does not contain a "." the prefix "-" should be added. That's kind of simple using the subsetting or grep functionality. But I also want to replace the entries which contain a "." with the counter of ".".
my example data:
X1 X2
1 2
3 4
6 8
5 1.2
3.4 7
1.2.5 9
11 3.4.7
and I would like to have it look like this:
X1 X2
-1 -2
-3 -4
-6 -8
-5 1
2 -7
3 -9
-11 4
I have no clue and tried already subsetting, extracting the "." parts to count them. But I can not insert the counter. Thanks for your help.
Here is an idea via base R,
ind <- rowSums(sapply(df, function(i) cumsum(grepl('\\.', i))))
df[] <- lapply(df[], function(i) ifelse(grepl('\\.', i), ind, paste0('-', i)))
df
# X1 X2
#1 -1 -2
#2 -3 -4
#3 -6 -8
#4 -5 1
#5 2 -7
#6 3 -9
#7 -11 4
NOTE : I converted df to character,
df[] <- lapply(df[], as.character)
EDIT
Regarding your row numbers request, then this should do it,
ind1 <- apply(df, 1, function(i) paste(sort(i), collapse = '.'))
df2 <- sapply(df, function(i) match(i, ind1))
df[] <- lapply(df[], function(i) ifelse(grepl('\\.', i), 0, paste0('-', i)))
df[!is.na(df2)] <- df2[!is.na(df2)]
df
# X1 X2
#1 -1 -2
#2 -3 -4
#3 -6 -8
#4 -5 1
#5 2 -7
#6 4 -9
#7 -11 5
If you are planning on doing calculations with this data frame later on, then you should convert to integer, i.e.,
df[] <- lapply(df[], as.integer)
str(df)
#'data.frame': 7 obs. of 2 variables:
# $ X1: int -1 -3 -6 -5 2 4 -11
# $ X2: int -2 -4 -8 1 -7 -9 5
Here it is with data.table
The idea is to create a counter in an temporary column:
library(data.table)
dt<-data.table(df)
dt$X1 <- as.character(dt$X1 )
dt$X2 <- as.character(dt$X2 )
dt[!grepl(".", dt$X1, fixed=TRUE),X1:=paste("-", X1, sep="") ]
dt[!grepl(".", dt$X2, fixed=TRUE),X2:=paste("-", X2, sep="") ]
dt[grepl(".", dt$X1, fixed=TRUE)|grepl(".", dt$X2, fixed=TRUE), count_point:=as.character(sequence(.N))]
dt[grepl(".", dt$X1, fixed=TRUE),X1:=count_point]
dt[grepl(".", dt$X2, fixed=TRUE),X2:=count_point]
df <- data.frame(dt[, c("X1", "X2"), with = FALSE])
There should be a way to do it in less line, using .SD

Only include outliers from each column in a dataframe

I have a dataframe as follows:
chr leftPos TBGGT 12_try 324Gtt AMN2
1 24352 34 43 19 43
1 53534 2 1 -1 -9
2 34 -15 7 -9 -18
3 3443 -100 -4 4 -9
3 3445 -100 -1 6 -1
3 3667 5 -5 9 5
3 7882 -8 -9 1 3
I have to create a loop which:
a) Calculates the upper and lower limit (UL and LL) for each column from the third column onwards.
b) Only includes rows that fall outside of the UL and LL (Zoutliers).
c) Then count the number of rows where the Zoutlier is the same direction (i.e. positive or negative) as the previous or the subsequent row for the same chr.
The output would therefore be:
ZScore1 TBGGT 12_try 324Gtt AMN2
nrow 4 6 4 4
So far I have code as follows:
library(data.table)#v1.9.5
f1 <- function(df, ZCol){
#A) Determine the UL and LL and then generate the Zoutliers
UL = median(ZCol, na.rm = TRUE) + alpha*IQR(ZCol, na.rm = TRUE)
LL = median(ZCol, na.rm = TRUE) - alpha*IQR(ZCol, na.rm = TRUE)
Zoutliers <- which(ZCol > UL | ZCol < LL)
#B) Exclude Zoutliers per chr if same direction as previous or subsequent row
na.omit(as.data.table(df)[, {tmp = sign(eval(as.name(ZCol)))
.SD[tmp==shift(tmp) | tmp==shift(tmp, type='lead')]},
by=chr])[, list(.N)]}
nm1 <- paste0(names(df)
setnames(do.call(cbind,lapply(nm1, function(x) f1(df, x))), nm1)[]
The code is patched together from various places. The problem I have is combining parts A) and B) of the code to get the output I want
Can you try this function? I was not sure what alpha is, so I could not reproduce the expected output and included it as variable in the function.
# read your data per copy&paste
d <- read.table("clipboard",header = T)
# or as in Frank comment mentioned solution via fread
d <- data.table::fread("chr leftPos TBGGT 12_try 324Gtt AMN2
1 24352 34 43 19 43
1 53534 2 1 -1 -9
2 34 -15 7 -9 -18
3 3443 -100 -4 4 -9
3 3445 -100 -1 6 -1
3 3667 5 -5 9 5
3 7882 -8 -9 1 3")
# set up the function
foo <- function(x, alpha, chr){
# your code for task a) and b)
UL = median(x, na.rm = TRUE) + alpha*IQR(x, na.rm = TRUE)
LL = median(x, na.rm = TRUE) - alpha*IQR(x, na.rm = TRUE)
Zoutliers <- which(x > UL | x < LL)
# part (c
# factor which specifies the direction. 0 values are set as positives
pos_neg <- ifelse(x[Zoutliers] >= 0, "positive", "negative")
# count the occurrence per chromosome and direction.
aggregate(x[Zoutliers], list(chr[Zoutliers], pos_neg), length)
}
# apply over the columns and get a list of dataframes with number of outliers per chr and direction.
apply(d[,3:ncol(d)], 2, foo, 0.95, d$chr)

R apply code to different factors or levels

Below is code to generate data to demonstrate the problem.
con <- textConnection('
Nu Na Vo
100 A 60
103 A 2
104 A 2
106 A 5
107 A 1
108 A 1
112 A 50
100 B 1
108 B 4
109 B 2
120 B 30
109 C 40
')
tt <- read.table(con, header = T)
close(con)
test <- as.data.frame(tt)
I've the following code. It is to assign value to "Sta" column subject to the specific condition and to add the difference in "Nu" between i and i+1 row into "Lag" column.
library(dplyr)
# to sort "Na" column and arrange "Nu" in descending order
# in order to apply the code below.
test2 <- tt %.% arrange(Na, -Nu)
for (i in 1:nrow(test2)) {
if (i < nrow(test2)) {
if (test2[i, ]$Nu - 2 > test2[i+1, ]$Nu) {
test2[i, 4] <- "N"
test2[i, 5] <- test2[i, ]$Nu - test2[i+1, ]$Nu
} else if (test2[i, ]$Nu - 2 <= test2[i+1, ]$Nu) {
test2[i, 4] <- "Y"
test2[i, 5] <- test2[i, ]$Nu - test2[i+1, ]$Nu
}
} else if (i == nrow(test2)) {
test2[i, 4] <- "N"
test2[i, 5] <- 0
}
}
names(test2)[names(test2) == "V4"] <- "Sta"
names(test2)[names(test2) == "V5"] <- "Lag"
test2
After running the code, it produces the result as below:
Nu Na Vo Sta Lag
1 112 A 50 N 4
2 108 A 1 Y 1
3 107 A 1 Y 1
4 106 A 5 Y 2
5 104 A 2 Y 1
6 103 A 2 N 3
7 100 A 60 Y -20
8 120 B 30 N 11
9 109 B 2 Y 1
10 108 B 4 N 8
11 100 B 1 Y -9
12 109 C 40 N 0
The values under "Sta" column are properly assigned but not for the "Lag" column. The original intention is to apply the code based on different values/levels in "Na", that is "A", "B", "C". Don't how to apply the code to "A", "B", "C" separately and combine separate results into ONE table. Desired outcome should be:
Nu Na Vo Sta Lag
1 112 A 50 N 4
2 108 A 1 Y 1
3 107 A 1 Y 1
4 106 A 5 Y 2
5 104 A 2 Y 1
6 103 A 2 N 3
7 100 A 60 Y 0 << Last row for "A". "Lag" should be "0"; "Sta" should be "N".
8 120 B 30 N 11
9 109 B 2 Y 1
10 108 B 4 N 8
11 100 B 1 Y 0 << Last row for "B". "Lag" should be "0"; "Sta" should be "N".
12 109 C 40 N 0 << Last row for "C". "Lag" should be "0"; "Sta" should be "N".
Edited
Not sure how to apply the code to different factors / levels of "Na": "A", "B" and "C". Possible to use split() or apply family of functions? As could see from the result and intent of the code above, the result should be FACTOR / LEVEL / Element dependent (hope I'm using the proper terminology) and will affect values under both "Sta" and "Lag" columns. However my code could not distinguish this. Appreciate for any help provided. Thanks
An inelegant solution!
For completeness, I post herewith a possible solution. I code it the hard way. If anyone could help simplify it, it would be very much appreciated.
con <- textConnection('
Nu Na Vo
100 A 60
103 A 2
104 A 2
106 A 5
107 A 1
108 A 1
112 A 50
100 B 1
108 B 4
109 B 2
120 B 30
109 C 40
')
tt <- read.table(con, header = T)
close(con)
require(dplyr); require(data.table)
test2 <- tt %.% arrange(Na, -Nu)
spl <- split(test2, test2$Na)
spl
for (i in 1:length(levels(test2$Na))) {
for (j in 1:nrow(spl[[i]])) {
if (j < nrow(spl[[i]])) {
if (spl[[i]][j, ]$Nu - 2 > spl[[i]][j+1, ]$Nu) {
spl[[i]][j, 4] <- "N"
spl[[i]][j, 5] <- spl[[i]][j, ]$Nu - spl[[i]][j+1, ]$Nu
} else if (spl[[i]][j, ]$Nu - 2 <= spl[[i]][j+1, ]$Nu) {
spl[[i]][j, 4] <- "Y"
spl[[i]][j, 5] <- spl[[i]][j, ]$Nu - spl[[i]][j+1, ]$Nu
}
} else if (j == nrow(spl[[i]])) {
spl[[i]][j, 4] <- "N"
spl[[i]][j, 5] <- 0
}
}
}
spl <- rbindlist(spl)
setnames(spl, c("V4", "V5"), c("Sta", "Lag"))
spl
ave to the rescue - if applied twice this will essentially do the same comparisons as your long loop code.
First, calculate the lag differences using diff for each group, and set the value for the last row in each group to 0. Then use the computed lag values to determine the "Sta" column, forcing the last row in each group's value to be assigned "N".
test2$Lag <- with(test2, ave(Nu, Na, FUN=function(x) -c(diff(x),0)) )
test2$Sta <- with(test2, ave(Lag, Na, FUN=function(x) {
out <- ifelse(x > 2, "N", "Y"); out[length(out)] <- "N"; out}))
Same result as requested:
test2[c(1:3,5,4)]
# Nu Na Vo Sta Lag
#1 112 A 50 N 4
#2 108 A 1 Y 1
#3 107 A 1 Y 1
#4 106 A 5 Y 2
#5 104 A 2 Y 1
#6 103 A 2 N 3
#7 100 A 60 N 0
#8 120 B 30 N 11
#9 109 B 2 Y 1
#10 108 B 4 N 8
#11 100 B 1 N 0
#12 109 C 40 N 0

Resources