detect outliers in a group and outlier in the single data - r

Car 100 200 300
Group1 34 35 34
Group1 57 67 34
Group1 68 76 6
Group2 45 23 23
I have some problems while detecting outliers in my dataframe. I want to detect if there is a complete vector (one row) an outlier of the corresponding group vectors (rows one-three)for each group. Further i want to detect if there is an outlier in one specific row. For this problem i found this solution but with this code i have to repeat the whole code for every single row and check the table for an "TRUE". Is there an outomatisation possible? e.g. creating a matrix of all outputs so i just have to check >sum(matrix==TRUE)
The code:
x=as.numeric(data_without[1,1:400])
grubbs.flag <- function(x) {
outliers <- NULL
test <- x
grubbs.result <- grubbs.test(test)
pv <- grubbs.result$p.value
while(pv < 0.05) {
outliers <- c(outliers,as.numeric(strsplit(grubbs.result$alternative," ")[[1]][3]))
test <- x[!x %in% outliers]
grubbs.result <- grubbs.test(test)
pv <- grubbs.result$p.value
}
return(data.frame(X=x,Outlier=(x %in% outliers)))
}
grubbs.flag(x)
X Outlier
1 0.1157 FALSE
2 0.1152 FALSE
3 0.1163 FALSE
4 0.1165 FALSE

I've read the object documentation and the default option just checks if there is a single outlier given data. Therefore I consider it suffices to run the test only once per each group.
First the data is split by group and then test is done recursively for each group. Only p-value and description is returned at the end to see which is the outlier if any - it'd be easy to identify which is the outlier as it'll be either the maximum or minimum value.
library(outliers)
df <- t(data.frame(car = c(100,200,300),
g1 = c(34,35,34),
g1 = c(57,67,34),
g1 = c(68, 76, 6),
g2 = c(45, 23, 23)))
row.names(df) <- c("car", "group1", "group1", "group1", "group2")
lst <- lapply(1:length(unique(row.names(df))), function(x) {
df[row.names(df)==unique(row.names(df))[x],]
})
lst
[[1]]
[1] 100 200 300
[[2]]
[,1] [,2] [,3]
group1 34 35 34
group1 57 67 34
group1 68 76 6
[[3]]
[1] 45 23 23
lapply(lst, function(x) {
tst <- grubbs.test(x)
c(tst$p.value, tst$alternative)
})
[[1]]
[1] "0.5" "highest value 300 is an outlier"
[[2]]
[1] "0.244875529263511" "lowest value 6 is an outlier"
[[3]]
[1] "0" "highest value 45 is an outlier"

Related

How to subset multiple elements of a list R

I am trying to remove undesired values from elements in a list. The example below is a condensed version of my attempt at solving the problem. This system is going to be made into a shiny app so it needs to be reactive to any size or cardinality of input vectors (seen below as A:B , 'group...', remove) as these will be the indirect result of a user selection.
A <- c(35,35,2609,917,0)
B <- c(8,6,9,24,27,35)
C <- c(1,45,91,24)
D <- c(927,38,22,9)
E <- c(6361,7,43)
my.list <- list(A, B, C, D, E)
group1 <- c(1,2)
group2 <- c(3,5)
remove <- c(35, 24, 6361)
my.list[group1] <- my.list[group1] %>% subset(., !.%in% remove)
my.list
###final expected output
my.list
[[1]]
[1] 2609 917 0
[[2]]
[1] 8 6 9 27
[[3]]
[1] 1 45 91 24
[[4]]
[1] 927 38 22 9
[[5]]
[1] 6361 7 43
The solution should allow for any number of input groups that specify the location of the list elements to be subset, any number of elements to the list, and any number of values to be removed. (it shouldn't be reliant on any fixed cardinality of membership)
Thanks!
my.list[group1] %<>% lapply(function(x) setdiff(x, remove))

R apply a vector of functions to a dataframe

I am currently working on a dataframe with raw numeric data in cols. Every col contains data for one parameter (for example gene expression data of gene xyz) while each row contains a subject. Some of the data in the cols are normally distributed, while some are far from it. I ran shapiro tests using apply with margin 2 for different transformations and then picked suitable transformations by comparing shapiro.test()$p.value. I sent my pick as char to a vector, giving me a vector of NA, log10, sqrt with the length of ncol(DataFrame). I now wonder if it is possible to apply the vector to the data frame via an apply-function, or if neccessary a for-loop. How do I do this or is there a better way? I guess I could loop if-else statements but there has to be a more efficient ways because my code already is slow.
Thanks all!
Update: I tried the code below but it is giving me "Error in file(filename, "r") : invalid 'description' argument"
TransformedExampleDF <- apply(exampleDF, 2 , function(x) eval(parse(paste(transformationVector , "(" , x , ")" , sep = "" ))))
exampleDF <- as.data.frame(matrix(c(1,2,3,4,1,10,100,1000,0.1,0.2,0.3,0.4), ncol=3, nrow = 4))
transformationVector <- c(NA, "log10", NA)
So you could do something like this. In the example below, I've cooked up four random functions whose names I've then stored in the list func_list (Note: the last function converts data to NA; that is intentional).
Then, I created another function func_to_df() that accepts the data.frame and the list of functions (func_list) as inputs, and applies (i.e., executes using get()) the functions upon the corresponding column of the data.frame. The output is returned (and in this example, is stored in the data.frame my_df1.
tl;dr: just look at what func_to_df() does. It might also be worthwhile looking into the purrr package (although it hasn't been used here).
#---------------------
#Example function 1
myaddtwo <- function(x){
if(is.numeric(x)){
x = x+2
} else{
warning("Input must be numeric!")
}
return(x)
#Constraints such as the one shown above
#can be added elsewhere to prevent
#inappropriate action
}
#Example function 2
mymulttwo <- function(x){
return(x*2)
}
#Example function 3
mysqrt <- function(x){
return(sqrt(x))
}
#Example function 4
myna <- function(x){
return(NA)
}
#---------------------
#Dummy data
my_df <- data.frame(
matrix(sample(1:100, 40, replace = TRUE),
nrow = 10, ncol = 4),
stringsAsFactors = FALSE)
#User somehow ascertains that
#the following order of functions
#is the right one to be applied to the data.frame
my_func_list <- c("myaddtwo", "mymulttwo", "mysqrt", "myna")
#---------------------
#A function which applies
#the functions from func_list
#to the columns of df
func_to_df <- function(df, func_list){
for(i in 1:length(func_list)){
df[, i] <- get(func_list[i])(df[, i])
#Alternative to get()
#df[, i] <- eval(as.name(func_list[i]))(df[, i])
}
return(df)
}
#---------------------
#Execution
my_df1 <- func_to_df(my_df, my_func_list)
#---------------------
#Output
my_df
# X1 X2 X3 X4
# 1 8 85 6 41
# 2 45 7 8 65
# 3 34 80 16 89
# 4 34 62 9 31
# 5 98 47 51 99
# 6 77 28 40 72
# 7 24 7 41 46
# 8 45 80 75 30
# 9 93 25 39 72
# 10 68 64 87 47
my_df1
# X1 X2 X3 X4
# 1 10 170 2.449490 NA
# 2 47 14 2.828427 NA
# 3 36 160 4.000000 NA
# 4 36 124 3.000000 NA
# 5 100 94 7.141428 NA
# 6 79 56 6.324555 NA
# 7 26 14 6.403124 NA
# 8 47 160 8.660254 NA
# 9 95 50 6.244998 NA
# 10 70 128 9.327379 NA
#---------------------

Assigning numeric values based on the letters in a string in R

I have a data.frame that is a single column with 235,886 rows. Each row corresponds to a single word of the English language.
E.g.
> words[10000:10005,1]
[1] anticontagionist anticontagious anticonventional anticonventionalism anticonvulsive
[6] anticor
What I'd like to do is convert each row to a number based on the letters in it. So, if "a" = 1, "b" = 2, "c" = 3, and "d" = 4, then "abcd" = 10. Does anyone know of a way to do that?
My ultimate goal is to have a function that scans the data.frame for a given numeric value and returns all the strings, i.e. words, with that value. So, continuing from the example above, if I asked for the value 9, this function would return "dad" and any other rows having a numeric value of 9.
You can use a combination of strsplit and match. I've thrown a tolower in there to make sure that we are matching to the right thing.
Here's a function that implements those steps:
word_value <- function(words) {
temp <- strsplit(tolower(words), "", TRUE)
vapply(temp, function(x) sum(match(x, letters)), integer(1L))
}
Here's a sample vector:
myvec <- c("and", "dad", "cat", "fox", "mom", "add", "dan")
Test it out:
word_value(myvec)
# [1] 19 9 24 45 41 9 19
myvec[word_value(myvec) == 9]
# [1] "dad" "add"
myvec[word_value(myvec) > 20]
# [1] "cat" "fox" "mom"
You can use utf8ToInt.
#using the sample data from Ananda's answer
offset <- utf8ToInt("a") - 1
d <- vapply(tolower(myvec),
function(ii) sum(utf8ToInt(ii) - offset), FUN.VALUE = double(1L))
#and dad cat fox mom add dan
# 19 9 24 45 41 9 19
d[d > 20]
#cat fox mom
# 24 45 41
Using the offset is necessary because utf8ToInt("a") is 97, but you want "a" to be 1.
Wrapping with stack will give a different format for the output, if preferred:
d <- stack(vapply(tolower(myvec),
function(ii) sum(utf8ToInt(ii) - offset), FUN.VALUE = double(1L)))
# values ind
#1 19 and
#2 9 dad
#3 24 cat
#4 45 fox
#5 41 mom
#6 9 add
#7 19 dan
d[d$values > 20,]
# values ind
#3 24 cat
#4 45 fox
#5 41 mom

grouping data with the same name and applying function

I have matrix like this:
I want to group the columns by which they have same name and apply function to the rows of my matrix.
>data
A A A B B C
gene1 1 6 11 16 21 26
gene2 2 7 12 17 22 27
gene3 3 8 13 18 23 28
gene4 4 9 14 19 24 29
gene5 5 10 15 20 25 30
basically, I want put columns with same names like A to group 1, B to group 2,... and after that, I calculate T-test for each genes for all groups.
can anybody help me how can I do this ? first : grouping, then applying the T-test, which return T score for each genes between different groups .
The OP hasn't mentioned what form they want in their output, but I'm entirely updating this answer with a possible solution.
First, some reproducible sample data to work with (that will actually work with t.test).
set.seed(1)
mymat <- matrix(sample(100, 40, replace = TRUE),
ncol = 8, dimnames = list(
paste("gene", 1:5, sep = ""),
c("A", "A", "A", "B", "B", "B", "C", "C")))
mymat
# A A A B B B C C
# gene1 27 90 21 50 94 39 49 67
# gene2 38 95 18 72 22 2 60 80
# gene3 58 67 69 100 66 39 50 11
# gene4 91 63 39 39 13 87 19 73
# gene5 21 7 77 78 27 35 83 42
I've left all the hard work to the combn function. Within the combn function, I've made use of the FUN argument to add a function that creates a vector of the t.test "statistic" by each row (I'm assuming one gene per row). I've also added an attribute to the resulting vector to remind us which columns were used in calculating the statistic.
temp <- combn(unique(colnames(mymat)), 2, FUN = function(x) {
out <- vector(length = nrow(mymat))
for (i in sequence(nrow(mymat))) {
out[i] <- t.test(mymat[i, colnames(mymat) %in% x[1]],
mymat[i, colnames(mymat) %in% x[2]])$statistic
}
attr(out, "NAME") <- paste(x, collapse = "")
out
}, simplify = FALSE)
The output of the above is a list of vectors. It might be more convenient to convert this into a matrix. Since we know that each value in a vector represents one row, and each vector overall represents one column value combination (AB, AC, or BC), we can use that for the dimnames of the resulting matrix.
DimNames <- list(rownames(mymat), sapply(temp, attr, "NAME"))
final <- do.call(cbind, temp)
dimnames(final) <- DimNames
final
# AB AC BC
# gene1 -0.5407966 -0.5035088 0.157386919
# gene2 0.5900350 -0.7822292 -1.645448267
# gene3 -0.2040539 1.7263502 1.438525163
# gene4 0.6825062 0.5933218 0.009627409
# gene5 -0.4384258 -0.9283003 -0.611226402
Some manual verification:
## Should be the same as final[1, "AC"]
t.test(mymat[1, colnames(mymat) %in% "A"],
mymat[1, colnames(mymat) %in% "C"])$statistic
# t
# -0.5035088
## Should be the same as final[5, "BC"]
t.test(mymat[5, colnames(mymat) %in% "B"],
mymat[5, colnames(mymat) %in% "C"])$statistic
# t
# -0.6112264
## Should be the same as final[3, "AB"]
t.test(mymat[3, colnames(mymat) %in% "A"],
mymat[3, colnames(mymat) %in% "B"])$statistic
# t
# -0.2040539
Update
Building on #EDi's answer, here's another approach. It makes use of melt from "reshape2" to convert the data into a "long" format. From there, as before, it's pretty straightforward subsetting work to get what you want. The output there is transposed in relation to the approach taken with the pure combn approach, but the values are the same.
library(reshape2)
mymatL <- melt(mymat)
byGene <- split(mymatL, mymatL$Var1)
RowNames <- combn(unique(as.character(mymatL$Var2)), 2,
FUN = paste, collapse = "")
out <- sapply(byGene, function(combos) {
combn(unique(as.character(mymatL$Var2)), 2, FUN = function(x) {
t.test(value ~ Var2, combos[combos[, "Var2"] %in% x, ])$statistic
}, simplify = TRUE)
})
rownames(out) <- RowNames
out
# gene1 gene2 gene3 gene4 gene5
# AB -0.5407966 0.5900350 -0.2040539 0.682506188 -0.4384258
# AC -0.5035088 -0.7822292 1.7263502 0.593321770 -0.9283003
# BC 0.1573869 -1.6454483 1.4385252 0.009627409 -0.6112264
The first option is considerably faster, at least on this smaller dataset:
microbenchmark(fun1(), fun2())
# Unit: milliseconds
# expr min lq median uq max neval
# fun1() 8.812391 9.012188 9.116896 9.20795 17.55585 100
# fun2() 42.754296 43.388652 44.263760 45.47216 67.10531 100

Generate data where cell counts are random, but row sums always the same

I'm in a situation where I need to create a bunch of fake datasets where the sum of two variables is the same as in my real data, but the counts for each variable are random. Here's the setup:
>df
X.1 X.2
1 145 30
2 55 73
The first row sums to 175, and the second to 128. What I'm looking for is a way to generate a data frame (or a bunch of data frames) like this:
>df.2
X.1 X.2
1 100 75
2 90 38
In df.2, the cell counts have changed, but the rows still sum to the same table. The actual data has hundreds of rows, but only two variables if that helps. I've tried to figure out how to do this with sample() but haven't had any luck. Any suggestions?
Thanks!
Perhaps you're looking for r2dtable?
> r2dtable(2, c(175,128), c(190, 113))
[[1]]
[,1] [,2]
[1,] 108 67
[2,] 82 46
[[2]]
[,1] [,2]
[1,] 114 61
[2,] 76 52
Also, here's a version of #mnel's answer that uses rmultinom to do the n replicates and then combines the results. Not that it really matters if you only need a few replicates, but since rmultinom could do it, I thought I'd see how it might be done.
n <- 10
e <- cbind(X1 = c(100,90,30),X2 = c(75,28,120))
aperm(array(sapply(1:nrow(e), function(i)
rmultinom(n, rowSums(e)[i], (e/rowSums(e))[i,])),
dim=c(ncol(e),n,nrow(e))), c(3,1,2))
You are sampling from a multinomial distribution,
edit
to allow for prespecified expected cell counts
The multinomial distribution can be considered each cell as Poisson distribution (with expected cell count), conditional on the sum.
EDIT 2
allow for any number of rows / expected cell counts
pass expected as the expected cell counts
note that rmultinom returns a matrix where each column is a multinomial sample, hence my use of t to create a single row matrix
replicates <- 10
expected <- data.frame(X1 = c(100,90,30),X2 = c(75,28,120))
## X1 X2
## 1 100 75
## 2 90 28
## 3 30 120
data_samples <- lapply(seq(replicates), function(i, expected){
# create a list of expected cell counts (list element = row of expected)
.list <- lapply(apply(expected,1,list),unlist)
# sample from these expected cell counts and recombine into a data.frame
as.data.frame(do.call(rbind,lapply(.list, function(.x) t(rmultinom(n = 1, prob = .x, size = sum(.x) )))))
}, expected = expected)
This creates a list of data.frames with the appropriate properties
data_samples[[1]]
## X1 X2
## 1 104 71
## 2 84 34
## 3 19 131
data_samples[[5]]
## X1 X2
## 1 88 87
## 2 92 26
## 3 27 123
Data for use in answers:
test <- data.frame(X.1=c(145,55),X.2=c(30,73))
A version using sample:
t(sapply(
rowSums(test),
function(x) {
one <- sample(1:x,1)
two <- (x - one)
result <- data.frame(one,two)
names(result) <- names(test)
return(result)
}
)
)
Results look like:
X.1 X.2
[1,] 20 155
[2,] 127 1
or...
X.1 X.2
[1,] 111 64
[2,] 94 34
etc...
Alternatively:
Just add a bit of jitter to one of the numbers first then subtract this from the row sum.
t(apply(
test,
1,
function(x) {
rsum <- sum(x)
one <- round(jitter(x[1],20,20),0)
two <- (rsum - one)
result <- c(one,two)
names(result) <- names(test)
return(result)
}
)
)
Result examples:
X.1 X.2
[1,] 160 15
[2,] 47 81
X.1 X.2
[1,] 127 48
[2,] 64 64
If you what a total sample size of n= .. say 40 and the number of cells is 4 with the number of columns = say 2 then the call should be:
rmultinom(2, size = 40/4, prob = c(0.5,0.5))
[,1] [,2]
[1,] 6 3
[2,] 4 7
If you want a function to deliver this sort of result with specified probability per row then:
my_mat_rand <- function(tot, coln, probs){
rmultinom(coln, size = tot/length(probs), prob = probs) }
> my_mat_rand(tot=40, coln=2, probs = c(0.5,0.5))
[,1] [,2]
[1,] 11 10
[2,] 9 10
> my_mat_rand(40, 2, probs = c(0.5,0.5))
[,1] [,2]
[1,] 8 13
[2,] 12 7
If you want the probabilities to be also "random" then use runif to specify the first and 1-that-value to specify the second element of the probs vector.

Resources