Several Grubbs tests simultaneously in R - r

I'm new using R, I'm just starting with the outliers package. Probably this is very easy, but could anybody tell me how to run several Grubbs tests at the same time? I have 20 columns and I want to test all of them simultaneously.
Thanks in advance
Edit: Sorry for not explaining well. I'll try. I started using R today and I learned how to make Grubbs test using grubbs.test(data$S1, type=10 or 11 or 20) and it goes well. But I have a table with 20 columns, and I want to run Grubbs test for each of them simultaneously. I can do it one by one, but I think there must be a way to do it faster.
I ran the code at How to repeat the Grubbs test and flag the outliers as well, and works perfectly, but again, I would like to do it with my 20 samples.
As an example of my data:
S1 S2 S3 S4 S5 S6 S7
96 40 99 45 12 16 48
52 49 11 49 59 77 64
18 43 11 67 6 97 91
79 19 39 28 45 44 99
9 78 88 6 25 43 78
60 12 29 32 2 68 25
18 61 60 30 26 51 70
96 98 55 74 83 17 69
19 0 17 24 0 75 45
42 70 71 7 61 82 100
39 80 71 58 6 100 94
100 5 41 18 33 98 97
Hope this helps.

You can use lapply:
library(outliers)
df = data.frame(a=runif(20),b=runif(20),c=runif(20))
tests = lapply(df,grubbs.test)
# or with parameters:
tests = lapply(df,grubbs.test,opposite=T)
Results:
> tests
$a
Grubbs test for one outlier
data: X[[i]]
G = 1.80680, U = 0.81914, p-value = 0.6158
alternative hypothesis: highest value 0.963759744539857 is an outlier
$b
Grubbs test for one outlier
data: X[[i]]
G = 1.53140, U = 0.87008, p-value = 1
alternative hypothesis: highest value 0.975481075001881 is an outlier
$c
Grubbs test for one outlier
data: X[[i]]
G = 1.57910, U = 0.86186, p-value = 1
alternative hypothesis: lowest value 0.0136249314527959 is an outlier
You can access the results as follows:
> tests$a$statistic
G U
1.8067906 0.8191417
Hope this helps.

A #Florian answer can be updated a bit. For example fancy and easy-reading result can be achieved with purrr package and tidyverse. It can be useful if you are comparing loads of groups:
Load necessary packages:
library(dplyr)
library(purrr)
library(tidyr)
library(outliers)
Create some data - we're going to use the same from Florian's answer, but transformed to a modern tibble and long format:
df <- tibble(a = runif(20),
b = runif(20),
c = runif(20)) %>%
# transform to along format
tidyr::gather(letter, value)
Then instead of apply functions we can use map and map_dbl from purrr:
df %>%
group_by(letter) %>%
nest() %>%
mutate(n = map_dbl(data, ~ nrow(.x)), # number of entries
G = map(data, ~ grubbs.test(.x$value)$statistic[[1]]), # G statistic
U = map(data, ~ grubbs.test(.x$value)$statistic[[2]]), # U statistic
grubbs = map(data, ~ grubbs.test(.x$value)$alternative), # Alternative hypotesis
p_grubbs = map_dbl(data, ~ grubbs.test(.x$value)$p.value)) %>% # p-value
# Let's make the output more fancy
mutate(G = signif(unlist(G), 3),
U = signif(unlist(U), 3),
grubbs = unlist(grubbs),
p_grubbs = signif(p_grubbs, 3)) %>%
select(-data) %>% # remove temporary column
arrange(p_grubbs)
And the desired output would be this:
# A tibble: 3 x 6
letter n G U grubbs p_grubbs
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 c 20 1.68 0.843 lowest value 0.0489965472370386 is an outlier 0.84
2 a 20 1.58 0.862 lowest value 0.0174888013862073 is an outlier 1
3 b 20 1.57 0.863 lowest value 0.0656482006888837 is an outlier 1

Related

How to randomly select row from a dataframe for which the row skewness is larger that a given value in R

I am trying to select random rows from a data frame with 1000 lines (and six columns) where the skewness of the line is larger than a given value (say Sk > 0.3).
I've generated the following data frame
df=data.frame(replicate(6,sample(10:100,1000,rep=TRUE)))
I can get row skewness from the fbasics package:
rowSkewness(df) gives:
[8] -0.2243295435 0.5306809351 0.0707122386 0.0341447417 0.3339384838 -0.3910593364 -0.6443905090
[15] 0.5603809206 0.4406091534 -0.3736108832 0.0397860038 0.9970040772 -0.7702547535 0.2065830354
But now, I need to select say 10 rows of the df which have rowskewness greater than say 0.1... May with
for (a in 1:10) {
sample.data[a,] = sample(x=df[which(rowSkewness(df[sample(1:nrow(df),1)>0.1),], size = 1, replace = TRUE)
}
or something like this?
Any thoughts on this will be appreciated.
thanks in advance.
you can use the sample_n() function or sample_frac() - makes your version a little shorter:
library(tidyr)
library(fBasics)
df=data.frame(replicate(6,sample(10:100,1000,rep=TRUE)))
x=df %>% dplyr::filter(rowSkewness(df)>0.1) %>% dplyr::sample_n(10)
Got it:
x=df %>% filter(rowSkewness(df)>0.1)
for (a in 1:samplesize) {
sample.data[a,] = sample(x=x, size = 1, replace = TRUE)
}
Just do a subset:
res1 <- DF[fBasics::rowSkewness(DF) > .1, ]
head(res1)
# X1 X2 X3 X4 X5 X6
# 7 56 28 21 93 74 24
# 8 33 56 23 44 10 12
# 12 29 19 29 38 94 95
# 13 35 51 54 98 66 10
# 14 12 51 24 23 36 68
# 15 50 37 81 22 55 97
Or with e1071::skewness:
res2 <- DF[apply(as.matrix(DF), 1, e1071::skewness) > .1, ]
stopifnot(all.equal(res1, res2))
Data
set.seed(42); DF <- data.frame(replicate(6, sample(10:100, 1000, rep=TRUE)))

Filter all columns in timeseries to keep only top 1/3

I have a timeseries with about 100 dates, 50 entities per date (so 5,000 rows) and 50 columns (all are different variables). How can I filter each column in the data frame, per unique date, to keep the top 1/3 of values for each column on each date. Then get the average Return for that group for that date. Thank you.
My data is organized as follows but the numbers in each column are random and vary like they do in column "a" (this is a sample, the real data has many more columns and many more rows):
Date Identity Return a b c d e f... ...z
2/1/19 X 5 75 43 67 85 72 56 92
2/1/19 Y 4 27 43 67 85 72 56 92
2/1/19 Z 7 88 43 67 85 72 56 92
2/1/19 W 2 55 43 67 85 72 56 92
2/2/19 X 7 69 43 67 85 72 56 92
2/2/19 Y 8 23 43 67 85 72 56 92
2/3/19 X 2 34 43 67 85 72 56 92
2/3/19 Y 3 56 43 67 85 72 56 92
2/3/19 Z 4 62 43 67 85 72 56 92
2/3/19 W 4 43 43 67 85 72 56 92
2/3/19 U 4 26 43 67 85 72 56 92
2/4/19 X 6 67 43 67 85 72 56 92
2/4/19 Y 1 78 43 67 85 72 56 92
2/5/19 X 4 75 43 67 85 72 56 92
2/7/19 X 5 99 43 67 85 72 56 92
2/7/19 Y 4 72 43 67 85 72 56 92
2/7/19 Z 4 45 43 67 85 72 56 92
I am trying to filter data into quantiles. I have a code that works for filtering into quantiles for one measure. However I want filtered results for many measures individually (i.e. I want a “high” group for a ton of columns).
The code that I have that works for one measure is as follows.
Columns are date, identity, and a a is the indicator I want to sort on
High = df[!is.na(df$a),] %>%
group_by(df.date) %>%
filter(a > quantile(a, .666)) %>%
summarise(high_return = sum(df.return) / length(df.identity)
Now I want to loop this for when I have many indicators to sort on individually (I.e. I do not want to sort within one another, I want each sorted separately and the results to be broken out by indicator)
I want the output of the loop to be a new data frame with the following format (where a_Return is the average return of the top 1/3 of the original a's on a given date):
Date a_Return b_Return c_Return
2/1/19 6. 7 3
2/3/19 4. 2 5
2/4/19 2. 4 6
I have tried the code below without it working:
Indicators <- c(“a”, “b”, “c”)
for(i in 1:length(Indicators)){
High = df %>%
group_by(df.date) %>%
filter(High[[I]] > quantile(High[[i]], .666)) %>%
summarise(g = sum(df.return) / length(df.identity)}
With this attempt I get the error: "Error in filter_impl(.data, quo) : Result must have length 20, not 4719.
I also tried:
High %>%
group_by(date) %>%
filter_at(vars(Indicators[i]), any_vars(. > quantile (., .666)))%>%
summarise(!!Indicators[I] := sum(Return) / n())
but with that code I get the error "Strings must match column names. Unknown Columns: NA"
I want High to turn up with a date column and then a column for each a, b, and c.
If you combine the filtering and calculations into a single function, then you can put that into summarize_at to apply it easily to each column. Since you're example data isn't fully reproducible, I'll use the iris dataset. In your case, you'd replace Species with Date, and Petal.Width with Return:
library(dplyr)
top_iris <- iris %>%
group_by(Species) %>%
summarize_at(vars(one_of('Sepal.Length', 'Sepal.Width', 'Petal.Length')),
funs(return = sum(Petal.Width[. > quantile(., .666)]) / length(Petal.Width[. > quantile(., .666)])))
top_iris
# A tibble: 3 x 4
Species Sepal.Length_return Sepal.Width_return Petal.Length_return
<fct> <dbl> <dbl> <dbl>
1 setosa 0.257 0.262 0.308
2 versicolor 1.44 1.49 1.49
3 virginica 2.1 2.22 2.09
The problem with using filter is that each function in the pipe runs in order, so any criteria you give to filter_* will have to be applied to the whole data.frame before the result is piped into summarize_at. Instead, we just use a single summarize_at statement, and filter each column as the summarization function is applied to it.
To explain this in more detail, summarize_at takes 2 arguments:
The first argument is one or more of the variable selector functions described in ?select_helpers, enclosed in the vars function. Here we use one_of which just takes a vector of column names, but we could also use matches to select using a regular expession, or starts_with to choose based on a prefix, for example.
The second argument is a list of one or more function calls to be run on each selected column, enclosed in the funs function. Here we have 1 function call, to which we've given the name return.
Like with any tidyverse function, this is evaluated in a local environment constructed from the data piped in. So bare variable names like Petal.Width function as data$Petal.Width. In *_at functions, the . represents the variable passed in, so when the Sepal.Length column is being summarized:
Petal.Width[. > quantile(., .666)]
means:
data$Petal.Width[data$Sepal.Length > quantile(data$Sepal.Length, .666)]
Finally, since the function in funs is named (that's the return =), then the resulting summary columns have the function's name (return) appended to the original column names.
If you want to remove missing data before running these calculations, you can use na.omit to strip out NA values.
To remove all rows containing NA, just pipe your data through na.omit before grouping:
iris2 <- iris
iris2[c(143:149), c(1:2)] <- NA
iris2 %>%
na.omit() %>%
group_by(Species) %>%
summarize_at(vars(one_of('Sepal.Length', 'Sepal.Width', 'Petal.Length')),
funs(return = sum(Petal.Width[. > quantile(., .666)]) / length(Petal.Width[. > quantile(., .666)])))
Species Sepal.Length_return Sepal.Width_return Petal.Length_return
<fct> <dbl> <dbl> <dbl>
1 setosa 0.257 0.262 0.308
2 versicolor 1.44 1.49 1.49
3 virginica 2.09 2.19 2.07
To strip NA values from each column as it's being summarized, you need to move na.omit inside the summarize function:
iris2 %>%
group_by(Species) %>%
summarize_at(vars(one_of('Sepal.Length', 'Sepal.Width', 'Petal.Length')),
funs(return = {
var <- na.omit(.)
length(Petal.Width[var > quantile(var, .666)])
}))
# A tibble: 3 x 4
Species Sepal.Length_return Sepal.Width_return Petal.Length_return
<fct> <dbl> <dbl> <dbl>
1 setosa 0.257 0.262 0.308
2 versicolor 1.44 1.49 1.49
3 virginica 2.11 2.2 2.09
Here we use curly braces to extend the function we run in summarize_at to multiple expressions. First, we strip out NA values, then we calculate the return values. Since this function is in summarize_at it gets applied to each variable based on the grouping established by group_by.

Normalise only some columns in R

I'm new to R and still getting to grips with how it handles data (my background is spreadsheets and databases). the problem I have is as follows. My data looks like this (it is held in CSV):
RecNo Var1 Var2 Var3
41 800 201.8 Y
43 140 39 N
47 60 20.24 N
49 687 77 Y
54 570 135 Y
58 1250 467 N
61 211 52 N
64 96 117.3 N
68 687 77 Y
Column 1 (RecNo) is my observation number; while it is a number, it is not required for my analysis. Column 4 (Var3) is a Yes/No column which, again, I do not currently need for the analysis but will need later in the process to add information in the output.
I need to normalise the numeric data in my dataframe to values between 0 and 1 without losing the other information. I have the following function:
normalize <- function(x) {
x <- sweep(x, 2, apply(x, 2, min))
sweep(x, 2, apply(x, 2, max), "/")
}
However, when I apply it to my above data by calling
myResult <- normalize(myData)
it returns an error because of the text in Column 4. If I set the text in this column to binary values it runs fine, but then also normalises my case numbers, which I don't want.
So, my question is: How can I change my normalize function above to accept the names of the columns to transform, while outputting the full dataset (i.e. without losing columns)?
I could not get TUSHAr's suggestion to work, but I have found two solutions that work fine:
1. akrun's suggestion above:
myData2 <- myData1 %>% mutate_at(2:3, funs((.-min(.))/max(.-min(.))))
This produces the following:
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
Alternatively, there is the package BBmisc which allowed me the following after transforming my record numbers to factors:
> myData <- myData %>% mutate(RecNo = factor(RecNo))
> myNorm <- normalize(myData2, method="range", range = c(0,1), margin = 1)
> myNorm
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
EDIT: For completion I include TUSHAr's solution as well, showing as always that there are many ways around a single problem:
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)
Thank you for your help!
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)

use dplyr mutate() in programming

I am trying to assign a column name to a variable using mutate.
df <-data.frame(x = sample(1:100, 50), y = rnorm(50))
new <- function(name){
df%>%mutate(name = ifelse(x <50, "small", "big"))
}
When I run
new(name = "newVar")
it doesn't work. I know mutate_() could help but I'm struggling in using it together with ifelse.
Any help would be appreciated.
Using dplyr 0.7.1 and its advances in NSE, you have to UQ the argument to mutate and then use := when assigning. There is lots of info on programming with dplyr and NSE here: https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
I've changed the name of the function argument to myvar to avoid confusion. You could also use case_when from dplyr instead of ifelse if you have more categories to recode.
df <- data.frame(x = sample(1:100, 50), y = rnorm(50))
new <- function(myvar){
df %>% mutate(UQ(myvar) := ifelse(x < 50, "small", "big"))
}
new(myvar = "newVar")
This returns
x y newVar
1 37 1.82669 small
2 63 -0.04333 big
3 46 0.20748 small
4 93 0.94169 big
5 83 -0.15678 big
6 14 -1.43567 small
7 61 0.35173 big
8 26 -0.71826 small
9 21 1.09237 small
10 90 1.99185 big
11 60 -1.01408 big
12 70 0.87534 big
13 55 0.85325 big
14 38 1.70972 small
15 6 0.74836 small
16 23 -0.08528 small
17 27 2.02613 small
18 76 -0.45648 big
19 97 1.20124 big
20 99 -0.34930 big
21 74 1.77341 big
22 72 -0.32862 big
23 64 -0.07994 big
24 53 -0.40116 big
25 16 -0.70226 small
26 8 0.78965 small
27 34 0.01871 small
28 24 1.95154 small
29 82 -0.70616 big
30 77 -0.40387 big
31 43 -0.88383 small
32 88 -0.21862 big
33 45 0.53409 small
34 29 -2.29234 small
35 54 1.00730 big
36 22 -0.62636 small
37 100 0.75193 big
38 52 -0.41389 big
39 36 0.19817 small
40 89 -0.49224 big
41 81 -1.51998 big
42 18 0.57047 small
43 78 -0.44445 big
44 49 -0.08845 small
45 20 0.14014 small
46 32 0.48094 small
47 1 -0.12224 small
48 66 0.48769 big
49 11 -0.49005 small
50 87 -0.25517 big
Following the dlyr programming vignette, define your function as follows:
new <- function(name)
{
nn <- enquo(name) %>% quo_name()
df %>% mutate( !!nn := ifelse(x <50, "small", "big"))
}
enquo takes its expression argument and quotes it, followed by quo_name converting it into a string. Since nn is now quoted, we need to tell mutate not to quote it a second time. That's what !! is for. Finally, := is a helper operator to make it valid R code. Note that with this definition, you can simply pass newVar instead of "newVar" to your function, maintaining dplyr style.
> new( newVar ) %>% head
x y newVar
1 94 -1.07642088 big
2 85 0.68746266 big
3 80 0.02630903 big
4 74 0.18323506 big
5 86 0.85086915 big
6 38 0.41882858 small
Base R solution
df <-data.frame(x = sample(1:100, 50), y = rnorm(50))
new <- function(name){
df[,name]='s'
df[,name][df$x>50]='b'
return(df)
}
I am using dplyr 0.5 so i just combine base R with mutate
new <- function(Name){
df=mutate(df,ifelse(x <50, "small", "big"))
names(df)[3]=Name
return(df)
}
new("newVar")

Using ddply across numerous variables when calculating descriptive statistics

Here's my data. It shows the amount of fish I found at three different sites.
Selidor.Bay Enlades.Bay Cumphrey.Bay
1 39 29 187
2 70 370 50
3 13 44 52
4 0 65 20
5 43 110 220
6 0 30 266
What I would like to do is create a script to calculate basic statistics for each site.
If I re-arrange the data by stacking it. I.e :
values site
1 29 Selidor.Bay
2 370 Selidor.Bay
3 44 Selidor.Bay
4 65 Enlades.Bay
I'm able to use the following:
data <- ddply(df, c("site"), summarise,
N = length(values),
mean = mean(values),
sd = sd(values),
se = sd / sqrt(N),
sum = sum(values)
)
data.
My question is how can I use the script without having to stack my dataframe?
Thanks.
A slight variation on #docendodiscimus' comment:
library(reshape2)
library(dplyr)
DF %>%
melt(variable.name="site") %>%
group_by(site) %>%
summarise_each(funs( n(), mean, sd, se=sd(.)/sqrt(n()), sum ), value)
# site n mean sd se sum
# 1 Selidor.Bay 6 27.5 27.93385 11.40395 165
# 2 Enlades.Bay 6 108.0 131.84688 53.82626 648
# 3 Cumphrey.Bay 6 132.5 104.29909 42.57992 795
melt does what the OP referred to as "stacking" the data.frame. There is likely some analogous function in the tidyr package.

Resources