Removing outlier from excel using R code

Removing outlier from excel using R code - r

The following datasheet is from excel file
Part A B C D E F G H I J K L
XXX 0 1 1 2 0 1 2 3 1 2 1 0
YYY 0 1 2 2 0 30 1 1 0 1 10 0
....
So, I want to display those parts that contains outliers having logic of
[median – t * MAD, median + t * MAD]
So how to code this using R by function for large amount of data?

You would want to calculate robust Z-scores based on median and MAD (median of absolute deviations) instead of non-robust standard mean and SD. Then assess your data using Z, with Z=0 meaning on median, Z=1 one MAD out, etc.
Let's assume we have the following data, where one set is outliers:
df <- rbind( data.frame(tag='normal', res=rnorm(1000)*2.71), data.frame(tag='outlier', res=rnorm(20)*42))
then Z it:
df$z <- with(df, (res - median(res))/mad(res))
that gives us something like this:
> head(df)
tag res z
1 normal -3.097 -1.0532
2 normal -0.650 -0.1890
3 normal 1.200 0.4645
4 normal 1.866 0.6996
5 normal -6.280 -2.1774
6 normal 1.682 0.6346
Then cut it into Z-bands, eg.
df$band <- cut(df$z, breaks=c(-99,-3,-1,1,3,99))
That can be analyzed in a straightforward way:
> addmargins(xtabs(~band+tag, df))
tag
band normal outlier Sum
(-99,-3] 1 9 10
(-3,-1] 137 0 137
(-1,1] 719 2 721
(1,3] 143 1 144
(3,99] 0 8 8
Sum 1000 20 1020
As can be seen, obviously, the ones with the biggest Zs (those being in the (-99,-3) and (3,99) Z-band, are those from the outlier community).

Related

how to find which rows are related by mathematical difference of x in R

i have a data frame with about 20k IDs of chemical compounds and the corresponding molecular weights, something like this:
ID <- c(1,2,3,4,5)
MASS <- c(324,162,508,675,670)
d <- data.frame(ID, MASS)
ID MASS
1 1 324
2 2 162
3 3 508
4 4 675
5 5 670
I would like to find a way to loop over the rows of the column MASS to find which masses are related by having a difference (positive or negative) of 162∓0.5. Then I would like to have a new column (d$DIFF) where the IDs that are linked by a MASS difference of 162∓0.5 are reported, while get 0 for those IDs when the condition is not met, in this example it would be something like this:
ID MASS DIFF
1 1 324 1&2
2 2 162 1&2
3 3 508 3&5
4 4 675 0
5 5 670 3&5
Thanks in advance for any help

Here's a base R solution using outer:
d$DIFF <- unlist(lapply(apply(outer(d$MASS, d$MASS,
function(x, y) abs((abs(x - y) - 162)) < 0.5), 1, which),
function(x) if(length(x) == 0)
return("0")
else
return(paste(x, collapse = " & "))))
This gives the result:
d
#> ID MASS DIFF
#> 1 1 324 2
#> 2 2 162 1
#> 3 3 508 5
#> 4 4 675 0
#> 5 5 670 3
Note that in your example data, there is at most a single match to other rows, but if you apply this technique to your real data you should get multiple hits for some rows separated by "&" as requested.
You should also note that whatever way you do this in your real data, you will have to make approximately 20K * 20K (400 million) comparisons, so it may take some time to complete, and may result in memory issues depending on your set-up.

Calculate percentage change in dataframe from first row

I want to calculate the per cent change in my dataframe using the first row as the reference. For example my dataframe
Set rate field
A 3 10
B 2 17
C 5 4
Using row A as the reference, I want to calculate the percentage change from row A to every other row for all columns in the dataframe.
which will result in
Set rate field
A 3 10
B -33 70
C 66.66 -60
or
Set rate field pct_rate pct-field
A 3 10 0 0
B 2 17 -33 70
C 5 4 66.66 -60
My code:
z %>%
mutate(pct_rate = (rate - lag(rate)/ rate ) * 100)
which doesn't give me the desired result

df <- fread("Set rate field
A 3 10
B 2 17
C 5 4")
Soltuion using dplyr: We can use dplyr's first function to refer to the first element of a vector (your attempt with lag is very close to this solution). Also I used first(rate) in the denominator to calculate the percentage difference to get the numbers in your example...
library(dplyr)
df %>%
mutate(pct_rate = (rate - first(rate)) / first(rate) * 100,
pct_field = (field - first(field)) / first(field) * 100)
Returns:
Set rate field pct_rate pct_field
1: A 3 10 0.00000 0
2: B 2 17 -33.33333 70
3: C 5 4 66.66667 -60

You can use z$rate[1] or z$field[1] to get the first element and make than the calculations with all values.
z$pct_rate <- 100 * (z$rate - z$rate[1]) / z$rate[1]
z$pct_field <- 100 * (z$field - z$field[1]) / z$field[1]
z
# Set rate field pct_rate pct_field
#1 A 3 10 0.00000 0
#2 B 2 17 -33.33333 70
#3 C 5 4 66.66667 -60
or for many columns:
rbind(z[1,], do.call(cbind.data.frame, c(z[1],
lapply(z[-1], function(x) 100 * (x - x[1]) / x[1])))[-1,])
# Set rate field
#1 A 3.00000 10
#2 B -33.33333 70
#3 C 66.66667 -60

Bootstrapping multiple columns with R

I'm relatively new at R and I'm trying to build a function which will loop through columns in an imported table and produce an output which consists of the means and 95% confidence intervals. Ideally it should be possible to bootstrap columns with different sample sizes, but first I would like to get the iteration working. I have something that sort-of works, but I can't get it all the way there. This is what the code looks like, with the sample data and output included:
#cdata<-read.csv(file.choose(),header=T)#read data from selected file, works, commented out because data is provided below
#cdata #check imported data
#Sample Data
# WALL NRPK CISC WHSC LKWH YLPR
#1 21 8 1 2 2 5
#2 57 9 3 1 0 1
#3 45 6 9 1 2 0
#4 17 10 2 0 3 0
#5 33 2 4 0 0 0
#6 41 4 13 1 0 0
#7 21 4 7 1 0 0
#8 32 7 1 7 6 0
#9 9 7 0 5 1 0
#10 9 4 1 0 0 0
x<-cdata[,c("WALL","NRPK","LKWH","YLPR")] #only select relevant species
i<-nrow(x) #count number of rows for bootstrapping
g<-ncol(x) #count number of columns for iteration
#build bootstrapping function, this works for the first column but doesn't iterate
bootfun <- function(bootdata, reps) {
boot <- function(bootdata){
s1=sample(bootdata, size=i, replace=TRUE)
ms1=mean(s1)
return(ms1)
} # a single bootstrap
bootrep <- replicate(n=reps, boot(bootdata))
return(bootrep)
} #replicates bootstrap of "bootdata" "reps" number of times and outputs vector of results
cvr1 <- bootfun(x$YLPR,50000) #have unsuccessfully tried iterating the location various ways (i.e. x[i])
cvrquantile<-quantile(cvr1,c(0.025,0.975))
cvrmean<-mean(cvr1)
vec<-c(cvrmean,cvrquantile) #puts results into a suitable form for output
vecr<-sapply(vec,round,1) #rounds results
vecr
2.5% 97.5%
28.5 19.4 38.1
#apply(x[1:g],2,bootfun) ##doesn't work in this case
#desired output:
#Species Mean LowerCI UpperCI
#WALL 28.5 19.4 38.1
#NRPK 6.1 4.6 7.6
#YLPR 0.6 0.0 1.6
I've also tried this using the boot package, and it works beautifully to iterate through the means but I can't get it to do the same with the confidence intervals. The "ordinary" code above also has the advantage that you can easily retrieve the bootstrapping results, which might be used for other calculations. For the sake of completeness here is the boot code:
#Bootstrapping using boot package
library(boot)
#data<-read.csv(file.choose(),header=TRUE) #read data from selected file
#x<-data[,c("WALL","NRPK","LKWH","YLPR")] #only select relevant columns
#x #check data
#Sample Data
# WALL NRPK LKWH YLPR
#1 21 8 2 5
#2 57 9 0 1
#3 45 6 2 0
#4 17 10 3 0
#5 33 2 0 0
#6 41 4 0 0
#7 21 4 0 0
#8 32 7 6 0
#9 9 7 1 0
#10 9 4 0 0
i<-nrow(x) #count number of rows for resampling
g<-ncol(x) #count number of columns to step through with bootstrapping
boot.mean<-function(x,i){boot.mean<-mean(x[i])} #bootstrapping function to get the mean
z<-boot(x, boot.mean,R=50000) #bootstrapping function, uses mean and number of reps
boot.ci(z,type="perc") #derive 95% confidence intervals
apply(x[1:g],2, boot.mean) #bootstrap all columns
#output:
#WALL NRPK LKWH YLPR
#28.5 6.1 1.4 0.6
I've gone through all of the resources I can find and can't seem to get things working. What I would like for output would be the bootstrapped means with the associated confidence intervals for each column. Thanks!

Note: apply(x[1:g],2, boot.mean) #bootstrap all columns doesn't do any bootstrap. You are simply calculating the mean for each column.
For bootstrap mean and confidence interval, try this:
apply(x,2,function(y){
b<-boot(y,boot.mean,R=50000);
c(mean(b$t),boot.ci(b,type="perc", conf=0.95)$percent[4:5])
})

Calculate mean of a proportion of the data.frame

I'm working with data that looks similar to this:
cat value n
1 100 18
2 0 19
3 -100 15
4 100 13
5 0 17
6 -100 18
In the real data, there are many cats and value can be any number between -100 and 100 (no NA).
What I want to do is to calculate the mean of value based on terciles defined by n
So, for example, since sum(n)=100 what I want to do is to get n's as close as possible to 33 and calculate the mean of value. So for the first tercile, 18 isn't quite 33, so I need to take 15 values from cat=2. So the mean for the first tercile should be (100*18+0*15)/(18+15). The second tercile would be the remaining ns from cat=2, then as many as are needed to get to 33: (0*4+-100*15+100*13+0*1)/(4+15+13+1). Similar for the last tercile.
I got started writing this, but ended up with lots of nasty for loops and if statements. I'm hoping that you see an easier way to deal with this than I do. Thanks in advance!

A solution with data.table:
setDT(df)[rep(1:.N,n)
][,indx:=c(rep("a",33),rep("b",33),rep("c",34))
][,.(mean_val_indx=mean(value)),by=indx]
this gives:
indx mean_val_indx
1: a 54.545455
2: b -6.060606
3: c -52.941176
Which are the means of value for the three parts of the data.
Broken down in the intermediate steps:
1: replice the rows according n
setDT(df)[rep(1:.N,n)]
this gives (shortened):
cat value n
1: 1 100 18
2: 1 100 18
....
17: 1 100 18
18: 1 100 18
19: 2 0 19
20: 2 0 19
....
36: 2 0 19
37: 2 0 19
38: 3 -100 15
....
99: 6 -100 18
100: 6 -100 18
2: create an index with [,indx:=c(rep("a",33),rep("b",33),rep("c",34))]
setDT(df)[rep(1:.N,n)
][,indx:=c(rep("a",33),rep("b",33),rep("c",34))]
this gives:
> dt
cat value n indx
1: 1 100 18 a
2: 1 100 18 a
....
17: 1 100 18 a
18: 1 100 18 a
19: 2 0 19 a
20: 2 0 19 a
....
32: 2 0 19 a
33: 2 0 19 a
34: 2 0 19 b
35: 2 0 19 b
....
99: 6 -100 18 c
100: 6 -100 18 c
3: summarise value by indx with [,.(mean_val_indx=mean(value)),by=indx]

You could try something like this, data being your example dataframe:
longData<-unlist(apply(data[,c("value","n")],1,function(x){
rep(x["value"],x["n"])
}))
aggregate(longData,list(cut(seq_along(longData),breaks=3,right=FALSE)),mean)
longData will be a vector of length 100 with, using your example, 18 repetitions of -100, 19 repetitions of 0 etc.
The cut in the aggregate will divide longData into three groups, and the mean of each group will be calculated.

If already the data is very long repetition by "n" is perhaps unwanted.
The following solution doesn't do this. Moreover, 1/3 of the sum of the
"n"-values is not rounded to the nearest integer.
"i" is the vector of row numbers where terciles end. Since it is possible
that several terciles end at the same row, those row numbers are replicated.
The result is the vector "k".
For each index "j" the cumulative sum of "data$value"*"data$n" up to "k[j]"
covers "ms[k[j]]" terciles, so "ms[j]-j" terciles have to be subtracted
to get the cumulative sum up to the "j"th tercile.
m <- 3
sn <- sum(data$n)
ms <- m * cumsum(data$n) / sn
d <- diff(c(0,floor(ms)))
i <- which(d>0)
k <- rep(i,d[i])
vn <- data$value * data$n
sums <- cumsum(vn)[k] - (ms[k]-(1:m))*data$value[k]*sn/m
means <- m*diff(c(0,sums))/sn
The means of the terciles are:
> means
[1] 54 -6 -54
In this example "i" is equal to "k". But if terciles are replaced by deciles,
i.e. "m" is not 3 but 10, they are distinct:
> m
[1] 10
> i
[1] 1 2 3 4 5 6
> k
[1] 1 2 2 3 3 4 5 5 6 6
> means
[1] 100 80 0 -30 -100 60 50 0 -80 -100
I compared the speed of the 4 answers, using out small example with 8 rows:
> ##### "longData"-Answer #####
>
> system.time( for ( i in 1:1000 ) { A1 <- f1(data) } )
User System verstrichen
3.48 0.00 3.49
> ##### "sapply"-Answer #####
>
> system.time( for ( i in 1:1000 ) { A2 <- f2(data) } )
User System verstrichen
1.00 0.00 0.99
> ##### "data.table"Answer #####
>
> system.time( for ( i in 1:1000 ) { A3 <- f3(data) } )
User System verstrichen
4.73 0.00 4.79
> ##### this Answer #####
>
> system.time( for ( i in 1:1000 ) { A4 <- f4(data) } )
User System verstrichen
0.43 0.00 0.44
The "sapply"-Answer is even false:
> A1
Group.1 x
1 [0.901,34) 54.545455
2 [34,67) -6.060606
3 [67,100) -52.941176
> A2
(0,33] (33,67] (67,100]
-100.00000 0.00000 93.93939
> A3
indx mean_val_indx
1: a 54.545455
2: b -6.060606
3: c -52.941176
> A4
[1] 54 -6 -54
>

This is basically the same as NicE although perhaps useful as a different way fo assembling the rep and cutting operations:
sapply(split( sort(unlist( mapply(rep, res$value, res$n) )),
cut(seq(sum(res$n)), breaks=c(0,33,67,100) )),
mean)
(0,33] (33,67] (67,100]
-100.00000 0.00000 93.93939

How to generalize this algorithm (sign pattern match counter)?

I have this code in R :
corr = function(x, y) {
sx = sign(x)
sy = sign(y)
cond_a = sx == sy && sx > 0 && sy >0
cond_b = sx < sy && sx < 0 && sy >0
cond_c = sx > sy && sx > 0 && sy <0
cond_d = sx == sy && sx < 0 && sy < 0
cond_e = sx == 0 || sy == 0
if(cond_a) return('a')
else if(cond_b) return('b')
else if(cond_c) return('c')
else if(cond_d) return('d')
else if(cond_e) return('e')
}
Its role is to be used in conjunction with the mapply function in R in order to count all the possible sign patterns present in a time series. In this case the pattern has a length of 2 and all the possible tuples are : (+,+)(+,-)(-,+)(-,-)
I use the corr function this way :
> with(dt['AAPL'], table(mapply(corr, Return[-1], Return[-length(Return)])) /length(Return)*100)
a b c d e
24.6129416 25.4466058 25.4863041 24.0174672 0.3969829
> dt["AAPL",list(date, Return)]
symbol date Return
1: AAPL 2014-08-29 -0.3499903
2: AAPL 2014-08-28 0.6496702
3: AAPL 2014-08-27 1.0987923
4: AAPL 2014-08-26 -0.5235654
5: AAPL 2014-08-25 -0.2456037
I would like to generalize the corr function to n arguments. This mean that for every nI would have to write down all the conditions corresponding to all the possible n-tuples. Currently the best thing I can think of for doing that is to make a python script to write the code string using loops, but there must be a way to do this properly. Do you have an idea about how I could generalize the fastidious condition writing, maybe I could try to use expand.grid but how do the matching then ?

I think you're better off using rollapply(...) in the zoo package for this. Since you seem to be using quantmod anyway (which loads xts and zoo), here is a solution that does not use all those nested if(...) statements.
library(quantmod)
AAPL <- getSymbols("AAPL",auto.assign=FALSE)
AAPL <- AAPL["2007-08::2009-03"] # AAPL during the crash...
Returns <- dailyReturn(AAPL)
get.patterns <- function(ret,n) {
f <- function(x) { # identifies which row of `patterns` matches sign(x)
which(apply(patterns,1,function(row)all(row==sign(x))))
}
returns <- na.omit(ret)
patterns <- expand.grid(rep(list(c(-1,1)),n))
labels <- apply(patterns,1,function(row) paste0("(",paste(row,collapse=","),")"))
result <- rollapply(returns,width=n,f,align="left")
data.frame(100*table(labels[result])/(length(returns)-(n-1)))
}
get.patterns(Returns,n=2)
# Var1 Freq
# 1 (-1,-1) 22.67303
# 2 (-1,1) 26.49165
# 3 (1,-1) 26.73031
# 4 (1,1) 23.15036
get.patterns(Returns,n=3)
# Var1 Freq
# 1 (-1,-1,-1) 9.090909
# 2 (-1,-1,1) 13.397129
# 3 (-1,1,-1) 14.593301
# 4 (-1,1,1) 11.722488
# 5 (1,-1,-1) 13.636364
# 6 (1,-1,1) 13.157895
# 7 (1,1,-1) 12.200957
# 8 (1,1,1) 10.765550
The basic idea is to create a patterns matrix with 2^n rows and n columns, where each row represents one of the possible patterns (e,g, (1,1), (-1,1), etc.). Then pass the daily returns to this function n-wise using rollapply(...) and identify which row in patterns matches sign(x) exactly. Then use this vector of row numbers an an index into labels, which contains a character representation of the patterns, then use table(...) as you did.
This is general for an n-day pattern, but it ignores situations where any return is exactly zero, so the $Freq columns do not add up to 100. As you can see, this doesn't happen very often.
It's interesting that even during the crash it was (very slightly) more likely to have two up days in succession, than two down days. If you look at plot(Cl(AAPL)) during this period, you can see that it was a pretty wild ride.

This is a little different approach but it may give you what you're looking for and allows you to use any size of n-tuple. The basic approach is to find the signs of the adjacent changes for each sequential set of n returns, convert the n-length sign changes into n-tuples of 1's and 0's where 0 = negative return and 1 = positive return. Then calculate the decimal value of each n-tuple taken as binary number. These numbers will clearly be different for each distinct n-tuple. Using a zoo time series for these calculations provides several useful functions including get.hist.quote() to retrieve stock prices, diff() to calculate returns, and the rollapply() function to use in calculating the n-tuples and their sums.The code below does these calculations, converts the sum of the sign changes back to n-tuples of binary digits and collects the results in a data frame.
library(zoo)
library(tseries)
n <- 3 # set size of n-tuple
#
# get stock prices and compute % returns
#
dtz <- get.hist.quote("AAPL","2014-01-01","2014-10-01", quote="Close")
dtz <- merge(dtz, (diff(dtz, arithmetic=FALSE ) - 1)*100)
names(dtz) <- c("prices","returns")
#
# calculate the sum of the sign changes
#
dtz <- merge(dtz, rollapply( data=(sign(dtz$returns)+1)/2, width=n,
FUN=function(x, y) sum(x*y), y = 2^(0:(n-1)), align="right" ))
dtz <- fortify.zoo(dtz)
names(dtz) <- c("date","prices","returns", "sum_sgn_chg")
#
# convert the sum of the sign changes back to an n-tuple of binary digits
#
for( i in 1:nrow(dtz) )
dtz$sign_chg[i] <- paste(((as.numeric(dtz$sum_sgn_chg[i]) %/%(2^(0:2))) %%2), collapse="")
#
# report first part of result
#
head(dtz, 10)
#
# report count of changes by month and type
#
table(format(dtz$date,"%Y %m"), dtz$sign_chg)
An example of possible output is a table showing the count of changes by type for each month.
000 001 010 011 100 101 110 111 NANANA
2014 01 1 3 3 2 3 2 2 2 3
2014 02 1 2 4 2 2 3 2 3 0
2014 03 2 3 0 4 4 1 4 3 0
2014 04 2 3 2 3 3 2 3 3 0
2014 05 2 2 1 3 1 2 3 7 0
2014 06 3 4 3 2 4 1 1 3 0
2014 07 2 1 2 4 2 5 5 1 0
2014 08 2 2 1 3 1 2 2 8 0
2014 09 0 4 2 3 4 2 4 2 0
2014 10 0 0 1 0 0 0 0 0 0
so this would show that in month 1, January of 2014, there was one set of three days with 000 indicating 3 down returns , 3 days with the 001 change indicating two down return and followed by one positive return and so forth. Most months seem to have a fairly random distribution but May and August show 7 and 8 sets of 3 days of positive returns reflecting the fact that these were strong months for AAPL.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Removing outlier from excel using R code - r

Related

how to find which rows are related by mathematical difference of x in R

Calculate percentage change in dataframe from first row

Bootstrapping multiple columns with R

Calculate mean of a proportion of the data.frame

How to generalize this algorithm (sign pattern match counter)?

Categories

Resources