How to calculate mean of multiple measures after removing outliers in R

How to calculate mean of multiple measures after removing outliers in R - r

I have three independent measures of a variable, and they are subject to a lot of noise and sporadic sources of error that can be quite large. I would like to discard the value furthest away from the others, remember which one is discarded, and then calculate the mean with the remaining two. For example,
a b c
15 6 7
11 10 3
5 12 6
would become
a b c ave discard
15 6 7 6.5 15
11 10 3 10.5 3
5 12 6 5.5 12

Try:
ddf
a b c
1 15 6 7
2 11 10 3
3 5 12 6
ddf$ave = apply(ddf[1:3], 1, function(x) {
x = sort(x)
ifelse(abs(x[1]-x[2]) > abs(x[2]-x[3]), mean(x[2:3]), mean(x[1:2]))
}
)
ddf$discard = apply(ddf[1:3], 1, function(x) {
x = sort(x)
ifelse(abs(x[1]-x[2]) > abs(x[2]-x[3]), x[1], x[3])
}
)
ddf
a b c ave discard
1 15 6 7 6.5 15
2 11 10 3 10.5 3
3 5 12 6 5.5 12

You question is underspecified. Say the three values are 1000, 2000 and 3000. Which would you discard? Should the answer be 1500 or 2500?
If all you're looking for is a robust measure of central tendency, the median might be a good start (?median in R).

Related

How to create a table with flexible columns based on variables control in R?

I want to create a tale like:
1 1 6 6 10 10 ...
2 2 7 7 11 11 ...
3 3 8 8 12 12 ...
4 4 9 9 13 13 ...
5 5 14 14 ...
15 15 ...
I want to use variables:
n (repeat) and m(total number of columns) and k(k=the prior columns's end number+1,for example: 6=5+1, and 10=9+1), and different number length of row
to create a table.
I know I can use like:
rep(list(1:5,6:9,10:15), each = 2)),
but how to make them as parameters using a general expression to list list(1:5,6:9,10:15,..use n,m,k expression...).
I tried to use loop for (i in 1:m) etc.. but cannot work it out
finally I want a sequence by using unlist(): 1,2,3,4,5,6,1,2,3,4,5,6......)
Many thanks.

Maybe the code below can help
len <- c(5,4,6)
res <- unlist(unname(rep(split(1:sum(len),
findInterval(1:sum(len),cumsum(len)+1)),
each = 2)))
which gives
> res
[1] 1 2 3 4 5 1 2 3 4 5 6 7 8 9 6 7 8 9 10 11 12 13 14 15 10 11 12 13 14 15

Probably, something like this would be helpful.
#Number of times to repeat
r <- 2
#Length of each sequence
len <- c(5, 4, 6)
#Get the end of the sequence
end <- cumsum(Glen)
#Calculate the start of each sequence
start <- c(1, end[-length(end)] + 1)
#Create a sequence of start and end and repeat it r times
Map(function(x, y) rep(seq(x, y), r), start, end)
#[[1]]
# [1] 1 2 3 4 5 1 2 3 4 5
#[[2]]
#[1] 6 7 8 9 6 7 8 9
#[[3]]
# [1] 10 11 12 13 14 15 10 11 12 13 14 15
You could unlist to get it as one vector.
unlist(Map(function(x, y) rep(seq(x, y), r), start, end))

How to extract a sample of pairs in grouping variable

My data looks like this:
x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18
y is a grouping variable. I would like to see how well this grouping went.
Because of this I want to extract a sample of n pairs of cases that are grouped together by variable y
and n pairs of cases that are not grouped together by variable y. In order to calculate the number of
false positives and false negatives (either falsly grouped or not). How do I extract a sample of grouped pairs
and a sample of not-grouped pairs?
I would like the samples to look like this (for n=6) :
Grouped sample:
x y
2 2
3 2
9 9
10 9
15 14
17 14
Not-grouped sample:
x y
1 1
2 2
6 8
6 8
11 11
19 17
How would I go about this in R?

I'm not entirely clear on what you like to do, partly because I feel there is some context missing as to what you're trying to achieve. I also don't quite understand your expected output (for example, the not-grouped sample contains an entry 6 8 that does not exist in your original data...)
That aside, here is a possible approach.
# Maximum number of samples per group
n <- 3;
# Set fixed RNG seed for reproducibility
set.seed(2017);
# Grouped samples
df.grouped <- do.call(rbind.data.frame, lapply(split(df, df$y),
function(x) if (nrow(x) > 1) x[sample(min(n, nrow(x))), ]));
df.grouped;
# x y
#2.3 3 2
#2.2 2 2
#6.6 6 6
#6.7 7 6
#9.10 10 9
#9.9 9 9
#13.13 13 13
#13.14 14 13
#14.15 15 14
#14.17 17 14
# Ungrouped samples
df.ungrouped <- df[sample(nrow(df.grouped)), ];
df.ungrouped;
# x y
#7 7 6
#1 1 1
#9 9 9
#4 4 4
#3 3 2
#2 2 2
#5 5 5
#6 6 6
#10 10 9
#8 8 8
Explanation: Split df based on y, then draw min(n, nrow(x)) samples from subset x containing >1 rows; rbinding gives the grouped df.grouped. We then draw nrow(df.grouped) samples from df to produce the ungrouped df.ungrouped.
Sample data
df <- read.table(text =
"x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18", header = T)

Creating a contingency table with fixed margins

I am trying to create a table with random entries from a central hypergeometric distribution where the column and row totals are fixed.
However I can get the column sums to be fixed and equal but not the row sums. I have read other answers but none seem to talk specifically about how to do it, my R knowledge is pretty basic and could do with some help or a point in the right direction.
To get the values from a central hypergeometric distribution I am using the BiasedUrn package.
For example:
N <- 50
rand <- 10
n1 <- 25
odds0 <- rep(1,K)
m0 <- rep(N/K,K)
library(BiasedUrn)
i <- as.table(rMFNCHypergeo(nran=rand, n=n1, m=m0, odds=odds0))
addmargins(i)
A B C D E F G H I J Sum
A 5 3 5 7 5 5 6 6 5 5 52
B 8 7 4 5 5 6 3 4 5 4 51
C 3 6 4 4 4 5 6 8 5 4 49
D 4 4 6 3 6 4 5 3 3 5 43
E 5 5 6 6 5 5 5 4 7 7 55
Sum 25 25 25 25 25 25 25 25 25 25 250
Where I'm looking to keep all the column sums equal to 25, and all the row sums equal to another number which I can choose such as 50.

Are you looking for the r2dtable function from base R?
set.seed(101)
tt <- r2dtable(n=1,c=rep(25,6),r=rep(50,3))
addmargins(as.table(tt[[1]]))
## A B C D E F Sum
## A 7 9 7 11 9 7 50
## B 10 7 10 6 7 10 50
## C 8 9 8 8 9 8 50
## Sum 25 25 25 25 25 25 150

Merging data sets with unequal observations

I have two data sets, one is the subset of another but the subset has additional column, with lesser observations.
Basically, I have a unique ID assigned to each participants, and then a HHID, the house id from which they were recruited (eg 15 participants recruited from 11 houses).
> Healthdata <- data.frame(ID = gl(15, 1), HHID = c(1,2,2,3,4,5,5,5,6,6,7,8,9,10,11))
> Healthdata
Now, I have a subset of data with only one participant per household, chosen who spent longer hours watching television. In this subset data, I have computed socioeconomic score (SSE) for each house.
> set.seed(1)
> Healthdata.1<- data.frame(ID=sample(1:15,11, replace=F), HHID=gl(11,1), SSE = sample(-6.5:3.5, 11, replace=TRUE))
> Healthdata.1
Now, I want to assign the SSE from the subset (Healthdata.1) to unique participants of bigger data (Healthdata) such that, participants from the same house gets the same score.
I can't merge this simply, because the data sets have different number of observations, 15 in the bigger one but only 11 in the subset.
Is there any way to do this in R? I am very new to it and I am stuck with this.
I want the required output as something like below, ie ID (participants) from same HHID (house) should have same SSE score. The following output is just meant for an example of what I need, the above seed will not give the same output.
ID HHID SSE
1 1 -6.5
2 2 -5.5
3 2 -5.5
4 3 3.3
5 4 3.0
6 5 2.58
7 5 2.58
8 5 2.58
9 6 -3.05
10 6 -3.05
11 7 -1.2
12 8 2.5
13 9 1.89
14 10 1.88
15 11 -3.02
Thanks.

You can use merge , By default it will merge by columns intersections.
merge(Healthdata,Healthdata.1,all.x=TRUE)
ID HHID SSE
1 1 1 NA
2 2 2 NA
3 3 2 NA
4 4 3 NA
5 5 4 NA
6 6 5 NA
7 7 5 NA
8 8 5 NA
9 9 6 0.7
10 10 6 NA
11 11 7 NA
12 12 8 NA
13 13 9 NA
14 14 10 NA
15 15 11 NA
Or you can choose by which column you merge :
merge(Healthdata,Healthdata.1,all.x=TRUE,by='ID')

You need to merge by HHID, not ID. Note this is somewhat confusing because the ids from the supergroup are from a different set than from the subgroup. I.e. ID.x == 4 != ID.y == 4 (in fact, in this case they are in different households). Because of that I left both ID columns here to avoid ambiguity, but you can easily subset the result to show only the ID.x one,
> merge(Healthdata, Healthdata.1, by='HHID')
HHID ID.x ID.y SSE
1 1 1 4 -5.5
2 2 2 6 0.5
3 2 3 6 0.5
4 3 4 8 -2.5
5 4 5 11 1.5
6 5 6 3 -1.5
7 5 7 3 -1.5
8 5 8 3 -1.5
9 6 9 9 0.5
10 6 10 9 0.5
11 7 11 10 3.5
12 8 12 14 -2.5
13 9 13 5 1.5
14 10 14 1 3.5
15 11 15 2 -4.5

library(plyr)
join(Healthdata, Healthdata.1)
# Inner Join
join(Healthdata, Healthdata.1, type = "inner", by = "ID")
# Left Join
# I believe this is what you are after
join(Healthdata, Healthdata.1, type = "left", by = "ID")

handling high dimension tables

I have a table that I routinely compute with R that has three dimensions. I would like to add some tables within the (here 5) marginal tables. What I usually do is like:
A=sample(LETTERS[1:5],100, rep=T)
b=sample(letters[1:2],100, rep=T)
numbers=sample(1:3,100, rep=T)
( tab=table(A,b,numbers) )
( tab1=ftable(addmargins(tab)) )
I would like to add the sum of the first few marginal tables and then the sum of the remaining tables at the bottom, then the grand total. I can imagine that in the resulting ftable I would want the As,Bs,Cs, then the sum of those three, then the Ds, Es, and the sum of those two, then the sum of all of the tables, like:
numbers 1 2 3 Sum
A b
A a 1 5 0 6
b 4 2 2 8
Sum 5 7 2 14
B a 2 6 6 14
b 5 4 5 14
Sum 7 10 11 28
C a 3 2 5 10
b 1 2 2 5
Sum 4 4 7 15
sumac a 6 13 11 30 #### how do i add these three lines
b ....
sum ....
D a 2 1 1 4
b 4 5 3 12
Sum 6 6 4 16
E a 5 3 4 12
b 4 3 8 15
Sum 9 6 12 27
sumde a 7 4 5 20 #### and these
b ....
sum ....
sumae a 13 17 16 46
b 18 16 20 54
Sum 31 33 36 100
As usual I think the solution is prolly many fewer lines than the question. Thanks
Seth Latimer

You could do something like this:
isABC<-ifelse(A %in% c("A", "B", "C"), "ABC", "CD")
( tab=table(isABC,A,b,numbers) )
( tab1=ftable(addmargins(tab)) )
Now you have a larger table which holds even more rows than you want, but those should be easy to drop...

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to calculate mean of multiple measures after removing outliers in R - r

You question is underspecified. Say the three values are 1000, 2000 and 3000. Which would you discard? Should the answer be 1500 or 2500? If all you're looking for is a robust measure of central tendency, the median might be a good start (?median in R).

Related

How to create a table with flexible columns based on variables control in R?

How to extract a sample of pairs in grouping variable

Creating a contingency table with fixed margins

Merging data sets with unequal observations

handling high dimension tables

Categories

Resources