Assign pass/fail value based on mean in large dataset - r

this might be a simple question but I was hoping someone could point me in the right direction. I have a sample dataset of:
dfrm <- list(L = c("A","B","P","C","D","E","P","F"), J=c(2,2,1,2,2,2,1,2), K=c(4,3,10,16,21,3,17,2))
dfrm <-as.data.frame(dfrm)
dfrm
L J K
1 A 2 4
2 B 2 3
3 P 1 10
4 C 2 16
5 D 2 21
6 E 2 3
7 P 1 17
8 F 2 2
Column J specifies the type of variable that is defined in K. I want to be able to take the mean of the K values that have a 1 assigned next to them. In this example it would be 10 and 17
T = c(10,17)
mean(T)
13.5
Next I want to be able to assign a pass/fail rank, where pass = 1, fail = 0 to identify whether the number in column K is larger than the mean.
The final data set should look like:
cdfrm <- list(L = c("A","B","P","C","D","E","P","F"), J=c(2,2,1,2,2,2,1,2), K=c(4,3,10,16,21,3,17,2),C = c(0,0,0,1,1,0,1,0))
cdfrm <-as.data.frame(cdfrm)
cdfrm
L J K C
1 A 2 4 0
2 B 2 3 0
3 P 1 10 0
4 C 2 16 1
5 D 2 21 1
6 E 2 3 0
7 P 1 17 1
8 F 2 2 0
this seems so basic, i am sorry guys, I just don't know what I am overthinking.

There are two steps in the solution. The first is to calculate the mean for the value you are interested in. In other words, take the mean of a subset of values in your data.frame. R has a handy function to calculate subsets, called subset. Here it is in action:
meanK <- mean(subset(dfrm, subset=J==1, select=K))
meanK
K
13.5
Next, you want to compare column K in your data frame with the mean value we have just calculated. This is a straightforward vector comparison:
dfrm$Pass <- dfrm$K>meanK
dfrm
L J K Pass
1 A 2 4 FALSE
2 B 2 3 FALSE
3 P 1 10 FALSE
4 C 2 16 TRUE
5 D 2 21 TRUE
6 E 2 3 FALSE
7 P 1 17 TRUE
8 F 2 2 FALSE

Here's how to do it in one line
transform(dfrm, C = K > sapply(split(dfrm$K, dfrm$J), mean)[J])
split groups the values of K according to the values of J and sapply(..., mean) calculates group wise means.

Related

Filtering observations using multivariate column conditions

I'm not very experienced R user, so seek advice how to optimize what I've build and in which direction to move on.
I have one reference data frame, it contains four columns with integer values and one ID.
df <- matrix(ncol=5,nrow = 10)
colnames(df) <- c("A","B","C","D","ID")
# df
for (i in 1:10){
df[i,1:4] <- sample(1:5,4, replace = TRUE)
}
df <- data.frame(df)
df$ID <- make.unique(rep(LETTERS,length.out=10),sep='')
df
A B C D ID
1 2 4 3 5 A
2 5 1 3 5 B
3 3 3 5 3 C
4 4 3 1 5 D
5 2 1 2 5 E
6 5 4 4 5 F
7 4 4 3 3 G
8 2 1 5 5 H
9 4 4 1 3 I
10 4 2 2 2 J
Second data frame has manual input, it's user input, I want to turn it into shiny app later on, that's why also I'm asking for optimization, because my code doesn't seem very neat to me.
df.man <- data.frame(matrix(ncol=5,nrow=1))
colnames(df.man) <- c("A","B","C","D","ID")
df.man$ID <- c("man")
df.man$A <- 4
df.man$B <- 4
df.man$C <- 3
df.man$D <- 4
df.man
A B C D ID
4 4 3 4 man
I want to filter rows from reference sequentially, following the rules:
If there is exact match in a whole row between reference table and manual than extract this(those) from reference and show me that row, if not then reduce number of matching columns from right to left until there is a match but not between less then two variables(columns A,B).
So with my limited knowledge I've wrote this:
# subtraction manual from reference
df <- df %>% dplyr::mutate(Adiff=A-df.man$A)%>%
dplyr::mutate(Bdiff=B-df.man$B)%>%
dplyr::mutate(Cdiff=C-df.man$C) %>%
dplyr::mutate(Ddiff=D-df.man$D)
# check manually how much in a row has zero difference and filter those
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0 & Ddiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0 & Ddiff==0),
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0),
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0),
"less then two exact match")
))
tbl_df(df0[,1:5])
# A tibble: 1 x 5
A B C D ID
<int> <int> <int> <int> <chr>
1 4 4 3 3 G
It works and found ID G but looks ugly to me. So the first question is - What would be recommended way to improve this? Are there any functions, packages or smth I'm missing?
Second question - I want to complicate condition.
Imagine we have reference data set.
A B C D ID
2 4 3 5 A
5 1 3 5 B
3 3 5 3 C
4 3 1 5 D
2 1 2 5 E
5 4 4 5 F
4 4 3 3 G
2 1 5 5 H
4 4 1 3 I
4 2 2 2 J
Manual input is
A B C D ID
4 4 2 2 man
Filtering rules should be following:
If there is exact match in a whole row between reference table and manual than extract this(those) from reference and show me that row, if not then reduce number of matching columns from right to left until there is a match but not between less then two variables(columns A,B).
From those rows where I have only two variable matches filter those which has ± 1 difference in columns to the right. So I should have filtered case G and I from reference table from the example above.
keep going the way I did above, I would do the following:
ifelse(nrow(df0%>%filter(Cdiff %in% (-1:1) & Ddiff %in% (-1:1)))>0,
df01 <- df0%>%filter(Cdiff %in% (-1:1) & Ddiff %in% (-1:1)),
ifelse(nrow(df0%>%filter(Cdiff %in% (-1:1)))>0,
df01<- df0%>%filter(Cdiff %in% (-1:1)),
"NA"))
It will be about 11 columns at the end, but I assume it doesn't matter so much.
Keeping in mind this objective - how would you suggest to proceed?
Thanks!
This is a lot to sort through, but I have some ideas that might be helpful.
First, you could keep your df a matrix, and use row names for your letters. Something like:
set.seed(2)
df
A B C D
A 5 1 5 1
B 4 5 1 2
C 3 1 3 2
D 3 1 1 4
E 3 1 5 3
F 1 5 5 2
G 2 3 4 3
H 1 1 5 1
I 2 4 5 5
J 4 2 5 5
And for demonstration, you could use a vector for manual as this is input:
# Complete match example
vec.man <- c(3, 1, 5, 3)
To check for complete matches between the manual input and reference (all 4 columns), with all numbers, you can do:
df[apply(df, 1, function(x) all(x == vec.man)), ]
A B C D
3 1 5 3
If you don't have a complete match, would calculate differences between df and vec.man:
# Change example vec.man
vec.man <- c(3, 1, 5, 2)
df.diff <- sweep(df, 2, vec.man)
A B C D
A 2 0 0 -1
B 1 4 -4 0
C 0 0 -2 0
D 0 0 -4 2
E 0 0 0 1
F -2 4 0 0
G -1 2 -1 1
H -2 0 0 -1
I -1 3 0 3
J 1 1 0 3
The diffs that start with and continue with 0 will be your best matches (same as looking from right to left iteratively). Then, your best match is the column of the first non-zero element in each row:
df.best <- apply(df.diff, 1, function(x) which(x!=0)[1])
A B C D E F G H I J
1 1 3 3 4 1 1 1 1 1
You can see that the best match is E which was non-zero in the 4th column (last column did not match). You can extract rows that have 4 in df.best as your best matches:
df.match <- df[which(df.best == max(df.best, na.rm = T)), ]
A B C D
3 1 5 3
Finally, if you want all the rows with closest match +/- 1 if only 2 match, you could check for number of best matches (should be 3). Then, compare differences with vector c(0,0,1) which would imply 2 matches then 3rd column off by +/- 1:
# Example vec.man with only 2 matches
vec.man <- c(3, 1, 6, 9)
> df.match
A B C D
C 3 1 3 2
D 3 1 1 4
E 3 1 5 3
if (max(df.best, na.rm = T) == 3) {
vec.alt = c(0, 0, 1)
df[apply(df.diff[,1:3], 1, function(x) all(abs(x) == vec.alt)), ]
}
A B C D
3 1 5 3
This should be scalable for 11 columns and 4 matches.
To generalize for different numbers of columns, #IlyaT suggested:
n.cols <- max(df.best, na.rm=TRUE)
vec.alt <- c(rep(0, each=n.cols-1), 1)

vectorise rows of a dataframe, apply vector function, return to original dataframe r

Given the following df:
a=c('a','b','c')
b=c(1,2,5)
c=c(2,3,4)
d=c(2,1,6)
df=data.frame(a,b,c,d)
a b c d
1 a 1 2 2
2 b 2 3 1
3 c 5 4 6
I'd like to apply a function that normally takes a vector (and returns a vector) like cummax row by row to the columns in position b to d.
Then, I'd like to have the output back in the df, either as a vector in a new column of the df, or replacing the original data.
I'd like to avoid writing it as a for loop that would iterate every row, pull out the content of the cells into a vector, do its thing and put it back.
Is there a more efficient way? I've given the apply family functions a go, but I'm struggling to first get a good way to vectorise content of columns by row and get the right output.
the final output could look something like that (imagining I've applied a cummax() function).
a b c d
1 a 1 2 2
2 b 2 3 3
3 c 5 5 6
or
a b c d output
1 a 1 2 2 (1,2,2)
2 b 2 3 1 (2,3,3)
3 c 5 4 6 (5,5,6)
where output is a vector.
Seems this would just be a simple apply problem that you want to cbind to df:
> cbind(df, apply(df[ , 4:2] # work with columns in reverse order
, 1, # do it row-by-row
cummax) )
a b c d 1 2 3
d a 1 2 2 2 1 6
c b 2 3 1 2 3 6
b c 5 4 6 2 3 6
Ouch. Bitten by failing to notice that this would be returned in a column oriented matrix and need to transpose that result; Such a newbie mistake. But it does show the value of having a question with a reproducible dataset I suppose.
> cbind(df, t(apply(df[ , 4:2] , 1, cummax) ) )
a b c d d c b
1 a 1 2 2 2 2 2
2 b 2 3 1 1 3 3
3 c 5 4 6 6 6 6
To destructively assign the result to df you would just use:
df <- # .... that code.
This does the concatenation with commas (and as a result no longer needs to be transposed:
> cbind(df, output=apply(df[ , 4:2] , 1, function(x) paste( cummax(x), collapse=",") ) )
a b c d output
1 a 1 2 2 2,2,2
2 b 2 3 1 1,3,3
3 c 5 4 6 6,6,6

ratios according to two variables, function aggregate in R?

I've been playing with some data in order to obtain the ratios between two levels within one variable and taking into account two other variables. I've been using the function aggregate(), which is very useful to calculate means and sums. However, I'm stuck when I want to calculate some ratios (divisions).
Here you find a dataframe very similar to my data:
w<-c("A","B","C","D","E","F","A","B","C","D","E","F")
x<-c(1,1,1,1,1,1,2,2,2,2,2,2)
y<-c(3,4,5,6,8,10,3,4,5,7,9,10)
z<-runif(12)
df<-data.frame(w,x,y,z)
df
w x y z
1 A 1 3 0.93767621
2 B 1 4 0.09169992
3 C 1 5 0.49012926
4 D 1 6 0.90886690
5 E 1 8 0.37058120
6 F 1 10 0.83558267
7 A 2 3 0.42670001
8 B 2 4 0.05656252
9 C 2 5 0.70694423
10 D 2 7 0.13634309
11 E 2 9 0.92065671
12 F 2 10 0.56276176
What I want is to obtain the ratios of z from the two levels of x and taking into account the variables w and y. So the level "A" from the variable "w" in the level "3" from the variable "y" should be:
df$z[1]/df$z[7]
With aggregate function should be something like this:
final<-aggregate(z~y:w, data=df)
However, I know that I miss something because in the variable y there are some classes that not appear in the two categories of w (e.g. 7, 8 and 9).
Any help will be welcomed!
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'w', 'y', if the nrow (.N) is 2, we divide the first value by the second or else return the 'z'. Assign (:=) the output to a new column 'z1'.
library(data.table)
setDT(df)[,z1 :=if(.N==2) z[1]/z[2] else z , by = .(w,y)]
df
# w x y z z1
# 1: A 1 3 0.93767621 2.1975069
# 2: B 1 4 0.09169992 1.6212135
# 3: C 1 5 0.49012926 0.6933068
# 4: D 1 6 0.90886690 0.9088669
# 5: E 1 8 0.37058120 0.3705812
# 6: F 1 10 0.83558267 1.4847894
# 7: A 2 3 0.42670001 2.1975069
# 8: B 2 4 0.05656252 1.6212135
# 9: C 2 5 0.70694423 0.6933068
#10: D 2 7 0.13634309 0.1363431
#11: E 2 9 0.92065671 0.9206567
#12: F 2 10 0.56276176 1.4847894
If we just want the summary output we don't need to use :=
setDT(df)[, list(z=if(.N==2) z[1]/z[2] else z) , by = .(w,y)]
Or using aggregate
aggregate(z~w+y, df, FUN=function(x)
if(length(x)==2) x[1]/x[2] else x)

R: combinatorics, number of permutations for one combination

I have three different events (1,2,3) with different probabilities (0.15, 0.76, 0.09) and I would like to draw 5 times with replacement.
I can now determine the number of possible combinations using
nsimplex(3,5) ### =21
from the combinat-package.
And I can determine the probabilities of each combination using
mySimplex <- xsimplex(3,5)
myProbs<-c(0.15, 0.76, 0.09)
results<- apply(mySimplex,2,dmultinom,prob=myProbs)
Further, I can of course determine the number of permutations by calculating 3^5= 243.
But how do I know how often each permutation of the same combination is drawn without counting them manually? That is, how many permutations are in each of my combinations?
If I undestand that correctly, there are 243 permutations building 21 different combinations. Now my question is, how many permutations build each combination. E.g. the combination {1,1,1,1,1} will be built up only once, whereas others are created by several permutations.
I guess you can come to this by using the probabilities for each combination but I do not know how to do it? Or is there any other way to easiliy determine that in R?
Thank you in advance.
The number of permutations of a indistinguishable copies of item 1, b of item 2, c of item 3, where a + b + c = N, is N! / (a! b! c!).
For example if you had (a,b,c) = (3,1,1) then there are 5!/(3! 1! 1!) = 20 arrangements.
c b a a a b a c a a a b a a c a a c a b
c a b a a b a a c a a c b a a a a b c a
c a a b a b a a a c a c a b a a a b a c
c a a a b a b c a a a c a a b a a a b c
b c a a a a b a c a a a c b a a a a c b
In general, we can calculate the number as follows
nperm<-function(...) {
args<-as.numeric(list(...));
num<-lfactorial(sum(args));
den<-sum(lfactorial(args));
return(round(exp(num-den)));
}
So, e.g.,
x<-expand.grid(0:5,0:5,0:5)
x<-x[rowSums(x)==5,]
x[,"nperm"]<-apply(x,1,function(x) do.call(nperm,as.list(x)))
Var1 Var2 Var3 nperm
5 0 0 1
4 1 0 5
3 2 0 10
2 3 0 10
1 4 0 5
0 5 0 1
4 0 1 5
3 1 1 20
2 2 1 30
1 3 1 20
0 4 1 5
3 0 2 10
2 1 2 30
1 2 2 30
0 3 2 10
2 0 3 10
1 1 3 20
0 2 3 10
1 0 4 5
0 1 4 5
0 0 5 1
And sum(x[,"nperm"]) == 243, as expected.
To make this reproducible, I would have needed to use set.seed(<some_value>) but this is one attempt at using sample to draw distinct combinations (without considering the permutations distinct. If the permutations are to be considered distinct, then take out the sort step:
table( # get the counts of distinct combinations
apply( # this will collapse values by column
replicate(100000, # yields a 100,000 column matrix
{sample(c("1","2","3"), 5 ,repl=TRUE, prob=c(.5,.25,.25) )}),
2, function(x) paste(sort(x), collapse=".")) )
1.1.1.1.1 1.1.1.1.2 1.1.1.1.3 1.1.1.2.2 1.1.1.2.3 1.1.1.3.3 1.1.2.2.2
3090 7705 8144 7851 15408 7649 3997
1.1.2.2.3 1.1.2.3.3 1.1.3.3.3 1.2.2.2.2 1.2.2.2.3 1.2.2.3.3 1.2.3.3.3
11731 11554 3940 949 3844 5955 4019
1.3.3.3.3 2.2.2.2.2 2.2.2.2.3 2.2.2.3.3 2.2.3.3.3 2.3.3.3.3 3.3.3.3.3
961 99 506 990 997 510 101
A.Webb suggests we compare theory dmultinom to practice:
dmultinom(c(4,1,0),prob=c(0.5,0.25,0.25))*2
[1] 0.15625
So prediction for first value 3125 looks arguably accurate vs simulated at 3090 and for the second and third value as well at 7812.5 vs 7705 and 8144.

Subseting data frame by another data frame

The data is as follows:
> x
a b
1 1 a
2 2 a
3 3 a
4 1 b
5 2 b
6 3 b
> y
a b
1 2 a
2 3 a
3 3 b
My goal is to compare both data frames, and for each row in x indicate whether equivalent row exists in y. All of the y rows are actually contained in x, so I would like to end up with something like this:
> x
a b intersect.x.y
1 1 a F
2 2 a T
3 3 a T
4 1 b F
5 2 b F
6 3 b T
How about that?
How about this?
x$rn <- 1:nrow(x)
xyrows <- merge(x,y)$rn # maybe you just want to look at the merge ...?
x$iny <- FALSE
x$iny[xyrows] <- TRUE
I suspect there is a more standard approach, but this way is easy to understand.

Resources