How can I calculate the mean of the top 4 observations in my column? - r

How can I calculate the mean of the top 4 observations in my column?
c(12, 13, 15, 1, 5, 9, 34, 50, 60, 50, 60, 4, 6, 8, 12)
For instance, in the above I would have (50+60+50+60)/4 = 55. I only know how to use the quantile, but it does not work for this.
Any ideas?

Since you're interested in only the top 4 items, you can use partial sort instead of full sort. If your vector is huge, you might save quite some time:
x <- c(12, 13, 15, 1, 5, 9, 34, 50, 60, 50, 60, 4, 6, 8, 12)
idx <- seq(length(x)-3, length(x))
mean(sort(x, partial=idx)[idx])
# [1] 55

Try this:
vec <- c(12, 13, 15, 1, 5, 9, 34, 50, 60, 50, 60, 4, 6, 8, 12)
mean(sort(vec, decreasing=TRUE)[1:4])
gives
[1] 55

Maybe something like this:
v <- c(12, 13, 15, 1, 5, 9, 34, 50, 60, 50, 60, 4, 6, 8, 12)
mean(head(sort(v,decreasing=T),4))
First, you sort your vector so that the largest values are in the beginning. Then with head you take the 4 first values in that vector, subsequently taking the mean value of that.

To be different! Also, please try to do some research on your own before posting.
x <- c(12, 13, 15, 1, 5, 9, 34, 50, 60, 50, 60, 4, 6, 8, 12)
mean(tail(sort(x), 4))

Just to show that you can use quantile in this exercise:
mean(quantile(x,1-(0:3)/length(x),type=1))
#[1] 55
However, the other answers are clearly more efficient.

You could use the order function. Order by -x to give the values in descending order, and just average the first 4:
x <- c(12, 13, 15, 1, 5, 9, 34, 50, 60, 50, 60, 4, 6, 8, 12)
mean(x[order(-x)][1:4])
[1] 55

Related

Calculate log probability, darts game R

I am having trouble with the following problem, I have done some research but I still cannot come up with any solution to this problem.
Darts Player shoots 30 times every night for a period of 42 days.
Create a function which takes the probability p of shooting the target and calculates the log of probability that the player has done the following shoots of the target for each of the 42 days:
shots = c(
8, 5, 12, 11, 12, 8, 6, 7, 11, 7, 11, 13, 15,
12, 17, 12, 9, 15, 8, 11, 11, 13, 10, 8, 12, 12, 11,
13, 12, 14, 9, 11, 13, 10, 10, 12, 13, 10, 15, 12, 15, 12
)
I am new to probability and this type of programming in R, so any help and approach to solving this problem would be appreciated. Thank you in advance!
The probability of getting 8 shots or less given a hit probability of 0.5 can be found with:
pbinom(8, 30, 0.5)
But to find the probability of exactly 8 shots, we need to subtract the probability of getting 7 shots or less:
pbinom(8, 30, 0.5) - pbinom(8 - 1, 30, 0.5)
Since pbinom is vectorized, we can get the independent probabilities of getting all the shots with:
pbinom(shots, 30, 0.5) - pbinom(shots - 1, 30, 0.5)
But this gives us a vector of 42 probabilities. To get the probability of getting exactly this string of shots, we need to multiply all these probabilities together:
prod(pbinom(shots, 30, 0.5) - pbinom(shots - 1, 30, 0.5))
#> [1] 2.921801e-62
And the log of this value is what we're looking for:
log(prod(pbinom(shots, 30, 0.5) - pbinom(shots - 1, 30, 0.5)))
#> [1] -141.6881
Note though that we might run into problems with floating point numbers being unable to handle very small numbers, so it is safer to take the sum of the logs rather than the log of the product, which is otherwise mathematically equivalent.
sum(log(pbinom(shots, 30, 0.5) - pbinom(shots - 1, 30, 0.5)))
#> [1] -141.6881
Now all we need to do is wrap this in a function which allows us to specify a number other than 0.5 for probability:
f <- function(p) {
shots = c(
8, 5, 12, 11, 12, 8, 6, 7, 11, 7, 11, 13, 15,
12, 17, 12, 9, 15, 8, 11, 11, 13, 10, 8, 12, 12, 11,
13, 12, 14, 9, 11, 13, 10, 10, 12, 13, 10, 15, 12, 15, 12
)
sum(log(pbinom(shots, 30, p) - pbinom(shots - 1, 30, p)))
}
The reason you are being asked this question is probably as an introduction to likelihood. We can see the likelihood curve of the p parameter by plotting the log probability of getting exactly shots given a particular value of p
probs <- seq(0.01, 0.99, 0.01)
plot(probs, sapply(probs, f))
We can find the value of p with the greatest likelihood by using optimize:
optimize(f, c(0.01, 0.99), maximum = TRUE)$maximum
#> [1] 0.3714248
So we can infer that the player had approximately 37.14% chance of hitting the target each time.
We can confirm this is right by simply calculating the percentage of throws the dart player made, which should give us the same value:
mean(shots/30)
#> [1] 0.3714286

How to collect unique values, and sum across other columns with conditions

I have a lot of financial trading data with around a million rows and I want to be able to condense this into a new data frame with a list of Unique UserIDs. I then want to be able to add up the "trades" for their account, with some conditions, ie if TransactionTypeId == 2 & AC_Type== 19. I would use a sumifs in excel for this but the size of the file means its pretty much impossible to run on my computer.
df<- structure(list(UserId = c(1, 1, 1, 1, 2,
2, 2, 3, 3, 3, 4, 5, 6,
6, 6, 7, 7, 7, 8, 8, 8,
8, 8, 9, 9, 9, 10, 11, 12,
12, 13, 13, 13, 14, 14, 15, 15,
16, 16, 16), TransactionTypeId = c(14, 1, 1, 70,
15, 1, 1, 14, 1, 1, 70, 14, 14, 1, 1, 14, 1, 1, 14, 1, 1, 1,
1, 14, 1, 1, 14, 14, 1, 1, 14, 1, 1, 1, 1, 70, 70, 14, 1, 1),
AC_Type = c(21, 21, 21, 21, 19, 19, 19, 19, 19, 19, 19, 19,
19, 19, 19, 21, 21, 21, 19, 19, 19, 19, 19, 19, 19, 19, 20,
19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20), Trades = c(30,
30, 0.00067116, 0.00067115, 249, 249, 0.00533033, 48.75,
48.75, 0.00101298, 0.00533, 24.37, 146.25, 146.25, 0.00309109,
100.01, 100.01, 0.00233551, 97.5, 90, 0.00189134, 5, 0.00245851,
234, 234, 0.00500802, 100.01, 48.75, 48.5, 0.0275474, 24,
24, 0.00051975, 100, 0.00223998, 0.00051975, 0.00205, 9.75,
8.75, 0.00017811)), row.names = c(NA, -40L), class = c("tbl_df",
"tbl", "data.frame"))
You can take sum of the logical condition that you want to count.
library(dplyr)
df %>%
group_by(UserId) %>%
summarise(count = sum(Trades[TransactionTypeId == 2 & AC_Type== 19]))
Not quite sure what you want ...
libary(dplyr)
df %>%
group_by(UserId) %>%
filter(TransactionTypeId == 1 & AC_Type == 19) %>%
summarise(sum = sum(Trades))
# A tibble: 6 x 2
UserId sum
<dbl> <dbl>
1 2 249.
2 3 48.8
3 6 146.
4 8 95.0
5 9 234.
6 12 48.5
Here you first group_by UserId, then filterthose rows that meet your conditions (NB: I've changed 2to 1 as there aren't any 2s in the sample data), and finally summarise by summing up the values in Trades.
Using data.table
library(data.table)
setDT(df)[, .(count = sum(Trades[TransactionTypeId == 2 &
AC_Type== 19], na.rm = TRUE)), UserId]

Quade test in R

I would like to perform a Quade test with more than one covariate in R. I know the command quade.test and I have seen the example below:
## Conover (1999, p. 375f):
## Numbers of five brands of a new hand lotion sold in seven stores
## during one week.
y <- matrix(c( 5, 4, 7, 10, 12,
1, 3, 1, 0, 2,
16, 12, 22, 22, 35,
5, 4, 3, 5, 4,
10, 9, 7, 13, 10,
19, 18, 28, 37, 58,
10, 7, 6, 8, 7),
nrow = 7, byrow = TRUE,
dimnames =
list(Store = as.character(1:7),
Brand = LETTERS[1:5]))
y
quade.test(y)
My question is as follows: how could I introduce more than one covariate? In this example the covariate is the Store variable.

changing the spacing between vertices in iGraph in R

Suppose I want to make a plot with the following data:
pairs <- c(1, 2, 2, 3, 2, 4, 2, 5, 2, 6, 2, 7, 2, 8, 2, 9, 2, 10, 2, 11, 4,
14, 4, 15, 6, 13, 6, 19, 6, 28, 6, 36, 7, 16, 7, 23, 7, 26, 7, 33,
7, 39, 7, 43, 8, 35, 8, 40, 9, 21, 9, 22, 9, 25, 9, 27, 9, 33, 9,
38, 10, 12, 10, 18, 10, 20, 10, 32, 10, 34, 10, 37, 10, 44, 10, 45,
10, 46, 11, 17, 11, 24, 11, 29, 11, 30, 11, 31, 11, 33, 11, 41, 11,
42, 11, 47, 14, 50, 14, 52, 14, 54, 14, 55, 14, 56, 14, 57, 14, 58,
14, 59, 14, 60, 14, 61, 15, 48, 15, 49, 15, 51, 15, 53, 15, 62, 15,
63)
g <- graph( pairs )
plot( g,layout = layout.reingold.tilford )
I get a plot like the one below:
As you can see the spaces between some of the vertices are so small that these vertices overlap.
1. I wonder if there is a way to change the spacing between vertices.
2. In addition, is the spacing between vertices arbitrary? For example, Vertices 3, 4, and 5 are very close to each other, but 5 and 6 are far apart.
EDIT:
For my 2nd question, I guess the spacing is dependent on the number of nodes below. E.g., 10 and 11 are farther from each other than 8 and 9 are because there are more children below 10 and 11 than there are below 8 and 9.
I bet there is a better solution but I cannot find it. Here my approach. Since seems that a general parameter for width is missing you have to adjust manually parameters in order to obtain the desired output.
My approach is primarily to resize some elements of the plot in order to make them of the right size, adjust margins in order to optimize the space as much as possible. The most important parameter here is the asp parameter that controls the aspect ratio of the plot (since in this case the plot I guess is better long than tall an aspect ratio of even less than 0.5 is right). Other tricks are to diminish the size of vertex and fonts. Here is the code:
plot( g, layout = layout.reingold.tilford,
edge.width = 1,
edge.arrow.width = 0.3,
vertex.size = 5,
edge.arrow.size = 0.5,
vertex.size2 = 3,
vertex.label.cex = 1,
asp = 0.35,
margin = -0.1)
That produces this plot:
another approach would be to set graphical devices to PDF (or JPEG etc.) and then set the rescale to FALSE. With Rstudio viewer this cut off a huge piece of the data but with other graphic devices it might (not guarantee) work well.
Anyway for every doubt about how to use these parameters (that are very tricky sometimes) type help(igraph.plotting)
For the second part of the question I am not sure but looking inside the function I cannot figure out a precise answer but I guess that the space between elements on the same level is calculated on the child elements they have, say 3,4,5 have to be closer because they have child and sub-child and then they require more space.

Error in data.frame() arguments imply differing number of rows: 1, 11, 10, 3, 5, 4, 9, 2, 6, 7, 8, 12, 22, 13, 16, 14, 15, 19, 17, 20, 18, 28, 2

I am using this command in R Studio to split the data present in one column:
CTE.info <- data.frame(strsplit(as.character(CTE$V11),'|',fixed=TRUE))
But, I am getting the error:
Error in data.frame("orderItems", "79542;2;24.000;24.000;5.310", "Credit;1;-15.000;-15.000;.000", :
arguments imply differing number of rows: 1, 11, 10, 3, 5, 4, 9, 2, 6, 7, 8, 12, 22, 13, 16, 14, 15, 19, 17, 20, 18, 28, 24
Could someone assist and let me know how can this be sorted?
You can make the length of the list element same and it should work.
lst <- strsplit(as.character(CTE$V11),'|',fixed=TRUE)
d1 <- data.frame(lapply(lst, `length<-`, max(lengths(lst))))
colnames(d1) <- paste0('V', seq_along(d1))
data
CTE <- data.frame(V11= c('a|b|c', 'a|b', 'a|b|c|d'))

Resources