I have a set of user recommandations
review=matrix(c(5:1,10,2,1,1,2), nrow=5, ncol=2, dimnames=list(NULL,c("Star","Votes")))
and wanted to use summary(review) to show basic properties mean, median, quartiles and min max.
But it gives back the summary of both columns. I refrain from using data.frame because the factors 'Star' are ordered.
How can I tell R that Star is a ordered list of factors numeric score and votes are their frequency?
I'm not exactly sure what you mean by taking the mean in general if Star is supposed to be an ordered factor. However, in the example you give where Star is actually a set of numeric values, you can use the following:
library(Hmisc)
R> review=matrix(c(5:1,10,2,1,1,2), nrow=5, ncol=2, dimnames=list(NULL,c("Star","Votes")))
R> wtd.mean(review[, 1], weights = review[, 2])
[1] 4.0625
R> wtd.quantile(review[, 1], weights = review[, 2])
0% 25% 50% 75% 100%
1.00 3.75 5.00 5.00 5.00
I don't understand what's the problem. Why shouldn't you use data.frame?
rv <- data.frame(star = ordered(review[, 1]), votes = review[, 2])
You should convert your data.frame to vector:
( vts <- with(rv, rep(star, votes)) )
[1] 5 5 5 5 5 5 5 5 5 5 4 4 3 2 1 1
Levels: 1 < 2 < 3 < 4 < 5
Then do the summary... I just don't know what kind of summary, since summary will bring you back to the start. O_o
summary(vts)
1 2 3 4 5
2 1 1 2 10
EDIT (on #Prasad's suggestion)
Since vts is an ordered factor, you should convert it to numeric, hence calculate the summary (at this moment I will disregard the background statistical issues):
nvts <- as.numeric(levels(vts)[vts]) ## numeric conversion
summary(nvts) ## "ordinary" summary
fivenum(nvts) ## Tukey's five number summary
Just to clarify -- when you say you would like "mean, median, quartiles and min/max", you're talking in terms of number of stars? e.g mean = 4.062 stars?
Then using aL3xa's code, would something like summary(as.numeric(as.character(vts))) be what you want?
Related
My understanding was that dplyr::ntile and statar::xtile are trying to the same thing. But sometimes the output is different:
dplyr::ntile(1:10, 5)
# [1] 1 1 2 2 3 3 4 4 5 5
statar::xtile(1:10, 5)
# [1] 1 1 2 2 3 3 3 4 5 5
I am converting Stata code into R, so statar::xtile gives the same output as the original Stata code but I thought dplyr::ntile would be the equivalent in R.
The Stata help says that xtile is used to:
Create variable containing quantile categories
And statar::xtile is obviously replicating this.
And dplyr::ntile is:
a rough rank, which breaks the input vector into n buckets.
Do these mean the same thing?
If so, why do they give different answers?
And if not, then:
What is the difference?
When should you use one or the other?
Thanks #alistaire for pointing out that dplyr::ntile is only doing:
function (x, n) { floor((n * (row_number(x) - 1)/length(x)) + 1) }
So not the same as splitting into quantile categories, as xtile does.
Looking at the code for statar::xtile leads to statar::pctile and the documentation for statar says that:
pctile computes quantile and weighted quantile of type 2 (similarly to Stata _pctile)
Therefore an equivalent to statar::xtile in base R is:
.bincode(1:10, quantile(1:10, seq(0, 1, length.out = 5 + 1), type = 2),
include.lowest = TRUE)
# [1] 1 1 2 2 3 3 3 4 5 5
I'm looking for a way to split a data frame into groups of equal size (essentially same number of rows in each group), whose groups have a nearly equal mean.
User Data
1 5.0
2 4.5
3 3.5
4 6.0
5 7.0
6 6.5
7 5.5
8 6.2
9 5.7
10 5.9
This is very similar to this request However this only splits the data into 2 groups.
My actual dataset contains anywhere from 75-150 rows, and I need to split it into anywhere from 5-10 groups of equal mean and fairly equal size.
I've researched on Google & Stack Exchange for the last few days, and I'm just not having much luck. Any guidance would be great.
Thanks in advance!
More details:
Maybe I need to provide some more details, below I've included a real dataset. We are a transportation company, this data set has Driver ID, Miles, Gallons provided. What I have been doing is reading the data into R, and adding and MPG column like so:
data <- read.csv('filename')
data$MPG <- data$Miles / data$Gallons
Then I tried the two provided answers below. Arun's idea gives me almost equal group sizes (9 members per group, 10 groups), however the variation of the means is large, from 6.615 - 7.093 which is too large of a variation for me to start off with. Thomas' idea gets a little bit tighter variation, but the group sizes are all different from 6 - 13 members.
What we are looking to do is improve fleet MPG, and we're going to accomplish this with a team based competition, so I need to randomly put the teams together with them all starting from relatively the same group MPG.
Maybe that helps and can lead us in the correct direction? I tried doing this just in my programming language, but it locks the computer up every time, so I figured that R would probably be able to process the data better.
Thanks again!
If similar means is really all that matters, I've put together a simulation below that basically looks at a bunch of different combinations of the data (n) for a particular group size (k) and then minimizes the variance of the group means. With that minimization you can then extract that grouping from the simulation results.
df <- data.frame(User=1:1000,Data=rnorm(1000,0,1)) # example data
myfun = function(){
k <- 5 # number of groups
tmp <- seq(length(mpg))%%ngroups # really efficient code from #qwwqwwq's answer
thisgroup <- sample(tmp, dim(df)[1], FALSE) # pull a sample
# thisgroup <- sample(1:k,dim(df)[1],TRUE) # original version
thisavg <- as.vector(by(df$Data, thisgroup, mean)) # group means
thisvar <- var(thisavg) # variance of means
return(list(group=thisgroup, avgs=thisavg, var=thisvar))
}
n <- 1000 # number of simulations
sorts <- replicate(n, myfun(), simplify=FALSE)
wh <- which.min(sapply(sorts, function(x) x$var)) # minimization
# sorts[[wh]] # this is the sample you want
split(df, sorts[[wh]]$group) # list of separate dataframes for each group
You could also have k of different sizes, if you don't care about how many cases are in each group by just moving the k <- 5 line into the function and having it be a random draw from the range of number of groups you're willing to have.
There are probably other ways to do this, though.
Going by Thomas' idea, here's a brute-force/greedy approach, which'll give more or less the same values (you can opt for more repetitions until you agree with the closeness of the solution).
# Assuming the data you provided is in `df`
grp <- 5
myfun <- function() {
samp <- sample(nrow(df))
s.mean <- tapply(df$Data, samp %% grp, mean)
s.var <- var(s.mean)
list(samp, s.mean, s.var)
}
out <- replicate(1000, myfun(), simplify=FALSE)
min.pos <- which.min(sapply(out, `[[`, 3))
min.idx <- out[[min.pos]][[1]]
split(df$Data[min.idx], min.idx %% grp)
$`0`
[1] 7.0 5.9
$`1`
[1] 5.0 6.5
$`2`
[1] 5.5 4.5
$`3`
[1] 6.2 3.5
$`4`
[1] 5.7 6.0
This is how out[min.pos] looks like:
out[min.pos]
[[1]]
[[1]][[1]]
[1] 7 9 8 5 3 4 1 2 10 6
[[1]][[2]]
0 1 2 3 4
5.85 5.70 5.60 5.25 5.50
[[1]][[3]]
[1] 0.05075
Simplest way I can think of: Sort the data, modulo all the indicies by the number of groups, and you're done. Should work well if the data are normally distributed I think. Has the advantage of the groups being as equally sized as possible.
mpg <- rnorm(150)
mpg <- sort(mpg)
ngroups = 13
df = data.frame( mpg=mpg, group=seq(length(mpg))%%ngroups)
tapply(df$mpg, df$group, mean)
0 1 2 3 4 5 6 7 8
0.080400272 -0.110797283 -0.046698548 -0.014177675 0.024410834 0.048370962 0.066265303 0.087119914 -0.062259638
9 10 11 12
-0.042172496 -0.003451581 0.033853024 0.056947458
I tried to do a stochastic simulation on a epidemiology SEIR model using the coding below.
library(GillespieSSA)
parms <- c(beta=0.591,sigma=1/8,gamma=1/7)
x0 <- c(S=50,E=0,I=1,R=0)
a <- c("beta*S*I","sigma*E","gamma*I")
nu <- matrix(c(-1,0,0,
1,-1,0,
0,1,-1,
0,0,1),nrow=4,byrow=TRUE)
set.seed(12345)
out <- lapply(X=1:10,FUN=function(x) ssa(x0,a,nu,parms,tf=50)$data)
out
I managed to obtain the 10 simulations values that I wanted. The time is in continuous form. Now, I have to extract time in discrete form such as 1,2,3...,50 from each simulation. Which type of coding should I use?
I tried doing data.frame and extract but still not able to do it.
Thanking in advance for any help.
Lets say the data looks like this:
df <- data.frame(t=seq(0.4,4.5,0.03), x=1:137)
## t x
## 1 0.40 1
## 2 0.43 2
## 3 0.46 3
## 4 0.49 4
## 5 0.52 5
To get the discrete time index values:
idx <- diff(ceiling(df$t)) == 1
Discrete time series will be:
df[idx,]
## t x
## 21 1.00 21
## 54 1.99 54
## 87 2.98 87
## 121 4.00 121
Having run the trial myself a problem seems to be that many of the time stamps are quite a distance from an integer result.
To see these remainders, check: out[[1]][,1] %% 1
The good news is that you can use the output from this, with a tuning parameter, to select what you want. For this purpose, you'll want to find the distance from one and then control for what's an acceptable gap.
Do this as follows and save the result (and bunch of TRUE and FALSE results)
selection <- abs((out[[1]][,1] %% 1) - 1) < 0.1
You can then subset the matrix out using the selection index we just saved:
out[[1]][selection,]
Users
I have a distance matrix dMat and want to find the 5 nearest samples to the first one. What function can I use in R? I know how to find the closest sample (cf. 3rd line of code), but can't figure out how to get the other 4 samples.
The code:
Mat <- replicate(10, rnorm(10))
dMat <- as.matrix(dist(Mat))
which(dMat[,1]==min(dMat[,1]))
The 3rd line of code finds the index of the closest sample to the first sample.
Thanks for any help!
Best,
Chega
You can use order to do this:
head(order(dMat[-1,1]),5)+1
[1] 10 3 4 8 6
Note that I removed the first one, as you presumably don't want to include the fact that your reference point is 0 distance away from itself.
Alternative using sort:
sort(dMat[,1], index.return = TRUE)$ix[1:6]
It would be nice to add a set.seed(.) when using random numbers in matrix so that we could show the results are identical. I will skip the results here.
Edit (correct solution): The above solution will only work if the first element is always the smallest! Here's the correct solution that will always give the 5 closest values to the first element of the column:
> sort(abs(dMat[-1,1] - dMat[1,1]), index.return=TRUE)$ix[1:5] + 1
Example:
> dMat <- matrix(c(70,4,2,1,6,80,90,100,3), ncol=1)
# James' solution
> head(order(dMat[-1,1]),5) + 1
[1] 4 3 9 2 5 # values are 1,2,3,4,6 (wrong)
# old sort solution
> sort(dMat[,1], index.return = TRUE)$ix[1:6]
[1] 4 3 9 2 5 1 # values are 1,2,3,4,6,70 (wrong)
# Correct solution
> sort(abs(dMat[-1,1] - dMat[1,1]), index.return=TRUE)$ix[1:5] + 1
[1] 6 7 8 5 2 # values are 80,90,100,6,4 (right)
It is easy to do an Exact Binomial Test on two values but what happens if one wants to do the test on a whole bunch of number of successes and number of trials. I created a dataframe of test sensitivities, potential number of enrollees in a study and then for each row I calculate how may successes that would be. Here is the code.
sens <-seq(from=.1, to=.5, by=0.05)
enroll <-seq(from=20, to=200, by=20)
df <-expand.grid(sens=sens,enroll=enroll)
df <-transform(df,succes=sens*enroll)
But now how do I use each row's combination of successes and number of trials to do the binomial test.
I am only interested in the upper limit of the 95% confidence interval of the binomial test. I want that single number to be added to the data frame as a column called "upper.limit"
I thought of something along the lines of
binom.test(succes,enroll)$conf.int
alas, conf.int gives something such as
[1] 0.1266556 0.2918427
attr(,"conf.level")
[1] 0.95
All I want is just 0.2918427
Furthermore I have a feeling that there has to be do.call in there somewhere and maybe even an lapply but I do not know how that will go through the whole data frame. Or should I perhaps be using plyr?
Clearly my head is spinning. Please make it stop.
If this gives you (almost) what you want, then try this:
binom.test(succes,enroll)$conf.int[2]
And apply across the board or across the rows as it were:
> df$UCL <- apply(df, 1, function(x) binom.test(x[3],x[2])$conf.int[2] )
> head(df)
sens enroll succes UCL
1 0.10 20 2 0.3169827
2 0.15 20 3 0.3789268
3 0.20 20 4 0.4366140
4 0.25 20 5 0.4910459
5 0.30 20 6 0.5427892
6 0.35 20 7 0.5921885
Here you go:
R> newres <- do.call(rbind, apply(df, 1, function(x) {
+ bt <- binom.test(x[3], x[2])$conf.int;
+ newdf <- data.frame(t(x), UCL=bt[2]) }))
R>
R> head(newres)
sens enroll succes UCL
1 0.10 20 2 0.31698
2 0.15 20 3 0.37893
3 0.20 20 4 0.43661
4 0.25 20 5 0.49105
5 0.30 20 6 0.54279
6 0.35 20 7 0.59219
R>
This uses apply to loop over your existing data, compute test, return the value you want by sticking it into a new (one-row) data.frame. And we then glue all those 90 data.frame objects into a new single one with do.call(rbind, ...) over the list we got from apply.
Ah yes, if you just want to directly insert a single column the other answer rocks as it is simple. My longer answer shows how to grow or construct a data.frame during the sweep of apply.