Difference between dplyr::ntile and statar::xtile - r

My understanding was that dplyr::ntile and statar::xtile are trying to the same thing. But sometimes the output is different:
dplyr::ntile(1:10, 5)
# [1] 1 1 2 2 3 3 4 4 5 5
statar::xtile(1:10, 5)
# [1] 1 1 2 2 3 3 3 4 5 5
I am converting Stata code into R, so statar::xtile gives the same output as the original Stata code but I thought dplyr::ntile would be the equivalent in R.
The Stata help says that xtile is used to:
Create variable containing quantile categories
And statar::xtile is obviously replicating this.
And dplyr::ntile is:
a rough rank, which breaks the input vector into n buckets.
Do these mean the same thing?
If so, why do they give different answers?
And if not, then:
What is the difference?
When should you use one or the other?

Thanks #alistaire for pointing out that dplyr::ntile is only doing:
function (x, n) { floor((n * (row_number(x) - 1)/length(x)) + 1) }
So not the same as splitting into quantile categories, as xtile does.
Looking at the code for statar::xtile leads to statar::pctile and the documentation for statar says that:
pctile computes quantile and weighted quantile of type 2 (similarly to Stata _pctile)
Therefore an equivalent to statar::xtile in base R is:
.bincode(1:10, quantile(1:10, seq(0, 1, length.out = 5 + 1), type = 2),
include.lowest = TRUE)
# [1] 1 1 2 2 3 3 3 4 5 5

Related

Implement Z = k*Z to model a population growth

What would be your scholarly recommendation to model a population within R when
DELTA_Z = .2Z, Z0 = 10
? The output should be similar to the following
Or as another example, suppose a population is described by the model
Nt+1 = 1.5Nt and N5 = 7.3. Find Nt for t = 0, 1, 2, 3, and 4.
t 0 1 2 3 4 5 6
Zt 10 12 14.4 17.28 20.736 24.8832 29.8598
Those recursions i.e. Z=k*Z are done quite easily within a spreadsheet such as Excel. In R however, the following (far from efficient) have been done thus far:
#loop implementation in R
Z=10;Z;for (t in 6:0)
{Z=.2*Z+Z; print(Z)}
pr
Z0=10;
Z1=.2*Z0+Z0; Z2=.2*Z1+Z1; Z3=.2*Z2+Z2
Z4=.2*Z3+Z3;Z5=.2*Z4+Z4;Z6=.2*Z5+Z5
Zn=c(Z0,Z1,Z2,Z3,Z4,Z5,Z6);
Since R tries to avoid for loops and iterations at all costs, what would be your recommendation (could it be done preferably without iteration?)
What has been done in Excel is the following:
t Nt
5 7.3 k=1.5
4 =B2/$C$2
3 =B3/$C$2
2 =B4/$C$2
1 =B5/$C$2
0 =B6/$C$2
It is a lot easier:
R> Z <- 10
R> Z * 1.2 ^ (0:6)
[1] 10.00000 12.00000 14.40000 17.28000 20.73600 24.88320 29.85984
R>
We set Z to ten, and then multiply it by the growth rate. And that is really just taking 'growth' to the t-th power.
There is a nice short tutorial in the appendix of the An Introduction to R manual that came with your copy of R. I went over that a number of times when I started.

Formatting data for two sample t-tests on R

Suppose I have the dataset that has the following information:
1) Number (of products bought, for example)
1 2 3
2) Frequency for each number (e.g., how many people purchased that number of products)
2 5 10
Let's say I have the above information for each of the 2 groups: control and test data.
How do I format the data such that it would look like this:
controldata<-c(1,1,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
(each number * frequency listed as a vector)
testdata<- (similar to above)
so that I can perform the two independent sample t-test on R?
If I don't even need to make them a vector / if there's an alternative clever way to format the data to perform the t-test, please let me know!
It would be simple if the vector is small like above, but I can have the frequency>10000 for each number.
P.S.
Control and test data have a different sample size.
Thanks!
Use rep. Using your data above
rep(c(1, 2, 3), c(2, 5, 10))
# [1] 1 1 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Or, for your case
control_data = rep(n_bought, frequency)

finding set of multinomial combinations

Let's say I have a vector of integers 1:6
w=1:6
I am attempting to obtain a matrix of 90 rows and 6 columns that contains the multinomial combinations from these 6 integers taken as 3 groups of size 2.
6!/(2!*2!*2!)=90
So, columns 1 and 2 of the matrix would represent group 1, columns 3 and 4 would represent group 2 and columns 5 and 6 would represent group 3. Something like:
1 2 3 4 5 6
1 2 3 5 4 6
1 2 3 6 4 5
1 2 4 5 3 6
1 2 4 6 3 5
...
Ultimately, I would want to expand this to other multinomial combinations of limited size (because the numbers get large rather quickly) but I am having trouble getting things to work. I've found several functions that do binomial combinations (only 2 groups) but I could not locate any functions that do this when the number of groups is greater than 2.
I've tried two approaches to this:
Building up the matrix from nothing using for loops and attempting things with the reshape package (thinking that might be something there for this with melt() )
working backwards from the permutation matrix (720 rows) by attempting to retain unique rows within groups and or removing duplicated rows within groups
Neither worked for me.
The permutation matrix can be obtained with
library(gtools)
dat=permutations(6, 6, set=TRUE, repeats.allowed=FALSE)
I think working backwards from the full permutation matrix is a bit excessive but I'm tring anything at this point.
Is there a package with a prebuilt function for this? Anyone have any ideas how I shoud proceed?
Here is how you can implement your "working backwards" approach:
gps <- list(1:2, 3:4, 5:6)
get.col <- function(x, j) x[, j]
is.ordered <- function(x) !colSums(diff(t(x)) < 0)
is.valid <- Reduce(`&`, Map(is.ordered, Map(get.col, list(dat), gps)))
dat <- dat[is.valid, ]
nrow(dat)
# [1] 90

Reducing a data frame for computing harmonic speedup in R

I have a data set containing the following information:
Workload name
Configuration used
Measured performance
Here you have a toy data set to illustrate my problem (performance data does not make sense at all, I just selected different integers to make the example easy to follow. In reality that data would be floating point values coming from performance measurements):
workload cfg perf
1 a 1 1
2 b 1 2
3 a 2 3
4 b 2 4
5 a 3 5
6 b 3 6
7 a 4 7
8 b 4 8
You can generate it using:
dframe <- data.frame(workload=rep(letters[1:2], 4),
cfg=unlist(lapply(seq_len(4),
function(x) { return(c(x, x)) })),
perf=round(seq_len(8))
)
I am trying to compute the harmonic speedup for the different configurations. For that a base configuration is needed (cfg = 1 in this example). Then the harmonic speedup is computed as:
num_workloads
HS(cfg_i) = num_workloads / sum (perf(cfg_base, wl_j) / perf(cfg_i, wl_j))
wl_j
For instance, for configuration 2 it would be:
HS(cfg_2) = 2 / [perf(cfg_1, wl_1) / perf(cfg_2, wl_1) +
perf(cfg_1, wl_2) / perf_cfg_2, wl_2)]
I would like to compute harmonic speedup for every workload pair and configuration. By using the example data set, the result would be:
workload.pair cfg harmonic.speedup
1 a-b 1 2 / (1/1 + 2/2) = 1
2 a-b 2 2 / (1/3 + 2/4) = 2.4
3 a-b 3 2 / (1/5 + 2/6) = 3.75
4 a-b 4 2 / (1/7 + 2/8) = 5.09
I am struggling with aggregate and ddply in order to find a solution that does not uses loops, but I have not been able to come up with a working solution. So, the basic problems that I am facing are:
how to handle the relationship between workloads and configuration. The results for a given workload pair (A-B), and a given configuration must be handled together (the first two performance measurements in the denominator of the harmonic speedup formula come from workload A, while the other two come from workload B)
for each workload pair and configuration, I need to "normalize" performance values with the values from configuration base (cfg 1 in the example)
I do not really know how to express that with some R function, such as aggregate or ddply (if it is possible, at all).
Does anyone know how this can be solved?
EDIT: I was somehow afraid that using 1..8 as perf could lead to some confusion. I did that for the sake of simplicity, but the values do not need to be those ones (for instance, imagine initializing them like this: dframe$perf <- runif(8)). Both James and Zach's answers understood that part of my question wrong, so I thought it was better to clarify this in the question. Anyway, I generalized both answers to deal with the case where performance for configuration 1 is not (1, 2)
Try this:
library(plyr)
baseline <- dframe[dframe$cfg == 1,]$perf
hspeed <- function(x) length(x) / sum(baseline / x)
ddply(dframe,.(cfg),summarise,workload.pair=paste(workload,collapse="-"),
harmonic.speedup=hspeed(perf))
cfg workload.pair harmonic.speedup
1 1 a-b 1.000000
2 2 a-b 2.400000
3 3 a-b 3.750000
4 4 a-b 5.090909
For problems like this, I like to "reshape" the dataframe, using the reshape2 package, giving a column for workload a, and a column for workload b. It is then easy to compare the 2 columns using vector operations:
library(reshape2)
dframe <- dcast(dframe, cfg~workload, value.var='perf')
baseline <- dframe[dframe$cfg == 1, ]
dframe$harmonic.speedup <- 2/((baseline$a/dframe$a)+(baseline$b/dframe$b))
> dframe
cfg a b harmonic.speedup
1 1 1 2 1.000000
2 2 3 4 2.400000
3 3 5 6 3.750000
4 4 7 8 5.090909

Summary in R for frequency tables?

I have a set of user recommandations
review=matrix(c(5:1,10,2,1,1,2), nrow=5, ncol=2, dimnames=list(NULL,c("Star","Votes")))
and wanted to use summary(review) to show basic properties mean, median, quartiles and min max.
But it gives back the summary of both columns. I refrain from using data.frame because the factors 'Star' are ordered.
How can I tell R that Star is a ordered list of factors numeric score and votes are their frequency?
I'm not exactly sure what you mean by taking the mean in general if Star is supposed to be an ordered factor. However, in the example you give where Star is actually a set of numeric values, you can use the following:
library(Hmisc)
R> review=matrix(c(5:1,10,2,1,1,2), nrow=5, ncol=2, dimnames=list(NULL,c("Star","Votes")))
R> wtd.mean(review[, 1], weights = review[, 2])
[1] 4.0625
R> wtd.quantile(review[, 1], weights = review[, 2])
0% 25% 50% 75% 100%
1.00 3.75 5.00 5.00 5.00
I don't understand what's the problem. Why shouldn't you use data.frame?
rv <- data.frame(star = ordered(review[, 1]), votes = review[, 2])
You should convert your data.frame to vector:
( vts <- with(rv, rep(star, votes)) )
[1] 5 5 5 5 5 5 5 5 5 5 4 4 3 2 1 1
Levels: 1 < 2 < 3 < 4 < 5
Then do the summary... I just don't know what kind of summary, since summary will bring you back to the start. O_o
summary(vts)
1 2 3 4 5
2 1 1 2 10
EDIT (on #Prasad's suggestion)
Since vts is an ordered factor, you should convert it to numeric, hence calculate the summary (at this moment I will disregard the background statistical issues):
nvts <- as.numeric(levels(vts)[vts]) ## numeric conversion
summary(nvts) ## "ordinary" summary
fivenum(nvts) ## Tukey's five number summary
Just to clarify -- when you say you would like "mean, median, quartiles and min/max", you're talking in terms of number of stars? e.g mean = 4.062 stars?
Then using aL3xa's code, would something like summary(as.numeric(as.character(vts))) be what you want?

Resources