What would be your scholarly recommendation to model a population within R when
DELTA_Z = .2Z, Z0 = 10
? The output should be similar to the following
Or as another example, suppose a population is described by the model
Nt+1 = 1.5Nt and N5 = 7.3. Find Nt for t = 0, 1, 2, 3, and 4.
t 0 1 2 3 4 5 6
Zt 10 12 14.4 17.28 20.736 24.8832 29.8598
Those recursions i.e. Z=k*Z are done quite easily within a spreadsheet such as Excel. In R however, the following (far from efficient) have been done thus far:
#loop implementation in R
Z=10;Z;for (t in 6:0)
{Z=.2*Z+Z; print(Z)}
pr
Z0=10;
Z1=.2*Z0+Z0; Z2=.2*Z1+Z1; Z3=.2*Z2+Z2
Z4=.2*Z3+Z3;Z5=.2*Z4+Z4;Z6=.2*Z5+Z5
Zn=c(Z0,Z1,Z2,Z3,Z4,Z5,Z6);
Since R tries to avoid for loops and iterations at all costs, what would be your recommendation (could it be done preferably without iteration?)
What has been done in Excel is the following:
t Nt
5 7.3 k=1.5
4 =B2/$C$2
3 =B3/$C$2
2 =B4/$C$2
1 =B5/$C$2
0 =B6/$C$2
It is a lot easier:
R> Z <- 10
R> Z * 1.2 ^ (0:6)
[1] 10.00000 12.00000 14.40000 17.28000 20.73600 24.88320 29.85984
R>
We set Z to ten, and then multiply it by the growth rate. And that is really just taking 'growth' to the t-th power.
There is a nice short tutorial in the appendix of the An Introduction to R manual that came with your copy of R. I went over that a number of times when I started.
Related
My understanding was that dplyr::ntile and statar::xtile are trying to the same thing. But sometimes the output is different:
dplyr::ntile(1:10, 5)
# [1] 1 1 2 2 3 3 4 4 5 5
statar::xtile(1:10, 5)
# [1] 1 1 2 2 3 3 3 4 5 5
I am converting Stata code into R, so statar::xtile gives the same output as the original Stata code but I thought dplyr::ntile would be the equivalent in R.
The Stata help says that xtile is used to:
Create variable containing quantile categories
And statar::xtile is obviously replicating this.
And dplyr::ntile is:
a rough rank, which breaks the input vector into n buckets.
Do these mean the same thing?
If so, why do they give different answers?
And if not, then:
What is the difference?
When should you use one or the other?
Thanks #alistaire for pointing out that dplyr::ntile is only doing:
function (x, n) { floor((n * (row_number(x) - 1)/length(x)) + 1) }
So not the same as splitting into quantile categories, as xtile does.
Looking at the code for statar::xtile leads to statar::pctile and the documentation for statar says that:
pctile computes quantile and weighted quantile of type 2 (similarly to Stata _pctile)
Therefore an equivalent to statar::xtile in base R is:
.bincode(1:10, quantile(1:10, seq(0, 1, length.out = 5 + 1), type = 2),
include.lowest = TRUE)
# [1] 1 1 2 2 3 3 3 4 5 5
I am totally new with R and I'll appreciate the time anyone bothers to take with helping me with these probably simple tasks. I'm just at a loss with all the resources available and am not sure where to start.
My data looks something like this:
subject sex age nR medL medR meanL meanR pL ageBin
1 0146si 1 67 26 1 1 1.882353 1.5294118 0.5517241 1
2 0162le 1 72 5 2 1 2 1.25 0.6153846 1
3 0323er 1 54 30 2.5 3 2.416667 2.5 0.4915254 0
4 0811ne 0 41 21 2 2 2 1.75 0.5333333 0
5 0825en 1 44 31 2 2 2.588235 1.8235294 0.5866667 0
Though the actual data has many, many more subjects in variables.
This first thing I need to do is compare the 'ageBin' values. 0 = under age 60, 1 = over age 60. I want to compare stats between these two groups. So I guess the first thing I need is the ability to recognize the different ageBin values and make those the two rows.
Then I need to do things like calculate the frequency of the values in the two groups (ie. how many instances of 1 and 0), the mean of the 'age' variable, the median of the age variable, number of males (ie. sex = 1), the mean of meanL, etc. Simple things like that. I just want them to be all in one table.
So an example of a potential table might be
n nMale mAge
ageBin 0 14 x x
ageBin 1 14 x x
I could easily do this stuff in SPSS or even Excel...I just really want to get started with R. So any resource or advice someone could offer to point me in the right direction would be so, so helpful. Sorry if this sounds unclear...I can try to clarify if necessary.
Thanks in advance, anyone.
Use the plyr() package to split up the data structure and then apply a function to combine all the results back together.
install.packages("plyr") # install package from CRAN
library(plyr) # load the package into R
dd <- list(subject=c("0146si", "0162le", "1323er", "0811ne", "0825en"),
sex = c(1,1,1,0,1),
age = c(67,72,54,41,44),
nR = c(26,5,30,21,31),
medL = c(1,2,2.5,2,2),
medR = c(1,1,3,2,2),
meanL = c(1.882352,2,2.416667,2,2.588235),
meanR = c(1.5294118,1.25,2.5,1.75,1.8235294),
pL = c(0.5517241,0.6153846,0.4915254,0.5333333,0.5866667),
ageBin = c(1,1,0,0,0))
dd <- data.frame(dd) # convert to data.frame
Using the ddply function, you can do things like calculate the frequency of the values in the two groups
ddply(dd, .(ageBin), summarise, nMale = sum(sex), mAge = mean(age))
ageBin nMale mAge
0 2 46.33333
1 2 69.50000
The following is a very useful resource by Sean Anderson for getting up to speed with the plyr package.
A more comprehensive extremely resource by Hadley Wickham the package author can be found here
Try the by function:
if your data frame is named df:
by(data=df, INDICES=df$ageBin, FUN=summary)
I am trying to implement the Affinity Propagation clustering algorithm in C++. As part of testing I want to compare my results with well established implementations of the algorithm in Matlab (Link) and in R (package apcluster). Unfortunately, the clusterings do not agree.
To be more precise, the (test) data set is:
0.9411760 0.9702140
0.9607826 0.9744693
0.9754896 0.9574479
0.9852929 0.9489372
0.9950962 0.9234050
1.0000000 0.8936175
1.0000000 0.8723408
0.9852929 0.8595747
1.0000000 0.8893622
1.0000000 0.9191497
In R I typed:
S<-negDistMat(data)
A<-apcluster(S,maxits=1000,convits=100, lam=0.9,q=0.5)
and got:
> A#idx
2 2 2 5 5 9 9 9 9 5
2 2 2 5 5 9 9 9 9 5
In Matlab I just typed:
[idx,netsim,dpsim,expref]=apcluster(S,diag(S));
From the apcluster.m file implementing apcluster (line 77):
maxits=1000; convits=100; lam=0.9; plt=0; details=0; nonoise=0;
This explains the parameters for R, in Matlab their are the default values. Since I'm more comfortable with R concerning Affinity Propagation, for comparison reasons I stuck with Matlab's defaults, just to avoid messing something up unintentionally.
..but got:
>> idx'
ans =
3 3 3 3 5 9 9 9 9 5
In both cases the similarity matrices matched. What could I've missed?
Update:
I've also implemented the Matlab code proposed by Frey & Dueck in their original publication. (You may notice that I omitted noise) and although I can replicate the indexes provided by the former Matlab implementation, Availability and Responsibility matrices differ on some values. The error is less than 0.01 but this is significant.
Their code is:
function [idx,A,R]=frey(S);
N=size(S,1);
A=zeros(N,N);
R=zeros(N,N);
lam=0.9; % Set damping factor
for iter=1:122
% Compute responsibilities
Rold=R;
AS=A+S;
[Y,I]=max(AS,[],2);
for i=1:N
AS(i,I(i))=-realmax;
end;
[Y2,I2]=max(AS,[],2);
R=S-repmat(Y,[1,N]);
for i=1:N
R(i,I(i))=S(i,I(i))-Y2(i);
end;
R=(1-lam)*R+lam*Rold; % Dampen responsibilities
% Compute availabilities
Aold=A;
Rp=max(R,0);
for k=1:N
Rp(k,k)=R(k,k);
end;
A=repmat(sum(Rp,1),[N,1])-Rp;
dA=diag(A);
A=min(A,0);
for k=1:N
A(k,k)=dA(k);
end;
A=(1-lam)*A+lam*Aold; % Dampen availabilities
end;
E=R+A; % Pseudomarginals
I=find(diag(E)>0); K=length(I); % Indices of exemplars
[tmp c]=max(S(:,I),[],2); c(I)=1:K; idx=I(c); % Assignments
I have tried all your code and the problem is caused by the way you supply the input preference. In the first case (R), you specify q=0.5. This means that the input preference p is set to the median of off-diagonal similarities (in your example, this is -0.05129912). If I run the Matlab code as follows (I used Octave, but Matlab should give the same result), I get:
octave:7> [idx,netsim,dpsim,expref]=apcluster(S,-0.05129912);
octave:8> idx'
ans =
2 2 2 5 5 9 9 9 9 5
This is exactly the same as the R result. If I run your Matlab code (with diag(S) being the second argument) and if I run
apcluster(S, p=diag(S))
in R (which sets the input preference to 0 for all samples in both cases), I get 10 one-sample clusters in both cases. So the two results match again, though I could not recover your Matlab result
3 3 3 3 5 9 9 9 9 5
I hope that makes the difference clear.
Cheers, UBod
I have a data set containing the following information:
Workload name
Configuration used
Measured performance
Here you have a toy data set to illustrate my problem (performance data does not make sense at all, I just selected different integers to make the example easy to follow. In reality that data would be floating point values coming from performance measurements):
workload cfg perf
1 a 1 1
2 b 1 2
3 a 2 3
4 b 2 4
5 a 3 5
6 b 3 6
7 a 4 7
8 b 4 8
You can generate it using:
dframe <- data.frame(workload=rep(letters[1:2], 4),
cfg=unlist(lapply(seq_len(4),
function(x) { return(c(x, x)) })),
perf=round(seq_len(8))
)
I am trying to compute the harmonic speedup for the different configurations. For that a base configuration is needed (cfg = 1 in this example). Then the harmonic speedup is computed as:
num_workloads
HS(cfg_i) = num_workloads / sum (perf(cfg_base, wl_j) / perf(cfg_i, wl_j))
wl_j
For instance, for configuration 2 it would be:
HS(cfg_2) = 2 / [perf(cfg_1, wl_1) / perf(cfg_2, wl_1) +
perf(cfg_1, wl_2) / perf_cfg_2, wl_2)]
I would like to compute harmonic speedup for every workload pair and configuration. By using the example data set, the result would be:
workload.pair cfg harmonic.speedup
1 a-b 1 2 / (1/1 + 2/2) = 1
2 a-b 2 2 / (1/3 + 2/4) = 2.4
3 a-b 3 2 / (1/5 + 2/6) = 3.75
4 a-b 4 2 / (1/7 + 2/8) = 5.09
I am struggling with aggregate and ddply in order to find a solution that does not uses loops, but I have not been able to come up with a working solution. So, the basic problems that I am facing are:
how to handle the relationship between workloads and configuration. The results for a given workload pair (A-B), and a given configuration must be handled together (the first two performance measurements in the denominator of the harmonic speedup formula come from workload A, while the other two come from workload B)
for each workload pair and configuration, I need to "normalize" performance values with the values from configuration base (cfg 1 in the example)
I do not really know how to express that with some R function, such as aggregate or ddply (if it is possible, at all).
Does anyone know how this can be solved?
EDIT: I was somehow afraid that using 1..8 as perf could lead to some confusion. I did that for the sake of simplicity, but the values do not need to be those ones (for instance, imagine initializing them like this: dframe$perf <- runif(8)). Both James and Zach's answers understood that part of my question wrong, so I thought it was better to clarify this in the question. Anyway, I generalized both answers to deal with the case where performance for configuration 1 is not (1, 2)
Try this:
library(plyr)
baseline <- dframe[dframe$cfg == 1,]$perf
hspeed <- function(x) length(x) / sum(baseline / x)
ddply(dframe,.(cfg),summarise,workload.pair=paste(workload,collapse="-"),
harmonic.speedup=hspeed(perf))
cfg workload.pair harmonic.speedup
1 1 a-b 1.000000
2 2 a-b 2.400000
3 3 a-b 3.750000
4 4 a-b 5.090909
For problems like this, I like to "reshape" the dataframe, using the reshape2 package, giving a column for workload a, and a column for workload b. It is then easy to compare the 2 columns using vector operations:
library(reshape2)
dframe <- dcast(dframe, cfg~workload, value.var='perf')
baseline <- dframe[dframe$cfg == 1, ]
dframe$harmonic.speedup <- 2/((baseline$a/dframe$a)+(baseline$b/dframe$b))
> dframe
cfg a b harmonic.speedup
1 1 1 2 1.000000
2 2 3 4 2.400000
3 3 5 6 3.750000
4 4 7 8 5.090909
I have decided to learn R. I am trying to get a sense of how to write "R style" functions and to avoid looping. Here is a sample situation:
Given a vector a, I would like to compute a vector b whose elements b[i] (the vector index begins at 1) are defined as follows:
1 <= i <= 4:
b[i] = NaN
5 <= i <= length(a):
b[i] = mean(a[i-4] to a[i])
Essentially, if we pretend 'a' is a list of speeds where the first entry is at time = 0, the second at time = 1 second, the third at time = 2 seconds... I would like to obtain a corresponding vector describing the average speed over the past 5 seconds.
E.g.:
If a is (1,1,1,1,1,4,6,3,6,8,9) then b should be (NaN, NaN, NaN, NaN, 1, 1.6, 2.6, 3, 4, 5.4, 6.4)
I could do this using a loop, but I feel that doing so would not be in "R style".
Thank you,
Tungata
Because these rolling functions often apply with time-series data, some of the newer and richer time-series data-handling packages already do that for you:
R> library(zoo) ## load zoo
R> speed <- c(1,1,1,1,1,4,6,3,6,8,9)
R> zsp <- zoo( speed, order.by=1:length(speed) ) ## creates a zoo object
R> rollmean(zsp, 5) ## default use
3 4 5 6 7 8 9
1.0 1.6 2.6 3.0 4.0 5.4 6.4
R> rollmean(zsp, 5, na.pad=TRUE, align="right") ## with padding and aligned
1 2 3 4 5 6 7 8 9 10 11
NA NA NA NA 1.0 1.6 2.6 3.0 4.0 5.4 6.4
R>
The zoo has excellent documentation that will show you many, many more examples, in particular how to do this with real (and possibly irregular) dates; xts extends this further but zoo is a better starting point.
Something like b = filter(a, rep(1.0/5, 5), sides=1) will do the job, although you will probably get zeros in the first few slots, instead of NaN. R has a large library of built-in functions, and "R style" is to use those wherever possible. Take a look at the documentation for the filter function.
You can also use a combination of cumsum and diff to get the sum over sliding windows. You'll need to pad with your own NaN, though:
> speed <- c(1,1,1,1,1,4,6,3,6,8,9)
> diff(cumsum(c(0,speed)), 5)/5
[1] 1.0 1.6 2.6 3.0 4.0 5.4 6.4