Vuong test has different results on R and Stata - r

I am running a zero inflated negative binomial model with probit link on R (http://www.ats.ucla.edu/stat/r/dae/zinbreg.htm) and Stata (http://www.ats.ucla.edu/stat/stata/dae/zinb.htm).
There is a Vuong test to compare whether this specification is better than an ordinary negative binomial model. Where R tells me I am better off using the latter, Stata says a ZINB is the preferable choice. In both instances I assume that the process leading to the excess zeros is the same as for the negative binomial distributed non-zero observations. Coefficients are indeed the same (except that Stata prints one digit more).
In R I run (data code is below)
require(pscl)
ZINB <- zeroinfl(Two.Year ~ length + numAuth + numAck,
data=Master,
dist="negbin", link="probit"
)
NB <- glm.nb(Two.Year ~ length + numAuth + numAck,
data=Master
)
Comparing both with vuong(ZINB, NB) from the same package yields
Vuong Non-Nested Hypothesis Test-Statistic: -10.78337
(test-statistic is asymptotically distributed N(0,1) under the
null that the models are indistinguishible)
in this case:
model2 > model1, with p-value < 2.22e-16
Hence: NB is better than ZINB.
In Stata I run
zinb twoyear numauth length numack, inflate(numauth length numack) probit vuong
and receive (iteration fitting suppressed)
Zero-inflated negative binomial regression Number of obs = 714
Nonzero obs = 433
Zero obs = 281
Inflation model = probit LR chi2(3) = 74.19
Log likelihood = -1484.763 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
twoyear | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
twoyear |
numauth | .1463257 .0667629 2.19 0.028 .0154729 .2771785
length | .038699 .006077 6.37 0.000 .0267883 .0506097
numack | .0333765 .010802 3.09 0.002 .0122049 .0545481
_cons | -.4588568 .2068824 -2.22 0.027 -.8643389 -.0533747
-------------+----------------------------------------------------------------
inflate |
numauth | .2670777 .1141893 2.34 0.019 .0432708 .4908846
length | .0147993 .0105611 1.40 0.161 -.0059001 .0354987
numack | .0177504 .0150118 1.18 0.237 -.0116722 .0471729
_cons | -2.057536 .5499852 -3.74 0.000 -3.135487 -.9795845
-------------+----------------------------------------------------------------
/lnalpha | .0871077 .1608448 0.54 0.588 -.2281424 .4023577
-------------+----------------------------------------------------------------
alpha | 1.091014 .175484 .7960109 1.495346
------------------------------------------------------------------------------
Vuong test of zinb vs. standard negative binomial: z = 2.36 Pr>z = 0.0092
In the very last line Stata tells me that in this case ZINB is better than NB: Both test statistic and p-value differ. How come?
Data (R code)
Master <- <-read.table(text="
Two.Year numAuth length numAck
0 1 4 6
3 3 28 3
3 1 18 4
0 1 42 4
0 2 17 0
2 1 10 3
1 2 20 0
0 1 28 3
1 1 23 7
0 2 34 3
2 2 24 2
0 2 18 0
0 1 23 7
0 1 35 11
4 2 33 13
0 2 24 4
0 2 21 9
1 4 21 0
1 1 8 6
2 1 18 1
0 3 28 2
0 2 17 2
1 1 30 6
4 2 28 16
1 4 35 1
2 3 19 2
0 1 24 2
1 3 26 6
1 1 17 7
0 3 42 4
0 3 32 8
3 1 33 23
7 2 24 9
0 2 25 6
1 1 7 1
0 1 15 2
2 2 16 2
0 1 23 6
2 3 18 7
0 1 28 5
0 1 12 2
1 1 25 4
0 4 18 1
1 2 32 6
1 1 15 2
2 2 14 4
0 2 24 9
0 3 30 9
0 2 19 9
0 2 14 2
2 2 23 3
0 2 18 0
1 3 13 4
0 1 10 4
0 1 24 8
0 2 22 9
2 3 29 5
2 1 25 5
0 2 17 4
1 2 24 0
0 2 26 0
2 2 33 12
1 4 17 2
1 1 25 8
3 1 36 11
0 1 10 4
9 2 60 22
0 2 18 3
2 3 19 6
2 2 23 7
2 2 26 0
1 1 20 5
4 2 31 4
0 2 21 2
0 1 24 12
1 1 12 1
1 3 26 5
1 4 32 8
2 3 21 1
3 3 26 3
4 2 36 6
3 3 28 2
1 3 27 1
0 2 12 5
0 3 24 4
0 2 35 1
0 2 17 2
3 2 28 3
0 3 29 8
0 2 20 3
3 2 28 0
11 1 30 2
0 3 22 2
21 3 59 24
0 2 15 5
0 2 22 2
5 4 33 0
0 2 21 2
4 2 21 0
0 3 25 9
2 2 31 5
1 2 23 1
2 3 25 0
0 1 13 3
0 1 22 7
0 1 16 3
6 1 18 4
2 2 19 7
3 2 22 10
0 1 12 6
0 1 23 8
1 1 23 9
1 2 32 15
1 3 26 8
1 3 15 2
0 3 16 2
0 4 29 2
2 3 24 3
2 3 32 1
2 1 29 13
1 3 26 0
5 1 23 4
3 2 21 2
4 2 19 4
4 3 19 2
2 1 29 0
0 1 13 6
0 2 28 2
0 3 33 1
0 1 20 2
0 1 30 8
1 2 19 2
17 2 30 7
5 3 39 17
21 3 30 5
1 3 29 24
1 1 31 4
4 3 26 13
4 2 14 16
2 3 31 14
5 3 37 10
15 2 52 13
1 1 6 5
2 1 24 13
17 3 17 3
3 2 29 5
2 1 26 7
3 3 34 9
5 2 39 2
3 1 26 7
1 2 32 12
2 3 26 4
9 3 28 8
1 3 29 1
4 1 24 7
9 1 40 13
1 2 27 21
2 2 27 13
5 3 31 10
10 2 29 15
10 2 41 15
8 1 24 17
2 4 16 5
17 2 26 20
3 2 31 3
2 2 18 1
6 3 32 9
2 1 32 11
4 3 34 8
4 1 16 1
5 1 33 5
0 2 17 11
17 2 48 8
2 1 11 2
5 3 33 18
4 2 25 9
10 2 17 5
1 1 25 8
3 3 41 16
2 1 40 13
4 3 25 2
16 4 32 13
10 1 33 18
5 2 25 3
3 2 20 3
2 3 14 7
3 2 23 4
2 2 28 4
3 2 25 19
0 2 14 6
3 1 28 18
8 3 27 11
1 3 25 17
21 2 33 15
9 2 24 2
1 1 16 14
1 1 38 10
16 2 37 13
16 2 41 1
7 2 24 18
4 2 17 5
4 1 37 32
3 1 37 8
13 2 35 6
15 1 23 11
7 1 47 11
3 1 16 6
12 2 36 6
7 1 24 17
4 2 24 8
14 2 24 9
15 2 24 11
0 3 19 4
0 4 28 9
1 1 5 3
11 1 28 15
5 1 33 5
10 2 21 9
3 3 28 8
2 3 13 2
11 2 41 8
4 2 24 11
3 1 32 11
4 2 31 11
7 2 34 3
11 6 33 6
7 3 33 7
2 2 37 13
7 3 19 9
1 2 14 3
6 2 15 11
11 3 37 12
0 2 20 5
7 4 13 6
17 1 52 14
9 3 47 30
1 2 32 27
30 3 36 19
2 2 12 5
3 1 30 7
4 2 19 11
32 3 45 14
13 1 17 7
16 2 24 4
5 1 32 13
7 3 29 14
5 2 46 2
1 2 21 6
1 3 13 17
11 1 41 16
6 2 33 1
7 1 31 20
0 1 16 13
6 3 26 8
11 2 46 7
8 2 20 5
8 1 44 7
2 2 33 12
1 3 22 5
0 4 14 2
4 1 25 8
5 3 24 11
1 1 21 18
5 1 28 5
2 1 51 19
2 1 16 4
17 2 35 2
4 1 35 1
9 3 48 8
2 1 33 16
0 3 24 7
18 2 33 12
11 1 41 5
5 2 17 3
8 1 19 7
4 3 38 2
23 2 27 10
22 3 46 13
5 3 21 1
5 2 38 10
1 2 20 5
2 2 24 8
0 3 30 9
7 2 44 16
7 1 21 7
0 1 20 10
10 2 33 11
4 2 18 2
11 1 45 17
7 2 32 7
7 2 28 6
5 2 25 10
3 2 57 6
8 1 16 2
7 2 34 4
5 2 22 8
2 2 21 7
4 2 37 15
2 4 36 7
1 1 17 4
0 2 23 9
12 2 48 4
8 3 29 13
0 1 29 7
0 2 27 12
1 1 53 10
3 3 15 5
8 1 40 29
2 2 22 11
10 2 20 7
4 4 27 3
4 1 24 4
2 2 24 5
1 2 19 6
10 3 41 10
57 3 46 9
5 1 20 11
6 2 30 4
0 2 20 5
16 3 35 8
1 2 44 1
2 4 24 8
1 1 20 9
5 3 19 11
5 3 29 15
3 1 21 8
3 3 19 3
8 3 44 0
11 3 34 15
2 2 31 1
11 1 39 11
0 3 24 3
4 2 35 6
2 1 14 6
10 1 30 10
6 2 21 4
9 2 32 3
0 1 34 10
6 2 32 3
7 2 50 11
11 1 35 15
4 1 27 9
1 2 32 27
8 2 54 2
0 3 15 8
2 1 31 13
0 1 31 11
0 4 14 5
0 2 37 15
0 2 51 12
0 2 34 1
0 3 29 12
0 2 22 11
0 2 19 15
0 2 39 13
0 3 25 12
0 1 46 2
0 4 42 10
0 1 38 5
0 3 31 4
0 3 33 1
0 2 24 11
0 1 28 16
0 2 28 13
0 1 29 17
0 1 23 13
0 3 36 21
0 2 30 15
0 2 25 12
0 2 26 17
0 3 19 2
0 2 37 5
0 2 47 12
0 1 21 20
0 3 27 21
0 2 16 7
0 1 35 5
0 2 32 24
0 3 31 6
0 3 36 13
0 2 26 20
0 1 31 13
0 2 46 6
0 2 34 12
0 1 18 13
0 1 29 3
0 3 40 9
0 1 25 3
0 3 45 9
0 2 31 3
0 2 35 4
0 3 29 10
0 2 33 13
0 3 22 4
0 2 26 9
0 2 29 19
0 2 28 12
0 2 30 5
0 4 30 3
0 3 32 14
0 3 45 20
0 2 42 9
0 2 25 4
0 2 20 22
0 3 31 5
0 1 26 13
0 2 32 11
0 1 31 2
0 2 42 17
0 1 37 8
0 3 37 16
0 3 25 10
0 2 33 11
0 2 29 7
0 2 21 16
0 3 30 33
0 1 35 8
0 3 25 6
0 2 54 3
0 2 41 10
0 3 35 1
0 4 26 4
0 2 31 4
0 3 26 11
0 3 34 11
0 2 27 7
0 1 19 14
0 1 38 9
0 2 24 1
0 3 30 20
0 4 43 13
0 2 20 10
0 2 38 1
0 2 41 6
0 1 20 9
0 2 34 2
0 2 24 5
0 2 24 2
0 1 31 19
0 3 49 7
0 1 26 0
0 2 44 6
0 3 36 13
0 3 31 14
0 2 30 20
0 1 27 13
0 2 28 9
0 2 22 20
0 4 36 34
0 3 25 3
0 2 29 17
0 2 40 8
0 2 39 17
0 4 29 8
0 1 27 22
0 1 21 10
0 3 17 5
0 3 28 10
0 1 27 7
0 3 40 7
0 2 21 4
0 1 33 14
0 1 31 14
0 3 37 13
0 2 23 9
0 2 25 1
0 2 30 1
0 2 30 12
0 1 41 8
0 2 26 1
0 2 25 14
0 2 26 3
0 3 36 1
0 4 23 1
0 2 18 0
0 2 34 2
0 1 39 6
0 1 16 15
0 3 34 4
0 4 35 6
0 1 22 10
0 1 35 8
0 2 36 13
0 2 50 8
0 2 28 6
0 1 30 14
0 2 33 26
0 3 28 1
0 1 18 10
0 2 27 4
0 2 27 5
0 2 8 2
0 4 32 16
0 3 40 6
0 4 45 15
0 2 38 3
0 2 29 6
0 1 25 9
12 1 27 5
2 1 33 8
4 3 31 3
1 1 33 4
0 3 20 5
0 2 28 6
2 2 32 12
0 3 30 2
0 3 19 3
1 1 14 19
0 2 28 2
0 3 26 3
0 2 32 13
1 3 21 7
1 4 20 0
2 2 40 8
0 2 35 18
1 1 20 6
6 2 21 3
3 2 33 10
1 1 31 15
1 2 22 5
0 2 24 7
2 2 22 3
3 2 17 6
9 2 30 12
2 4 39 9
0 2 46 8
0 2 26 5
1 2 28 5
6 1 18 3
5 2 19 13
1 3 27 3
1 1 20 10
0 1 27 6
0 4 26 1
0 2 19 4
0 1 26 8
1 1 30 8
0 2 22 2
3 3 42 4
3 1 10 5
3 1 30 12
1 1 25 8
1 2 38 8
2 1 28 13
3 1 18 12
2 2 20 11
2 2 29 0
1 2 18 3
1 1 6 2
0 1 6 3
2 2 24 1
0 1 14 1
1 1 17 5
2 2 20 9
1 4 24 0
1 2 8 10
0 2 18 1
1 1 25 5
2 2 12 7
0 3 18 1
0 1 19 1
8 2 21 2
1 2 23 5
7 2 19 6
1 1 21 5
0 1 16 6
1 1 24 1
0 2 19 3
1 2 14 6
3 2 24 2
6 1 32 21
0 1 16 0
1 2 15 0
1 2 8 8
0 1 14 5
0 2 27 5
2 2 17 2
1 1 19 7
1 2 21 2
0 1 29 7
0 2 18 2
0 2 15 6
2 3 27 3
0 2 57 4
2 3 17 2
1 1 18 8
1 1 17 5
0 1 18 1
1 2 18 4
1 1 12 1
0 2 15 6
1 2 24 4
3 2 14 9
0 1 24 6
3 1 30 9
0 1 19 5
3 1 16 7
5 3 21 1
2 2 17 5
4 1 34 9
1 1 17 7
3 2 30 10
12 1 17 6
2 1 26 6
1 1 18 2
2 2 24 0
0 1 12 2
0 2 3 2
1 1 11 4
1 4 18 13
0 1 25 9
8 2 20 7
0 1 11 7
7 3 26 19
6 1 18 6
6 2 32 5
1 1 31 2
1 2 33 9
4 1 17 6
1 2 34 11
5 1 37 3
0 3 27 10
12 2 25 14
3 1 40 6
6 2 27 9
0 2 31 2
1 1 28 7
2 1 37 11
1 1 19 0
5 2 30 17
4 3 40 6
0 1 27 6
5 3 31 7
0 3 26 10
3 2 32 4
1 3 43 6
3 1 19 3
2 2 37 4
0 3 28 4
6 3 30 11
1 1 30 9
4 3 31 26
1 2 14 1
10 1 35 27
1 1 36 7
5 1 32 8
2 1 28 6
3 1 34 16
3 2 32 5
1 3 11 0
2 2 42 5
0 2 30 7
0 1 32 9
3 3 43 2
7 2 43 6
1 2 21 5
2 1 27 20
1 2 37 7
2 1 37 8
0 1 19 3
0 3 28 5
2 2 33 3
3 1 41 6
13 2 41 9
2 1 38 3
4 1 32 5
2 1 34 8
1 1 27 9
8 1 29 7
4 1 17 6
0 1 20 8
1 2 34 4
1 1 16 11
4 2 33 5
0 2 15 6
1 1 27 4
2 3 15 8
1 1 30 8
3 2 41 20
0 1 25 15
1 3 35 24
4 2 30 21
6 2 30 6
16 2 33 21
2 3 37 3
2 2 30 12
4 1 57 11
0 2 18 16
4 4 20 13
3 1 43 10
3 1 25 15
7 2 31 11
2 1 31 3
5 2 40 11
3 2 28 7
4 2 27 10
0 1 26 6
4 2 24 14
4 2 23 8
0 2 25 11
21 2 33 12
1 3 37 0
3 2 28 7
4 2 27 10
1 2 41 15
2 2 30 16
2 2 28 7
6 1 19 8
4 4 22 19
0 2 38 33
1 1 29 11
1 2 27 2
4 2 24 6
2 1 22 5
",header=TRUE,sep="")

The above problem occurred with pcsl version 1.4.6. I spoke to the author since and in version 1.4.7 he fixed the bug. The actual version in February 2015 is 1.4.8.

Related

Sum in R based on a date range and another condition?

I am working on a dataframe of baseball data called mlb_team_logs. A random sample lies below.
Date Team season AB PA H X1B X2B X3B HR R RBI BB IBB SO HBP SF SH GDP
1 2015-04-06 ARI 2015 34 39 9 7 1 1 0 4 4 3 0 6 2 0 0 2
2 2015-04-07 ARI 2015 31 36 8 4 1 1 2 7 7 5 0 7 0 0 0 1
3 2015-04-08 ARI 2015 32 35 5 3 2 0 0 2 1 2 0 7 1 0 0 0
4 2015-04-10 ARI 2015 35 38 7 6 0 0 1 4 4 3 0 10 0 0 0 0
5 2015-04-11 ARI 2015 32 35 10 9 0 0 1 6 6 3 0 7 0 0 0 1
6 2015-04-12 ARI 2015 36 38 10 7 3 0 0 4 4 1 0 11 0 0 1 1
7 2015-04-13 ARI 2015 39 44 12 8 3 1 0 8 7 4 0 11 0 0 1 0
8 2015-04-14 ARI 2015 28 32 3 1 2 0 0 1 1 3 0 4 1 0 0 2
9 2015-04-15 ARI 2015 33 34 9 7 1 0 1 2 2 1 0 8 0 0 0 1
10 2015-04-16 ARI 2015 47 51 11 6 2 0 3 7 7 3 1 8 1 0 0 0
240 2015-07-03 ATL 2015 30 32 7 4 1 0 2 2 2 2 0 6 0 0 0 1
241 2015-07-04 ATL 2015 34 40 10 6 3 0 1 9 9 5 0 5 0 0 1 0
242 2015-07-05 ATL 2015 35 37 7 6 1 0 0 0 0 1 0 10 1 0 0 1
243 2015-07-06 ATL 2015 40 44 15 10 4 0 1 5 5 3 0 7 0 0 1 1
244 2015-07-07 ATL 2015 34 37 10 7 1 1 1 4 4 2 0 4 0 0 1 1
245 2015-07-08 ATL 2015 31 38 7 4 1 0 2 5 5 5 1 7 0 0 2 1
246 2015-07-09 ATL 2015 34 37 10 8 2 0 0 3 3 1 0 9 0 1 1 2
247 2015-07-10 ATL 2015 32 35 8 7 0 0 1 3 3 2 0 5 1 0 0 2
248 2015-07-11 ATL 2015 33 38 6 3 1 0 2 2 2 5 1 8 0 0 0 0
249 2015-07-12 ATL 2015 34 41 8 6 2 0 0 3 3 7 1 10 0 0 0 1
250 2015-07-17 ATL 2015 30 36 7 4 3 0 0 4 4 5 1 7 0 0 0 0
In total, the df has 43 total columns. My objective is to sum columns 4 (AB) to 43 on two criteria:
the team
the date is within 7 days of the entry in "Date" (ie Date - 7 to Date - 1)
Eventually, I would like these columns to be appended to mlb_team_logs as l7_AB, l7_PA, etc (but I know how to do that if the output will be a new dataframe). Any help is appreciated!
EDIT I altered the sample to allow for more easily tested results
You might be able to use a data.table non-equi join here. The idea would be to create a lower date bound (below, I've named this date_lb), and then join the table on itself, matching on Team = Team, Date < Date, and Date >= date_lb. Then use lapply with .SDcols to sum the columns of interest.
load library and set your frame to data.table
library(data.table)
setDT(mlb_team_logs)
Identify the columns you want to sum, in a character vector (change to 4:43 in your full dataset)
sum_cols = names(mlb_team_logs)[4:19]
Add a lower bound on date
df[, date_lb := Date-7]
Join the table on itself, and use lapply(.SD, sum) on the columns of interest
result = mlb_team_logs[mlb_team_logs[, .(Team, Date, date_lb)], on=.(Team, Date<Date, Date>=date_lb)] %>%
.[, lapply(.SD, sum), by=.(Date,Team), .SDcols = sumcols ]
Set the new names (inplace, using setnames())
setnames(result, old=sumcols, new=paste0("I7_",sumcols))
Output:
Date Team I7_AB I7_PA I7_H I7_X1B I7_X2B I7_X3B I7_HR I7_R I7_RBI I7_BB I7_IBB I7_SO I7_HBP I7_SF I7_SH I7_GDP
<IDat> <char> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1: 2015-04-06 ARI NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2: 2015-04-07 ARI 34 39 9 7 1 1 0 4 4 3 0 6 2 0 0 2
3: 2015-04-08 ARI 65 75 17 11 2 2 2 11 11 8 0 13 2 0 0 3
4: 2015-04-10 ARI 97 110 22 14 4 2 2 13 12 10 0 20 3 0 0 3
5: 2015-04-11 ARI 132 148 29 20 4 2 3 17 16 13 0 30 3 0 0 3
6: 2015-04-12 ARI 164 183 39 29 4 2 4 23 22 16 0 37 3 0 0 4
7: 2015-04-13 ARI 200 221 49 36 7 2 4 27 26 17 0 48 3 0 1 5
8: 2015-04-14 ARI 205 226 52 37 9 2 4 31 29 18 0 53 1 0 2 3
9: 2015-04-15 ARI 202 222 47 34 10 1 2 25 23 16 0 50 2 0 2 4
10: 2015-04-16 ARI 203 221 51 38 9 1 3 25 24 15 0 51 1 0 2 5
11: 2015-07-03 ATL NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
12: 2015-07-04 ATL 30 32 7 4 1 0 2 2 2 2 0 6 0 0 0 1
13: 2015-07-05 ATL 64 72 17 10 4 0 3 11 11 7 0 11 0 0 1 1
14: 2015-07-06 ATL 99 109 24 16 5 0 3 11 11 8 0 21 1 0 1 2
15: 2015-07-07 ATL 139 153 39 26 9 0 4 16 16 11 0 28 1 0 2 3
16: 2015-07-08 ATL 173 190 49 33 10 1 5 20 20 13 0 32 1 0 3 4
17: 2015-07-09 ATL 204 228 56 37 11 1 7 25 25 18 1 39 1 0 5 5
18: 2015-07-10 ATL 238 265 66 45 13 1 7 28 28 19 1 48 1 1 6 7
19: 2015-07-11 ATL 240 268 67 48 12 1 6 29 29 19 1 47 2 1 6 8
20: 2015-07-12 ATL 239 266 63 45 10 1 7 22 22 19 2 50 2 1 5 8
21: 2015-07-17 ATL 99 114 22 16 3 0 3 8 8 14 2 23 1 0 0 3
Date Team I7_AB I7_PA I7_H I7_X1B I7_X2B I7_X3B I7_HR I7_R I7_RBI I7_BB I7_IBB I7_SO I7_HBP I7_SF I7_SH I7_GDP

Calculate difference between column values in R [duplicate]

This question already has answers here:
Subtract a column in a dataframe from many columns in R
(3 answers)
Closed 1 year ago.
How do I generate a variable that can be called "PV" which will be the difference between the DX values and the DR column values.
For example:
PV of DR0 will be 25-19=6 for 03/01/2021 for Room Super
PV of DR1 will be 25-19=6 for for 03/01/2021 for Room Super
PV of DR2 will be 25-23=2 for for 03/01/2021 for Room Super
And so on.
df <- structure(
list(date = c("01-01-2021","01-01-2021","01-01-2021","01-01-2021","01-01-2021","02-01-2021","02-01-2021",
"02-01-2021","02-01-2021","02-01-2021","03-01-2021","03-01-2021","03-01-2021","03-01-2021","03-01-2021","04-01-2021",
"04-01-2021","04-01-2021","04-01-2021","04-01-2021"),Room = c("Standard","Master","Luxury","Super","Deluxe","Standard","Master",
"Luxury","Super","Deluxe","Standard","Master","Luxury","Super","Deluxe","Standard","Master","Luxury","Super","Deluxe"),
DX=c(5,8,13,5,3,4,6,18,14,5,9,9,25,25,10,10,9,24,25,12),DR0=c(5,8,13,6,4,4,6,19,14,7,9,7,25,19,12,10,10,30,27,12),
DR1=c(5,8,12,6,3,4,6,19,15,8,9,7,27,19,12,10,8,29,23,12),DR2=c(5,8,12,6,3,4,6,18,15,7,9,7,27,23,16,10,8,31,23,12),
DR3=c(5,8,12,6,3,4,6,18,16,7,9,7,24,19,11,10,8,31,29,14),DR4=c(5,8,12,4,3,4,6,18,16,7,9,7,24,18,11,10,8,28,25,10),
DR5=c(5,8,12,4,3,4,6,18,15,7,9,7,24,18,11,10,8,28,23,10)),class = "data.frame", row.names = c(NA, -20L))
> df
date Room DX DR0 DR1 DR2 DR3 DR4 DR5
1 01-01-2021 Standard 5 5 5 5 5 5 5
2 01-01-2021 Master 8 8 8 8 8 8 8
3 01-01-2021 Luxury 13 13 12 12 12 12 12
4 01-01-2021 Super 5 6 6 6 6 4 4
5 01-01-2021 Deluxe 3 4 3 3 3 3 3
6 02-01-2021 Standard 4 4 4 4 4 4 4
7 02-01-2021 Master 6 6 6 6 6 6 6
8 02-01-2021 Luxury 18 19 19 18 18 18 18
9 02-01-2021 Super 14 14 15 15 16 16 15
10 02-01-2021 Deluxe 5 7 8 7 7 7 7
11 03-01-2021 Standard 9 9 9 9 9 9 9
12 03-01-2021 Master 9 7 7 7 7 7 7
13 03-01-2021 Luxury 25 25 27 27 24 24 24
14 03-01-2021 Super 25 19 19 23 19 18 18
15 03-01-2021 Deluxe 10 12 12 16 11 11 11
16 04-01-2021 Standard 10 10 10 10 10 10 10
17 04-01-2021 Master 9 10 8 8 8 8 8
18 04-01-2021 Luxury 24 30 29 31 31 28 28
19 04-01-2021 Super 25 27 23 23 29 25 23
20 04-01-2021 Deluxe 12 12 12 12 14 10 10
base R
tmp <- subset(df, select = DR0:DR5)
cbind(df, setNames(df$DX - tmp, paste0(names(tmp), "_PV")))
# date Room DX DR0 DR1 DR2 DR3 DR4 DR5 DR0_PV DR1_PV DR2_PV DR3_PV DR4_PV DR5_PV
# 1 01-01-2021 Standard 5 5 5 5 5 5 5 0 0 0 0 0 0
# 2 01-01-2021 Master 8 8 8 8 8 8 8 0 0 0 0 0 0
# 3 01-01-2021 Luxury 13 13 12 12 12 12 12 0 1 1 1 1 1
# 4 01-01-2021 Super 5 6 6 6 6 4 4 -1 -1 -1 -1 1 1
# 5 01-01-2021 Deluxe 3 4 3 3 3 3 3 -1 0 0 0 0 0
# 6 02-01-2021 Standard 4 4 4 4 4 4 4 0 0 0 0 0 0
# 7 02-01-2021 Master 6 6 6 6 6 6 6 0 0 0 0 0 0
# 8 02-01-2021 Luxury 18 19 19 18 18 18 18 -1 -1 0 0 0 0
# 9 02-01-2021 Super 14 14 15 15 16 16 15 0 -1 -1 -2 -2 -1
# 10 02-01-2021 Deluxe 5 7 8 7 7 7 7 -2 -3 -2 -2 -2 -2
# 11 03-01-2021 Standard 9 9 9 9 9 9 9 0 0 0 0 0 0
# 12 03-01-2021 Master 9 7 7 7 7 7 7 2 2 2 2 2 2
# 13 03-01-2021 Luxury 25 25 27 27 24 24 24 0 -2 -2 1 1 1
# 14 03-01-2021 Super 25 19 19 23 19 18 18 6 6 2 6 7 7
# 15 03-01-2021 Deluxe 10 12 12 16 11 11 11 -2 -2 -6 -1 -1 -1
# 16 04-01-2021 Standard 10 10 10 10 10 10 10 0 0 0 0 0 0
# 17 04-01-2021 Master 9 10 8 8 8 8 8 -1 1 1 1 1 1
# 18 04-01-2021 Luxury 24 30 29 31 31 28 28 -6 -5 -7 -7 -4 -4
# 19 04-01-2021 Super 25 27 23 23 29 25 23 -2 2 2 -4 0 2
# 20 04-01-2021 Deluxe 12 12 12 12 14 10 10 0 0 0 -2 2 2
dplyr
library(dplyr)
df %>%
mutate(across(DR0:DR5, list(PV = ~ DX - .)))
# date Room DX DR0 DR1 DR2 DR3 DR4 DR5 DR0_PV DR1_PV DR2_PV DR3_PV DR4_PV DR5_PV
# 1 01-01-2021 Standard 5 5 5 5 5 5 5 0 0 0 0 0 0
# 2 01-01-2021 Master 8 8 8 8 8 8 8 0 0 0 0 0 0
# 3 01-01-2021 Luxury 13 13 12 12 12 12 12 0 1 1 1 1 1
# 4 01-01-2021 Super 5 6 6 6 6 4 4 -1 -1 -1 -1 1 1
# 5 01-01-2021 Deluxe 3 4 3 3 3 3 3 -1 0 0 0 0 0
# 6 02-01-2021 Standard 4 4 4 4 4 4 4 0 0 0 0 0 0
# 7 02-01-2021 Master 6 6 6 6 6 6 6 0 0 0 0 0 0
# 8 02-01-2021 Luxury 18 19 19 18 18 18 18 -1 -1 0 0 0 0
# 9 02-01-2021 Super 14 14 15 15 16 16 15 0 -1 -1 -2 -2 -1
# 10 02-01-2021 Deluxe 5 7 8 7 7 7 7 -2 -3 -2 -2 -2 -2
# 11 03-01-2021 Standard 9 9 9 9 9 9 9 0 0 0 0 0 0
# 12 03-01-2021 Master 9 7 7 7 7 7 7 2 2 2 2 2 2
# 13 03-01-2021 Luxury 25 25 27 27 24 24 24 0 -2 -2 1 1 1
# 14 03-01-2021 Super 25 19 19 23 19 18 18 6 6 2 6 7 7
# 15 03-01-2021 Deluxe 10 12 12 16 11 11 11 -2 -2 -6 -1 -1 -1
# 16 04-01-2021 Standard 10 10 10 10 10 10 10 0 0 0 0 0 0
# 17 04-01-2021 Master 9 10 8 8 8 8 8 -1 1 1 1 1 1
# 18 04-01-2021 Luxury 24 30 29 31 31 28 28 -6 -5 -7 -7 -4 -4
# 19 04-01-2021 Super 25 27 23 23 29 25 23 -2 2 2 -4 0 2
# 20 04-01-2021 Deluxe 12 12 12 12 14 10 10 0 0 0 -2 2 2
We can use `tidyverse
library(dplyr)
library(stringr)
df %>%
mutate(across(c(DR0:DR5), ~ DX - .,
.names = '{str_replace(.col, "DR", "PV")}'))
-output
date Room DX DR0 DR1 DR2 DR3 DR4 DR5 PV0 PV1 PV2 PV3 PV4 PV5
1 01-01-2021 Standard 5 5 5 5 5 5 5 0 0 0 0 0 0
2 01-01-2021 Master 8 8 8 8 8 8 8 0 0 0 0 0 0
3 01-01-2021 Luxury 13 13 12 12 12 12 12 0 1 1 1 1 1
4 01-01-2021 Super 5 6 6 6 6 4 4 -1 -1 -1 -1 1 1
5 01-01-2021 Deluxe 3 4 3 3 3 3 3 -1 0 0 0 0 0
6 02-01-2021 Standard 4 4 4 4 4 4 4 0 0 0 0 0 0
7 02-01-2021 Master 6 6 6 6 6 6 6 0 0 0 0 0 0
8 02-01-2021 Luxury 18 19 19 18 18 18 18 -1 -1 0 0 0 0
9 02-01-2021 Super 14 14 15 15 16 16 15 0 -1 -1 -2 -2 -1
10 02-01-2021 Deluxe 5 7 8 7 7 7 7 -2 -3 -2 -2 -2 -2
11 03-01-2021 Standard 9 9 9 9 9 9 9 0 0 0 0 0 0
12 03-01-2021 Master 9 7 7 7 7 7 7 2 2 2 2 2 2
13 03-01-2021 Luxury 25 25 27 27 24 24 24 0 -2 -2 1 1 1
14 03-01-2021 Super 25 19 19 23 19 18 18 6 6 2 6 7 7
15 03-01-2021 Deluxe 10 12 12 16 11 11 11 -2 -2 -6 -1 -1 -1
16 04-01-2021 Standard 10 10 10 10 10 10 10 0 0 0 0 0 0
17 04-01-2021 Master 9 10 8 8 8 8 8 -1 1 1 1 1 1
18 04-01-2021 Luxury 24 30 29 31 31 28 28 -6 -5 -7 -7 -4 -4
19 04-01-2021 Super 25 27 23 23 29 25 23 -2 2 2 -4 0 2
20 04-01-2021 Deluxe 12 12 12 12 14 10 10 0 0 0 -2 2 2
Or in base R
nm1 <- paste0("DR", 0:5)
df[paste0("PR", 0:5)] <- df$DX[col(df[nm1])] - df[nm1]

I have a weights variable and I need to create cross tabulations for a chord diagram

I have a dataset with over 15,000 observations. I've dropped all variables but three (3).
One is the individual's origin or, the other is the individual's destination dest, and the third is weight of that individual wgt.
Origin and destination are categorical variables.
The weights I have are used as analytic weights in Stata. However, Stata can't handle the number of columns I generate when making tables. R generates them with ease. However, I can't figure out how to apply weights into the generated table.
I tried using wtd.tables(), but the following error appears.
wtd.table(NonHSGrad$b206reg, NonHSGrad$c305reg, weights=NonHSGrad$ind_wgts)
Error in proxy[, ..., drop = FALSE] : incorrect number of dimensions
When I use only the table(), this comes out:
table(NonHSGrad$b206reg, NonHSGrad$c305reg)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
1 285 38 20 8 6 3 1 2 0 1 0 10 38 46 0 2 14
2 32 312 26 3 1 0 2 1 1 0 1 1 22 51 0 0 8
3 17 35 325 12 12 2 3 7 0 2 3 5 52 13 1 1 25
4 3 5 27 224 19 5 2 10 1 1 1 2 51 4 0 3 35
5 4 9 44 81 778 6 7 22 1 4 5 5 155 5 0 5 47
6 4 5 22 21 10 547 24 12 32 21 32 81 86 5 3 15 58
7 5 4 12 17 20 21 558 20 31 99 93 33 59 1 3 67 15
8 8 9 41 49 17 11 24 919 5 8 37 10 151 2 0 52 19
9 0 1 7 9 1 4 26 5 466 66 19 17 17 2 24 24 7
10 1 2 3 4 2 3 27 8 41 528 21 17 13 2 11 36 2
11 3 0 3 10 1 5 11 5 6 17 519 59 7 1 2 49 1
12 0 1 1 2 0 1 5 2 2 10 39 318 10 0 14 17 1
13 15 9 26 34 25 21 12 42 2 5 3 5 187 2 1 6 15
14 14 47 7 5 0 0 0 1 1 0 0 0 9 475 0 0 0
15 0 0 3 1 2 2 4 2 22 9 3 60 9 2 342 2 3
16 0 2 6 10 3 2 11 21 3 33 29 4 34 0 3 404 5
17 1 1 7 15 2 6 1 2 0 1 1 0 34 0 0 2 463
99 0 0 1 1 0 0 0 1 0 1 0 0 0 1 2 0 1
I am also going to use the table for a chord diagram to show flows.

How to create a matrix in simple correspondence analysis?

I am trying to create a matrix in order to apply a simple correspondence analysis on it; I have 2 categorical variables: exp and conexinternet with 3 levels each.
obs conexinternet exp
1 1 2
2 1 1
3 2 2
4 1 1
5 1 1
6 2 1
7 1 2
8 1 2
9 1 2
10 2 1
11 1 1
12 2 1
13 2 2
14 2 1
15 1 1
16 2 2
17 1 1
18 2 2
19 2 2
20 2 2
21 2 2
22 1 1
23 2 3
24 1 1
25 2 1
26 2 1
27 1 1
28 2 2
29 2 1
30 1 2
31 1 2
32 2 3
33 2 1
34 2 1
35 2 1
36 3 2
37 2 1
38 3 2
39 2 3
40 2 3
41 2 2
42 2 3
43 2 2
44 2 2
45 2 1
46 2 2
47 2 3
48 1 3
49 2 3
50 3 2
51 2 2
52 2 2
53 2 1
54 1 2
55 1 1
56 2 3
57 3 2
58 3 1
59 3 1
60 1 2
61 2 3
62 2 2
63 3 1
64 3 2
65 3 2
66 1 2
67 3 2
68 3 2
69 3 3
70 2 1
71 3 3
72 3 2
73 3 2
74 3 2
75 3 1
76 3 2
77 3 1
I want to make a vector to categorize the observations as 11, 12, 13, 21, 22, 23, 31, 32, 33, how can I do it?
Is this what you want?
d <- read.table(text="obs conexinternet exp
1 1 2
...
77 3 1", header=T)
(tab <- xtabs(~conexinternet+exp, d))
# exp
# conexinternet 1 2 3
# 1 10 9 1
# 2 14 15 9
# 3 5 12 2

merge data tables in R

My apologies for this simple question. Basically, I want to make three separate cumsum() tables and merge them together by the first table. For example:
a <- cumsum(table(df$variable))
b <- cumsum(table(df$variable[c(TRUE, FALSE)]))
c <- cumsum(table(df$variable[c(FALSE, TRUE)]))
Where a is the cumsum of the entire vector of df$variable, b is the cumsum of the odd-numbered values of df$variable, c is the cumsum of the even-numbered values of df$variable. Another way of interpreting this is that combining b and c produces a.
This is the entire vector of numbers.
[1] 18 17 15 10 5 0 10 10 0 10 15 5 5 5 25 15 13 0 0 0 25 18 15 15 1 4 5
[28] 5 5 15 5 12 15 0 3 12 20 0 5 5 13 10 10 10 3 15 13 20 12 60 10 10 2 0
[55] 5 10 8 4 0 15 5 5 15 5 0 5 2 8 5 5 5 5 9 9 3 7 20 25 5 4 10
[82] 10 2 4 5 5 18 8 0 10 5 5 7 12 5 13 26 20 13 21 5 15 10 10 5 15 5 15
[109] 0 1 13 21 25 25 5 14 5 15 10 0 5 15 3 4 5 15 15 5 25 25 5 15 0 2 13
[136] 22 2 10 3 3 15 11 0 2 40 35 24 24 5 5 10 5 16 0 17 19 20 5 5 5 0 15
[163] 3 13 20 4 5 5 3 19 25 25 0 15 5 3 22 22 25 5 15 15 5 15 17 9 5 5 15
[190] 10
For a, I used cbind(cumsum(table(df$variable)))
0 18
1 20
2 26
3 35
4 41
5 88
7 90
8 93
9 96
10 115
11 116
12 120
13 128
14 129
15 154
16 155
17 158
18 161
19 163
20 169
21 171
22 174
24 176
25 186
26 187
35 188
40 189
60 190
For b, I used cbind(cumsum(table(df$variable[c(TRUE, FALSE)])))
0 10
1 11
2 15
3 22
5 50
7 51
8 52
9 53
10 60
12 61
13 67
15 76
16 77
17 79
18 81
20 85
22 86
24 87
25 93
26 94
40 95
For c, I used cbind(cumsum(table(df$variable[c(FALSE, TRUE)])))
0 8
1 9
2 11
3 13
4 19
5 38
7 39
8 41
9 43
10 55
11 56
12 59
13 61
14 62
15 78
17 79
18 80
19 82
20 84
21 86
22 88
24 89
25 93
35 94
60 95
In frequency form, the distributions should look something like this.
a b c
0 18 10 8
1 2 1 1
2 6 4 2
3 9 7 2
4 6 0 6
5 47 28 19
7 2 1 1
8 3 1 2
9 3 1 2
10 19 7 12
11 1 0 1
12 4 1 3
13 8 6 2
14 1 0 1
15 25 9 16
16 1 1 0
17 3 2 1
18 3 2 1
19 2 0 2
20 6 4 2
21 2 0 2
22 3 1 2
24 2 1 1
25 10 6 4
26 1 1 0
35 1 0 1
40 1 1 0
60 1 0 1
190 95 95
But I want it in cumsum() form, such that it should look something like this. I wrote out the first 6 rows as illustration.
a b c
0 18 10 8
1 20 11 9
2 26 15 11
3 35 22 13
4 41 22 19
5 88 50 38
7 90 51 39
The problem I've been having is that the subsets a and b doesn't have all the values (i.e. some values have 0 frequency), such that it shortens the length of the vector; as a result, I'm unable to properly merge or cbind() these values.
Any suggestion is greatly appreciated.
You could probably get there using match quite easily. Assuming your data is:
set.seed(1)
df <- data.frame(variable=rbinom(10,prob=0.5,size=3))
Something like this seems to work
out <- data.frame(a,b=b[match(names(a),names(b))],c=c[match(names(a),names(c))])
replace(out,is.na(out),0)
# a b c
#0 1 0 1
#1 4 2 2
#2 7 4 3
#3 10 5 5

Resources