Calculate difference between column values in R [duplicate] - r
This question already has answers here:
Subtract a column in a dataframe from many columns in R
(3 answers)
Closed 1 year ago.
How do I generate a variable that can be called "PV" which will be the difference between the DX values and the DR column values.
For example:
PV of DR0 will be 25-19=6 for 03/01/2021 for Room Super
PV of DR1 will be 25-19=6 for for 03/01/2021 for Room Super
PV of DR2 will be 25-23=2 for for 03/01/2021 for Room Super
And so on.
df <- structure(
list(date = c("01-01-2021","01-01-2021","01-01-2021","01-01-2021","01-01-2021","02-01-2021","02-01-2021",
"02-01-2021","02-01-2021","02-01-2021","03-01-2021","03-01-2021","03-01-2021","03-01-2021","03-01-2021","04-01-2021",
"04-01-2021","04-01-2021","04-01-2021","04-01-2021"),Room = c("Standard","Master","Luxury","Super","Deluxe","Standard","Master",
"Luxury","Super","Deluxe","Standard","Master","Luxury","Super","Deluxe","Standard","Master","Luxury","Super","Deluxe"),
DX=c(5,8,13,5,3,4,6,18,14,5,9,9,25,25,10,10,9,24,25,12),DR0=c(5,8,13,6,4,4,6,19,14,7,9,7,25,19,12,10,10,30,27,12),
DR1=c(5,8,12,6,3,4,6,19,15,8,9,7,27,19,12,10,8,29,23,12),DR2=c(5,8,12,6,3,4,6,18,15,7,9,7,27,23,16,10,8,31,23,12),
DR3=c(5,8,12,6,3,4,6,18,16,7,9,7,24,19,11,10,8,31,29,14),DR4=c(5,8,12,4,3,4,6,18,16,7,9,7,24,18,11,10,8,28,25,10),
DR5=c(5,8,12,4,3,4,6,18,15,7,9,7,24,18,11,10,8,28,23,10)),class = "data.frame", row.names = c(NA, -20L))
> df
date Room DX DR0 DR1 DR2 DR3 DR4 DR5
1 01-01-2021 Standard 5 5 5 5 5 5 5
2 01-01-2021 Master 8 8 8 8 8 8 8
3 01-01-2021 Luxury 13 13 12 12 12 12 12
4 01-01-2021 Super 5 6 6 6 6 4 4
5 01-01-2021 Deluxe 3 4 3 3 3 3 3
6 02-01-2021 Standard 4 4 4 4 4 4 4
7 02-01-2021 Master 6 6 6 6 6 6 6
8 02-01-2021 Luxury 18 19 19 18 18 18 18
9 02-01-2021 Super 14 14 15 15 16 16 15
10 02-01-2021 Deluxe 5 7 8 7 7 7 7
11 03-01-2021 Standard 9 9 9 9 9 9 9
12 03-01-2021 Master 9 7 7 7 7 7 7
13 03-01-2021 Luxury 25 25 27 27 24 24 24
14 03-01-2021 Super 25 19 19 23 19 18 18
15 03-01-2021 Deluxe 10 12 12 16 11 11 11
16 04-01-2021 Standard 10 10 10 10 10 10 10
17 04-01-2021 Master 9 10 8 8 8 8 8
18 04-01-2021 Luxury 24 30 29 31 31 28 28
19 04-01-2021 Super 25 27 23 23 29 25 23
20 04-01-2021 Deluxe 12 12 12 12 14 10 10
base R
tmp <- subset(df, select = DR0:DR5)
cbind(df, setNames(df$DX - tmp, paste0(names(tmp), "_PV")))
# date Room DX DR0 DR1 DR2 DR3 DR4 DR5 DR0_PV DR1_PV DR2_PV DR3_PV DR4_PV DR5_PV
# 1 01-01-2021 Standard 5 5 5 5 5 5 5 0 0 0 0 0 0
# 2 01-01-2021 Master 8 8 8 8 8 8 8 0 0 0 0 0 0
# 3 01-01-2021 Luxury 13 13 12 12 12 12 12 0 1 1 1 1 1
# 4 01-01-2021 Super 5 6 6 6 6 4 4 -1 -1 -1 -1 1 1
# 5 01-01-2021 Deluxe 3 4 3 3 3 3 3 -1 0 0 0 0 0
# 6 02-01-2021 Standard 4 4 4 4 4 4 4 0 0 0 0 0 0
# 7 02-01-2021 Master 6 6 6 6 6 6 6 0 0 0 0 0 0
# 8 02-01-2021 Luxury 18 19 19 18 18 18 18 -1 -1 0 0 0 0
# 9 02-01-2021 Super 14 14 15 15 16 16 15 0 -1 -1 -2 -2 -1
# 10 02-01-2021 Deluxe 5 7 8 7 7 7 7 -2 -3 -2 -2 -2 -2
# 11 03-01-2021 Standard 9 9 9 9 9 9 9 0 0 0 0 0 0
# 12 03-01-2021 Master 9 7 7 7 7 7 7 2 2 2 2 2 2
# 13 03-01-2021 Luxury 25 25 27 27 24 24 24 0 -2 -2 1 1 1
# 14 03-01-2021 Super 25 19 19 23 19 18 18 6 6 2 6 7 7
# 15 03-01-2021 Deluxe 10 12 12 16 11 11 11 -2 -2 -6 -1 -1 -1
# 16 04-01-2021 Standard 10 10 10 10 10 10 10 0 0 0 0 0 0
# 17 04-01-2021 Master 9 10 8 8 8 8 8 -1 1 1 1 1 1
# 18 04-01-2021 Luxury 24 30 29 31 31 28 28 -6 -5 -7 -7 -4 -4
# 19 04-01-2021 Super 25 27 23 23 29 25 23 -2 2 2 -4 0 2
# 20 04-01-2021 Deluxe 12 12 12 12 14 10 10 0 0 0 -2 2 2
dplyr
library(dplyr)
df %>%
mutate(across(DR0:DR5, list(PV = ~ DX - .)))
# date Room DX DR0 DR1 DR2 DR3 DR4 DR5 DR0_PV DR1_PV DR2_PV DR3_PV DR4_PV DR5_PV
# 1 01-01-2021 Standard 5 5 5 5 5 5 5 0 0 0 0 0 0
# 2 01-01-2021 Master 8 8 8 8 8 8 8 0 0 0 0 0 0
# 3 01-01-2021 Luxury 13 13 12 12 12 12 12 0 1 1 1 1 1
# 4 01-01-2021 Super 5 6 6 6 6 4 4 -1 -1 -1 -1 1 1
# 5 01-01-2021 Deluxe 3 4 3 3 3 3 3 -1 0 0 0 0 0
# 6 02-01-2021 Standard 4 4 4 4 4 4 4 0 0 0 0 0 0
# 7 02-01-2021 Master 6 6 6 6 6 6 6 0 0 0 0 0 0
# 8 02-01-2021 Luxury 18 19 19 18 18 18 18 -1 -1 0 0 0 0
# 9 02-01-2021 Super 14 14 15 15 16 16 15 0 -1 -1 -2 -2 -1
# 10 02-01-2021 Deluxe 5 7 8 7 7 7 7 -2 -3 -2 -2 -2 -2
# 11 03-01-2021 Standard 9 9 9 9 9 9 9 0 0 0 0 0 0
# 12 03-01-2021 Master 9 7 7 7 7 7 7 2 2 2 2 2 2
# 13 03-01-2021 Luxury 25 25 27 27 24 24 24 0 -2 -2 1 1 1
# 14 03-01-2021 Super 25 19 19 23 19 18 18 6 6 2 6 7 7
# 15 03-01-2021 Deluxe 10 12 12 16 11 11 11 -2 -2 -6 -1 -1 -1
# 16 04-01-2021 Standard 10 10 10 10 10 10 10 0 0 0 0 0 0
# 17 04-01-2021 Master 9 10 8 8 8 8 8 -1 1 1 1 1 1
# 18 04-01-2021 Luxury 24 30 29 31 31 28 28 -6 -5 -7 -7 -4 -4
# 19 04-01-2021 Super 25 27 23 23 29 25 23 -2 2 2 -4 0 2
# 20 04-01-2021 Deluxe 12 12 12 12 14 10 10 0 0 0 -2 2 2
We can use `tidyverse
library(dplyr)
library(stringr)
df %>%
mutate(across(c(DR0:DR5), ~ DX - .,
.names = '{str_replace(.col, "DR", "PV")}'))
-output
date Room DX DR0 DR1 DR2 DR3 DR4 DR5 PV0 PV1 PV2 PV3 PV4 PV5
1 01-01-2021 Standard 5 5 5 5 5 5 5 0 0 0 0 0 0
2 01-01-2021 Master 8 8 8 8 8 8 8 0 0 0 0 0 0
3 01-01-2021 Luxury 13 13 12 12 12 12 12 0 1 1 1 1 1
4 01-01-2021 Super 5 6 6 6 6 4 4 -1 -1 -1 -1 1 1
5 01-01-2021 Deluxe 3 4 3 3 3 3 3 -1 0 0 0 0 0
6 02-01-2021 Standard 4 4 4 4 4 4 4 0 0 0 0 0 0
7 02-01-2021 Master 6 6 6 6 6 6 6 0 0 0 0 0 0
8 02-01-2021 Luxury 18 19 19 18 18 18 18 -1 -1 0 0 0 0
9 02-01-2021 Super 14 14 15 15 16 16 15 0 -1 -1 -2 -2 -1
10 02-01-2021 Deluxe 5 7 8 7 7 7 7 -2 -3 -2 -2 -2 -2
11 03-01-2021 Standard 9 9 9 9 9 9 9 0 0 0 0 0 0
12 03-01-2021 Master 9 7 7 7 7 7 7 2 2 2 2 2 2
13 03-01-2021 Luxury 25 25 27 27 24 24 24 0 -2 -2 1 1 1
14 03-01-2021 Super 25 19 19 23 19 18 18 6 6 2 6 7 7
15 03-01-2021 Deluxe 10 12 12 16 11 11 11 -2 -2 -6 -1 -1 -1
16 04-01-2021 Standard 10 10 10 10 10 10 10 0 0 0 0 0 0
17 04-01-2021 Master 9 10 8 8 8 8 8 -1 1 1 1 1 1
18 04-01-2021 Luxury 24 30 29 31 31 28 28 -6 -5 -7 -7 -4 -4
19 04-01-2021 Super 25 27 23 23 29 25 23 -2 2 2 -4 0 2
20 04-01-2021 Deluxe 12 12 12 12 14 10 10 0 0 0 -2 2 2
Or in base R
nm1 <- paste0("DR", 0:5)
df[paste0("PR", 0:5)] <- df$DX[col(df[nm1])] - df[nm1]
Related
Creating new column names using dplyr across and .names
I have the following data frame: df <- data.frame(A_TR1=sample(10:20, 8, replace = TRUE),A_TR2=seq(2, 16, by=2), A_TR3=seq(1, 16, by=2), B_TR1=seq(1, 16, by=2),B_TR2=seq(2, 16, by=2), B_TR3=seq(1, 16, by=2)) > df A_TR1 A_TR2 A_TR3 B_TR1 B_TR2 B_TR3 1 11 2 1 1 2 1 2 12 4 3 3 4 3 3 18 6 5 5 6 5 4 11 8 7 7 8 7 5 17 10 9 9 10 9 6 17 12 11 11 12 11 7 14 14 13 13 14 13 8 11 16 15 15 16 15 What I would like to do, is subtract B_TR1 from A_TR1, B_TR2 from A_TR2, and so on and create new columns from these, similar to below: df$x_TR1 <- (df$A_TR1 - df$B_TR1) df$x_TR2 <- (df$A_TR2 - df$B_TR2) df$x_TR3 <- (df$A_TR3 - df$B_TR3) > df A_TR1 A_TR2 A_TR3 B_TR1 B_TR2 B_TR3 x_TR1 x_TR2 x_TR3 1 12 2 1 1 2 1 11 0 0 2 11 4 3 3 4 3 8 0 0 3 19 6 5 5 6 5 14 0 0 4 13 8 7 7 8 7 6 0 0 5 12 10 9 9 10 9 3 0 0 6 16 12 11 11 12 11 5 0 0 7 16 14 13 13 14 13 3 0 0 8 18 16 15 15 16 15 3 0 0 I would like to name these columns "x TR1", "x TR2", etc. I tried to do the following: xdf <- df%>%mutate(across(starts_with("A_TR"), -across(starts_with("B_TR")), .names="x TR{.col}")) However, I get an error in mutate(): attempt to select less than one element in integerOneIndex I also don't know how to create the proper column names, in terms of getting the numbers right -- I am not even sure the glue() syntax allows for it. Any help appreciated here.
We could use .names in the first across to replace the substring 'a' with 'x' from the column names (.col) while subtracting from the second set of columns library(dplyr) library(stringr) df <- df %>% mutate(across(starts_with("A_TR"), .names = "{str_replace(.col, 'A', 'x')}") - across(starts_with("B_TR"))) -output df A_TR1 A_TR2 A_TR3 B_TR1 B_TR2 B_TR3 x_TR1 x_TR2 x_TR3 1 10 2 1 1 2 1 9 0 0 2 10 4 3 3 4 3 7 0 0 3 16 6 5 5 6 5 11 0 0 4 12 8 7 7 8 7 5 0 0 5 20 10 9 9 10 9 11 0 0 6 19 12 11 11 12 11 8 0 0 7 17 14 13 13 14 13 4 0 0 8 14 16 15 15 16 15 -1 0 0
Sum in R based on a date range and another condition?
I am working on a dataframe of baseball data called mlb_team_logs. A random sample lies below. Date Team season AB PA H X1B X2B X3B HR R RBI BB IBB SO HBP SF SH GDP 1 2015-04-06 ARI 2015 34 39 9 7 1 1 0 4 4 3 0 6 2 0 0 2 2 2015-04-07 ARI 2015 31 36 8 4 1 1 2 7 7 5 0 7 0 0 0 1 3 2015-04-08 ARI 2015 32 35 5 3 2 0 0 2 1 2 0 7 1 0 0 0 4 2015-04-10 ARI 2015 35 38 7 6 0 0 1 4 4 3 0 10 0 0 0 0 5 2015-04-11 ARI 2015 32 35 10 9 0 0 1 6 6 3 0 7 0 0 0 1 6 2015-04-12 ARI 2015 36 38 10 7 3 0 0 4 4 1 0 11 0 0 1 1 7 2015-04-13 ARI 2015 39 44 12 8 3 1 0 8 7 4 0 11 0 0 1 0 8 2015-04-14 ARI 2015 28 32 3 1 2 0 0 1 1 3 0 4 1 0 0 2 9 2015-04-15 ARI 2015 33 34 9 7 1 0 1 2 2 1 0 8 0 0 0 1 10 2015-04-16 ARI 2015 47 51 11 6 2 0 3 7 7 3 1 8 1 0 0 0 240 2015-07-03 ATL 2015 30 32 7 4 1 0 2 2 2 2 0 6 0 0 0 1 241 2015-07-04 ATL 2015 34 40 10 6 3 0 1 9 9 5 0 5 0 0 1 0 242 2015-07-05 ATL 2015 35 37 7 6 1 0 0 0 0 1 0 10 1 0 0 1 243 2015-07-06 ATL 2015 40 44 15 10 4 0 1 5 5 3 0 7 0 0 1 1 244 2015-07-07 ATL 2015 34 37 10 7 1 1 1 4 4 2 0 4 0 0 1 1 245 2015-07-08 ATL 2015 31 38 7 4 1 0 2 5 5 5 1 7 0 0 2 1 246 2015-07-09 ATL 2015 34 37 10 8 2 0 0 3 3 1 0 9 0 1 1 2 247 2015-07-10 ATL 2015 32 35 8 7 0 0 1 3 3 2 0 5 1 0 0 2 248 2015-07-11 ATL 2015 33 38 6 3 1 0 2 2 2 5 1 8 0 0 0 0 249 2015-07-12 ATL 2015 34 41 8 6 2 0 0 3 3 7 1 10 0 0 0 1 250 2015-07-17 ATL 2015 30 36 7 4 3 0 0 4 4 5 1 7 0 0 0 0 In total, the df has 43 total columns. My objective is to sum columns 4 (AB) to 43 on two criteria: the team the date is within 7 days of the entry in "Date" (ie Date - 7 to Date - 1) Eventually, I would like these columns to be appended to mlb_team_logs as l7_AB, l7_PA, etc (but I know how to do that if the output will be a new dataframe). Any help is appreciated! EDIT I altered the sample to allow for more easily tested results
You might be able to use a data.table non-equi join here. The idea would be to create a lower date bound (below, I've named this date_lb), and then join the table on itself, matching on Team = Team, Date < Date, and Date >= date_lb. Then use lapply with .SDcols to sum the columns of interest. load library and set your frame to data.table library(data.table) setDT(mlb_team_logs) Identify the columns you want to sum, in a character vector (change to 4:43 in your full dataset) sum_cols = names(mlb_team_logs)[4:19] Add a lower bound on date df[, date_lb := Date-7] Join the table on itself, and use lapply(.SD, sum) on the columns of interest result = mlb_team_logs[mlb_team_logs[, .(Team, Date, date_lb)], on=.(Team, Date<Date, Date>=date_lb)] %>% .[, lapply(.SD, sum), by=.(Date,Team), .SDcols = sumcols ] Set the new names (inplace, using setnames()) setnames(result, old=sumcols, new=paste0("I7_",sumcols)) Output: Date Team I7_AB I7_PA I7_H I7_X1B I7_X2B I7_X3B I7_HR I7_R I7_RBI I7_BB I7_IBB I7_SO I7_HBP I7_SF I7_SH I7_GDP <IDat> <char> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> 1: 2015-04-06 ARI NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2: 2015-04-07 ARI 34 39 9 7 1 1 0 4 4 3 0 6 2 0 0 2 3: 2015-04-08 ARI 65 75 17 11 2 2 2 11 11 8 0 13 2 0 0 3 4: 2015-04-10 ARI 97 110 22 14 4 2 2 13 12 10 0 20 3 0 0 3 5: 2015-04-11 ARI 132 148 29 20 4 2 3 17 16 13 0 30 3 0 0 3 6: 2015-04-12 ARI 164 183 39 29 4 2 4 23 22 16 0 37 3 0 0 4 7: 2015-04-13 ARI 200 221 49 36 7 2 4 27 26 17 0 48 3 0 1 5 8: 2015-04-14 ARI 205 226 52 37 9 2 4 31 29 18 0 53 1 0 2 3 9: 2015-04-15 ARI 202 222 47 34 10 1 2 25 23 16 0 50 2 0 2 4 10: 2015-04-16 ARI 203 221 51 38 9 1 3 25 24 15 0 51 1 0 2 5 11: 2015-07-03 ATL NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 12: 2015-07-04 ATL 30 32 7 4 1 0 2 2 2 2 0 6 0 0 0 1 13: 2015-07-05 ATL 64 72 17 10 4 0 3 11 11 7 0 11 0 0 1 1 14: 2015-07-06 ATL 99 109 24 16 5 0 3 11 11 8 0 21 1 0 1 2 15: 2015-07-07 ATL 139 153 39 26 9 0 4 16 16 11 0 28 1 0 2 3 16: 2015-07-08 ATL 173 190 49 33 10 1 5 20 20 13 0 32 1 0 3 4 17: 2015-07-09 ATL 204 228 56 37 11 1 7 25 25 18 1 39 1 0 5 5 18: 2015-07-10 ATL 238 265 66 45 13 1 7 28 28 19 1 48 1 1 6 7 19: 2015-07-11 ATL 240 268 67 48 12 1 6 29 29 19 1 47 2 1 6 8 20: 2015-07-12 ATL 239 266 63 45 10 1 7 22 22 19 2 50 2 1 5 8 21: 2015-07-17 ATL 99 114 22 16 3 0 3 8 8 14 2 23 1 0 0 3 Date Team I7_AB I7_PA I7_H I7_X1B I7_X2B I7_X3B I7_HR I7_R I7_RBI I7_BB I7_IBB I7_SO I7_HBP I7_SF I7_SH I7_GDP
I have a weights variable and I need to create cross tabulations for a chord diagram
I have a dataset with over 15,000 observations. I've dropped all variables but three (3). One is the individual's origin or, the other is the individual's destination dest, and the third is weight of that individual wgt. Origin and destination are categorical variables. The weights I have are used as analytic weights in Stata. However, Stata can't handle the number of columns I generate when making tables. R generates them with ease. However, I can't figure out how to apply weights into the generated table. I tried using wtd.tables(), but the following error appears. wtd.table(NonHSGrad$b206reg, NonHSGrad$c305reg, weights=NonHSGrad$ind_wgts) Error in proxy[, ..., drop = FALSE] : incorrect number of dimensions When I use only the table(), this comes out: table(NonHSGrad$b206reg, NonHSGrad$c305reg) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 285 38 20 8 6 3 1 2 0 1 0 10 38 46 0 2 14 2 32 312 26 3 1 0 2 1 1 0 1 1 22 51 0 0 8 3 17 35 325 12 12 2 3 7 0 2 3 5 52 13 1 1 25 4 3 5 27 224 19 5 2 10 1 1 1 2 51 4 0 3 35 5 4 9 44 81 778 6 7 22 1 4 5 5 155 5 0 5 47 6 4 5 22 21 10 547 24 12 32 21 32 81 86 5 3 15 58 7 5 4 12 17 20 21 558 20 31 99 93 33 59 1 3 67 15 8 8 9 41 49 17 11 24 919 5 8 37 10 151 2 0 52 19 9 0 1 7 9 1 4 26 5 466 66 19 17 17 2 24 24 7 10 1 2 3 4 2 3 27 8 41 528 21 17 13 2 11 36 2 11 3 0 3 10 1 5 11 5 6 17 519 59 7 1 2 49 1 12 0 1 1 2 0 1 5 2 2 10 39 318 10 0 14 17 1 13 15 9 26 34 25 21 12 42 2 5 3 5 187 2 1 6 15 14 14 47 7 5 0 0 0 1 1 0 0 0 9 475 0 0 0 15 0 0 3 1 2 2 4 2 22 9 3 60 9 2 342 2 3 16 0 2 6 10 3 2 11 21 3 33 29 4 34 0 3 404 5 17 1 1 7 15 2 6 1 2 0 1 1 0 34 0 0 2 463 99 0 0 1 1 0 0 0 1 0 1 0 0 0 1 2 0 1 I am also going to use the table for a chord diagram to show flows.
Need to count items from a tables
I have this DF (partially shown) with 15 categories in the first column and each cell has number between 1 and 15. Actually this is just a small example, The 15 categories are repeated with their different numbers in the other columns What I need is to have a 16x15 matrix with the count of appearances of the values as follows. I can program this in an old fashion with IFs etc but I am kind of lost using R I hope this is clear. Any advise is welcome EDITED AS REQUESTED (I apology not to be clear) RESULTADOS DF PREOCUPACIÓN 13 15 4 4 1 8 3 1 TRISTEZA 15 13 2 5 4 14 6 6 PERDIDA 4 11 3 2 14 12 7 10 ANGUSTIA 14 10 11 3 2 13 1 2 IMPOTENCIA 1 8 9 6 5 5 5 4 MUERTE 2 1 14 14 15 6 13 15 ENOJO 12 7 10 8 6 7 12 5 INJUSTICIA 3 9 12 7 12 2 14 13 AUSENCIA 11 14 6 1 8 11 11 11 DOLOR 5 12 5 9 7 15 8 8 CORRUPCIÓN 8 6 15 13 11 3 15 12 MIEDO 9 3 13 10 3 10 9 3 SECUESTRO 10 2 1 11 9 4 4 14 INSEGURIDAD 7 4 7 15 10 1 10 9 DESESPERACIÓN 6 5 8 12 13 9 2 7 PREOCUPACIÓN 14 2 5 4 3 8 8 7 TRISTEZA 5 7 1 8 7 9 13 9 PERDIDA 2 6 6 12 2 10 6 10 ANGUSTIA 13 3 15 9 8 11 7 4 IMPOTENCIA 12 11 7 5 10 12 12 1 MUERTE 3 10 14 2 13 13 9 2 ENOJO 11 5 10 10 11 7 11 5 INJUSTICIA 7 13 2 6 15 14 10 6 AUSENCIA 8 1 9 11 1 6 4 12 DOLOR 6 8 8 13 9 3 3 3 CORRUPCIÓN 10 15 3 14 14 15 5 11 MIEDO 9 4 13 15 4 4 14 8 SECUESTRO 4 9 11 1 12 5 15 13 INSEGURIDAD 1 12 4 7 6 1 1 14 DESESPERACIÓN 15 14 12 3 5 2 2 15 PREOCUPACIÓN 13 10 4 1 7 4 11 2 TRISTEZA 15 11 11 2 9 3 12 8 PERDIDA 2 15 7 4 15 7 3 13 ANGUSTIA 8 13 5 3 6 1 7 1 IMPOTENCIA 10 4 8 5 12 10 13 3 MUERTE 7 8 15 15 3 6 6 9 ENOJO 14 12 12 10 10 8 15 10 INJUSTICIA 4 1 13 6 1 9 2 6 AUSENCIA 12 9 1 7 8 11 1 14 DOLOR 9 14 2 12 5 2 14 12 CORRUPCIÓN 3 6 14 14 14 14 5 15 MIEDO 6 2 3 9 2 5 10 7 SECUESTRO 1 3 6 8 13 15 4 5 INSEGURIDAD 5 5 9 11 4 13 8 4 DESESPERACIÓN 11 7 10 13 11 12 9 11 ... The result I need is like: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 PREOCUPACION 3 2 2 5 1 0 2 3 0 1 1 0 2 0 1 TRISTEZA 1 2 1 1 2 2 2 2 3 0 2 1 1 1 2
Using apply on every row, convert to factor and get table: res <- cbind.data.frame(name = df1[, 1], t(apply(df1[, -1], 1, function(i){ table(factor(i, levels = 1:15)) }))) res # name 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # 1 PREOCUPACIÓN 2 1 2 2 0 2 0 1 0 0 0 0 1 0 1 # 2 TRISTEZA 0 2 0 1 2 3 0 0 1 0 0 0 1 1 1 # 3 PERDIDA 0 1 1 1 0 0 1 0 0 1 2 2 1 2 0 # 4 ANGUSTIA 2 2 1 1 0 0 0 0 1 1 1 0 1 1 1 # ... Edit: If you have names repeated on multiple rows, then try below. Split dataframe on 1st column, then loop through each split dataframe and get counts per factor level. res <- t(data.frame( lapply(split(df1, df1$V1), function(i){ as.numeric(table(factor(unlist(i[-1, ]), levels = 1:15))) }))) res # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] # ANGUSTIA 4 0 2 1 1 1 2 2 1 0 1 0 2 0 1 # AUSENCIA 4 2 0 1 0 1 1 2 2 0 2 2 0 1 0 # CORRUPCIÓN 0 0 4 0 2 1 0 0 0 1 1 0 0 6 3 # DESESPERACIÓN 0 2 1 2 1 0 1 0 1 1 3 2 1 1 2 # ...
Vuong test has different results on R and Stata
I am running a zero inflated negative binomial model with probit link on R (http://www.ats.ucla.edu/stat/r/dae/zinbreg.htm) and Stata (http://www.ats.ucla.edu/stat/stata/dae/zinb.htm). There is a Vuong test to compare whether this specification is better than an ordinary negative binomial model. Where R tells me I am better off using the latter, Stata says a ZINB is the preferable choice. In both instances I assume that the process leading to the excess zeros is the same as for the negative binomial distributed non-zero observations. Coefficients are indeed the same (except that Stata prints one digit more). In R I run (data code is below) require(pscl) ZINB <- zeroinfl(Two.Year ~ length + numAuth + numAck, data=Master, dist="negbin", link="probit" ) NB <- glm.nb(Two.Year ~ length + numAuth + numAck, data=Master ) Comparing both with vuong(ZINB, NB) from the same package yields Vuong Non-Nested Hypothesis Test-Statistic: -10.78337 (test-statistic is asymptotically distributed N(0,1) under the null that the models are indistinguishible) in this case: model2 > model1, with p-value < 2.22e-16 Hence: NB is better than ZINB. In Stata I run zinb twoyear numauth length numack, inflate(numauth length numack) probit vuong and receive (iteration fitting suppressed) Zero-inflated negative binomial regression Number of obs = 714 Nonzero obs = 433 Zero obs = 281 Inflation model = probit LR chi2(3) = 74.19 Log likelihood = -1484.763 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ twoyear | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- twoyear | numauth | .1463257 .0667629 2.19 0.028 .0154729 .2771785 length | .038699 .006077 6.37 0.000 .0267883 .0506097 numack | .0333765 .010802 3.09 0.002 .0122049 .0545481 _cons | -.4588568 .2068824 -2.22 0.027 -.8643389 -.0533747 -------------+---------------------------------------------------------------- inflate | numauth | .2670777 .1141893 2.34 0.019 .0432708 .4908846 length | .0147993 .0105611 1.40 0.161 -.0059001 .0354987 numack | .0177504 .0150118 1.18 0.237 -.0116722 .0471729 _cons | -2.057536 .5499852 -3.74 0.000 -3.135487 -.9795845 -------------+---------------------------------------------------------------- /lnalpha | .0871077 .1608448 0.54 0.588 -.2281424 .4023577 -------------+---------------------------------------------------------------- alpha | 1.091014 .175484 .7960109 1.495346 ------------------------------------------------------------------------------ Vuong test of zinb vs. standard negative binomial: z = 2.36 Pr>z = 0.0092 In the very last line Stata tells me that in this case ZINB is better than NB: Both test statistic and p-value differ. How come? Data (R code) Master <- <-read.table(text=" Two.Year numAuth length numAck 0 1 4 6 3 3 28 3 3 1 18 4 0 1 42 4 0 2 17 0 2 1 10 3 1 2 20 0 0 1 28 3 1 1 23 7 0 2 34 3 2 2 24 2 0 2 18 0 0 1 23 7 0 1 35 11 4 2 33 13 0 2 24 4 0 2 21 9 1 4 21 0 1 1 8 6 2 1 18 1 0 3 28 2 0 2 17 2 1 1 30 6 4 2 28 16 1 4 35 1 2 3 19 2 0 1 24 2 1 3 26 6 1 1 17 7 0 3 42 4 0 3 32 8 3 1 33 23 7 2 24 9 0 2 25 6 1 1 7 1 0 1 15 2 2 2 16 2 0 1 23 6 2 3 18 7 0 1 28 5 0 1 12 2 1 1 25 4 0 4 18 1 1 2 32 6 1 1 15 2 2 2 14 4 0 2 24 9 0 3 30 9 0 2 19 9 0 2 14 2 2 2 23 3 0 2 18 0 1 3 13 4 0 1 10 4 0 1 24 8 0 2 22 9 2 3 29 5 2 1 25 5 0 2 17 4 1 2 24 0 0 2 26 0 2 2 33 12 1 4 17 2 1 1 25 8 3 1 36 11 0 1 10 4 9 2 60 22 0 2 18 3 2 3 19 6 2 2 23 7 2 2 26 0 1 1 20 5 4 2 31 4 0 2 21 2 0 1 24 12 1 1 12 1 1 3 26 5 1 4 32 8 2 3 21 1 3 3 26 3 4 2 36 6 3 3 28 2 1 3 27 1 0 2 12 5 0 3 24 4 0 2 35 1 0 2 17 2 3 2 28 3 0 3 29 8 0 2 20 3 3 2 28 0 11 1 30 2 0 3 22 2 21 3 59 24 0 2 15 5 0 2 22 2 5 4 33 0 0 2 21 2 4 2 21 0 0 3 25 9 2 2 31 5 1 2 23 1 2 3 25 0 0 1 13 3 0 1 22 7 0 1 16 3 6 1 18 4 2 2 19 7 3 2 22 10 0 1 12 6 0 1 23 8 1 1 23 9 1 2 32 15 1 3 26 8 1 3 15 2 0 3 16 2 0 4 29 2 2 3 24 3 2 3 32 1 2 1 29 13 1 3 26 0 5 1 23 4 3 2 21 2 4 2 19 4 4 3 19 2 2 1 29 0 0 1 13 6 0 2 28 2 0 3 33 1 0 1 20 2 0 1 30 8 1 2 19 2 17 2 30 7 5 3 39 17 21 3 30 5 1 3 29 24 1 1 31 4 4 3 26 13 4 2 14 16 2 3 31 14 5 3 37 10 15 2 52 13 1 1 6 5 2 1 24 13 17 3 17 3 3 2 29 5 2 1 26 7 3 3 34 9 5 2 39 2 3 1 26 7 1 2 32 12 2 3 26 4 9 3 28 8 1 3 29 1 4 1 24 7 9 1 40 13 1 2 27 21 2 2 27 13 5 3 31 10 10 2 29 15 10 2 41 15 8 1 24 17 2 4 16 5 17 2 26 20 3 2 31 3 2 2 18 1 6 3 32 9 2 1 32 11 4 3 34 8 4 1 16 1 5 1 33 5 0 2 17 11 17 2 48 8 2 1 11 2 5 3 33 18 4 2 25 9 10 2 17 5 1 1 25 8 3 3 41 16 2 1 40 13 4 3 25 2 16 4 32 13 10 1 33 18 5 2 25 3 3 2 20 3 2 3 14 7 3 2 23 4 2 2 28 4 3 2 25 19 0 2 14 6 3 1 28 18 8 3 27 11 1 3 25 17 21 2 33 15 9 2 24 2 1 1 16 14 1 1 38 10 16 2 37 13 16 2 41 1 7 2 24 18 4 2 17 5 4 1 37 32 3 1 37 8 13 2 35 6 15 1 23 11 7 1 47 11 3 1 16 6 12 2 36 6 7 1 24 17 4 2 24 8 14 2 24 9 15 2 24 11 0 3 19 4 0 4 28 9 1 1 5 3 11 1 28 15 5 1 33 5 10 2 21 9 3 3 28 8 2 3 13 2 11 2 41 8 4 2 24 11 3 1 32 11 4 2 31 11 7 2 34 3 11 6 33 6 7 3 33 7 2 2 37 13 7 3 19 9 1 2 14 3 6 2 15 11 11 3 37 12 0 2 20 5 7 4 13 6 17 1 52 14 9 3 47 30 1 2 32 27 30 3 36 19 2 2 12 5 3 1 30 7 4 2 19 11 32 3 45 14 13 1 17 7 16 2 24 4 5 1 32 13 7 3 29 14 5 2 46 2 1 2 21 6 1 3 13 17 11 1 41 16 6 2 33 1 7 1 31 20 0 1 16 13 6 3 26 8 11 2 46 7 8 2 20 5 8 1 44 7 2 2 33 12 1 3 22 5 0 4 14 2 4 1 25 8 5 3 24 11 1 1 21 18 5 1 28 5 2 1 51 19 2 1 16 4 17 2 35 2 4 1 35 1 9 3 48 8 2 1 33 16 0 3 24 7 18 2 33 12 11 1 41 5 5 2 17 3 8 1 19 7 4 3 38 2 23 2 27 10 22 3 46 13 5 3 21 1 5 2 38 10 1 2 20 5 2 2 24 8 0 3 30 9 7 2 44 16 7 1 21 7 0 1 20 10 10 2 33 11 4 2 18 2 11 1 45 17 7 2 32 7 7 2 28 6 5 2 25 10 3 2 57 6 8 1 16 2 7 2 34 4 5 2 22 8 2 2 21 7 4 2 37 15 2 4 36 7 1 1 17 4 0 2 23 9 12 2 48 4 8 3 29 13 0 1 29 7 0 2 27 12 1 1 53 10 3 3 15 5 8 1 40 29 2 2 22 11 10 2 20 7 4 4 27 3 4 1 24 4 2 2 24 5 1 2 19 6 10 3 41 10 57 3 46 9 5 1 20 11 6 2 30 4 0 2 20 5 16 3 35 8 1 2 44 1 2 4 24 8 1 1 20 9 5 3 19 11 5 3 29 15 3 1 21 8 3 3 19 3 8 3 44 0 11 3 34 15 2 2 31 1 11 1 39 11 0 3 24 3 4 2 35 6 2 1 14 6 10 1 30 10 6 2 21 4 9 2 32 3 0 1 34 10 6 2 32 3 7 2 50 11 11 1 35 15 4 1 27 9 1 2 32 27 8 2 54 2 0 3 15 8 2 1 31 13 0 1 31 11 0 4 14 5 0 2 37 15 0 2 51 12 0 2 34 1 0 3 29 12 0 2 22 11 0 2 19 15 0 2 39 13 0 3 25 12 0 1 46 2 0 4 42 10 0 1 38 5 0 3 31 4 0 3 33 1 0 2 24 11 0 1 28 16 0 2 28 13 0 1 29 17 0 1 23 13 0 3 36 21 0 2 30 15 0 2 25 12 0 2 26 17 0 3 19 2 0 2 37 5 0 2 47 12 0 1 21 20 0 3 27 21 0 2 16 7 0 1 35 5 0 2 32 24 0 3 31 6 0 3 36 13 0 2 26 20 0 1 31 13 0 2 46 6 0 2 34 12 0 1 18 13 0 1 29 3 0 3 40 9 0 1 25 3 0 3 45 9 0 2 31 3 0 2 35 4 0 3 29 10 0 2 33 13 0 3 22 4 0 2 26 9 0 2 29 19 0 2 28 12 0 2 30 5 0 4 30 3 0 3 32 14 0 3 45 20 0 2 42 9 0 2 25 4 0 2 20 22 0 3 31 5 0 1 26 13 0 2 32 11 0 1 31 2 0 2 42 17 0 1 37 8 0 3 37 16 0 3 25 10 0 2 33 11 0 2 29 7 0 2 21 16 0 3 30 33 0 1 35 8 0 3 25 6 0 2 54 3 0 2 41 10 0 3 35 1 0 4 26 4 0 2 31 4 0 3 26 11 0 3 34 11 0 2 27 7 0 1 19 14 0 1 38 9 0 2 24 1 0 3 30 20 0 4 43 13 0 2 20 10 0 2 38 1 0 2 41 6 0 1 20 9 0 2 34 2 0 2 24 5 0 2 24 2 0 1 31 19 0 3 49 7 0 1 26 0 0 2 44 6 0 3 36 13 0 3 31 14 0 2 30 20 0 1 27 13 0 2 28 9 0 2 22 20 0 4 36 34 0 3 25 3 0 2 29 17 0 2 40 8 0 2 39 17 0 4 29 8 0 1 27 22 0 1 21 10 0 3 17 5 0 3 28 10 0 1 27 7 0 3 40 7 0 2 21 4 0 1 33 14 0 1 31 14 0 3 37 13 0 2 23 9 0 2 25 1 0 2 30 1 0 2 30 12 0 1 41 8 0 2 26 1 0 2 25 14 0 2 26 3 0 3 36 1 0 4 23 1 0 2 18 0 0 2 34 2 0 1 39 6 0 1 16 15 0 3 34 4 0 4 35 6 0 1 22 10 0 1 35 8 0 2 36 13 0 2 50 8 0 2 28 6 0 1 30 14 0 2 33 26 0 3 28 1 0 1 18 10 0 2 27 4 0 2 27 5 0 2 8 2 0 4 32 16 0 3 40 6 0 4 45 15 0 2 38 3 0 2 29 6 0 1 25 9 12 1 27 5 2 1 33 8 4 3 31 3 1 1 33 4 0 3 20 5 0 2 28 6 2 2 32 12 0 3 30 2 0 3 19 3 1 1 14 19 0 2 28 2 0 3 26 3 0 2 32 13 1 3 21 7 1 4 20 0 2 2 40 8 0 2 35 18 1 1 20 6 6 2 21 3 3 2 33 10 1 1 31 15 1 2 22 5 0 2 24 7 2 2 22 3 3 2 17 6 9 2 30 12 2 4 39 9 0 2 46 8 0 2 26 5 1 2 28 5 6 1 18 3 5 2 19 13 1 3 27 3 1 1 20 10 0 1 27 6 0 4 26 1 0 2 19 4 0 1 26 8 1 1 30 8 0 2 22 2 3 3 42 4 3 1 10 5 3 1 30 12 1 1 25 8 1 2 38 8 2 1 28 13 3 1 18 12 2 2 20 11 2 2 29 0 1 2 18 3 1 1 6 2 0 1 6 3 2 2 24 1 0 1 14 1 1 1 17 5 2 2 20 9 1 4 24 0 1 2 8 10 0 2 18 1 1 1 25 5 2 2 12 7 0 3 18 1 0 1 19 1 8 2 21 2 1 2 23 5 7 2 19 6 1 1 21 5 0 1 16 6 1 1 24 1 0 2 19 3 1 2 14 6 3 2 24 2 6 1 32 21 0 1 16 0 1 2 15 0 1 2 8 8 0 1 14 5 0 2 27 5 2 2 17 2 1 1 19 7 1 2 21 2 0 1 29 7 0 2 18 2 0 2 15 6 2 3 27 3 0 2 57 4 2 3 17 2 1 1 18 8 1 1 17 5 0 1 18 1 1 2 18 4 1 1 12 1 0 2 15 6 1 2 24 4 3 2 14 9 0 1 24 6 3 1 30 9 0 1 19 5 3 1 16 7 5 3 21 1 2 2 17 5 4 1 34 9 1 1 17 7 3 2 30 10 12 1 17 6 2 1 26 6 1 1 18 2 2 2 24 0 0 1 12 2 0 2 3 2 1 1 11 4 1 4 18 13 0 1 25 9 8 2 20 7 0 1 11 7 7 3 26 19 6 1 18 6 6 2 32 5 1 1 31 2 1 2 33 9 4 1 17 6 1 2 34 11 5 1 37 3 0 3 27 10 12 2 25 14 3 1 40 6 6 2 27 9 0 2 31 2 1 1 28 7 2 1 37 11 1 1 19 0 5 2 30 17 4 3 40 6 0 1 27 6 5 3 31 7 0 3 26 10 3 2 32 4 1 3 43 6 3 1 19 3 2 2 37 4 0 3 28 4 6 3 30 11 1 1 30 9 4 3 31 26 1 2 14 1 10 1 35 27 1 1 36 7 5 1 32 8 2 1 28 6 3 1 34 16 3 2 32 5 1 3 11 0 2 2 42 5 0 2 30 7 0 1 32 9 3 3 43 2 7 2 43 6 1 2 21 5 2 1 27 20 1 2 37 7 2 1 37 8 0 1 19 3 0 3 28 5 2 2 33 3 3 1 41 6 13 2 41 9 2 1 38 3 4 1 32 5 2 1 34 8 1 1 27 9 8 1 29 7 4 1 17 6 0 1 20 8 1 2 34 4 1 1 16 11 4 2 33 5 0 2 15 6 1 1 27 4 2 3 15 8 1 1 30 8 3 2 41 20 0 1 25 15 1 3 35 24 4 2 30 21 6 2 30 6 16 2 33 21 2 3 37 3 2 2 30 12 4 1 57 11 0 2 18 16 4 4 20 13 3 1 43 10 3 1 25 15 7 2 31 11 2 1 31 3 5 2 40 11 3 2 28 7 4 2 27 10 0 1 26 6 4 2 24 14 4 2 23 8 0 2 25 11 21 2 33 12 1 3 37 0 3 2 28 7 4 2 27 10 1 2 41 15 2 2 30 16 2 2 28 7 6 1 19 8 4 4 22 19 0 2 38 33 1 1 29 11 1 2 27 2 4 2 24 6 2 1 22 5 ",header=TRUE,sep="")
The above problem occurred with pcsl version 1.4.6. I spoke to the author since and in version 1.4.7 he fixed the bug. The actual version in February 2015 is 1.4.8.