left_join with individual lag to new column - r

I need to merge two data frames probably with left_join, offset the joining observation by a specific amount and add it to a new column. The purpose is the preparation of a time-series analysis hence the different shifts in calendar weeks. I would like to stay in the tidyverse.
I read a few posts with a nested left-join() and lag() but that's beyond my current capability.
MWE
library(tidyverse)
set.seed(1234)
df1 <- data.frame(
Week1 = sample(paste("2015", 20:40, sep = "."),10, replace = FALSE),
Qty = as.numeric(sample(1:10)))
df2 <- data.frame(
Week2 = paste0("2015.", 1:52),
Value = as.numeric(sample(1:52)))
df1 %>%
left_join(df2, by = c("Week1" = "Week2")) %>%
rename(Lag_0 = Value)
Current output
+----+---------+-------+-------+
| | Week1 | Qty | Lag_0 |
+====+=========+=======+=======+
| 1 | 2015.35 | 6.00 | 50.00 |
+----+---------+-------+-------+
| 2 | 2015.24 | 10.00 | 26.00 |
+----+---------+-------+-------+
| 3 | 2015.31 | 7.00 | 43.00 |
+----+---------+-------+-------+
| 4 | 2015.34 | 9.00 | 42.00 |
+----+---------+-------+-------+
| 5 | 2015.28 | 4.00 | 10.00 |
+----+---------+-------+-------+
| 6 | 2015.39 | 8.00 | 24.00 |
+----+---------+-------+-------+
| 7 | 2015.25 | 5.00 | 33.00 |
+----+---------+-------+-------+
| 8 | 2015.23 | 1.00 | 39.00 |
+----+---------+-------+-------+
| 9 | 2015.21 | 2.00 | 17.00 |
+----+---------+-------+-------+
| 10 | 2015.26 | 3.00 | 27.00 |
+----+---------+-------+-------+
It might be worthwhile pointing out that the target data frame does not hold the same amount of week observations as the joining data frame.
Desired output
+----+---------+-------+-------+-------+-------+--------+
| | Week1 | Qty | Lag_0 | Lag_3 | Lag_6 | Lag_12 |
+====+=========+=======+=======+=======+=======+========+
| 1 | 2015.35 | 6.00 | 50.00 | 9.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 2 | 2015.24 | 10.00 | 26.00 | 17.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 3 | 2015.31 | 7.00 | 43.00 | 10.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 4 | 2015.34 | 9.00 | 42.00 | 43.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 5 | 2015.28 | 4.00 | 10.00 | 33.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 6 | 2015.39 | 8.00 | 24.00 | 13.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 7 | 2015.25 | 5.00 | 33.00 | 25.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 8 | 2015.23 | 1.00 | 39.00 | 38.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 9 | 2015.21 | 2.00 | 17.00 | 6.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 10 | 2015.26 | 3.00 | 27.00 | 39.00 | | |
+----+---------+-------+-------+-------+-------+--------+
Column Lag_3, which I added manually, contains the values from the matching df2 week value but offset by three rows. Lag_6 would be offset by six rows, etc.
I suppose the challenge is, that the lag() would have to happen in the joining table after the matching but before the returning of the value.
Hope this makes sense and thanks for the assistance.

We just need to create the lag before in the second data and then do the join
library(dplyr)
df2 %>%
mutate(Lag_3 = lag(Value, 3), Lag_6 = lag(Value, 6)) %>%
left_join(df1, ., by = c("Week1" = "Week2")) %>%
rename(Lag_0 = Value)
-output
# Week1 Qty Lag_0 Lag_3 Lag_6
#1 2015.35 6 50 9 46
#2 2015.24 10 26 17 6
#3 2015.31 7 43 10 33
#4 2015.34 9 42 43 10
#5 2015.28 4 10 33 25
#6 2015.39 8 24 13 16
#7 2015.25 5 33 25 49
#8 2015.23 1 39 38 15
#9 2015.21 2 17 6 32
#10 2015.26 3 27 39 38

Related

Calculating weighted average buy and hold return per ID in R

Thanks to #langtang, I was able to calculate Buy and Hold Return around the event date for each company (Calculating Buy and Hold return around event date per ID in R). But then now I am facing a new problem.
Below is the data I currently have.
+----+------------+-------+------------+------------+----------------------+
| ID | Date | Price | EventDate | Market Cap | BuyAndHoldIndividual |
+----+------------+-------+------------+------------+----------------------+
| 1 | 2011-03-06 | 10 | NA | 109 | NA |
| 1 | 2011-03-07 | 9 | NA | 107 | -0.10000 |
| 1 | 2011-03-08 | 12 | NA | 109 | 0.20000 |
| 1 | 2011-03-09 | 14 | NA | 107 | 0.40000 |
| 1 | 2011-03-10 | 15 | NA | 101 | 0.50000 |
| 1 | 2011-03-11 | 17 | NA | 101 | 0.70000 |
| 1 | 2011-03-12 | 12 | 2011-03-12 | 110 | 0.20000 |
| 1 | 2011-03-13 | 14 | NA | 110 | 0.40000 |
| 1 | 2011-03-14 | 17 | NA | 100 | 0.70000 |
| 1 | 2011-03-15 | 14 | NA | 101 | 0.40000 |
| 1 | 2011-03-16 | 17 | NA | 107 | 0.70000 |
| 1 | 2011-03-17 | 16 | NA | 104 | 0.60000 |
| 1 | 2011-03-18 | 15 | NA | 104 | NA |
| 1 | 2011-03-19 | 16 | NA | 102 | 0.06667 |
| 1 | 2011-03-20 | 17 | NA | 107 | 0.13333 |
| 1 | 2011-03-21 | 18 | NA | 104 | 0.20000 |
| 1 | 2011-03-22 | 11 | NA | 105 | -0.26667 |
| 1 | 2011-03-23 | 15 | NA | 100 | 0.00000 |
| 1 | 2011-03-24 | 12 | 2011-03-24 | 110 | -0.20000 |
| 1 | 2011-03-25 | 13 | NA | 110 | -0.13333 |
| 1 | 2011-03-26 | 15 | NA | 107 | 0.00000 |
| 2 | 2011-03-12 | 48 | NA | 300 | NA |
| 2 | 2011-03-13 | 49 | NA | 300 | NA |
| 2 | 2011-03-14 | 50 | NA | 290 | NA |
| 2 | 2011-03-15 | 57 | NA | 296 | 0.14000 |
| 2 | 2011-03-16 | 60 | NA | 297 | 0.20000 |
| 2 | 2011-03-17 | 49 | NA | 296 | -0.02000 |
| 2 | 2011-03-18 | 64 | NA | 299 | 0.28000 |
| 2 | 2011-03-19 | 63 | NA | 292 | 0.26000 |
| 2 | 2011-03-20 | 67 | 2011-03-20 | 290 | 0.34000 |
| 2 | 2011-03-21 | 70 | NA | 299 | 0.40000 |
| 2 | 2011-03-22 | 58 | NA | 295 | 0.16000 |
| 2 | 2011-03-23 | 65 | NA | 290 | 0.30000 |
| 2 | 2011-03-24 | 57 | NA | 296 | 0.14000 |
| 2 | 2011-03-25 | 55 | NA | 299 | 0.10000 |
| 2 | 2011-03-26 | 57 | NA | 299 | NA |
| 2 | 2011-03-27 | 60 | NA | 300 | NA |
| 3 | 2011-03-18 | 5 | NA | 54 | NA |
| 3 | 2011-03-19 | 10 | NA | 50 | NA |
| 3 | 2011-03-20 | 7 | NA | 53 | NA |
| 3 | 2011-03-21 | 8 | NA | 53 | NA |
| 3 | 2011-03-22 | 7 | NA | 50 | NA |
| 3 | 2011-03-23 | 8 | NA | 51 | 0.14286 |
| 3 | 2011-03-24 | 7 | NA | 52 | 0.00000 |
| 3 | 2011-03-25 | 6 | NA | 55 | -0.14286 |
| 3 | 2011-03-26 | 9 | NA | 54 | 0.28571 |
| 3 | 2011-03-27 | 9 | NA | 55 | 0.28571 |
| 3 | 2011-03-28 | 9 | 2011-03-28 | 50 | 0.28571 |
| 3 | 2011-03-29 | 6 | NA | 52 | -0.14286 |
| 3 | 2011-03-30 | 6 | NA | 53 | -0.14286 |
| 3 | 2011-03-31 | 4 | NA | 50 | -0.42857 |
| 3 | 2011-04-01 | 5 | NA | 50 | -0.28571 |
| 3 | 2011-04-02 | 8 | NA | 55 | 0.00000 |
| 3 | 2011-04-03 | 9 | NA | 55 | NA |
+----+------------+-------+------------+------------+----------------------+
This time, I would like to make a new column called BuyAndHoldWeightedMarket, where I calculate the weighted average (by Market cap) Buy and Hold return for each ID around -5 ~ +5 days of the event date. For example, for ID =1, starting from 2011-03-19, BuyAndHoldWeightedMarket is calculated as the sum product of (prices for each ID(t)/prices for each ID(eventdate-6)-1) and Market Caps for that day for each ID and then dividing that by the sum of the Market Caps for each ID on that day.
Please check the below picture for the details. The equations are listed for each case of colored blocks.
Please note that for the uppermost BuyAndHoldWeightedMarket, ID =2,3 is not involved because they begin later than 2011-03-06. For the third block (grey colored area), the calculation of weighted return only includes ID=1,2 because Id=3 begins later than 2011-03-14. Also, for the Last block (mixed color), the first four rows use all three IDs, Blue area uses only ID=2,3 because ID=1 ends 2011-03-26, and the yellow block uses only ID=3 because ID=1, 2 ends before 2011-03-28.
Eventually, I would like to get a nice data table that looks as below.
+----+------------+-------+------------+------------+----------------------+--------------------------+
| ID | Date | Price | EventDate | Market Cap | BuyAndHoldIndividual | BuyAndHoldWeightedMarket |
+----+------------+-------+------------+------------+----------------------+--------------------------+
| 1 | 2011-03-06 | 10 | NA | 109 | NA | NA |
| 1 | 2011-03-07 | 9 | NA | 107 | -0.10000 | -0.10000 |
| 1 | 2011-03-08 | 12 | NA | 109 | 0.20000 | 0.20000 |
| 1 | 2011-03-09 | 14 | NA | 107 | 0.40000 | 0.40000 |
| 1 | 2011-03-10 | 15 | NA | 101 | 0.50000 | 0.50000 |
| 1 | 2011-03-11 | 17 | NA | 101 | 0.70000 | 0.70000 |
| 1 | 2011-03-12 | 12 | 2011-03-12 | 110 | 0.20000 | 0.20000 |
| 1 | 2011-03-13 | 14 | NA | 110 | 0.40000 | 0.40000 |
| 1 | 2011-03-14 | 17 | NA | 100 | 0.70000 | 0.70000 |
| 1 | 2011-03-15 | 14 | NA | 101 | 0.40000 | 0.40000 |
| 1 | 2011-03-16 | 17 | NA | 107 | 0.70000 | 0.70000 |
| 1 | 2011-03-17 | 16 | NA | 104 | 0.60000 | 0.60000 |
| 1 | 2011-03-18 | 15 | NA | 104 | NA | NA |
| 1 | 2011-03-19 | 16 | NA | 102 | 0.06667 | 0.11765 |
| 1 | 2011-03-20 | 17 | NA | 107 | 0.13333 | 0.10902 |
| 1 | 2011-03-21 | 18 | NA | 104 | 0.20000 | 0.17682 |
| 1 | 2011-03-22 | 11 | NA | 105 | -0.26667 | -0.07924 |
| 1 | 2011-03-23 | 15 | NA | 100 | 0.00000 | 0.07966 |
| 1 | 2011-03-24 | 12 | 2011-03-24 | 110 | -0.20000 | -0.07331 |
| 1 | 2011-03-25 | 13 | NA | 110 | -0.13333 | -0.09852 |
| 1 | 2011-03-26 | 15 | NA | 107 | 0.00000 | 0.02282 |
| 2 | 2011-03-12 | 48 | NA | 300 | NA | NA |
| 2 | 2011-03-13 | 49 | NA | 300 | NA | NA |
| 2 | 2011-03-14 | 50 | NA | 290 | NA | NA |
| 2 | 2011-03-15 | 57 | NA | 296 | 0.14000 | 0.059487331 |
| 2 | 2011-03-16 | 60 | NA | 297 | 0.20000 | 0.147029703 |
| 2 | 2011-03-17 | 49 | NA | 296 | -0.02000 | -0.030094118 |
| 2 | 2011-03-18 | 64 | NA | 299 | 0.28000 | 0.177381404 |
| 2 | 2011-03-19 | 63 | NA | 292 | 0.26000 | 0.177461929 |
| 2 | 2011-03-20 | 67 | 2011-03-20 | 290 | 0.34000 | 0.24836272 |
| 2 | 2011-03-21 | 70 | NA | 299 | 0.40000 | 0.311954459 |
| 2 | 2011-03-22 | 58 | NA | 295 | 0.16000 | 0.025352941 |
| 2 | 2011-03-23 | 65 | NA | 290 | 0.30000 | 0.192911011 |
| 2 | 2011-03-24 | 57 | NA | 296 | 0.14000 | 0.022381918 |
| 2 | 2011-03-25 | 55 | NA | 299 | 0.10000 | 0.009823098 |
| 2 | 2011-03-26 | 57 | NA | 299 | NA | NA |
| 2 | 2011-03-27 | 60 | NA | 300 | NA | NA |
| 3 | 2011-03-18 | 5 | NA | 54 | NA | NA |
| 3 | 2011-03-19 | 10 | NA | 50 | NA | NA |
| 3 | 2011-03-20 | 7 | NA | 53 | NA | NA |
| 3 | 2011-03-21 | 8 | NA | 53 | NA | NA |
| 3 | 2011-03-22 | 7 | NA | 50 | NA | NA |
| 3 | 2011-03-23 | 8 | NA | 51 | 0.14286 | 0.178343199 |
| 3 | 2011-03-24 | 7 | NA | 52 | 0.00000 | 0.010691161 |
| 3 | 2011-03-25 | 6 | NA | 55 | -0.14286 | -0.007160905 |
| 3 | 2011-03-26 | 9 | NA | 54 | 0.28571 | 0.106918456 |
| 3 | 2011-03-27 | 9 | NA | 55 | 0.28571 | 0.073405953 |
| 3 | 2011-03-28 | 9 | 2011-03-28 | 50 | 0.28571 | 0.285714286 |
| 3 | 2011-03-29 | 6 | NA | 52 | -0.14286 | -0.142857143 |
| 3 | 2011-03-30 | 6 | NA | 53 | -0.14286 | -0.142857143 |
| 3 | 2011-03-31 | 4 | NA | 50 | -0.42857 | -0.428571429 |
| 3 | 2011-04-01 | 5 | NA | 50 | -0.28571 | -0.285714286 |
| 3 | 2011-04-02 | 8 | NA | 55 | 0.00000 | 0.142857143 |
| 3 | 2011-04-03 | 9 | NA | 55 | NA | NA |
+----+------------+-------+------------+------------+----------------------+--------------------------+
I tried so far by using the following code, with the help of the previous question, but I am having a hard time figure out how to calculate the weighted BUY AND HOLD return that begins around different event dates for each ID.
#choose rows with no NA in event date and only show ID and event date
events = unique(df[!is.na(EventDate),.(ID,EventDate)])
#helper column
#:= is defined for use in j only. It adds or updates or removes column(s) by reference.
#It makes no copies of any part of memory at all.
events[, eDate:=EventDate]
#makes new column(temporary) lower and upper boundary
df[, `:=`(s=Date-6, e=Date+6)]
#non-equi match
bhr = events[df, on=.(ID, EventDate>=s, EventDate<=e), nomatch=0]
#Generate the BuyHoldReturn column, by ID and EventDate
bhr2 = bhr[, .(Date, BuyHoldReturnM1=c(NA, (Price[-1]/Price[1] -1)*MarketCap[-1])), by = .(Date)]
#merge back to get the full data
bhr3 = bhr2[df,on=.(ID,Date),.(ID,Date,Price,EventDate=i.EventDate,BuyHoldReturn)]
I would be grateful if you could help.
Thank you very much in advance!

expss mdset table with both counts and percentages

For a multiple response (multiple dichotomy) set, I would like to make a simple table showing both counts and percents of valid responses.
checkall <- data.frame(ID=1:60,
q1a=sample(c(0,1), size=60, replace=TRUE),
q1b=sample(c(0,1), size=60, replace=TRUE),
q1c=sample(c(0,1), size=60, replace=TRUE),
q1d=sample(c(0,1), size=60, replace=TRUE),
q1e=sample(c(0,1), size=60, replace=TRUE),
q1f=sample(c(0,1), size=60, replace=TRUE),
q1g=sample(c(0,1), size=60, replace=TRUE),
q1h=sample(c(0,1), size=60, replace=TRUE))
calc_cro_cpct(checkall, mdset(q1a %to% q1h))
| | #Total |
| ------------ | ------ |
| q1a | 53.3 |
| q1b | 48.3 |
| q1c | 43.3 |
| q1d | 43.3 |
| q1e | 41.7 |
| q1f | 55.0 |
| q1g | 63.3 |
| q1h | 48.3 |
| #Total cases | 60.0 |
calc_cro_cases(checkall, mdset(q1a %to% q1h))
| | #Total |
| ------------ | ------ |
| q1a | 32 |
| q1b | 29 |
| q1c | 26 |
| q1d | 26 |
| q1e | 25 |
| q1f | 33 |
| q1g | 38 |
| q1h | 29 |
| #Total cases | 60 |
Is there a way to get these into the same table, side by side?
I managed to make it work with tidyverse (without a total cases row).
library(tidyverse)
checkall %>%
select(q1a:q1h) %>%
summarize_all(sum, na.rm=TRUE) %>%
pivot_longer(everything(), names_to="Choice", values_to = "count") %>%
mutate(percent=round(count/nrow(checkall), 3))
# A tibble: 8 x 3
Choice count percent
<chr> <dbl> <dbl>
1 q1a 32 0.533
2 q1b 29 0.483
3 q1c 26 0.433
4 q1d 26 0.433
5 q1e 25 0.417
6 q1f 33 0.55
7 q1g 38 0.633
8 q1h 29 0.483
But I'd like to see if expss can do it because if the ease in computing valid percent (i.e. total cases reflects the number of observations that have at least one choice in the mdset.
I figured it out while looking at the examples for non mdset tables:
checkall %>% tab_cells(mdset(q1a %to% q1h)) %>%
#tab_cols(q11) %>%
tab_stat_cases(label = "freq") %>%
tab_stat_cpct(label = "col %") %>%
tab_pivot(stat_position = "inside_columns")
| | #Total | |
| | freq | col % |
| ------------ | ------ | ----- |
| q1a | 32 | 53.3 |
| q1b | 29 | 48.3 |
| q1c | 26 | 43.3 |
| q1d | 26 | 43.3 |
| q1e | 25 | 41.7 |
| q1f | 33 | 55.0 |
| q1g | 38 | 63.3 |
| q1h | 29 | 48.3 |
| #Total cases | 60 | 60.0 |

make a table in expss that shows both freq and cpct but only tests cpct on the cpct columns

Using this data set with a multiple dichotomy set and a group:
set.seed(14)
checkall <- data.frame(ID=1:200,
group=sample(c("A", "B", "C"), size=200, replace=TRUE),
q1a=sample(c(0,1), size=200, replace=TRUE),
q1b=sample(c(0,1), size=200, replace=TRUE),
q1c=sample(c(0,1), size=200, replace=TRUE),
q1d=sample(c(0,1), size=200, replace=TRUE),
q1e=sample(c(0,1), size=200, replace=TRUE),
q1f=sample(c(0,1), size=200, replace=TRUE),
q1g=sample(c(0,1), size=200, replace=TRUE),
q1h=sample(c(0,1), size=200, replace=TRUE))
#Doctor some to be related to group
checkall$q1c[checkall$group=="A"] <- sample(c(0,1,1,1), size=sum(checkall$group=="A"), replace=TRUE)
checkall$q1e[checkall$group=="A"] <- sample(c(0,0,0,1), size=sum(checkall$group=="A"), replace=TRUE)
I would like to make a table that shows frequencies and column percents like this:
library(dplyr)
if( !require(expss) ){ install.packages("expss", dependencies=TRUE); library(expss) }
checkall %>% tab_cells(mdset(q1a %to% q1h)) %>%
tab_cols(total(), group) %>%
tab_stat_cases(label = "freq") %>%
tab_stat_cpct(label = "col %") %>%
tab_pivot(stat_position = "inside_columns")
| | #Total | | group | | | | | |
| | freq | col % | A | | B | | C | |
| | | | freq | col % | freq | col % | freq | col % |
| ------------ | ------ | ----- | ----- | ----- | ---- | ----- | ---- | ----- |
| q1a | 101 | 50.8 | 33 | 47.8 | 36 | 51.4 | 32 | 53.3 |
| q1b | 92 | 46.2 | 34 | 49.3 | 29 | 41.4 | 29 | 48.3 |
| q1c | 111 | 55.8 | 53 | 76.8 | 30 | 42.9 | 28 | 46.7 |
| q1d | 89 | 44.7 | 35 | 50.7 | 30 | 42.9 | 24 | 40.0 |
| q1e | 100 | 50.3 | 19 | 27.5 | 43 | 61.4 | 38 | 63.3 |
| q1f | 89 | 44.7 | 34 | 49.3 | 36 | 51.4 | 19 | 31.7 |
| q1g | 97 | 48.7 | 29 | 42.0 | 33 | 47.1 | 35 | 58.3 |
| q1h | 113 | 56.8 | 40 | 58.0 | 36 | 51.4 | 37 | 61.7 |
| #Total cases | 199 | 199.0 | 69 | 69.0 | 70 | 70.0 | 60 | 60.0 |
But I would like to add the notations that compare the cpct values to that in the first column. I can get that on a table with just cpct values like this:
checkall %>% tab_cells(mdset(q1a %to% q1h)) %>%
tab_cols(total(), group) %>%
tab_stat_cpct(label = "col %")%>%
tab_pivot(stat_position = "inside_columns")%>%
significance_cpct(compare_type = "first_column")
| | #Total | group | | |
| | col % | A | B | C |
| | | col % | col % | col % |
| ------------ | ------ | ------ | ----- | ----- |
| q1a | 50.8 | 47.8 | 51.4 | 53.3 |
| q1b | 46.2 | 49.3 | 41.4 | 48.3 |
| q1c | 55.8 | 76.8 + | 42.9 | 46.7 |
| q1d | 44.7 | 50.7 | 42.9 | 40.0 |
| q1e | 50.3 | 27.5 - | 61.4 | 63.3 |
| q1f | 44.7 | 49.3 | 51.4 | 31.7 |
| q1g | 48.7 | 42.0 | 47.1 | 58.3 |
| q1h | 56.8 | 58.0 | 51.4 | 61.7 |
| #Total cases | 199 | 69 | 70 | 60 |
Is there a way to get the + and - notations onto the first graph in just the cpct columns? If I try to mix the lines with tab_stat_cases(label="freq") and significance_cpct(compare_type = "first_column"), I get a weird table that tries to compare both the freq and cpct columns to the first column:
checkall %>% tab_cells(mdset(q1a %to% q1h)) %>%
tab_cols(total(), group) %>%
#tab_stat_cases(label = "freq") %>%
tab_stat_cpct(label = "col %")%>%
tab_pivot(stat_position = "inside_columns")%>%
significance_cpct(compare_type = "first_column")
| | #Total | | group | | | | | |
| | freq | col % | A | | B | | C | |
| | | | freq | col % | freq | col % | freq | col % |
| ------------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
| q1a | 101.0 | 50.8 - | 33.0 - | 47.8 - | 36.0 - | 51.4 - | 32.0 - | 53.3 - |
| q1b | 92.0 | 46.2 - | 34.0 - | 49.3 - | 29.0 - | 41.4 - | 29.0 - | 48.3 - |
| q1c | 111.0 | 55.8 - | 53.0 - | 76.8 | 30.0 - | 42.9 - | 28.0 - | 46.7 - |
| q1d | 89.0 | 44.7 - | 35.0 - | 50.7 - | 30.0 - | 42.9 - | 24.0 - | 40.0 - |
| q1e | 100.0 | 50.3 - | 19.0 - | 27.5 - | 43.0 - | 61.4 - | 38.0 - | 63.3 - |
| q1f | 89.0 | 44.7 - | 34.0 - | 49.3 - | 36.0 - | 51.4 - | 19.0 - | 31.7 - |
| q1g | 97.0 | 48.7 - | 29.0 - | 42.0 - | 33.0 - | 47.1 - | 35.0 - | 58.3 - |
| q1h | 113.0 | 56.8 - | 40.0 - | 58.0 - | 36.0 - | 51.4 - | 37.0 - | 61.7 |
| #Total cases | 199 | 199 | 69 | 69 | 70 | 70 | 60 | 60 |
I'm looking for the top table with the + and - notation as below:
| | #Total | | group | | | | | |
| | freq | col % | A | | B | | C | |
| | | | freq | col % | freq | col % | freq | col % |
| ------------ | ------ | ----- | ----- | ----- | ---- | ----- | ---- | ----- |
| q1a | 101 | 50.8 | 33 | 47.8 | 36 | 51.4 | 32 | 53.3 |
| q1b | 92 | 46.2 | 34 | 49.3 | 29 | 41.4 | 29 | 48.3 |
| q1c | 111 | 55.8 | 53 | 76.8 +| 30 | 42.9 | 28 | 46.7 |
| q1d | 89 | 44.7 | 35 | 50.7 | 30 | 42.9 | 24 | 40.0 |
| q1e | 100 | 50.3 | 19 | 27.5 -| 43 | 61.4 | 38 | 63.3 |
| q1f | 89 | 44.7 | 34 | 49.3 | 36 | 51.4 | 19 | 31.7 |
| q1g | 97 | 48.7 | 29 | 42.0 | 33 | 47.1 | 35 | 58.3 |
| q1h | 113 | 56.8 | 40 | 58.0 | 36 | 51.4 | 37 | 61.7 |
| #Total cases | 199 | 199.0 | 69 | 69.0 | 70 | 70.0 | 60 | 60.0 |
There is a special function for such case - tab_last_sig_cpct - which will be applied only to the last calculation:
checkall %>% tab_cells(mdset(q1a %to% q1h)) %>%
tab_cols(total(), group) %>%
tab_stat_cases(label = "freq") %>%
tab_stat_cpct(label = "col %") %>%
tab_last_sig_cpct(compare_type = "first_column") %>%
tab_pivot(stat_position = "inside_columns")

Varargs is giving key error in Julia

Consider the following table:
julia> using RDatasets, DataFrames
julia> anscombe = dataset("datasets","anscombe")
11x8 DataFrame
| Row | X1 | X2 | X3 | X4 | Y1 | Y2 | Y3 | Y4 |
|-----|----|----|----|----|-------|------|-------|------|
| 1 | 10 | 10 | 10 | 8 | 8.04 | 9.14 | 7.46 | 6.58 |
| 2 | 8 | 8 | 8 | 8 | 6.95 | 8.14 | 6.77 | 5.76 |
| 3 | 13 | 13 | 13 | 8 | 7.58 | 8.74 | 12.74 | 7.71 |
| 4 | 9 | 9 | 9 | 8 | 8.81 | 8.77 | 7.11 | 8.84 |
| 5 | 11 | 11 | 11 | 8 | 8.33 | 9.26 | 7.81 | 8.47 |
| 6 | 14 | 14 | 14 | 8 | 9.96 | 8.1 | 8.84 | 7.04 |
| 7 | 6 | 6 | 6 | 8 | 7.24 | 6.13 | 6.08 | 5.25 |
| 8 | 4 | 4 | 4 | 19 | 4.26 | 3.1 | 5.39 | 12.5 |
| 9 | 12 | 12 | 12 | 8 | 10.84 | 9.13 | 8.15 | 5.56 |
| 10 | 7 | 7 | 7 | 8 | 4.82 | 7.26 | 6.42 | 7.91 |
| 11 | 5 | 5 | 5 | 8 | 5.68 | 4.74 | 5.73 | 6.89 |
I have defined a function as follows:
julia> f1(df, matchval, matchfield, qfields...) = isempty(qfields)
WARNING: Method definition f1(Any, Any, Any, Any...) in module Main at REPL[314]:1 overwritten at REPL[317]:1.
f1 (generic function with 3 methods)
Now below is the problem
julia> f1(anscombe, 11, "X1")
ERROR: KeyError: key :field not found
in getindex at ./dict.jl:697 [inlined]
in getindex(::DataFrames.Index, ::Symbol) at /home/arghya/.julia/v0.5/DataFrames/src/other/index.jl:114
in getindex at /home/arghya/.julia/v0.5/DataFrames/src/dataframe/dataframe.jl:228 [inlined]
in f1(::DataFrames.DataFrame, ::Int64, ::String) at ./REPL[249]:2
Where am I doing wrong? FYI I'm using Julia Version 0.5.2. How to overcome this problem? Thanks in advance!
There is nothing wrong with your code - try running just what you've posted in a fresh session. Possibly you've defined another f1 method before. If you come from R, you may assume that this is overwritten by f1(df, matchval, matchfield, qfields...) = isempty(qfields), while in fact you're just defining a new method for the f1 function. The error is probably thrown by a 3-argument version you've defined earlier. Look at https://docs.julialang.org/en/stable/manual/methods/

Mimic tabulate command from Stata in R

I'm trying to get a 2 way table in R similar to this one from Stata. I was trying to use CrossTable from gmodels package, but the table is not the same. Do you known how can this be done in R?
I hope at least to get the frequencies from
when cursmoke1 == "Yes" & cursmoke2 == "No" and reversed
In R I'm only getting totals from yes, no and NA.
Here is the output:
Stata
. tabulate cursmoke1 cursmoke2, cell column miss row
+-------------------+
| Key |
|-------------------|
| frequency |
| row percentage |
| column percentage |
| cell percentage |
+-------------------+
Current |
smoker, | Current smoker, exam 2
exam 1 | No Yes . | Total
-----------+---------------------------------+----------
No | 1,898 131 224 | 2,253
| 84.24 5.81 9.94 | 100.00
| 86.16 7.59 44.44 | 50.81
| 42.81 2.95 5.05 | 50.81
-----------+---------------------------------+----------
Yes | 305 1,596 280 | 2,181
| 13.98 73.18 12.84 | 100.00
| 13.84 92.41 55.56 | 49.19
| 6.88 35.99 6.31 | 49.19
-----------+---------------------------------+----------
Total | 2,203 1,727 504 | 4,434
| 49.68 38.95 11.37 | 100.00
| 100.00 100.00 100.00 | 100.00
| 49.68 38.95 11.37 | 100.00
R
> CrossTable(cursmoke2, cursmoke1, missing.include = T, format="SAS")
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 4434
| cursmoke1
cursmoke2 | No | Yes | NA | Row Total |
-------------|-----------|-----------|-----------|-----------|
No | 2203 | 0 | 0 | 2203 |
| 1122.544 | 858.047 | 250.409 | |
| 1.000 | 0.000 | 0.000 | 0.497 |
| 1.000 | 0.000 | 0.000 | |
| 0.497 | 0.000 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|
Yes | 0 | 1727 | 0 | 1727 |
| 858.047 | 1652.650 | 196.303 | |
| 0.000 | 1.000 | 0.000 | 0.389 |
| 0.000 | 1.000 | 0.000 | |
| 0.000 | 0.389 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|
NA | 0 | 0 | 504 | 504 |
| 250.409 | 196.303 | 3483.288 | |
| 0.000 | 0.000 | 1.000 | 0.114 |
| 0.000 | 0.000 | 1.000 | |
| 0.000 | 0.000 | 0.114 | |
-------------|-----------|-----------|-----------|-----------|
Column Total | 2203 | 1727 | 504 | 4434 |
| 0.497 | 0.389 | 0.114 | |
-------------|-----------|-----------|-----------|-----------|
Maybe I'm missing something here. The default settings for CrossTable seem to provide essentially what you are looking for.
Here is CrossTable with minimal arguments. (I've loaded the dataset as "temp".) Note that the results are the same as what you posted from the Stata output (you just need to multiply by 100 if you want the result as a percentage).
library(gmodels)
with(temp, CrossTable(cursmoke1, cursmoke2, missing.include=TRUE))
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 4434
| cursmoke2
cursmoke1 | No | Yes | NA | Row Total |
-------------|-----------|-----------|-----------|-----------|
No | 1898 | 131 | 224 | 2253 |
| 541.582 | 635.078 | 4.022 | |
| 0.842 | 0.058 | 0.099 | 0.508 |
| 0.862 | 0.076 | 0.444 | |
| 0.428 | 0.030 | 0.051 | |
-------------|-----------|-----------|-----------|-----------|
Yes | 305 | 1596 | 280 | 2181 |
| 559.461 | 656.043 | 4.154 | |
| 0.140 | 0.732 | 0.128 | 0.492 |
| 0.138 | 0.924 | 0.556 | |
| 0.069 | 0.360 | 0.063 | |
-------------|-----------|-----------|-----------|-----------|
Column Total | 2203 | 1727 | 504 | 4434 |
| 0.497 | 0.389 | 0.114 | |
-------------|-----------|-----------|-----------|-----------|
Alternatively, you can use format="SPSS" if you want the numbers displayed as percentages.
with(temp, CrossTable(cursmoke1, cursmoke2, missing.include=TRUE, format="SPSS"))
Cell Contents
|-------------------------|
| Count |
| Chi-square contribution |
| Row Percent |
| Column Percent |
| Total Percent |
|-------------------------|
Total Observations in Table: 4434
| cursmoke2
cursmoke1 | No | Yes | NA | Row Total |
-------------|-----------|-----------|-----------|-----------|
No | 1898 | 131 | 224 | 2253 |
| 541.582 | 635.078 | 4.022 | |
| 84.243% | 5.814% | 9.942% | 50.812% |
| 86.155% | 7.585% | 44.444% | |
| 42.806% | 2.954% | 5.052% | |
-------------|-----------|-----------|-----------|-----------|
Yes | 305 | 1596 | 280 | 2181 |
| 559.461 | 656.043 | 4.154 | |
| 13.984% | 73.177% | 12.838% | 49.188% |
| 13.845% | 92.415% | 55.556% | |
| 6.879% | 35.995% | 6.315% | |
-------------|-----------|-----------|-----------|-----------|
Column Total | 2203 | 1727 | 504 | 4434 |
| 49.684% | 38.949% | 11.367% | |
-------------|-----------|-----------|-----------|-----------|
Update: prop.table()
Just FYI (to save you the tedious work you did in making your own data.frame as you did), you may also be interested in the prop.table() function.
Again, using the data you linked to and assuming it is named "temp", the following gives you the underlying data from which you can construct your data.frame. You may also be interested in looking into the functions margin.table() or addmargins():
## Your basic table
CurSmoke <- with(temp, table(cursmoke1, cursmoke2, useNA = "ifany"))
CurSmoke
# cursmoke2
# cursmoke1 No Yes <NA>
# No 1898 131 224
# Yes 305 1596 280
## Row proportions
prop.table(CurSmoke, 1) # * 100 # If you so desire
# cursmoke2
# cursmoke1 No Yes <NA>
# No 0.84243231 0.05814470 0.09942299
# Yes 0.13984411 0.73177442 0.12838148
## Column proportions
prop.table(CurSmoke, 2) # * 100 # If you so desire
# cursmoke2
# cursmoke1 No Yes <NA>
# No 0.86155243 0.07585408 0.44444444
# Yes 0.13844757 0.92414592 0.55555556
## Cell proportions
prop.table(CurSmoke) # * 100 # If you so desire
# cursmoke2
# cursmoke1 No Yes <NA>
# No 0.42805593 0.02954443 0.05051872
# Yes 0.06878665 0.35994587 0.06314840

Resources