Extracting basic statistics using aggregate inside a function in R - r

I have created a function to extract the basic statistics like Mean,median , mode , SD, Var based on what the user wants. Eg. If the user wants to see only mean ,only mean should be calculated. So the statistics are passed as arguments.
The code is
countfunc<-function(dset,Xaxis,Color,Groupby,AggValue){
S1=select(dset,Xaxis,Color,Groupby)
S2=unique(S1)
str(S2)
stackval5<-aggregate(Groupby~Xaxis+Color,data=S2,FUN=AggValue)
return(stackval5)
}
countfunc(sbarr,"workclass","sex","age","mean")
Sample data :
> dput(head(S1,20))
structure(list(workclass = structure(c(8L, 7L, 5L, 5L, 5L, 5L,
5L, 7L, 5L, 5L, 5L, 8L, 5L, 5L, 5L, 5L, 7L, 5L, 5L, 7L), .Label = c(" Federal-gov",
" Local-gov", " NA", " Never-worked", " Private", " Self-emp-inc",
" Self-emp-not-inc", " State-gov", " Without-pay"), class = "factor"),
sex = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L,
2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L), .Label = c(" Female",
" Male"), class = "factor"), age = c(39L, 50L, 38L, 53L,
28L, 37L, 49L, 52L, 31L, 42L, 37L, 30L, 23L, 32L, 40L, 34L,
25L, 32L, 38L, 43L)), .Names = c("workclass", "sex", "age"
), row.names = c(NA, 20L), class = "data.frame")
But when i run the function , it is throwing an error as "In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA" even though the "age" column is in Int, tried with Numeric conversion as well.
str of my DF
'data.frame': 886 obs. of 3 variables:
$ workclass: Factor w/ 9 levels " Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
$ sex : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ...
$ age : int 39 50 38 53 28 37 49 52 31 42 ...
Xaxis Color Groupby
1 workclass sex NA
If i hard code the values (aggregate(age~workclass+sex,data=S1,FUN=mean), it is working as expected.It would a great help if you guide or share some thoughts on what i am doing wrong here. Thanks in advance.

Try the following.
countfunc<-function(dset,Xaxis,Color,Groupby,AggValue){
S1=select(dset,Xaxis,Color,Groupby)
S2=unique(S1)
stackval5 <- aggregate(S2[[Groupby]], list(S2[[Xaxis]], S2[[Color]]), FUN = AggValue)
names(stackval5) <- c(Xaxis, Color, Groupby)
stackval5
}
countfunc(sbarr,"workclass","sex","age","mean")
workclass sex age
1 Private Female 33.60000
2 Self-emp-not-inc Female 43.00000
3 Private Male 39.42857
4 Self-emp-not-inc Male 42.33333
5 State-gov Male 34.50000
What you were doing wrong was the formula. aggregate was looking for the values of the variables Xaxis, Color and Groupby, which were, respectively, "workclass", "sex", and "age". Since the value "age" is neither numeric nor logical, it would return NA. (It would do mean("age") and return NA.)

Related

gtsummary modified cross tab

[![enter image description here][2]][2][![i need help in writing gstummary r code to produce following table output.dummy table shown in above table][2]][2]
i need help in writing gstummary r code to produce following table output.dummy table shown in above table
[![enter image description here][2]][2]
library(gtsummary)
[![enter image description here][2]][2]
[![enter image description here][3]][3]
id
age
sex
country
edu
ln
ivds
n2
p5
1
a
M
eng
x
45
15
40
15
2
a
M
eng
x
23
26
70
15
4
a
M
eng
x
26
36
35
40
5
b
F
eng
x
26
25
36
47
6
b
F
wal
y
45
45
60
12
7
b
M
wal
y
60
25
36
15
8
c
M
wal
y
70
08
25
36
9
c
F
sco
z
80
25
36
15
10
c
F
sco
z
90
25
26
39
structure(list(id = 1:15, age = structure(c(1L, 1L, 2L, 1L, 2L,
2L, 2L, 3L, 3L, 3L, 1L, 1L, 2L, 1L, 2L), .Label = c("a", "b",
"c"), class = "factor"), sex = structure(c(2L, 1L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L), .Label = c("F", "M"), class = "factor"),
country = structure(c(1L, 1L, 1L, 1L, 3L, 3L, 3L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 3L), .Label = c("eng", "scot", "wale"
), class = "factor"), edu = structure(c(1L, 1L, 1L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L), .Label = c("x",
"y", "z"), class = "factor"), lon = c(45L, 23L,
25L, 45L, 70L, 69L, 90L, 50L, 62L, 45L, 23L, 25L, 45L, 70L,
69L), is = c(15L, 26L, 36L, 34L, 2L, 4L, 5L, 8L, 9L,
15L, 26L, 36L, 34L, 2L, 4L), n2 = c(40L, 70L, 50L, 60L,
30L, 25L, 80L, 89L, 10L, 40L, 70L, 50L, 60L, 30L, 25L), p5 = c(15L,
20L, 36L, 48L, 25L, 36L, 28L, 15L, 25L, 15L, 20L, 36L, 48L,
25L, 36L)), row.names = c(NA, 15L), class = "data.frame")
[
I made a table similar to what you have above (more similar to the table you had before you updated it). But I think it'll get you most of the way there.
The type of table you're requesting it something that is in the works. In the meantime, you will need to use the bstfun::tbl_2way_summary() function. This function exists in another package while we work to make it better before integrating with gtsummary.
library(bstfun) # install with `remotes::install_github("ddsjoberg/bstfun")`
library(gtsummary)
packageVersion("gtsummary")
#> [1] '1.4.1'
# add a column that is all the same value
trial2 <- trial %>% mutate(constant = TRUE)
# loop over each continuous variable, construct table, then merge them together
tbls_row1 <-
c("age", "marker", "ttdeath") %>%
purrr::map(
~tbl_2way_summary(data = trial2, row = grade, col = constant, con = all_of(.x),
statistic = "{mean} ({sd}) - {min}, {max}") %>%
modify_header(stat_1 = paste0("**", .x, "**"))
) %>%
tbl_merge() %>%
modify_spanning_header(everything() ~ NA)
# repeat for the second row
tbls_row2 <-
c("age", "marker", "ttdeath") %>%
purrr::map(
~tbl_2way_summary(data = trial2, row = stage, col = constant, con = all_of(.x),
statistic = "{mean} ({sd}) - {min}, {max}") %>%
modify_header(stat_1 = paste0("**", .x, "**"))
) %>%
tbl_merge() %>%
modify_spanning_header(everything() ~ NA)
# stack these tables
tbl_stacked <- tbl_stack(list(tbls_row1, tbls_row2))
# lastly, add calculated summary stats for categorical variables, and merge them
tbl_summary_stats <-
trial2 %>%
tbl_summary(
include = c(grade, stage),
missing = "no"
) %>%
modify_header(stat_0 ~ "**n (%)**") %>%
modify_footnote(everything() ~ NA)
tbl_final <-
tbl_merge(list(tbl_summary_stats, tbl_stacked)) %>%
modify_spanning_header(everything() ~ NA) %>%
# column spanning column headers
modify_spanning_header(
list(c(stat_1_1_2, stat_1_2_2) ~ "**Group 1**",
stat_1_3_2 ~ "**Group 2**")
)
Created on 2021-07-10 by the reprex package (v2.0.0)

Data wrangling: Convert long format of sequences to wide format of specific subsequences

I am studying state changes for multiple subject sessions. My original dataset was formatted long, with a separate row for each state for each session. Something like:
Session StateCount State
1 0 B
1 1 C
1 2 B
1 3 B
1 4 A
… … …
56 26 A
56 27 B
Using the tidyR spread function in R...
d0_Spread <- spread(d0, key = "StateCount", value = "State")
...I was able to convert the data to a wide format (1 row per session, 1 column per state in the session), which was needed for some sequence-based calculations:
Session 0 1 2 3 4 … [Nth State of Longest Session]
1 B C B B A … NA
… … … … … … … …
56 A C B C C … NA
I now want to run the same calculations on two different types of subsequences within a given session (rather than across an entire session). However, I'm having difficulty wrangling the required data into wide format.
Below is the dput() output for a representative sample of the new long-format dataset:
structure(list(Session = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 56L, 56L, 56L, 56L, 56L, 56L, 56L, 56L, 56L
), StateCount = c(0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L,
11L, 12L, 13L, 0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L), State = structure(c(2L,
3L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 3L,
1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A", "B", "C"), class = "factor"),
State_AType = structure(c(5L, 6L, 5L, 1L, 5L, 1L, 5L, 5L,
2L, 5L, 1L, 5L, 6L, 4L, 1L, 5L, 6L, 3L, 5L, 2L, 5L, 1L, 5L
), .Label = c("A", "A_L", "A_S", "A4", "B", "C"), class = "factor"),
State_AType_629Num = structure(c(9L, 10L, 9L, 3L, 9L, 4L,
9L, 9L, 1L, 9L, 5L, 9L, 10L, 6L, 7L, 9L, 10L, 2L, 9L, 1L,
9L, 8L, 9L), .Label = c("A_L", "A_S", "A1", "A2", "A3", "A4",
"A628", "A629", "B", "C"), class = "factor"), StateDuration = c(1L,
1L, 37L, 147L, 32L, 42L, 24L, 2L, 8L, 1L, 17L, 8L, 8L, 2L,
297L, 1L, 11L, 73L, 27L, 28L, 46L, 14L, 127L)), class = "data.frame", row.names = c(NA,
-23L))
It, like the first dataset, has a row for each state within each session. It contains the same data columns as my first dataset (e.g., Session, StateCount, State), but also a few additional columns:
In addition to State (which codes each state as either "A", "B", or "C"), there is State_AType, which differentiates between a normal "A" (of which there are 629) and two alternative A types ("A_S" and "A_L"), a distinction that matters below.
The State_AType_629Num is a repeat column, but it provides a unique identifier for the 629 normal "A" states (e.g., A1, A2...A629), in case helpful for the below.
The State_Duration column captures the time duration of the state (needed for the below).
Using this dataset, I now need to create the following wide formats for the intended subsequence analyses:
1) First state to each normal "A" state - a wide format in which each row represents a sequence from the first state of a session to each "A" state of the session. There would be 629 such rows (one for each normal "A" only; that is, not for "A_S" or "A_L" states). Each column value would be a state within the sequence. Complicating things a bit further, I'd also like to calculate the total duration of the sequence. This means the StateDuration variable would also need to be accounted for, so that each row of the output notes the sum of StateDuration for all of the states in that sequence. For the sample data provided above, the desired data frame would be structured as follows, producing the listed values:
SeqEndpnt 0 1 2 3 4 5 … [Nth State of Longest Sequence] SumDuration
A1 B C B NA NA NA … NA 39
A2 B C B A B A … NA 218
A3 B C B A B A … NA 295
A4 B C B A B A … NA 328
… … … … … … … … … …
A628 NA NA NA NA NA NA … NA NA
A629 A B C A_S B A_L … NA 483
2) Previous "A" or "A_L" state to next "A" state - a wide format in which each row represents a sequence from the previous "A" or "A_L" (whichever was more recent) to the next normal "A" (importantly, an "A_L" state could begin a sequence but cannot terminate it). As above, there would be 629 such rows, with the columns again representing a state in the sequence. The first sequence within a given session would likely be blank, as there would be no "A" previous to the first "A"; the exception would be if there were an "A_L" that preceded the first "A". For the sample data provided above, the desired data frame would be structured as follows, producing the listed values:
SeqEndpnt 0 1 2 3 4 5 … [Nth State of Longest Sequence] SumDuration
A1 NA NA NA NA NA NA … NA NA
A2 B NA NA NA NA NA … NA 32
A3 B B A_L B NA NA … NA 35
A4 B C NA NA NA NA … NA 16
… … … … … … … … … …
A628 NA NA NA NA NA NA … NA NA
A629 B C A_S B A_L B … NA 186
I assume the solution this time will require a combination of the spread function used earlier with some for loops. However, I'm not especially experienced with the latter, particularly in the context of iteratively adding values to a data frame. Alternatively, I thought reshape2, group_by, or mutate may assist with achieving the desired sets, but am not familiar.
Any help would be extremely appreciated!

R ggplot2 - How to plot 2 boxplots on the same x value

suppose I have two boxplots.
trial1 <- ggplot(completionTime, aes(fill=Condition, x=Scenario, y=Trial1))
trial1 + geom_boxplot()+geom_point(position=position_dodge(width=0.75)) + ylim(0, 160)
trial2 <- ggplot(completionTime, aes(fill=Condition, x=Scenario, y=Trial2))
trial2 + geom_boxplot()+geom_point(position=position_dodge(width=0.75)) + ylim(0, 160)
How can I plot trial 1 and trial 2 on the same plot and same respective X? they have the same range of y.
I looked at geom_boxplot(position="identity"), but that plots the two conditions(fill) on the same X.
I want to plot two y column on the same X.
Edit: the dataset
User Condition Scenario Trial1 Trial2
1 1 ME a 67 41
2 1 ME b 70 42
3 1 ME c 40 15
4 1 ME d 65 23
5 1 ME e 45 45
6 1 SE a 100 34
7 1 SE b 54 23
8 1 SE c 70 23
9 1 SE d 56 15
10 1 SE e 30 20
11 2 ME a 42 23
12 2 ME b 22 12
13 2 ME c 28 8
14 2 ME d 22 8
15 2 ME e 38 37
16 2 SE a 59 18
17 2 SE b 65 14
18 2 SE c 75 7
19 2 SE d 37 9
20 2 SE e 31 7
dput()
structure(list(User = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), Condition = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L), .Label = c("ME", "SE"), class = "factor"), Scenario =
structure(c(1L,
2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L,
3L, 4L, 5L), .Label = c("a", "b", "c", "d", "e"), class = "factor"),
Trial1 = c(67L, 70L, 40L, 65L, 45L, 100L, 54L, 70L, 56L,
30L, 42L, 22L, 28L, 22L, 38L, 59L, 65L, 75L, 37L, 31L), Trial2 = c(41L,
42L, 15L, 23L, 45L, 34L, 23L, 23L, 15L, 20L, 23L, 12L, 8L,
8L, 37L, 18L, 14L, 7L, 9L, 7L)), .Names = c("User", "Condition",
"Scenario", "Trial1", "Trial2"), class = "data.frame", row.names = c(NA,
-20L))
You could try using interaction to combine two of your factors and plot against a third. For example, assuming you want to fill by condition as in your original code:
library(tidyr)
completionTime %>%
gather(trial, value, -Scenario, -Condition, -User) %>%
ggplot(aes(interaction(Scenario, trial), value)) + geom_boxplot(aes(fill = Condition))
Result:

array manipulation: calculate odds ratios for a layer in a 3-way table

This is a question about array and data frame manipulation and calculation, in the
context of models for log odds in contingency tables. The closest question I've found to this is How can i calculate odds ratio in many table, but mine is more general.
I have a data frame representing a 3-way frequency table, of size 5 (litter) x 2 (treatment) x 3 (deaths).
"Freq" is the frequency in each cell, and deaths is the response variable.
Mice <-
structure(list(litter = c(7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 11L,
11L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 11L, 11L, 7L, 7L, 8L,
8L, 9L, 9L, 10L, 10L, 11L, 11L), treatment = structure(c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A",
"B"), class = "factor"), deaths = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("0", "1",
"2+"), class = "factor"), Freq = c(58L, 75L, 49L, 58L, 33L, 45L,
15L, 39L, 4L, 5L, 11L, 19L, 14L, 17L, 18L, 22L, 13L, 22L, 12L,
15L, 5L, 7L, 10L, 8L, 15L, 10L, 15L, 18L, 17L, 8L)), .Names = c("litter",
"treatment", "deaths", "Freq"), row.names = c(NA, 30L), class = "data.frame")
From this, I want to calculate the log odds for adjacent categories of the last variable (deaths)
and have this value in a data frame with factors litter (5), treatment (2), and contrast (2), as detailed below.
The data can be seen in xtabs() form:
mice.tab <- xtabs(Freq ~ litter + treatment + deaths, data=Mice)
ftable(mice.tab)
deaths 0 1 2+
litter treatment
7 A 58 11 5
B 75 19 7
8 A 49 14 10
B 58 17 8
9 A 33 18 15
B 45 22 10
10 A 15 13 15
B 39 22 18
11 A 4 12 17
B 5 15 8
>
From this, I want to calculate the (adjacent) log odds of 0 vs. 1 and 1 vs.2+ deaths, which is easy in
array format,
odds1 <- log(mice.tab[,,1]/mice.tab[,,2]) # contrast 0:1
odds2 <- log(mice.tab[,,2]/mice.tab[,,3]) # contrast 1:2+
odds1
treatment
litter A B
7 1.6625477 1.3730491
8 1.2527630 1.2272297
9 0.6061358 0.7156200
10 0.1431008 0.5725192
11 -1.0986123 -1.0986123
>
But, for analysis, I want to have these in a data frame, with factors litter, treatment and contrast
and a column, 'logodds' containing the entries in the odds1 and odds2 tables, suitably strung out.
More generally, for an I x J x K table, where the last factor is the response, my desired result
is a data frame of IJ(K-1) rows, with adjacent log odds in a 'logodds' column, and ideally, I'd like
to have a general function to do this.
Note that if T is the 10 x 3 matrix of frequencies shown by ftable(), the calculation is essentially
log(T) %*% matrix(c(1, -1, 0,
0, 1, -1))
followed by reshaping and labeling.
Can anyone help with this?

ggplot2 sorting a plot Part II

I have a melted data.frame, dput(x), below:
## dput(x)
x <- structure(list(variable = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L,
4L, 4L), .Label = c("a", "b", "c", "d"), class = "factor"),
value = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L,
6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("Never Heard of",
"Heard of but Not at all Familiar",
"Somewhat Familiar", "Familiar", "Very Familiar", "Extremely Familiar"
), class = "factor"), freq = c(10L, 24L, 32L, 90L, 97L, 69L,
15L, 57L, 79L, 94L, 58L, 19L, 11L, 17L, 34L, 81L, 94L, 85L, 4L,
28L, 59L, 114L, 82L, 35L)), .Names = c("variable", "value", "freq"
), row.names = c(NA, -24L), class = "data.frame")
Which looks like this (for those of you who don't need a test set):
variable value freq
1 a Never Heard of 10
2 a Heard of but Not at all Familiar 24
3 a Somewhat Familiar 32
4 a Familiar 90
5 a Very Familiar 97
6 a Extremely Familiar 69
7 b Never Heard of 15
8 b Heard of but Not at all Familiar 57
9 b Somewhat Familiar 79
10 b Familiar 94
11 b Very Familiar 58
12 b Extremely Familiar 19
13 c Never Heard of 11
14 c Heard of but Not at all Familiar 17
15 c Somewhat Familiar 34
16 c Familiar 81
17 c Very Familiar 94
18 c Extremely Familiar 85
19 d Never Heard of 4
20 d Heard of but Not at all Familiar 28
21 d Somewhat Familiar 59
22 d Familiar 114
23 d Very Familiar 82
24 d Extremely Familiar 35
Now, I can make a nice and pretty plot akin to this:
ggplot(x, aes(variable, freq, fill = value)) +
geom_bar(position = "fill") +
coord_flip() +
scale_y_continuous("", formatter="percent")
Question
What I would like to do is sort a,b,c,d by the highest to lowest "freq" of "Extremely Familiar"
?relevel and ?reorder haven't provided any constructive examples for this usage.
Your help, is always appreciated.
Cheers,
BEB
Here is another way to do it:
tmp <- subset(x, value=="Extremely Familiar")
x$variable <- factor(x$variable, levels=levels(x$variable)[order(-tmp$freq)])
Here is one way:
tmpfun <- function(i) {
tmp <- x[i,]
-tmp[ tmp$value=='Extremely Familiar', 'freq' ]
}
x$variable <- reorder( x$variable, 1:nrow(x), tmpfun )

Resources