aggregating counts per category - r

I have a dataset (df) where I would just like to get some summary stats for the entire column variables and then a summary for the variables of 2 specific treatments. So far so good:
summary(var1)
aggregate(var1 ~ treatment, results, summary)
I then have one variable that are values of 1 and 2. I can count these with the sum function:
sum(var3 == 1)
sum(var3 == 2)
However, when I try to sum these by treatment:
aggregate(var3 ~ treatment, results, sum var3 == 1)
I get the following error:
Error in sum == 1 :
comparison (1) is possible only for atomic and list types
I have tried lots of variations on the same theme and taken a look through the textbooks I am using to help me with my first forays into R... but I can't seem to find the answer.

Here's a sample dataset (it's always best to include sample data to make your question reproducible).
set.seed(15)
results<-data.frame(
var1=runif(30),
var3=sample(1:2, 30, replace=T),
treatment=gl(2,15)
)
If you really want to use aggregate, you can do
aggregate(var3==1~treatment, results, sum)
# treatment var3 == 1
# 1 1 9
# 2 2 5
but since you're counting discrete observations, table() may be a better choice to do all the counting at once
with(results, table(var3, treatment))
# treatment
# var3 1 2
# 1 9 5
# 2 6 10

Related

Simplify time-dependent data created with tmerge

I have a large data.table containing many time-dependent variables(50+) for use in coxph models. This dataset has been generated by using tmerge. Patients are identified by the patid variable and time intervals are defined by tstart and tstop.
The majority of the models I want to fit only use a selection of these time-dependent variables. Unfortunately the speed of Cox proportional hazards models is dependent on the number of rows and the number of timepoints in my data.table even if all the data in these rows is identical. Is there a good/fast way of combining rows which are identical apart from the time interval in order to speed up my models? In many cases, tstop for one line is equal to tstart for the next with everything else identical after removing some columns.
For example I would want to convert the data.table example into results.
library(data.table)
example=data.table(patid = c(1,1,1,2,2,2), tstart=c(0,1,2,0,1,2), tstop=c(1,2,3,1,2,3), x=c(0,0,1,1,2,2), y=c(0,0,1,2,3,3))
results=data.table(patid = c(1,1,2,2), tstart=c(0,2,0,1), tstop=c(2,3,1,3), x=c(0,1,1,2), y=c(0,1,2,3))
This example is extremely simplified. My current dataset has ~600k patients, >20M rows and 3.65k time points. Removing variables should significantly reduce the number of needed rows which should significantly increase the speed of models fit using a subset of variables.
The best I can come up with is:
example=data.table(patid = c(1,1,1,2,2,2), tstart=c(0,1,2,0,1,2), tstop=c(1,2,3,1,2,3), x=c(0,0,1,1,2,2), y=c(0,0,1,2,3,3))
example = example[order(patid,tstart),]
example[,matched:=x==shift(x,-1)&y==shift(y,-1),by="patid"]
example[is.na(matched),matched:=FALSE,by="patid"]
example[,tstop:=ifelse(matched,shift(tstop,-1),tstop)]
example[,remove:=tstop==shift(tstop),by="patid"]
example = example[is.na(remove) | remove==FALSE,]
example$matched=NULL
example$remove=NULL
This solves this example; however, this is pretty complex and overkill code and when I have a number of columns in the dataset having to edit x==shift(x,-1) for each variable is asking for error. Is there a sane way of doing this? The list of columns will change a number of times based on loops, so accepting as input a vector of column names to compare would be ideal.
This solution also doesn't cope with multiple time periods in a row that contain the same covariate values(e.g. time periods of (0,1), (1,3), (3,4) with the same covariate values)
this solution create a temporary group-id based on the rleid() of the combination of x and y. This temp value is used, and then dropped (temp := NULL)
example[, .(tstart = min(tstart), tstop = max(tstop), x[1], y[1]),
by = .(patid, temp = rleid(paste(x,y, sep = "_")))][, temp := NULL][]
# patid tstart tstop x y
# 1: 1 0 2 0 0
# 2: 1 2 3 1 1
# 3: 2 0 1 1 2
# 4: 2 1 3 2 3
Here is an option that builds on our conversation/comments above, but allows the flexibility of setting a vector column names:
cols=c("x","y")
cbind(
example[, id:=rleidv(.SD), .SDcols = cols][, .(tstart=min(tstart), tstop=max(tstop)), .(patid,id)],
example[,.SD[1],.(patid,id),.SDcols =cols][,..cols]
)[,id:=NULL][]
Output:
patid tstart tstop x y
1: 1 0 2 0 0
2: 1 2 3 1 1
3: 2 0 1 1 2
4: 2 1 3 2 3
Based on Wimpel's answer I have created the following solution which also allows using a vector of column names for input.
example=data.table(patid = c(1,1,1,2,2,2), tstart=c(0,1,2,0,1,2), tstop=c(1,2,3,1,2,3), x=c(0,0,1,1,2,2), y=c(0,0,1,2,3,3))
variables = c("x","y")
example[,key_ := do.call(paste, c(.SD,sep = "_")),.SDcols = variables]
example[, c("tstart", "tstop") := .(min(tstart),max(tstop)),
by = .(patid, temp = rleid(key_))][,key_:=NULL]
example = unique(example)
I would imagine this could be simplified, but I think it does what is needed for more complex examples.

Output list of variable names with significant p-values for simultaneous regressions in R

I am trying to determine how to output a list of the variable names that yield significant (p < 0.05) interactions in a series of regressions.
I have a dataframe that looks like the following:
behavior condition attitude1 attitude2 attitude3
1 0 4 5 7
6 1 3 7 2
5 0 2 1 4
3 1 4 2 6
In reality, I have several more attitudes than displayed here. To run several regressions simultaneously and test for interaction terms, I would typically use the following code:
attitudes <- colnames(df[,3:5])
form <- paste("behavior ~ condition*",attitudes)
model <- form %>%
set_names(attitudes) %>%
map(~lm(.x, data = df))
map(model, summary)
The output is a list of each of the following regressions:
lm(behavior ~ condition * attitude1)
lm(behavior ~ condition * attitude2)
lm(behavior ~ condition * attitude3)
I would like to find a way to output a list of all the variable names with a significant condition*attitude interaction. For example, if p<0.05 for attitude1 and attitude3, the output I would be looking for would be:
attitude1, attitude3
This question is related to what I am trying to do, but it does not show me how I can do this when I am running the models simultaneously using map().
A quick but inelegant way to accomplish your goal:
map_df(model, tidy) %>%
mutate(model = rep(attitudes, each = num_of_your_predictors+1))
You can then use filter to get all p.value < .05

How do I sample specific sizes within groups?

I have a specific use problem. I want to sample exact sizes from within groups. What method should I use to construct exact subsets based on group counts?
My use case is that I am going through a two-stage sample design. First, for each group in my population, I want to ensure that 60% of subjects will not be selected. So I am trying to construct a sampling data frame that excludes 60% of available subjects for each group. Further, this is a function where the user specifies the minimum proportion of subjects that must not be used, hence the 1- construction where the user has indicated that at least 60% of subjects in each group cannot be selected for sampling.
After this code, I will be sampling completely at random, to get my final sample.
Code example:
testing <- data.frame(ID = c(seq_len(50)), Age = c(rep(18, 10), rep(19, 9), rep(20,15), rep(21,16)))
testing <- testing %>%
slice_sample(ID, prop=1-.6)
As you can see, the numbers by group are not what I want. I should only have 4 subjects who are 18 years of age, 3 subjects who are 19 years, 6 subjects who are 20 years of age, and 6 subjects who are 21 years of age. With no set seed, the numbers I ended up with were 6 18-year-olds, 1 19-year-old, 6 20-year-olds, and 7 21-year-olds.
However, the overall sample size of 20 is correct.
How do I brute force the sample size within the groups to be what I need?
There are other variables in the data frame so I need to sample randomly from each age group.
EDIT: Messed up trying to give an example. In my real data I am grouping by age inside the dplyr set of commands. But neither group-by([Age variable) ahead of slice_sample() or doing the grouping inside slice_sample() work. In my real data, I get neither the correct set of samples by age, nor do I get the correct overall sample size.
I was using a semi_join to limit the ages to those that had a total remaining after doing the proportion test. For those ages for which no sample could be taken, the semi_join was being used to remove those ages from the population ahead of doing the proportional sampling. I don't know if the semi_join has caused the problem.
That said, the answer provided and accepted shifts me away from relying on the semi_join and I think is an overall large improvement to my real code.
You haven't defined your grouping variable.
Try the following:
set.seed(1)
x <- testing %>% group_by(Age) %>% slice_sample(prop = .4)
x %>% count()
# # A tibble: 4 x 2
# # Groups: Age [4]
# Age n
# <dbl> <int>
# 1 18 4
# 2 19 3
# 3 20 6
# 4 21 6
Alternatively, try stratified from my "splitstackshape" package:
library(splitstackshape)
set.seed(1)
y <- stratified(testing, "Age", .4)
y[, .N, Age]
# Age N
# 1: 18 4
# 2: 19 4
# 3: 20 6
# 4: 21 6

In R, compute relative frequency of binomial values, grouped by multiple columns, and create a new dataset with this 'summary'

I have a dataset (named 'gala') that has the columns "Day", "Tree", "Trt", and "Countable". The data was collected over time, so each numbered tree is the same tree for each treatment is the same across all days. The tree numbers are repeated for each treatment (e.g. there is a tree "1" for multiple treatments). I want to know the proportion/frequency of the "Countable" column values. I have converted the values in the "Countable" column to binomial ("0" and "1").
I would like to compute the relative frequency of "1" vs. "0" for the 'Countable' column, for each tree per each treatment per each day (e.g. If I had eight 1's and two 0's, the new column value would be "0.8" to summarize with one value that tree for that treatment on that day), and output these results into a new data frame that also includes the original day, Tree, Trt values.
I have been unsuccessfully trying to make a Frankenstein of codes from other Stack Overflow answers, but I cannot get the codes to work. Many people use "sum" but I do not want the sum, I would just like R to treat the "0" and "1" like categorical values and give me the relative proportion of each for each subset of data. If I missed this, I am sorry, and please let me know with a link to this answer. I am new to coding, and R, and do not understand well how other codes not directly relating to what I would like to do can be applied.
It looks like dplyr is probably my best option, based on what I've seen for other similar questions. This is what I have thus far, but I keep getting various errors:
library(dplyr)
RelativeFreq <-
(gala %>%
group_by(Day, Tree, Trt) %>%
summarise(Countable) %>%
mutate(rel.freq=n/length(Countable)))
I've also tried this with no success:
RelativeFreq <- gala[,.("proportion"=frequency(Countable[0,1])), by=c("Day","Tree","Trt")]
Any help is greatly appreciated. Thank you!
you could use data.table:
# create fake data
set.seed(0)
df <- expand.grid(Day = 1:2,
Tree = 1:2,
Trt = 1:2)
df<- rbind(df, df, df)
library(data.table)
# make df a data.table
setDT(df)
# create fake Countable column
df[, Countable := as.integer(runif(.N) < 0.5)]
RelativeFreq <- df[, list(prop = sum(Countable)/.N), by = list(Day, Tree, Trt)]
RelativeFreq
Day Tree Trt prop
1: 1 1 1 0.3333333
2: 2 1 1 0.3333333
3: 1 2 1 0.6666667
4: 2 2 1 0.6666667
5: 1 1 2 0.3333333
6: 2 1 2 0.3333333
7: 1 2 2 0.6666667
8: 2 2 2 0.0000000

Unsplit reduced data table based on two factors in R

Suppose I have a data frame in R where I would like to use 2 columns "factor1" and "factor2" as factors and I need to calculate mean value for all other columns per each pair of the above mentioned factors. After running the code below, the last line gives the following warnings:
Warning messages:
1: In split.default(seq_along(x), f, drop = drop, ...) :
data length is not a multiple of split variable
...
Why is it happening and what should I do to make it right?
Thanks.
Here is my code:
# Create data frame
myDataFrame <- data.frame(factor1=c(1,1,1,2,2,2,3,3,3), factor2=c(3,3,3,4,4,4,5,5,5), val1=c(1,2,3,4,5,6,7,8,9), val2=c(9,8,7,6,5,4,3,2,1))
# Split by 2 columns (factors)
splitDataFrame <- split(myDataFrame, list(myDataFrame$factor1, mydataFrame$factor2))
# Calculate mean value for each column per each pair of factors
splitMeanValues <- lapply(splitDataFrame, function(x) apply(x, 2, mean))
# Combine back to reduced table whereas there is only one value (mean) per each pair of factors
MeanValues <- unsplit(splitMeanValues, list(unique(myDataFrame$factor1), unique(mydataFrame$factor2)))
EDIT1: Added data frame creation (see above)
If you need to calculate the mean for all other columns than the factors, you can use the formula syntax of aggregate()
aggregate(.~factor1+factor2, myDataFrame, FUN=mean)
That returns
factor1 factor2 val1 val2
1 1 3 2 8
2 2 4 5 5
3 3 5 8 2
Your split() method didn't work because when you unsplit you must have the same number of rows as when you split your data. You were reduing the number of rows for all groups to just one row. Plus, unsplit really should be used with the exact same list of factors that was used to do the split otherwise groups may get out of order. You could to a split and then lapply some collapsing function and then rbind the list back into a single data.frame if you really wanted, but for a simple mean, aggregate is probably best.
The same result can be obtained with summaryBy() in the doBy package. Although it's pretty much the same as aggregate() in this case.
> library(doBy)
> summaryBy( . ~ factor1+factor2, data = myDataFrame)
# factor1 factor2 val1.mean val2.mean
# 1 1 3 2 8
# 2 2 4 5 5
# 3 3 5 8 2
Have you tried aggregate?
aggregate(myDataFrame$valueColum, myDataFrame$factor1, FUN=mean)
aggregate(myDataFrame$valueColum, myDataFrame$factor2, FUN=mean)

Resources