R: cumsum and group_by [duplicate] - r

This question already has answers here:
How to sum a variable by group
(18 answers)
Calculate cumulative sum (cumsum) by group
(5 answers)
Closed 1 year ago.
I have the following sample data frame:
dates <- c("2021-01-01", "2021-01-03", "2021-01-06", "2021-01-02", "2021-01-04", "2021-01-06")
group <- c("A", "A", "A", "B", "B", "B")
values <- c(1, 5, 4, 2, 7, 3)
df <- data.frame(dates = as.Date(dates), group = group, values)
df
Can someone please tell me how I can compute a variable as the cumulated sum of values for each group (A and B) separately (+ chronologically)?
values_cumulated should be 1, 6, 10, 2, 9, 12
I was trying it with group_by() and mutate(values_cum = cumsum(values) ) but couldnt get it to work.

Related

Transform a df into individual observations [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 7 months ago.
I want to transform a df from a "counting" approach (number of cases) to a "individual observations" approach.
Example:
df <- dplyr::tibble(
city = c("a", "a", "b", "b", "c", "c"),
sex = c(1,0,1,0,1,0),
age = c(1,2,1,2,1,2),
cases = c(2, 3, 1, 1, 1, 1))
Expected result
df <- dplyr::tibble(
city = c("a","a","a","a","a", "b", "b", "c", "c"),
sex = c(1,1,0,0,0,1,0,1,0),
age = c(1,1,2,2,2,1,2,1,2))
uncount() from tidyr can do that for you.
df |> tidyr::uncount(cases)

gtsummary R package: pre-post summary table with paired 2-sample tests?

Is it possible to use the gtsummary R package to make a pre-post summary table with 2 columns that summarize multiple variables at 2 different time points?
I know the arsenal R package supports this, but I would prefer to use gtsummary if possible since it supports the tidyverse.
For example, is it possible to make a pre-post summary table using gtsummary that is similar to the table in this example? Here is a simpler version of the dataset from their example:
dat <- data.frame(
tp = paste0("Time Point ", c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2)),
id = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 6),
Cat = c("A", "A", "A", "B", "B", "B", "B", "A", NA, "B"),
Fac = factor(c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A")),
Num = c(1, 2, 3, 4, 4, 3, 3, 4, 0, NA),
stringsAsFactors = FALSE)
Note the dataset is in "long format": tp is the 2 pre-post time points, and id is the subject ID for the 2 repeated measures. To make the table, Cat and Fac are categorical variables that would be summarized as count(%) at each time point, and use McNemar's test to compare if they change over time. Num is a numeric variable that would be summarized as mean(standard deviation) at each time point, and use paired t-test to assess change over time.
Yes, as of gtsummary v1.3.6, there is a function called add_difference() for this express purpose. The function supports both paired (e.g. pre- and post-responses), and unpaired data. The method is specified in the test= argument.
Worked example here: http://www.danieldsjoberg.com/gtsummary/articles/gallery.html#paired-test
Here's an unpaired example:
trial %>%
select(trt, age, marker, response, death) %>%
tbl_summary(
by = trt,
statistic =
list(
all_continuous() ~ "{mean} ({sd})",
all_dichotomous() ~ "{p}%"
),
missing = "no"
) %>%
add_n() %>%
add_difference()

Allocate ordinal values to numerical vector in R [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 2 years ago.
I have a set of data from children, recorded across a number of sessions. The number of sessions and age of each child in each session is different for each participant, so it looks something like this:
library(tibble)
mydf <- tribble(~subj, ~age,
"A", 16,
"A", 17,
"A", 19,
"B", 10,
"B", 11,
"B", 12,
"B", 13)
What I don't currently have in the data is a variable for Session number, and I'd like to add this to my dataframe. Basically I want to create a numeric variable that is ordinal from 1-n for each child, something like this:
mydf2 <- tribble(~subj, ~age, ~session,
"A", 16, 1,
"A", 17, 2,
"A", 19, 3,
"B", 10, 1,
"B", 11, 2,
"B", 12, 3
"B", 13, 4)
Ideally I'd like to do this in dplyr().
You simply need to group by subj and use row_number():
mydf %>%
group_by(subj) %>%
mutate(session = row_number())

Using R merge() to collect non matching IDs [duplicate]

This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
Closed 3 years ago.
So I have these two dataframes:
id <- c(1, 2, 3, 4, 5, 6, 7, 8)
drug <- c("A", "B", "C", "D", "E", "F", "G", "H")
value <- c(100, 200, 300, 400, 500, 600, 700, 800)
df1 <- data.frame(id, drug, value)
id <- c(1, 2, 3, 4, 6, 8)
treatment <- c("C", "IC", "C", "IC", "C", "C")
value <- c(700, 800, 900, 100, 200, 900)
df2 <- data.frame(id, treatment, value)
I used merge() to combined the two datasets like this
key = "id"
merge(df1,df2[key],by=key)
This worked but I end up droping some fields(due to not matching ids).
Is there a way I can see or collect the ids which were dropped as well?
My real dataset consists of 100s of entries so finding a way to find dropped ids would be very useful in R
library(dplyr)
> anti_join(df1, df2, by = "id")
id drug value
1 5 E 500
2 7 G 700
Or if you just want the IDs
> anti_join(df1, df2, by = "id")$id
[1] 5 7

Random stratified sampling with different proportions

I am trying to split a dataset in 80/20 - training and testing sets. I am trying to split by location, which is a factor with 4 levels, however each level has not been sampled equally. Out of 1892 samples -
Location1: 172
Location2: 615
Location3: 603
Location4: 502
I am trying to split the whole dataset 80/20, as mentioned above, but I also want each location to be split 80/20 so that I get an even proportion from each location in the training and testing set. I've seen one post about this using stratified function from the splitstackshape package but it doesn't seem to want to split my factors up.
Here is a simplified reproducible example -
x <- c(1, 2, 3, 4, 1, 3, 7, 4, 5, 7, 8, 9, 4, 6, 7, 9, 7, 1, 5, 6)
xx <- c("A", "A", "B", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "D", "D", "D", "D", "D")
df <- data.frame(x, xx)
validIndex <- stratified(df, "xx", size=16/nrow(df))
valid <- df[-validIndex,]
train <- df[validIndex,]
where A, B, C, D correspond to the factors in the approximate proportions as the actual dataset (~ 10, 32, 32, and 26%, respectively)
Using bothSets should return you a list containing the split of the original data frame into validation and training set (whose union should be the original data frame):
splt <- stratified(df, "xx", size=16/nrow(df), replace=FALSE, bothSets=TRUE)
valid <- splt[[1]]
train <- splt[[2]]
## check
df2 <- as.data.frame(do.call("rbind",splt))
all.equal(df[with(df, order(xx, x)), ],
df2[with(df2, order(xx, x)), ],
check.names=FALSE)

Resources