Counting the number of observations by groups with conditions in R - r

I would like to count the number of observations within each group using conditions in R.
For example, I would like to count how many observations for ID "A" in every 10 days.
ID (A,A,A,A,A,A,A,A)
Day (7,14,17,25,35,37,42,57)
X (9,20,14,24,23,30,20,40)
Output Image
(In the first 10 days, we have one observation for ID "A". Days:7
In the next 10 days, we have two observations for ID "A". Days:14,17)
ID (A,A,A,A,A,A,A,A)
Day_10 (1,2,3,4,5,6)
Count_10 (1,2,1,2,1,1)
Also it would be great if I can calculate the number of observations before and after the certain values. For the given X value, I would like to know how many observation between [X-10, X+10] within ID "A".
The output image would be as follows:
ID (A,A,A,A,A,A,A,A)
X (9,20,14,24,23,30,40,50)
Count_X10 (3,3,3,3,3,3,2,1)
Count_X10: for a given X(=9) there are three observations within ID "A" [-1,19]

Here are the data loaded as a data.frame to keep the observations connected. Note that I added a second group to to show how to handle that
df <-
data.frame(
ID = rep(c("A","B"), each = 8)
, Day = c(7,14,17,25,35,37,42,57)
, X = c(9,20,14,24,23,30,20,40)
)
Then, I used dplyr to pass the data through a series of steps. First, I split by the ID column, then used lapply to run a function on each of those ID groups, including calculating two columns of interest (then returning the whole data.frame). Finally, I stitch the rows back together with bind_rows
df %>%
split(.$ID) %>%
lapply(function(x){
x$nextTen <- sapply(x$Day, function(thisDay){
sum(between(x$Day, thisDay, thisDay + 10))
})
x$plusMinusTen <- sapply(x$Day, function(thisDay){
sum(between(x$Day, thisDay - 10, thisDay + 10))
})
return(x)
}) %>%
bind_rows()
The result is
ID Day X nextTen plusMinusTen
1 A 7 9 3 3
2 A 14 20 2 3
3 A 17 14 2 4
4 A 25 24 2 3
5 A 35 23 3 4
6 A 37 30 2 3
7 A 42 20 1 3
8 A 57 40 1 1
9 B 7 9 3 3
10 B 14 20 2 3
11 B 17 14 2 4
12 B 25 24 2 3
13 B 35 23 3 4
14 B 37 30 2 3
15 B 42 20 1 3
16 B 57 40 1 1
But any condition you are interested good be added to that lapply step.

Your sample data :
df = data.frame(
ID = rep('A', 8),
Day = c(7, 14, 17, 25, 35, 37, 42, 57),
X = c(9, 20, 14, 24, 23, 30, 40, 50),
stringsAsFactors = FALSE)
Note: You give two different values for vector X. I suppose it is c(9, 20, 14, 24, 23, 30, 40, 50), and not c(9, 20, 14, 24, 23, 30, 20, 40).
First calculation:
library(dplyr)
output1 = df %>%
mutate(Day_10 = ceiling(Day/10)) %>%
group_by(ID, Day_10) %>%
summarise(Count_10 = n())
The mutate step creates the ranges of 10 days by rounding Day/10. Then we group by ID and Day_10 and we count the number of observations within each group.
> output1
ID Day_10 Count_10
<chr> <dbl> <int>
1 A 1 1
2 A 2 2
3 A 3 1
4 A 4 2
5 A 5 1
6 A 6 1
Second calculation:
output2 = df %>%
group_by(ID) %>%
mutate(Count_X10 = sapply(X, function(x){sum(Day >= x-10 & Day <= x+10)})) %>%
select(-Day)
We group by ID, and for each X we count the number of days with this ID that are between X-10 and X+10.
> output2
ID X Count_X10
<chr> <dbl> <int>
1 A 9 3
2 A 20 3
3 A 14 3
4 A 24 3
5 A 23 3
6 A 30 3
7 A 40 3
8 A 50 2
Note: I suppose there's a mistake in the desired output you give, because for instance, when X = 50, there are 2 observations within [40, 60] with ID "A": days 42 and 57.

Related

R: creating a longitudinal dataset using tidyr

I am looking to generate a longitudinal dataset. I have generated my pat numbers and treatment groups:
library(dplyr)
set.seed(420)
Pat_TNO <- 1001:1618
data.frame(Pat_TNO = Pat_TNO) %>%
rowwise() %>%
mutate(
trt = rbinom(1, 1, 0.5)
)
My timepoints (in days) are:
timepoint_weeks <- c(seq(2, 12, 2), 16, 20, 24, 52)
timepoint_days <- 7 * timepoint_weeks
How can I pivot this dataset using the vector timepoint_days, so I have 10 rows per participant and column names Pat_TNO, trt, timepoint_days.
You can use the unnest function from tidyr to achieve what you want.
Here is the code
library(dplyr)
library(tidyr)
set.seed(420)
Pat_TNO <- 1001:1618
x <- data.frame(Pat_TNO = Pat_TNO) %>%
rowwise() %>%
mutate(
trt = rbinom(1, 1, 0.5)
)
timepoint_weeks <- c(seq(2, 12, 2), 16, 20, 24, 52)
timepoint_days <- 7 * timepoint_weeks
x %>%
mutate(timepoint_days = list(timepoint_days)) %>%
unnest()
Output
# A tibble: 6,180 × 3
Pat_TNO trt timepoint_days
<int> <int> <dbl>
1 1001 1 14
2 1001 1 28
3 1001 1 42
4 1001 1 56
5 1001 1 70
6 1001 1 84
7 1001 1 112
8 1001 1 140
9 1001 1 168
10 1001 1 364
# … with 6,170 more rows
Here I used the mutate function to add a column with a list containing timepoint_days in every row. And then unnest collapses each row to get 10 rows per participant.

If function with looping to create new datatable

I have a dataset which i want to loop over with an if function:
id <- c(1,2,3,4,5)
value1 <- c(25, 100, 15, 20, 30)
value2 <- c(130, 25, 10, 30, 20)
value3 <- c(50, 60, 20, 120, 10)
month <- c(2, 3, 4, 2, 3)
df1 <- dataframe(id,value1, value2, value3, month)
I need an if function that would calculate the differences between value 1 and 2 and between 1 and 3 and check if one OR the other is higher than 35%. This should not be the change(increase or decrease) but just the difference in percentages. And when the differences is equal or higher than 35% it should add all the variables for that specific id to a seperate dataframe/table.
However i the function to loop over all the id's instead of just the top one.
This is what i have so far:
library(threadr)
if (percentage_difference(value1, value2) >= 35 | percentage_difference(value1, value3) >= 35) {print "bad"}
If I understand you correctly, this should do the trick, though I am not sure how you would want to indicate if it is >35.
Here I used a boolean T/F:
library(dplyr)
df2 <- df1
df2$perc_diff <- pmax(value1-value2, value1-value3)
df2$over35 <- df2$perc_diff >= 35
# id value1 value2 value3 month perc_diff over35
# 1 1 25 130 50 2 -25 FALSE
# 2 2 100 25 60 3 75 TRUE
# 3 3 15 10 20 4 5 FALSE
# 4 4 20 30 120 2 -10 FALSE
# 5 5 30 20 10 3 20 FALSE
Here I removed those < 35:
df3 <- df1
df3$perc_diff <- pmax(value1-value2, value1-value3)
df3[df3$perc_diff >= 35,]
# id value1 value2 value3 month perc_diff
# 2 100 25 60 3 75

Identify the highest number of consecutive numbers in a dataframe and add rows with NA

Here is a reproducible example of the situation I need help for. I have a database (db1) in which weekly ratings of behavioral outcomes are recorded. The variable "Week" corresponds to the number of the week from the beginning of the year (e.g., Week = 1 indicates the week between January 1st and 7th, and so on...) and the variable "Score" to the value obtained by the subject on the criterion measure. In the real data set, I have several participants and a different number of ratings for each subject; however, in this example there is only one subject to make things easier.
library(magrittr)
x1 <- c(14, 18, 19, 20, 21, 23, 24, 25)
y1 <- c(34, 21, 45, 32, 56, 45, 23, 48)
db1 <- cbind(x1, y1) %>% as.data.frame() %>% setNames(c("Week", "Score"))
db1
# Week Score
#1 14 34
#2 18 21
#3 19 45
#4 20 32
#5 21 56
#6 23 45
#7 24 23
#8 25 48
What I need to do is to identify the highest number of ratings occurred in consecutive weeks in the database. In the example, the highest number is 4 because the ratings were consecutive from week 18 to 21. Here I added a column for demonstration, but it might not be necessary for the solution.
x2 <- c(14, 18, 19, 20, 21, 23, 24, 25)
y2 <- c(34, 21, 45, 32, 56, 45, 23, 48)
z2 <- c(1, 1, 2, 3, 4, 1, 2, 3)
db2 <- cbind(x2, y2, z2) %>% as.data.frame() %>% setNames(c("Week", "Score", "Consecutive"))
db2
# Week Score Consecutive
#1 14 34 1
#2 18 21 1
#3 19 45 2
#4 20 32 3
#5 21 56 4
#6 23 45 1
#7 24 23 2
#8 25 48 3
Finally, because every subject has to have a total of five consecutive ratings, I need to add a row with a missing datum where the highest number of consecutive weeks is below five (so that I can impute the missing data later on). However, there might be ratings before and after the sequence. If that is the case, I want to add the row based on the minimal distance between the first or last week of the longest series of consecutive weeks from the other existing rating. In the example, that means that the row with missing datum will be added after 21 because there are 4 missing weeks between week 14 and 18 whereas only 1 between week 21 and 23.
x3 <- c(14, 18, 19, 20, 21, 22, 23, 24, 25)
y3 <- c(34, 21, 45, 32, 56, NA, 45, 23, 48)
z3 <- c(1, 1, 2, 3, 4, 5, 1, 2, 3)
db3 <- cbind(x3, y3, z3) %>% as.data.frame() %>% setNames(c("Week", "Score", "Consecutive"))
db3
# Week Score Consecutive
#1 14 34 1
#2 18 21 1
#3 19 45 2
#4 20 32 3
#5 21 56 4
#6 22 NA 5
#7 23 45 1
#8 24 23 2
#9 25 48 3
For your information, this is not going to be part of the main statistical analyses but rather one of several ways I want to use to test the sensitivity of my model. So do not worry about whether it makes sense from a methodological point of view. In addition, if possible, a tidyverse solution would be greatly appreciated.
Thanks so much to anyone who will take the time.
The code is relatively easier, if you want to do it just for max group and if more than one, just for one.
db1 %>% mutate(consecutive = accumulate(diff(Week), .init = 1, ~if(.y == 1) { .x +1} else {1}),
dummy = max(consecutive) == consecutive & max(consecutive) < 5) %>%
group_by(grp = cumsum(consecutive == 1)) %>%
filter(sum(dummy) > 0) %>% #filter out group(s) with max consecutive
ungroup() %>% select(-dummy) %>%
filter(grp == min(grp)) %>% # filter out first such group, if there are more than 1
complete(consecutive = 1:5) %>%
select(-grp) %>%
mutate(Week = first(Week) + consecutive -1)
# A tibble: 5 x 3
consecutive Week Score
<dbl> <dbl> <dbl>
1 1 18 21
2 2 19 45
3 3 20 32
4 4 21 56
5 5 22 NA
OLD ANSWER Another tidyverse strategy (this can be modified to suit your additional column requirements which you have not given in sample)
library(tidyverse)
db1
#> Week Score
#> 1 14 34
#> 2 18 21
#> 3 19 45
#> 4 20 32
#> 5 21 56
#> 6 23 45
#> 7 24 23
#> 8 25 48
library(data.table)
db1 %>% mutate(consecutive = accumulate(diff(Week), .init = 1, ~if(.y == 1) { .x +1} else {1}),
dummy = max(consecutive) == consecutive & max(consecutive) < 5,
dummy2 = rleid(dummy)) %>%
group_split(dummy2, .keep = F) %>%
map_if( ~.x$dummy[[1]], ~.x %>% complete(consecutive = seq(max(consecutive), 5, 1), fill = list(Week = 1)) %>%
mutate(Week = cumsum(Week))) %>%
map_dfr(~.x %>% select(-dummy))
#> # A tibble: 9 x 3
#> Week Score consecutive
#> <dbl> <dbl> <dbl>
#> 1 14 34 1
#> 2 18 21 1
#> 3 19 45 2
#> 4 20 32 3
#> 5 21 56 4
#> 6 22 NA 5
#> 7 23 45 1
#> 8 24 23 2
#> 9 25 48 3
Created on 2021-06-10 by the reprex package (v2.0.0)
if I understand correctly
library(data.table)
library(tidyverse)
x1 <- c(14, 18, 19, 20, 21, 23, 24, 25)
y1 <- c(34, 21, 45, 32, 56, 45, 23, 48)
db1 <- cbind(x1, y1) %>% as.data.frame() %>% setNames(c("Week", "Score"))
db1 %>%
mutate(grp = cumsum(c(0, diff(Week)) > 1)) %>%
group_by(grp) %>%
mutate(n_grp = n()) %>%
ungroup() %>%
filter(n_grp == max(n_grp, na.rm = TRUE)) %>%
complete(grp,
n_grp,
nesting(Week = seq(from = first(Week), length = 5))) %>%
select(-c(grp, n_grp)) %>%
rows_upsert(db1, by = c("Week", "Score"))
#> # A tibble: 9 x 2
#> Week Score
#> <dbl> <dbl>
#> 1 18 21
#> 2 19 45
#> 3 20 32
#> 4 21 56
#> 5 22 NA
#> 6 14 34
#> 7 23 45
#> 8 24 23
#> 9 25 48
Created on 2021-06-10 by the reprex package (v2.0.0)
You can also use the following solution. Midway through this solution before we use add_row to add your additional rows, we can filter the whole data set for we use group_split I filtered the whole data set to keep only those groups with the maximum observations which means they have longer consecutive Weeks than others. So after we split by grouping variable we may end of with 2 or more groups of equal consecutive Weeks so then you can choose whichever your like based on your preference:
library(dplyr)
library(purrr)
library(tibble)
db1 %>%
mutate(Consecutive = +(Week - lag(Week, default = first(Week)) == 1),
grp = cumsum(Consecutive == 0)) %>%
group_by(grp) %>%
mutate(Consecutive = row_number()) %>%
group_by(grp, .drop = TRUE) %>%
add_count() %>%
ungroup() -> db2 # We create our grouping variable `grp` here
db2 %>%
filter(n == max(n)) %>%
group_split(grp) %>%
map_dfr(~ add_row(.x, Week = .x$Week[.x$n[1]] + seq(1, 5 - .x$n[1], 1),
Consecutive = .x$Consecutive[.x$n[1]] + seq(1, 5 - .x$n[1], 1),
grp = .x$grp[1])) %>%
bind_rows(db2 %>%
filter(n != max(n))) %>%
select(-c(grp, n)) %>%
arrange(Week)
# A tibble: 9 x 3
Week Score Consecutive
<dbl> <dbl> <dbl>
1 14 34 1
2 18 21 1
3 19 45 2
4 20 32 3
5 21 56 4
6 22 NA 5
7 23 45 1
8 24 23 2
9 25 48 3

shift a column with lagged data from other column and enlarge data frame as needed

I have a data frame with values and I need a new column with shifted values some rows down but data frame has to get more rows to accommodate the shifted data.
What I've got so far:
df <- data.frame(day=1:5,value=floor(runif(5, min=0, max=101)))
> df %>% dplyr::mutate(value2=dplyr::lag(value,n=2, default = 0))
day value value2
1 1 19 0
2 2 78 0
3 3 18 19
4 4 14 78
5 5 10 18
Expected result:
day value value2
1 1 19 0
2 2 78 0
3 3 18 19
4 4 14 78
5 5 10 18
6 6 0 14
7 7 0 10
Stuck on making the data frame grow the needed rows.
Here's a way with dplyr -
df %>%
bind_rows(
tail(df, 2) %>%
mutate(day = day + 2, value = 0)
) %>%
mutate(value2 = lag(value, 2, default = 0))
day value value2
1 1 19 0
2 2 78 0
3 3 18 19
4 4 14 78
5 5 10 18
6 6 0 14
7 7 0 10
Use a merge. Create the "target" dataset with however many rows you want, fill in NA values with 0, then remap the lagged value onto "value2". It's useful to store "lag" as a variable, at the risk of being more verbose.
have <- data.frame(
day= 1:5,
value = c(19, 78, 18, 14, 10),
value2 = c(0, 0, 19, 78, 18)
)
target <- data.frame(
day=1:7
)
want <- merge(have, target, by='day', all=T)
want[is.na(want)] <- 0
lag <- 2
## just one way of mapping a lagged response
want$value2 <- c(rep(0, lag), rev(rev(want$value)[-{1:lag}]))

Extracting row in time series in R

I'm trying to extract the rows from a data frame containing the lowest value in a specific column:
income = c(2, 3, 5, 5, -15, 2, 1)
balance = c(15, 17, 20, 25, 30, 15, 17)
date = as.Date(c("2016/02/11", "2016/02/14", "2017/02/16", "2016/03/01", "2017/03/12", "2016/04/11", "2017/04/24"))
df = data.frame(income, balance, date)
Now what I want to get the rows containing the minimum "balance" value from each month, so that the outcome would be a data frame looking like this:
income balance date
1 2 15 2016-02-11
2 5 25 2016-03-01
3 2 33 2016-04-11
I have tryed the aggregate function:
bymonth = aggregate(balance~months(date), data=df,FUN=min)
print(bymonth)
But this gives me the following output:
months(date) balance
1 April 15
2 Februar 15
3 Marts 25
Help!
We can do with dplyr. After grouping by months of 'date', we slice the row which has the min 'balance' and remove the 'mth' column using select
library(dplyr)
df %>%
group_by(mth = months(date)) %>%
slice(which.min(balance)) %>%
ungroup() %>%
select(-mth)
# A tibble: 3 x 3
# income balance date
# <dbl> <dbl> <date>
#1 2 15 2016-04-11
#2 2 15 2016-02-11
#3 5 25 2016-03-01
Note that if there are ties for the 'balance', then use filter(balance == min(balance)) in place of slice
Or with ave from base R tp create a logical vector and use that to subset the rows of 'df'
df[with(df, ave(balance, months(date), FUN = min)==balance),]
# income balance date
#1 2 15 2016-02-11
#4 5 25 2016-03-01
#6 2 15 2016-04-11

Resources