Keep previous value if it is under a certain threshold - r

I would like to create a variable called treatment_cont that is grouped by group as follows:
ID day day_diff treatment treatment_cont
1 0 NA 1 1
1 14 14 1 1
1 20 6 2 2
1 73 53 1 1
2 0 NA 1 1
2 33 33 1 1
2 90 57 2 2
2 112 22 3 2
2 152 40 1 1
2 178 26 4 1
Treatment_cont is the same as treatment but we want to keep the same treatment regime only when the day_diff, the difference in days between treatments, is lower than 30.
I have tried many ways on dplyr, manipulating the table, but I cannot figure out how to do it efficiently.

Probably, a conditional mutate, using case_when and lag might work:
df %>% mutate(treatment_cont = case_when(day_diff < 30 ~ treatment,TRUE ~ lag(treatment)))

You are probably looking for lag (and perhaps it's brother, lead):
df %>%
replace_na(list(day_diff=0)) %>%
group_by(ID) %>%
arrange(day) %>%
mutate(
treatment_cont = ifelse(day_diff < 30, lag(treatment_cont, default = treatment_cont[1]),treatment_cont)
# A tibble: 10 x 5
ID day day_diff treatment treatment_cont
<int> <int> <dbl> <int> <int>
1 1 0 0 1 1
2 1 14 14 1 1
3 1 20 6 2 1
4 1 73 53 1 1
5 2 0 0 1 1
6 2 33 33 1 1
7 2 90 57 2 2
8 2 112 22 3 2
9 2 152 40 1 1
10 2 178 26 4 1
) %>%
ungroup %>%
arrange(ID, day)

Related

sum the number of occurences based on two columns with 3 levels of categories in R

I have two columns with categorical data, they are different categories both both with levels 0,1 and 2.
I want to sum the number of times the combinations occur, but the total count just takes the total sum of that column.
Groups: dfMBG$datasetG.snelheid [3]
dfMBG$datasetG.snelheid as.character(dfMBG$~ countMB
<chr> <chr> <dbl>
1 0 0 153
2 0 1 153
3 0 2 153
4 1 0 153
5 1 1 153
6 1 2 153
7 2 0 153
8 2 1 153
9 2 2 153
I want it to look something like this.
Groups: dfMBG$datasetG.snelheid [3]
dfMBG$datasetG.snelheid as.character(dfMBG$~ countMB
<chr> <chr> <dbl>
1 0 0 12
2 0 1 15
3 0 2 45
4 1 0 12
5 1 1 15
6 1 2 28
7 2 0 4
8 2 1 17
9 2 2 5
the code that I used is this:
MBGcount<-dfMBG %>% rowwise(.) %>%
group_by(dfMBG$datasetG.snelheid, as.character(dfMBG$datasetG.indicatie)) %>%
summarise(countMB = sum(as.numeric(dfMBG$datasetG.verstoringsbron)))
MBGcount
dfMBG$dfMBG$datasetG.verstoringsbron
consists of a column with 1's.
Thank you for helping me!
If you run this instead, does it achieve what you want?
dfMBG %>%
group_by(datasetG.snelheid, datasetG.indicatie) %>%
count()

Finding cumulative second max per group in R

I have a dataset where I would like to create a new variable that is the cumulative second largest value of another variable, and I would like to perform this function per group.
Let's say I create the following example data frame:
(df1 <- data.frame(patient = rep(1:5, each=8), visit = rep(1:2,each=4,5), trial = rep(1:4,10), var1 = sample(1:50,20,replace=TRUE)))
This is pretend data that represents 5 patients who each had 2 study visits, and each visit had 4 trials with a measurement taken (var1).
> head(df1,n=20)
patient visit trial var1
1 1 1 1 25
2 1 1 2 23
3 1 1 3 48
4 1 1 4 37
5 1 2 1 41
6 1 2 2 45
7 1 2 3 8
8 1 2 4 9
9 2 1 1 26
10 2 1 2 14
11 2 1 3 41
12 2 1 4 35
13 2 2 1 37
14 2 2 2 30
15 2 2 3 14
16 2 2 4 28
17 3 1 1 34
18 3 1 2 19
19 3 1 3 28
20 3 1 4 10
I would like to create a new variable, cum2ndmax, that is the cumulative 2nd largest value of var1 and I would like to group this variable by patient # and visit #.
I figured out how to calculate the cumulative 2nd max number like so:
df1$cum2ndmax <- sapply(seq_along(df1$var1),function(x){sort(df1$var1[seq(x)],decreasing=TRUE)[2]})
df1
However, this calculates the cumulative 2nd max across the whole dataset, not for each group. I have attempted to calculate this variable using grouped data like so after installing and loading package dplyr:
library(dplyr)
df2 <- df1 %>%
group_by(patient,visit) %>%
mutate(cum2ndmax = sapply(seq_along(df1$var1),function(x){sort(df1$var1[seq(x)],decreasing=TRUE)[2]}))
But I get an error: Error: Problem with mutate() input cum2ndmax. x Input cum2ndmax can't be recycled to size 4.
Ideally, my result would look something like this:
patient visit trial var1 cum2ndmax
1 1 1 25 NA
1 1 2 23 23
1 1 3 48 25
1 1 4 37 37
1 2 1 41 NA
1 2 2 45 41
1 2 3 8 41
1 2 4 9 41
2 1 1 26 NA
2 1 2 14 14
2 1 3 41 26
2 1 4 35 35
… … … … …
Any help in getting this to work in R would be much appreciated! Thank you!
One dplyr and purrr option could be:
df1 %>%
group_by(patient, visit) %>%
mutate(cum_second_max = map_dbl(.x = seq_along(var1),
~ ifelse(.x == 1, NA, var1[dense_rank(-var1[1:.x]) == 2])))
patient visit trial var1 cum_second_max
<int> <int> <int> <int> <dbl>
1 1 1 1 25 NA
2 1 1 2 23 23
3 1 1 3 48 25
4 1 1 4 37 37
5 1 2 1 41 NA
6 1 2 2 45 41
7 1 2 3 8 41
8 1 2 4 9 41
9 2 1 1 26 NA
10 2 1 2 14 14
11 2 1 3 41 26
12 2 1 4 35 35
13 2 2 1 37 NA
14 2 2 2 30 30
15 2 2 3 14 30
16 2 2 4 28 30
17 3 1 1 34 NA
18 3 1 2 19 19
19 3 1 3 28 28
20 3 1 4 10 28
Here is an Rcpp solution.
cum_second_max is a modification of cummax which keeps track of the second maximum.
library(tidyverse)
Rcpp::cppFunction("
NumericVector cum_second_max(NumericVector x) {
double max_value = R_NegInf, max_value2 = NA_REAL;
NumericVector result(x.length());
for (int i = 0 ; i < x.length() ; ++i) {
if (x[i] > max_value) {
max_value2 = max_value;
max_value = x[i];
}
else if (x[i] < max_value && x[i] > max_value2) {
max_value2 = x[i];
}
result[i] = isinf(max_value2) ? NA_REAL : max_value2;
}
return result;
}
")
df1 %>%
group_by(patient, visit) %>%
mutate(
c2max = cum_second_max(var1)
)
#> # A tibble: 20 x 5
#> # Groups: patient, visit [5]
#> patient visit trial var1 c2max
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 25 NA
#> 2 1 1 2 23 23
#> 3 1 1 3 48 25
#> 4 1 1 4 37 37
#> 5 1 2 1 41 NA
#> 6 1 2 2 45 41
#> 7 1 2 3 8 41
#> 8 1 2 4 9 41
#> 9 2 1 1 26 NA
#> 10 2 1 2 14 14
#> 11 2 1 3 41 26
#> 12 2 1 4 35 35
#> 13 2 2 1 37 NA
#> 14 2 2 2 30 30
#> 15 2 2 3 14 30
#> 16 2 2 4 28 30
#> 17 3 1 1 34 NA
#> 18 3 1 2 19 19
#> 19 3 1 3 28 28
#> 20 3 1 4 10 28
Thanks so much everyone! I really appreciate it and could not have solved this without your help! In the end, I ended up using a similar approach suggested by tmfmnk since I was already using dplyr. I found an interesting result with the code suggested by tmkmnk where for some reason it gave me a column of values that just repeated the first row's number. With a small tweak to change dense_rank to order, I got exactly what I wanted like this:
df1 %>%
group_by(patient, visit) %>%
mutate(cum_second_max = map_dbl(.x = seq_along(var1),
~ ifelse(.x == 1, NA, var1[order(-var1[1:.x])[2])))

How can I create a lag difference variable within group relative to baseline?

I would like a variable that is a lagged difference to the within group baseline. I have panel data that I have balanced.
my_data <- data.frame(id = c(1,1,1,2,2,2,3,3,3), group = c(1,2,3,1,2,3,1,2,3), score=as.numeric(c(0,150,170,80,100,110,75,100,0)))
id group score
1 1 1 0
2 1 2 150
3 1 3 170
4 2 1 80
5 2 2 100
6 2 3 110
7 3 1 75
8 3 2 100
9 3 3 0
I would like it to look like this:
id group score lag_diff_baseline
1 1 1 0 NA
2 1 2 150 150
3 1 3 170 170
4 2 1 80 NA
5 2 2 100 20
6 2 3 110 30
7 3 1 75 NA
8 3 2 100 25
9 3 3 0 -75
The data.table version of #Liam's answer
library(data.table)
setDT(my_data)
my_data[,.(id,group,score,lag_diff_baseline = score-first(score)),by = id]
I missed the easy answer:
library(dplyr)
my_data %>%
group_by(id) %>%
mutate(lag_diff_baseline = score - first(score))

How to find percentile and then group in R

I have a data frame like below (df).
day area hour time count
___ ____ _____ ___ ____
1 1 0 1 10
1 1 0 2 12
1 1 0 3 8
1 1 0 4 12
1 1 0 5 15
1 1 0 6 18
1 1 1 1 10
1 1 1 2 12
1 1 1 3 8
1 1 1 4 12
1 1 1 5 15
1 1 1 6 18
1 1 1 7 12
1 1 1 8 15
1 1 1 9 18
1 1 2 1 10
1 1 2 2 18
1 1 2 3 19
.....
2 1 0 1 18
2 1 0 2 12
2 1 0 3 18
2 1 0 4 12
2 1 1 1 8
2 1 1 2 12
2 1 1 3 18
2 1 1 4 10
2 1 1 5 15
2 1 1 6 18
2 1 1 7 12
2 1 1 8 15
2 1 1 9 18
2 1 2 1 10
2 1 2 2 18
2 1 2 3 19
2 1 2 4 9
2 1 2 5 18
2 1 2 6 9
.....
30 99 23 1 9
30 99 23 2 8
30 99 23 3 9
30 99 23 4 19
30 99 23 5 18
30 99 23 6 9
30 99 23 7 19
30 99 23 8 8
30 99 23 9 19
Here I have the data for 30 days for 87 areas (1 to 82 and then I have 90,93,95,97,99) and 24 hours (0 to 23) per day.So the data is about the time taken to cross the area and how many have crossed.
For example:
day area hour time count
___ ____ _____ ___ ____
1 1 0 1 10
1 1 0 2 12
1 1 0 3 8
1 1 0 4 12
1 1 0 5 15
1 1 0 6 18
This gives me the On day 1 on hour 0 the time taken to cross the area 1
time count cumulative_count
___ ___ ________________
1 10 10
2 12 22
3 8 30
4 12 42
5 15 57
6 18 75
10 vehicles crossed the area in 1 minute.
12 vehicles crossed the area in 2 minutes.
8 vehicles crossed the area in 3 minutes.
12 vehicles crossed the area in 4 minutes.
15 vehicles crossed the area in 5 minutes.
18 vehicles crossed the area in 6 minutes.
From this I want to calculate How much time it took for 80% of the vehicles to cross area 1 in day 1 hour 0.So total vehicles=(10+12+8+12+15+18)=75.So 80% of 75 is 60.So time taken for 80% of the vehicles(80% of 75 which is 60) to pass the area 1 at day 1 hour 0 will be between 5 and 6(will be nearer to 5). So the result will be like:
day area hour time_taken_for_80%vehicles_to_pass
___ ____ ____ ___________________________________
1 1 0 5.33(approximately)
1 1 1 7.30
1 1 2 2.16
....
30 1 23 3.13
1 2 0 ---
1 2 1 ---
1 2 2 ---
1 2 3 ---
.......
30 99 21 ---
30 99 22 ---
30 99 23 ---
I know to I have to take quantile and then group by the area and day and hour.So I tried with
library(dplyr)
grp <- group_by(df, day,area,hour,quantile(df$count,0.8))
But it does not work.Any help is appreciated
My solution calculates the percentage of vehicles that crossed the area, for each time. Then gets the first time the percentage is above 80%:
str <- 'day area hour time count
1 1 0 1 10
1 1 0 2 12
1 1 0 3 8
1 1 0 4 12
1 1 0 5 15
1 1 0 6 18
1 1 1 1 10
1 1 1 2 12
1 1 1 3 8
1 1 1 4 12
1 1 1 5 15
1 1 1 6 18
1 1 1 7 12
1 1 1 8 15
1 1 1 9 18
1 1 2 1 10
1 1 2 2 18
1 1 2 3 19'
file <- textConnection(str)
df <- read.table(file, header = T)
df
library(dplyr)
df %>% group_by(day, area, hour) %>%
mutate(cumcount = cumsum(count),
p = cumcount/max(cumcount)) %>%
filter(p > 0.8) %>%
summarise(time = min(time))
result:
day area hour time
<int> <int> <int> <int>
1 1 1 0 6
2 1 1 1 8
3 1 1 2 3
Or with a linear estimation of the time when 80% is reached:
df %>% group_by(day, area, hour) %>%
mutate(cumcount = cumsum(count),
p = cumcount/max(cumcount),
g = +(p > 0.8),
order = (g*2-1)*time) %>%
group_by(day, area, hour,g) %>%
filter(row_number((g*2-1)*time)==1) %>%
group_by(day, area, hour) %>%
summarise(time = min(time)+(0.8-min(p))/(max(p)-min(p)))
result:
day area hour time
<int> <int> <int> <dbl>
1 1 1 0 5.166667
2 1 1 1 7.600000
3 1 1 2 2.505263
or get the same result using lag and lead
df %>% group_by(day, area, hour) %>%
arrange(hour) %>%
mutate(cumcount = cumsum(count),
p = cumcount/max(cumcount)) %>%
filter((p >= 0.8&lag(p)<0.8)|(p < 0.8&lead(p)>=0.8)) %>%
summarise(time = min(time)+(0.8-min(p))/(max(p)-min(p)))

how to cast to multicolumn in R like Pandas-Style?

i searched a lot but didn't find anything relevant.
What I Want:
I'm trying to do a simple groupby and summarising in R.
My preffered output would be with multiindexed columns and multiindexed rows. Multiindexed rows are easy with dplyr, the difficulty are the cols.
what I already tried:
library(dplyr)
cp <- read.table(text="SEX REGION CAR_TYPE JOB EXPOSURE NUMBER
1 1 1 1 1 70 1
2 1 1 1 2 154 8
3 1 1 2 1 210 10
4 1 1 2 2 21 1
5 1 2 1 1 77 8
6 1 2 1 2 90 6
7 1 2 2 1 105 5
8 1 2 2 2 140 11
")
attach(cp)
cp_gb <- cp %>%
group_by(SEX, REGION, CAR_TYPE, JOB) %>%
summarise(counts=round(sum(NUMBER/EXPOSURE*1000)))
dcast(cp_gb, formula = SEX + REGION ~ CAR_TYPE + JOB, value.var="counts")
Now there is the problem that the column index is "melted" into one instead of a multiindexed column, like I know it from Python/Pandas.
Wrong output:
SEX REGION 1_1 1_2 2_1 2_2
1 1 14 52 48 48
1 2 104 67 48 79
Example how it would work in Pandas:
# clipboard, copy this withoud the comments:
# SEX REGION CAR_TYPE JOB EXPOSURE NUMBER
# 1 1 1 1 1 70 1
# 2 1 1 1 2 154 8
# 3 1 1 2 1 210 10
# 4 1 1 2 2 21 1
# 5 1 2 1 1 77 8
# 6 1 2 1 2 90 6
# 7 1 2 2 1 105 5
# 8 1 2 2 2 140 11
df = pd.read_clipboard(delim_whitespace=True)
gb = df.groupby(["SEX","REGION", "CAR_TYPE", "JOB"]).sum()
gb['promille_value'] = (gb['NUMBER'] / gb['EXPOSURE'] * 1000).astype(int)
gb = gb[['promille_value']].unstack(level=[2,3])
correct Output:
CAR_TYPE 1 1 2 2
JOB 1 2 1 2
SEX REGION
1 1 14 51 47 47
1 2 103 66 47 78
(Update) What works (nearly):
I tried to to with ftable, but it only prints ones in the matrix instead of the values of "counts".
ftable(cp_gb, col.vars=c("CAR_TYPE","JOB"), row.vars = c("SEX","REGION"))
ftable accepts lists of factors (data frame) or a table object. Instead of passing the grouped data frame as it is, converting it to a table object first before passing to ftable should get your the counts:
# because xtabs expects factors
cp_gb <- cp_gb %>% ungroup %>% mutate_at(1:4, as.factor)
xtabs(counts ~ ., cp_gb) %>%
ftable(col.vars=c("CAR_TYPE","JOB"), row.vars = c("SEX","REGION"))
# CAR_TYPE 1 2
# JOB 1 2 1 2
# SEX REGION
# 1 1 14 52 48 48
# 2 104 67 48 79
There is a difference of 1 in some of counts between R and pandas outputs because you use round in R and truncation (.astype(int)) in python.

Resources