Extracting baseline values from long format data frame - r

I have a following data frame, representing longitudinal data:
df<-structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), AGE = c(59,
59, 59, 59, 59, 69, 69, 69, 69, 69), BMI = c(23.8, 23.8, 23.8,
23.8, 23.8, 29.8, 29.8, 29.8, 29.8, 29.8), time = c(0, 1, 3,
5, 6, 0, 1, 3, 5, 6), variable = c(5, 6, 1, 6, 2, 3, 2, NA, 10,
1)), .Names = c("ID", "AGE", "BMI", "time", "var"), row.names = c(NA,
10L), class = "data.frame")
> df
ID AGE BMI time var
1 1 59 23.8 0 5
2 1 59 23.8 1 6
3 1 59 23.8 3 1
4 1 59 23.8 5 6
5 1 59 23.8 6 2
6 2 69 29.8 0 3
7 2 69 29.8 1 2
8 2 69 29.8 3 NA
9 2 69 29.8 5 10
10 2 69 29.8 6 1
AGE and BMI are baseline variables, var is longitudinal variable measured at different time points (time).
I would like to extract baseline (time = 0) data from var variable and create new baseline variable var.baseline. My data frame is going to look like
> df
ID AGE BMI time variable var.baseline
1 1 59 23.8 0 5 5
2 1 59 23.8 1 6 5
3 1 59 23.8 3 1 5
4 1 59 23.8 5 6 5
5 1 59 23.8 6 2 5
6 2 69 29.8 0 3 3
7 2 69 29.8 1 2 3
8 2 69 29.8 3 NA 3
9 2 69 29.8 5 10 3
10 2 69 29.8 6 1 3
Of course, I can transform the data to wide format, create var.baseline based on variable.0, and then again transform to long format. However, as my real data set is significantly larger and I have much more variables, it becomes cumbersome. Could you please suggest a more easy way of extracting baseline data from long format data frame.

You can try
library(dplyr)
df %>%
group_by(ID) %>%
mutate(var.baseline=var[time==0])
Or
library(data.table)
setDT(df)[,var.baseline:=var[time==0] , by=ID]
Or using base R
merge(df,setNames(subset(df, time==0,select=c("ID", "var")),
c('ID', 'var.baseline')), by='ID')
Or
library(zoo)
df$var.baseline <- with(df, na.locf(var[!NA^time==0]))

Related

R Identify Max and Col Source [duplicate]

This question already has answers here:
How can I take pairwise parallel maximum or minimum between two vectors?
(3 answers)
For each row return the column name of the largest value
(10 answers)
Closed 7 months ago.
HAVE = data.frame("STUDENT"=c(1, 1, 1, 2, 2, 2, 3, 3, 3),
"CLASS"=c('A','A','A','B','B','B','C','C','C'),
"SEMESTER"=c(1, 2, 3, 1, 2, 3, 1, 2, 3),
"SCORE"=c(50, 74, 78, 79, 100, 65, 61, 70, 87),
"TEST"=c(80, 59, 63, 96, 57, 53, 93, 89, 92))
WANT = HAVE %>%
rowwise() %>%
mutate(MAX = max(c(SCORE, TEST)))
WANT$WHICHCOL = c("TEST", "SCORE", "SCORE", "TEST", "SCORE", "SCORE", "TEST", "TEST", "TEST")
I am able to identify the way to get the max value between SCORE and TEST but I wish to also make the column WHICHCOL which equals to 'TEST' if TEST> SCORE or 'SCORE' if SCORE > TEST
pmax is a built-in function that will be much more efficient than a rowwise max:
HAVE %>%
mutate(
MAX = pmax(SCORE, TEST),
WHICHCOL = ifelse(SCORE > TEST, "SCORE", "TEST")
)
# STUDENT CLASS SEMESTER SCORE TEST MAX WHICHCOL
# 1 1 A 1 50 80 80 TEST
# 2 1 A 2 74 59 74 SCORE
# 3 1 A 3 78 63 78 SCORE
# 4 2 B 1 79 96 96 TEST
# 5 2 B 2 100 57 100 SCORE
# 6 2 B 3 65 53 65 SCORE
# 7 3 C 1 61 93 93 TEST
# 8 3 C 2 70 89 89 TEST
# 9 3 C 3 87 92 92 TEST
Note that, since I use > not >=, TEST will win ties.
A base R solution:
df1 <- HAVE[c("SCORE", "TEST")]
x <- max.col(df1, "first")
MAX <- df1[cbind(1:nrow(df1), x)]
WHICHCOL <- names(df1)[x]
HAVE <- cbind(HAVE, MAX, WHICHCOL)
HAVE
#> STUDENT CLASS SEMESTER SCORE TEST MAX WHICHCOL
#> 1 1 A 1 50 80 80 TEST
#> 2 1 A 2 74 59 74 SCORE
#> 3 1 A 3 78 63 78 SCORE
#> 4 2 B 1 79 96 96 TEST
#> 5 2 B 2 100 57 100 SCORE
#> 6 2 B 3 65 53 65 SCORE
#> 7 3 C 1 61 93 93 TEST
#> 8 3 C 2 70 89 89 TEST
#> 9 3 C 3 87 92 92 TEST

create a new variable based on other factors using R

So I have this dataframe and I aim to add a new variable based on others:
Qi
Age
c_gen
1
56
13
2
43
15
5
31
6
3
67
8
I want to create a variable called c_sep that if:
Qi==1 or Qi==2 c_sep takes a random number between (c_gen + 6) and Age;
Qi==3 or Qi==4 c_sep takes a random number between (Age-15) and Age;
And 0 otherwise,
so my data would look something like this:
Qi
Age
c_gen
c_sep
1
56
13
24
2
43
15
13
5
31
6
0
3
67
8
40
Any ideas please
In base R, you can do something along the lines of:
dat <- read.table(text = "Qi Age c_gen
1 56 13
2 43 15
5 31 6
3 67 8", header = T)
set.seed(100)
dat$c_sep <- 0
dat$c_sep[dat$Qi %in% c(1,2)] <- apply(dat[dat$Qi %in% c(1,2),], 1, \(row) sample(
(row["c_gen"]+6):row["Age"], 1
)
)
dat$c_sep[dat$Qi %in% c(3,4)] <- apply(dat[dat$Qi %in% c(3,4),], 1, \(row) sample(
(row["Age"]-15):row["Age"], 1
)
)
dat
# Qi Age c_gen c_sep
# 1 1 56 13 28
# 2 2 43 15 43
# 3 5 31 6 0
# 4 3 67 8 57
If you are doing it more than twice you might want to put this in a function - depending on your requirements.
Try this
df$c_sep <- ifelse(df$Qi == 1 | df$Qi == 2 ,
sapply(1:nrow(df) ,
\(x) sample(seq(df$c_gen[x] + 6, df$Age[x]) ,1)) ,
sapply(1:nrow(df) ,
\(x) sample(seq(df$Age[x] - 15, df$Age[x]) ,1)) , 0))
output
Qi Age c_gen c_sep
1 1 56 13 41
2 2 43 15 42
3 5 31 6 0
4 3 67 8 58
A tidyverse option:
library(tidyverse)
df <- tribble(
~Qi, ~Age, ~c_gen,
1, 56, 13,
2, 43, 15,
5, 31, 6,
3, 67, 8
)
df |>
rowwise() |>
mutate(c_sep = case_when(
Qi <= 2 ~ sample(seq(c_gen + 6, Age, 1), 1),
between(Qi, 3, 4) ~ sample(seq(Age - 15, Age, 1), 1),
TRUE ~ 0
)) |>
ungroup()
#> # A tibble: 4 × 4
#> Qi Age c_gen c_sep
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 56 13 39
#> 2 2 43 15 41
#> 3 5 31 6 0
#> 4 3 67 8 54
Created on 2022-06-29 by the reprex package (v2.0.1)

Substract the result for level time0 from the results from all other levels, for each id

I want to substract the result for level time0 from the results from all other levels, for each id.
id <- rep(1:4,each=4)
time <- rep(c(0,5,10,15),4)
a <- c(34,56,67,35)
b <-c(56,78,23,90)
c <- c(23,89,67,78)
df <- data.frame(id,time,a,b,c)
df
id time a b c
1 1 0 34 56 23
2 1 5 56 78 89
3 1 10 67 23 67
4 1 15 35 90 78
5 2 0 34 56 23
6 2 5 56 78 89
7 2 10 67 23 67
8 2 15 35 90 78
9 3 0 34 56 23
10 3 5 56 78 89
11 3 10 67 23 67
12 3 15 35 90 78
13 4 0 34 56 23
14 4 5 56 78 89
15 4 10 67 23 67
16 4 15 35 90 78
I started like this but it feels there must be a more efficient way. Any suggestions? Thanks!
for( i in 1:length(unique(df$id))){
df_id <- df[df$id==i,]
for(j in 2:length(time)){
test <- t(df_id[,-1])
test[,c(2:4)]-test[,1]
}
Here's an option with dplyr -
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(a:c, ~. - .[time == 0])) %>%
ungroup
# id time a b c
# <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 0 0 0 0
# 2 1 5 22 22 66
# 3 1 10 33 -33 44
# 4 1 15 1 34 55
# 5 2 0 0 0 0
# 6 2 5 22 22 66
# 7 2 10 33 -33 44
# 8 2 15 1 34 55
# 9 3 0 0 0 0
#10 3 5 22 22 66
#11 3 10 33 -33 44
#12 3 15 1 34 55
#13 4 0 0 0 0
#14 4 5 22 22 66
#15 4 10 33 -33 44
#16 4 15 1 34 55
Using time == 0 would work if it is guaranteed that every id has exactly 1 value of time = 0. If for some id's there is no row for time = 0 or have more than one row with time = 0 then probably using match is better option.
df %>% group_by(id) %>% mutate(across(a:c, ~. - .[match(0, time)]))
Use mapply in by.
vc <- c('a', 'b', 'c')
by(df, df$id, \(x) {x[-1, vc] <- mapply(`-`, x[-1, vc], x[1, vc]);x}) |>
do.call(what=rbind)
# id time a b c
# 1.1 1 0 34 56 23
# 1.2 1 5 22 22 66
# 1.3 1 10 33 -33 44
# 1.4 1 15 1 34 55
# 2.5 2 0 34 56 23
# 2.6 2 5 22 22 66
# 2.7 2 10 33 -33 44
# 2.8 2 15 1 34 55
# 3.9 3 0 34 56 23
# 3.10 3 5 22 22 66
# 3.11 3 10 33 -33 44
# 3.12 3 15 1 34 55
# 4.13 4 0 34 56 23
# 4.14 4 5 22 22 66
# 4.15 4 10 33 -33 44
# 4.16 4 15 1 34 55
If id==0 position is not consistent, you need to formulate more verbose:
{x[x$time != 0, vc] <- mapply(`-`, x[x$time != 0, vc], x[x$time == 0, vc]);x}
Data:
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L), time = c(0, 5, 10, 15, 0, 5, 10, 15,
0, 5, 10, 15, 0, 5, 10, 15), a = c(34, 56, 67, 35, 34, 56, 67,
35, 34, 56, 67, 35, 34, 56, 67, 35), b = c(56, 78, 23, 90, 56,
78, 23, 90, 56, 78, 23, 90, 56, 78, 23, 90), c = c(23, 89, 67,
78, 23, 89, 67, 78, 23, 89, 67, 78, 23, 89, 67, 78)), class = "data.frame", row.names = c(NA,
-16L))

Create nominal variable from multiple columns R

My intention involves creating a variable based on the values of two numeric ones. I have not written any user-defined functions in R and need help getting started.
Dataset:
My dataset has over 3k stores, but created a reproducible example of the first 10 rows. Deliveries per day of week show total volume for that day through the year. Store_num represents store number and Total shows the total deliveries for a store throughout year.
I want predominant delivery days created in a variable called Del_Sch with the following inequalities. If first condition TRUE (50-100%), then create the variable with the column name. If FALSE, test second condition and create variable with all column names between 32-50%, ect. If there are no days over 20%, no predominant delivery days are counted.
-Volume in a day between 50-100% of the total.
-Volume in a day between 32-50% of total
-Volume in a day between 25-32% of total.
-Volume in a day between 20-25% of total.
-Volume in a day less than 20% of total.
Reproducible Example:
Store_Num <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
#Total deliveries made per week
Sun_Del <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
Mon_Del <- c(10, 50, 51, 7, 80, 97, 21, 49, 30, 3)
Tue_Del <- c(7, NA, 2, 50, 5, 56, 1, 4, 35, 52)
Wed_Del <- c(49, 51, 1, 4, 51, 16, 2, 2, 1, 1)
Thu_Del <- c(3, 2, 47, 7, 40, 2, 6, 5, 1, 7)
Fri_Del <- c(50, 49, 3, 51, 53, 86, 9, 52, 25, 52)
Sat_Del <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
Total <- c(119, 152, 104, 119, 229, 257, 39, 112, 92, 115)
#Single dataset
Schedule <- data.frame(Store_Num, Sun_Del, Mon_Del, Tue_Del,
Wed_Del, Thu_Del, Fri_Del, Sat_Del, Total)
Schedule
Store_Num Sun_Del Mon_Del Tue_Del Wed_Del Thu_Del Fri_Del Sat_Del Total
1 1 NA 10 7 49 3 50 NA 119
2 2 NA 50 NA 51 2 49 NA 152
3 3 NA 51 2 1 47 3 NA 104
4 4 NA 7 50 4 7 51 NA 119
5 5 NA 80 5 51 40 53 NA 229
6 6 NA 97 56 16 2 86 NA 257
7 7 NA 21 1 2 6 9 NA 39
8 8 NA 49 4 2 5 52 NA 112
9 9 NA 30 35 1 1 25 NA 92
10 10 NA 3 52 1 7 52 NA 115
Desired Output:
Store_Num Sun_Del Mon_Del Tue_Del Wed_Del Thu_Del Fri_Del Sat_Del Total Del_Sch
1 1 NA 10 7 49 3 50 NA 119 WF
2 2 NA 50 NA 51 2 49 NA 152 MWF
3 3 NA 51 2 1 47 3 NA 104 MTh
4 4 NA 7 50 4 7 51 NA 119 TF
5 5 NA 80 5 51 40 53 NA 229 MWF
6 6 NA 97 56 16 2 86 NA 257 MTF
7 7 NA 21 1 2 6 9 NA 39 M
8 8 NA 49 4 2 5 52 NA 112 MF
9 9 NA 30 35 1 1 25 NA 92 MTF
10 10 NA 3 52 1 7 52 NA 115 TF
Using tidyr and dplyr. I made the names be the first two letter pasted to fix the Tuesday/Thursday confusion:
library(dplyr)
library(tidyr)
Schedule %>% gather(Day, del, -Store_Num, -Total) %>%
mutate(proportion = ifelse(del/Total >= 0.5, 1,
ifelse(del/Total >= 0.32, 2,
ifelse(del/Total >= 0.25, 3,
ifelse(del/Total >= 0.20, 4,
NA))))) %>%
group_by(Store_Num) %>%
summarise(days = paste0(substr(Day[which(
proportion == min(proportion, na.rm = TRUE))],
1, 2), collapse = "")) %>%
merge(Schedule, ., by = "Store_Num")
Store_Num Sun_Del Mon_Del Tue_Del Wed_Del Thu_Del Fri_Del Sat_Del Total days
1 1 NA 10 7 49 3 50 NA 119 WeFr
2 2 NA 50 NA 51 2 49 NA 152 MoWeFr
3 3 NA 51 2 1 47 3 NA 104 MoTh
4 4 NA 7 50 4 7 51 NA 119 TuFr
5 5 NA 80 5 51 40 53 NA 229 Mo
6 6 NA 97 56 16 2 86 NA 257 MoFr
7 7 NA 21 1 2 6 9 NA 39 Mo
8 8 NA 49 4 2 5 52 NA 112 MoFr
9 9 NA 30 35 1 1 25 NA 92 MoTu
10 10 NA 3 52 1 7 52 NA 115 TuFr
Edit: there are a couple of mismatches between my results and your data (line 5,6 and 9), according to your rules, you have mistakes there.

Enumerate quantiles in reverse order

I'm trying to get the quantile number of a column in a data frame, but in reverse order. I want the highest number to be in quantile number 1.
Here is what I have so far:
> x<-c(10, 12, 75, 89, 25, 100, 67, 89, 4, 67, 120.2, 140.5, 170.5, 78.1)
> x <- data.frame(x)
> within(x, Q <- as.integer(cut(x, quantile(x, probs=0:5/5, na.rm=TRUE),
include.lowest=TRUE)))
x Q
1 10.0 1
2 12.0 1
3 75.0 3
4 89.0 4
5 25.0 2
6 100.0 4
7 67.0 2
8 89.0 4
9 4.0 1
10 67.0 2
11 120.2 5
12 140.5 5
13 170.5 5
14 78.1 3
And what I want to get is:
x Q
1 10.0 5
2 12.0 5
3 75.0 3
4 89.0 2
5 25.0 4
6 100.0 2
7 67.0 4
8 89.0 2
9 4.0 5
10 67.0 4
11 120.2 1
12 140.5 1
13 170.5 1
14 78.1 3
One way to do this is to specify the reversed labels in the cut() function. If you want Q to be an integer then you need to first coerce the factor labels into a character and then into an integer.
result <- within(x, Q <- as.integer(as.character((cut(x,
quantile(x, probs = 0:5/5, na.rm = TRUE),
labels = c(5, 4, 3, 2, 1),
include.lowest = TRUE)))))
head(result)
x Q
1 10 5
2 12 5
3 75 3
4 89 2
5 25 4
6 100 2
Your data:
x <- c(10, 12, 75, 89, 25, 100, 67, 89, 4, 67, 120.2, 140.5, 170.5, 78.1)
x <- data.frame(x)

Resources