I have a following data frame, representing longitudinal data:
df<-structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), AGE = c(59,
59, 59, 59, 59, 69, 69, 69, 69, 69), BMI = c(23.8, 23.8, 23.8,
23.8, 23.8, 29.8, 29.8, 29.8, 29.8, 29.8), time = c(0, 1, 3,
5, 6, 0, 1, 3, 5, 6), variable = c(5, 6, 1, 6, 2, 3, 2, NA, 10,
1)), .Names = c("ID", "AGE", "BMI", "time", "var"), row.names = c(NA,
10L), class = "data.frame")
> df
ID AGE BMI time var
1 1 59 23.8 0 5
2 1 59 23.8 1 6
3 1 59 23.8 3 1
4 1 59 23.8 5 6
5 1 59 23.8 6 2
6 2 69 29.8 0 3
7 2 69 29.8 1 2
8 2 69 29.8 3 NA
9 2 69 29.8 5 10
10 2 69 29.8 6 1
AGE and BMI are baseline variables, var is longitudinal variable measured at different time points (time).
I would like to extract baseline (time = 0) data from var variable and create new baseline variable var.baseline. My data frame is going to look like
> df
ID AGE BMI time variable var.baseline
1 1 59 23.8 0 5 5
2 1 59 23.8 1 6 5
3 1 59 23.8 3 1 5
4 1 59 23.8 5 6 5
5 1 59 23.8 6 2 5
6 2 69 29.8 0 3 3
7 2 69 29.8 1 2 3
8 2 69 29.8 3 NA 3
9 2 69 29.8 5 10 3
10 2 69 29.8 6 1 3
Of course, I can transform the data to wide format, create var.baseline based on variable.0, and then again transform to long format. However, as my real data set is significantly larger and I have much more variables, it becomes cumbersome. Could you please suggest a more easy way of extracting baseline data from long format data frame.
You can try
library(dplyr)
df %>%
group_by(ID) %>%
mutate(var.baseline=var[time==0])
Or
library(data.table)
setDT(df)[,var.baseline:=var[time==0] , by=ID]
Or using base R
merge(df,setNames(subset(df, time==0,select=c("ID", "var")),
c('ID', 'var.baseline')), by='ID')
Or
library(zoo)
df$var.baseline <- with(df, na.locf(var[!NA^time==0]))
Related
This question already has answers here:
How can I take pairwise parallel maximum or minimum between two vectors?
(3 answers)
For each row return the column name of the largest value
(10 answers)
Closed 7 months ago.
HAVE = data.frame("STUDENT"=c(1, 1, 1, 2, 2, 2, 3, 3, 3),
"CLASS"=c('A','A','A','B','B','B','C','C','C'),
"SEMESTER"=c(1, 2, 3, 1, 2, 3, 1, 2, 3),
"SCORE"=c(50, 74, 78, 79, 100, 65, 61, 70, 87),
"TEST"=c(80, 59, 63, 96, 57, 53, 93, 89, 92))
WANT = HAVE %>%
rowwise() %>%
mutate(MAX = max(c(SCORE, TEST)))
WANT$WHICHCOL = c("TEST", "SCORE", "SCORE", "TEST", "SCORE", "SCORE", "TEST", "TEST", "TEST")
I am able to identify the way to get the max value between SCORE and TEST but I wish to also make the column WHICHCOL which equals to 'TEST' if TEST> SCORE or 'SCORE' if SCORE > TEST
pmax is a built-in function that will be much more efficient than a rowwise max:
HAVE %>%
mutate(
MAX = pmax(SCORE, TEST),
WHICHCOL = ifelse(SCORE > TEST, "SCORE", "TEST")
)
# STUDENT CLASS SEMESTER SCORE TEST MAX WHICHCOL
# 1 1 A 1 50 80 80 TEST
# 2 1 A 2 74 59 74 SCORE
# 3 1 A 3 78 63 78 SCORE
# 4 2 B 1 79 96 96 TEST
# 5 2 B 2 100 57 100 SCORE
# 6 2 B 3 65 53 65 SCORE
# 7 3 C 1 61 93 93 TEST
# 8 3 C 2 70 89 89 TEST
# 9 3 C 3 87 92 92 TEST
Note that, since I use > not >=, TEST will win ties.
A base R solution:
df1 <- HAVE[c("SCORE", "TEST")]
x <- max.col(df1, "first")
MAX <- df1[cbind(1:nrow(df1), x)]
WHICHCOL <- names(df1)[x]
HAVE <- cbind(HAVE, MAX, WHICHCOL)
HAVE
#> STUDENT CLASS SEMESTER SCORE TEST MAX WHICHCOL
#> 1 1 A 1 50 80 80 TEST
#> 2 1 A 2 74 59 74 SCORE
#> 3 1 A 3 78 63 78 SCORE
#> 4 2 B 1 79 96 96 TEST
#> 5 2 B 2 100 57 100 SCORE
#> 6 2 B 3 65 53 65 SCORE
#> 7 3 C 1 61 93 93 TEST
#> 8 3 C 2 70 89 89 TEST
#> 9 3 C 3 87 92 92 TEST
So I have this dataframe and I aim to add a new variable based on others:
Qi
Age
c_gen
1
56
13
2
43
15
5
31
6
3
67
8
I want to create a variable called c_sep that if:
Qi==1 or Qi==2 c_sep takes a random number between (c_gen + 6) and Age;
Qi==3 or Qi==4 c_sep takes a random number between (Age-15) and Age;
And 0 otherwise,
so my data would look something like this:
Qi
Age
c_gen
c_sep
1
56
13
24
2
43
15
13
5
31
6
0
3
67
8
40
Any ideas please
In base R, you can do something along the lines of:
dat <- read.table(text = "Qi Age c_gen
1 56 13
2 43 15
5 31 6
3 67 8", header = T)
set.seed(100)
dat$c_sep <- 0
dat$c_sep[dat$Qi %in% c(1,2)] <- apply(dat[dat$Qi %in% c(1,2),], 1, \(row) sample(
(row["c_gen"]+6):row["Age"], 1
)
)
dat$c_sep[dat$Qi %in% c(3,4)] <- apply(dat[dat$Qi %in% c(3,4),], 1, \(row) sample(
(row["Age"]-15):row["Age"], 1
)
)
dat
# Qi Age c_gen c_sep
# 1 1 56 13 28
# 2 2 43 15 43
# 3 5 31 6 0
# 4 3 67 8 57
If you are doing it more than twice you might want to put this in a function - depending on your requirements.
Try this
df$c_sep <- ifelse(df$Qi == 1 | df$Qi == 2 ,
sapply(1:nrow(df) ,
\(x) sample(seq(df$c_gen[x] + 6, df$Age[x]) ,1)) ,
sapply(1:nrow(df) ,
\(x) sample(seq(df$Age[x] - 15, df$Age[x]) ,1)) , 0))
output
Qi Age c_gen c_sep
1 1 56 13 41
2 2 43 15 42
3 5 31 6 0
4 3 67 8 58
A tidyverse option:
library(tidyverse)
df <- tribble(
~Qi, ~Age, ~c_gen,
1, 56, 13,
2, 43, 15,
5, 31, 6,
3, 67, 8
)
df |>
rowwise() |>
mutate(c_sep = case_when(
Qi <= 2 ~ sample(seq(c_gen + 6, Age, 1), 1),
between(Qi, 3, 4) ~ sample(seq(Age - 15, Age, 1), 1),
TRUE ~ 0
)) |>
ungroup()
#> # A tibble: 4 × 4
#> Qi Age c_gen c_sep
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 56 13 39
#> 2 2 43 15 41
#> 3 5 31 6 0
#> 4 3 67 8 54
Created on 2022-06-29 by the reprex package (v2.0.1)
I want to substract the result for level time0 from the results from all other levels, for each id.
id <- rep(1:4,each=4)
time <- rep(c(0,5,10,15),4)
a <- c(34,56,67,35)
b <-c(56,78,23,90)
c <- c(23,89,67,78)
df <- data.frame(id,time,a,b,c)
df
id time a b c
1 1 0 34 56 23
2 1 5 56 78 89
3 1 10 67 23 67
4 1 15 35 90 78
5 2 0 34 56 23
6 2 5 56 78 89
7 2 10 67 23 67
8 2 15 35 90 78
9 3 0 34 56 23
10 3 5 56 78 89
11 3 10 67 23 67
12 3 15 35 90 78
13 4 0 34 56 23
14 4 5 56 78 89
15 4 10 67 23 67
16 4 15 35 90 78
I started like this but it feels there must be a more efficient way. Any suggestions? Thanks!
for( i in 1:length(unique(df$id))){
df_id <- df[df$id==i,]
for(j in 2:length(time)){
test <- t(df_id[,-1])
test[,c(2:4)]-test[,1]
}
Here's an option with dplyr -
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(a:c, ~. - .[time == 0])) %>%
ungroup
# id time a b c
# <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 0 0 0 0
# 2 1 5 22 22 66
# 3 1 10 33 -33 44
# 4 1 15 1 34 55
# 5 2 0 0 0 0
# 6 2 5 22 22 66
# 7 2 10 33 -33 44
# 8 2 15 1 34 55
# 9 3 0 0 0 0
#10 3 5 22 22 66
#11 3 10 33 -33 44
#12 3 15 1 34 55
#13 4 0 0 0 0
#14 4 5 22 22 66
#15 4 10 33 -33 44
#16 4 15 1 34 55
Using time == 0 would work if it is guaranteed that every id has exactly 1 value of time = 0. If for some id's there is no row for time = 0 or have more than one row with time = 0 then probably using match is better option.
df %>% group_by(id) %>% mutate(across(a:c, ~. - .[match(0, time)]))
Use mapply in by.
vc <- c('a', 'b', 'c')
by(df, df$id, \(x) {x[-1, vc] <- mapply(`-`, x[-1, vc], x[1, vc]);x}) |>
do.call(what=rbind)
# id time a b c
# 1.1 1 0 34 56 23
# 1.2 1 5 22 22 66
# 1.3 1 10 33 -33 44
# 1.4 1 15 1 34 55
# 2.5 2 0 34 56 23
# 2.6 2 5 22 22 66
# 2.7 2 10 33 -33 44
# 2.8 2 15 1 34 55
# 3.9 3 0 34 56 23
# 3.10 3 5 22 22 66
# 3.11 3 10 33 -33 44
# 3.12 3 15 1 34 55
# 4.13 4 0 34 56 23
# 4.14 4 5 22 22 66
# 4.15 4 10 33 -33 44
# 4.16 4 15 1 34 55
If id==0 position is not consistent, you need to formulate more verbose:
{x[x$time != 0, vc] <- mapply(`-`, x[x$time != 0, vc], x[x$time == 0, vc]);x}
Data:
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L), time = c(0, 5, 10, 15, 0, 5, 10, 15,
0, 5, 10, 15, 0, 5, 10, 15), a = c(34, 56, 67, 35, 34, 56, 67,
35, 34, 56, 67, 35, 34, 56, 67, 35), b = c(56, 78, 23, 90, 56,
78, 23, 90, 56, 78, 23, 90, 56, 78, 23, 90), c = c(23, 89, 67,
78, 23, 89, 67, 78, 23, 89, 67, 78, 23, 89, 67, 78)), class = "data.frame", row.names = c(NA,
-16L))
My intention involves creating a variable based on the values of two numeric ones. I have not written any user-defined functions in R and need help getting started.
Dataset:
My dataset has over 3k stores, but created a reproducible example of the first 10 rows. Deliveries per day of week show total volume for that day through the year. Store_num represents store number and Total shows the total deliveries for a store throughout year.
I want predominant delivery days created in a variable called Del_Sch with the following inequalities. If first condition TRUE (50-100%), then create the variable with the column name. If FALSE, test second condition and create variable with all column names between 32-50%, ect. If there are no days over 20%, no predominant delivery days are counted.
-Volume in a day between 50-100% of the total.
-Volume in a day between 32-50% of total
-Volume in a day between 25-32% of total.
-Volume in a day between 20-25% of total.
-Volume in a day less than 20% of total.
Reproducible Example:
Store_Num <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
#Total deliveries made per week
Sun_Del <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
Mon_Del <- c(10, 50, 51, 7, 80, 97, 21, 49, 30, 3)
Tue_Del <- c(7, NA, 2, 50, 5, 56, 1, 4, 35, 52)
Wed_Del <- c(49, 51, 1, 4, 51, 16, 2, 2, 1, 1)
Thu_Del <- c(3, 2, 47, 7, 40, 2, 6, 5, 1, 7)
Fri_Del <- c(50, 49, 3, 51, 53, 86, 9, 52, 25, 52)
Sat_Del <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
Total <- c(119, 152, 104, 119, 229, 257, 39, 112, 92, 115)
#Single dataset
Schedule <- data.frame(Store_Num, Sun_Del, Mon_Del, Tue_Del,
Wed_Del, Thu_Del, Fri_Del, Sat_Del, Total)
Schedule
Store_Num Sun_Del Mon_Del Tue_Del Wed_Del Thu_Del Fri_Del Sat_Del Total
1 1 NA 10 7 49 3 50 NA 119
2 2 NA 50 NA 51 2 49 NA 152
3 3 NA 51 2 1 47 3 NA 104
4 4 NA 7 50 4 7 51 NA 119
5 5 NA 80 5 51 40 53 NA 229
6 6 NA 97 56 16 2 86 NA 257
7 7 NA 21 1 2 6 9 NA 39
8 8 NA 49 4 2 5 52 NA 112
9 9 NA 30 35 1 1 25 NA 92
10 10 NA 3 52 1 7 52 NA 115
Desired Output:
Store_Num Sun_Del Mon_Del Tue_Del Wed_Del Thu_Del Fri_Del Sat_Del Total Del_Sch
1 1 NA 10 7 49 3 50 NA 119 WF
2 2 NA 50 NA 51 2 49 NA 152 MWF
3 3 NA 51 2 1 47 3 NA 104 MTh
4 4 NA 7 50 4 7 51 NA 119 TF
5 5 NA 80 5 51 40 53 NA 229 MWF
6 6 NA 97 56 16 2 86 NA 257 MTF
7 7 NA 21 1 2 6 9 NA 39 M
8 8 NA 49 4 2 5 52 NA 112 MF
9 9 NA 30 35 1 1 25 NA 92 MTF
10 10 NA 3 52 1 7 52 NA 115 TF
Using tidyr and dplyr. I made the names be the first two letter pasted to fix the Tuesday/Thursday confusion:
library(dplyr)
library(tidyr)
Schedule %>% gather(Day, del, -Store_Num, -Total) %>%
mutate(proportion = ifelse(del/Total >= 0.5, 1,
ifelse(del/Total >= 0.32, 2,
ifelse(del/Total >= 0.25, 3,
ifelse(del/Total >= 0.20, 4,
NA))))) %>%
group_by(Store_Num) %>%
summarise(days = paste0(substr(Day[which(
proportion == min(proportion, na.rm = TRUE))],
1, 2), collapse = "")) %>%
merge(Schedule, ., by = "Store_Num")
Store_Num Sun_Del Mon_Del Tue_Del Wed_Del Thu_Del Fri_Del Sat_Del Total days
1 1 NA 10 7 49 3 50 NA 119 WeFr
2 2 NA 50 NA 51 2 49 NA 152 MoWeFr
3 3 NA 51 2 1 47 3 NA 104 MoTh
4 4 NA 7 50 4 7 51 NA 119 TuFr
5 5 NA 80 5 51 40 53 NA 229 Mo
6 6 NA 97 56 16 2 86 NA 257 MoFr
7 7 NA 21 1 2 6 9 NA 39 Mo
8 8 NA 49 4 2 5 52 NA 112 MoFr
9 9 NA 30 35 1 1 25 NA 92 MoTu
10 10 NA 3 52 1 7 52 NA 115 TuFr
Edit: there are a couple of mismatches between my results and your data (line 5,6 and 9), according to your rules, you have mistakes there.
I'm trying to get the quantile number of a column in a data frame, but in reverse order. I want the highest number to be in quantile number 1.
Here is what I have so far:
> x<-c(10, 12, 75, 89, 25, 100, 67, 89, 4, 67, 120.2, 140.5, 170.5, 78.1)
> x <- data.frame(x)
> within(x, Q <- as.integer(cut(x, quantile(x, probs=0:5/5, na.rm=TRUE),
include.lowest=TRUE)))
x Q
1 10.0 1
2 12.0 1
3 75.0 3
4 89.0 4
5 25.0 2
6 100.0 4
7 67.0 2
8 89.0 4
9 4.0 1
10 67.0 2
11 120.2 5
12 140.5 5
13 170.5 5
14 78.1 3
And what I want to get is:
x Q
1 10.0 5
2 12.0 5
3 75.0 3
4 89.0 2
5 25.0 4
6 100.0 2
7 67.0 4
8 89.0 2
9 4.0 5
10 67.0 4
11 120.2 1
12 140.5 1
13 170.5 1
14 78.1 3
One way to do this is to specify the reversed labels in the cut() function. If you want Q to be an integer then you need to first coerce the factor labels into a character and then into an integer.
result <- within(x, Q <- as.integer(as.character((cut(x,
quantile(x, probs = 0:5/5, na.rm = TRUE),
labels = c(5, 4, 3, 2, 1),
include.lowest = TRUE)))))
head(result)
x Q
1 10 5
2 12 5
3 75 3
4 89 2
5 25 4
6 100 2
Your data:
x <- c(10, 12, 75, 89, 25, 100, 67, 89, 4, 67, 120.2, 140.5, 170.5, 78.1)
x <- data.frame(x)