Related
I have df like this
ID <- c("A01","B20","C3","D4")
Nb_data <- c(2,2,2,3)
Weight_t1 <- c(70,44,98,65)
Weight_t2 <- c(75,78,105,68)
Weight_t3 <- c(72,52,107,NA)
year1 <- c(20,28,32,50)
year2 <- c(28,32,35,60)
year3 <- c(29,35,38,NA)
LENGTHt1 <- c(175,155,198,165)
LENGTHt2 <- c(175,155,198,163)
LENGTHt3 <- c(176,154,198,NA)
df <- data.frame(ID,Nb_data,Weight_t1,Weight_t2,Weight_t3,year1,year2,year3,LENGTHt1,LENGTHt2,LENGTHt3)
weight/year and length : t1 to t28
I want to tidy my data like :
ID
Nb_data
Weigth
Year
Length
A01
3
70
20
175
A01
3
75
28
175
A01
3
72
29
176
B20
3
44
28
155
B20
3
78
32
155
B20
3
52
35
154
I try
df1 <- df %>%
pivot_longer(cols = -c('ID','Nb_data'),
names_to = c('Weight','Year','Length' ),
names_pattern = '(Weight_t[0-9]*|year[0-9]*|LENGTHt[0-9]*)' ,
values_drop_na = TRUE)
or names_pattern = '(.t[0-9])(.t[0-9])(.t[0-9])'
I have some difficulties to use regex or maybe pivot_longer are not suitable...
You need to extract the common timepoint information from the variable names. Make this information consistent first, with a clear separator (_ in this case), then it becomes much easier.
I would do something like this
library(tidyr)
library(dplyr)
df1 <- df
names(df1) <- gsub("year", "Year_t", names(df1))
names(df1) <- gsub("LENGTH", "Length_", names(df1))
df1 %>%
pivot_longer(cols = -c('ID','Nb_data'),
names_to = c("name", "timepoint"),
names_sep = "_",
values_drop_na = TRUE) %>%
pivot_wider(names_from = name, values_from = value)
EDIT: or shorter, using ".value" in the names_to argument (as #onyambu showed in his answer):
df1 %>%
pivot_longer(cols = -c('ID','Nb_data'),
names_to = c(".value", "timepoint"),
names_sep = "_",
values_drop_na = TRUE)
Output:
ID Nb_data timepoint Weight Year Length
<chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 A01 2 t1 70 20 175
2 A01 2 t2 75 28 175
3 A01 2 t3 72 29 176
4 B20 2 t1 44 28 155
5 B20 2 t2 78 32 155
6 B20 2 t3 52 35 154
7 C3 2 t1 98 32 198
8 C3 2 t2 105 35 198
9 C3 2 t3 107 38 198
10 D4 3 t1 65 50 165
11 D4 3 t2 68 60 163
You could directly use pivot_longer though with abit of complex regex as follows
df %>%
pivot_longer(matches("\\d+$"), names_to = c(".value", "grp"),
names_pattern = "(.*?)[_t]{0,2}(\\d+$)",
values_drop_na = TRUE)
# A tibble: 11 × 6
ID Nb_data grp Weight year LENGTH
<chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 A01 2 1 70 20 175
2 A01 2 2 75 28 175
3 A01 2 3 72 29 176
4 B20 2 1 44 28 155
5 B20 2 2 78 32 155
6 B20 2 3 52 35 154
7 C3 2 1 98 32 198
8 C3 2 2 105 35 198
9 C3 2 3 107 38 198
10 D4 3 1 65 50 165
11 D4 3 2 68 60 163
In my dataframe, I have multiple columns with student grades. I want to sum the "Quiz" columns (e.g., Quiz1, Quiz2). However, I only want to sum the top 2 values, and ignore the others. I want to create a new column with the total (i.e., the sum of the top 2 values).
One issue is that some students have grades that tie for the top 2 grades in a given row. For example, Aaron has a high score of 42, but then there are two scores that tie for the second highest (i.e., 36).
Data
df <-
structure(
list(
Student = c("Aaron", "James", "Charlotte", "Katie", "Olivia",
"Timothy", "Grant", "Chloe", "Judy", "Justin"),
ID = c(30016, 87311, 61755, 55323, 94839, 38209, 34096,
98432, 19487, 94029),
Quiz1 = c(31, 25, 41, 10, 35, 19, 27, 42, 15, 20),
Quiz2 = c(42, 33, 34, 22, 23, 38, 48, 49, 23, 30),
Quiz3 = c(36, 36, 34, 32, 43, 38, 44, 42, 42, 37),
Quiz4 = c(36, 43, 39, 46, 40, 38, 43, 35, 41, 41)
),
row.names = c(NA, -10L),
class = c("tbl_df", "tbl", "data.frame")
)
I know that I can use pivot_longer to do this, which allows me to arrange by group, then take the top 2 values for each student. This works fine, but I would like a more efficient way with tidyverse, rather than having to pivot back and forth.
What I Tried
library(tidyverse)
df %>%
pivot_longer(-c(Student, ID)) %>%
group_by(Student, ID) %>%
arrange(desc(value), .by_group = TRUE) %>%
slice_head(n = 2) %>%
pivot_wider(names_from = name, values_from = value) %>%
ungroup() %>%
mutate(Total = rowSums(select(., starts_with("Quiz")), na.rm = TRUE))
I also know that if I wanted to sum all the columns on each row, then I could use rowSums, as I made use of above. However, I am unsure how to do rowSums of just the top 2 values in the 4 quiz columns.
Expected Output
# A tibble: 10 × 7
Student ID Quiz2 Quiz3 Quiz1 Quiz4 Total
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron 30016 42 36 NA NA 78
2 Charlotte 61755 NA NA 41 39 80
3 Chloe 98432 49 NA 42 NA 91
4 Grant 34096 48 44 NA NA 92
5 James 87311 NA 36 NA 43 79
6 Judy 19487 NA 42 NA 41 83
7 Justin 94029 NA 37 NA 41 78
8 Katie 55323 NA 32 NA 46 78
9 Olivia 94839 NA 43 NA 40 83
10 Timothy 38209 38 38 NA NA 76
Based on this StackOverflow answer.
library(tidyverse)
df <-
structure(
list(
Student = c("Aaron", "James", "Charlotte", "Katie", "Olivia",
"Timothy", "Grant", "Chloe", "Judy", "Justin"),
ID = c(30016, 87311, 61755, 55323, 94839, 38209, 34096,
98432, 19487, 94029),
Quiz1 = c(31, 25, 41, 10, 35, 19, 27, 42, 15, 20),
Quiz2 = c(42, 33, 34, 22, 23, 38, 48, 49, 23, 30),
Quiz3 = c(36, 36, 34, 32, 43, 38, 44, 42, 42, 37),
Quiz4 = c(36, 43, 39, 46, 40, 38, 43, 35, 41, 41)
),
row.names = c(NA, -10L),
class = c("tbl_df", "tbl", "data.frame")
)
df %>%
rowwise() %>%
mutate(Quiz_Total = sum(sort(c(Quiz1,Quiz2,Quiz3,Quiz4), decreasing = TRUE)[1:2])) %>%
ungroup()
#> # A tibble: 10 × 7
#> Student ID Quiz1 Quiz2 Quiz3 Quiz4 Quiz_Total
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Aaron 30016 31 42 36 36 78
#> 2 James 87311 25 33 36 43 79
#> 3 Charlotte 61755 41 34 34 39 80
#> 4 Katie 55323 10 22 32 46 78
#> 5 Olivia 94839 35 23 43 40 83
#> 6 Timothy 38209 19 38 38 38 76
#> 7 Grant 34096 27 48 44 43 92
#> 8 Chloe 98432 42 49 42 35 91
#> 9 Judy 19487 15 23 42 41 83
#> 10 Justin 94029 20 30 37 41 78
with base R - select just the quiz result columns and you can treat it like a matrix. apply sort in decreasing order, subsetting first two elements, and then use colSums.
df$Total <- colSums(apply(df[grepl("Quiz", names(df))], 1, function(x) sort(x, decreasing = TRUE)[1:2]))
df
#> # A tibble: 10 × 7
#> Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Aaron 30016 31 42 36 36 78
#> 2 James 87311 25 33 36 43 79
#> 3 Charlotte 61755 41 34 34 39 80
#> 4 Katie 55323 10 22 32 46 78
#> 5 Olivia 94839 35 23 43 40 83
#> 6 Timothy 38209 19 38 38 38 76
#> 7 Grant 34096 27 48 44 43 92
#> 8 Chloe 98432 42 49 42 35 91
#> 9 Judy 19487 15 23 42 41 83
#> 10 Justin 94029 20 30 37 41 78
You do not have to do pivot_wider. Note that the longer format is the tidy format. Just do pivot_longer and left_join:
df %>%
left_join(pivot_longer(., -c(Student, ID)) %>%
group_by(Student, ID) %>%
summarise(Total = sum(sort(value, TRUE)[1:2]), .groups = 'drop'))
# A tibble: 10 x 7
Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron 30016 31 42 36 36 78
2 James 87311 25 33 36 43 79
3 Charlotte 61755 41 34 34 39 80
4 Katie 55323 10 22 32 46 78
5 Olivia 94839 35 23 43 40 83
6 Timothy 38209 19 38 38 38 76
7 Grant 34096 27 48 44 43 92
8 Chloe 98432 42 49 42 35 91
9 Judy 19487 15 23 42 41 83
10 Justin 94029 20 30 37 41 78
Yet another solution, based on tidyverse:
library(tidyverse)
df %>%
rowwise %>%
mutate(Quiz = list(c_across(starts_with("Quiz")) *
if_else(rank(c_across(starts_with("Quiz")),ties.method="last")>=3,1,NA_real_)),
across(matches("\\d$"), ~ NULL), total = sum(Quiz, na.rm = T)) %>%
unnest_wider(Quiz, names_sep = "")
#> # A tibble: 10 × 7
#> Student ID Quiz1 Quiz2 Quiz3 Quiz4 total
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Aaron 30016 NA 42 36 NA 78
#> 2 James 87311 NA NA 36 43 79
#> 3 Charlotte 61755 41 NA NA 39 80
#> 4 Katie 55323 NA NA 32 46 78
#> 5 Olivia 94839 NA NA 43 40 83
#> 6 Timothy 38209 NA 38 38 NA 76
#> 7 Grant 34096 NA 48 44 NA 92
#> 8 Chloe 98432 42 49 NA NA 91
#> 9 Judy 19487 NA NA 42 41 83
#> 10 Justin 94029 NA NA 37 41 78
(A bit messy) Base R Solution:
# Store the names of quiz columns as a vector: quiz_colnames => character vector
quiz_colnames <- grep("Quiz\\d+", names(df), value = TRUE)
# Store the names of the non-quiz columns as a vector: non_quiz_colnames => character vector
non_quiz_colnames <- names(df)[!(names(df) %in% quiz_colnames)]
# Store an Idx based on the ID: Idx => integer vector:
Idx <- with(df, as.integer(factor(ID, levels = unique(ID))))
# Split-Apply-Combine to calculate the top 2 quizes: res => data.frame
res <- data.frame(
do.call(
rbind,
lapply(
with(
df,
split(
df,
Idx
)
),
function(x){
# Extract the top 2 quiz vectors: top_2_quizes => named integer vector
top_2_quizes <- head(sort(unlist(x[,quiz_colnames]), decreasing = TRUE), 2)
# Calculate the quiz columns not used: remainder_quiz_cols => character vector
remainder_quiz_cols <- quiz_colnames[!(quiz_colnames %in% names(top_2_quizes))]
# Nullify the remaining quizes: x => data.frame
x[, remainder_quiz_cols] <- NA_integer_
# Calculate the resulting data.frame: data.frame => env
transform(
cbind(
x[,non_quiz_names],
x[,names(top_2_quizes)],
x[,remainder_quiz_cols]
),
Total = sum(top_2_quizes)
)[,c(non_quiz_names, "Quiz2", "Quiz3", "Quiz1", "Quiz4", "Total")]
}
)
),
row.names = NULL,
stringsAsFactors = FALSE
)
Try this base R to also get the NAs
cbind( df[,1:2], t( sapply( seq_along( 1:nrow( df ) ), function(x){
ord <- order( unlist( df[x,3:6] ) )[1:2]; arow <- df[x,3:6];
arow[ord] <- NA; ttl <- rowSums( arow[-ord], na.rm=T );
cbind( arow,Total=ttl ) } ) ) )
Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total
1 Aaron 30016 NA 42 NA 36 78
2 James 87311 NA NA 36 43 79
3 Charlotte 61755 41 NA NA 39 80
4 Katie 55323 NA NA 32 46 78
5 Olivia 94839 NA NA 43 40 83
6 Timothy 38209 NA NA 38 38 76
7 Grant 34096 NA 48 44 NA 92
8 Chloe 98432 NA 49 42 NA 91
9 Judy 19487 NA NA 42 41 83
10 Justin 94029 NA NA 37 41 78
As #akrun provided above, collapse is another efficient possibility. radixorder provides an integer ordering vector, and only the top 2 values in each row are kept, while the others are replaced with NA. Then, rowSums is used to get the totals for each row.
library(collapse)
ftransform(gvr(df, "Student|ID"),
dapply(
gvr(df, "^Quiz"),
MARGIN = 1,
FUN = function(x)
replace(x, radixorder(radixorder(x)) %in% 1:2, NA)
)) %>%
ftransform(Total = rowSums(gvr(., "^Quiz"), na.rm = TRUE))
Output
# A tibble: 10 × 7
Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron 30016 NA 42 NA 36 78
2 James 87311 NA NA 36 43 79
3 Charlotte 61755 41 NA NA 39 80
4 Katie 55323 NA NA 32 46 78
5 Olivia 94839 NA NA 43 40 83
6 Timothy 38209 NA NA 38 38 76
7 Grant 34096 NA 48 44 NA 92
8 Chloe 98432 NA 49 42 NA 91
9 Judy 19487 NA NA 42 41 83
10 Justin 94029 NA NA 37 41 78
I have data on repeated measurements (5 or less) which include blood pressure measurements. I formatted the data in a long format, but seeing as I've done this for the first time, I now no longer see how to get descriptive statistics of my variables.
My example data:
questiondata <- structure(list(id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3,
3, 4, 4, 5, 5, 5, 5, 6, 6, 7, 7, 8, 8),
time = c("time1", "time2", "time3", "time5", "time1", "time2", "time3", "time5",
"time1", "time2", "time3", "time5", "time4", "time5", "time4", "time5",
"time4", "time5", "time4", "time5", "time4", "time5", "time4", "time5"),
cohort = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
systolicBP = c(102, 137, 132, 150, 152, 146, 160.5, 159.5, 144, 138, 137.5, 163,
137, 147, 125, 141, 147, 150, 148, 167.5, 153.5, 164.5, 159, 123),
diastolicBP = c(56, 99, 78, 90, 77, 78, 80.5, 82, 72, 70, 67.5, 61, 86, 90, 80.5,
84, 75, 81, 91, 96, 80, 87.5, 87, 79),
egfr = c(78.2, NA, 55.8, NA, NA, NA, 87.6, NA, NA, NA, 75.6, 70.9, 71.9, 71.8,
47.9, 36.6, 93.7, 81.5, 93.2, 82.1, 92.9, 79.1, 66.6, 55.2)),
row.names = c(NA, -24L), class = c("tbl_df", "tbl", "data.frame"))
Which corresponds to the following tibble:
# A tibble: 24 x 6
id time cohort systolicBP diastolicBP egfr
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 time1 1 102 56 78.2
2 1 time2 1 137 99 NA
3 1 time3 1 132 78 55.8
4 1 time5 1 150 90 NA
5 2 time1 1 152 77 NA
6 2 time2 1 146 78 NA
7 2 time3 1 160. 80.5 87.6
8 2 time5 1 160. 82 NA
9 3 time1 1 144 72 NA
10 3 time2 1 138 70 NA
11 3 time3 1 138. 67.5 75.6
12 3 time5 1 163 61 70.9
13 4 time4 2 137 86 71.9
14 4 time5 2 147 90 71.8
15 5 time4 2 125 80.5 47.9
16 5 time5 2 141 84 36.6
17 6 time4 2 147 75 93.7
18 6 time5 2 150 81 81.5
19 7 time4 2 148 91 93.2
20 7 time5 2 168. 96 82.1
21 8 time4 2 154. 80 92.9
22 8 time5 2 164. 87.5 79.1
23 9 time4 2 159 87 66.6
24 9 time5 2 123 79 55.2
So for example now I want to
Get how many persons are in each cohort
What is the mean of the systolicBP, in total, for each cohort, and at each timepoint.
I'm sure this is very easy, but I can't seem to get any viable results.
library(tidyverse)
questiondata %>%
group_by(cohort) %>%
summarise(n = n_distinct(id),
mean_systolic = mean(systolicBP, na.rm = TRUE), .groups = "drop")
#> # A tibble: 2 x 3
#> cohort n mean_systolic
#> <dbl> <int> <dbl>
#> 1 1 3 143.
#> 2 2 6 147.
questiondata %>%
group_by(cohort, time) %>%
summarise(mean_systolic_time = mean(systolicBP, na.rm = TRUE), .groups = "drop")
#> # A tibble: 6 x 3
#> cohort time mean_systolic_time
#> <dbl> <chr> <dbl>
#> 1 1 time1 133.
#> 2 1 time2 140.
#> 3 1 time3 143.
#> 4 1 time5 158.
#> 5 2 time4 145.
#> 6 2 time5 149.
Created on 2021-06-25 by the reprex package (v2.0.0)
questiondata %>%
group_by(cohort, time) %>%
summarise(mean_systolic_time = str_c(round(mean(systolicBP, na.rm = TRUE), digits = 0), " mmHg"),
.groups = "drop")
# A tibble: 6 x 3
cohort time mean_systolic_time
<dbl> <chr> <chr>
1 1 time1 133 mmHg
2 1 time2 140 mmHg
3 1 time3 143 mmHg
4 1 time5 158 mmHg
5 2 time4 145 mmHg
6 2 time5 149 mmHg
I have a dataset of logged data at 5 minutes intervals that also includes data at 1 minute intervals denoted by _1 - _5 in the header.
Each row represents a 5 minute interval.
datetime temp speed_1 speed_2 speed_3 speed_4 speed_5
20190710 09:00:00 21 13 14 26 29 32
20190710 09:05:00 21 28 28 29 38 12
20190710 09:10:00 20 8 15 29 30 19
20190711 11:12:00 18 6 9 18 51 49
20190711 11:17:00 17 49 48 48 30 10
The actual dataset has an additional 25 columns of data logged at 5 minute intervals and consists of approximately 25000 rows.
I'm looking for an efficient way of analyzing the speed for each day.
For example, if I wanted to plot the speed for each day it would take speed_1 to speed_5 from the earliest entry on a particular day, say 09:00:00, then speed_1 to speed_5 from the next time, 09:05:00, and so on for the whole day.
Currently I have created an additional dataframe for the speed that fills in the times to give:
datetime speed
20190710 09:00:00 13
20190710 09:01:00 14
20190710 09:02:00 26
20190710 09:03:00 29
20190710 09:04:00 32
This results in having a second df of 125000 entries. I was wondering if there was a more memory efficient way of analyzing the original dataset as the datasets may grow considerably in the future.
Edit: Reproducible code added
structure(list(time = structure(1:3, .Label = c("20190710 09-00-00", "20190710 09-05-00", "20190710 09-10-00"), class = "factor"), temp = c(21, 21, 20), speed_1 = c(13, 28, 8), speed_2 = c(14, 28, 15), speed_3 = c(26, 29, 29), speed_4 = c(29, 38, 30), speed_5 = c(32, 12, 19)), .Names = c("time", "temp", "speed_1", "speed_2", "speed_3", "speed_4", "speed_5"), row.names = c(NA, -3L), class = "data.frame")
Here is a dplyr version:
library(tidyverse)
library(lubridate)
df <- read.table(text='datetime temp speed_1 speed_2 speed_3 speed_4 speed_5
"20190710 09:00:00" 21 13 14 26 29 32
"20190710 09:05:00" 21 28 28 29 38 12
"20190710 09:10:00" 20 8 15 29 30 19
"20190711 11:12:00" 18 6 9 18 51 49
"20190711 11:17:00" 17 49 48 48 30 10',header=T)
# we take our dataframe
df %>%
# ...then we put all the speed columns in one column
pivot_longer(starts_with("speed_")
, names_to = "minute"
, values_to = "speed") %>%
# ...then we...
mutate(datetime = ymd_hms(datetime) #...turn the "datetime" column actually into a datetime format
, minute = gsub("speed_", "", minute) %>% as.numeric() # ...remove "speed_" from the former column names (which are now in column "speed")
, datetime = datetime + minutes(minute - 1)) # ...and add the minute to our datetime...
...to get this:
# A tibble: 25 x 4
datetime temp minute speed
<dttm> <int> <dbl> <int>
1 2019-07-10 09:00:00 21 1 13
2 2019-07-10 09:01:00 21 2 14
3 2019-07-10 09:02:00 21 3 26
4 2019-07-10 09:03:00 21 4 29
5 2019-07-10 09:04:00 21 5 32
6 2019-07-10 09:05:00 21 1 28
7 2019-07-10 09:06:00 21 2 28
8 2019-07-10 09:07:00 21 3 29
9 2019-07-10 09:08:00 21 4 38
10 2019-07-10 09:09:00 21 5 12
# ... with 15 more rows
Some example data and expected output would help a lot. I gave it a shot anyways. You can do this if you simply want a list of all the speeds for every date.
dataset <- read.table(text='datetime temp speed_1 speed_2 speed_3 speed_4 speed_5
"20190710 09:00:00" 21 13 14 26 29 32
"20190710 09:05:00" 21 28 28 29 38 12
"20190710 09:10:00" 20 8 15 29 30 19
"20190711 11:12:00" 18 6 9 18 51 49
"20190711 11:17:00" 17 49 48 48 30 10',header=T)
dataset$datetime <- as.POSIXlt(dataset$datetime,format="%Y%m%d %H:%M:%OS")
lapply(split(dataset,as.Date(dataset$datetime)), function(x) c(t(x[,3:ncol(x)])) )
output:
$`2019-07-10`
[1] 13 14 26 29 32 28 28 29 38 12 8 15 29 30 19
$`2019-07-11`
[1] 6 9 18 51 49 49 48 48 30 10
Edit: Updated answer so that the speeds are in the correct order.
Here is something raw using data.table:
library(data.table)
setDT(df)
df[, time := as.POSIXct(time, format="%Y%m%d %H-%M-%OS")]
out <-
df[, !"temp"
][, melt(.SD, id.vars = "time")
][, time := time + (rleid(variable)-1)*60, time
][order(time), !"variable"]
out
# time value
# 1: 2019-07-10 09:00:00 13
# 2: 2019-07-10 09:01:00 14
# 3: 2019-07-10 09:02:00 26
# 4: 2019-07-10 09:03:00 29
# 5: 2019-07-10 09:04:00 32
# 6: 2019-07-10 09:05:00 28
# 7: 2019-07-10 09:06:00 28
# 8: 2019-07-10 09:07:00 29
# 9: 2019-07-10 09:08:00 38
# 10: 2019-07-10 09:09:00 12
# 11: 2019-07-10 09:10:00 8
# 12: 2019-07-10 09:11:00 15
# 13: 2019-07-10 09:12:00 29
# 14: 2019-07-10 09:13:00 30
# 15: 2019-07-10 09:14:00 19
Data:
df <- data.frame(
time = factor(c("20190710 09-00-00", "20190710 09-05-00", "20190710 09-10-00")),
temp = c(21, 21, 20),
speed_1 = c(13, 28, 8),
speed_2 = c(14, 28, 15),
speed_3 = c(26, 29, 29),
speed_4 = c(29, 38, 30),
speed_5 = c(32, 12, 19)
)
I have a dataframe that looks like this:
id date1 value1 date2 value2 date3 value3
1 1113 2012-01-14 29 2012-09-29 22 2013-10-28 21
2 1622 2012-12-05 93 2012-12-05 82 2013-01-22 26
3 1609 2014-08-30 30 2013-04-07 53 2013-03-20 100
4 1624 2014-01-20 84 2013-03-17 92 2014-01-10 81
5 1861 2014-10-08 29 2012-08-19 84 2012-09-21 56
6 1640 2014-03-05 27 2012-02-28 5 2015-01-11 65
I want to create a new column that contains whichever value of the three columns "value1", "value2", and "value3" that is the most recent. I don't need to know which date it was associated with.
id date1 value1 date2 value2 date3 value3 value_recent
1 1113 2012-01-14 29 2012-09-29 22 2013-10-28 21 21
2 1622 2012-12-05 93 2012-12-05 82 2013-01-22 26 26
3 1609 2014-08-30 30 2013-04-07 53 2013-03-20 100 30
4 1624 2014-01-20 84 2013-03-17 92 2014-01-10 81 84
5 1861 2014-10-08 29 2012-08-19 84 2012-09-21 56 29
6 1640 2014-03-05 27 2012-02-28 5 2015-01-11 65 65
Code to create working example:
set.seed(1234)
id <- sample(1000:2000, 6, replace=TRUE)
date1 <- sample(seq(as.Date('2012-01-01'), as.Date('2016-01-01'), by="day"), 6)
value1 <- sample(1:100, 6, replace=TRUE)
date2 <- sample(seq(as.Date('2012-01-01'), as.Date('2016-01-01'), by="day"), 6)
value2 <- sample(1:100, 6, replace=TRUE)
date3 <- sample(seq(as.Date('2012-01-01'), as.Date('2016-01-01'), by="day"), 6)
value3 <- sample(1:100, 6, replace=TRUE)
df <- data.frame(id, date1, value1, date2, value2, date3, value3)
Edit: Per #Pierre Lafortune's answer, you can actually collapse this into one statement.
Edit 2: Added in data with NAs, also changed code to handle NAs.
This should do the trick rather nicely. It does require a loop and I would be interested to see if someone could come up with a concise vecotrized solution.
date_cols <- colnames(df)[grep("date",colnames(df))]
df$value_recent<-df[cbind(1:nrow(df),grep("date",colnames(df))[apply(sapply(df[,date_cols],as.numeric),1,which.max)]+1)]
df
id date1 value1 date2 value2 date3 value3 value_recent
1 1113 <NA> 29 2012-09-29 22 2013-10-28 21 21
2 1622 2012-12-05 93 2012-12-05 82 2013-01-22 26 26
3 1609 <NA> 30 2013-04-07 53 2013-03-20 100 53
4 1624 2014-01-20 84 2013-03-17 92 2014-01-10 81 84
5 1861 2014-10-08 29 2012-08-19 84 2012-09-21 56 29
6 1640 2014-03-05 27 2012-02-28 5 2015-01-11 65 65
Data:
df<-structure(list(id = c(1113L, 1622L, 1609L, 1624L, 1861L, 1640L
), date1 = structure(c(NA, 15679, NA, 16090, 16351, 16134), class = "Date"),
value1 = c(29L, 93L, 30L, 84L, 29L, 27L), date2 = structure(c(15612,
15679, 15802, 15781, 15571, 15398), class = "Date"), value2 = c(22L,
82L, 53L, 92L, 84L, 5L), date3 = structure(c(16006, 15727,
15784, 16080, 15604, 16446), class = "Date"), value3 = c(21L,
26L, 100L, 81L, 56L, 65L)), .Names = c("id", "date1", "value1",
"date2", "value2", "date3", "value3"), row.names = c(NA, -6L), class = "data.frame")
I'm using apply to go over the rows looking for the most recent date. Then use that index to find the value that corresponds. We use a matrix subsetting method to keep it concise:
indx <- apply(df[grep("date", names(df))], 1, function(x) which(x == max(x))[1])
df$value_recent <- df[grep("val", names(df))][cbind(1:nrow(df), indx)]
# id date1 value1 date2 value2 date3 value3 value_recent
# 1 1113 2012-01-14 29 2012-09-29 22 2013-10-28 21 21
# 2 1622 2012-12-05 93 2012-12-05 82 2013-01-22 26 26
# 3 1609 2014-08-30 30 2013-04-07 53 2013-03-20 100 30
# 4 1624 2014-01-20 84 2013-03-17 92 2014-01-10 81 84
# 5 1861 2014-10-08 29 2012-08-19 84 2012-09-21 56 29
# 6 1640 2014-03-05 27 2012-02-28 5 2015-01-11 65 65
(Note: arranging your data this way will create more trouble than good.)
There are probably less verbose ways to do this, but here's one option. First move it to a "long" format, then split it by id, sort, and extract the most recent record and merge that back in with the original data frame.
ld <- reshape(df,
idvar = "id",
varying = list(paste0("date", 1:3),
paste0("value", 1:3)),
v.names = c("date", "value"),
direction = "long")
recent <- split(ld, ld$id)
recent <- lapply(recent, function(x) {
d <- x[order(x$date), ]
d <- d[nrow(d), c(1, 4)]
names(d)[2] <- "value_recent"
d
})
recent <- do.call(rbind, recent)
merge(df, recent, by = "id")
# id date1 value1 date2 value2 date3 value3 value_recent
# 1 1204 2014-10-25 73 2012-12-22 39 2015-07-18 62 62
# 2 1667 2012-01-16 97 2014-02-28 30 2014-12-31 83 83
# 3 1673 2015-01-16 96 2014-12-16 50 2014-08-05 31 96
# 4 1722 2015-02-07 10 2013-12-25 4 2012-08-18 93 10
# 5 1882 2012-10-20 91 2014-12-28 71 2015-09-03 18 18
# 6 1883 2012-03-30 73 2015-04-26 4 2014-12-23 74 4
Here's a similar solution that also starts with reshape but then does the rest in a series of pipes:
library(dplyr)
library(reshape)
df2 <- reshape(df,
varying = list(names(df)[grep("date", names(df))],
names(df)[grep("value", names(df))]),
v.names = c("date", "value"),
direction = "long") %>%
# order data for step to come
arrange(id, date) %>%
# next two steps cut down to last (ordered) obs for each id
group_by(id) %>%
slice(n()) %>%
# keep only the columns we need and rename the value column for merging
select(id, most.recent = value) %>%
# merge the values back into the original data frame, matching on id
left_join(df, .)