I have a dataframe that looks like this:
id date1 value1 date2 value2 date3 value3
1 1113 2012-01-14 29 2012-09-29 22 2013-10-28 21
2 1622 2012-12-05 93 2012-12-05 82 2013-01-22 26
3 1609 2014-08-30 30 2013-04-07 53 2013-03-20 100
4 1624 2014-01-20 84 2013-03-17 92 2014-01-10 81
5 1861 2014-10-08 29 2012-08-19 84 2012-09-21 56
6 1640 2014-03-05 27 2012-02-28 5 2015-01-11 65
I want to create a new column that contains whichever value of the three columns "value1", "value2", and "value3" that is the most recent. I don't need to know which date it was associated with.
id date1 value1 date2 value2 date3 value3 value_recent
1 1113 2012-01-14 29 2012-09-29 22 2013-10-28 21 21
2 1622 2012-12-05 93 2012-12-05 82 2013-01-22 26 26
3 1609 2014-08-30 30 2013-04-07 53 2013-03-20 100 30
4 1624 2014-01-20 84 2013-03-17 92 2014-01-10 81 84
5 1861 2014-10-08 29 2012-08-19 84 2012-09-21 56 29
6 1640 2014-03-05 27 2012-02-28 5 2015-01-11 65 65
Code to create working example:
set.seed(1234)
id <- sample(1000:2000, 6, replace=TRUE)
date1 <- sample(seq(as.Date('2012-01-01'), as.Date('2016-01-01'), by="day"), 6)
value1 <- sample(1:100, 6, replace=TRUE)
date2 <- sample(seq(as.Date('2012-01-01'), as.Date('2016-01-01'), by="day"), 6)
value2 <- sample(1:100, 6, replace=TRUE)
date3 <- sample(seq(as.Date('2012-01-01'), as.Date('2016-01-01'), by="day"), 6)
value3 <- sample(1:100, 6, replace=TRUE)
df <- data.frame(id, date1, value1, date2, value2, date3, value3)
Edit: Per #Pierre Lafortune's answer, you can actually collapse this into one statement.
Edit 2: Added in data with NAs, also changed code to handle NAs.
This should do the trick rather nicely. It does require a loop and I would be interested to see if someone could come up with a concise vecotrized solution.
date_cols <- colnames(df)[grep("date",colnames(df))]
df$value_recent<-df[cbind(1:nrow(df),grep("date",colnames(df))[apply(sapply(df[,date_cols],as.numeric),1,which.max)]+1)]
df
id date1 value1 date2 value2 date3 value3 value_recent
1 1113 <NA> 29 2012-09-29 22 2013-10-28 21 21
2 1622 2012-12-05 93 2012-12-05 82 2013-01-22 26 26
3 1609 <NA> 30 2013-04-07 53 2013-03-20 100 53
4 1624 2014-01-20 84 2013-03-17 92 2014-01-10 81 84
5 1861 2014-10-08 29 2012-08-19 84 2012-09-21 56 29
6 1640 2014-03-05 27 2012-02-28 5 2015-01-11 65 65
Data:
df<-structure(list(id = c(1113L, 1622L, 1609L, 1624L, 1861L, 1640L
), date1 = structure(c(NA, 15679, NA, 16090, 16351, 16134), class = "Date"),
value1 = c(29L, 93L, 30L, 84L, 29L, 27L), date2 = structure(c(15612,
15679, 15802, 15781, 15571, 15398), class = "Date"), value2 = c(22L,
82L, 53L, 92L, 84L, 5L), date3 = structure(c(16006, 15727,
15784, 16080, 15604, 16446), class = "Date"), value3 = c(21L,
26L, 100L, 81L, 56L, 65L)), .Names = c("id", "date1", "value1",
"date2", "value2", "date3", "value3"), row.names = c(NA, -6L), class = "data.frame")
I'm using apply to go over the rows looking for the most recent date. Then use that index to find the value that corresponds. We use a matrix subsetting method to keep it concise:
indx <- apply(df[grep("date", names(df))], 1, function(x) which(x == max(x))[1])
df$value_recent <- df[grep("val", names(df))][cbind(1:nrow(df), indx)]
# id date1 value1 date2 value2 date3 value3 value_recent
# 1 1113 2012-01-14 29 2012-09-29 22 2013-10-28 21 21
# 2 1622 2012-12-05 93 2012-12-05 82 2013-01-22 26 26
# 3 1609 2014-08-30 30 2013-04-07 53 2013-03-20 100 30
# 4 1624 2014-01-20 84 2013-03-17 92 2014-01-10 81 84
# 5 1861 2014-10-08 29 2012-08-19 84 2012-09-21 56 29
# 6 1640 2014-03-05 27 2012-02-28 5 2015-01-11 65 65
(Note: arranging your data this way will create more trouble than good.)
There are probably less verbose ways to do this, but here's one option. First move it to a "long" format, then split it by id, sort, and extract the most recent record and merge that back in with the original data frame.
ld <- reshape(df,
idvar = "id",
varying = list(paste0("date", 1:3),
paste0("value", 1:3)),
v.names = c("date", "value"),
direction = "long")
recent <- split(ld, ld$id)
recent <- lapply(recent, function(x) {
d <- x[order(x$date), ]
d <- d[nrow(d), c(1, 4)]
names(d)[2] <- "value_recent"
d
})
recent <- do.call(rbind, recent)
merge(df, recent, by = "id")
# id date1 value1 date2 value2 date3 value3 value_recent
# 1 1204 2014-10-25 73 2012-12-22 39 2015-07-18 62 62
# 2 1667 2012-01-16 97 2014-02-28 30 2014-12-31 83 83
# 3 1673 2015-01-16 96 2014-12-16 50 2014-08-05 31 96
# 4 1722 2015-02-07 10 2013-12-25 4 2012-08-18 93 10
# 5 1882 2012-10-20 91 2014-12-28 71 2015-09-03 18 18
# 6 1883 2012-03-30 73 2015-04-26 4 2014-12-23 74 4
Here's a similar solution that also starts with reshape but then does the rest in a series of pipes:
library(dplyr)
library(reshape)
df2 <- reshape(df,
varying = list(names(df)[grep("date", names(df))],
names(df)[grep("value", names(df))]),
v.names = c("date", "value"),
direction = "long") %>%
# order data for step to come
arrange(id, date) %>%
# next two steps cut down to last (ordered) obs for each id
group_by(id) %>%
slice(n()) %>%
# keep only the columns we need and rename the value column for merging
select(id, most.recent = value) %>%
# merge the values back into the original data frame, matching on id
left_join(df, .)
Related
I have df like this
ID <- c("A01","B20","C3","D4")
Nb_data <- c(2,2,2,3)
Weight_t1 <- c(70,44,98,65)
Weight_t2 <- c(75,78,105,68)
Weight_t3 <- c(72,52,107,NA)
year1 <- c(20,28,32,50)
year2 <- c(28,32,35,60)
year3 <- c(29,35,38,NA)
LENGTHt1 <- c(175,155,198,165)
LENGTHt2 <- c(175,155,198,163)
LENGTHt3 <- c(176,154,198,NA)
df <- data.frame(ID,Nb_data,Weight_t1,Weight_t2,Weight_t3,year1,year2,year3,LENGTHt1,LENGTHt2,LENGTHt3)
weight/year and length : t1 to t28
I want to tidy my data like :
ID
Nb_data
Weigth
Year
Length
A01
3
70
20
175
A01
3
75
28
175
A01
3
72
29
176
B20
3
44
28
155
B20
3
78
32
155
B20
3
52
35
154
I try
df1 <- df %>%
pivot_longer(cols = -c('ID','Nb_data'),
names_to = c('Weight','Year','Length' ),
names_pattern = '(Weight_t[0-9]*|year[0-9]*|LENGTHt[0-9]*)' ,
values_drop_na = TRUE)
or names_pattern = '(.t[0-9])(.t[0-9])(.t[0-9])'
I have some difficulties to use regex or maybe pivot_longer are not suitable...
You need to extract the common timepoint information from the variable names. Make this information consistent first, with a clear separator (_ in this case), then it becomes much easier.
I would do something like this
library(tidyr)
library(dplyr)
df1 <- df
names(df1) <- gsub("year", "Year_t", names(df1))
names(df1) <- gsub("LENGTH", "Length_", names(df1))
df1 %>%
pivot_longer(cols = -c('ID','Nb_data'),
names_to = c("name", "timepoint"),
names_sep = "_",
values_drop_na = TRUE) %>%
pivot_wider(names_from = name, values_from = value)
EDIT: or shorter, using ".value" in the names_to argument (as #onyambu showed in his answer):
df1 %>%
pivot_longer(cols = -c('ID','Nb_data'),
names_to = c(".value", "timepoint"),
names_sep = "_",
values_drop_na = TRUE)
Output:
ID Nb_data timepoint Weight Year Length
<chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 A01 2 t1 70 20 175
2 A01 2 t2 75 28 175
3 A01 2 t3 72 29 176
4 B20 2 t1 44 28 155
5 B20 2 t2 78 32 155
6 B20 2 t3 52 35 154
7 C3 2 t1 98 32 198
8 C3 2 t2 105 35 198
9 C3 2 t3 107 38 198
10 D4 3 t1 65 50 165
11 D4 3 t2 68 60 163
You could directly use pivot_longer though with abit of complex regex as follows
df %>%
pivot_longer(matches("\\d+$"), names_to = c(".value", "grp"),
names_pattern = "(.*?)[_t]{0,2}(\\d+$)",
values_drop_na = TRUE)
# A tibble: 11 × 6
ID Nb_data grp Weight year LENGTH
<chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 A01 2 1 70 20 175
2 A01 2 2 75 28 175
3 A01 2 3 72 29 176
4 B20 2 1 44 28 155
5 B20 2 2 78 32 155
6 B20 2 3 52 35 154
7 C3 2 1 98 32 198
8 C3 2 2 105 35 198
9 C3 2 3 107 38 198
10 D4 3 1 65 50 165
11 D4 3 2 68 60 163
I want to combine about 20 dataframes, with different lengths of rows and columns, only by the mutual rownames. Any rows that are not shared for ALL dataframes are deleted. So for example on two dataframes:
Patient1 Patient64 Patient472
ABC 28 38 0
XYZ 92 11 998
WWE 1 10 282
ICQ 0 76 56
SQL 22 1002 778
combine with
Pat_9 Pat_1 Pat_111
ABC 65 44 874
CBA 3 311 998
WWE 2 1110 282
vVv 2 760 56
GHG 12 1200 778
The result would be
Patient1 Patient64 Patient472 Pat_9 Pat_1 Pat_111
ABC 28 38 0 65 44 874
WWE 1 10 282 2 1110 282
I know how to use rbind and cbind but not for the purpose of joining according to shared rownames.
Try this considering change list arguments to df1 , df2 , df3 , ... , df20 your data.frames
l <- lapply(list(df1 , df2 ) , \(x) {x[["id"]] <- rownames(x) ; x})
Reduce(\(x,y) merge(x,y , by = "id") , l)
you can try
merge(d1, d2, by = "row.names")
Row.names Patient1 Patient64 Patient472 Pat_9 Pat_1 Pat_111
1 ABC 28 38 0 65 44 874
2 WWE 1 10 282 2 1110 282
for more than two you can use a tidyverse
library(tidyverse)
lst(d1, d2, d2) %>%
map(rownames_to_column) %>%
reduce(inner_join, by="rowname")
You can first turn your rownames_to_column and use a inner_join and at last convert column_to_rownames back like this:
df1 <- read.table(text=" Patient1 Patient64 Patient472
ABC 28 38 0
XYZ 92 11 998
WWE 1 10 282
ICQ 0 76 56
SQL 22 1002 778", header = TRUE)
df2 <- read.table(text = " Pat_9 Pat_1 Pat_111
ABC 65 44 874
CBA 3 311 998
WWE 2 1110 282
vVv 2 760 56
GHG 12 1200 778", header = TRUE)
library(dplyr)
library(tibble)
df1 %>%
rownames_to_column() %>%
inner_join(df2 %>% rownames_to_column(), by = "rowname") %>%
column_to_rownames()
#> Patient1 Patient64 Patient472 Pat_9 Pat_1 Pat_111
#> ABC 28 38 0 65 44 874
#> WWE 1 10 282 2 1110 282
Created on 2022-07-20 by the reprex package (v2.0.1)
Option with list of dataframes:
dfs_list <- list(df1, df2)
transform(Reduce(merge, lapply(dfs_list, function(x) data.frame(x, rn = row.names(x)))), row.names=rn, rn=NULL)
#> Patient1 Patient64 Patient472 Pat_9 Pat_1 Pat_111
#> ABC 28 38 0 65 44 874
#> WWE 1 10 282 2 1110 282
Created on 2022-07-20 by the reprex package (v2.0.1)
I am mainly interested in replacing a specific value (81) in many columns across the dataframe.
For example, if this is my dataset
Id Date Col_01 Col_02 Col_03 Col_04
30 2012-03-31 1 A42.2 20.46 43
36 1996-11-15 42 V73 23 55
96 2010-02-07 X48 81 13 3R
40 2010-03-18 AD14 18.12 20.12 36
69 2012-02-21 8 22.45 12 10
11 2013-07-03 81 V017 78.12 81
22 2001-06-01 11 09 55 12
83 2005-03-16 80.45 V22.15 46.52 X29.11
92 2012-02-12 1 4 67 12
34 2014-03-10 82.12 N72.22 V45.44 10
I like to replace value 81 in columns Col1, Col2, Col3, Col4 to NA. The final expected dataset like this
Id Date Col_01 Col_02 Col_03 Col_04
30 2012-03-31 1 A42.2 20.46 43
36 1996-11-15 42 V73 23 55
96 2010-02-07 X48 **NA 13 3R
40 2010-03-18 AD14 18.12 20.12 36
69 2012-02-21 8 22.45 12 10
11 2013-07-03 **NA V017 78.12 **NA
22 2001-06-01 11 09 55 12
83 2005-03-16 80.45 V22.15 46.52 X29.11
92 2012-02-12 1 4 67 12
34 2014-03-10 82.12 N72.22 V45.44 10
I tried this approach
df %>% select(matches("^Col_\\d+$"))[ df %>% select(matches("^Col_\\d+$")) == 81 ] <- NA
Something similar to this solution data[ , 2:3 ][ data[ , 2:3 ] == 4 ] <- 10 here
Replacing occurrences of a number in multiple columns of data frame with another value in R
This did not work.
Any suggestion is much appreciated. Thanks in adavance.
Instead of select, we can directly specify the matches in mutate to replace the values that are '81' to NA (use na_if)
library(dplyr)
df <- df %>%
mutate(across(matches("^Col_\\d+$"), ~ na_if(., "81")))
-output
df
Id Date Col_01 Col_02 Col_03 Col_04
1 30 2012-03-31 1 A42.2 20.46 43
2 36 1996-11-15 42 V73 23 55
3 96 2010-02-07 X48 <NA> 13 3R
4 40 2010-03-18 AD14 18.12 20.12 36
5 69 2012-02-21 8 22.45 12 10
6 11 2013-07-03 <NA> V017 78.12 <NA>
7 22 2001-06-01 11 09 55 12
8 83 2005-03-16 80.45 V22.15 46.52 X29.11
9 92 2012-02-12 1 4 67 12
10 34 2014-03-10 82.12 N72.22 V45.44 10
Or we can use base R
i1 <- grep("^Col_\\d+$", names(df))
df[i1][df[i1] == "81"] <- NA
The issue in the OP's code is the assignment is not triggered as we expect i.e.
(df %>%
select(matches("^Col_\\d+$")))[(df %>%
select(matches("^Col_\\d+$"))) == "81" ]
[1] "81" "81" "81"
which is same as
df[i1][df[i1] == "81"]
[1] "81" "81" "81"
and not the assignment
(df %>%
select(matches("^Col_\\d+$")))[(df %>%
select(matches("^Col_\\d+$"))) == "81" ] <- NA
Error in (df %>% select(matches("^Col_\\d+$")))[(df %>% select(matches("^Col_\\d+$"))) == :
could not find function "(<-"
In base R, it does the assignment with [<-
data
df <- structure(list(Id = c(30L, 36L, 96L, 40L, 69L, 11L, 22L, 83L,
92L, 34L), Date = c("2012-03-31", "1996-11-15", "2010-02-07",
"2010-03-18", "2012-02-21", "2013-07-03", "2001-06-01", "2005-03-16",
"2012-02-12", "2014-03-10"), Col_01 = c("1", "42", "X48", "AD14",
"8", "81", "11", "80.45", "1", "82.12"), Col_02 = c("A42.2",
"V73", "81", "18.12", "22.45", "V017", "09", "V22.15", "4", "N72.22"
), Col_03 = c("20.46", "23", "13", "20.12", "12", "78.12", "55",
"46.52", "67", "V45.44"), Col_04 = c("43", "55", "3R", "36",
"10", "81", "12", "X29.11", "12", "10")),
class = "data.frame", row.names = c(NA,
-10L))
We can also use replace:
library(dplyr)
df <- df %>%
mutate(across(matches("^Col_\\d+$"), ~ replace(.x, ~.x==81, NA)))
I want to create a time window around a given date. Other questions have centred around having a start and end date, the only thing I have is one date which I would like to create an window around. Currently, I have a df with multiple dates in it and an ID, I would like to create time windows around the dates ranging from -2 to +2. The outcome should be stored in a df format so that I can join values from another df to it. The actual dataset is a lot larger so manually entering start and end dates for all the ID is not really an option.
df1 =
ID Date
56 2016-05-22
894 2016-11-09
ending up with a df similar to below
ID date
56 2016-05-20
56 2016-05-21
56 2016-05-22
56 2016-05-23
56 2016-05-24
894 2016-11-07
894 2016-11-08
894 2016-11-09
894 2016-11-10
894 2016-11-11
Here is a fast data.table solution
library(data.table)
setDT(df1)[, .(date = seq(as.Date(Date) - 2, as.Date(Date) + 2, 1)), by = ID]
# ID date
# 1: 56 2016-05-20
# 2: 56 2016-05-21
# 3: 56 2016-05-22
# 4: 56 2016-05-23
# 5: 56 2016-05-24
# 6: 894 2016-11-07
# 7: 894 2016-11-08
# 8: 894 2016-11-09
# 9: 894 2016-11-10
#10: 894 2016-11-11
Sample data
df1 <- read.table(text = " ID Date
56 2016-05-22
894 2016-11-09", header = T)
We can use complete from tidyr which makes it easy to complete sequences, i.e.
library(tidyverse)
df %>%
mutate(Date = as.Date(Date)) %>%
group_by(ID) %>%
complete(Date = seq.Date((Date-2), (Date+2), by = 'days'))
which gives,
# A tibble: 10 x 2
# Groups: ID [2]
ID Date
<int> <date>
1 56 2016-05-20
2 56 2016-05-21
3 56 2016-05-22
4 56 2016-05-23
5 56 2016-05-24
6 894 2016-11-07
7 894 2016-11-08
8 894 2016-11-09
9 894 2016-11-10
10 894 2016-11-11
A base R option would be to loop through the 'Date' column, get the sequence in a list, then replicate the 'ID' based on the lengths of 'list' to create a new 'data.frame' while concatenating the list elements
lst1 <- lapply(df1$Date, function(x) seq(x-2, x+2, by = '1 day'))
data.frame(ID = rep(df1$ID, lengths(lst1)), date = do.call(c, lst1))
# ID date
#1 56 2016-05-20
#2 56 2016-05-21
#3 56 2016-05-22
#4 56 2016-05-23
#5 56 2016-05-24
#6 894 2016-11-07
#7 894 2016-11-08
#8 894 2016-11-09
#9 894 2016-11-10
#10 894 2016-11-11
data
df1 <- structure(list(ID = c(56L, 894L), Date = structure(c(16943, 17114
), class = "Date")), row.names = c(NA, -2L), class = "data.frame")
I have a sample xts object with the some data:
dates <- seq.Date(from = as.Date("2010-01-01", format = "%Y-%m-%d"),
to = as.Date("2013-12-01", format = "%Y-%m-%d"), by = "month")
sample_data <- cbind(1:length(dates),length(dates):1)
xts_object <- xts(x = sample_data, order.by = dates)
I then use apply.yearly on it with the function cumsum:
apply.yearly(x = xts_object, FUN = cumsum)
The output is a tranposed matrix, which is not what I originally intended it to return.
I would expect the snippet above to return the same output as:
rbind(apply(X = xts_object[1:12],MARGIN = 2,FUN = cumsum),
apply(X = xts_object[13:24],MARGIN = 2,FUN = cumsum),
apply(X = xts_object[25:36],MARGIN = 2,FUN = cumsum),
apply(X = xts_object[37:48],MARGIN = 2,FUN = cumsum))
The problem with using apply is that it returns a matrix and not an xts object. While I could solve this by using as.xts, I would like to know if there is something I am missing, or if I am using apply.yearly incorrectly. Using pure apply seems to be more prone to difficult to catch errors and bugs.
This might not be the most elegant solution, but it works:
# Split xts_object by year
xts_list = split(xts_object, "years")
# cumsum for each year
cumsum_list = lapply(xts_list, FUN = cumsum)
# rbind them together
do.call(rbind, cumsum_list)
# [,1] [,2]
# 2010-01-01 1 48
# 2010-02-01 3 95
# 2010-03-01 6 141
# 2010-04-01 10 186
# 2010-05-01 15 230
# 2010-06-01 21 273
# 2010-07-01 28 315
# 2010-08-01 36 356
# 2010-09-01 45 396
# 2010-10-01 55 435
# 2010-11-01 66 473
# 2010-12-01 78 510
# 2011-01-01 13 36
# 2011-02-01 27 71
# 2011-03-01 42 105
# 2011-04-01 58 138
# 2011-05-01 75 170
# 2011-06-01 93 201
# 2011-07-01 112 231
# 2011-08-01 132 260
# 2011-09-01 153 288
# 2011-10-01 175 315
# 2011-11-01 198 341
# 2011-12-01 222 366
# 2012-01-01 25 24
# 2012-02-01 51 47
# 2012-03-01 78 69
# 2012-04-01 106 90
# 2012-05-01 135 110
# 2012-06-01 165 129
# 2012-07-01 196 147
# 2012-08-01 228 164
# 2012-09-01 261 180
# 2012-10-01 295 195
# 2012-11-01 330 209
# 2012-12-01 366 222
# 2013-01-01 37 12
# 2013-02-01 75 23
# 2013-03-01 114 33
# 2013-04-01 154 42
# 2013-05-01 195 50
# 2013-06-01 237 57
# 2013-07-01 280 63
# 2013-08-01 324 68
# 2013-09-01 369 72
# 2013-10-01 415 75
# 2013-11-01 462 77
# 2013-12-01 510 78
class(do.call(rbind, cumsum_list))
# [1] "xts" "zoo"
The resulting object would still be "xts"