Align timeseries in R - r

I have a data frame like this.
date X1 X2
1: 2001-12-31 96.32 NA
2: 2002-01-29 NA 100.7
3: 2002-01-31 96.59 NA
4: 2002-02-28 96.67 100.7
5: 2002-03-29 NA 100.7
6: 2002-03-31 97.36 NA
7: 2002-04-29 NA 87.3
8: 2002-04-30 97.72 NA
9: 2002-05-29 NA 87.3
10:2002-05-31 97.60 NA
I have some values with different dates and I would like to align them to month end, so would like to use X1 as a "base" and align X2 values to month end as in X1. End product would be clean data frame without NAs and matching dates.
Expected output:
date X1 X2
1: 2001-12-31 96.32 NA
2: 2002-01-31 96.59 100.7
3: 2002-02-28 96.67 100.7
4: 2002-03-31 97.36 100.7
5: 2002-04-30 97.72 87.3
6: 2002-05-31 97.60 87.3
Data
df <- structure(list(date = structure(c(11687L, 11716L, 11718L, 11746L,
11775L, 11777L, 11806L, 11807L, 11836L, 11838L), class = "Date"),
X1 = c(96.32, NA, 96.59, 96.67, NA, 97.36, NA, 97.72, NA,
97.6), X2 = c(NA, 100.7, NA, 100.7, 100.7, NA, 87.3, NA,
87.3, NA)), .Names = c("date", "X1", "X2"), row.names = c(NA,
10L), class = "data.frame")

We could try the following using data.table.
library(data.table)
setDT(df)[,month := month(date)][,lapply(.SD, max, na.rm = TRUE), by = month]
# month date X1 X2
#1: 12 2001-12-31 96.32 -Inf
#2: 1 2002-01-31 96.59 100.7
#3: 2 2002-02-28 96.67 100.7
#4: 3 2002-03-31 97.36 100.7
#5: 4 2002-04-30 97.72 87.3
#6: 5 2002-05-31 97.60 87.3
There is a new variable month that has been created for grouping purposes (and to keep the original date column), you can always get rid of it if not needed afterwards.

Related

How to set missing some columns and their corresponding columns in data frame in R

I have a longitudinal data with three follow-up. The columns 2,3 and 4
I want to set the value 99 in the columns v_9, v_01, and v_03 to NA, but I want to set their corresponding columns (columns "d_9", "d_01","d_03" and "a_9", "a_01","a_03") as NA as well. As an example for ID 101 as below:
How can I do this for all the individuals and my whole data set in R? thanks in advance for the help.
"id" "v_9" "v_01" "v_03" "d_9" "d_01" "d_03" "a_9" "a_01" "a_03"
101 12 NA 10 2015-03-23 NA 2003-06-19 40.50650 NA 44.1065
structure(list(id = c(101, 102, 103, 104), v_9 = c(12, 99, 16,
25), v_01 = c(99, 12, 16, NA), v_03 = c(10, NA, 99, NA), d_9 = structure(c(16517,
17613, 16769, 10667), class = "Date"), d_01 = structure(c(13291,
NA, 13566, NA), class = "Date"), d_03 = structure(c(12222, NA,
12119, NA), class = "Date"), a_9 = c(40.5065, 40.5065, 30.19713,
51.40862), a_01 = c(42.5065, 41.5112, 32.42847, NA), a_03 = c(44.1065,
NA, 35.46543, NA)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
Try this function:
fn <- function(df){
for(s in c("_9" , "_01" , "_03")){
i <- which(`[[`(df,paste0("v",s)) == 99)
df[i, paste0("v",s)] <- NA
df[i, paste0("d",s)] <- NA
df[i, paste0("a",s)] <- NA
}
df
}
df <- fn(df)
Output
# A tibble: 4 × 10
id v_9 v_01 v_03 d_9 d_01 d_03 a_9 a_01 a_03
<dbl> <dbl> <dbl> <dbl> <date> <date> <date> <dbl> <dbl> <dbl>
1 101 12 NA 10 2015-03-23 NA 2003-06-19 40.5 NA 44.1
2 102 NA 12 NA NA NA NA NA 41.5 NA
3 103 16 16 NA 2015-11-30 2007-02-22 NA 30.2 32.4 NA
4 104 25 NA NA 1999-03-17 NA NA 51.4 NA NA

Replace values in a data table based on several columns in lookup table

Given a data table df with among others, a code column and a version column, I would like to replace the values in the code column based on a lookup table look.
df <- data.table(structure(list(year = c("1951", "1951", "1951", "1951", "1951"),
region = c(10, 11, 12, 18, 4),
code = c("140", "140", "140","1403", "1404"),
version = c(6, 7, 8, 9, 9)), row.names = c(NA,-5L), class = c("data.table", "data.frame")))
year region code version
1: 1951 10 140 6
2: 1951 11 140 7
3: 1951 12 140 8
4: 1951 18 1403 9
5: 1951 4 1404 9
look <- data.table(structure(list(code = c("C00", "C000", "C001", "C002", "C003","C004"),
ver67 = c(140L, 1400L, 1401L, NA, NA, NA),
ver8 = c(140L,1400L, 1401L, NA, NA, NA),
ver9 = c(140L, 1400L, 1401L, NA, 1403L,1404L)), row.names = c(NA, -6L), class = c("data.table", "data.frame")))
code ver67 ver8 ver9
1: C00 140 140 140
2: C000 1400 1400 1400
3: C001 1401 1401 1401
4: C002 NA NA NA
5: C003 NA NA 1403
6: C004 NA NA 1404
So that the code values of df is replaced with the code values of look when matching the code and corresponding version, as below.
year region code
1: 1951 10 C00
2: 1951 11 C00
3: 1951 12 C00
4: 1951 18 C003
5: 1951 4 C004
I am not quite sure how to tackle this challenge and would love some inputs to get me started.
It is probably easiest, to first transform your lookup-table into a long-format table, and then use left_join to join the lookup-values to the table (assuming you are fine with using tidyverse, and not sticking that much to data.table):
library(data.table)
library(tidyverse)
look_long <- look %>%
pivot_longer(starts_with("ver"), names_to = "ver") %>%
drop_na() %>%
mutate(ver = str_split(str_remove(ver, "ver"), "")) %>%
unnest(ver) %>%
transmute(ver = as.integer(ver),
value = as.character(value),
newcode = code)
df %>%
left_join(look_long, by = c("code" = "value", "version" = "ver"))
#> year region code version newcode
#> 1: 1951 10 140 6 C00
#> 2: 1951 11 140 7 C00
#> 3: 1951 12 140 8 C00
#> 4: 1951 18 1403 9 C003
#> 5: 1951 4 1404 9 C004
Created on 2022-06-30 by the reprex package (v2.0.1)

R: Combining several character columns into one by replacing NA-rows

I have a data frame consisting of character variables which looks like this:
V1 V2 V3 V4 V5
1 ID Date pic1 pic2 pic3
2 1 15.06.16 11:50 abc <NA> def
3 1 16.06.16 11:19 <NA> hij <NA>
4 1 17.06.16 11:41 <NA> <NA> nop
5 2 28.05.16 11:40 tuv <NA> <NA>
6 2 29.05.16 11:39 <NA> zab <NA>
7 2 30.05.16 09:07 <NA> <NA> wxy
8 3 03.06.16 07:31 lmn <NA> <NA>
9 3 04.06.16 11:01 <NA> rst <NA>
10 3 05.06.16 13:57 <NA> <NA> opq
So on each day one of the pic-variables contains a value, the rest is NA.
Now I want to combine all pic-values into one variable by replacing the NA's. Sorry if this is a dublicate, I've already tried a lot of suggested solutions but nothing has worked so far.
Thanks!
We can try with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'ID', and 'Date', we unlist the Subset of Data.table (.SD) and omit the NA elements (na.omit)
library(data.table)
setDT(df1)[, .(pic = na.omit(unlist(.SD))), by = .(ID, Date)]
# ID Date pic
# 1: 1 15.06.16 11:50 abc
# 2: 1 15.06.16 11:50 def
# 3: 1 16.06.16 11:19 hij
# 4: 1 17.06.16 11:41 nop
# 5: 2 28.05.16 11:40 tuv
# 6: 2 29.05.16 11:39 zab
# 7: 2 30.05.16 09:07 wxy
# 8: 3 03.06.16 07:31 lmn
# 9: 3 04.06.16 11:01 rst
#10: 3 05.06.16 13:57 opq
Or another option is pmax if there is only a single non-NA per row
setDT(df1)[, pic := do.call(pmax, c(.SD, na.rm = TRUE)),
.SDcols = pic1:pic3][, paste0("pic", 1:3) := NULL][]
Or using dplyr
library(dplyr)
df1 %>%
mutate(pic = pmax(pic1, pic2, pic3, na.rm=TRUE))%>%
select(-(pic1:pic3))
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), Date = c("15.06.16 11:50",
"16.06.16 11:19", "17.06.16 11:41", "28.05.16 11:40", "29.05.16 11:39",
"30.05.16 09:07", "03.06.16 07:31", "04.06.16 11:01", "05.06.16 13:57"
), pic1 = c("abc", NA, NA, "tuv", NA, NA, "lmn", NA, NA), pic2 = c(NA,
"hij", NA, NA, "zab", NA, NA, "rst", NA), pic3 = c("def", NA,
"nop", NA, NA, "wxy", NA, NA, "opq")), .Names = c("ID", "Date",
"pic1", "pic2", "pic3"), row.names = c(NA, -9L), class = "data.frame")
Assuming
on each day one of the pic-variables contains a value, the rest is NA
You can use coalesce from dplyr to get what you want:
library(dplyr)
result <- df1 %>% mutate(pic = coalesce(pic1, pic2, pic3)) %>%
select(-(pic1:pic3))
With the data supplied by akrun:
print(result)
## ID Date pic
##1 1 15.06.16 11:50 abc
##2 1 16.06.16 11:19 hij
##3 1 17.06.16 11:41 nop
##4 2 28.05.16 11:40 tuv
##5 2 29.05.16 11:39 zab
##6 2 30.05.16 09:07 wxy
##7 3 03.06.16 07:31 lmn
##8 3 04.06.16 11:01 rst
##9 3 05.06.16 13:57 opq

searching a data frame value through column names, and get the respective matching value column

Date_Time C 4700C Put.15 4800C Put.16 4900C Put.17
1 20120531 NA NA NA NA NA NA NA
2 20120601 1445 4800 208 84.9 143.3 119.8 92 167
3 20120606 1100 4900 268.85 43 192 66.3 127 100
4 20120607 1500 5000 345 24 261 38.25 183 60.5
5 20120612 1515 NA NA NA NA NA NA NA
I have the above sample data frame, here i wants to search the values of 1st row for C column in all the column names and get back the values of the matching column as the result.
For example <- wants to search the value of 2nd row C column which is 4900, in first all the column names, and once it's found 4900C, gives me the result as all the values in 4900C for 2nd row.
Pls help
It would have been better if the delimiters in the data were clear. For example "Date_Time" as column name could have elements "20120531 NA" as a string.
We remove the non-numeric substring from the names of the 'df1' (subset it based on 'j2') using sub, match with the 'C' to get the column index ('j1'), get a logical index based on the NA values ('i1'), then with row/column index, we extract the elements from the proposed columns (df1[-(1:3)]`) and assign it to a "NewCol".
j2 <- grep("\\d+C", names(df1))
j1 <- match(df1$C, sub("\\D+", "", names(df1)[j2]))
i1 <- !is.na(j1)
df1$NewCol[i1] <- df1[j2][cbind((1:nrow(df1))[i1], j1[i1])]
df1
# Date Time C 4700C Put.15 4800C Put.16 4900C Put.17 NewCol
#1 20120531 NA NA NA NA NA NA NA NA NA
#2 20120601 1445 4800 208.00 84.9 143.3 119.80 92 167.0 143.3
#3 20120606 1100 4900 268.85 43.0 192.0 66.30 127 100.0 127.0
#4 20120607 1500 5000 345.00 24.0 261.0 38.25 183 60.5 NA
#5 20120612 1515 NA NA NA NA NA NA NA NA
NOTE: Here I am assuming that 'Time' is the second column
data
df1 <- structure(list(Date = c(20120531L, 20120601L, 20120606L, 20120607L,
20120612L), Time = c(NA, 1445L, 1100L, 1500L, 1515L), C = c(NA,
4800L, 4900L, 5000L, NA), `4700C` = c(NA, 208, 268.85, 345, NA
), Put.15 = c(NA, 84.9, 43, 24, NA), `4800C` = c(NA, 143.3, 192,
261, NA), Put.16 = c(NA, 119.8, 66.3, 38.25, NA), `4900C` = c(NA,
92L, 127L, 183L, NA), Put.17 = c(NA, 167, 100, 60.5, NA)),
.Names = c("Date",
"Time", "C", "4700C", "Put.15", "4800C", "Put.16", "4900C", "Put.17"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))

subsetting and performing calculations on time series data, avoiding loops

I'm trying to figure out how to do the following without looping. I have a melted dataset of time, study site, and flow that looks like:
datetime site flow
6/1/2009 00:00 EBT NA
6/2/2009 01:00 EBT NA
6/3/2009 02:00 EBT 0.1
6/4/2009 03:00 EBT NA
6/5/2009 04:00 EBT NA
6/1/2009 00:00 MUT 0.4
6/2/2009 01:00 MUT 0.3
6/3/2009 02:00 MUT 0.2
6/4/2009 03:00 MUT NA
6/5/2009 04:00 MUT NA
I need to subset this by site, and then for periods when there are at least two subsequent flow measurements I need to perform a couple of calculations, *for example the mean of the current and previous measurement.
The trick is that I need to perform the average on each set of consecutive measurements, i.e. if there are three in a row for each of the latter two I need the average of that measurement and the previous one. I've added a goal column to the sample dataframe with the results I'd like to get.*
I'd like to end up with a similar looking dataframe with the datetime, site, and result of the calculation. There is a full time series for each site.
Thanks for any help!
data generator:
structure(list(datetime = structure(c(1167627600, 1167717600,
1167807600, 1167897600, 1167987600, 1167627600, 1167717600, 1167807600,
1167897600, 1167987600, 1168077600, 1168167600, 1168257600, 1168347600,
1168437600), class = c("POSIXct", "POSIXt"), tzone = ""), site = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("EBT",
"MUT"), class = "factor"), flow = c(NA, 0.1, NA, NA, NA, NA,
0.4, 0.2, NA, NA, 0.4, 0.2, 0.1, NA, NA), goal = c(NA, NA, NA,
NA, NA, NA, NA, 0.3, NA, NA, NA, 0.3, 0.15, NA, NA)), .Names = c("datetime",
"site", "flow", "goal"), row.names = c(NA, -15L), class = "data.frame")
This will separate your dataframe by site and then filter only rows that have two or more consecutive non-NA values in flow:
by(sample, sample$site, function(d) d[with(rle(!is.na(d$flow)), rep(values & lengths>=2, lengths)),])
You can then work on the function inside to do your calculations as needed.
For instance, if you want to add the mean as a new column (assuming you want NA when not defined) you can use this:
f <- function(d)
{
x <- with(rle(!is.na(d$flow)), rep(values & lengths>=2, lengths))
within(d, {avg <- NA; avg[x] <- mean(d[x,"flow"])})
}
b <- by(sample, sample$site, f)
Reduce(rbind, b)
Result:
datetime site flow avg
1 2009-06-01 01:00:00 EBT NA NA
2 2009-06-02 02:00:00 EBT NA NA
3 2009-06-03 03:00:00 EBT 0.1 NA
4 2009-06-04 04:00:00 EBT NA NA
5 2009-06-05 05:00:00 EBT NA NA
6 2009-06-01 01:00:00 MUT 0.4 0.3
7 2009-06-02 02:00:00 MUT 0.3 0.3
8 2009-06-03 03:00:00 MUT 0.2 0.3
9 2009-06-04 04:00:00 MUT NA NA
10 2009-06-05 05:00:00 MUT NA NA
EDIT: To get the mean between the current flow measure and the previous one, you can use this:
f <- function(d)
{
within(d, avg <- (flow+c(NA,head(flow,-1)))/2)
}
Reduce(rbind, by(sample, sample$site, f))
Note that cases with a single measure are automatically set to NA. New result:
datetime site flow goal avg
1 2007-01-01 03:00:00 EBT NA NA NA
2 2007-01-02 04:00:00 EBT 0.1 NA NA
3 2007-01-03 05:00:00 EBT NA NA NA
4 2007-01-04 06:00:00 EBT NA NA NA
5 2007-01-05 07:00:00 EBT NA NA NA
6 2007-01-01 03:00:00 MUT NA NA NA
7 2007-01-02 04:00:00 MUT 0.4 NA NA
8 2007-01-03 05:00:00 MUT 0.2 0.30 0.30
9 2007-01-04 06:00:00 MUT NA NA NA
10 2007-01-05 07:00:00 MUT NA NA NA
11 2007-01-06 08:00:00 MUT 0.4 NA NA
12 2007-01-07 09:00:00 MUT 0.2 0.30 0.30
13 2007-01-08 10:00:00 MUT 0.1 0.15 0.15
14 2007-01-09 11:00:00 MUT NA NA NA
15 2007-01-10 12:00:00 MUT NA NA NA
Plyr functions are a good way to split apart dataframes by certain variables, which is what you need to do.
I thought of two ways to handle intervals on a vector: first with vector multiplication (for the mean of the data), and second with vectorizing a function (for generating the labels). They're both doing pretty much the same thing, though.
library(reshape2)
library(plyr)
library(lubridate)
meanBetween <- function(x){
l <- length(x)
diag(outer(x[1:(l-1)], x[2:l], "+"))/2
}
output <- ddply(sample, .(site), function(df){
df <- df[order(df$datetime, decreasing=FALSE), ]
result <- meanBetween(df$flow)
names(result) <- Reduce(c, (mapply(as.interval,
df$datetime[-1],
df$datetime[1:(length(df$datetime)-1)],
SIMPLIFY=FALSE)))
result
})
melt(output) # to make it look nicer

Resources