Pad data frame with missing dates in a series [duplicate] - r

This question already has an answer here:
r - time series padding with NA
(1 answer)
Closed 8 years ago.
I've aggregated a data frame of rows representing events into another data frame of daily counts using aggregate(). The resultant frame is sorted by date, but it's missing days with zero counts, and I want to fill those days in to get a continuous daily series. The count frame looks something like this:
agg <- data.frame(
date = as.Date(c("2013-04-02", "2013-04-04", "2013-04-07", "2013-04-08")),
count = c(4, 2, 6, 1))
The way I previously solved this was by iterating through the frame to find non-continuous days, then rbinding subsets of the frame with an empty one. But this is an ugly solution, horrible to debug and painfully inefficient to boot. My thinking is that it would be better to generate a new data frame, populate it with the target date series...
target <- data.frame(
date = seq(from = as.Date("2013-04-01"), to = as.Date("2013-04-10"), by = "day"),
count = NA)
... and then somehow project counts from agg onto target using the matching dates. Does anyone know how I'd do this -- or have a better solution?

You're almost there. Just do:
merge(agg,target[-2],all.y=TRUE)
subset [-2] is needed to remove the count column from target, as it is not needed. Alternatively, you could do:
target <- data.frame(
date = seq(from = as.Date("2013-04-01"), to = as.Date("2013-04-10"), by = "day"))
merge(agg,target,all.y=TRUE)

As another solution, how about this?
other <- data.frame(date = seq(as.Date("2013-04-01"), as.Date("2013-04-10"), by = "day"), count = 0)
other <- filter(other, !(date %in% agg$date))
join = full_join(agg, other, by = c("date", "count")) %>% arrange(date)
It's a little messy, but it does the trick.
edit: fixed a mistake or two

Related

Calculating the difference of elements in a vector with varying lag/lead

I have some lab data and I am looking to calculate the difference between sample measurements over a moving time frame/window e.g 2 minutes (as apposed to 0-2, 2-4, 4-6 minute, static windows)
The problem is that although the data is sampled every second there are some missed samples (e.g. 1,2,4,6,7) so I cannot use a fixed lag function especially for larger time windows.
Here is the most promising I have tried. I have tried to calculate the difference in the row positions that will then use that to determine the lag value.
library(tidyverse)
df <- data.frame(sample_group = c(rep("a", 25), rep("b", 25)),t_seconds = c(1:50), measurement = seq(1,100,2))
df <- df[-c(5,10,23,33,44),] #remove samples
t_window = 5
df_diff <- df %>%
group_by(sample_group) %>%
arrange(t_seconds) %>%
mutate(lag_row = min(which(t_seconds >= t_seconds + t_window))- min(which(t_seconds == t_seconds)), #attempt to identify the lag value for each element
Meas_diff = measurement - lag(measurement, lag_row))
In this example (lag_row) I am trying to call an element from a vector and the vector itself, which obviously does not work! to make it clearer, I have added '_v' to identify what I wanted as a vector and '_e' as an element of that vector min(which(t_seconds_v >= t_seconds_e + t_window))- min(which(t_seconds_v == t_seconds_e))
I have tried to stay away from using loops but I have failed to solve the problem.
I would appreciate if anyone has any better ideas?
Your first step should be inserting missing observations into your time series. Then you could fill the missing values using a Last-Observation-Carried-Backwards operation. This provides you with a complete regular time series.
Your desired output is very unclear, so the next step after that in the following example is just a guess. Adjust as needed.
#complete time series (using a data.table join):
library(data.table)
setDT(df)
df_fill <- df[, .SD[data.table(t_seconds = min(t_seconds):max(t_seconds)),
on = "t_seconds"],
by = sample_group]
df_fill[, filled := is.na(measurement)]
#last observation carried backwards
library(zoo)
df_fill[, measurement := na.locf(measurement, fromLast = TRUE), by = sample_group]
#differences
df_fill[, diff_value := shift(measurement, -t_window) - measurement, by = sample_group]

is there a way to look for same values by month by month in R?

I'll start by showing a part of my dataframe.
So I have for each patient a combination of molecules, the date he started it and the theoretical date he/she finishes the treatment.
What I need to do is the flag/get the patients who stays under same combination of molecules for at least six consecutive months.
I already found a way to create sequences of dates between to dates and create a new df with it..but I don't know to do the last part of getting the patients on the same treatment for six consecutive months with R.
Thanks in advance for the help.
Here is an attempt to solve your question. However, I think it's possible with an apply function, but I didn't found it yet.
library(lubridate)
df = data.frame(id = c(1,2), traitement = c("A+B+C","D+E"),
begin = c('01/03/2012','01/01/2012'),
end = c('05/06/2012','05/07/2012'),
stringsAsFactors = FALSE)
df$begin = dmy(df$begin)
df$end = dmy(df$end)
df$time = 0
df$toolong = FALSE
for(i in 1:nrow(df)){
df$time[i] = month(as.period(interval(df$begin[i],df$end[i]))) # compute the number of months
if(df$time[i] >= 6){df$toolong[i] = TRUE} # flag with true patient you use medication for too long
}
df

Joining duplicate columns in single dataframe [duplicate]

This question already has answers here:
Identifying duplicate columns in a dataframe
(10 answers)
Closed 3 years ago.
I have a dataframe where each column has a unique name, but the content of several columns is identical. The columns with identical content are all factor variables and they end in the same way (e.g. .x or .y). My goal is to join all columns with the same ending (.x or .y) into a single column.
Most solutions I have encountered in this regard combine multiple dataframes, but I have not found a solution yet that does this within a single dataframe. I am providing some example script to illustrate what my dataframe looks like at the moment and the desired output.
# generate some data
dv1 = rnorm(6)
dv2 = rnorm(6)
dv3 = rnorm(6)
# current dataframe
DF <- data.frame(dv1,
iv1.x = sort(rep(letters[1:2], 3)),
iv1.y = as.factor(c(1:6)),
dv2,
iv2.x = sort(rep(letters[1:2], 3)),
iv2.y = as.factor(c(1:6)),
dv3,
iv3.x = sort(rep(letters[1:2], 3)),
iv3.y = as.factor(c(1:6))
)
# desired dataframe
DF.cbmd <- data.frame(dv1,
dv2,
dv3,
iv1.x = sort(rep(letters[1:2], 3)),
iv1.y = as.factor(c(1:6))
)
If they are truly duplicate columns, it seems there's no use to merge them, but you can simply remove them:
dfUnique <- DF[!duplicated(as.list(DF))]
Your data frame seems to be a result of a merge. The ideal fix would be to handle this on the previous step (merging). However, another idea would be to remove everything before the . at the column names, and simply remove duplicate column names, i.e.
DF[!duplicated(gsub('.*\\.', '', names(DF)))]

R check consistency of separated timeseries-table

I have a timeseries-table like this, which goes up to 2000 31 12 23 (12/31/2000 23:00):
I'd like to add temparature values from several weatherstations to it. The problem is, that obviously the different timeseries dont't match by count of rows, so there must be gaps.
How can I check up on these dataframes if they consequently follow the pattern of 0-24 hours, 1-12 months and get information of where these gaps are?
If your data is in the format of the link then you can probably convert it to a POSIXct object by doing the following (assuming your data frame is called data):
data = as.data.frame(list(YY = rep("1962",10),
MM = rep("01",10),
DD = rep("01",10),
HH = c("00","01","02","03","04",
"05","06","07","08","09")))
date = paste(data$YY,data$MM,data$DD,sep="-")
data$dateTime = as.POSIXct(paste(date,data$HH,sep=" "),format="%Y-%m-%d %H")
That should put your data into a POSIXct format. If your temperature dataset also has a column called "dateTime" and it's a POSIXct object you should be able to use the merge function and it will combine the two data frames
temp = as.data.frame(list(YY = rep("1962",10),
MM = rep("01",10),
DD = rep("01",10),
HH = c("00","01","02","03","04",
"05","06","07","08","09")))
date1 = paste(temp$YY,temp$MM,temp$DD,sep="-")
temp$dateTime = as.POSIXct(paste(date1,temp$HH,sep=" "),format="%Y-%m-%d %H")
temp$temp = round(rnorm(10,0,5),1)
temp = temp[,c("dateTime","temp")]
#let's say your temperature dataset is missing an entry for a certain timestamp
temp = temp[-3,]
# this data frame won't have an entry for 02:00:00
data1 = merge(data,temp)
data1
# if you want to look at time differences you can try something like this
diff(data1$dateTime)
# this one will fill in the temp value as NA at 02:00:00
data2 = merge(data,temp,all.x = T)
data2
diff(data2$dateTime)
I hope that helps, I often use the merge function when I'm trying to match up timestamps from ecological datasets
Thank you for you answer and sorry for my late reply.
Couldn't make it without your helpful hints though I now managed to merge all my timeseries on a slightly different way:
Sys.setenv(TZ='UTC') #setting system time to UTC for not having DST-gaps
# creating empty hourly timeseries for following join
start = strptime("1962010100", format="%Y%m%d%H")
end = strptime("2000123123", format= "%Y%m%d%H")
series62_00 <- data.frame(
MESS_DATUM=seq(start, end, by="hour",tz ='UTC'), t = NA)
# joining all the temperatureseries with same timespan using "plyr"-package
library("plyr")
t_allstations <- list(series62_00,t282,t867,t1270,t2261,t2503
,t2597,t3668,t3946,t4752,t5397,t5419,t5705)
t_omain_DWD <- join_all(t_allstations, by = "MESS_DATUM", type = "left")
Using join_all with type = "left" makes sure, that the column "Date" is not changed and missing temperature values are filled in as NA's.

how to insert the gap back into a time series so that there is no gap in the result time series?

We have retailer data stored in a database, if there is no sale for one item, then there is no row for that item in the database. So after fetch the data from database to R, there will be same gap in the result time series. Is there any way the remove the time series in R (insert back the missing data with value of 0)?
As shown only yesterday, just construct a vector with the desired timeframe, and merge.
# Your dataframe
DF <- data.frame(day = seq(as.POSIXlt("2011-01-17"),
as.POSIXlt("2011-02-14"),
by = "days")[seq(1, 30, by = 4)][-6],
value = c(5,5,7,8,4,6,5,6,9,8,6,6,4,6))
# new dataframe
full <- data.frame(day = seq(min(DF$day),max(DF$day),by = "days"),
value = 0)
# the big merge trick
merge(DF,full,all=T)
See also this question from yesterday.

Resources