I am trying to fill missing values using zoo package.
my data set is as follows
a=c("2017-01-12 00:00:00","2017-01-12 00:03:00","2017-01-12 00:08:00",
"2017-01-12 00:11:00","2017-01-12 00:14:00","2017-01-12 04:59:00","2017-01-12 05:10:00",
"2017-01-12 05:30:00")
b=c(NA,NA,1,NA,0,NA,1,NA)
df =data.frame(a,b)
to fill the missing's i am trying with
df$new = na.locf0(df$b,fromLast=F)
O/p should be:
a b new
1/12/2017 0:00 NA 0
1/12/2017 0:03 NA 0
1/12/2017 0:08 1 1
1/12/2017 0:11 NA 1
1/12/2017 0:14 0 0
1/12/2017 4:59 NA 0
1/12/2017 5:10 1 1
1/12/2017 5:30 NA 1
Thanks in advance.
na.locf0 (correctly) does not fill in components for which there is no prior value. If you want to fill in those with some particular value then use na.fill. (In the development version of zoo na.fill0 will also work.)
transform(df, new = na.fill(na.locf0(b), 0))
giving:
a b new
1 2017-01-12 00:00:00 NA 0
2 2017-01-12 00:03:00 NA 0
3 2017-01-12 00:08:00 1 1
4 2017-01-12 00:11:00 NA 1
5 2017-01-12 00:14:00 0 0
6 2017-01-12 04:59:00 NA 0
7 2017-01-12 05:10:00 1 1
8 2017-01-12 05:30:00 NA 1
We can use
df$new <- na.locf(df$b,fromLast=F, na.rm = FALSE)
df$new[is.na(df$new)] <- 0
df$new
#[1] 0 0 1 1 0 0 1 1
Option 1
Using zoo::na.locf0
library(zoo);
library(tidyverse);
df %>% mutate(b = na.locf0(b), b = replace(b, is.na(b), 0))
# a b
#1 2017-01-12 00:00:00 0
#2 2017-01-12 00:03:00 0
#3 2017-01-12 00:08:00 1
#4 2017-01-12 00:11:00 1
#5 2017-01-12 00:14:00 0
#6 2017-01-12 04:59:00 0
#7 2017-01-12 05:10:00 1
#8 2017-01-12 05:30:00 1
Option 2
Using tidyr::fill
df %>% fill(b) %>% mutate(b = replace(b, is.na(b), 0))
# a b
#1 2017-01-12 00:00:00 0
#2 2017-01-12 00:03:00 0
#3 2017-01-12 00:08:00 1
#4 2017-01-12 00:11:00 1
#5 2017-01-12 00:14:00 0
#6 2017-01-12 04:59:00 0
#7 2017-01-12 05:10:00 1
#8 2017-01-12 05:30:00 1
Explanation: Both zoo::na.locf0 and tidyr::fill fill NA entries based on previous entries (by default top down); the last replace step replaces leading NA values with 0 (since there are no previous entries, these NAs cannot be filled).
Related
I want to transform turn codes like these
library(tidyverse)
library(lubridate)
turndata_wide <- tibble(turnID = 1:4,
code = c("a", "b", "a", "g"),
start = c(ymd_hms("2019_05_25 00:00:05"),
ymd_hms("2019_05_25 00:00:02"),
ymd_hms("2019_05_25 00:00:10"),
ymd_hms("2019_05_25 00:00:01")),
end = c(ymd_hms("2019_05_25 00:00:08"),
ymd_hms("2019_05_25 00:00:07"),
ymd_hms("2019_05_25 00:00:15"),
ymd_hms("2019_05_25 00:00:25")))
which results in this
> turndata_wide
# A tibble: 4 x 4
turnID code start end
<int> <chr> <dttm> <dttm>
1 1 a 2019-05-25 00:00:05 2019-05-25 00:00:08
2 2 b 2019-05-25 00:00:02 2019-05-25 00:00:07
3 3 a 2019-05-25 00:00:10 2019-05-25 00:00:15
4 4 g 2019-05-25 00:00:01 2019-05-25 00:00:25
into what we (social scientists) call time codes. This should look like
# A tibble: 25 x 4
time a b g
<dttm> <dbl> <dbl> <dbl>
1 2019-05-25 00:00:01 NA NA 1
2 2019-05-25 00:00:02 NA 1 1
3 2019-05-25 00:00:03 NA 1 1
4 2019-05-25 00:00:04 NA 1 1
5 2019-05-25 00:00:05 1 1 1
6 2019-05-25 00:00:06 1 1 1
7 2019-05-25 00:00:07 1 1 1
8 2019-05-25 00:00:08 1 NA 1
9 2019-05-25 00:00:09 NA NA 1
10 2019-05-25 00:00:10 1 NA 1
# … with 15 more rows
I have constructed a (pedestrian and ugly) solution which works, but I'm quite sure, that there are much more better solution. My (ugly) approach is:
Create long_df per turn
Join a df with "full time rows" per turn
Join theses full_dfs per turn
Spread the codes
## Loop over steps 1) + 2) ########################################
df_per_turn_list <- list()
for(i in 1:nrow(turndata_wide)){
data_turn_temp <- turndata_wide[i,]%>%
gather(startend, time, start, end)%>%
full_join(.,
tibble(time = seq.POSIXt(from = min(.$time),
to = max(.$time),
by = "sec"),
code = .$code[1],
turnID = .$turnID[1]))%>%
select(-startend)%>%
arrange(time)
temp_name <- paste("data_turn_", i, sep = "")
df_per_turn_list[[temp_name]] <- data_turn_temp
}
## Steps 3) + 4): Join dfs_per turn and spread codes ########
reduce(df_per_turn_list, full_join)%>%
mutate(dummy_one = 1)%>%
select(-turnID)%>%
spread(code, dummy_one)%>%
arrange(time)
One way using tidyverse and cSplit_e from splitstackshape. We create a sequence between start and end for every second, group_by each second and convert it into comma-separated value and then use cSplit_e to convert them to binary columns.
library(tidyverse)
turndata_wide %>%
mutate(time = map2(start, end, seq, by = "1 sec")) %>%
unnest(cols = time) %>%
select(-start, -end) %>%
group_by(time) %>%
summarise(code = toString(code)) %>%
splitstackshape::cSplit_e("code", type = "character", drop = TRUE)
which returns output as :
# time code_a code_b code_g
#1 2019-05-25 00:00:01 NA NA 1
#2 2019-05-25 00:00:02 NA 1 1
#3 2019-05-25 00:00:03 NA 1 1
#4 2019-05-25 00:00:04 NA 1 1
#5 2019-05-25 00:00:05 1 1 1
#6 2019-05-25 00:00:06 1 1 1
#7 2019-05-25 00:00:07 1 1 1
#8 2019-05-25 00:00:08 1 NA 1
#9 2019-05-25 00:00:09 NA NA 1
#10 2019-05-25 00:00:10 1 NA 1
#11 2019-05-25 00:00:11 1 NA 1
#12 2019-05-25 00:00:12 1 NA 1
#13 2019-05-25 00:00:13 1 NA 1
#14 2019-05-25 00:00:14 1 NA 1
#15 2019-05-25 00:00:15 1 NA 1
#16 2019-05-25 00:00:16 NA NA 1
#17 2019-05-25 00:00:17 NA NA 1
#18 2019-05-25 00:00:18 NA NA 1
#19 2019-05-25 00:00:19 NA NA 1
#20 2019-05-25 00:00:20 NA NA 1
#21 2019-05-25 00:00:21 NA NA 1
#22 2019-05-25 00:00:22 NA NA 1
#23 2019-05-25 00:00:23 NA NA 1
#24 2019-05-25 00:00:24 NA NA 1
#25 2019-05-25 00:00:25 NA NA 1
I am looking to run a cumulative sum at every row for values that occur in two columns before and after that point. So in this case I have volume of 2 incident types at every given minute over two days. I want to create a column which adds all the incidents that occured before and after for each row by the type. Sumif from excel comes to mind but I'm not sure how to port that over to R:
EDIT: ADDED set.seed and easier numbers
I have the following data set:
set.seed(42)
master_min =
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-2 23:00", tz="America/New_York"),
by="min"
))
)
incident1= round(runif(2821, min=0, max=10))
incident2= round(runif(2821, min=0, max=10))
master_min = head(cbind(master_min, incident1, incident2), 5)
How do I essentially compute the following logic:
for each row, sum all the incident1s that occured before that row's timestamp and all the incident2s that occured after that row's timestamp? It would be great to get a data table solution, if not a dplyr as I am working with a large dataset. Below is a before and after for the data`:
BEFORE:
master_min incident1 incident2
1: 2016-01-01 00:00:00 9 6
2: 2016-01-01 00:01:00 9 5
3: 2016-01-01 00:02:00 3 5
4: 2016-01-01 00:03:00 8 6
5: 2016-01-01 00:04:00 6 9
AFTER THE CALCULATION:
master_min incident1 incident2 new_column
1: 2016-01-01 00:00:00 9 6 25
2: 2016-01-01 00:01:00 9 5 29
3: 2016-01-01 00:02:00 3 5 33
4: 2016-01-01 00:03:00 8 6 30
5: 2016-01-01 00:04:00 6 9 29
If I understand correctly:
# Cumsum of incident1, without current row:
master_min$sum1 <- cumsum(master_min$incident1) - master_min$incident1
# Reverse cumsum of incident2, without current row:
master_min$sum2 <- rev(cumsum(rev(master_min$incident2))) - master_min$incident2
# Your new column:
master_min$new_column <- master_min$sum1 + master_min$sum2
*update
The following two lines can do the job
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
I rewrote the question a bit to show a bit more comprehensive structure
library(data.table)
master_min <-
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-1 0:09", tz="America/New_York"),
by="min"
))
)
set.seed(2)
incident1= as.integer(runif(10, min=0, max=10))
incident2= as.integer(runif(10, min=0, max=10))
master_min = cbind(master_min, incident1, incident2)
Now master_min looks like this
> master_min
master_min incident1 incident2
1: 2016-01-01 00:00:00 1 5
2: 2016-01-01 00:01:00 7 2
3: 2016-01-01 00:02:00 5 7
4: 2016-01-01 00:03:00 1 1
5: 2016-01-01 00:04:00 9 4
6: 2016-01-01 00:05:00 9 8
7: 2016-01-01 00:06:00 1 9
8: 2016-01-01 00:07:00 8 2
9: 2016-01-01 00:08:00 4 4
10: 2016-01-01 00:09:00 5 0
Apply transformations
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
Results
> master_min
master_min incident1 incident2 sum1 sum2
1: 2016-01-01 00:00:00 1 5 1 37
2: 2016-01-01 00:01:00 7 2 8 35
3: 2016-01-01 00:02:00 5 7 13 28
4: 2016-01-01 00:03:00 1 1 14 27
5: 2016-01-01 00:04:00 9 4 23 23
6: 2016-01-01 00:05:00 9 8 32 15
7: 2016-01-01 00:06:00 1 9 33 6
8: 2016-01-01 00:07:00 8 2 41 4
9: 2016-01-01 00:08:00 4 4 45 0
10: 2016-01-01 00:09:00 5 0 50 0
There are similar questions I've seen, but none of them apply it to specific rows of a data.table or data.frame, rather they apply it to the whole matrix.
Subset a dataframe between 2 dates
How to select some rows with specific date from a data frame in R
I have a dataset with patients who were diagnosed with OA and those who were not:
dt <- data.table(ID = seq(1,10,1), OA = c(1,0,0,1,0,0,0,1,1,0),
oa.date = as.Date(c("01/01/2006", "01/01/2001", "01/01/2001", "02/03/2005","01/01/2001","01/01/2001","01/01/2001","05/06/2010", "01/01/2011", "01/01/2001"), "%d/%m/%Y"),
stop.date = as.Date(c("01/01/2006", "31/12/2007", "31/12/2008", "02/03/2005", "31/12/2011", "31/12/2011", "31/12/2011", "05/06/2010", "01/01/2011", "31/12/2011"), "%d/%m/%Y"))
dt$oa.date[dt$OA==0] <- NA
> dt
ID OA oa.date stop.date
1: 1 1 2006-01-01 2006-01-01
2: 2 0 <NA> 2007-12-31
3: 3 0 <NA> 2008-12-31
4: 4 1 2005-03-02 2005-03-02
5: 5 0 <NA> 2011-12-31
6: 6 0 <NA> 2011-12-31
7: 7 0 <NA> 2011-12-31
8: 8 1 2010-06-05 2010-06-05
9: 9 1 2011-01-01 2011-01-01
10: 10 0 <NA> 2011-12-31
What I want to do is delete those who were diagnosed with OA (OA==1) before start:
start <- as.Date("01/01/2009", "%d/%m/%Y")
So I want my final data to be:
> dt
ID OA oa.date stop.date
1: 2 0 <NA> 2009-12-31
2: 3 0 <NA> 2008-12-31
3: 5 0 <NA> 2011-12-31
4: 6 0 <NA> 2011-12-31
5: 7 0 <NA> 2011-12-31
6: 8 1 2010-06-05 2010-06-05
7: 9 1 2011-01-01 2011-01-01
8: 10 0 <NA> 2011-12-31
My tries are:
dt[dt$OA==1] <- dt[!(oa.date < start)]
I've also tried a loop but to no effect.
Any help is much appreciated.
This should be straightforward:
> dt[!(OA & oa.date < start)]
# ID OA oa.date stop.date
#1: 2 0 <NA> 2007-12-31
#2: 3 0 <NA> 2008-12-31
#3: 5 0 <NA> 2011-12-31
#4: 6 0 <NA> 2011-12-31
#5: 7 0 <NA> 2011-12-31
#6: 8 1 2010-06-05 2010-06-05
#7: 9 1 2011-01-01 2011-01-01
#8: 10 0 <NA> 2011-12-31
The OA column is binary (1/0) which is coerced to logical (TRUE/FALSE) in the i-expression.
You can try
dt=dt[dt$OA==0|(dt$OA==1&!(dt$oa.date < start)),]
I have some weather forecast data, which records the forecast amount of rainfall for every hour. I would like to compare this to observation data, which has the observed amount of rainfall for every 6 hours. So, I need to aggregate the forecast data to 6-hourly data.
Here is an overview of my data:
DateUtc StationID FcstDay PrecipQuantity_hSum
1 2014-01-01 12:00:00 54745 0 0
2 2014-01-01 13:00:00 54745 0 0
3 2014-01-01 14:00:00 54745 0 0
4 2014-01-01 15:00:00 54745 0 0
5 2014-01-01 16:00:00 54745 0 0
6 2014-01-01 17:00:00 54745 0 0
7 2014-01-01 18:00:00 54745 0 0
8 2014-01-01 19:00:00 54745 0 0
9 2014-01-01 20:00:00 54745 0 0
10 2014-01-01 21:00:00 54745 0 0
11 2014-01-01 22:00:00 54745 0 0
12 2014-01-01 23:00:00 54745 0 0
13 2014-01-02 00:00:00 54745 1 0
14 2014-01-02 01:00:00 54745 1 0
15 2014-01-02 02:00:00 54745 1 0
16 2014-01-02 03:00:00 54745 1 0
17 2014-01-02 04:00:00 54745 1 0
18 2014-01-02 05:00:00 54745 1 0
19 2014-01-02 06:00:00 54745 1 0
20 2014-01-02 07:00:00 54745 1 0
... <NA> <NA> ... ...
13802582 2014-11-20 08:00:00 55005 7 0
13802583 2014-11-20 09:00:00 55005 7 0
13802584 2014-11-20 10:00:00 55005 7 0
13802585 2014-11-20 11:00:00 55005 7 0
13802586 2014-11-20 12:00:00 55005 7 0
To aggregate correctly, it is important to split by StationID (the weather station) and FcstDay (number of days between date of calculating prediction and the date being forecast) before aggregating.
I have used the xts package to do the aggregating and it works as expected if I manually subset the data first e.g.
z <- fcst[which(fcst$StationID=="54745" & fcst$FcstDay==1),]
z.xts <- xts(z$PrecipQuantity_hSum, z$DateUtc)
ends <- endpoints(z.xts, "hours", 6)
precip6 <- as.data.frame(period.appl(z.xts, ends, sum))
I need to automate the subsetting, but I have tried to wrap the xts functions in various split-apply functions and always get the same error:
Error in xts(z$PrecipQuantity_hSum, z$DateUtc) :
NROW(x) must match length(order.by)
This is my latest version of my code:
df <- data.frame()
d_ply(
.data = fcst,
.variables = c("FcstDay", "StationID"),
.fun = function(z){
z.xts <- xts(z$PrecipQuantity_hSum, z$DateUtc)
ends <- endpoints(z.xts, "hours", 6)
precip6 <- as.data.frame(period.apply(z.xts, ends, sum))
precip6$DateUtc <- rownames(precip6)
rownames(precip6) <- NULL
df <- rbind.fill(df, precip6)
})
I've also tried nested for loops. Can anybody give any guidance on what's wrong? I've included the code for a reproducible example set below. Thanks in advance.
DateUtc <- rep(seq(from=ISOdatetime(2014,1,1,0,0,0), to=ISOdatetime(2014,12,30,0,0,0), by=(60*60)), times=9)
StationID <- rep(c("50060","50061","50062"), each=3*8713)
FcstDay <- rep(c(1,2,3), each=8713, times=3)
PrecipQuantity_hSum <- rgamma(78417, shape=1, rate=20)
fcst <- data.frame(DateUtc, StationID, FcstDay, PrecipQuantity_hSum)
I think the error David Robinson is getting is because your example code uses PrecipQuantity_6hSum and not PrecipQuantity_hSum. Once this is changed your ddply code is working for me.
Does this work for you?
df<-ddply(
.data = fcst,
.variables = c("FcstDay", "StationID"),
.fun = function(z){
z.xts <- xts(z$PrecipQuantity_6hSum, z$DateUtc)
ends <- endpoints(z.xts, "hours", 6)
precip6 <- as.data.frame(period.apply(z.xts, ends, sum))
precip6$DateUtc <- rownames(precip6)
rownames(precip6) <- NULL
return(precip6)
})
I have a data frame which looks something like this:
ID = c(1,1,1,1,2,2,3,3,3,3,4,4)
TIME = as.POSIXct(c("2013-03-31 09:07:00", "2013-09-26 10:07:00", "2013-03-31 11:07:00",
"2013-09-26 12:07:00","2013-03-31 09:10:00","2013-03-31 11:11:00",
"2013-03-31 09:06:00","2013-09-26 09:04:00","2013-03-31 10:35:00",
"2013-09-26 09:07:00","2013-09-26 09:07:00","2013-09-26 10:07:00"))
var = c(0,0,1,1,0,1,0,0,1,1,0,1)
DF = data.frame(ID, TIME, var)
ID TIME var
1 1 2013-03-31 09:07:00 0
2 1 2013-09-26 10:07:00 0
3 1 2013-03-31 11:07:00 1
4 1 2013-09-26 12:07:00 1
5 2 2013-03-31 09:10:00 0
6 2 2013-03-31 11:11:00 1
7 3 2013-03-31 09:06:00 0
8 3 2013-09-26 09:04:00 0
9 3 2013-03-31 10:35:00 1
10 3 2013-09-26 09:07:00 1
11 4 2013-09-26 09:07:00 0
12 4 2013-09-26 10:07:00 1
I would like to remove the row containing the earliest TIME value when there are identical ID and var present in the data, ie. to end up with something like this:
ID2 = c(1,1,2,2,3,3,4,4)
TIME2 = as.POSIXct(c("2013-09-26 10:07:00","2013-09-26 12:07:00","2013-03-31 09:10:00",
"2013-03-31 11:11:00","2013-09-26 09:04:00","2013-09-26 09:07:00",
"2013-09-26 09:07:00","2013-09-26 10:07:00"))
var2 = c(0,1,0,1,0,1,0,1)
DF2 = data.frame(ID2, TIME2, var2)
ID2 TIME2 var2
1 1 2013-09-26 10:07:00 0
2 1 2013-09-26 12:07:00 1
3 2 2013-03-31 09:10:00 0
4 2 2013-03-31 11:11:00 1
5 3 2013-09-26 09:04:00 0
6 3 2013-09-26 09:07:00 1
7 4 2013-09-26 09:07:00 0
8 4 2013-09-26 10:07:00 1
As you can see it is not simply about avoiding the measurements performed in March 2013, since these are valid. It is only the measurements for which there are "duplicates" and have been performed again in September that should be affected (see for example that ID = 2 remains in DF2).
Hope you can help.
Sincerily,
ykl
Here's an option with dplyr:
library(dplyr)
DF %>% group_by(ID, var) %>% filter(n() == 1L | !TIME %in% min(TIME))
#Source: local data frame [8 x 3]
#Groups: ID, var
#
# ID TIME var
#1 1 2013-09-26 10:07:00 0
#2 1 2013-09-26 12:07:00 1
#3 2 2013-03-31 09:10:00 0
#4 2 2013-03-31 11:11:00 1
#5 3 2013-09-26 09:04:00 0
#6 3 2013-09-26 09:07:00 1
#7 4 2013-09-26 09:07:00 0
#8 4 2013-09-26 10:07:00 1
What this does:
Take the data frame DF
group it by ID and var
the filter function is used to filter out (subset) by row. it takes a logical vector
and returns rows for which the vector is TRUE. The logic is:
1) if the group has only 1 row, i.e. n() == 1L, then always return that row.
2) if the group has more than 1 rows, i.e. n() > 1L, then check if the TIME value is
equal to the minimum (earlist) TIME value of the group. By using ! we negate the vector so that it is FALSE whenever TIME is at its minimum. Those 1) and 2) conditions are combined with an OR (|).
An option using data.table
library(data.table)
setDT(DF)[ ,{if(.N==1) .SD else .SD[-which.min(TIME)]}, by=list(ID, var)]
# ID var TIME
#1: 1 0 2013-09-26 10:07:00
#2: 1 1 2013-09-26 12:07:00
#3: 2 0 2013-03-31 09:10:00
#4: 2 1 2013-03-31 11:11:00
#5: 3 0 2013-09-26 09:04:00
#6: 3 1 2013-09-26 09:07:00
#7: 4 0 2013-09-26 09:07:00
#8: 4 1 2013-09-26 10:07:00
Or a similar logical approach as showed by #docendo discimus
setDT(DF)[DF[,.N==1L|!TIME %in% min(TIME), by=list(ID, var)]$V1]