Count number of rows that are not NA [duplicate]

Count number of rows that are not NA [duplicate] - r

This question already has answers here:
Count number of non-NA values by group
(3 answers)
Count non-NA values by group [duplicate]
(3 answers)
Closed 1 year ago.
So I have a data frame that looks like this:
"date","id_station","id_parameter","valor","unit","year","day","month","hour","zona"
2019-01-01 00:00:00,"AJM","CO",NA,15,2019,1,1,0,"SO"
2019-01-01 00:00:00,"ATI","CO",NA,15,2019,1,1,0,"NO"
2019-01-01 00:00:00,"BJU","CO",NA,15,2019,1,1,0,"CE"
2019-01-01 00:00:00,"CAM","CO",NA,15,2019,1,1,0,"NO"
2019-01-01 00:00:00,"CCA","CO",NA,15,2019,1,1,0,"SO"
2019-01-01 00:00:00,"CHO","CO",NA,15,2019,1,1,0,"SE"
2019-01-01 00:00:00,"CUA","CO",NA,15,2019,1,1,0,"SO"
2019-01-01 00:00:00,"FAC","CO",NA,15,2019,1,1,0,"NO"
2019-01-01 00:00:00,"HGM","CO",NA,15,2019,1,1,0,"CE"
2019-01-01 00:00:00,"IZT","CO",NA,15,2019,1,1,0,"CE"
2019-01-01 00:00:00,"LLA","CO",NA,15,2019,1,1,0,"NE"
2019-01-01 00:00:00,"LPR","CO",NA,15,2019,1,1,0,"NE"
2019-01-01 00:00:00,"MER","CO",NA,15,2019,1,1,0,"CE"
2019-01-01 00:00:00,"MGH","CO",NA,15,2019,1,1,0,"SO"
2019-01-01 00:00:00,"NEZ","CO",NA,15,2019,1,1,0,"NE"
2019-01-01 00:00:00,"PED","CO",NA,15,2019,1,1,0,"SO"
2019-01-01 00:00:00,"SAG","CO",NA,15,2019,1,1,0,"NE"
2019-01-01 00:00:00,"SFE","CO",NA,15,2019,1,1,0,"SO"
2019-01-01 00:00:00,"SJA","CO",NA,15,2019,1,1,0,"NO"
2019-01-01 00:00:00,"TAH","CO",NA,15,2019,1,1,0,"SE"
2019-01-01 00:00:00,"TLA","CO",NA,15,2019,1,1,0,"NO"
2019-01-01 00:00:00,"TLI","CO",NA,15,2019,1,1,0,"NO"
2019-01-01 00:00:00,"UAX","CO",NA,15,2019,1,1,0,"SE"
2019-01-01 00:00:00,"UIZ","CO",NA,15,2019,1,1,0,"SE"
2019-01-01 00:00:00,"VIF","CO",NA,15,2019,1,1,0,"NE"
2019-01-01 00:00:00,"XAL","CO",NA,15,2019,1,1,0,"NE"
2019-01-01 01:00:00,"AJM","CO",NA,15,2019,1,1,1,"SO"
2019-01-01 01:00:00,"ATI","CO",NA,15,2019,1,1,1,"NO"
2019-01-01 01:00:00,"BJU","CO",NA,15,2019,1,1,1,"CE"
2019-01-01 01:00:00,"CAM","CO",NA,15,2019,1,1,1,"NO"
2019-01-01 01:00:00,"CCA","CO",NA,15,2019,1,1,1,"SO"
2019-01-01 01:00:00,"CHO","CO",NA,15,2019,1,1,1,"SE"
2019-01-01 01:00:00,"CUA","CO",NA,15,2019,1,1,1,"SO"
2019-01-01 01:00:00,"FAC","CO",NA,15,2019,1,1,1,"NO"
2019-01-01 01:00:00,"HGM","CO",NA,15,2019,1,1,1,"CE"
2019-01-01 01:00:00,"IZT","CO",NA,15,2019,1,1,1,"CE"
2019-01-01 01:00:00,"LLA","CO",NA,15,2019,1,1,1,"NE"
2019-01-01 01:00:00,"LPR","CO",NA,15,2019,1,1,1,"NE"
2019-01-01 01:00:00,"MER","CO",NA,15,2019,1,1,1,"CE"
2019-01-01 01:00:00,"MGH","CO",NA,15,2019,1,1,1,"SO"
2019-01-01 01:00:00,"NEZ","CO",NA,15,2019,1,1,1,"NE"
2019-01-01 01:00:00,"PED","CO",NA,15,2019,1,1,1,"SO"
2019-01-01 01:00:00,"SAG","CO",NA,15,2019,1,1,1,"NE"
2019-01-01 01:00:00,"SFE","CO",NA,15,2019,1,1,1,"SO"
2019-01-01 01:00:00,"SJA","CO",NA,15,2019,1,1,1,"NO"
2019-01-01 01:00:00,"TAH","CO",NA,15,2019,1,1,1,"SE"
2019-01-01 01:00:00,"TLA","CO",NA,15,2019,1,1,1,"NO"
2019-01-01 01:00:00,"TLI","CO",NA,15,2019,1,1,1,"NO"
2019-01-01 01:00:00,"UAX","CO",NA,15,2019,1,1,1,"SE"
2019-01-01 01:00:00,"UIZ","CO",NA,15,2019,1,1,1,"SE"
2019-01-01 01:00:00,"VIF","CO",NA,15,2019,1,1,1,"NE"
2019-01-01 01:00:00,"XAL","CO",NA,15,2019,1,1,1,"NE"
And what I want to do is to group all based on id_station, id_parameter, year, day, and month. Afterwards, I want to count the number of rows that are not NA in "valor" for each day.
Finally, I want to determine how many days had at least 18 non-NA values for each day of each id_station. If there are less than 274 days, I want to delete ALL values associated to that id_station
How can I do this?

Another possible option might be
aggregate(
cbind(Count = !is.na(valor)) ~id_station + id_parameter + year + day + month,
df,
sum
)

After grouping by the columns of interest, get the sum of logical vector as the count i.e. - is.na(valor) returns a logical vector with TRUE where there are NA and FALSE for non-NA, negate (!) to reverse it and get the sum of the logical such as each TRUE (-> 1) represents one non-NA element
library(dplyr)
df1 %>%
group_by(id_station, id_parameter, year, day, month) %>%
summarise(Count = sum(!is.na(valor)))

Related

creating special subgroup column in r

I have a large dataset with 516 rows (partial dataset below),
Check_In
Ward_1
Elapsed_time
2019-01-01 00:05:18
2019-01-01 00:09:32
4.2333333 mins
2019-01-01 00:11:3
2019-01-01 00:25:04
13.4500000 mins
2019-01-01 00:21:33
2019-01-01 01:03:31
41.9666667 mins
2019-01-01 00:27:18
2019-01-01 01:15:36
48.3000000 mins
2019-01-01 01:44:07
2019-01-01 02:02:45
18.6333333 mins
2019-01-01 02:10:46
2019-01-01 02:26:18
15.5333333 mins
I would like to create a subgroup number column of 3 rows per subgroup (example below) so i can then use the qcc.groups function using the Elapsed_time and subgroup columns
Check_In
Ward_1
Elapsed_time
subgroup
2019-01-01 00:05:18
2019-01-01 00:09:32
4.2333333 mins
1
2019-01-01 00:11:3
2019-01-01 00:25:04
13.4500000 mins
1
2019-01-01 00:21:33
2019-01-01 01:03:31
41.9666667 mins
1
2019-01-01 00:27:18
2019-01-01 01:15:36
48.3000000 mins
2
2019-01-01 01:44:07
2019-01-01 02:02:45
18.6333333 mins
2
2019-01-01 02:10:46
2019-01-01 02:26:18
15.5333333 mins
2

Another base R option
df$subgroup <- ceiling(seq(nrow(df)) / 3)

We can use gl from base R to create the group by specifying the n as number of rows (nrow(df1)) of the dataset and k = 3
df1$subgroup <- as.integer(gl(nrow(df1), 3, nrow(df1)))
data
df1 <- structure(list(Check_In = c("2019-01-01 00:05:18", "2019-01-01 00:11:3",
"2019-01-01 00:21:33", "2019-01-01 00:27:18", "2019-01-01 01:44:07",
"2019-01-01 02:10:46"), Ward_1 = c("2019-01-01 00:09:32", "2019-01-01 00:25:04",
"2019-01-01 01:03:31", "2019-01-01 01:15:36", "2019-01-01 02:02:45",
"2019-01-01 02:26:18"), Elapsed_time = c("4.2333333 mins", "13.4500000 mins",
"41.9666667 mins", "48.3000000 mins", "18.6333333 mins", "15.5333333 mins"
)), class = "data.frame", row.names = c(NA, -6L))

Or simply
df1 %>% mutate(grp = (row_number() +2) %/% 3)
Check_In Ward_1 Elapsed_time grp
1 2019-01-01 00:05:18 2019-01-01 00:09:32 4.2333333 mins 1
2 2019-01-01 00:11:3 2019-01-01 00:25:04 13.4500000 mins 1
3 2019-01-01 00:21:33 2019-01-01 01:03:31 41.9666667 mins 1
4 2019-01-01 00:27:18 2019-01-01 01:15:36 48.3000000 mins 2
5 2019-01-01 01:44:07 2019-01-01 02:02:45 18.6333333 mins 2
6 2019-01-01 02:10:46 2019-01-01 02:26:18 15.5333333 mins 2
df1 dput courtesy beloved #akrun

Or maybe: Thanks to akrun for the data.
library(dplyr)
df1 %>%
mutate(subgroup = rep(row_number(), each=3, length.out = n()))
Output:
Check_In Ward_1 Elapsed_time subgroup
1 2019-01-01 00:05:18 2019-01-01 00:09:32 4.2333333 mins 1
2 2019-01-01 00:11:3 2019-01-01 00:25:04 13.4500000 mins 1
3 2019-01-01 00:21:33 2019-01-01 01:03:31 41.9666667 mins 1
4 2019-01-01 00:27:18 2019-01-01 01:15:36 48.3000000 mins 2
5 2019-01-01 01:44:07 2019-01-01 02:02:45 18.6333333 mins 2
6 2019-01-01 02:10:46 2019-01-01 02:26:18 15.5333333 mins 2

Generate dates based on condition before and after index dates

I have a data frame with 10,000+ dates. for example,
indexdt
01-02-2019
08-15-2019
I need to create two data frames based on the following conditions-
generate dates such that I get same day of week, upto 3 weeks before and after the index date. The out put should be
Table 1
indexdt dates
01-02-2019 12-26-2018
01-02-2019 12-19-2018
01-02-2019 12-12-2018
01-02-2019 01-09-2019
01-02-2019 01-16-2019
01-02-2019 01-23-2019
08-15-2019 07-25-2019
08-15-2019 08-01-2019
08-15-2019 08-08-2019
08-15-2019 08-22-2019
08-15-2019 08-29-2019
08-15-2019 08-05-2019
same day of week, same month. The output should be
Table 2
indexdt date
01-02-2019 01-09-2019
01-02-2019 01-16-2019
01-02-2019 01-23-2019
01-02-2019 01-30-2019
08-15-2019 08-01-2019
08-15-2019 08-08-2019
08-15-2019 08-22-2019
08-15-2019 08-29-2019

I have answered both the questions here but you should only ask one question in one post :
library(dplyr)
library(purrr)
library(lubridate)
#Convert to date
df <- df %>% mutate(indexdt = mdy(indexdt))
generate dates such that I get same day of week, upto 3 weeks before and after the index date
We use seq to generate before and after dates separately. [-1] is used to ignore the indexdt date since we don't want that in final output.
df %>%
mutate(dates = map(indexdt, ~c(seq(.x, length.out = 4, by = -7)[-1],
seq(.x, length.out = 4, by = 7)[-1]))) %>%
unnest(dates)
# indexdt dates
# <date> <date>
# 1 2019-01-02 2018-12-26
# 2 2019-01-02 2018-12-19
# 3 2019-01-02 2018-12-12
# 4 2019-01-02 2019-01-09
# 5 2019-01-02 2019-01-16
# 6 2019-01-02 2019-01-23
# 7 2019-08-15 2019-08-08
# 8 2019-08-15 2019-08-01
# 9 2019-08-15 2019-07-25
#10 2019-08-15 2019-08-22
#11 2019-08-15 2019-08-29
#12 2019-08-15 2019-09-05
same day of week, same month.
Here we create a sequence from indexdt date to start of the month (floor_date) and another sequence from indexdt to end of the month (ceiling_date - 1).
df %>%
mutate(dates = map(indexdt, ~c(seq(.x, floor_date(.x, 'month'), by = -7)[-1],
seq(.x, ceiling_date(.x, 'month') - 1, by = 7)[-1]))) %>%
unnest(dates)
# indexdt dates
# <date> <date>
#1 2019-01-02 2019-01-09
#2 2019-01-02 2019-01-16
#3 2019-01-02 2019-01-23
#4 2019-01-02 2019-01-30
#5 2019-08-15 2019-08-08
#6 2019-08-15 2019-08-01
#7 2019-08-15 2019-08-22
#8 2019-08-15 2019-08-29
data
df <- structure(list(indexdt = c("01-02-2019", "08-15-2019")),
class = "data.frame", row.names = c(NA, -2L))

R: Moving average length of time between two dates

I have a dataset of observations with start and end dates. I would like to calculate the moving average difference between the start and end dates.
I've included an example dataset below.
require(dplyr)
df <- data.frame(id=c(1,2,3),
start=c("2019-01-01","2019-01-10", "2019-01-05"),
end=c("2019-02-01", "2019-01-15", "2019-01-10"))
df[,c("start", "end")] <- lapply(df[,c("start", "end")], as.Date)
id start end
1 2019-01-01 2019-02-01
2 2019-01-10 2019-01-15
3 2019-01-05 2019-01-10
The overall date ranges are 2019-01-01 to 2019-02-01. I would like to calculate the average difference between the start and end dates for each of the dates in that range.
The result would look exactly like this. I've included the actual values for the averages that should show up:
date avg
2019-01-01 0
2019-01-02 1
2019-01-03 2
2019-01-04 3
2019-01-05 4
2019-01-06 3
2019-01-07 4
2019-01-08 5
2019-01-09 6
2019-01-10 7
2019-01-11 5.5
. .
. .
. .

Creating a reproducible example:
df <- data.frame(id=c(1,2,3,4),
start=c("2019-01-01","2019-01-01", "2019-01-10", "2019-01-05"),
end=c("2019-01-04", "2019-01-05", "2019-01-12", "2019-01-08"))
df[,c("start", "end")] <- lapply(df[,c("start", "end")], as.Date)
df
Returns:
id start end
1 2019-01-01 2019-01-04
2 2019-01-01 2019-01-05
3 2019-01-10 2019-01-12
4 2019-01-05 2019-01-08
Then using the group_by function from dplyr:
library(dplyr)
df %>%
group_by(start) %>%
summarize(avg=mean(end - start)) %>%
rename(date=start)
Returns:
date avg
<time> <time>
2019-01-01 3.5 days
2019-01-05 3.0 days
2019-01-10 2.0 days

Editing the answer as per comments.
Creating the df:
require(dplyr)
df <- data.frame(id=c(1,2,3),
start=c("2019-01-01", "2019-01-10", "2019-01-05"),
end=c("2019-02-01", "2019-01-15", "2019-01-10"))
df[,c("start", "end")] <- lapply(df[,c("start", "end")], as.Date)
Create dates for every start-end combination:
#gives the list of all dates within start and end frames and calculates difference
datesList = lapply(1:nrow(df),function(i){
dat = data_frame('date'=seq.Date(from=df$start[i],to=df$end[i],by=1),
'start'=df$start[i]) %>%
dplyr::mutate(diff=date-start)
})
Finally, group_by the date and find avg to give output exactly as the one in the question:
finalDf = bind_rows(datesList) %>%
dplyr::filter(diff != 0) %>%
dplyr::group_by(date) %>%
dplyr::summarise(avg=mean(diff,na.rm=T))
The output thus becomes:
# A tibble: 31 x 2
date avg
<date> <time>
1 2019-01-02 1.0 days
2 2019-01-03 2.0 days
3 2019-01-04 3.0 days
4 2019-01-05 4.0 days
5 2019-01-06 3.0 days
6 2019-01-07 4.0 days
7 2019-01-08 5.0 days
8 2019-01-09 6.0 days
9 2019-01-10 7.0 days
10 2019-01-11 5.5 days
# … with 21 more rows
Let me know if it works.

Can I aggregate time series data between an on and off date using a data table join or the aggregate function?

I would like to efficiently summarize continuous meteorological data over the periods that discrete samples are being collected.
I currently do this with a time-consuming loop, but I imagine a better solution exists. I'm new to data.table syntax, but it seems like there should be a solution with joining.
continuous <- data.frame(Time = seq(as.POSIXct("2019-01-01 0:00:00"),
as.POSIXct("2019-01-01 9:00:00"),"hour"),
CO2 = sample(400:450,10),
Temp = sample(10:30,10))
> continuous
Time CO2 Temp
1 2019-01-01 00:00:00 430 11
2 2019-01-01 01:00:00 412 26
3 2019-01-01 02:00:00 427 17
4 2019-01-01 03:00:00 435 29
5 2019-01-01 04:00:00 447 23
6 2019-01-01 05:00:00 417 19
7 2019-01-01 06:00:00 408 12
8 2019-01-01 07:00:00 449 28
9 2019-01-01 08:00:00 445 20
10 2019-01-01 09:00:00 420 27
discrete <- data.frame(on = c(as.POSIXct("2019-01-01 0:00:00"),
as.POSIXct("2019-01-01 3:00:00")),
off = c(as.POSIXct("2019-01-01 3:00:00"),
as.POSIXct("2019-01-01 8:00:00")))
> discrete
on off
1 2019-01-01 00:00:00 2019-01-01 03:00:00
2 2019-01-01 03:00:00 2019-01-01 08:00:00
discrete[, c("CO2.mean","Temp.mean")] <-
lapply(seq(length(c("CO2","Temp"))), function(k)
unlist(lapply(seq(length(discrete[, 1])), function(i)
mean(continuous[
which.closest(continuous$Time,discrete$on[i]):
which.closest(continuous$Time, discrete$off[i]),
c("CO2","Temp")[k]]))))
> discrete
on off CO2.mean Temp.mean
1 2019-01-01 00:00:00 2019-01-01 03:00:00 426.0 20.75000
2 2019-01-01 03:00:00 2019-01-01 08:00:00 433.5 21.83333
This works, but when aggregating tens of continuous variables into hundreds of sampling periods, it takes a very long time to run. Thank you for your help!

An option would be a nonequi join in data.table
library(data.table)
setDT(continuous)[discrete, .(CO2mean = mean(CO2),
Tempmean = mean(Temp)),on = .(Time >= on, Time <= off), by = .EACHI]
or with a rolling join
setDT(continuous)[discrete, .(CO2mean = mean(CO2),
Tempmean = mean(Temp)),on = .(Time = on, Time = off),
by = .EACHI, roll = 'nearest']

Unexpected dplyr::right_join() behavior for expanding POSIXct time series

I have a data frame containing some daily data timestamped at midnight on each day and some hourly data timestamped at the beginning of each hour throughout the day. I want to expand the data so it's all hourly, and I'd like to do so within a tidyverse "pipe chain".
My thought was to create a data frame containing the full hourly time series and then dplyr::right_join() my data against this time series. I thought this would populate the proper values where there was a match for the daily data (at midnight) and populate NA wherever there was no match (any hour except midnight). This seems to work only when the time series in my data is daily only, rather than a mix of daily and hourly values, which was unexpected. Why does the right join not expand the daily time series when it coexists in a data frame along with another hourly time series?
I've generated a minimal example below. My representative data set that I want to expand is named allData and contains a mix of daily and hourly datasets from two different time series variables, Daily TS and Hourly TS.
dailyData <- data.frame(
DateTime = seq.POSIXt(lubridate::ymd_hms('2019-01-01', truncated=3),
lubridate::ymd_hms('2019-01-07', truncated=3),
by='day'),
Name = 'Daily TS'
)
allHours <- data.frame(
DateTime = seq.POSIXt(lubridate::ymd_hms('2019-01-01', truncated=3),
lubridate::ymd_hms('2019-01-07 23:00:00'),
by='hour')
)
hourlyData <- allHours %>%
dplyr::mutate( Name = 'Hourly TS' )
allData <- rbind( dailyData, hourlyData )
This gives
head( allData, n=15 )
DateTime Name
1 2019-01-01 00:00:00 Daily TS
2 2019-01-02 00:00:00 Daily TS
3 2019-01-03 00:00:00 Daily TS
4 2019-01-04 00:00:00 Daily TS
5 2019-01-05 00:00:00 Daily TS
6 2019-01-06 00:00:00 Daily TS
7 2019-01-07 00:00:00 Daily TS
8 2019-01-01 00:00:00 Hourly TS
9 2019-01-01 01:00:00 Hourly TS
10 2019-01-01 02:00:00 Hourly TS
11 2019-01-01 03:00:00 Hourly TS
12 2019-01-01 04:00:00 Hourly TS
13 2019-01-01 05:00:00 Hourly TS
14 2019-01-01 06:00:00 Hourly TS
15 2019-01-01 07:00:00 Hourly TS
Now, I thought that dplyr::right_join() of the full hourly sequence of POSIXct values against allData$DateTime would have expanded the daily time series, leaving NA values for any hours not explicitly present in the data. I could then use tidyr::fill() to fill these in over the day. However, the following code does not behave this way:
expanded_BAD <- allData %>%
dplyr::right_join( allHours, by='DateTime' ) %>%
tidyr::fill( dplyr::everything(), .direction='down' ) %>%
dplyr::arrange( Name, DateTime )
expanded_BAD shows that the daily data hasn't been expanded by the right_join(). That is, the hours in allHours missing from allData were not retained in the result, which I thought was the whole purpose of using a right join. Here's the head of the result:
head(expanded_BAD, n=15)
DateTime Name
1 2019-01-01 00:00:00 Daily TS
2 2019-01-02 00:00:00 Daily TS
3 2019-01-03 00:00:00 Daily TS
4 2019-01-04 00:00:00 Daily TS
5 2019-01-05 00:00:00 Daily TS
6 2019-01-06 00:00:00 Daily TS
7 2019-01-07 00:00:00 Daily TS
8 2019-01-01 00:00:00 Hourly TS
9 2019-01-01 01:00:00 Hourly TS
10 2019-01-01 02:00:00 Hourly TS
11 2019-01-01 03:00:00 Hourly TS
12 2019-01-01 04:00:00 Hourly TS
13 2019-01-01 05:00:00 Hourly TS
14 2019-01-01 06:00:00 Hourly TS
15 2019-01-01 07:00:00 Hourly TS
Interestingly, if we perform the exact same right join on only the daily data, we get the desired result:
dailyData_expanded_GOOD <- dailyData %>%
dplyr::right_join( allHours, by='DateTime' ) %>%
tidyr::fill( dplyr::everything(), .direction='down' )
Here's the head:
head(dailyData_expanded_GOOD, n=15)
DateTime Value
1 2019-01-01 00:00:00 Daily TS
2 2019-01-01 01:00:00 Daily TS
3 2019-01-01 02:00:00 Daily TS
4 2019-01-01 03:00:00 Daily TS
5 2019-01-01 04:00:00 Daily TS
6 2019-01-01 05:00:00 Daily TS
7 2019-01-01 06:00:00 Daily TS
8 2019-01-01 07:00:00 Daily TS
9 2019-01-01 08:00:00 Daily TS
10 2019-01-01 09:00:00 Daily TS
11 2019-01-01 10:00:00 Daily TS
12 2019-01-01 11:00:00 Daily TS
13 2019-01-01 12:00:00 Daily TS
14 2019-01-01 13:00:00 Daily TS
15 2019-01-01 14:00:00 Daily TS
Why does the right join do different things on the full data compared to only the daily data?

I think the problem is that you are trying to bind the dataframes together too soon. I believe this gives you what you want:
result <- bind_rows(dailyData_expanded_GOOD, hourlyData)
head(result)
#> DateTime Name
#> 1 2019-01-01 00:00:00 Daily TS
#> 2 2019-01-01 01:00:00 Daily TS
#> 3 2019-01-01 02:00:00 Daily TS
#> 4 2019-01-01 03:00:00 Daily TS
#> 5 2019-01-01 04:00:00 Daily TS
#> 6 2019-01-01 05:00:00 Daily TS
The reason right_join() doesn't work is that allHours matches perfectly the
rows in allData for hourly timeseries. From ?right_join
return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
You're hoping that rows in x with no match in y will have NA values, but the rows in y do match rows in x already. There are actually multiple matches, one for the daily and one for the hourly, but right_join() just returns both without expanding the daily time series rows.
This is different from the situation in this question, where the datetimes to be expanded do not occur in the left hand data frame. Then the strategy of merging would expand your result as expected.
So that explains why a bare right_join() doesn't work, but doesn't solve
the problem because you have to manually split up the data, and that would
get old fast if there are varying numbers of time series. There are a couple solutions in comments, and then one additional that I will add below.
tidyr::expand()
expandedData <- allData %>%
tidyr::expand( DateTime, Name ) %>%
dplyr::arrange( Name, DateTime )
This works, but only with both time series present. If there is only
dailyData, then the result is not expanded.
The kitchen sink
expandedData1 <- allData %>%
dplyr::right_join(allHours, by = 'DateTime') %>%
tidyr::fill(everything()) %>%
tidyr::expand( DateTime, Name) %>%
dplyr::arrange( Name, DateTime )
As pointed out in the comments, this works for all cases - both types,
only daily data, only hourly data. This solution and the next generate
warnings unless you use stringsAsFactors = FALSE in the data.frame()
calls above.
The only issue with this solution is that fill() and right_join() are
only there to deal with edge cases. I don't know if that is a real problem
or not.
"Split" in the pipe
The simple solution splits the dataset, and this can be done inside the
pipe in a couple ways.
expandedData2 <- allData %>%
tidyr::nest(-Name) %>%
mutate(data = purrr::map(data, ~right_join(., allHours, by = 'DateTime'))) %>%
tidyr::unnest()
The other way would use base::split() and then purrr::map_dfr()
Created on 2019-03-24 by the reprex
package (v0.2.0).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Count number of rows that are not NA [duplicate] - r

Another possible option might be aggregate( cbind(Count = !is.na(valor)) ~id_station + id_parameter + year + day + month, df, sum )

Related

creating special subgroup column in r

Generate dates based on condition before and after index dates

R: Moving average length of time between two dates

Can I aggregate time series data between an on and off date using a data table join or the aggregate function?

Unexpected dplyr::right_join() behavior for expanding POSIXct time series

Categories

Resources