Filling the gap in the time series - r

I have the following data,
id <- c(rep(12, 10), rep(14, 12), rep(16, 2))
m <- c(seq(1:5), seq(8,12), seq(1:12), 10, 12)
y <- c(rep(14, 10), rep(14, 12), rep(15, 2))
v <- rnorm(24)
df <- data.frame(id, m, y, v)
> df
id m y v
1 12 1 14 0.9453216
2 12 2 14 1.0666393
3 12 3 14 -0.2750527
4 12 4 14 1.3264349
5 12 5 14 -1.8046676
6 12 8 14 0.3334960
7 12 9 14 -1.2448408
8 12 10 14 0.5258248
9 12 11 14 -0.1233157
10 12 12 14 1.4717530
11 14 1 14 0.6217376
12 14 2 14 -0.8344823
13 14 3 14 1.1468841
14 14 4 14 -0.3363987
15 14 5 14 -1.3543311
16 14 6 14 -0.2146853
17 14 7 14 -0.6546186
18 14 8 14 -2.4286257
19 14 9 14 -1.3314888
20 14 10 14 0.8215581
21 14 11 14 -0.9999368
22 14 12 14 -1.2935147
23 16 10 15 0.7339261
24 16 12 15 1.1303524
The first column is the id, second column m is the month, third column y is the year, and the last column is the value.
In the month column, in the year 14, two observations (June and July) is missing and in the year 15, November is missing.
I would like to have those missing months with a value of zero. That means, for example, for the year 15, the data should look like this,
16 10 15 0.7339261
16 11 15 0
16 12 15 1.1303524
Anyone can suggest a way to do that?

Or in data.table, generate the months for each id and year, left join this with original dataset on id, y, m and then replace NAs with 0:
library(data.table)
setDT(df)
df[df[, .(m=min(m):max(m)), by=.(id, y)], on=.(id,y,m)][
is.na(v), v := 0]

With dplyr and tidyr, you can do:
df %>%
group_by(id) %>%
complete(m = seq(min(m), max(m), 1), fill = list(v = 0)) %>%
fill(y)
id m y v
<dbl> <dbl> <dbl> <dbl>
1 12 1 14 0.539
2 12 2 14 -0.0768
3 12 3 14 1.85
4 12 4 14 -0.855
5 12 5 14 0.0326
6 12 6 14 0
7 12 7 14 0
8 12 8 14 -1.03
9 12 9 14 -0.982
10 12 10 14 0.00410
11 12 11 14 -0.233
12 12 12 14 -0.499
13 14 1 14 1.55
14 14 2 14 0.0875
15 14 3 14 1.32
16 14 4 14 -0.981
17 14 5 14 -0.246
18 14 6 14 -1.40
19 14 7 14 1.44
20 14 8 14 -0.981
21 14 9 14 1.47
22 14 10 14 -0.991
23 14 11 14 -0.0945
24 14 12 14 -2.88
25 16 10 15 -0.247
26 16 11 15 0
27 16 12 15 0.0147

Related

conditional combining rows and add to existing row

here is my data frame:
df<- data.frame(age=c(10,11,12,11,11,10,11,13,13,13,14,14,15,15,15),
time1=c(10:24),time2=c(20:34))
I want to sum rows for age 14 and 15 and keep as age 14. my expected output would be like this:
age time1 time2
1 10 10 20
2 11 11 21
3 12 12 22
4 11 13 23
5 11 14 24
6 10 15 25
7 11 16 26
8 13 17 27
9 13 18 28
10 13 19 29
11 14 110 160
thank you in advance.
Here is one method - replace the 'age' where value is '15' to 14, and summarise across the columns 'time' to get the sum if the 'age' values are all '14'
library(dplyr)
df %>%
group_by(age = replace(age, age %in% 15, 14)) %>%
summarise(across(everything(), ~if(all(age == 14))sum(.x) else .x),
.groups = 'drop')
-output
# A tibble: 11 × 3
age time1 time2
<dbl> <int> <int>
1 10 10 20
2 10 15 25
3 11 11 21
4 11 13 23
5 11 14 24
6 11 16 26
7 12 12 22
8 13 17 27
9 13 18 28
10 13 19 29
11 14 110 160
Or using base R with colSums and subset/rbind
rbind(subset(df, !age %in% c(14, 15)),
c(age = 14, colSums(df[df$age %in% c(14, 15), - 1])))
-output
age time1 time2
1 10 10 20
2 11 11 21
3 12 12 22
4 11 13 23
5 11 14 24
6 10 15 25
7 11 16 26
8 13 17 27
9 13 18 28
10 13 19 29
11 14 110 160

Transpose from long to wide with pair groups in r

I have descriptive statistics for four groups. My sample dataset is:
df <- data.frame(
Grade = c(3,3,3,3,4,4,4,4),
group = c("none","G1","G2","both","none","G1","G2","both"),
mean=c(10,12,13,12,11,18,19,20),
sd=c(22,12,22,12,11,13,14,15),
N=c(35,33,34,32,43,45,46,47))
> df
Grade group mean sd N
1 3 none 10 22 35
2 3 G1 12 12 33
3 3 G2 13 22 34
4 3 both 12 12 32
5 4 none 11 11 43
6 4 G1 18 13 45
7 4 G2 19 14 46
8 4 both 20 15 47
I would like to compare groups as pairs and need the descriptive information side by side for each pair.
Here is what I would like to have:
So, each grade has 6 pairs of groups.
Does anyone have any idea on this?
Thanks!
1) sqldf We can join df to itself on the indicated condition. Note that we escaped group since group is an sql keyword.
library(sqldf)
sqldf('select
a.Grade,
a.[group] Group1, b.[group] Group2,
a.mean mean1, b.mean mean2,
a.sd sd1, b.sd sd2,
a.N n1, b.N n2
from df a
join df b on a.Grade = b.Grade and a.[group] > b.[group]')
giving:
Grade Group1 Group2 mean1 mean2 sd1 sd2 n1 n2
1 3 none G1 10 12 22 12 35 33
2 3 none G2 10 13 22 22 35 34
3 3 none both 10 12 22 12 35 32
4 3 G2 G1 13 12 22 12 34 33
5 3 both G1 12 12 12 12 32 33
6 3 both G2 12 13 12 22 32 34
7 4 none G1 11 18 11 13 43 45
8 4 none G2 11 19 11 14 43 46
9 4 none both 11 20 11 15 43 47
10 4 G2 G1 19 18 14 13 46 45
11 4 both G1 20 18 15 13 47 45
12 4 both G2 20 19 15 14 47 46
2) base R We can perform a merge on part of the condition and then subset it for the remainder. The names are slightly different so you will need to change them if that is important.
subset(merge(df, df, by = "Grade"), group.x > group.y)
giving:
Grade group.x mean.x sd.x N.x group.y mean.y sd.y N.y
2 3 none 10 22 35 G1 12 12 33
3 3 none 10 22 35 G2 13 22 34
4 3 none 10 22 35 both 12 12 32
8 3 G1 12 12 33 both 12 12 32
10 3 G2 13 22 34 G1 12 12 33
12 3 G2 13 22 34 both 12 12 32
18 4 none 11 11 43 G1 18 13 45
19 4 none 11 11 43 G2 19 14 46
20 4 none 11 11 43 both 20 15 47
24 4 G1 18 13 45 both 20 15 47
26 4 G2 19 14 46 G1 18 13 45
28 4 G2 19 14 46 both 20 15 47

Trying to integrate over discrete points from a data frame

I have several months of weather data; an example day is here:
Hour Avg.Temp
1 1 11
2 2 11
3 3 11
4 4 10
5 5 10
6 6 11
7 7 12
8 8 14
9 9 15
10 10 17
11 11 19
12 12 21
13 13 22
14 14 24
15 15 23
16 16 22
17 17 21
18 18 18
19 19 16
20 20 15
21 21 14
22 22 12
23 23 11
24 24 10
I need to figure out the total number of hours above 15 degrees by integrating in R. I'm analyzing for degree days, a concept in agriculture, that gives valuable information about relative growth rate. For example, hour 10 is 2 degree hours and hour 11 is 4 degree hours above 15 degrees. This can help predict when to harvest fruit. How can I write the code for this?
Another column could potentially work with a simple subtraction. Then I would have to make a cumulative sum after canceling out all negative numbers. That is the approach I'm setting out to do right now. Is there an integral I could write and have an answer in one step?
This solution subtracts your threshold (i.e., 15°), fits a function to the result, then integrates this function. Note that if the temperature is below the threshold this contribute zero to the total rather than a negative value.
df <- read.table(text = "Hour Avg.Temp
1 1 11
2 2 11
3 3 11
4 4 10
5 5 10
6 6 11
7 7 12
8 8 14
9 9 15
10 10 17
11 11 19
12 12 21
13 13 22
14 14 24
15 15 23
16 16 22
17 17 21
18 18 18
19 19 16
20 20 15
21 21 14
22 22 12
23 23 11
24 24 10", header = TRUE)
with(df, integrate(approxfun(Hour, pmax(Avg.Temp-15, 0)),
lower = min(Hour), upper = max(Hour)))
#> 53.00017 with absolute error < 0.0039
Created on 2019-02-08 by the reprex package (v0.2.1.9000)
The OP has requested to figure out the total number of hours above 15 degrees by integrating in R.
It is not fully clear to me what the espected result is. Does the OP want to count the number of hours above 15 degrees or does the OP want to sum up the degrees greater 15 ("integrate").
However, the code below creates both figures. Supposed the data is sampled at each hour without gaps (as suggested by OP's sample dataset), cumsum() and sum() can be used, resp.:
library(data.table)
setDT(DT)[, c("deg_hrs_sum", "deg_hrs_cnt") :=
.(cumsum(pmax(0, Avg.Temp - 15)), cumsum(Avg.Temp > 15))]
Hour Avg.Temp deg_hrs_sum deg_hrs_cnt
1: 1 11 0 0
2: 2 11 0 0
3: 3 11 0 0
4: 4 10 0 0
5: 5 10 0 0
6: 6 11 0 0
7: 7 12 0 0
8: 8 14 0 0
9: 9 15 0 0
10: 10 17 2 1
11: 11 19 6 2
12: 12 21 12 3
13: 13 22 19 4
14: 14 24 28 5
15: 15 23 36 6
16: 16 22 43 7
17: 17 21 49 8
18: 18 18 52 9
19: 19 16 53 10
20: 20 15 53 10
21: 21 14 53 10
22: 22 12 53 10
23: 23 11 53 10
24: 24 10 53 10
Hour Avg.Temp deg_hrs_sum deg_hrs_cnt
Alternatively,
setDT(DT)[, .(deg_hrs_sum = sum(pmax(0, Avg.Temp - 15)),
deg_hrs_cnt = sum(Avg.Temp > 15))]
returns only the final result (last row):
deg_hrs_sum deg_hrs_cnt
1: 53 10
Data
library(data.table)
DT <- fread("
rn Hour Avg.Temp
1 1 11
2 2 11
3 3 11
4 4 10
5 5 10
6 6 11
7 7 12
8 8 14
9 9 15
10 10 17
11 11 19
12 12 21
13 13 22
14 14 24
15 15 23
16 16 22
17 17 21
18 18 18
19 19 16
20 20 15
21 21 14
22 22 12
23 23 11
24 24 10", drop = 1L)

Sum a variable based on another variable

I have a dataset consisting of two variables, Contents and Time like so:
Time Contents
2017M01 123
2017M02 456
2017M03 789
. .
. .
. .
2018M12 789
Now I want to create a numeric vector that aggregates Contents for six months, that is I want to sum 2017M01 to 2017M06 to one number, 2017M07 to 2017M12 to another number and so on.
I'm able to do this by indexing but I want to be able to write: "From 2017M01 to 2017M06 sum contents corresponding to that sequence" in my code.
I would really appreciate some help!
You can create a grouping variable based on the number of rows and number of elements to group. For your case, you want to group every 6 rows so your data frame should be divisible with 6. Using iris to demonstrate (It has 150 rows, so 150 / 6 = 25)
rep(seq(nrow(iris)%/%6), each = 6)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 10 10 10 10
#[59] 10 10 11 11 11 11 11 11 12 12 12 12 12 12 13 13 13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 15 16 16 16 16 16 16 17 17 17 17 17 17 18 18 18 18 18 18 19 19 19 19 19 19 20 20
#[117] 20 20 20 20 21 21 21 21 21 21 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 24 24 24 25 25 25 25 25 25
There are plenty of ways to handle how you want to call it. Here is a custom function that allows you to do that (i.e. create the grouping variable),
f1 <- function(x, df) {
v1 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\1', x))
v2 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\2', x))
i1 <- (v2 - v1) + 1
return(rep(seq(nrow(df)%/%i1), each = i1))
}
f1("2017M01:2017M06", iris)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 10 10 10 10
#[59] 10 10 11 11 11 11 11 11 12 12 12 12 12 12 13 13 13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 15 16 16 16 16 16 16 17 17 17 17 17 17 18 18 18 18 18 18 19 19 19 19 19 19 20 20
#[117] 20 20 20 20 21 21 21 21 21 21 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 24 24 24 25 25 25 25 25 25
EDIT: We can easily make the function compatible with 'non-0-remainder' divisions by concatenating the final result with a repetition of the max+1 value of the final result of remainder times, i.e.
f1 <- function(x, df) {
v1 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\1', x))
v2 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\2', x))
i1 <- (v2 - v1) + 1
final_v <- rep(seq(nrow(df) %/% i1), each = i1)
if (nrow(df) %% i1 == 0) {
return(final_v)
} else {
remainder = nrow(df) %% i1
final_v1 <- c(final_v, rep((max(final_v) + 1), remainder))
return(final_v1)
}
}
So for a data frame with 20 rows, doing groups of 6, the above function will yield the result:
f1("2017M01:2017M06", df)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4

R: Linear extrapolation between raster layers of different dates

There is already a thread dealing with interpolation between raster layers of different years (2006,2008,2010,2012). Now I tried to linearly extrapolate to 2020 with the approach suggested by #Ram Narasimhan and approxExtrap from the Hmisc package:
library(raster)
library(Hmisc)
df <- data.frame("2006" = 1:9, "2008" = 3:11, "2010" = 5:13, "2012"=7:15)
#transpose since we want time to be the first col, and the values to be columns
new <- data.frame(t(df))
times <- seq(2006, 2012, by=2)
new <- cbind(times, new)
# Now, apply Linear Extrapolate for each layer of the raster
approxExtrap(new, xout=c(2006:2012), rule = 2)
But instead of getting something like this:
# times X1 X2 X3 X4 X5 X6 X7 X8 X9
#1 2006 1 2 3 4 5 6 7 8 9
#2 2007 2 3 4 5 6 7 8 9 10
#3 2008 3 4 5 6 7 8 9 10 11
#4 2009 4 5 6 7 8 9 10 11 12
#5 2010 5 6 7 8 9 10 11 12 13
#6 2011 6 7 8 9 10 11 12 13 14
#7 2012 7 8 9 10 11 12 13 14 15
#8 2013 8 9 10 11 12 13 14 15 16
#9 2014 9 10 11 12 13 14 15 16 17
#10 2015 10 11 12 13 14 15 16 17 18
#11 2016 11 12 13 14 15 16 17 18 19
#12 2017 12 13 14 15 16 17 18 19 20
#13 2018 13 14 15 16 17 18 19 20 21
#14 2019 14 15 16 17 18 19 20 21 22
#15 2020 15 16 17 18 19 20 21 22 23
I get this:
$x
[1] 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
$y
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
This is quite confusing as both approxTime and approxExtrap are based on approxfun.
I found a way to make this work, although it doesn't seem to be the most elegant way to do it. The basic idea is to perform a linear interpolation with approxTime first, then use lm to fit a linear model to the time-series and extrapolate by using predict and the final year of extrapolation. The data gap between the final year and the end-year of the first interpolation is than filled by a second linear interpolation using approxTime again.
NOTE: The first linear interpolation is not really necessary, although I don't know if it makes any difference when you use more sophisticated data.
library(raster)
library(Hmisc)
library(simecol)
df <- data.frame("2006" = 1:9, "2008" = 3:11, "2010" = 5:13, "2012"=7:15)
#transpose since we want time to be the first col, and the values to be columns
new <- data.frame(t(df))
times <- seq(2006, 2012, by=2)
new <- cbind(times, new)
# Now, apply Linear Interpolate for each layer of the raster
intp<-approxTime(new, 2006:2012, rule = 2)
#Extract the years from the data.frame
tm<-intp[,1]
#Define a function for a linear model using lm
lm.func<-function(i) {lm(i ~ tm)}
#Define a new data.frame without the years from intp
intp.new<-intp[,-1]
#Creates a list of the lm coefficients for each column of intp.new
lm.list<-apply(intp.new, MARGIN=2, FUN=lm.func)
#Create a data.frame of the final year of your extrapolation; keep the name of tm data.frame
new.pred<-data.frame(tm = 2020)
#Make predictions for the final year for each element of lm.list
pred.points<-lapply(lm.frame, predict, new.pred)
#unlist the predicted points
fintime<-matrix(unlist(pred.points))
#Add the final year to the fintime matrix and transpond it
fintime.new<-t(rbind(2020,fintime))
#Convert the intp data.frame into a matrix
intp.ma<-as.matrix(intp)
#Append fintime.new to intp.ma
intp.wt<-as.data.frame(rbind(intp.ma,fintime.new))
#Perform an linear interpolation with approxTime again
approxTime(intp.wt, 2006:2020, rule = 2)
times X1 X2 X3 X4 X5 X6 X7 X8 X9
1 2006 1 2 3 4 5 6 7 8 9
2 2007 2 3 4 5 6 7 8 9 10
3 2008 3 4 5 6 7 8 9 10 11
4 2009 4 5 6 7 8 9 10 11 12
5 2010 5 6 7 8 9 10 11 12 13
6 2011 6 7 8 9 10 11 12 13 14
7 2012 7 8 9 10 11 12 13 14 15
8 2013 8 9 10 11 12 13 14 15 16
9 2014 9 10 11 12 13 14 15 16 17
10 2015 10 11 12 13 14 15 16 17 18
11 2016 11 12 13 14 15 16 17 18 19
12 2017 12 13 14 15 16 17 18 19 20
13 2018 13 14 15 16 17 18 19 20 21
14 2019 14 15 16 17 18 19 20 21 22
15 2020 15 16 17 18 19 20 21 22 23

Resources