Trying to integrate over discrete points from a data frame - r

I have several months of weather data; an example day is here:
Hour Avg.Temp
1 1 11
2 2 11
3 3 11
4 4 10
5 5 10
6 6 11
7 7 12
8 8 14
9 9 15
10 10 17
11 11 19
12 12 21
13 13 22
14 14 24
15 15 23
16 16 22
17 17 21
18 18 18
19 19 16
20 20 15
21 21 14
22 22 12
23 23 11
24 24 10
I need to figure out the total number of hours above 15 degrees by integrating in R. I'm analyzing for degree days, a concept in agriculture, that gives valuable information about relative growth rate. For example, hour 10 is 2 degree hours and hour 11 is 4 degree hours above 15 degrees. This can help predict when to harvest fruit. How can I write the code for this?
Another column could potentially work with a simple subtraction. Then I would have to make a cumulative sum after canceling out all negative numbers. That is the approach I'm setting out to do right now. Is there an integral I could write and have an answer in one step?

This solution subtracts your threshold (i.e., 15°), fits a function to the result, then integrates this function. Note that if the temperature is below the threshold this contribute zero to the total rather than a negative value.
df <- read.table(text = "Hour Avg.Temp
1 1 11
2 2 11
3 3 11
4 4 10
5 5 10
6 6 11
7 7 12
8 8 14
9 9 15
10 10 17
11 11 19
12 12 21
13 13 22
14 14 24
15 15 23
16 16 22
17 17 21
18 18 18
19 19 16
20 20 15
21 21 14
22 22 12
23 23 11
24 24 10", header = TRUE)
with(df, integrate(approxfun(Hour, pmax(Avg.Temp-15, 0)),
lower = min(Hour), upper = max(Hour)))
#> 53.00017 with absolute error < 0.0039
Created on 2019-02-08 by the reprex package (v0.2.1.9000)

The OP has requested to figure out the total number of hours above 15 degrees by integrating in R.
It is not fully clear to me what the espected result is. Does the OP want to count the number of hours above 15 degrees or does the OP want to sum up the degrees greater 15 ("integrate").
However, the code below creates both figures. Supposed the data is sampled at each hour without gaps (as suggested by OP's sample dataset), cumsum() and sum() can be used, resp.:
library(data.table)
setDT(DT)[, c("deg_hrs_sum", "deg_hrs_cnt") :=
.(cumsum(pmax(0, Avg.Temp - 15)), cumsum(Avg.Temp > 15))]
Hour Avg.Temp deg_hrs_sum deg_hrs_cnt
1: 1 11 0 0
2: 2 11 0 0
3: 3 11 0 0
4: 4 10 0 0
5: 5 10 0 0
6: 6 11 0 0
7: 7 12 0 0
8: 8 14 0 0
9: 9 15 0 0
10: 10 17 2 1
11: 11 19 6 2
12: 12 21 12 3
13: 13 22 19 4
14: 14 24 28 5
15: 15 23 36 6
16: 16 22 43 7
17: 17 21 49 8
18: 18 18 52 9
19: 19 16 53 10
20: 20 15 53 10
21: 21 14 53 10
22: 22 12 53 10
23: 23 11 53 10
24: 24 10 53 10
Hour Avg.Temp deg_hrs_sum deg_hrs_cnt
Alternatively,
setDT(DT)[, .(deg_hrs_sum = sum(pmax(0, Avg.Temp - 15)),
deg_hrs_cnt = sum(Avg.Temp > 15))]
returns only the final result (last row):
deg_hrs_sum deg_hrs_cnt
1: 53 10
Data
library(data.table)
DT <- fread("
rn Hour Avg.Temp
1 1 11
2 2 11
3 3 11
4 4 10
5 5 10
6 6 11
7 7 12
8 8 14
9 9 15
10 10 17
11 11 19
12 12 21
13 13 22
14 14 24
15 15 23
16 16 22
17 17 21
18 18 18
19 19 16
20 20 15
21 21 14
22 22 12
23 23 11
24 24 10", drop = 1L)

Related

R:How to apply a sliding conditional branch to consecutive values in the sequential data

I want to use conditional statement to consecutive values in the sliding manner.
For example, I have dataset like this;
data <- data.frame(ID = rep.int(c("A","B"), times = c(24, 12)),
+ time = c(1:24,1:12),
+ visit = as.integer(runif(36, min = 0, max = 20)))
and I got table below;
> data
ID time visit
1 A 1 7
2 A 2 0
3 A 3 6
4 A 4 6
5 A 5 3
6 A 6 8
7 A 7 4
8 A 8 10
9 A 9 18
10 A 10 6
11 A 11 1
12 A 12 13
13 A 13 7
14 A 14 1
15 A 15 6
16 A 16 1
17 A 17 11
18 A 18 8
19 A 19 16
20 A 20 14
21 A 21 15
22 A 22 19
23 A 23 5
24 A 24 13
25 B 1 6
26 B 2 6
27 B 3 16
28 B 4 4
29 B 5 19
30 B 6 5
31 B 7 17
32 B 8 6
33 B 9 10
34 B 10 1
35 B 11 13
36 B 12 15
I want to flag each ID by continuous values of "visit".
If the number of "visit" continued less than 10 for 6 times consecutively, I'd attach "empty", and "busy" otherwise.
In the data above, "A" is continuously below 10 from rows 1 to 6, then "empty". On the other hand, "B" doesn't have 6 consecutive one digit, then "busy".
I want to apply the condition to next segment of 6 values if the condition weren't fulfilled in the previous segment.
I'd like achieve this using R. Any advice will be appreciated.

Filling the gap in the time series

I have the following data,
id <- c(rep(12, 10), rep(14, 12), rep(16, 2))
m <- c(seq(1:5), seq(8,12), seq(1:12), 10, 12)
y <- c(rep(14, 10), rep(14, 12), rep(15, 2))
v <- rnorm(24)
df <- data.frame(id, m, y, v)
> df
id m y v
1 12 1 14 0.9453216
2 12 2 14 1.0666393
3 12 3 14 -0.2750527
4 12 4 14 1.3264349
5 12 5 14 -1.8046676
6 12 8 14 0.3334960
7 12 9 14 -1.2448408
8 12 10 14 0.5258248
9 12 11 14 -0.1233157
10 12 12 14 1.4717530
11 14 1 14 0.6217376
12 14 2 14 -0.8344823
13 14 3 14 1.1468841
14 14 4 14 -0.3363987
15 14 5 14 -1.3543311
16 14 6 14 -0.2146853
17 14 7 14 -0.6546186
18 14 8 14 -2.4286257
19 14 9 14 -1.3314888
20 14 10 14 0.8215581
21 14 11 14 -0.9999368
22 14 12 14 -1.2935147
23 16 10 15 0.7339261
24 16 12 15 1.1303524
The first column is the id, second column m is the month, third column y is the year, and the last column is the value.
In the month column, in the year 14, two observations (June and July) is missing and in the year 15, November is missing.
I would like to have those missing months with a value of zero. That means, for example, for the year 15, the data should look like this,
16 10 15 0.7339261
16 11 15 0
16 12 15 1.1303524
Anyone can suggest a way to do that?
Or in data.table, generate the months for each id and year, left join this with original dataset on id, y, m and then replace NAs with 0:
library(data.table)
setDT(df)
df[df[, .(m=min(m):max(m)), by=.(id, y)], on=.(id,y,m)][
is.na(v), v := 0]
With dplyr and tidyr, you can do:
df %>%
group_by(id) %>%
complete(m = seq(min(m), max(m), 1), fill = list(v = 0)) %>%
fill(y)
id m y v
<dbl> <dbl> <dbl> <dbl>
1 12 1 14 0.539
2 12 2 14 -0.0768
3 12 3 14 1.85
4 12 4 14 -0.855
5 12 5 14 0.0326
6 12 6 14 0
7 12 7 14 0
8 12 8 14 -1.03
9 12 9 14 -0.982
10 12 10 14 0.00410
11 12 11 14 -0.233
12 12 12 14 -0.499
13 14 1 14 1.55
14 14 2 14 0.0875
15 14 3 14 1.32
16 14 4 14 -0.981
17 14 5 14 -0.246
18 14 6 14 -1.40
19 14 7 14 1.44
20 14 8 14 -0.981
21 14 9 14 1.47
22 14 10 14 -0.991
23 14 11 14 -0.0945
24 14 12 14 -2.88
25 16 10 15 -0.247
26 16 11 15 0
27 16 12 15 0.0147

Sum a variable based on another variable

I have a dataset consisting of two variables, Contents and Time like so:
Time Contents
2017M01 123
2017M02 456
2017M03 789
. .
. .
. .
2018M12 789
Now I want to create a numeric vector that aggregates Contents for six months, that is I want to sum 2017M01 to 2017M06 to one number, 2017M07 to 2017M12 to another number and so on.
I'm able to do this by indexing but I want to be able to write: "From 2017M01 to 2017M06 sum contents corresponding to that sequence" in my code.
I would really appreciate some help!
You can create a grouping variable based on the number of rows and number of elements to group. For your case, you want to group every 6 rows so your data frame should be divisible with 6. Using iris to demonstrate (It has 150 rows, so 150 / 6 = 25)
rep(seq(nrow(iris)%/%6), each = 6)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 10 10 10 10
#[59] 10 10 11 11 11 11 11 11 12 12 12 12 12 12 13 13 13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 15 16 16 16 16 16 16 17 17 17 17 17 17 18 18 18 18 18 18 19 19 19 19 19 19 20 20
#[117] 20 20 20 20 21 21 21 21 21 21 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 24 24 24 25 25 25 25 25 25
There are plenty of ways to handle how you want to call it. Here is a custom function that allows you to do that (i.e. create the grouping variable),
f1 <- function(x, df) {
v1 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\1', x))
v2 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\2', x))
i1 <- (v2 - v1) + 1
return(rep(seq(nrow(df)%/%i1), each = i1))
}
f1("2017M01:2017M06", iris)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 10 10 10 10
#[59] 10 10 11 11 11 11 11 11 12 12 12 12 12 12 13 13 13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 15 16 16 16 16 16 16 17 17 17 17 17 17 18 18 18 18 18 18 19 19 19 19 19 19 20 20
#[117] 20 20 20 20 21 21 21 21 21 21 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 24 24 24 25 25 25 25 25 25
EDIT: We can easily make the function compatible with 'non-0-remainder' divisions by concatenating the final result with a repetition of the max+1 value of the final result of remainder times, i.e.
f1 <- function(x, df) {
v1 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\1', x))
v2 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\2', x))
i1 <- (v2 - v1) + 1
final_v <- rep(seq(nrow(df) %/% i1), each = i1)
if (nrow(df) %% i1 == 0) {
return(final_v)
} else {
remainder = nrow(df) %% i1
final_v1 <- c(final_v, rep((max(final_v) + 1), remainder))
return(final_v1)
}
}
So for a data frame with 20 rows, doing groups of 6, the above function will yield the result:
f1("2017M01:2017M06", df)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4

Create partition based in two variables

I have a data set with two outcome variables, case1 and case2. Case1 has 4 levels, while case2 has 50 (levels in case2 could increase later). I would like to create data partition for train and test keeping the ratio in both cases. The real data is imbalanced for both case1 and case2. As an example,
library(caret)
set.seed(123)
matris=matrix(rnorm(10),1000,20)
case1 <- as.factor(ceiling(runif(1000, 0, 4)))
case2 <- as.factor(ceiling(runif(1000, 0, 50)))
df <- as.data.frame(matris)
df$case1 <- case1
df$case2 <- case2
split1 <- createDataPartition(df$case1, p=0.2)[[1]]
train1 <- df[-split1,]
test1 <- df[split1,]
length(split1)
201
split2 <- createDataPartition(df$case2, p=0.2)[[1]]
train2 <- df[-split2,]
test2 <- df[split2,]
length(split2)
220
If I do separate splitting, I get different length for the data frame. If I do one splitting based on case2 (one with more classes), I lose the ratio of classes for case1.
I will be predicting the two cases separately, but at the end my accuracy will be given by having the exact match for both cases (e.g., ix = which(pred1 == case1 & pred2 == case2), so I need the arrays to be the same size.
Is there a smart way to do this?
Thank you!
If I understand correctly (which I do not guarantee) I can offer the following approach:
Group by case1 and case2 and get the group indices
library(tidyverse)
df %>%
select(case1, case2) %>%
group_by(case1, case2) %>%
group_indices() -> indeces
use these indeces as the outcome variable in create data partition:
split1 <- createDataPartition(as.factor(indeces), p=0.2)[[1]]
check if satisfactory:
table(df[split1,22])
#output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
5 6 5 8 5 5 6 6 4 6 6 6 6 6 5 5 5 4 4 7 5 6 5 6 7 5 5 8 6 7 6 6 7
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
4 5 6 6 6 5 5 6 5 6 6 5 4 5 6 4 6
table(df[-split1,22])
#output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
15 19 13 18 12 13 16 15 8 13 13 15 21 14 11 13 12 9 12 20 17 15 16 19 16 11 14 21 13 20 18 13 16
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
9 6 12 19 14 10 16 19 17 17 16 14 4 15 14 9 19
table(df[split1,21])
#output
1 2 3 4
71 70 71 67
table(df[-split1,21])
1 2 3 4
176 193 174 178

Create new columns based on similar row values

How do I create a new set of data frame columns based on matched row values?
For instance, for this sample data frame:
x<-data.frame(cbind(numsp=rep(c(16,64,256),each=12),Colless=rep(c("loIc","midIc","hiIc"),each=4, times=3), lambdaE=rep(c(TRUE,FALSE),each=2,times=9),ntree=rep(c(1,2),length.out=36), metric1=seq(1:36), metric2=seq(1:36)))
For when some parameter, e.g., lambdaE, I'd like to create new columns for metric1 and metric 2 based on whether lambdaE is TRUE or FALSE.
The data frame would look something like this:
x2<-data.frame(cbind(numsp=rep(c(16,64,256),each=6),Colless=rep(c("hiIc","loIc","midIc"),each=2, times=3), ntree=rep(c(1,2),length.out=18), metric1.lambdE.FALSE=c(11,12,3,4,7,8,35,36,27,28,31,32,23,24,15,16,19,20), metric2.lambdE.FALSE=c(11,12,3,4,7,8,35,36,27,28,31,32,23,24,15,16,19,20),metric1.lambdE.TRUE=c(9,10,1,2,5,6,33,34,25,26,29,30,21,22,13,14,17,18), metric2.lambdE.TRUE=c(9,10,1,2,5,6,33,34,25,26,29,30,21,22,13,14,17,18)))
Or alternatively for the parameter "Colless", a new set of columns for metric1 and metric2 for each level of Colless.
Thanks in advance!
Okay, looks like library reshape2 has a quick solution:
reshape(x, direction="wide", idvar=c("numsp","Colless","ntree"), timevar="lambdaE")
melt and dcast of reshape2 can also be used:
library(reshape2)
mm =melt(x, id=c('numsp','Colless','lambdaE','ntree'))
dcast(mm, numsp+Colless+ntree~lambdaE+variable)
numsp Colless ntree FALSE_metric1 FALSE_metric2 TRUE_metric1 TRUE_metric2
1 16 hiIc 1 11 11 9 9
2 16 hiIc 2 12 12 10 10
3 16 loIc 1 3 3 1 1
4 16 loIc 2 4 4 2 2
5 16 midIc 1 7 7 5 5
6 16 midIc 2 8 8 6 6
7 256 hiIc 1 35 35 33 33
8 256 hiIc 2 36 36 34 34
9 256 loIc 1 27 27 25 25
10 256 loIc 2 28 28 26 26
11 256 midIc 1 31 31 29 29
12 256 midIc 2 32 32 30 30
13 64 hiIc 1 23 23 21 21
14 64 hiIc 2 24 24 22 22
15 64 loIc 1 15 15 13 13
16 64 loIc 2 16 16 14 14
17 64 midIc 1 19 19 17 17
18 64 midIc 2 20 20 18 18

Resources